Differential-Geometrical Methods in Statistics
Differential-Geometrical Methods in Statistics
Vol. 1: R.A. Fisher: An Appreciation. Edited by S.E. Fien- Vol. 22: S. Johansen, Functional Relations, Random Coef-
berg and D.V. Hinkley. XI, 208 pages, 1980. ficients and Nonlinear Regression with Application to
Vol. 2: Mathematical Statistics and Probability Theory. Pro- Kinetic Data. VIII, 126 pages, 1984.
ceedings 1978. Edited by W. Klonecki, A. Kozek, and Vol. 23: D.G. Saphire, Estimation of Victimization Pre-
J. Rosinski. XXIV, 373 pages, 1980. valence Using Data from the National Crime Survey. V, 165
Vol. 3: B.D. Spencer, Benefit-Cost Analysis of Data Used pages, 1984.
to Allocate Funds. VIII, 296 pages, 1980. Vol. 24: TS. Rao, M.M. Gabr, An Introduction to Bispectral
Vol. 4: E.A. van Doorn, Stochastic Monotonicity and Analysis and BilinearTime Series Models. VIII, 280 pages,
Queueing Applications of Birth-Death Processes. VI, 118 1984.
pages, 1981. Vol. 25: Time Series Analysis of Irregularly Observed
Vol. 5: T Rolski, Stationary Random Processes Asso- Data. Proceedings, 1983. Edited by E. Parzen. VII, 363
ciated with Point Processes. VI, 139 pages, 1981. pages, 1984.
Vol. 6: S.S. Gupta and D.-y' Huang, Multiple Statistical Vol. 26: Robust and Nonlinear Time Series Analysis. Pro-
Decision Theory: Recent Developments. VIII, 104 pages, ceedings, 1983. Edited by J. Franke, W. Hardie and D.
1981. Martin. IX, 286 pages, 1984.
Vol. 7: M. Akahira and K. Takeuchi, Asymptotic Efficiency Vol. 27: A. Janssen, H. Milbrodt, H. Strasser, Infinitely
of Statistical Estimators. VIII, 242 pages, 1981. Divisible Statistical Experiments. VI, 163 pages, 1985.
Vol. 8: The First Pannonian Symposium on Mathematical Vol. 28: S. Amari, Differential-Geometrical Methods in Sta-
Statistics. Edited by P. Revesz, L. Schmetterer, and V.M. tistics. V, 290 pages, 1985.
Zolotarev. VI, 308 pages, 1981. Vol. 29: Statistics in Ornithqlogy. Edited by B.J.T Morgan
Vol. 9: B. Jorgensen, Statistical Properties of the Gen- and P.M. North. XXV, 418 pages, 1985.
eralized Inverse Gaussian Distribution. VI, 188 pages, Vol. 30: J. Grandell, Stochastic Models of Air Pollutant
1981. Concentration. V, 110 pages, 1985.
Vol. 10: A.A. Mcintosh, Fitting Linear Models: An Ap- Vol. 31: J. Pfanzagl, Asymptotic Expansions for General
plication on Conjugate Gradient Algorithms. VI, 200 Statistical Models. VII, 505 pages, 1985.
pages, 1982. Vol. 32: Generalized Linear Models. Proceedings, 1985.
Vol. 11: D.F Nicholls and B.G. Quinn, Random Coefficient Edited by R. Gilchrist, B. Francis and J. Whittaker. VI, 178
Autoregressive Models: An Introduction. V, 154 pages, pages, 1985.
1982. Vol. 33: M. Csorgo, S. Csorgo, L. Horvath, An Asymptotic
Vol. 12: M. Jacobsen, Statistical Analysis of Counting Pro- Theory for Empirical Reliability and Concentration Pro-
cesses. VII, 226 pages, 1982. cesses. V, 171 pages, 1986.
Vol. 13: J. Pfanzagl (with the assistance of W. Wefel- Vol. 34: D.E. Critchlow, Metfic Methods for Analyzing Par-
meyer), Contributions to a General Asymptotic Statistical tially Ranked Data. X, 216 pages, 1985.
Theory. VII, 315 pages, 1982. Vol. 35: Linear Statistical Inference. Proceedings, 1984.
Vol. 14: GLiM 82: Proceedings of the International Con- Edited by T Cal in ski and W. Klonecki. VI, 318 pages,
ference on Generalised Linear Models. Edited by R. Gil- 1985.
christ. V, 188 pages, 1982. Vol. 36: B. Matern, Spatial Variation. Second Edition. 151
Vol. 15: K.R.W. Brewer and M. Hanif, Sampling with Un- pages, 1986.
equal Probabilities. IX, 164 pages, 1983. Vol. 37: Advances in Order Restricted Stalistical Infer-
Vol. 16: Specifying Statistical Models: From Parametric to ence. Proceedings, 1985. Edited by R. Dykstra,
Non-Parametric, Using Bayesian or Non-Bayesian T Robertson and FT Wright. VIII, 295 pages, 1986.
Approaches. Edited by J.P. Florens, M. Mouchart, J.P. Vol. 38: Survey Research Designs: Towards a Better
Raoult, L. Simar, and A.FM. Smith, XI, 204 pages, 1983. Understanding of Their Costs and Benefits. Edited by
Vol. 17: I.V. Basawa and D.J. Scott, Asymptotic Optimal R.W. Pearson and R.F. Boruch. V, 129 pages, 1986.
Inference for Non-Ergodic Models. IX, 170 pages, 1983. Vol. 39: J.D. Malley, Optimal Unbiased Estimation of
Vol. 18: W. Britton, Conjugate Duality and the Exponential Variance Components. IX, 146 pages, 1986.
Fourier Spectrum. V, 226 pages, 1983. Vol. 40: H.R. Lerche, Boundary Crossing of Brownian
Vol. 19: L. Fernholz, von Mises Calculus For Statistical Motion. V, 142 pages, 1986.
Functionals. VIII, 124 pages, 1983. Vol. 41: F Baccelli, P. Bremaud, Palm Probabilities and
Vol. 20: Mathematical Learning Models - Theory and Stationary Queues. VII, 106 pages, 1987.
Algorithms: Proceedings of a Conference. Edited by U. Vol. 42: S. Kullback, J.C. Keegel, J.H. Kullback, Topics in
Herkenrath, D. Kalin, W. Vogel. XIV, 226 pages, 1983. Statistical Information Theory. IX, 158 pages, 1987.
Vol. 21: H. Tong, Threshold Models in Non-linear Time Vol. 43: B.C. Arnold, Majorization and the Lorenz Order:
Series Analysis. X, 323 pages, 1983. A Brief Introduction. VI, 122 pages, 1987.
Lecture Notes in
Statistics
Edited by J. Berger, S. Fienberg, J. Gani,
K. Krickeberg, I. Olkin, and B. Singer
28
Shun-ichi Amari
Differential-Geometrical
Methods in Statistics
Springer-Verlag
Berlin Heidelberg New York London Paris Tokyo Hong Kong
Author
Shun-ichi Amari
University ofTokyo, Faculty of Engineering
Department of Mathematical Engineering and Information Physics
Bunkyo-ku, Tokyo 113, Japan
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation,
broadcasting, reproduction on microfilms or in other ways, and storage in data banks. Duplication
of this publication or parts thereof is only permitted under the provisions of the German Copyright
Law of September 9, 1965, in its version of June 24, 1985, and a copyright fee must always be
paid. Violations fall under the prosecution act of the German Copyright Law.
© Springer-Verlag Berlin Heidelberg 1985
Chapter 1. Introduction 1
3.1. a-representation 66
3.2. Dual affine connections 70
3.3. a-family of distributions 73
3.4. Duality in a-flat manifolds 79
3.5. a-divergence 84
3.6. a-projection 89
3.7. On geometry of function space of distributions 93
3.8. Remarks on possible divergence, metric and
connection in statistical manifold 96
3.9. Notes 102
IV
REFERENCES 276
Why Geometry?
Historical Remark
It was Rao (1945), in his early twenties, who first noticed the
importance of the differential-geometrical approach. He introduced
the Riemannian metric in a statistical manifold by using the Fisher
information matrix and calculated the geodesic distances between two
distributions for various statistical models. This theory made an
impact and not a few researchers have tried to construct a theory
along this Riemannian line. Jeffreys also remarked the Riemannian
distance (Jeffreys, 1948) and the invariant prior of Jeffreys (1946)
was based on the Riemannian concept. The properties of the
Riemannian manifold of a statistical model have further been studied
by a number of researchers independently, e.g., Amari (1968), James
(1973), Atkinson and Mitchell (1981), Dawid (1977), Akin (1979),
Kass (1980), Skovgaard (1984), etc. Amari I s unpublished results
(1959) induced a number of researches in Japan; Yoshizawa (197la,
b), Takiyama (1974), Ozeki (1971), Sato et al. (1979), Ingarden et
al. (1979), etc. Nevertheless, the statistical implications of the
Personal Remarks
It was in 1959, while I was studying for my Master's
Degree at the University of Tokyo, that I became enchanted by the
idea of a beautiful geometrical structure of a statistical model. I
was suggested to consider the geometrical structure of the family
of normal distributions, using the Fisher information as a
Riemannian metric. This was Professor Rao's excellent idea proposed
in 1945. I found that the family of normal distributions forms a
Riemannian manifold of constant negative curvature, which is the
Bolyai-Lobachevsky geometry well known in the theory of
non-Euclidean geometry. My results on the geodesic, geodesic
distance and curvature appeared in an unpublished report. I could
not understand the statistical meaning of these results, in
particular the meaning of the Riemannian curvature of a statistical
manifold. Since then, I have been dreaming of constructing a
theory of differential geometry for statistics, although my work has
been concentrated in non-statistical areas, namely graph theory,
continuum mechanics, information sciences, mathematical theory of
neural nets, and other aspects of mathematical engineering. It was
8
Acknowledgement
I would like to express my sincere gratitude to Professor
Emeritus Kazuo Kondo, who organized the RAAG (Research Association
of Applied Geometry) and introduced me to the world of applied
geometry. The author also thanks Professor S. Moriguti for his
suggestion of the geometrical approach in statistics. I especially
appreciate valuable suggestions and encouragement from Professor K.
Takeuchi, without which I could not complete the present work. I am
greateful to many statisticians for their warm encouragement, useful
comments and inspiring discussions. I would like to mention
especially Professor B. Efron, Professor A.P.Dawid, Professor
D.R.Cox, Professor C.R.Rao, Professor O. Barndorff-Nielsen,
Professor S. Lauritzen, Professor D.V. Hinkley, Professor
D.A.Pierce, Professors lb. and L.T. Skovgaard, Professor T.
Kitagawa, Professor T. Okuno, Professor T.S.Han, and Professor M.
Akahira, Dr. A. Mitchell. The comments by Professor H. Kimura and
Professor K. Kanatani were also useful. Professor Th. Chang and
Dr. R. Lockhart were kind enough to read the firs t vers ion of the
10
Since the first printing in 1985 of this monograph, a lot of papers have
appeared on this subject, and this dual geometry has been recognized to be
applicable to a wide range of information sciences. New references appeared for
these four years are appended in this second printing, which show new
developments in this field.
PART I. GEOMETRICAL STRUCTURES OF A FAMILY
OF PROBABILITY DISTRIBUTIONS
,...
/ 1"-
I 1\
te-
r~ ~ ~ ~
t1
>
-1-9=tf(p.-
II
"- ......
-v
- + - - - - - - - 7 81
Fig 2.1
coordinates
~-+-r-t-.y ~
\q>oljJ
-I-----~,
I
/</Ioq>
/
1-4-+-+-+-+~k.
Fig 2.2
the transformation
i
det 12L-1
ae J
does not vanish on U, where det denotes the determinant of the
matrix whose (i,j)-element is a~i/aej. In this case, the inverse
d(fo c) df{a(t)} n da i a
Cf dt dt L'=l-dt -.f.
~ aa~
coordinate curves c 1 ' c 2 ' ... , c n ' passing through a point PO. For
example, the first coordinate curve c 1 is the curve along which only
the value of the first coordinate e 1 changes while all the other
coordinates are fixed. Hence, the curve c 1 is represented by
1 2
e 1 (t) (e a + t, eo' en)
a
where 80 (8 01 , en) is the coordinates of po· Then, the
a
tangent vector C1 of c 1 is nothing but the partial derivative with
respect to e i ,
a
-;r f .
Hence, we may denote the
tangent C1 by a/ae 1 or
shortly by a 1 · Similarly the
tangent vector Ci of the
coordinate curve ci is
denoted by a i (Fig.2.4).
Ci is simply the partial
derivative a/aei, a i can be Fig 2.4
regarded as the abbreviation
of a/ae i . It can be proved that n vectors ai are linearly
independent, forming a basis of Tp. We call {ail the natural basis
associated with the coordinate system 8. Any tangent vector A E Tp
can be represented as a linear combination of a i
A = I n i
i=l A a i ,
where Ai are the components of A with respect to the natural
In the following, we adopt the Einstein summation
convention: Summation is automatically taken without the summation
symbol L for those indices which appear twice in one term once as a
subscript and once as a superscript. Hence, Aia. automatically
~
i
implies I~=l A a i . The tangent vector e of a curve e (t) in the
coordinate expression is indeed given by e = aia. (which implies
~
19
di E Te ~ d i ~(x, 8) E T~l) .
Obviously, a tangent vector (derivative operator) A = Aia i E Te
corresponds to a random variable A(x) = Aiai~(x, 8) E. T~l) having
the same components Ai. We can identify Te with TP), regarding
that Te is the differentiation operator representation of the
tangent space, while T~l) is the random variable representation of
p(x, 8),
E[f(x)] = !f(x)p(x, 8)dP . (2.3)
i = 1, ... , n; a = 1, ... , n
Here, the index i is used to denote the components of e while the
index a is used to denote the components of t;. It is convenient to
use different index letters to denote the components with respect to
different coordinate systems. Thus, we use i, j, k, etc. for
representing quantities with respect to e, and a, S, y, etc. for
quantities with respect to t;.
The Jacobian matrices of the above coordinate transformations
are written as
Bal.. (e) -
- de i
Bi(O
a
K de i
dt;a
By differentiating the identity e[t;(e)] = e or
ei[t;l(e), ... , t;n(e)] = ei
with respect to e j , we have
"ei ~j:"a = . .
a ~ Bl.B~ = a~
d1';a de j a J J
where a~ is the Kronecker delta which is equal to 1 when i j and
J
otherwise equal to O. Similarly, we have
aa
B~B~ =
S
l. and (B~) are mutually inverse
Hence, the two Jacobian matrices (B~)
matrices. Let {ail and {aa} be the natural bases of the tangent
space with respect to e and 1';, respectively. Then, the relations
aa l. = B~a
d.
l. a (2.4)
same vector A in these two bases A = Aid. = AUd ,we have the
~ U
respective components Ai and AU. From the relations (2.4), it is
shown that the components are related by
Ai = Bi AU AU = B~Ai (2.5)
u' ~
with
2 2
c = - E[ax + bxl = - a(o + ~ 2) - b~.
1
the coordinate transformation is given by
:J
1
Be:'1- K
de i
2 ~
and its inverse is given by a
~[
1
Bi ~
a
dS a
-L
0
The coordinate curves are given in Fig. 2.Sb), where natural basis
vectors ida}' da = B~di are also shown. The tangent vectors ida}' a
= 1', 2', are written as
where l' and 2' are used to denote the {"a}-system. Their
l-representations are
2 + ~/02,
(x ~)/o - ~(x _ ~)2/04
(x ~)2/(204) _ 1/(20 2 ).
certain sense.
23
a) 8-coordinate
b) ~-coordinate
Fig. 2.5
24
or
i j
A A gij
This is obviously the variance of the I-representation A(x)
IAIZ = E[{A(x)}Z] .
semi-definite matrix.
(2.7) or (2.10) as
gij ( 8 ) = 10 2 [0 1
O
2
J
Since the cross components g12(8) and g2l(8) vanish identically, the
basis vectors dl and d2 are always orthogonal. Hence the coordinate
system 8 is an orthogonal system, composed of two families of
mutually orthogonal coordinate curves, 81 = ~ = const. and 8 2 = 0 =
const. However, the length of di depends on the position 8 (more
30
precisely on 0)
Z
1<\I Z var[ (x )1)/0 ]
IdZI Z
and the coordinate system e is not Cartesian.
In terms of another coordinate system l;,
d(e , e
1 .2
The geodesic curve e(t) connecting two normal distributions is given
by
e l (t) cl + Zc Z tanh(t/I! + c 3) ,
eZ(t) /TcZ cosh- l (t/ IT + c3) ,
= )1 , eZ(t) = exp(t/IT + c) ,
A = Ai(e)di(e)
It is smooth when the components Ai(e) are smooth functions in e.
The basis vector field di itself is a smooth vector field. The set
of all the smooth vector fields of S is denoted by T(S) or shortly
by T.
Since the tangent spaces Te and T e , are different when e and e'
are different points, there are no direct means to compare two
vectors A(e) Te and A(e') T e ,. The direct comparison of their
components Ai(e) and Ai(e') is meaningless, because the basis
vectors di(e) and di(e') are different. (Even when the space is
Euclidean, di (e) and di (e') are different in the case that the
coordinate system e is curvilinear.) In order to compare two
vectors belonging to two different vector spaces, it is necessary to
establish a one-to-one correspondence between the vector spaces so
that one vector space is mapped to another. Let us try to give an
affine correspondence between two adjacent tangent spaces Te and
Fig. 2.7
is expressed as
i k
!'Idj = de r ij (e)dk '
i k
where de r ij is the components of !'Id j ~ Te , Hence, the mapping m
is determined by n 3 functions rijk(e), i, j, k = 1, .. ,' n, in e,
Since m is linear, a vector Aidi E Te +de is mapped to
Aim[di 1 (Ak + deirijkAj)dkE: Te
thus establishing a correspondence between vectors in Te and Te+ de ,
An affine correspondence between Te and Te+ de is obtained from the
map m by considering that the origin of Te+ de is mapped to the point
deidi in T e , By this affine correspondence, a point Aiai in Te+ de
is mapped to a point in Te shown by the vector
(Ak + de k + deiAj r" k) ak
1J
The n 3 functions r ij k(e) in e are called the coefficients of the
affine connection, because m gives an affine correspondence between
Te and Te+ de ,
The difference !'I a j can be regarded as the intrinsic change in
the j-th basis vector a j (e) as the point changes from e to e+de,
Hence, if we denote by vaia j the rate of the intrinsic change of a j
as point e changes in the direction of ai' it is given by the vector
35
(2.11)
This is again a vector field. The vector field Vdid j is called the
covariant derivative of vector field dj along a i . It is determined
k
from the coefficients r ij (8) of the affine connection. On the
the inner product of the both sides of (2.11) with om' the relation
k k
<vaa j , dm> = r ij <a k , am> = r ij (8)gkm(8) (2.12)
follows. It is convenient to define the covariant expression of the
(2.l3)
(2.17)
(2.18)
where
Af = Ai(e)aif(e)
Obviously, the coefficients of the affine connection are given
by
rijk(e) = <vaia j , (2.19)
On the other hand, when r ijk are given, the covariant derivative VAB
for A = Ai(e)a i , B = Bi(e)a i can be calculated as
'V A (Bj aj )
(Aia. Bj) a.
1 J
(2.21)
This shows how the coefficients of an affine connection change under
coordinate transformations. It should be noticed that r ijk is not a
tensor. Even if r. 'k(e)
1J
= 0 holds identically for some coordinate
37
o , (2.22)
where the relation
Fig. 2.8
38
af aidif = d~ f[a(t)] = f
is used. When B(t) is the solution of the above equation, the
Vee = c(t)e
By choosing an appropriate parameter t, this equation reduces to the
simpler form
ae =
V· 0
' (2.23)
which implies that the tangent vector of the curve does not change
at all along itself. This is a generalization of the straight line
in the Euclidean geometry. It is called a geodesic with respect to
the affine connection. The equation (2.23) can be rewritten as
ek(t) + ei(t)ej(t)fijk{a(t)} = 0 (2.24)
a-connections.
We have thus introduced a one-parameter family of affine
connections. The reader might ask which is the true connection to be
introduced in S? This question is meaningless. For any a, the
a-connection has its proper meaning depending on a, and it plays a
proper role in statistical inference. The a-connection defines the
curve
Pl(x) and 100t% from P2(x). For R,(x, t) = log p(x, t), it is easy to
show
41
relation,
we have
a t a t 2(x, t) + gtt = 0 ,
where gtt is the Fisher information. Hence, at£(x, t+dt) E T~!) dt is
naturally mapped dt£(X) E T~l), if the modification (2.25) is used,
(2.32)
42
(2.25) as
r(l) r(l) r (1) o ,
221 122 212
(1) _ (1) _ 3
r(l)
r 12l - r 2ll - - 2/a ,
222 6/a 3 .
The components of Tijk are calculated as
3 _ 3
Tlll = T22l = 0, Tl12 = 2/a , T222 - 8/a
and all the other components are obtained from the symmetry of T ijk ,
i.e. , Tijk = Tjik = Tkij Hence, the a-connection is given
from (3.29) as
r(a) r(a) r(a) r(a) 0 , r(a) (1 - a)/a 3 ,
111 212 122 221 112
r(a) r(a) - (1 + a)/a 3 r(a) = 2(1 + 2a)/a 3
121 211 222
43
different a. One can check that r{~~ coincides with the Christoffel
symbol [i, j; k] calculated from the metric tensor. It is not
difficult to check that the geodesic curve given in Example 2.3 is
indeed the a-geodesic satisfying (2.24) for a = a connection.
the manifold and are described by the torsion and curvature of the
manifold.
44
Q(ll. • • • • (lit.
-- Bit.
(l'.'
B4.
(l
Qii··:iw.·
This gives the trasformation rule of a tensor. When the components
of a tensor vanishes in a coordinate system, they vanish in any
coordinate system. The Riemannian metric is a tensor of order two
whose components are given by gij' An affine connection is not a
tensor, although its components are written as r ijk , because it does
not define a multilinear mapping and its transformation rule (2.21)
is different from that of a tensor.
We define another type of tensors. Let R be a multilinear
mapping from k vector fields to a vector field,
R: T(S) Y. ••• X T(S)~T(S).
This is called a tensor of order k + 1, of covarinat order k and
45
and
R. . . . . m = R . . . . . . gjm.
11 11t 11 111 J
Hence, R' is a covarinat version (lower index version) of R. We
herafter identify Rand R', and omit the prime, regarding them as
different expressions of one and the same quantity.
The torsion is a bilinear mapping from T(S) x T(S) to T(S)
induced by the affine connection. Hence, it is a tensor of order
holds.
The result is
Ri~i2 = (1 - a 2 ) /0 4
in the coordinate system (].J, 0). The scalar curvature K is
defined by
K = (1 1) Rl.. J• km gimgjk
n n-
and is a scalar taking the same value in any coordinate systems. In
our case, it is given by
K(a) - (1 - a 2 )/2
Fig. 2.9
gab(u) = < a,
d db> = B;B~ < di' d j > = B;Bt;gij (u) , (2.41)
where gij(u) = gij[6(U)]. This gives the induced metric tensor of M.
We can also calculate the covariant derivative Vdadb of vector
field db (which is defined only on M) along d a in the enveloping
manifold S as follows,
Tu(M). because the intrinsic change in the tangent vector db may have
Fig. 2.10
Fig. 2.11
Cl = u l • s2 = u 2 .... sm um
56
for a and the second part stands for K. Indices K, A, ~, etc. are
used to denote quantities related to the coordinate system v in A(u).
It is convenient to fix the origin v = 0 of each A(u) at the point
8(u) which is the intersection of A(u) and M. Then, all the points
of M have v = 0 coordinates, so that the points of M have the
following 8-coordinates, 8 = 8(u) = 8(u, 0). Hence, 8 8(u) is a
parametric representation of M, and v K o (K = m + 1, ... , n) is
another representation of M.
A foliation of a manifold S is a partitioning
S = UUE-M A(u)
of the manifold S into submanifolds A(u) of dimension n-m. Hence, a
foliation defines an ancillary family when A(u) transverses M at u.
When M is a regular submanifold in S, there always exists a
UM =Uu~M A(u).
__
d_
K = m+l, ... , n.
dK - dVJ( ,
The vectors da span the tangent space Tu(M) at e e (u) E M C S, and
the vectors dK span the tangent space Tu(A) at e = e(u) E M. The
tangent .space Tu(S) at e = e(u) is decomposed into the direct sum of
these two,
ijk
r a.Sy = i j
Ba. BSB y r"1J k + BYd a. BSg··
1J
The v-part of the metric
BiB~g .. (2.48)
K 1\ 1J
represents the induced metric of A(u), and the u-part
(2.49)
at e e(u) represents the metric of M. The mixed part
gaK(u) = <d a , > = B~B~gij
dK (2.50)
at e e(u) represents the angles between Tu(M) and Tu(A). When
(2.51)
holds, Tu(M) and Tu(A) are the orthogonal complement of each other.
When (2.51) holds, A = {A(u)} is called an orthogonal ancillary
family. This property is independent of the choice of the coordinate
system v in A(u).
The v-part of the connection
58
r d . l1 (u) = <
i j k
'JdKd>..,
j
dl1
i
>-_
BKB>..Bl1rijk + B11 d KB>..g 1J
.. (2.52)
is the induced affine connection of A(u), while the u-part
i j k
rabc(u) = <'Jdadb' c a ~ gij
BaBbBcrijk + Bjd dC(2.53)
) =
HabK (2.55)
gives the imbedding curvature of M in S at e(u).
It is always possible to choose a coordinate system v in each
A(u) such that {d K} forms an orthonormal basis vector in A(u) on M,
i. e., at points v 0, so that
(1
v
/\ I,
(1,0), a= 1; S
i = 1, 2 . A/..u)
[: l
and the Jacobian matrix
where E; ( u, v ) ,an d "aa BiS o holds. The metric tensor gas is given
by
1
[: :1
This implies that the metric (Fisher information) of Ml is gab(u) =
1, where we put v = 0, and the metric of A(u) on Ml is gKA(u) = 2,
and da and dK are mutually orthogonal, gaK(u) = O. Therefore, the
ancillary family is orhtogonal. The a-connection r(a) (u) in the
aSy
associated coordinate system ~ = (u, v) is given by
r(a)(u) 0 r(~)(u) = - 2(1 + 2a)/(1 + v)3
abc KI\Il
r (a)
(u) (1 - a)/(l + v)3, r(~)(u) = 0
abK Kl\a
r(a)(u)
aKA 0, r~~~(u) = - (1 + a)/(l + v)3
where a, b, c stand for 1 and K, A, 11 stand for 2. The above results
show that the model Ml is not an a-flat submanifold in S except for
60
1 x2
q(x, u) lTITU exp {- -2-} .
2u
The imbedding e e(u) is given in this case by
el(u) o e2 (u) = u .
u, u
I~
s
M
A(u)
u
,
... V
Fig. 2.13
61
v , e2 (u, v) u .
B~ [: : 1
o also holds. By calculating the metric tensor gaS(u), we
2 2
gab (u) = 2/u , gKA (u) = l/u , gaA(U) = 0 ,
so that the ancillary family A is an orthogonal family. The
a-connection is given by
r~~~(u) = 0 ,
M3 is imbedded in S by
following A(u)
----------~------~--~------------~8
u
Fig. 2.14
el(u, v) = u - 2v , u + v ,
so that
B~(U)
[ : -: 1
o on M3 . The metric tensor gaS(u) is decomposed into
the followi~g three
2 2
gab(u) = 3/u o , 6/u ,
63
2.9. Notes
It was in 1940's that the possibility of constructing a
differential-geometrical theory on the statistical model was
remarked. Rao [1945] showed that a statistical model forms a
Riemmanian manifold, where the Fisher information matrix plays the
role of the metric tensor gij. He calculated the geodesic distances
in some statistical models. The idea of the non-informative prior
distribution by Jeffrey [1946] seems to be proposed from the
Riemannian point of view. Since then a number of researchers have
tried to elucidate the statistical meaning of the Riemannian
structure. We can mention, among others, unpublished works by Amari
in 1959 and by Moriguti in 1960 which had some influences on the
studies followed by Yoshizawa [197la, b], Takiyama [1974], Sato et
al. [1979], Ozeki [1977] and Ingarden et al. [1979] in Japan. The
Riemannian approach was also adopted by Atkinson and Mitchell [1981],
Kass [1981], Skovgaard [1981]. However, the statistical
implications of Riemannian structures, especially of the
Riemann-Christoffel curvature, are still not clear, except for the
fact that there are no covariant stabilizing transformations unless
the Riemann-Christoffel curvature vanishes. Here, a covariant
stabilizing transformation is a transformation of the parameter (or
64
3.1. a-representation
We have used the logarithm t(x, e) of the density function p(x,
e) to define the fundamental geometric structures in the previous
chapter. However, there exist more convenient representations in
which the meaning of the a-connections can be understood more
directly. Let Fa(p) be a one-parameter family of functions defined
by
2 (1-a)/2
y-,:--aP ,afl,
{
Fa(p) (3.1)
log p , a = 1
The derivative F'(p) = p-(1+a)/2 is a homogeneous function in p of
a
degree -(1 + a)/2. We call
ta(x, e) = Fa{p(x, e)} (3.2)
B = Bj(8)a., we have
J
ABR.a(x, 8) = {p(x, 8)}(1-a)/2(ABR. +¥AR.BR.)
Hence, for three vector fields A, B, C, the a-covariant derivative
can be written as
Ea[(ABR. a ) (CR. a )]
/{ABR.a(x, 8)}{CR._ a (x, 8)}dP . (3.9)
log P , a = 1
1· 2 {(1-a)/2 l} log P .
ol:'IIf
l-a p -
The definition (3.1), where we put Ca (x) 0, is used only for
brevity's sake.
70
Let
This shows that every affine connection has a unique dual determined
by
J (AB~a)(C~_a)dP + J (B~a)(AC~_a)dP
and *
~. For a loop c, let c -1 be the inverse loop encircling c in
the reverse order. It is easy to show
n -I = (nc)-I.
c
Hence, we have
<ncC,D>=<n -I ncC,n;-ID>=<c, nO_ ID >.
c c
Since c -1 is the reverse loop of c, for any vector fields A, B, C, D,
where i is the imaginary unit and s = (sj). This shows that exp {~(s
+ a) - ~(a)} is the moment generating function,
0, x I- i
By virtue of the relations
{~ioi(x)}(1-n)/2 = (~i)(1-n)/2oi(x) log{~ioi(x)} ~i
log~ 0i(x)
if we choose a new parametrization en defined by
{
_2_ ai) (1-n)/2 , n I- 1 ,
1 - n
eai
log~
-i , a = 1 ,
the a-representation of the distribution takes the following form
R-a(x, Sa) = "'i
eaoi(x)
with respect to the new coordinate system. This shows that the set
S of all the distributions over n+l atoms forms an n-family for any
a, and that the ea gives the natural homogeneous coordinate system
of Sn as the a-family. Especially, Sn is an exponential family and
also it is a mixture family.
76
This implies that S is a-flat and that the natural coordinate system
in general curved in S.
There is a me~hod of obtaining the geodesic c in the a-family S
from the a-geodesic c in the extended S. Roughly speaking, the
proj ection of c from the origin to S is the desired geodesic in
FIG, 3.1
78
.. n+l"'i
constraint K(8) = L8 = 1 which determines S is linear in 8.
i=l
Hence, the mixture family S itself is autoparallel. The exponential
family (a = 1) is also special. The extended manifold S in an
exponential family is of the following form
1(x, e) = eic.(x) +
1.
en+ l ,
so that the constraint determining S is -n+l
8 = - 1j!(8) and is not
linear in 8. Hence, S is not an autoparallel submanifold in S.
However, S itself is also a I-flat manifold having null
Riemann-Christoffel curvature, because the (a=l)-connection vanishes
r g~ (8) = E[ a i a j R.(x, 8) akR.(x, 8) 1= 0 .
case.
As a swmnary, the extended manifold S of any a.-family S is
a.-flat, while S itself is in general not so. The mixture and
exponential family are exceptional in the sense that they are 1- and
-l-flat by themselves.
Proof. When dual coordinate systems (8, n) exist, the metric gij is
r ijk * = °igjk
Since '11* is tors ion - free, i. e . , r ijk* -- r jik'
*
follows. This guarantees the existence of the potential ~ such that
or
E[ci(x)] = ai w(6) = ni .
The components of the (-l)-affine connection vanish in this
coordinate system,
r(-l)ijk(n) = <v~i1)aj, ak ) = 0 ,
as we have already seen. The metric gij(n) <ai, aj ) is the
inverse of gji' derived by gij(n) = aiajHn), from the potential
~(n). Let H(n) be the entropy function of the distribution specified
by n. It is given by
83
system 11 is given by
aj~i~a(e)
AI
=
2
1 + a
J-djdim(x,
~ 'e)dP
" = gji(x, e) ,
oJ
~a(n) = 1: a K(n) ,
where K(n) is the total measure of the distribution specified by n.
The inverse transformation is given by
i = 2e
1 - a
aiK(ri)
The relation (3.20) reduces in this case to
"'i"" 4 K
e T1i 2
1 - a
3.5. a-divergence
Returning to a general Riemannian manifold S admitting a pair
of dual coordinate systems (e, T1), let us define a function D S x S
+ R by
(3.21)
that is, D is a function of two points PI and P2 E S whose
e-coordinates are e l and e 2 and whose T1-coordinates are T1l and T12'
respectively, where Wand ~ are the dual potentials and e l ' T1 2 is the
abbreviation of
i
e l ' T1 2 e l T1 2i
As can easily be shown from (3.20), when PI = P 2 , D(P l , P 2 ) = O. The
function D is called the divergence of two points PI and P 2 from PI
to P Z' because it satisfies the following properties, where D(e, e')
denotes the divergence of the points whose e-coordinates are e and
e' ,
Lenma
1) D(e, e') > 0, The equality holds when, and only
when e e'.
2) ,\D(e, e ') diD(e, e I ) o at e e' , where d!
~
= d/de,i.
3) di djD(e, e ' ) gij (e)
relation
has dual coordinate systems (e, n) with the potentials ~(e) and ~(n).
differential-geometrical structures.
FIG, 3,2
a ~
N
.
_2_ ei~.K(8')
1 + a 1
2
T""+""a e f ci(x){ 1
~i
2a e,jc j (x)}(l+a)/(l-a)dP
4 {m(x, 8)}(1-a)/2{m(x, e,)}(1+a)/2 dP
1 - a
2 f
4 m(x, e){ m~x, 8') }(1+a)/2 dP
2
f m x, B')
1 - a
For two distributions p(x, e), p(x, 8') belonging to S, K(e) K(e' )
= 1 holds, so that (3.26) is proved.
It should be remarked that the a-divergences have already been
known and widely used in statistical literature, without noticing the
differential-geometrical background. Csiszar [1967, 1975] treats a
general theory of this kind of divergence. The -l-divergence from
p(x, e) to p(x, e') (and hence 1-divergence from p(x, e') to p(x,
e» is the well-known Ku11back-information,
D_ 1 (e, e') = I{p(x, e') : p(x, e)}
The a-divergence is nothing but the Chernoff distance of degree a.
Especially, the O-divergence is the Hellinger distance,
DO(e, e') = ~f {/p(x, e) - /p(x, e,)}2 dP . (3.27)
0a(8, 8') 4 2 [ 1
1 - a
! a Ia,{p(x, 8) : p(x, 8')} + 1] , (3.29)
FIG. 3.3
90
Proof. Assume that a' is the ex-extreme point of a. Then, for any
vector A' belonging to the tangent space Ta,(S') at a', A'Dex(a, a') =
of 6, (A', 6 - 6' > o holds for any A' which is tangential to S'.
Hence, A'D (6, 6') = 0, proving that 6' is the a-extreme point.
a
The a-projection of 6 is not necessarily unique in general.
Moreover, an extreme point 6' is not necessarily the minimum point
giving the best approximation of 6. The next theorem yields the
condition which guarantees that the extreme point is unique, if it
a-minimal point.
Proof. Assume the contrary that there exist two points 6 1 and 6 2 E
av, (6 1 f 6 2) both of which are a-extreme points of 6 E S - V to V.
Let us construct a triangle (6, 6 1, 6 2), whose sides c i connecting 6
FIG. 3.4
follows. Let us construct a triangle (8, 8', 1;), where 8' is the
a-proj ection of 8 and I; is any point in V. Since the -a-geodesic
connecting 8' and I; is inside V, the angle of the two geodesics
connecting 8 and 8', and 8' and I; is not less than ~/2. Hence, Da(8,
1;) :': Da (8, 8'), proving the theorem.
where
i = 1, 2 .
When the above inner product vanishes, the two curves are said to be
orthogonal. The tangent directions of a submanifold are defined in a
similar manner.
The a-geodesic connecting two points ml (x) and m2 (x) in S is
defined by the curve
1 a (x, t) = tal (x) + t{1 a2 (x) - ral(x)}, t E [0, 1]
in the a-representation. This definition suggests that S is a-flat
for any a, and ra(x) gives the a-affine coordinate system of S. The
a-geodesic p(x, t) connecting two probability distributions Pl(x) and
P2(x) in S is given by
R-a(x, t) c(t) [R-al(x) + t{R- a2 (x) - R-al(x)}] a I 1
R-(x, t) c(t) + R-l(x) + t{R- 2 (x) R-l(x)}, a = 1
FIG. 3.5
a3 (3.39)
It is not difficult to prove that these two are indeed mutually dual,
satisfying the law of coordinate transformations for affine
connections. However, it should be noted that these geometrical
structures depend only on the local properties of the function D(e,
e') in a small neighborhood of e = e'.
Now we confine our attention within the class of invariant
divergences, and search for the geometrical structures derived
therefrom. Since a divergence D(p, q) from p(x) to q(x) is a
functional of p(x) and q(x) taking non-negative values, it is
natural to consider the following type of functionals
D(p, q) = Ep[F{p(x), q(x)}] =J F{p(x), q(x)}p(x)dP,
(3.40)
where F is some function and Ep is the expectation with respect to
p(x). We then require that D(p, q) should be invariant under any
(coordinate) transformation of the sample space X, i. e., a
transformation of the random varialble x into y. Then, p(x) and q(x)
are transformed to
p(y) = p{x(y)}J-l(y) , - -1
q(y) = q{x(y)}J (y),
Proof. From
Df(a, a') = J p(x, a)f{ ~~~: ~;) }dP(x) ,
we have by differentiating the above
ajai Df = Ea[(ajp) (aip)f"p(x, a')/{p(x, a)}3],
-
where a 2f"'(1) + 3.
torsi-on-free, when and only when gij and r ijk are given from a
divergence fuction D(6, 6') by (3.37) and (3.38) as Eguchi did.
6. Given a Riemannian manifold {S, g}, is it possilbe to
associate a tensor Tijk such that the induced manifold {S, g, T} is
a-flat for some a? If not what is the condition imposed on the
Riemannian metric to guarantee this? Lauritzen (1984) defined a new
notion called conjugate symmetry. A statistical manifold S is said
to be conjugate symmetric, when its Riemann-Christoffel curvature
tensor satisfies, for any a,
R(a) (-a)
ijkm Rijkm
or equivalently
R(a) (a)
ijkm - Rijmk ·
This always holds for a = 0, because the O-connetion is metric. Many
statistical manifolds are conjugate symmetric. Lauritzen showed that
a S-flat family for some S is always conjugate symmetic. He also
presented an example of a statiscal maniflod which is not conjugate
symmetric. We do not yet know statistical implications of the
conjugate symmetry.
3.9. Notes
Many researchers have proposed various distance-like measures
between two probability distributions. They are, for example,
Bhattacharrya [1943] distance, Hellinger distance, Rao's Riemannian
distance, Jeffreys divergence [1948], Kullback-Leibler [1951]
information, Chernoff [1952] distance, Matusita [1955] distance,
Kagan divergence [1963], Csiszar [1967a, b] f-divergence, etc.
Chentsov [1972] and Csiszar [1975] remarked the dualistic structures
of the geometry of the exponential family based on the Kullback
divergence (See also Efron [1978], Barndorff-Nielsen [1978]).
Csiszar [1967a, b] studied f-divergence (which includes the
a-divergence as a special case) and showed the topological
103
Rijkh (8) =
so that it vanishes for a = ±l.
or conversely
aj j ia
g i
where gji is the inverse of the metric tensor in the
gij'
n-coordinate system is given by the inverse of gji because. of
<ai, aj > = <gik ak , gjm am) = gikgjmgkm = gij .
Similarly, we can obtain the a-connection by
r (a)ijk = <'Va. aj ak > = _ ~ Tijk
a1 ' 2
im jr kS
in the n-coordinate system, where Tijk g g g Tmrs' This
vanishes for a = - 1, showing that the n is -l-affine.
The dual potential ~(n) is defined implicitly by
1jJ(6) + Hn) =
81 8 1 2
+ 02
nl = - 28 2 ~ , n2
---zez
( 1 )2
28 2 ~
[::0'
2~0
gij (8)
4 ~ 22
0
+ '0 J
4
gij (n) 1
04
[0'- ~
+
2/
~/~ ]
where ~ and 0 2 are considered as functions of 8 or n. They differ
from those given in Example 2.2, because the coordinate systems are
different. The tensor Tijk is given by
_ 4 _ 4
Tl12 - 20 T122 - 8~0
and the a-connection is
r (a) = 1 -2 a Tijk
ijk
The Riemann-Christoffel curvature Rijkm is given by R12l2 = (1
a 2 )06 in this coordinate system 8 and the scalar curvature is K
- (1 - a 2 )/2 which is the same for any coordinate system.
i.e.,
A(u) = {n I nl = u + uv , nZ = ZuZ + u Zv}
where v is the parameter specifying points on A(u). The parameter v
can be regarded as a coordinate on the line A(u), where the origin v
111
Fig. 4.2
denote the point which is on A(u) and which has the coordinates v in
A(u). The 8 and D coordinates of this point is written as
8 = 8(w) = 8(u, v) , D = D(W) = D(U, v) ,
represented by
or
where
vectors
113
a = 1, ... , m
gaS = <
aa' dS
i j
BaBsgij )
BaiBSj g
ij
Obviously, gab and gKA are, respectively, the metric tensors of M and
A(u) . The mixed part gaK <aa' dK) evaluated at w = (u, 0)
represents the angles between the tangent spaces T(M) and T(A) at the
intersecting point w = (u, 0). When gaK = 0, M and A(u) are said to
intersect perpendicularly.
The components of the a-affine connection is given by
(4.9)
H(a) = (V Cl
abK
Cl
Cla b' K
= r(a)
abK
> (4.11)
Hence, Bai
1 2
1 1 + v 2u(2 + V)]
Ba~.
2 u u2
where a = I, 2, and wI u, W2 v, the indices a, b, etc.
standing only for 1 and K, A, etc. standing only for 2. The tangent
i
vectors Cl a and Cl K of M and A(u) are given by Cl a Bai Cl , and the
metric tensor gaS is given by
= [gab gAb J o
J.
gaK gAK 3/2
where gaS is evaluated on M, i.e. at (u, 0). The gab = 3/u2 is the
Fisher information of M, and gaK = <Cl a , Cl K> = 0 implies that M and
A(u) are orthogonal at the intersection. The tensor Tasy is
T abc = l4/u 3 , TabK = - II u 2 ,
115
TaKA = 2/u , T = - 4
KA 11 '
in the w-coordinate system. By calculating (asB .)B~.gij and
y~ uJ
evaluating them at (u, 0), we have the components of the a-connection
as follows,
r (a)
- (3 + 7a)/u 3
abc
r(a) (3 + a) I (2u 2 )
aKb
r(a) - (1 + a) lu ,
KAa
Especially, the exponential curvature of M is
H(e) = _ l/u 2
abK .
This does not vanish, since M itself is not an exponential family.
(4.15)
where
- 1 ~ (4.16)
x N i~l ~
is the arithmetic mean of N observations. The statistic x is
sufficient, because the density function p(x, ... , x; e) depends on
1 N
x, ... , x only through X. Moreover,
1 N
I(x, ... , x; e) = NR,(x, e)
1 N
holds. The expectation and covariance of x- are
A(u) = {n I u ft(n)} •
i.e .• A(u) is the set of the observed points which are mapped to u
by the estimator ft (Fig.4.3). It is assumed that {A(u)} forms a
smooth
118
Fig. 4.3
critical region of the test T, is the subset of S such that, when the
function of ft only.
We will give the joint probability distribution of the new
When the true distribution is q(x, u), i.e., the true parameter is u,
variables,
/N{x - n(u, O)} , il vN(a - u) , vN~, (4.19)
w (il, v)
rewritten as
X.~ = Ba~.wa + z _ CNO;waw S + _l_n
--1..--
6N aSy~
v~
.wawSwy + 0 (N- 3 / Z)
~..,~ p
where op
(N- 3 / Z) denotes a random variable of order N- 3 / Z . This can
D
Syoa (4.22)
v~idj = 0 ,
from which Vm dj = 0 follows. Therefore,
dS
m m i i
VdSd y = VdS(Byid ) = (dSByi)d ,
V~SV~ydO = V~S{(dyBoi)di} = (dSdyBoi)di
are obtained. By taking the inner product of the above and da
Bjd j , (4.21) and (4.22) are derived.
3g(ij g km)
Since the statistic x tends to a normal random variable as N tends to
infinity, so does the statistic w as is seen from (4.20). The
average and covariance of ware given by
(4.24)
This enables us to calculate the expectation more accurately as
(4.25)
122
we modify ~ into
~*a = ~a + Ca(~)/(2N) ,
that
E[w*] = O(N- 3 / 2 ) .
the variable w = (wo.) with respect to the metric go.B' and study
their properties. Let us define a derivation operator DO. by DO.
go.B aB . Then, operating DO. repeatedly on the density function n(w)
n[w; go.B] of the normal distribution with covariance go.B, we easily
have
Do.n(w) - wo.n(w) ,
Do.Bn(w) (wo.wB _ go.B)n(w)
Do.Byn(w) (3g(o.B wY) _ wo.wBwY)n(w)
etc., where Do.1. ... a. It is the abbreviation of Do.1 Do.,t ... Do.ll. The
tensorial Hermite polynomials are defined by the coefficient
polynomials of the above derivations,
ho.1·· ·o.~(w) = (_1)k{Do.1·· ·Cl.kn(w)}/n(w) (4.30)
We show some of them explicitly,
hO 1 hCl. = wCl. hCl.B
hCl.By wo.wBwY _3g(o.BwY) ,
hCl.Byo = wCl.wBwYwo _ 6g(Cl.BwYwo ) + 3g(Cl.BgYo)
where the k! times the bracket (0. 1 1 Bli ... Cl.k) implies that all the
possible k! permutations are taken for the indices 0. 1 ' ... , Cl. k and
all the resultant terms are added.
When w is decomposed into wCl. (u a , v K) and the orthogonality
relation gaK = 0 holds (i.e., aa and a K are orthogonal), the
following Pythagorean decomposition
Cl.B ab KA
gCl.Bw w = gabu u + gKAv v
is valid. This implies
n(w; gCl.B) = n(u; gab)n(v; gKA)
and hence by operating Das. ... a~ Kl. ... K!o on the both sides of the
124
above follows
hal"'~ Kl (4.32)
Moreover, the following relation is useful
f -00
c
at. ... a",
hat, ... a,(w)n(w)dv=c
at. ... a ...
hat. ... all(u)n(u) (4.33)
Returning to the Gram-Charlier expansion
p(w*; u) = n(w*; g ){l + I c hal" .ak(w*)}
as k=l at ... alt
the contravariant (i. e., upper index) version of the coefficients
are given by the use of the orthogonality relation (4.31) as
where indices are raised or lowered by the use of the metric tensor
and all the other coefficients vanish except for the terms of order
higher than N- l . Hence, in order to get the Edgeworth expansion up
to the order of N- l , we need only to calculate the moments of w*a up
to the fourth order. The cumulants are easily obtained therefrom.
The calculation of the covariance (second moment) of w* is most
cumbersome. Although we can get it by direct calculations from
(4.37)
Theorem 4.3.
p(w; u) n(w'; gaB) {l + AN(w')
- ...L(d CY)g hIlB(w')} + O(N- 3 / 2 )} (4.38)
2N Il yB
where
(4.39)
It may be more natural to correct the bias of an estimator 0 not
at ~ = (0, ~) but at (0, 0). To this end, we define the partial bias
correction by
~** = ~ + C{(O, 0)}/(2N)
Then, by defining
Theorem 4.4.
P (w** ; u)
4.5. Notes
The curved exponential family is a statistical model which is
covenient in constructing a differential geometrical theory, because
there exists a finite-dimensional sufficient statistic x irrespective
of the number N of independent observations. (It is possible to
generalize the geometrical theory to be applicable to a more general
family. ) The term "curved exponential family" was introduced by
Efron [1975, 1978], and has been widely used. The dualistic
structures of the exponential family are studied in detail by
Barndorff-Nielsen [1978]. See also Chentsov [1972], Efron [1978].
The differential geometry of the exponential and curved exponential
family are studied in Amari [1980], in which he introduced the
ancillary family A(u) with a related coordinate system. This gives a
geometrical framework for statistical inference.
The Edgeworth and Gram-Charlier expansions are explained in many
books, e.g., in Kendall and Stuart [1963]. The Edgeworth expansion
of the
distribution of an estimator is evaluated up to the third
-1
order (i. e., up to the term of order N ) by a number of researchers
who have constructed the so-called higher-order asymptotic theory of
statistical inference (see, e.g., LeCam (1956), Rao (1961, 1962),
Akahira and Takeuchi [198laJ. Pfanzagl [1980], Ghosh and Subramanyam
[1974], Chibisov [1972], Skovgaard [1982], etc.). The present
treatment is based on Amari and Kumon [1983]. The merits of the
present method are the following two: Firstly, the joint density
function of a and ~ is evaluated for the frist time, so that the
density function of ~ (which turns out to be asymptotically
ancillary) and the conditional distribution pea I ~) are easily
ob tained therefrom. Secondly, the geometrical interpretation is
given to each term in the expansion, so that we can see which terms
are invariant and which terms depend on the estimator or on the
manner of parameterization of M. (See also McCullagn, 1984b.)
5. ASYMPTOTIC THEORY OF ESTIMATION
and
(5.3)
cd g
=
C a cd + C a KA = r(m) cd + H(m)a KA
Ca
KA g cd g KA g ,
because of gaK = 0, Ccd a = r(m)a and C a = H(m)a
cd KA KA'
a
Hence the asymptotic bias b of an efficient estimator fta is given by
the sum of the two terms, one is derived from the mixture connection
of 'M and is common to all the efficient estimators, and the other is
derived from the mixture curvature of the associated A which depends
132
+ __
1_ K K
72N abc def
habcdef ,
where h abc etc. are the Hermite polynomials in ii* with respect to the
metric gab· The third and fourth cumulants of ii* are given by
and they are common to all the first-order efficient estimators. The
estimators differ only in the term
represents the geometric properties of the associated ancillary
family A as
2
Cab (rm)2 + 2(He )2 + (Hm)2 (5.6)
ab M ab A ab '
where
d - H(e) (5.10)
= agKb abK
2
is used in calculating Cab' where gKb o.
for p 2.
Since
R,(x, ft) e(ft)·x - ~{e(u)},
it is written as
i -
Ba(ft) {Xi - ni(ft)} = 0, (5.12)
where Bi aei/au a . This shows that the associated ancillary
a
manifold A(u) = ft-l(u) rigging to point u is given by
(5. l3)
135
B;(U)X i = O.
Then. the general solutions of (5.13) are given as
0a = B;(u)oj
of M. because of
oK > o.
2 '12
Fig. 5.1
B;(U){8 j - 8 j (U)}gij(u) = 0 ,
or 48 2 u 2 + 8 1u + 1 = 0 in the present case.
By the use of the
2 2
relation between 8 and n, this can be rewritten as 2u - unl + nl -
n2 = 0 in the n-coordinate system. Hence, A'(u) is a parabolla in
The n-coordinate system (Fig. 5.1), and the dual m.l.e. 0' is given
138
by solving
2ft,2 - xlft' + (x l )2 - x 2 = 0 .
By introducing a coordinate v in A' (u), we have the new (u, v)
coordinate system such that
estimator is
ft'* = {l + 4/(9N)}ft'
The mean square error is
1
E[(Q'* - u)2] ( 3N + -22 )u
2
+ 0 (N
-3
)•
9N
This is second-order efficient, but is not third-order efficient
KAa T~ 0 .
because of H(m)
u 2 ) or
x = u + £u
2
where £u ~ N(O, u). Hence, given N independent observations I' ... ,
~, we can consider the least square estimator ~S' which minimizes
2 (~.. -
i.
2
u) .
by
nl(u, v) = u, n2(u, v) = 2u 2 + v.
The tangent vector da of M is
[1, 4u]
139
as
gaB [
gab gaK J
gAb gAK
Since gaK f 0, the A(u) is not orthogonal. Hence, ~S is not
first-order efficient. From
l/u 2 ,
we have
2 1 2 -2
E [(11LS - u) ] ="'Nu + 0 (N ).
(5.14)
When H~~! = 0, no improvement is possible, because ft is already
third-order efficient.
dav-KvA .
improved ~ does not include the term H(m) In other words, the
ancillary submanifo1d is modified by (5.14) such that the improved
ancillary family is free from the mixture curvature at v = O. Hence
~* is third-order efficient.
or
2v 2 - xlv
- -2
+ 2x1 - x2 = 0 .
The improved ~ is third-order efficient. but is different from the
m.l.e.
AN(ij'; u) - 1
2NI (
daC d) gdb h ab ,
1 a
UN C (u) .
Since the bias term Ca includes H~~)ag KA., the third-order term
of the mean square error depends not only on the mixture curvature
H(m)(U) but also on the derivative d H(m)(u) of the mixture
KAa b KAa
curvature. This implies that there is no ancillary family A which
142
parametrization.
We give a rather tricky example to show that one can further
[: :]
and the a-connection identically vanishes for all a, o.
The n-coordinates and 8-coordinates coincide, nl = 81 , n2 = 82 .
In this problem, only y is related to the unknown parameter u,
and z is merely a noise presented independently of u. Hence, y is a
sufficient statistic and it is plausible to assume that the m.l. e.
given by ft = Y is the optimal estimator. However, we consider for
comparison the following c-estimator
ftc = y- + cz-2-
/y ,
where c is a constant. Obviously GO (c = 0) is the m.l.e., and ftc is
Fig. 5.2
(1/3)-estimator is
E[ui/3] 1 - {1/(3u 2 )}N- l + O(N- 2 )
which is smaller than that of the m.l.e.
filc = Y+ c/(Ny) ,
which is obtained by replacing z2 by its expectation l/N, yields a
better estimator,
E[(u~)2] = 1 + {c(c - 2)/u 2 }N- l + O(N- 2 )
This kind of improvement is virtual depending on the bias term. It
also depends on the manner of parametrization and also on the region
which u is assumed to belong to. This suggests that it is inadequate
to compare the higher-order asymptotic behaviors of estimators
without correcting the bias. Or it is inadequate to evaluate an
estimator by the criterion of the least square error or of the
covariance without making bias correction. An invariant evaluation
is given only after bias correction or by the amount of Fisher
information which an estimator carries.
*
be the inverse image of the bias corrected version fi. Then, A* (u)
depends on N so that we may regard fi* as an estimator whose ancillary
147
so that
C
aKb =
d
agKb
- H(e)
abK
= - H(e)
abK
+ O(N- 1/2 )
is used. In the next chapter concerning statistical tests, we treat
function.
r(O)(u) = 0
abc 0 '
This parametrization can be considered covariance stabilizing in a
E[dadbdct(x, u)] = 0 ,
as can easily be proved from the definition. This implies
gab ( uA)(ACi
u - ua) is close to a normal distribution in some sense in
t = f u -7a/3 , a f- °
This t = t
a ~log u , a
(u) is the a-affine parameter.
° For example t_l = u 7/ 3 is
i = 1, ... , n (5.23)
where £i are mutually independent normal random variable subj ect to
N(O, 1), f is a known function of known control parameters c i and of
distributions
p (x, e)
ljJ( e)
affine connection rabc. See also Hougaard [1981] and Kass [1984].
following notations,
~ = (0, ~) , ~(i) = (O(i), ~(i» ,
~(i) o~ (i) = (00 (i), o~ (i» ,
" 1\ a
The jacknife estimate b = (b ) of the bias of an estimator 0 is
given by the (N - 1) times average of the deviation oO(i),
b = N N1 LoO(i) . (5.28)
The jacknife estimator 0JK of 0 is the bias-corrected one,
0JK = 0 - N N1 LOO(i) . (5.29)
where the indices of B . and C Q. are also omitted, and Band Care
a~ a,,~
A
Theorem 5.11. The bias estimate b converges to the true bias
E[ft] in probability. The jacknife estimator UJK coincides with the
bias-corrected estimator ft* up to Op(N- 1 ).
~ gSY + Op(N-3/2)
we have
Low(i)a 2~ CSyagSY + Op(N- 3/2 )
The term Lou(i) is the u-part of the above, so that
5.7.Notes
The higher-order asymptotic theory of estimation was initiated
by Fisher (1925) and Rao (1961, 1962, 1963), and was studied by many
researches, e.g. Chibisov(1973 b), Ghosh and Subramanyam (1974),
Pfanzagl (1982), Akahira and Takeuchi (198la). It was Efron (1975)
who first pointed out the important role of the statistical curvature
in the higher-order theory of estimation. Efron's statistical
curvature y2 is indeed
2 2 ab
y = (HM)ab g ,
the square of the exponential curvature of the statistical model in
our terminology. A multidimensional generalization of the
statistical curvature is given by Reeds (1975) and Madsen (1979).
Its importance has widely been recognized (see, e.g., Reid (1983)).
The geometrical foundation was given by Amari (1980, 1982a) and Amari
and Kumon(1983), where the mixture curvature plays as an important
role as the exponential curvature does. The mixture curvature plays
also an important role when a statistical model includes nuisance
paremeters. It is possible to define these curvatures for a general
regular statistical model other than a curved exponential family, and
to construct the higer-order theory of estimation (see Amari, 1984 ah
There seems to be some confusion in the usage of the term
"order" in the higher-order asymptotic theory. The distribution
_'k
function of U or u is expanded as
>~ *
p(u, u) = Pl(u ) + N
-1/2
P2(u) + N- 1P 3(u*) + O(N- 2/3 ),
7'
,,<
as in (5.5). Some people call the terms Pi(u) the i-th order term,
160
g~b = 0
for regular efficient estimators. Hence, one sometimes calls g3
ab
derived from P3(u*) the second-order term. We shall use the latter
usage in Chapter 7, where loss of information is treated.
We have shown the characteristic of an estimator ~ which depend
Wei (1984) for the Jacknife method. Akahira (1982) evaluated the
submanifold.
The power PT(u) of a test T at u is the probability that the
hypothesis HO is rejected, when the true parameter of the distribution
163
Fig. 6.lb)
Fig. 6.la)
ue:D
A test T of significance level u is unbiased, when its power PT(u) at
any uf.D is not less than u. The power function PT(u) can be
expressed as
is written as
as
where
instead of ft, when one calculates the power PT(u) at a point u, where
Ca = Cas a g as ,~.e.,
.
Fig. 6.2
holds.
between nand n (u) € M, where nand n (u) are in the same A(u). Then,
any point n € S is specified by the new coordinate system w = (u, v)
as n = n(w) or explicitly as
nl = (1 - v)sin u , nz = 1 - (1 - v)cos u ,
where u shows that n is in A(u) and v denotes the distance between n
and M. The power PT(u) is given by
u+
f p(tl; u)dtl
u
On the other hand, the likelihood ratio test (1. r. test) uses
the following test statistic
A(X) = - z log[q(x, 0)/ max q(x, u) ]
u
Z log[q(x, O)/q(x, tl)] ,
where tl(x) is the m.l.e. We have
A(X) = Z{xlsin tl - (1 - xZ)(l - cos tl)} .
The associated ancillary family is composed of the curves {A(n) c}.
The curves are parabola given by
1 Z
nZ = - 4c (nl) + 1 + c
(see Fig.6.3 b). They can be expressed in the parametric form
concern whether these two submanifolds A(u_) and A(u+) are a part of
one connected submanifold or separate two.
168
u is the true parameter, the power PT(u) tends to 1 for any fixed
uf: D. Hence, characteristics of a test sequence {TN} should be
distance t/R. When D has a smooth boundary 3D, the set UN(t) is an
(m-l) -dimensional submanifold surrounding D in M, and it approaches
Fig. 6.4
two tests, we use their average powers in all the direc"tions. Let
(t) PT(u)du/SN(t) ,
PT(t, N) = I UEU
N
where SN(t) is the area of UN(t). (When SN(t) is not finite, we can
PTl(t)< PT'l(t)
at some t. A first-order uniformly efficient test T is said to be
P T2 (t) ~ P T '2(t)
at all t compared with any other first-order efficient test T'. A
call
lim N{P(t, N) - PT(t, N)} (6.7)
N+oo
171
infU&UN(t) PT(u) .
However, the result is the same up to the third-order, provided D is
sufficiently smooth. It is also possible to define UN(t) by using a
distance function other than the Riemannian geodesic distance. (Note
that all the a-distances are asymptotically equivalent to the
Riemannian distance, since t/IN" is infinitesimally small for large
N.) If one would like to emphasize the power in a specific
direction, one needs to use an unisotropic distance.
against
The other is the one-sided test, which tests
against
The latter can be considered to be the case with D = [- 00, uOl. (The
only for 1 in the present scalar parameter case. The point whose
u-coordinate is
(6.12)
is separated from U o by a geodesic distance It1N- 1/2 + O(N- l ). We
case.
In order to calculate the i-th order powers PTi (t), we use the
Edgeworth expansion of p(~; u t ), where ~ is the coordinates of the
W~ = wt + C(ft)/(2/N) , (6.14)
u+
PT(u t ) = 1 - f p(u; ut)dO ,
u_
where ~ 'R()M is the interval [u_, u+]. We put u in the
one-sided case. The transformation of the variable 0 to u~ is given
by u t = IN(O - u t ) and by (6.14), so that u *
t satisfies
u~ = u5 - t/.Ig . (6.15 )
The interval ~
coordinate by (6.14). The same interval is expressed as
U+
PT (t)' = 1 - Ju_ t _ p(u*;
t ut)du*t . (6.17)
t
1 - a (6.18)
il
dtd J_ t+ PUt;
(-* u t ) d u-*t I t=O = 0 , (6.19)
ut _
The following theorem holds from the fact that g ~ g and the equality
holds when and only when gaK(u t ) = 0 except for terms of order N- l / 2 .
gaK(u O) = 0
and is asymptotically orhtogona1 at u t '
gaK(u t ) = 0(N- 1 / 2 ),
where IUt - uol is of order N- l / 2 . Let us define a tensor at Uo
This is the only difference from (5.18). We show the result in the
general multi-dimensional case for later use. Let ea be a unit
vector normal to aD at Uo E aD,
gab(uO)eae b = 1
Then the point u t defined by
u~ = u~ + ~ e a
is separated from D by a geodesic distance t/ IN": In the scalar
parameter case, ea is simply equal to l/fg, because all the
quantities are one-dimensional.
We omit the proof (see Amari and Kumon [1982]). When QabK = 0,
the new term BN vanishes. If we replace the term tQabK eb by gaK(u t )
and put all the other QabK's equal to zero, we get (5.18) from (6.25).
It should be remarked that the second-order term (i.e., the term of
order N- l / 2 ) of p(u~; u t ) does not depend on the ancillary family A or
the test T. It is common to all the efficient tests. The third-order
term (the term of order N- l ) depends on A or test T only through the
two geometric quantities QabK and H~~~, which represent the asymptotic
angle gaK(u t ) between aR and M and the mixture-curvature of aR,
mixture-curvature H(~)
Kl\a of the boundary oR of the critical region.
In order to calculate the second- and third-order powers, it is
necessary to evaluate u+ and u_ up to the third-order terms. We
expand them as
-1/2 -1 -3/2
r=:
.g u t = ul(a) + oN + EN + O(N )
in the one-sided case where u "", and
u 2 (a) + 0+N- l / 2 + E+N- l + O(N- 3 / 2 ) ,
- u 2 (a) + 0_N- l / 2 + E_N-l + O(N- 3 / 2 )
and 0 and E derived therefrom. The power is decomposed into the sum
of two terms. One term does not depend on QabK and H(m) so tnat it
KAa'
is cOIInllon to all the efficient tests. We denote it by Pi(t, a),
where i 1 for the one-sided case and i = 2 for the two sided case.
It will soon be proved that this is the third-order envelope function
The other depends on them, representing how the power is
affected by the geometric properties of the critical region which
characterizes the test. We show here only the results, asking the
179
where i = 1 for the one-sided case and i = Z for the two-sided case,
and
2't n{u l (a.) - t} ,
~ [n{uZ(a.) - t} n{uZ(a.) + t}] ,
t
Jl(t, a. ) = 1 - Zul(a.) , (6.Z6)
section.
parameter case
We compare the behaviors of widely used tests by their
third-order power-loss functions or deficiencies. The third-order
t-efficient test is also given explicitly.
The m.l.e. test. The test statistic of the m.l.e. test is the
of the m.l.e., if the level condition is the same. Hence, all the
given by
i I, 2 . (6.32)
test
0.5
5
Fig. 6.6
test
optimum test
Fig. 6.7
183
PT3 " (0) is largest in the two-sided case). Obviously, the locally
most powerful test is third-order locally most powerful.
185
given by
i
A(n) = Ba(uO){ni - ni(u O)} = c .
This is linear in n, so that A(u) is mixture-flat.
(6.35)
larger than that of the locally most poweful test at all t in the
two-sided case.
We can now compare the third-order characteristics of various
first-order efficient tests. It should be noted that the usal x2
procedure for various tests does not guarantee the level condition up
to the order of N- l . The performances of tests we are studying here
are those which are adjusted by introducing the terms I) and £ such
that the level and unbiasedness condition is correct up to N- l .
Fig.6.6 shows the third-order power-loss functions of various tests
in the one-sided case, where ex = 0.05. Fig. 6.7 shows the two-sided
case. It is seen that the efficient-score test or locally most
t
tanh uZ(a)t (6.36 )
efficient test is given by the t-l.r. test in the one-sided case and
t'
Ji(E', a) = 1 - 2 ui(a)
we have Ei Ei(t') by solving this.
2 (a =
Table 6.L Maximal power loss ~P(T)/y 0.05)
optimal lor. m.Le. l.most eff. score
test test test powerful t test
one-sided 0.07 0.12 0.24 0.57 0.57
two-sided 0.07 0.12 0.40 0.54 0.78
189
~---~N -+ OC'
N =5
a = 0.05
-t----~------r---~+-----~----~--~t
5
Fig. 6.8
a = 0.05
0.5 two-sided m.l.e. test
Fig. 6.9
190
This shows that the third-order theory can predict the qualitative
behavior of the m.l.e. test even when N = 5. This is true in the case
of other tests. Hence, so far as the Fisher circle model is
concerned, the third-order theory is useful to k~ow the behaviors of
tests for N = 5 'V 20, and the first-order theory is sufficient for
N > 20.
S = UB(r),
r
be the union of the critical regions of all the B(r)'s. This is the
(unconditional) critical region of the conditional test. It is
characterized by the fact that not only the total R but each
component R(r) satisfies the level condition (and the unbiasedness
condition on the two-sided case). The total (unconditional)
behaviors of the conditional test are analyzed by the geometrical
features of this R.
The problem for the conditional test is that there does not
always exist an exact ancillary statistic. However, there always
exists an maximal asymptotically ancillary statistic. In fact, the
statistic ~ in the decomposition
x=n(a,~),
gd = °d·
We can define the asymptotic conditional test based on the
asymptotically ancillary ~ or v instead of the exact ancillary f'.
The asymptotic ancillarity and asymptotic sufficiency will be studied
in detail in Chapter 7, so that we briefly describe here
characteristics of the asymptotic conditional test (see Kumon and
192
Amari, 1983).
In the m. 1 . e . decomposition x- n(a, ~), let the acceptance
region R(v) conditioned on v be
-R(v) = {u*
O < d(v)}
ab
r =
= H(e)v
abK
rK
with
1"
u
_.1
- zg
-3/Z{r(m) +lK Z( )}
abc 3 u a , (6.40)
_(_)
£ v =
l
zg- lu( )
a -
r. (6.41)
It is noteworthy that d i (v) or d(v) depends on v only through the
statistic
r =
H(e)v K = Nl/Z{o a ~(x u) + gab(uO)}'
abK a b ' 0
the difference between the expected and observed informations. In
fact, it will be shown that 1:' is the asymptotic ancillary carrying
the whole information. The critical regin R is obtained in the
above. By analyzing its shape we have the following theorem.
193
(6.42)
The size LI(u O) of I is the expectation of the length of the
confidence interval I(x),
Fig. 6.10
Ut = {u t e I gab(uO)eae b = I} .
,
For a test T, we can construct an associated ancillary family A such
that the critical region R is composed of A(u)'s where u E~, i.e.
R = {A(u) I uE ~ = RnM} .
The statiS'tic x is decomposed into (0, {)) by x n (ft, ~) by this
ancillary family, and we define
where R__ t,
--N, e is the domain of integration obtained from R__
--M by
power into
PT(t) = Pn (t) + At PT2 (t) + ~ PT3 (t) + 0(N- 3 / 2 )
the first-, second- and third-order efficiencies are defined by using
these i-th order functions in the same manner as in the scalar
parameter case. The level condition is written as PTI (0) = tl +
O(N- 3 / 2 ). The unbiasedness condition P T ' (0) = 0 or more strongly
-
Since gab S gab always holds with equality when and only when gaK(u O)
0, the distribution of
U*t,e is mostly concentrated around the
origin when the ancillary family is (asymptotically) orthogonal,
i.e., when gaK(u O) = O. Hence, a first-order efficient test should
have an asymptotically orthogonal ancillary family. We next search
for the section ~ = RnM of the critical region by M. The average
power PT(t) depends on the shape of 111' so that we obtain the I11
which maximizes PT(t) uniformly in t. (In the scalar parameter case,
RM is an interval and is automatically determined from the level and
= <f R_ _
--M, t, e
P (u~ e; u t e) du~ e )
' , ,
where Sm_l (cO) is an (m-l) -dimensional sphere with radius cO. The
above expression does not depend on a specific direction e.
Zm(c, t) t m- 1 Sm_1~n(uc
- te; gab):>
any point satisfying gab (u O - u ca ) (ubO - u bc )
a
where u c is = c 2 /N and
Sm-1 is the area of the (m-1)-dimensiona1 unit sphere (m > 1). This
does not depend on gab so that we may put gab = 0ab' It can be
expressed as
-m/2 n-2 1 2
Zm(c, t) = (21T) c Sm_2 exp{- T (t + c 2 )}Am(c, t ) , (6.59)
1
Am(c, t) = f (1 - z2)(m-3)/2 exp{- ctz}dz .
-1
~(t, m) (6.61)
Zm+4(c O' t)
J(t, m) = 1 - (6.62)
2 (m+2) Zm+2(c O ' t)
We then have the following theorem, whose proof is omitted (Kumon and
Amari [1985]).
(6.63)
where
g {abcd} =
gabg cd + gacg bd + g
ad g
bc ,
through the scalar curvature y2, i.e., l'.P T (t)/y2 does not depend on
~' (6.68)
Then, a' const. gives a modified ancillary family with the
coordinate system ~' = ~ in each A(a'). The observed point x in S
can be represented by the new coordinate systems
x = n'(a', ~') = n(a + gQ(a - uO)~, ~) ,
where n' is the coordinate transformation from (u', v') to n . Hence,
the term g' (u) of the new ancillary family at v = 0 is given by
aK
(dani) (aKnj)gi j
Bai(u){BKj(u) + Bbjgbc QCdK (d d )}
u - uO
b b
QabK(u u O)
This shows that the new ancillary family has the specified QabK.
Hence, any scalar function of a', e.g., gab(uO)(a,a - uO)(a,b - u~)
gives the test statistic of a test which is efficient and has the
statistic tbc.
represented as
D = {u I f(u) ~ c}
by using a smooth function f(u). The boundary aD is represented by
by as = a/az s , s = 2, ... , m.
1
The problem is now restated: Test a hypothesis HO : u ~ U o
against Hi : u l > U o in the curved exponential family {q(x, u)}, u =
1
(u, z).
s Here, we are interested in the value of the first
205
1
parameter u, and the other parameters ZS (s = 2, ... , m) are of no
interest. Such parameters are called nuisance parameters. Hence,
the problem is a one-sided test of a scalar parameter u l in the
presence of nuisance parameters z.
We denote by (u~, z) the point which is separated from D by a
geodesic distance t/~ and whose nuisance parameter is z = (zs). We
then have
(6.71)
where gll(uO' z) is the metric along the axis u l at (u O' z). For a
test T, we can associate an ancillary family A. Then, the statistic
x is decomposed into (ftl, 2, ~) by x= 1) (ftl, 2, ~). In this case,
the problem reduces to testing u l ~ Uo under the model {q(x, Ulj 2)}
where the estimate 2 is substituted for the nuisance parameter z.
By integrating the Edgeworth expansion of p(u~, z*j ut ' z) with
respect to z*, where u~ is defined as before, we have the Edgeworth
expansion of p (ut jUt' z). This function is used to calculate the
power PT(t, z) of test T at (u t ' z). It is easy to show that a test
T is efficient, when and only when the associated ancillary family is
asymptotically orthogonal, satisfying glK(u O' z) = 0 for all z. The
domain RM of an efficient test T is of the form ~ = {ut) I ut) >
u+ (1:'*) }, where
Ig1l(u O' 2) u+ = ul(a) + o(2)N- 1/2 + E(2)N- l + O(N- 3/2 )
Since the problem is reduced to a one-sided test of a scalar
parameter case except for the term 2, we can use the same techniques
as we used in sections 6.2 and 6.3. We show only the results (see
Kumon and Amari [1985]).
family satisfies
H(m)
KH o,
(6.72)
y2(uO' z) Hi~~Hi~~giigKA
depends on the nuisance parameter z. This does not imply that the
third-order power function is the same in the both cases. Although
the power loss function is the same, the third-order envelope
function is not the same. We can measure the effect of intervention
of the nuisance parameter by comparing the third-order power PT3 (t,
z) of an efficient test T when the value of z is known with that when
z is unknown. As is expected, the distribution p(u*
t ; ut ' z) is a
little more dispersed when z is known than when z is unknown. It is
given by integrating p(u*
t, z) with respect to z* when z is
unknown. This additional dispersion is given rise to by the squares
of the two curvatures of aD defined in the following. One is the
square of the mixture curvature of aD,
(Hm )2 = H(m) H(m) 11 pr qs
U, Z pql rsl g g g
and the other is the square of the twister component of the
exponential curvature of aD,
e 2 = H(e)
(HU , Z , V) ph
These quantities are defined and explained in Chap.S in more detail,
when we study the statistical inference in the presence of nuisance
parameters.
Theorem 6.20. The third-order power loss induced when the value
207
of z is unknown is given by
(6.73)
~P(t) =~~l(t, a){(~,Z)2 + (H~,Z,V)2}.
6.7. Notes
The higher-order asymptotic theory of statistical tests has been
developed by Pfanzagl [1973], Chibisov [1973 a], Pfanzagl and
Wefelmeyer [1978], Pfanzagl [1980], Bickel et al. [1981], etc.
However, the theory had been far from complete compared to the
higher-order asymptotic theory of estimation. This is partly because
the structure of the associated ancillary family is much more
complicated in the case of test than in the case of estimation.
Efron (1975) pointed out the importance of the statistical curvature
in the problem of tests.
The geometrical theory of higher-order asymptotics of
statistical tests and interval estimators was fully developed in
Kumon and Amari (1983) in the scalar parameter case, where both
one-sided and two-sided tests were equally analyzed. There are
widely used efficient tests such as the likelihood ratio test,
efficient score test (Rao test), Wald test, locally most powerful
test, etc. They are also second-order efficient. However, their
third-order characteritics were not well known. Their third-order
power loss (deficiency) functions were explicitly given by Amari
(1983a) based on Kumon and Amari (1983). The power loss functions
are universal in the sense that they depend on statistical model M
only through its statistical curvature y2. The results elucidate the
characteristics of these widely used tests. For example, Rao test is
209
Obviously, for the statistic x itself, gab (X) is the ordinary Fisher
information matrix. A statistic t is said to be sufficient, when
gab (T) = gab (X) holds, i. e., when t carries the same amount of
information as the original x does. On the other hand, a statistic t
When there are two statistics t(x) and s(x), the amount gab(T,
S) of the information which t and s together carry is defined
similarly by using the joint probability density of (t, s). When t
and s are independent, it is easy to prove the additivity
information
(7.3)
(7.5)
which is indeed the amount of loss of information by summarizing the
original data x into t and keeping only t. The following relation is
of frequent use, when we calculate the information loss,
conditioning on t.
There exist in general neither non-trivial ancillary statistics
nor non-trivial sufficient statistics. However, we can always
construct asymptotic ancillary statistics and asymptotic sufficient
statistics in the following sense. When N independent (vector)
observations xl' ... ,x N are available, they together carry the amount
N
gab(X ) = Ngab(X)
of Fisher information, where XN denotes the set of N observations.
In a curved exponential family M, the arithmetic mean
x= ~ LXi
is a sufficient statistic, retaining the whole information
gab(~ = Ngab(X) .
Let t(x) be a statistic which is a function of x. When the loss of
information by summarizing x into t is
213
~gab(T) = 0(N-q+1) ,
information of order N,
gab(T)
We show examples of asymptotically sufficient and asymptotically
ancillary statistics.
Let u(x) be a consistent estimator and let A be the associated
ancillary family. By introducing a coordinate system v = (VK) in
each A(u), the sufficient statistic -
x is decomposed into two
statistics (a, ~) by
x = 'l(a, ~).
curvature of A,
~ gab(U)
(He)2 +.1. (Hm)2 + O(N- l ).
M ab 2
=
A ab
(7.11)
KA.
gab = gab - gaKgb\g = gab - ~glab'
This shows that the first-order term of the covariance of a
consistent estimator ft is given by the first-order term of its amount
of the information loss, and vice versa. Note the difference in the
an efficient u* is written as
also depends on the bias correction, but the information matrix does
gab(U) = gab(U*)'
It is often easier to calculate the loss of information ~gab than to
calculate the covarinace. In this case the third-order term of the
. m 2ab
covariance is obtained from ~gab by add~ng the term ( r ) /2.
216
u' ua + Ca /(2/N),
where h ab , hK etc. are the Hermite polynomials in U, v, etc.
da(Bbi BKjgij) O.
Hence, we have
show
INE[u a I ~l = _i{H;~)a(VKvA - gKA) + Ca !+ O(N- l ) ,
Cov[ua , u b I ~l = gab + ~H;e)abvK + O(N- l ).
~gab(U, a, 5) = O(N- l )
by the use of the expansion (7.10). This implies that, among the
This shows that tab and ~a keep all the amount of the conditional
information of order 1 which the conditioning statistic 0 loses.
f(x) = f o(ft) + N- l / 2 f
1K
(ft)V K + 0 (N- l )
p'
When f(x) is first-order asymptotically ancillary, fO(ft) cannot
depend on ft but is a constant. Hence any first-order ancillary f(x)
reperesents the k statistics c l '" .,ck ' Indeed, they are given by
because of PCe i = e i .
We next define the subspace Til) of TO(A) spanned by
m(m + 1)/2 exponential curvature vectors
e ab = H(e)K~
ab oK' a , b -- 1 , ... , m . (7.19)
by
(7.21)
where
222
together with ft, i.e., (ft, c) are second-order sufficient, when and
only when
v = PCv + (6 - PC)v = Vc + (v - v C)
where 0 (6 A) is the identity operator, we have
K
when and only when TA (C):>T1P), the latter half of the theorem is
proved.
Now we show how the conditional inference divides the entire
223
density of X. From
(7.23).
Theorem 7.6 . The covariance of the condi tional Fisher
by
q(y, Z; u) = exp[-(1/2) {(1_u2 )-1(y2 + z2 - 2uyz)}
-(1/2)log(1-u 2 )].
The family M = {q(y, z; u)} can be regarded as (3, l)-curved
exponential family imbedded in an exponential family S = {p(x, e)},
with
2 2
xl = Y , x2 = z x3 yz
and the imbedding is given by
2 -1 2 -1
6 1 (u) -(1/2)(1-u) , 6 2 (u) - (1/2)(1-u) ,
2 -1
6 3 (u) u(l-u) .
We hereafter write 6 i instead of 6 i to avoid the confusion. The
S6 22 S6 1 6 2 -2D -46 2 6 3
D- 2 S6 1 6 2 -2D S6 21 -46 1 6 3
gij
-46 2 6 3 -46 1 6 3 26 2 + D
3
{U
2u 2 2u
J
gij (u) 2 2 2u
2u 2u u2 +
226
-2u
1 -2u
-2u 2(1 +
The tangent vector 0a of M is given by
0a Bio. = B oi
=
a ~ ai'
where the suffix a standing only for 1,
Bi ei(u) (1 _ u 2 )-2[_u, -u,
a
Bai ni(u) [0, 0, 1],
and denoting the derivative with respect to u. The Fisher
information gab of M is
i
gab = <oa' db> BaBbi = (1 + u 2 ) (1 - u 2 )-2.
H(e)i r abc
(e) Bigcd = _ (1 + u 2 )-1(1 2)-1[1 1 0]
ab d - u '"
or
u
2
, 1 + u 2 , 2u],
2 2
- u , -(1 - u ) , 0].
Hence
H~~~ = -2(1 + u 2 )-1/2(1 - u 2 )-1[1, 0].
ni(u, v) n;(u)
...
+ vKB K~.(u)
[1, 1, u] + v 2 (1 + u 2 ) -1/2 [1 + u 2 1 + u 2 , 2u]
+ v 3 (1 -u 2 ) -1/2 [1 - u 2 , -1 + u 2 , 0]
For N independent observations, the observed point Ai is given
where
-
X 1(_
= 2: xl + x-) 1 Nl/2(-xl + x- - 2) .
2 ="2 2
The expected conditional informations of xl' x 2 ' v 2 and v3 are
Since
1 2 2 2
~(6) = 2(6 1 + 6 2 + 6 3 ),
the metric tensor is given by
gij = °ij
and Tijk = O. Hence, all the a-connections are identical, because
in S by
nl(u) cos u, n2(u) = sin u, n3(u) = u,
q(x, u) = c exp[-~(xl - cos u)2 + (x 2 - sin u)2 + (x 3 u)2}].
The scalar parameter u may take on all the real values - 00 < u <
00 or it may take only on 0 ~ u < 2n with mod 2n. The model M forms
a spiral curve in S. The tangent vector of n(u) is given by
da = (ni) = (B ai ) = [-sin u, cos u, 1],
with
gab = <d a , db> = 2.
The curvature direction is given by
ni = daBbi = [-cos u, -sin u, 0],
which is orthogonal to da in the present case. We define a basis
{d K} (K = 2,3) in the subspace Tu(A) orthogonal to Tu(M)
by
d2 [- cos u, - sin u, 0],
d3 (1/ /2') [sin u, -cos u, 1],
such that gK\ = <d K, d\> = 0KA and d2 is the curvature direction. We
have
(HM);b = HacKHbdAgKAgcd t,
(HM)2ab = ~ .
-
x3 = ft + (1/12)~3·
and ~3 as
gab(V I U) = gab(V 2 I U)
gab(V" 3 I"U) = O(N -1 ).
This shows that the curvature-direction ancillary v 2 keeps all the
conditional information of order 1 while v3 keeps none. The
conditional Fisher informations conditioned on V, v 2 , v3 are
where
1
ni(u, 0) = ~[l - u, 1 + u, 2 - ul
and
-1
v Xl + x- 2 - 1/3,
-2
v 1/2,
if we choose BKi suitably. Each of v-1 and v-2 is exact ancillary,
while v = (v l , v 2 ) is only asymptotically ancillary. The curvature
direction component of v is given by
t = -3(2 + ft2)v l + Bftv2.
This shows that we should condition on this asymptotic ancillary
rather than on the exact ancillary v-1 or v-2 . The effect of
-1 -2
conditioning on t is equivalent to conditioning on v = (v , v ).
x = 2~ (~x(l)j + ~x(2)j)
for all the 2N observations. it is not necessary to retain all of
them but it suffices only to retain the sufficient statistics xl and
x 2 of the two samples.
- 1 \' i
Xi = 1f LX(i)j' 1.2
because x is given by
- 1 -
x = Z(x l + x- 2 ).
Indeed there is no loss of information even if we summarize
we have
-
Xi
= x!~ + N- l / 2 BK~.v0K _ N-la a BKi u-a_K
Vo + 0 p (N- / )
3 2
gab(X~) = O(N- l ),
proving that x~ is second-order sufficient.
the pooled samples are calculated from x~. We give its explicit form
where
x- = nCO, ~),
where
-
x (1/2)(x l + x2)' so that
1
n(O,~) = Z{n(Ol' ~l) + n(02' ~2)}· (7.29)
235
ni ( 0. ~ ) = ni ( ) + Baiu-a + BKiv
u.O
-K 1 -a-b
+ ~abBaiu u
+ "a
'\ BK~.iiav- K + •••
and similar expansions of n(Ol' ~l) and n(OZ' ~Z). we have the
first-order equation
6a ~r~~)adddc
= - ~gab(i'lbc - i'Zbc)d c + O(N- l / Z).
where d = INd. l' = !Nt. This gives (7. Z7) . In the above solution
process. the same ~a and tab are obtained if we replace BKiv l and
where
(7.31)
curvature.
We can further define the p-th curvature of M in a similar
manner. Let
,
(e) (e)
V........ Va d a
~,
Proof. From
nl..(u, 0) + BKl..(ft)~K},
and
we have
where
240
+Bia.Va a
a 1 b
a
Here. the derivative vba should be restricted to Tu (M) by the use of
the induced covariant derivative V of M. We then have the following
tensorial rule of covariant derivative
V Bi = a Bi + r~ BkBj _ r C Bi
b a b a Jk a b b a c·
We may use the covariant derivatives Vb.t ... Vbp B;. which are tensors,
obtained in this manner in defining the higher-order curvature
tensors. The components orthogonal to Tu(M). Til) •...• Tip - l ) define
K
the higher-order curvature tensor Ha1 ... B.t b .
7.5. Notes
Fisher information is a fundamental quantity
representing potential quality of a statistic t(x) which is to be
used for statistical inference. It represents how well statistical
data is summarized in it (see Efron, 1982). We have shown that the
242
efficient estimator from the point of view of the mean square error
or of the covariance. However, it is not necessarily third-order
efficient from the mean square error point of view. This is because
the mean square error is not an invariant quantity, except for the
first-order term which is given by the inverse of the first-order
term of the Fisher information of Q, while the Fisher information is
a tensor and hence is an invariant. The mean square error includes
terms depending on the manner of parametrization. However, we can
always construct the third-order efficient estimator from the m.l.e.
Q by making one-step bias correction, which depends on the manner of
parametrization. Moreover, we can obtain the third-order term of
the mean square error of an efficient estimator from the
second-order term of the amount of information contained in it. In
this sense, information is a more fundamental quantity.
The optimality of the m.l. e. does not hold from the point of
view of the third-order term of information loss. The statistical
implications of the third- or higher-order terms of information loss
is not clear (cf. Rao et al., 1982). We have decomposed the amount
of information which i carries into the set of statistics {Q, t(l),
t(2), ... }, where r(p) carries conditionally all the information of
order Nl-P and its magnitude is given by the p-th order exponential
curvature of model M. Although
(e) ( ) K
tab = HabK Q~ = aaab~ x, Q) + gab(Q)
(-
the conditionality principle, etc. See e.g. Basu (1975), Cox and
Hinkley (1973), Hinkley (1981), Efron (1982). The usefulness of the
ancillary t or the observed Fisher information
Fig. 8.1
true (u, z) lies or testing the hypothesis HO that the true parameter
(u, z) is on Z(u O). We have already treated the problem of tests in
the presence of nuisance parameters in Chapter 6, so that we
concentrate here on the problem of estimation in the presence of
nuisance parameters.
The problem may take another form. It sometimes takes the form
of estimating tha value of m independent functions fi(w), i = 1,···,
m, where w is an (m + k)-dimensional parameter in (n, m + k)-curved
exponential family M {q(x, w)} Indeed, the set Z(u) of the
points w satisfying
and we have interest in which Z(u) the true distribution lies, but no
interest in the relative position within Z(u). It is convenient to
,-___ z
does not change but z changes. They can be regarded as the z-axes.
The u-axes are composed of those points on which the z-coordinates do
not change, as is shown by the dotted lines in Fig. S.2.a). They are
(S.l), the u-axes change as shown in Fig. S.2b). However, the Z(u)'s
are invariant. We give a simple example.
t h e mean ~ ·
an d not ~n t he .
var~ance 0 2 , we can put u = ~ an d z = o.
The z defines a coordinate system in each Z(u) shown by a vertial
line (Fig. 8.3a). The dotted lines, on which z takes constant
values, define the u-axis. However, we may choose
a
u=const. u=const.
- - - - - - - - - - - - - ~ z=cons t.
-------------7
~- -- .....
/
" .... -.... " , "~
/ / z=cons to
----1----------- /' ",;0 -_, \
I I '\
I I ~ \V
the dotted line in Fig S.3b). The coordinate system (u, z') is
ancillary family is
A(u) = ft-l(u) = {n I ft(n) = u} ,
Fig. 8.4
the associated A(u)'s are such that each A(u) includes Z(u) (Fig.
8.4) .
We introduce new variables v = (v K), K = m+k+l, ... , n, such that
(z, v) is a coordinate system of A(u) and that the origin v = 0 is
The tangent space T( u,z )(5) of the whole space 5 at point (u, z)e M
is spanned by three kinds of vectors {d a , dp ' d K},
i
da Badi' a = 1, ... , m,
i
dp Bpd i , P m+l, ... , m+k,
i
dK BKd i , K m+k+l, ... , n.
Here, d K =d/dV K denote the tangent directions along the coordinates
v K, and are tangent to A(u). The vectors dp d/dZ P , which are along
the coordinates zP, span the tangent space of Z(u). Hence,{d K, dp }
together span the tangent space of A(u). The vectors d a = a/dU a are
along the coordinate curves u a on which zP are fixed. Hence, the
tangent space of the model M is spanned by {d p ' d a } (see Fig. 8.4).
o.
As will be soon shown, an estimator is efficient, when and only when
dp '
estimationg only u, the quantity gab = <aa' ab> does not represent
the amount of information available in estimating u. Indeed, this
gab depends on the manner of parametrization z of the nuisance
parameter or the coordinate system z in each Z (u), which can be
chosen arbitrarily. By the coordinate transformation (8.1) from z to
z' the vectors aa and ap change into a' and a' by
a p
a = a' + HPa' a p Hqa' (8.2)
a a a p p q'
or
a' = a - H,qa ' a' H,qa
a a a q , p p q'
respectively, where
H~ = ahP(u, z)/aua , z)/az p ,
Fig. 8.5
g
ab
+ g pq H'PH,q
a b
- g H'P
pa b
a' a'> g H,q - g H,rH,q
< a' p aq p qr a p'
<a' a'> g H,rH,s (8.3)
p' q rs p q'
How do we define the amount of information in the presence of
nuisance parameters? In order to answer this question, we decompose
the vectors aa in two components. One is the component tangential to
Z(u) given by a linear combination of a p . The other is the component
orthogonal to Z(u). The part which is tangential to Z(u) is given by
251
3a = 3~
as can easily be shown from (8.3) and (8.4). The inner products of
these orthogonalized 3a give an invariant tensor
(8.1) .
When and only when gap = < d a' dp > = 0, the orthogonalized
information coincides with the Fisher information,
gab = gab·
However, in general,
g-ab ,g
...... ab ,
-
where g-~ is the inverse of gab' hold in the sense of the positive
semi-definiteness.
Since the inverse matrix g-ab of the orthogonalized information
-
gba is the (b, a) -component of the inverse of the total Fisher
information matrix
252
of M, g-ab.
g~ves the asymptot~c . .
covar~ance 0f any e ff'~c~ent
. estimator
O. We show this in the following.
where (u, z, v) are the new coordinates associated with the ancillary
family A(u). When the true parameters are (u, z), we can obtain the
Edgeworth expansion of the joint distribution of
ii = IN(O - u), z = 1N(2 - z), v = IN~
Fisher information
-
gab' A first-order efficient estimator is
second-order efficient.
(8.6)
are linear combinations of 3 c ' that is, there exist Sab c such that
[3 a , 3b l = Sab c3 c'
or
<[3 a , 3b l, dp > = O.
3a = da - g~dp'
where
p qp
ga = gaqg .
By calculating the Lie bracket, we have
holds.
orthogonal parametrization.
C2 C C ay So
ab aSa yob g g
C C ce df + 2C C cd pq
cda efb g g cpa dqb g g
+ C C pr qs + C C KV A~
pqa rsb g g KAa v~bg g ,
consisting of six non-negative terms. Each term is interpreted
geometrically as follows.
Ccdar(m)
= C = H(m)
cda' pqa pqa'
so that the squares of the terms
(r m)2 r(m)r(m) ce df (8.9)
ab cda efb g g ,
(8.10)
H(m)H(m)gKV AV (8.11)
KAa v~b g ,
which vanishes when <(la' (l > 0, i.e. , when the nuisance parameter
p
is orthogonal to the parameter of interest. This implies that. when
they are not orthogonal, knowledge on the nuisance parameter carries
much information.
In many cases, the nuisance parameter z is orthogonal to the
parameter of interest, even when we can utilize its knowledge zo' so
259
-
that gab - gab = O. Then, what is the amount of information by
knowing the true value Zo of the nuisance parameter in this case?
Let gab(U) be the amount of information carried by the third-order
efficient estimator a (e.g. the m.l.e.) when we have no knowledge on
the nuisance parameters, and let gab(U) be the one when we know its
true value zo. (They are calculated in the next section.) The
amount L\gab of information which the knowledge on the nuisance
parameter carries can be defined by the difference
(8.17)
When the nuisance parameter is not orthogonal, this definition
coincides with the previous one (8.16) except for the higher-order
term. When the nuisance parameter is orthogonal, i. e., gab = gab or
5ap = 0, the knowledge that z = Zo still carries some information of
)rder 1.
Since the amount gab(U) or gab(U) of information included in an
~fficient a is related to the covariance matrix of the bias-corrected
~stimator a* by (7.11), we can calculate L\gab from the covariance
natrix of the m.l.e. 's when Zo is known and that when it is unknown.
Vhen Zo is known, the model reduces to M(zO)' and the ancillary
:amily A(u) associated with the third-order efficient estimator (the
~.l.e.) is orthogonal to M(zO) and is mixture-flat. This A(u) cannot
~nclude Z(u), unless Z(u) is mixture-flat. The third-order terms of
:he covariance of a bias-corrected efficient estimator a* in the
wdel M(zO) is the sum of the three non-negative terms as is given by
:5.11). The term (rm)2 is the same, but the ancillary directions for
:he model M(zO) are decomposed into mutually orthogonal '\ and d p
lirections in the present case. Therefore, for indices K and L,
:tanding for (p, K) and (q, A) of pairs of indices, the square of the
!xponential curvature of M(zO) is decomposed as
(H~)2 H~~H~~gKMgMN = H~~~ H~~~gKAgcd
(H~ , V);b + (H~ , Z);b
260
in the present case. These terms also appear in the case with the
unknown nuisance parameter. The square of the mixture curvature,
which is decomposed as
H~~H~~gKMgLN
( m2 m 2 e 2
HA)ab+ (HU,Z)ab + 2(H U,Z,V)ab' -
vanishes by choosing a mixture flat A when Zo is known. When Zo is
unknown, only the firs t term (H~);b vanishes for the third-order
efficient estimator, because of the restriction that A(u) includes
6g ab Ngapgbqgpq + 0(1)
when the nuisnace parameter is not orthogonal, and
e 21m 2 -1
6g ab = (HU,Z,V)ab + Z(HU,Z)ab + O(N ),
when it is orthogonal.
Example 8.2. We again use the set M ={ N(~, 02)} of the normal
dis tributions in Example 8.1, although it is very special in the
sense that M itself is an exponential family so that S = M and there
are no v-directions orthogonal to M. Let u = ~ be the parameter of
interest, and we first assume that we know the value of the nuisance
parameter z = 0, say z = z00 The metric in this (u, z) coordinate
system is already given in Example 2.2 as
2 o.
gab = 1/0 gpq = 2/0 2 gap
261
zo' then the model M(zO) is a curved line as is shown in the dotted
line in Fig. 8.3 b). The metric tensor in the coordinate system (u,
z) , u = ~, z = ~
2+ 02, .
1S a 1 so given in Example 2.2 as
2
gab = (2~ + 0 2 )/0 , 4
gap = -~ 4 /0 , gpq = 1/(20 4 ).
The orthogonalized information is gab = 1/0 2 . Hence, when we make
use of the knowledge z = zO' the asymptotic variance decreases from
- -1 2 -1 2 2 2
(gab) = 0 to (gab) = 0 /(2~ + 0). The knowledge brings an
amount
6g ab = 2N~2/04 + 0(1)
of information.
Consider the third case, where the parameter of interest is the
(8.18)
262
evaluated by
tlgab(T) (8.21)
In particular, when u and z are orthogonal at the true (u, z),
gP
a
gpqg
aq
= 0
and
tlgab(T) = tlgab(T)
holds except for higher-order terms. It is hence convenient to use
an orthogonal parametrization (u, z). When it does not exist, we
may use one which is orthogonal at least at a point (ft, ~).
(8.22)
...
~gab(U)
(8.23)
which is in a correspondence with the mean square error of a
bias-corrected efficient estimator 0.* given in (8.15). The term
(H~);b vanishes for the m.l.e.
We next calculate the loss of information caused by summarizing
x into the pair (0., 2) From
Naa~(x, u, z) ha(o., 2) + ~ H~~~yKyA - H~~~UbyK
=
orthogonal at the true value, as can be seen from (8.24), {ft, ~, tab'
tap} are second-order sufficient.
When (u, z) is not orthogonal, we need to use the components in
the orthogonal directions aa of the exponential curvature, instead of
those in the da-directions. Since the orthogonal directions are
given by
lia = da - g~dp'
the orthogonal direction components of the exponential-direction
components t of ~ are given by
t'
ab
t
ab
gPt
a bp
gqt
b aq
+ t pq gPgq
a b'
t' t gqt
ap ap a pq'
where g~ = gab(ft, ~)gpb(ft, ~). The statistics {ft, ~, t~b' t~p} are
second-order sufficient in the non-orthogonal case.
Now let us consider asymptotic ancillary statistics. When a
statistic t is an ancillary of order q with respect to the full
parameter (u, z), it is an ancillary of order q with respect to the
parameter u, because
- pq
gab = gab - gap gbqg
is of order N-q when each of gab' is of order
gab(V) = O(N- l ),
when (ft, ~) is efficient and the v-coordinate system satisfies gKA(U,
given by
p(z I u, z)
where
Hence,
0(1),
266
p(u I z) = c exp
1
{--z-
gab (u, z)(u a + g~zp) (ub + g~zq)}
+ O(N- 1 / 2 ).
This shows that the conditional covariance of u is gab when
expectation of u is
z) of Q.
(8.30)
where
-, = H(m)'vKv
sa KAa
A
_['N<:'
0, 2) dadpR,(X,
0, OJ
dbdqt(X, 0, 2) dpdqR,(X, 0, 2)
the estimate
- (0 2) _ t,ab
gab '
of the covariance is the (a, b)-component of the inverse of the
268
where, Gab, Gaq, Gibc ' Gibp ' etc. are defined in the same manner as
in (7.30) by considering that indices take pairs (a, p) etc. in the
present case.
We next show the method of constructing a third-order efficient
estimator ft, when the i-th sample consists of Ni independent
observations from the distribution q(x, u, zi) such that u is cOIllIllon
but the nuisance parameter takes different unknown values for each
sample.
We treat here a very simple case with two samples of the same
size for illustration. The general case is treated by the same
method, so that the result is shown later.
Let xli' ... ,x lN and x Zl ,' .. x ZN be N independent observations
from the distributions q(x l , u, zl) and q(x Z ' u, zZ), respectively.
The joint distribution of x = (xl' x Z), where xl is from the first
distribution and X z from the second, is given by
q(x, u, zl' zZ) = exp{x l ·6 1 + x Z ·6 Z - ~(61) - ~(6Z)} (8.33)
269
where
81 = 8(u, zl)' 8 Z = 8(u, zZ)
and the dot in x l .8 l denotes the inner product. The distribution
(8.33) defines an (Zn, m + Zk)-curved exponential family M imbedded
in Zn-dimensional S with the natural parameter
8 = [8 1 ] ~[8(U' Zl)]
8Z 8 (u, zZ)
and the expectation parameter
nZ n (u, zZ)
where u = (u a ) is the m-dimensional parameter of interests, Z = (zl'
zZ) is the Zk-dimensional nuisance parameter, and 81 = (8~), 8Z = (
i
8Z)' nl = (nli)' n Z = (nZi) are n-dimensional vectors.
The tangent vectors of M at (u, ZZ) are given
by ae/aua,a8/azl and a8/az~. They are Zn-dimensional vectors of the
forms
-ab p
gz (B Zbi - gZbBZpi)
where BKi(u, z) is the vectors satisfying
BiB. 0
a Kl.
and
Blci Cl11i(U, zl)/ClUc Bci(u, zl)
Blpi Cl11i(U, Zl)/ClZl Bpi(u, zl) ,
glba gab(u, zl)'
etc. It can easily be checked that these vectors are orthogonal to
those spanning T(M).
We treat the m.l.e. for simplicity's sake. Let (ft l , 21 ) and (ft Z '
2 Z) be the m.l.e.s from the observations x ll '·· .,x lN and x Zl '·· .,xZ N'
respectively. Then, we can decompose the sufficient statistics xi =
respective samples.
From the first equations of (8.34) and (8.35), we have
~ ...) (~2 ' ) B ",K B' A' K - ' ab ,
ni ( Ul' Gl - ni u, 1 = lKivl - lKivl - g 1 BlbiftOa'
(8.36)
where BIKi etc. are evaluated at and Bl' Kl.. etc. are
evaluated at (ft, 2i). The left hand side of the above equation is
expanded as
Biai(ftt - fta) + giap(21 - 2iP ) = -ftOa + higher order terms.
We first derive the first-order approximation of ft, 2i, 2 2, ftOa . To
this end, we multiply Bi~ and Bi~ to the both sides of (8.36). Then,
we have
(8.37)
where
a ac- a ac-
Gl b glbc' G2 b
G G g2bc'
and Gac is the inverse of Gac = (glac + g2bc). This shows that the
best estimator is obtained by the weighted mean of ftl and ft2 by using
the orthogonalized Fisher information matrices glab and g2ab as the
weights.
We can proceed further to obtain the higher-order terms of ft.
Here, we assume that (u, z) is orthogonal at (u, zl) and (u, z2) for
simplicity's sake. When it is not orthogonal, the same result holds
272
aa af + gi ab a Ob + Ef (8.38)
"a _ "a
U - u2
,ab"
g2 uOb
+ E2a
By multiplying Biai to the both sides of (8.36) and by expanding
various quantities at (a. 2 1 ), we have
2 b ("b
G,a u2 + Eb2 ) + Gl,ab(~bl
u + El ).
b (8.39)
where
a = ~ r(m) G c G a (aa'
glabEl 2 1 bca 2 c' 2 a' 2
etc.
Finally. we describe the result of the general case. Let aa be
the third-order efficient estimator derived from the second-order
a a ,
sufficient statistics. a i • 2 i • and t iab of the i-th sample consisting
of Ni independent observations from the distribution q (x. u. zi).
The first-order evaluation of aa is given by
a,a = G,ab{ENigbc(a i • 2i)a~}. (8.40)
where G,ab is the inverse of
G~b = ENigab(a i • 2 i )·
Let us define
_ - _1 (m) ( c
Giab - Ni{giab + 2 r abc a i - a, c ) - '}
tiab • (8.41)
273
8.6 Notes
Since we have interest in the value only of parameter u in
statistical inference in the presence of nuisance parameter z, we may
take any scale for z. In other words, we may introduce any
coordinate system z in Z(u). Hence, an allowable general parameter
transformation is of the form
z' h(z, u),
u' k(u),
which does not destroy the structure of the problem, i. e., which
keeps the family of submanifolds {Z(u)} invariant. If we can choose
a parametrization (u, z) such that da and dp are mutually orthogonal,
aa -- aa'
ab = g- ab g
Begun, J. M., Hall, W.J., Huang, W.-M. and Wellner, J.A. (1983).
Infromation and asymptotic efficiency in parametric-
nonparametric models. Ann. Statist., !l, 432-452
Ghosh, J.K., Shinha, B.K. and Wieand, H.S. (1980). Second order
efficiency of mle with respect to any bounded bowl-shaped loss
function. Ann. Statist.! 8, 506-521
Supplements to REFERENCES
Monographs
Amari, S., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S. L., and Rao, C. R.
(1987). Differential Goeometry in Statistical Inferences. IMS Lecture Notes
Monograph Series, vol. 10, Hayward, California, IMS, including the
following papers:
Kass, R. E., Introduction, Chap. 1, 1 - 18.
Amari, S., Differential Geometrical Theory of Statistics, Chap. 2, 19 - 94.
Barndorff-Nielsen, O. E., Differential and Integral Geometry in Statistical
Inference, Chap. 3, 95 - 162.
Lauritzen, S. L., Statistical Manifolds, Chap. 4, 163 • 216.
Rao, C. R., Differential Metrics in Probability Spaces, Chap. 5, 217 - 240.
Papers
Amari, S. (1987). Differential geometrical method in asymptotics of statistical
inference. Invited Paper, Proc. of the 1st Bernoulli Society World Congress
on Mathematical Statistics and Probability Theory (eds. Prohorov. Yu. and
Sazanov, V. V.), 2,195-204, VNU Press
Amari, S. (1987). Differential geometry in statistical inference. Proc. oflSI, 52,
Book 2, Invited Paper, 6.1, 46th Session of the lSI, 321-338
Amari, S. (1987). Statistical curvature. Encyclopedia of Statistical Sciences (eds.
Kotz, S. and Johnson, N. L.), 8, 642-646, Wiley
Amari, S. (1987). Differential geometry ofa parametric family of invertible
linear systems - Riemannian metric, dual affine connections and
divergence. Mathematical Systems Theory, 20, 53-82
Amari, S. (1989). Fisher information under restriction of Shannon information.
AISM, to appear
Amari, S. and Kumon, M. (1988). Estimation in the presence of infinitely many
nuisance parameters --- geometry of estimating functions. Annals of
Statistics, 16, 1044-1068
Amari, S. and Han, T. S. (1989). Statistical inference under multi-terminal rate
restrictions --- a differential geometrical approach. IEEE Trans. on
Information Theory, IT-35, 217-227,1989
288
invariancy 68 150
m.l.e. test 181