0% found this document useful (0 votes)
885 views301 pages

Differential-Geometrical Methods in Statistics

Uploaded by

Yingjian Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
885 views301 pages

Differential-Geometrical Methods in Statistics

Uploaded by

Yingjian Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 301

Lecture Notes in Statistics

Vol. 1: R.A. Fisher: An Appreciation. Edited by S.E. Fien- Vol. 22: S. Johansen, Functional Relations, Random Coef-
berg and D.V. Hinkley. XI, 208 pages, 1980. ficients and Nonlinear Regression with Application to
Vol. 2: Mathematical Statistics and Probability Theory. Pro- Kinetic Data. VIII, 126 pages, 1984.
ceedings 1978. Edited by W. Klonecki, A. Kozek, and Vol. 23: D.G. Saphire, Estimation of Victimization Pre-
J. Rosinski. XXIV, 373 pages, 1980. valence Using Data from the National Crime Survey. V, 165
Vol. 3: B.D. Spencer, Benefit-Cost Analysis of Data Used pages, 1984.
to Allocate Funds. VIII, 296 pages, 1980. Vol. 24: TS. Rao, M.M. Gabr, An Introduction to Bispectral
Vol. 4: E.A. van Doorn, Stochastic Monotonicity and Analysis and BilinearTime Series Models. VIII, 280 pages,
Queueing Applications of Birth-Death Processes. VI, 118 1984.
pages, 1981. Vol. 25: Time Series Analysis of Irregularly Observed
Vol. 5: T Rolski, Stationary Random Processes Asso- Data. Proceedings, 1983. Edited by E. Parzen. VII, 363
ciated with Point Processes. VI, 139 pages, 1981. pages, 1984.

Vol. 6: S.S. Gupta and D.-y' Huang, Multiple Statistical Vol. 26: Robust and Nonlinear Time Series Analysis. Pro-
Decision Theory: Recent Developments. VIII, 104 pages, ceedings, 1983. Edited by J. Franke, W. Hardie and D.
1981. Martin. IX, 286 pages, 1984.

Vol. 7: M. Akahira and K. Takeuchi, Asymptotic Efficiency Vol. 27: A. Janssen, H. Milbrodt, H. Strasser, Infinitely
of Statistical Estimators. VIII, 242 pages, 1981. Divisible Statistical Experiments. VI, 163 pages, 1985.

Vol. 8: The First Pannonian Symposium on Mathematical Vol. 28: S. Amari, Differential-Geometrical Methods in Sta-
Statistics. Edited by P. Revesz, L. Schmetterer, and V.M. tistics. V, 290 pages, 1985.
Zolotarev. VI, 308 pages, 1981. Vol. 29: Statistics in Ornithqlogy. Edited by B.J.T Morgan
Vol. 9: B. Jorgensen, Statistical Properties of the Gen- and P.M. North. XXV, 418 pages, 1985.
eralized Inverse Gaussian Distribution. VI, 188 pages, Vol. 30: J. Grandell, Stochastic Models of Air Pollutant
1981. Concentration. V, 110 pages, 1985.
Vol. 10: A.A. Mcintosh, Fitting Linear Models: An Ap- Vol. 31: J. Pfanzagl, Asymptotic Expansions for General
plication on Conjugate Gradient Algorithms. VI, 200 Statistical Models. VII, 505 pages, 1985.
pages, 1982. Vol. 32: Generalized Linear Models. Proceedings, 1985.
Vol. 11: D.F Nicholls and B.G. Quinn, Random Coefficient Edited by R. Gilchrist, B. Francis and J. Whittaker. VI, 178
Autoregressive Models: An Introduction. V, 154 pages, pages, 1985.
1982. Vol. 33: M. Csorgo, S. Csorgo, L. Horvath, An Asymptotic
Vol. 12: M. Jacobsen, Statistical Analysis of Counting Pro- Theory for Empirical Reliability and Concentration Pro-
cesses. VII, 226 pages, 1982. cesses. V, 171 pages, 1986.
Vol. 13: J. Pfanzagl (with the assistance of W. Wefel- Vol. 34: D.E. Critchlow, Metfic Methods for Analyzing Par-
meyer), Contributions to a General Asymptotic Statistical tially Ranked Data. X, 216 pages, 1985.
Theory. VII, 315 pages, 1982. Vol. 35: Linear Statistical Inference. Proceedings, 1984.
Vol. 14: GLiM 82: Proceedings of the International Con- Edited by T Cal in ski and W. Klonecki. VI, 318 pages,
ference on Generalised Linear Models. Edited by R. Gil- 1985.
christ. V, 188 pages, 1982. Vol. 36: B. Matern, Spatial Variation. Second Edition. 151
Vol. 15: K.R.W. Brewer and M. Hanif, Sampling with Un- pages, 1986.
equal Probabilities. IX, 164 pages, 1983. Vol. 37: Advances in Order Restricted Stalistical Infer-
Vol. 16: Specifying Statistical Models: From Parametric to ence. Proceedings, 1985. Edited by R. Dykstra,
Non-Parametric, Using Bayesian or Non-Bayesian T Robertson and FT Wright. VIII, 295 pages, 1986.
Approaches. Edited by J.P. Florens, M. Mouchart, J.P. Vol. 38: Survey Research Designs: Towards a Better
Raoult, L. Simar, and A.FM. Smith, XI, 204 pages, 1983. Understanding of Their Costs and Benefits. Edited by
Vol. 17: I.V. Basawa and D.J. Scott, Asymptotic Optimal R.W. Pearson and R.F. Boruch. V, 129 pages, 1986.
Inference for Non-Ergodic Models. IX, 170 pages, 1983. Vol. 39: J.D. Malley, Optimal Unbiased Estimation of
Vol. 18: W. Britton, Conjugate Duality and the Exponential Variance Components. IX, 146 pages, 1986.
Fourier Spectrum. V, 226 pages, 1983. Vol. 40: H.R. Lerche, Boundary Crossing of Brownian
Vol. 19: L. Fernholz, von Mises Calculus For Statistical Motion. V, 142 pages, 1986.
Functionals. VIII, 124 pages, 1983. Vol. 41: F Baccelli, P. Bremaud, Palm Probabilities and
Vol. 20: Mathematical Learning Models - Theory and Stationary Queues. VII, 106 pages, 1987.
Algorithms: Proceedings of a Conference. Edited by U. Vol. 42: S. Kullback, J.C. Keegel, J.H. Kullback, Topics in
Herkenrath, D. Kalin, W. Vogel. XIV, 226 pages, 1983. Statistical Information Theory. IX, 158 pages, 1987.
Vol. 21: H. Tong, Threshold Models in Non-linear Time Vol. 43: B.C. Arnold, Majorization and the Lorenz Order:
Series Analysis. X, 323 pages, 1983. A Brief Introduction. VI, 122 pages, 1987.
Lecture Notes in
Statistics
Edited by J. Berger, S. Fienberg, J. Gani,
K. Krickeberg, I. Olkin, and B. Singer

28

Shun-ichi Amari

Differential-Geometrical
Methods in Statistics

Springer-Verlag
Berlin Heidelberg New York London Paris Tokyo Hong Kong
Author
Shun-ichi Amari
University ofTokyo, Faculty of Engineering
Department of Mathematical Engineering and Information Physics
Bunkyo-ku, Tokyo 113, Japan

1st Edition 1985


Corrected 2nd Printing 1990

Mathematical Subject Classification: 62-03, 60E99, 62E99

ISBN-13: 978-0-387-96056-2 e-ISBN-13: 978-1-4612-5056-2


001: 10.1007/978-1-4612-5056-2

This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation,
broadcasting, reproduction on microfilms or in other ways, and storage in data banks. Duplication
of this publication or parts thereof is only permitted under the provisions of the German Copyright
Law of September 9, 1965, in its version of June 24, 1985, and a copyright fee must always be
paid. Violations fall under the prosecution act of the German Copyright Law.
© Springer-Verlag Berlin Heidelberg 1985

2847/3140-543210- Printed on acid-free paper


CONTENTS

Chapter 1. Introduction 1

PART I. GEOMETRICAL STRUCTURES OF A FAMILY OF PROBABILITY


DISTRIBUTIONS 11

Chapter 2. Differential Geometry of Statistical Models 11

2.1. Manifold of statistical model 11


2.2. Tangent space 16
2.3. Riemannian metric and Fisher information 25
2.4. Affine connection 32
2.5. Statistical a-connection 38
2.6. Curvature and torsion 43
2.7. Imbedding and submanifold 49
2.8. Family of ancillary submanifolds 54
2.9. Notes 63

Chapter 3. a-Divergence and a-Projection in Statistical Manifold 66

3.1. a-representation 66
3.2. Dual affine connections 70
3.3. a-family of distributions 73
3.4. Duality in a-flat manifolds 79
3.5. a-divergence 84
3.6. a-projection 89
3.7. On geometry of function space of distributions 93
3.8. Remarks on possible divergence, metric and
connection in statistical manifold 96
3.9. Notes 102
IV

PART IT. HIGHER-ORDER ASYMPTOTIC THEORY OF STATISTICAL INFERENCE


IN CURVED EXPONENTIAL FAMILIES 104

Chapter 4. Curved Exponential Families and Edgeworth Expansions 104

4.1. Exponential family 104


4.2. Curved exponential family 108
4.3. Geometrical aspects of statistical inference 115
4.4. Edgeworth expansion 120
4.5. Notes 127

Chapter 5. Asymptotic Theory of Estimation 128

5.1. Consistency and efficiency of estimators 128


5.2. Second- and third-order efficient estimator 131
5.3. Third-order error of estimator without bias
correction 141
5.4. Ancillary family depending on the number of
observations 145
5.5. Effects of parametrization 148
5.6. Geometrical aspects of jacknifing 156
5.7. Notes 159

Chapter 6. Asymptotic Theory of Tests and Interval Estimators 161

6.1. Ancillary family associated with a test 161


6.2. Asymptotic evaluations of tests: scalar parameter
case 171
6.3. Characteristics of widely used efficient tests:
Scalar parameter case 181
6.4. Conditional test 190
6.5. Asymptotic properties of interval estimators 193
6.6. Asymptotic evaluations of tests: general case 197
6.6. Notes 208
v

Chapter 7. Information, Ancillarity and Conditional Inference 210

7.1. Conditional information, asymptotic sufficiency


and asymptotic ancillarity 210
7.2. Conditionl inference 217
7.3. Pooling independent observations 231
7.4. Complete decomposition of information 236
7.5. Notes 241

Chapter 8. Statistical Inference in the Presence of Nuisance


Parameters 244

8.1. Orthogonal parametrization and orthogonalized


information 244
8.2. Higher-order efficiency of estimators 255
8.3. The amount of information carried by knowledge
of nuisance parameter 257
8.4. Asymptotic sufficiency and anci11arity 261
8.5. Reconstruction of estimator from those of
independent samples 268
8.6. Notes 273

REFERENCES 276

SUBJECT INDICES 291


1. INTRODUCTION

Why Geometry?

One may ask why geometry, in particular differential geometry,


is useful for statistics. The reason seems very simple and strong.
A statistical model is a set of probability distributions to which

we believe the true distribution belongs. It is a subset of all the


possible probability distributions. In particular, a parametric
model usually forms a finite-dimensional manifold imbedded in the
set of all the possible probability distributions. For example a
normal model consists of the probability distributions N(Il, a2 )

parametrized by two parameters (Il, a). The normal model M = {N(Il,


a 2 )} forms a two-dimensional manifold with coordinates Il and a, and
is imbedded in the set S {p(x)} of all the regular probability
distributions of a random variable x. One often uses a statistical
model to carry out statistical inference, assuming that the true

distribution is included in the model. However, a model is merely a


hypothesis. The true distribution may not be in the model but be
only close to it. Therefore, in order to evaluate statistical
inference procedures, it is important to know what part the
statistical model occupies in the entire set of probability
distrubutions and what shape the statistical model has in the entire

set. This is the problem of geometry of statistical models. It is


therefore expected that a fundamental role is played in statistics
by the geometrical quantities such as the distance or divergence of
two probability distributions, the flatness or curvature of a
statistical model, etc. However, it is by no means a trivial task
to define such geometrical structures in a natural and invariant

Statistical inference can be carried out more and more


precisely as the number of observations increases, so that one can
2

construct a universal asymptotic theory of statistical inference in


the regular case. Since the estimated probability distribution lies
very close to the true distribution in this case, it is sufficient
when evaluating statistical procedures to take account of only the
local structure of the model in a small neighborhood of the true or
estimated distribution. Hence, one can locally linearize the model
at the true or estimated distribution, even if the model is curved
in the entire set. Geometrically, this local linearization is an
approximation to the manifold by the tangent space at a point. The
tangent space has a natural inner product (Riemannian metric) given
by the Fisher information matrix. From the geometrical point of
view, one may say that the asymptotic theory of statistical
inference has indeed been constructed by using the linear geometry
of tangent spaces of a statistical model, even if it has not been
explicitly stated.
Local linearization accounts only for local properties of a
model. In order to elucidates larger-scale properties of a model,
one needs to introduce mutual relations of two different tangent
spaces at two neighboring points in the model. This can be done by
defining an affine correspondence between two tangent spaces at
neighboring points. This is a standard technique of differential
geometry and the correspondence is called an affine connection. By
an affine connection, one can study local non-linear properties,
such as curvature, of a model beyond linear approximation. This
suggests that a higher-order asymptotic theory can naturally be
constructed in the framework of differential geometry. Moreover,
one can obtain global properties of a model by connectiong tangent
spaces at various points. These considerations show the usefulness
and validity of the differential-geometrical approach to statistics.
Although the present monograph treats mainly the higher-order
asymptotic theory of statistical inference, the
3

differential-geometrical method is useful for more general


statistical anlyses. It seems rather surprising that few theories
have so far been developed concerning geometrical properties of a
family of probability distributions.

Historical Remark
It was Rao (1945), in his early twenties, who first noticed the
importance of the differential-geometrical approach. He introduced
the Riemannian metric in a statistical manifold by using the Fisher
information matrix and calculated the geodesic distances between two
distributions for various statistical models. This theory made an
impact and not a few researchers have tried to construct a theory
along this Riemannian line. Jeffreys also remarked the Riemannian
distance (Jeffreys, 1948) and the invariant prior of Jeffreys (1946)
was based on the Riemannian concept. The properties of the
Riemannian manifold of a statistical model have further been studied
by a number of researchers independently, e.g., Amari (1968), James
(1973), Atkinson and Mitchell (1981), Dawid (1977), Akin (1979),
Kass (1980), Skovgaard (1984), etc. Amari I s unpublished results
(1959) induced a number of researches in Japan; Yoshizawa (197la,
b), Takiyama (1974), Ozeki (1971), Sato et al. (1979), Ingarden et
al. (1979), etc. Nevertheless, the statistical implications of the

Riemannian curvature of a model did not become clear. Some


additional concepts seemed necessary for proving the usefulness of
the geometrical approach.
I t was an isolated work by Chentsov (1972) in a Russian book
(translated in English in 1982) and in some papers prior to the book
that developed a new concept on statistical manifolds. He
introduced a family of affine connections in a statistical manifold,
whereas only the Riemannian (Levi-Civita) connection was used in the
above works. He also proved that the Fisher information and these
4

affine connections are unique in the manifold of probability


distributions on a finite number of atoms. He proved this from the
point of view of the categorical invariance, by considering a
category whose objects are multinomial distributions and whose
morphisms are Markovian mappings between them. His theory is deep
and fundamental, and he elucidates the geometrical structures of the
exponential family. However, he did not remark the curvature of a
statistical manifold, which plays a central role in the higher-order
asymptotic theory of statistical inference.
It was Efron (1975, 1978) who opened a new idea independently
of Chentsov's work. He defined the statistical curvature of a
statistical model, and pointed out that the statistical curvature
plays a fu.ndamental role in the higer-order asymptotic theory of
statistical inference. Although he did not introduce an affine
connection explicitly, a new affine connection (exponential
connection) was introduced implicitly in his theory, as was
elucidated by Dawid (1975). Dawid also suggested the possibility of
introducing another affine connection (mixture connection). Efron's
idea was generalized by Madsen (1979); see also Reads (1975).
Under the strong influence of Efron's paper and Dawid's
suggestion, Amari (1980, 1982a) introduced a one-parameter family of
affine connections (a-connections), which turned out to be
equivalent to those Chentsov had already defined. Amari further
proposed a differential-geometrical framework for constructing a
higher-order asymptotic theory of statistical inference. He,
defining the a-curvature of a submanifold, pointed out important
roles of the exponential and mixture curvatures and their duality in
statistical inference. Being stimulated by this framework, a number
of papers appeared, e. g. Amari (1982b, 1983a, b), Amari and Kumon
(1983), Kumon and Amari (1983, 1984, 1985), Eguchi (1983, 1984); see
also Wei and Tsai (1983), Kass (1984). The theoretical background
5

was further deepened by Nagaoka and Amari (1982), where the


dualistic viewpoint was refined and some new geometrical concepts
were introduced. Here statistcs contributes to differential
geometry.
Professors D. R. Cox, O. E. Barndorff-Nielsen and D.V. Hinkley
organized a NATO Advanced Workshop on Differential Geometry in
Statistical Inference in April, 1984 in London. More than forty
researchers participated, and stimulating discussions took place
concerning the present achievement by and future prospects for the
differential-geometrical method in statistics. New directions of
developments were shown, e. g. by Amari (1984 a), Barndorff-
Nielsen(1984), Lauritzen(1984) , etc. I believe that the differential
geometrical method will become established as one of the main and
indispensable theoretical methods in statistics.

Organization of the Monograph


Part I treats fundamental geometrical properties of parametric
families of probability distributions. We define in Chapter 2 the
basic quantities of a statistical manifold, such as the Riemannian
metric, the a-affine connection, the a-curvature of a submanifold,
etc. This chapter also provides a good introduction to differential
geometry, so that one can read the Monograph without any prior
knowledge on differential geometry. The explanation is rather
intuitive, and unnecessary rigorous treatments are avoided. The
reader is asked to refer to Kobayashi and Nomizu (1963, 1969) or any
other textbooks for the modern approach to differential geometry,
and to Schouten (1954) for the old tensorial style of notations.
Chapter 3 presents an advanced theory of differential geometry of
statistical manifolds. A pair of dual connections are introduced in
a differentiable manifold with a Riemannian metric. The dualistic
characteristics of an a-flat manifold are especially interesting.
6

We can define an a-divergence measure between two probability


distributions in an a-flat manifold, which fits well to the
differential geometrical structures. The Kullback-Leibler
information, the Chernoff distance, the f-divergence of Csiszar, the
Hellinger distance etc. are all included in this class of
a-divergences. This chapter is based mainly on Nagaoka and Amari
(1982), which unifies the geometry of Csiszar (1967a,b; 1975) and
that of Chentsov (1972) and Amari (1982a). This type of the duality
theory cannot be found in any differential geometry literature.
Part I I is devoted to the higher-order asymptotic theory of
statistical inference in the framework of a curved exponential
family. We present the fundamental method of approach in Chapter 4,
by decomposing the minimal sufficient statistic into the sum of an
asymptotically sufficient and asymptotically ancillary statistics in
the tangent space of a model. The Edgeworth expansion of their
joint probability distribution is explicitly given in geometrical
terms up to the term of order lIN, where N is the number of
observations. Chapter 5 is devoted to the theory of estimation,
where both the exponential and mixture curvatures play important
roles. Chapter 6 treats the theory of statistical tests. We
calculate the power functions of various efficient tests such as the
Wald test, the Rao test (efficient score test), the likelihood ratio
test, etc. up to the term of order lIN. The characteristic of
various first-order efficient tests are compared. Chapter 7 treats
more basic structures concerning information such as higher-order
asymptotic sufficiency and ancillarity. Conditional inference is
studied from the geometrical point of view. The relation between
the Fisher information and higher-order curvatures is elucidated.
Chapter 8 treats statistical inference in the presence of nuisance
parameters. The mixture and exponential curvatures again play
important roles.
7

It was not possible to include in this volume the newly


developing topics such as those presented and discussed at the NATO
Workshop. See, e.g., Barndorff-Nielsen (1984), Lauritzen (1984) and
Amari (1984 a), which together will appear as a volume of the IMS
Monograph Series, and the papers by R.E. Kass, C.L.Tsai, etc. See
also Kumon and Amari (1984), Amari and Kumon (1985), Amari (1984 c).
The differential-geometrical method developed in statistics is also
applicable to other fields of sciences such as information theory
and systems theory (Amari, 1983 c, 1984 b). See Ingarden (1981) and
Caianiello (1983) for applications to physics. They together will
open a new field, which I would like to call information geometry.

Personal Remarks
It was in 1959, while I was studying for my Master's
Degree at the University of Tokyo, that I became enchanted by the
idea of a beautiful geometrical structure of a statistical model. I
was suggested to consider the geometrical structure of the family
of normal distributions, using the Fisher information as a
Riemannian metric. This was Professor Rao's excellent idea proposed
in 1945. I found that the family of normal distributions forms a
Riemannian manifold of constant negative curvature, which is the
Bolyai-Lobachevsky geometry well known in the theory of
non-Euclidean geometry. My results on the geodesic, geodesic
distance and curvature appeared in an unpublished report. I could
not understand the statistical meaning of these results, in
particular the meaning of the Riemannian curvature of a statistical
manifold. Since then, I have been dreaming of constructing a
theory of differential geometry for statistics, although my work has
been concentrated in non-statistical areas, namely graph theory,
continuum mechanics, information sciences, mathematical theory of
neural nets, and other aspects of mathematical engineering. It was
8

a paper by Professor Efron that awoke me from my dream and led me to


work enthuastically on constructing a differential-geometrical
theory of statistics. This Monograph is a result of several years
of endeavour by myself along this line.

Finally, I list up some problems which I have now interests in


and am now studying.

1. Extension of the geometric theory of statistical inference


such that it is applicable to a general regular parametric model
which is not necessarily a curved exponential family. This
extension is possible by introducing the jet bundle which is an
aggregate of local exponential families. Here, a local exponential
family is attached to each point of the model such that the original
model is locally (approximately) imbedded in the exponential family
at that point.
2. Extension of the present theory to the function space of
regular probability distributions. This enables us to construct a
geometrical theory of non-parametric, semi-parametric and robust
statistical inference.
3. The problem of estimating a structural parameter in the
presence of as many incidental parameters as the number of
observations. This classical problem can be elucidated by
introducing a Hilbert bundle to the underlying statistical model.
4. Differential geometry of a statistical model which
possesses an invariant transformation group. The structure of such
a model is highly related to the existence of an exact ancillary
statistics.
5. Geometry of statistical models of discrete random variables
and categorical data analysis.
6. Geometry of multivariate statistical analysis.
9

7. Geometry of time-series analysis. Local and global


structures of parametric time-series models are interesting.
8. Differential-geometrical theory of systems.
9. Application of differential geometry to information theory,
coding theory and the theory of flow. We need to study geometrical
structures of a manifold of information sources (e.g., the manifold
of Markov chains and the manifold of coders, which map the manifold
of all the information sources into itself.
10. Geometry of non-regular statistical models. Asymptotic
properties of statistical inference in a non-regular model are
r,elated to both the Finsler geometry and the theory of stable
distributions of degree a.

Acknowledgement
I would like to express my sincere gratitude to Professor
Emeritus Kazuo Kondo, who organized the RAAG (Research Association
of Applied Geometry) and introduced me to the world of applied
geometry. The author also thanks Professor S. Moriguti for his
suggestion of the geometrical approach in statistics. I especially
appreciate valuable suggestions and encouragement from Professor K.
Takeuchi, without which I could not complete the present work. I am
greateful to many statisticians for their warm encouragement, useful
comments and inspiring discussions. I would like to mention
especially Professor B. Efron, Professor A.P.Dawid, Professor
D.R.Cox, Professor C.R.Rao, Professor O. Barndorff-Nielsen,
Professor S. Lauritzen, Professor D.V. Hinkley, Professor
D.A.Pierce, Professors lb. and L.T. Skovgaard, Professor T.
Kitagawa, Professor T. Okuno, Professor T.S.Han, and Professor M.
Akahira, Dr. A. Mitchell. The comments by Professor H. Kimura and
Professor K. Kanatani were also useful. Professor Th. Chang and
Dr. R. Lockhart were kind enough to read the firs t vers ion of the
10

manuscript and gave me detailed and valuable suggestions both from


the mathematical and editorial points of view.
My special thanks go to young Japanese researchers Dr.M. Kumon
and Mr. H. Nagaoka who are actively working in this field. They
collaborated with me for years when they were my students in

constructing the differential-geometrical theory of statistical


inference. Without their cooperation, it would have been difficult
to construct the differential geometrical theory at such a speed.
Mr. S. Shimada, Mr. K. Kurata and many other members of my
laboratory checked the manuscript carefully. Mr. K. Shimada helped
me make numerical calculations and fine illustrations. Lastly but
not leastly, I would like to express my heartfelt thanks to Mrs. T.
Shintani and Miss K. Enomoto for their devotion and patience in
typing such a difficult manuscript.

Since the first printing in 1985 of this monograph, a lot of papers have
appeared on this subject, and this dual geometry has been recognized to be
applicable to a wide range of information sciences. New references appeared for
these four years are appended in this second printing, which show new
developments in this field.
PART I. GEOMETRICAL STRUCTURES OF A FAMILY
OF PROBABILITY DISTRIBUTIONS

2. DIFFERENTIAL GEOMETRY OF STATISTICAL MODELS

The present chapter is devoted to the introduction of

fundamental differential-geometrical structures of


statistical models. The tangent space, the Riemannian
metric and the a-connections are introduced in a

statistical manifold. No differential-geometrical

background is required for reading this monograph, because


the present chapter provides a readable introduction to
differential geometry.

2.1. Manifold of statistical model


Statisticians often treat a parametrized family of probability

distributions as a statistical model. Let S = {p(x, e)} be such a


statistical model, where x is a random variable belonging to sample
space X, and p (x, e) is the probability density function of x,

parametrized bye, with respect to some common dominating measure P

on X. Here, e is a real n-dimensional parameter e = (e l , e 2 ,


en) belonging to some open subset a of the n-dimensional real space
Rn. For example, the normal model is a family of

probability distributions having the following density functions,


2
1 exp{- (x - )J) }
p(x, e) =
202
~
where sample space X is the real Rl with the Lebesgue measure dP

dx and the parameter e is two-dimensional; we may put e = (e l , e 2 )


12

(~, 0), because ~ and 0 are usually used as the parameters


specifying a normal distribution. Here, the parameter set ~ is a
half plane,
e = {(~, 0) I - 00 < ~ < 00, 0 < 0 }.

Thus, the set S is composed of all the normal distributions, and


each normal distribution N(~, 0 2) in S is specified by the
two-dimensional parameter a (~, 0).

We give another example. Let x be a random variable taking its


value on the integer sample set X = {l, 2, ... , n+l}. Let Pi be the
probability that x is equal to i, where
lPi = 1, l>Pi>O, i=l, ... , n+l.
Then, the Pi's define a multinomial distribution. By putting
1 2 n
a = PI' a P2'···' a = Pn'
the probability function of a multinomial distribution is written as
p(x, a) = LO(x - i)a i + o(x - n - 1)(1 - rai),
where 0 (x - i) = 1 when x = i and otherwise 0 (x - i) = o. The
multinomial statistical model is the set S composed of all the above
multinomial distributions, and each distribution is specified by the
n-dimensional parameter e, where
When p(x, a) is sufficiently smooth in a, it is natural to
introduce in a statistical model S the structure of an n-dimensional
manifold, where a plays the role of a coordinate system. We give
here a brief introductory explanation of differentiable manifolds.
Refer to Kobayashi and Nomizu [1963] for rigorous definitions. An
n-dimensional manifold S is, intuitively speaking, a Hausdorff space
which is locally homeomorphic to an n-dimensional Euclidean space
Rn. Let U be an open subset of S which is homeomorphic to Rn with a
homeomorphism~. Then, a point p € U is mapped to a point a = (aI,
2
a , ... , a n) E R , i. e., cp (p) = a = (a 1 , a 2 , ... , an ), (Fig. 2.1).
n

This mapping ~ is called a coordinate function in the coordinate


neighborhood U. We have thus introduced a coordinate system in U
13

,...
/ 1"-
I 1\

te-
r~ ~ ~ ~

t1
>
-1-9=tf(p.-
II
"- ......
-v
- + - - - - - - - 7 81

Fig 2.1

such that each point p in U is given coordinates 8 = (8 1 •...• 8 n )

or shortly 8 n. The coordinates 8 may be


considered as a name given to point p. We can draw the coordinate
curves in U by mapping the coordinate curves in Rn by ~-l.

There exist many other coordinate functions in U. Let IjJ be

another coordinate function by which a point p is given the

coordinates

The coordinates s 1. n. define another name given


to the same point p. Given two coordinate systems. each point has
two names or two coordinates 8 and s. Obviously. there exists a

one-to-one correspondence between the coordinates 8 and s:


-1
s
= IjJ 0 ~ (8) 8

(Fig.2.2). which can be written in the component form as


ci = si(
" 81 •...• 8
n) . 8 =
i 8i(
s i • . . . • ,,e n ) . .~= 1 • . . . • n.

These are the coordinate transformations.

The transformation from 8 to s is said to be a diffeomorphism.


when the n func tions si (8 i • are differentiable (up to
1
necessary orders) with respect to 8 •...• 8n and the Jacobian of
14

~-+-r-t-.y ~

\q>oljJ
-I-----~,

I
/</Ioq>
/
1-4-+-+-+-+~k.

Fig 2.2

the transformation
i
det 12L-1
ae J
does not vanish on U, where det denotes the determinant of the
matrix whose (i,j)-element is a~i/aej. In this case, the inverse

transformation from ~ to e is also a diffeomorphism. When we


consider the differentiable structure of a manifold, only those
coordinate systems which are mutually connected by diffeomorphisms
are allowed. More precisely, a local differentiable structure is
introduced in U by defining a coordinate system. The same
differentiable structure is introduced by any of the allowable

coordinate systems connected by diffeomorphisms.

We have so far treated the local structure of manifold S by


restricting our attention to an open set U. Unless S itself is
homeomorphic to Rn , there are no coordinate functions which cover
the entire S. In this case, we consider an open cover U = {U i } of

S such that a coordinate function ¢i is defined on each

open set Ui . Whenever two open sets Ui and Uj overlap, a point p in

Uin Uj has the two sets of coordinates e


15

Hence, we can define the coordinate transformation from e = ~i(P) to


S = ~j(p) for points p belonging to both Ui and Uj . When all such
coordinate transformations are diffeomorphisms, the differentiable
structure is introduced in S by the open cover U together with the
coordinate functions ~i defined on Ui · A metrizable Hausdorff
space is called a differentiable manifold, when it has such an open
cover. A pair (U i ' ~i) of a coordinate neighborhood and a
coordinate function is called a chart, and the collection of (U i '
~i)'s is called an atlas. However, since the present theory treats
only local properties of manifolds of statistical models, we do not
hereafter consider the global structure of S. Instead, it is
assumed that a manifold S always has a global coordinate system
covering the entire S. Otherwise, our theory is valid on some
neighborhood U of S.
Let us return to the family of probability distributions S
{p(x, e)} of a statistical model. We can define a mapping ~ S +

Rn by Hp(x, e)] e. When this function plays the role of a


coordinate function, the vector e is used as the coordinates or the
name of the distribution p(x, e) and a differentiable structure is
introduced in S by this coordinate function. Thus, S is a
differentiable manifold. Let s sn) be another
parametrization of the model S such that e and s are connected by
diffeomorphisms s s (e) and e = e (0 . Then, s defines another
coordinate system in S. Any allowable coordinate system can be used
to analyze the geometric properties of S. Notice that the
coordinates are nothing but a "name" attached to each point
(distribution) pES. The intrinsic geometric properties should be
independent of the naming. However, there often exists a very
convenient naming (coordinate system) depending on the specific
properties of S. There are no reasons to avoid such a convenient
coordinate system when one analyzes a specific statistical model S.
16

The following regularity conditions are required in the


following geometrical theory.
1) All the p(x, e)'s have a common support so that p(x, e) > 0
for all x E X, where X is the support.
2) Let R.(x, e) log p(x, e). For every fixed e, n functions
in x
a
-y R.(x, e) , i 1, 2, ... , n
ae
are linearly independent.
3) The moments of random variables (a/aei)R.(x, e) exist up to
necessary orders.
4) The partial derivatives al ae i and the integration with
respect to the measure P can always be interchanged as
~ f f(x, e)dP = f ~ f(x, e)dP
ae~ ae~
for any functions f(x, e) we treat in the following.

2.2. Tangent space


The tangent space Tp at point p of a manifold S is, roughly
speaking, a vector space obtained by local linearization of S
around p. It is composed
of the tangent vectors of
smooth curves passing
through p (Fig. 2.3) . By a
curve c = c (t), we mean a
continuous mapping c from a
closed interval [a, bj E R1
into S, where c(t) E S is
the image of t E [a, bj.
If we use a coordinate Fig 2.3
system e = ~(p), the image
point c(t) of t is given by the coordinates e(t) = {e 1 (t),
en(t)}. The equation e = e(t) is the parametric representation of
17

the curve c. A curve is said to be (sufficiently) smooth when a(t)


is differentiable up to necessary order. Mathematicians define the
tangent space in the following formal way. Let F be the set of all
the smooth real functions on S. By using a coordinate system a, fE
F is a smooth function f(a l , ... , an) in a. l.Lven a smooth curve c
c (t) or a (t) and a function f E F, we can define a function f 0 c
[a, b]--+R l , which is written as f{a(t)} in the coordinate
expression.
Let Cf be the derivative of this function,

d(fo c) df{a(t)} n da i a
Cf dt dt L'=l-dt -.f.
~ aa~

This is obviously the derivative of f along the curve c or in the


direction of the tangent of c. Thus, a directional derivative
operator C is associated with each curve, and intuitively C depends
only on the "tangent vector dai/dt" of the curve c. Moreover, the
operator C satisfies the following two conditions at each point
c(to) on the curve

(1) C is a linear mapping from F to R.

(2) C(fg) (Cf)g + f(Cg), for f, g E F.

Conversely, it can be shown that a mapping C satisfying the above


conditions is always derived as the directional derivative operator
of a curve. The set of these mappings C can be proved to form an
n-dimensional vector space, provided S is sufficiently smooth
(Coo-manifold). It is called the tangent space Tp of S at p.
18

When a coordinate system 8 is given, we can consider n

coordinate curves c 1 ' c 2 ' ... , c n ' passing through a point PO. For
example, the first coordinate curve c 1 is the curve along which only
the value of the first coordinate e 1 changes while all the other
coordinates are fixed. Hence, the curve c 1 is represented by
1 2
e 1 (t) (e a + t, eo' en)
a
where 80 (8 01 , en) is the coordinates of po· Then, the
a
tangent vector C1 of c 1 is nothing but the partial derivative with
respect to e i ,
a
-;r f .
Hence, we may denote the

tangent C1 by a/ae 1 or
shortly by a 1 · Similarly the
tangent vector Ci of the
coordinate curve ci is
denoted by a i (Fig.2.4).
Ci is simply the partial
derivative a/aei, a i can be Fig 2.4
regarded as the abbreviation
of a/ae i . It can be proved that n vectors ai are linearly
independent, forming a basis of Tp. We call {ail the natural basis
associated with the coordinate system 8. Any tangent vector A E Tp
can be represented as a linear combination of a i
A = I n i
i=l A a i ,
where Ai are the components of A with respect to the natural
In the following, we adopt the Einstein summation
convention: Summation is automatically taken without the summation
symbol L for those indices which appear twice in one term once as a
subscript and once as a superscript. Hence, Aia. automatically
~

i
implies I~=l A a i . The tangent vector e of a curve e (t) in the
coordinate expression is indeed given by e = aia. (which implies
~
19

I e'i di ), where· denotes dldt, because of


Elf -.!if[e(t)] = i _ d_. f .
=
dt de~
e (2.1)

Hence, ei are the components of the tangent vector 8 of a curve


e(t).
There exists a more familiar representation of a tangent vector
in the case of the manifold S = {p(x, e)} of a statistical model.
Let us put
~(x, e) = log p(x, e) (2.2)

and consider n partial derivatives di~(x, e), i = 1, 2, ... , n. It


has been assumed that they are linearly independent functions in x
for every fixed 8. We can construct the following n-dimensional
vector space spanned by n functions di~(x, e) in x,
T~l) = {A(x) I A(x) = Aid i ~(x, e)} ,
i.e., A(x) E T~l) can be written as a linear combination of di~' as
A(x) = Aidi~(x, e), where Ai are the components of A(x) with respect
to the basis di~(x, e). Since x is a random variable, T~l) is the
linear space of random variables spanned by ai~(x, e).
There is a natural isomorphism between the two vector spaces Te
and T~l) by the following correspondence

di E Te ~ d i ~(x, 8) E T~l) .
Obviously, a tangent vector (derivative operator) A = Aia i E Te
corresponds to a random variable A(x) = Aiai~(x, 8) E. T~l) having
the same components Ai. We can identify Te with TP), regarding
that Te is the differentiation operator representation of the
tangent space, while T~l) is the random variable representation of

the same tangent space. The space is called the


l-representation of the tangent space.
Let E[·] be the expectation with respect to the distribution

p(x, 8),
E[f(x)] = !f(x)p(x, 8)dP . (2.3)

By differentiating the identity! p(x, e)dP 1 with respect to e i ,


20

o di jp(x, e)dP = jdip(x, e)dP


jp(x, e)di£(x, e)dP = E[di£(x, e)]
is derived. Hence, for any random variable A(x) belonging to T~l),
E[A(x)] = O.
We have so far used a coordinate system e (e i ). However, we
can use another coordinate system t; n, to
specify a dis tribution in S. There is a diffeomorphism between e
and t;, t; = t;(e), e = e(t;) or in the component form
t; a = t; a ( e, n) ,
1 ... , e ei = e i ( t;,
1 ... ,

i = 1, ... , n; a = 1, ... , n
Here, the index i is used to denote the components of e while the
index a is used to denote the components of t;. It is convenient to
use different index letters to denote the components with respect to
different coordinate systems. Thus, we use i, j, k, etc. for
representing quantities with respect to e, and a, S, y, etc. for
quantities with respect to t;.
The Jacobian matrices of the above coordinate transformations
are written as
Bal.. (e) -
- de i
Bi(O
a
K de i
dt;a
By differentiating the identity e[t;(e)] = e or
ei[t;l(e), ... , t;n(e)] = ei
with respect to e j , we have
"ei ~j:"a = . .
a ~ Bl.B~ = a~
d1';a de j a J J
where a~ is the Kronecker delta which is equal to 1 when i j and
J
otherwise equal to O. Similarly, we have
aa
B~B~ =
S
l. and (B~) are mutually inverse
Hence, the two Jacobian matrices (B~)
matrices. Let {ail and {aa} be the natural bases of the tangent
space with respect to e and 1';, respectively. Then, the relations

aa l. = B~a
d.
l. a (2.4)

hold, because are partial derivatives. By representing the


21

same vector A in these two bases A = Aid. = AUd ,we have the
~ U
respective components Ai and AU. From the relations (2.4), it is
shown that the components are related by
Ai = Bi AU AU = B~Ai (2.5)
u' ~

These show how the components of a vector are changed by the


coordinate transformation.
The l-representation A(x) of A is invariant for any coordinate
systems
A(x) = Ai di~(x, a) = AU dU~(X, s) ,

and only its components change in a contravariant manner as basis


changes.

Example 2.1. Normal distribution.


The mean ~ and standard deviation 0 are frequently used as the
parameter a (a l , a 2 ), a l = ~, a 2 = 0, to specify the family S =
{N(~, 0 2 )} of the normal distributions. Because of 0 > 0, the
parameter space is the upper half-plane, as is shown in Fig. 2.5(a).
The natural basis {di} is
_d_ _d_
dl = d~ d2 = dO
The tangent vector Ta is spanned by these vectors. From
(x _ ~)2
~(x, a) = - - log (1TiT0)
20 2
th~ basis di~(x, a) of the l-representation is calculated as
2 1
x - ~ d ~ - (x - ~)
dl ~ = 2 2 - 03 a
o
The space T~l) is spanned by these two random variables, so that it
consists of all the quadratic polynomials in x whose expectation
vanishes,
T(l) {ax 2 + bx + c}
a
22

with
2 2
c = - E[ax + bxl = - a(o + ~ 2) - b~.

It is possible to use the first and second moments of x,

as the parameter s = (sa), a = 1, 2, specifying the distributions.


This defines another coordinate system, and the Jacobian matrix of

1
the coordinate transformation is given by

:J
1
Be:'1- K
de i
2 ~
and its inverse is given by a

~[
1
Bi ~
a
dS a
-L
0
The coordinate curves are given in Fig. 2.Sb), where natural basis
vectors ida}' da = B~di are also shown. The tangent vectors ida}' a
= 1', 2', are written as

where l' and 2' are used to denote the {"a}-system. Their
l-representations are
2 + ~/02,
(x ~)/o - ~(x _ ~)2/04

(x ~)2/(204) _ 1/(20 2 ).

We have drawn Figs. 2.Sa) and Sb) as if the coordinate system e is

linear and s is curvilinear. However, we do not yet have any a

priori reason to decide the linearity of the coordinate systems. It


will be shown later that the coordinate system s is linear in a

certain sense.
23

a) 8-coordinate

b) ~-coordinate

Fig. 2.5
24

Figo 2. 6a) Fig. 2.6b)

Example 2.2 Multinomial distribution


In the case of the manifold S of multinomial distributions, we
put
P2' ... , en Pn+l·
Then,
e l + e 2 + ... + e n + l = 1
holds, so that S is n-dimensional, and we can use e = (e l , e 2 , ... ,
en) as a coordinate system of S. The probability distribution
specified by e is
p(x, e) = ~n+l
~i=18(x - i)loge i ,
and its logarithm is
t(x, e)
where en + l = 1 _ e l - - en is regarded as a function of e = (e l ,
en). The manifold S can be identified with the simplex defined
by

in Rn + l whose coordinate system is ~ = (e l , .. , , en+l) as is shown

in Fig. 2.6 a), where n = 2. The tangent space Te is spanned by n


vectors a l , ... , an' and their l-representations are
25

di~(x, e) = 8(x - i)/e i - 8(x - n - 1)/ en +l .


Let us define ~a, a = 1, 2, n + 1, by
~l 2/P2, ... , ~n+l
so that

or

Then, ~ 1, ... , n defines another coordinate sys tern.


When we use this coordinate system, it is convenient to regard S as
a part of the n-dimensional sphere with radius 2 imbedded in Rn +l
whose coordinate system is ~ = (~l, ~2, ~n+l) (Fig. 2.6 b).
The Jacobian matrix Bi is given by the diagonal matrix
a
~l
~2 o
i
-i de 1
Ba ~=T
o

and the natural basis d a is obtained from

2.3. Riemannian metric and Fisher information


When the inner product <A, B) of two tangent vectors A, B
Te is defined, the manifold S is called a Riemannian space. The
inner product can be introduced in the manifold of a statistical
model in the following natural way. Let A(x), B(x) be the
I-representations of A, B. Then, their inner product is naturally
defined by
26

<A, B> = E[A(x)B(x)] . (2.6)


Hence, the inner product is the Covariance cov[A(x), B(x)] of two
random variables A(x), B(x), because of E[A(x)] = E[B(x)] = O.
Especially, the inner product of the two basis vectors di and dj is
gij(8) = <di' dj ,> = E[diR.,(x, 8)d j R.,(X, 8)] . (2.7)

The n 2 quantities gij (8), i, j = 1, 2, ... , n, together form a


geometric object called the metric tensor. The inner product of two
vectors A = Aidi and B = Bid i can be expressed as
<A, B > = <A di' i j)_
B dj - A B gij
i j

in the component form. Hence, the inner product is determined by


giving the metric tensor The metric tensor in the
coordinate system ~ (~a) is given by

gaS <d a , dS > < B~di' Bidj>


B~Bi <di' dj '> = B~Bigij .
A geometric quantity t is called a covariant tensor of order 2, when
it is expressed by n 2 components tij in a coordinate system 8 = (8 i )
and the components t ij change into
_ -i-j
taS - BaBstij
when it is expressed in another coordinate system ~ = (~a). Hence,
the metric tensor is indeed a covariant tensor of order 2.
Two tangent vectors A, B are said to be orthogonal, when their
inner product vanishes; (A, Obviously, A and B
are orthogonal when their l-representations A(x) and B(x) are
uncorrelated, i.e. their covariance vanishes. Two curves 81 (t)
and 8 2 (t) passing through a point 80 = 81 (0) = 8 2 (0) at t = 0 are
said to be orthogonal at this point, when their tangent vectors 8 1
(0) and 82 (0) are orthogonal at this point, <61 , 82 > =
. i· j
8 1 8 2 gij
= 0, i.e.,
Cov[ R.{x, 81 (t)} , R.{x, 8 2 (t)}] 0
at t = O.
The length IAI of a tangent vector A is defined by
27

i j
A A gij
This is obviously the variance of the I-representation A(x)
IAIZ = E[{A(x)}Z] .

Let p and p' be two points in S, which are "infinitesimally" close,

and let e and e + de be, respectively, their coordinates. By


regarding pp'
deia. as an "infinitesimal" vector in T e , the square
1.
=

of the distance ds = Iw'l between two distributions p = p(x, e) and


p' = p(x, e + de) is given by the quadratic form
ds Z Ipp' I Z <pp',
g .. d8 i d8 j .
1.J
= pp' > =

The statistical meaning of the above distance is elucidated by the


Cramer-Rao theorem, which is one of the fundamental theorems in

the theory of statistical estimation.

The matrix (gij) is well known in statistics as Fisher


information matrix. Let (gij) be its inverse,
ij _ i
g gjk - ok .
Let e be an unbiased estimator of the parameter 8 based on an

observation x from the true distribution p(x, 8), E[e] = 8.

Cramer-Rao Theorem. The covariance of any unbiased estimator e


(e i ) is bounded by the inverse of the Fisher information matrix,
(Z.9)
where > implies that forms a positive

semi-definite matrix.

The above bound is attained asymptotically in the following

sense. Let xl' xz' x N be N independent observations from the


"-
identical distribution p(x, 8). Then, there exists an estimator 8N
"'-
based on these N observations such that the covariance of 8N tends

to gij/N as N tends to infinity,


"'i· l' .
cov[8 N , 8~] .... N"g1.J
The maximum likelihood estimator is such one. Moreover, the
28

distribution of the above estimator tends to the normal


distribution N(a, gij IN), i.e., the probability density function
" N, a) of "aN' where a is the true parameter, tends to
p(a
p(e N, a) = const. exp{ - ~gij(a)deid~j}.
where de i = g~ - a i is the estimation error.
The indistinguishability or non-separability of two nearby
distributions p(x, a) and p(x; a') may be measured by the
probability that a' is obtained as the estimated value eN from N
independent observations from p(x, a). When N is large, this
probability of confusion between p(x, a) and p(x, a') is
proportional to their distance
ds 2 = g .. (a)(a,i - ai)(a,j - a j )
1J
where da i a' i a i is infinitesimally small as N tends to
infinity. Hence, the distance (2.8) is shown to be based on the
separability of two distributions by a large number of independent
observations. When two distributions are separated by a large
distance, it is easy to distinguish them based on observations of
the random variable. It is also possible to show that the distance
ds 2 is related to the power of testing one hypothesis HO : p(x. a)
against the other H1 p(x, a1) based on a large number of
observations.
Let c : a(t) be a smooth curve connecting two points a O = a(t O)
and a 1 = a(t 1 ). Then, the distance s from a O to a 1 along the curve
c is obtained by integrating the infinitesimal distance ds between
a(t) and a(t+dt) = a(t) + edt,
ds 2 gij[a(t)]Si e j dt 2
so that
S{g Jds -- Jt1
/ .. a·i·j
= a dt. t
1J o
Among all the curves connecting two points a o and a 1 , the one which
gives the minimum distance is called the Riemannian geodesic
connecting a O and a 1 . The Riemannian distance between a O and a 1 is
29

defined by the distance along the Riemannian geodesic.


There is a formula convenient for calculating the metric tensor
gij or Fisher information matrix;
gij(8) = - E[CliClj£(X, 8)]. (2.10)
This can easily be proved from the relation

= CliCljP(X, 8) - p(x, 8)Cl i £(X, 8)dj£(X, 8)


This equation gives another interpretation of the metric tensor.
Given x, £(x, 8) is the log likelihood function in 8 and the maximum
likelihood estimator ~ is the one which maximizes £(x, 8), i.e., it
satisfies di£(X, e) = O. We can expand the function £(x, 8) at e,
£(x, 8) = £(x, 6) +~ didj£(x, e)(e i - 8i )(e j - 8j )
+ higher order terms
The maximum of £(x, 8) is attained at 8 ~, and the term
-didj£(X, e) shows how sharp is the peak of £(x, 8) at 6'. The
Fisher information is the negative of the expectation of this second
derivative of £(x, 8).

Example 2.3. Metric in the manifold of normal distributions


The metric tensor gij(8) in the coordinate system 8 = (~, 0) of

the normal family N(~, 0 2 ) is calculated easily from the definition

(2.7) or (2.10) as

gij ( 8 ) = 10 2 [0 1
O
2
J
Since the cross components g12(8) and g2l(8) vanish identically, the
basis vectors dl and d2 are always orthogonal. Hence the coordinate
system 8 is an orthogonal system, composed of two families of
mutually orthogonal coordinate curves, 81 = ~ = const. and 8 2 = 0 =
const. However, the length of di depends on the position 8 (more
30

precisely on 0)
Z
1<\I Z var[ (x )1)/0 ]

IdZI Z
and the coordinate system e is not Cartesian.
In terms of another coordinate system l;,

metric tensor has the following form

gaS =7 [::2 +.2 :/:]


This can be ascertained either by calculating the covariance matrix
of dat(X, l;,) directly, or by the tensorial rule of coordinate

transformations. The two basis vectors da , a = 1, Z are not


orthogonal.
The Riemannian distance between two points (two normal
distributions) e
1

d(e , e
1 .2
The geodesic curve e(t) connecting two normal distributions is given

by

e l (t) cl + Zc Z tanh(t/I! + c 3) ,
eZ(t) /TcZ cosh- l (t/ IT + c3) ,

when )11 f- )1Z and, when )11 = )1Z = )1,

= )1 , eZ(t) = exp(t/IT + c) ,

where c and c i are constants and t is the geodesic length. Here we


do not mention the procedures to obtain these results (see, e.g.,
Atkinson and Mitchell [1981]; Skovgaard [1981]). It was shown by
Amari in 1959 (unpublished) and later by many others that the family
of normal distributions is a space of constant negative curvature
with the scalar curvature K = -liZ. The geometry of such a space

was studied by J. Bolyai and by N.I. Lobacevskii and is known as the


non-Euclidean geometry. The Riemannian geometry of the multivariate
31

normal model is fully studied in Skovgaard [1981].

Exampel 2.4. Metric in the manifold of multinomial


distributions
The metric tensor gij in the coordinate sys tem e (e i Pi) is
calculated from (2.10) as
-1 -1
gij = (Pi) °ij + (Pn+l) •
-1
where (Pi ) 0ij is the diagonal matrix with the diagonal entries
(Pl)-l •...• (Pn)-l. This is obtained by taking the expectation of
-didi~ = o(x - i)(ei)-l + o(x - n - 1)(e n+ l )-2.
- didj~ = o(x - n - 1) (e n+l)-2. (i * j).
In terms of the new coordinate system sU = 2/Pu ' the metric tensor
is given by guB = <dU' dB> or by using the relation guB

guB = 0uB + (sn+l)-2 s u s B. u. B = 1 •...• n.


It is sometimes convenient to add the (n + l)st coordinate sn+l to s
sn). so that we get the (n + l)-dimensional coordinates
1 n n+l
(s •...• s . s ).
Then. a point ~ is regarded as a point on the sphere with radius 2
of the (n + l)-dimensional manifold Rn+l. because of the relation
(sn+l)2 = 4 _ (sl)2_ ... _(sn)2.

Let sand s' s + ds be the coordinates of two probability


distributions which are infinitesimally close to each other. Then.
the square of their distance is given by
ds 2 gUBdsUdsB 0UBdsudsB + (sn+l)-2IsusBdsudsB.
On the other hand. let ds 2 be the square of the Euclidean distance
between the two points t and ~ + d~ in Rn+l. It is given by
ds 2 = L~~t(dsU)2.
Since both ~ and ~ + d~ are on the sphere.
,n+l sUdsu = 0
Lu=l
or
32

holds. By substituting this in ds 2 , it is proved that the Euclidean


distance coincides with the Riemannian distance ds 2 . In other
words, the Riemannian metric of the manifold S of multinomial
distributions is induced from the Euclidean Rn + 1 by the constraint
that S is on the sphere. That is, we can imbed the S in the
Euclidean space Rn + 1 isometrically. The geodesic connecting two
distributions (Pi) and (qi) is given by the great circle connecting
them, so that their Riemannian distance is easily calculated. Let ~
and ~' be their (n+l)-dimensional coordinates in Rn+l. Then, the
angle S between two vectors t and is given by t,
1 ,n+1 a ,a ,n+l =-=--
cos S = 4' £.0(=1 s s = l.i=l vP i q i·
The Riemannian dis tance s is the arc length on the sphere so
that it is given by
-1 ,n+ 1 r.=--=-
s = 2 cos (Li=lvPiqi).
The geodesic ~(t) connecting them is given by
~a(t) = c(t)[sa + t(s,a - sa)],

where c(t) is the normalizing constant. In terms of the


probabilities, it is given by
Pi(t) = c 2 {(1 - t)/Pi + t~}2,
where c(t) is determined from L Pi(t) = 1.

2.4. Affine cdnnection


When a vector A(e) E Te is attached to each point e E S, the
collection A = {A(e) l e E S} is called a vector field. More
formally, a vector field A is a mapping from S to Te' which
assigns a vector A(e) E Te to a point e E S. The i-th basis vector
di in the coordinate system e is a vector field, which assigns the
partial derivative d i E Te to each e E S. The basis vector di at e
should be written as di(e), when it is necessary to show explicitly
the point e at which di is operated. A vector field A can be
represented in the component form as
33

A = Ai(e)di(e)
It is smooth when the components Ai(e) are smooth functions in e.
The basis vector field di itself is a smooth vector field. The set
of all the smooth vector fields of S is denoted by T(S) or shortly
by T.
Since the tangent spaces Te and T e , are different when e and e'
are different points, there are no direct means to compare two
vectors A(e) Te and A(e') T e ,. The direct comparison of their
components Ai(e) and Ai(e') is meaningless, because the basis
vectors di(e) and di(e') are different. (Even when the space is
Euclidean, di (e) and di (e') are different in the case that the
coordinate system e is curvilinear.) In order to compare two
vectors belonging to two different vector spaces, it is necessary to
establish a one-to-one correspondence between the vector spaces so
that one vector space is mapped to another. Let us try to give an
affine correspondence between two adjacent tangent spaces Te and

Te " where e' = e + de is "infinitesimally" close to e. Once such a


correspondence is established for any two adjacent points, it can be
extended along a curve e(t) to give a correspondence between two

tangent spaces Te(tO) and Te(tl) at distant points e(t O) and

although the correspondence depends in general on the


curve connecting the two points.
Let us consider a linear mapping m : T e+de -+ Te depending on
de, which reduces to the identity map as de tends to O. Since de is

small, the basis vector dj = dj (e + de) E. T e+de is mapped to a


vector m(dj) close to dj(e)(Fig. 2.7). By expanding the difference

~dj between m(dj) and dj , i.e.,

~dj = m[djl - dj(e)€ T e ,


with respect to de and neglecting higher-order terms, the vector ~dj
34

Fig. 2.7

is expressed as
i k
!'Idj = de r ij (e)dk '
i k
where de r ij is the components of !'Id j ~ Te , Hence, the mapping m
is determined by n 3 functions rijk(e), i, j, k = 1, .. ,' n, in e,
Since m is linear, a vector Aidi E Te +de is mapped to
Aim[di 1 (Ak + deirijkAj)dkE: Te
thus establishing a correspondence between vectors in Te and Te+ de ,
An affine correspondence between Te and Te+ de is obtained from the
map m by considering that the origin of Te+ de is mapped to the point
deidi in T e , By this affine correspondence, a point Aiai in Te+ de
is mapped to a point in Te shown by the vector
(Ak + de k + deiAj r" k) ak
1J
The n 3 functions r ij k(e) in e are called the coefficients of the
affine connection, because m gives an affine correspondence between
Te and Te+ de ,
The difference !'I a j can be regarded as the intrinsic change in
the j-th basis vector a j (e) as the point changes from e to e+de,
Hence, if we denote by vaia j the rate of the intrinsic change of a j
as point e changes in the direction of ai' it is given by the vector
35

(2.11)

This is again a vector field. The vector field Vdid j is called the
covariant derivative of vector field dj along a i . It is determined
k
from the coefficients r ij (8) of the affine connection. On the

contrary, the covariant derivatives V". d. uniquely determine the


01. J
coefficients of the underlying affine connection. Indeed by taking

the inner product of the both sides of (2.11) with om' the relation
k k
<vaa j , dm> = r ij <a k , am> = r ij (8)gkm(8) (2.12)
follows. It is convenient to define the covariant expression of the

coefficients of the affine connection by

(2.l3)

Then, by using the inverse (gmk) of the matrix (gkm) ,


k km
r ij g r ijm (2.14)
Apart from the above intuitive introduction, let us give a more

precise definition of an affine connection. An affine connection on

S is a covariant derivative V, namely a mapping from T(S) x T(S) to


T(S) satisfying some conditions described soon. For two vector
fields A, B f. T(S), the covariant derivative of B along A is a
vector field C denoted by C = VAB, where vector C(8) €: T8 can be
interpreted as the rate of the intrinsic change in vector field B(8)
as the point 8 changes in the direction of vector A(8). The
covariant derivative should satisfy, for vector fields A, A', B, B'€:

T(S), the linearity conditions

VA (B + B ') VAB + VAB ' (2.15)

V (A + A I)B VAB + VA,B (2.16)


36

For a smooth scalar function f : S + R, fA is also a vector field


whose value at e is f(e)A(e) E Te' The covariant derivative should
also satisfy

(2.17)
(2.18)
where
Af = Ai(e)aif(e)
Obviously, the coefficients of the affine connection are given
by
rijk(e) = <vaia j , (2.19)
On the other hand, when r ijk are given, the covariant derivative VAB
for A = Ai(e)a i , B = Bi(e)a i can be calculated as

'V A (Bj aj )
(Aia. Bj) a.
1 J

Hence, the n 3 quantities rijk(e) define an affine connection.


The coefficients of an affine connection depend on the
coordinate system e. For another coordinate system t; = (t;Cl) with
the Jacobian matrix BiCl = aei/at;Cl, the coefficients of the affine
connection are calculated as

(2.21)
This shows how the coefficients of an affine connection change under
coordinate transformations. It should be noticed that r ijk is not a
tensor. Even if r. 'k(e)
1J
= 0 holds identically for some coordinate
37

system e, ruSY(s) = 0 does not hold for another coordinate S, unless


e and s are linearly related. Although the covariant derivative V
is defined independently of the coordinate system, its expression
r ijk depends on the coordinate system because the natural vector
fields di are defined based on the coordinate system e = (e i ).
Once an affine connection is introduced, we can talk about the
straightness (and hence curvature) of a curve. Let e(t) be a smooth
curve in S and let B be a vector field defined on the curve such
that B(e) at e = e(t) is written as B(t) = Bi(t)di' The tangent
• .i
vector e = e (t) di of the curve is also a vector field defined on
the curve (Fig.2.8.). The covariant derivative VeB denotes how B(t)
changes along the curve, i.e. the intrinsic change in B along the
curve. When there are no intrinsic changes in B(t), B(t) satisfies
the equation VeB = 0, which is equivalent to

o , (2.22)
where the relation

Fig. 2.8
38

af aidif = d~ f[a(t)] = f
is used. When B(t) is the solution of the above equation, the

vector B(t) at Ta(t) is said to be the parallel shift of B(t') at


Ta(t') along the curve. A vector B at a' can be shifted in parallel
to any point a along a curve connecting these two points. (The
result depends in general on the curve.)

When the tangent vector e of a curve a (t) may change its


magnitude but not its direction, it satisfies the equation

Vee = c(t)e
By choosing an appropriate parameter t, this equation reduces to the
simpler form

ae =
V· 0
' (2.23)

which implies that the tangent vector of the curve does not change
at all along itself. This is a generalization of the straight line
in the Euclidean geometry. It is called a geodesic with respect to
the affine connection. The equation (2.23) can be rewritten as
ek(t) + ei(t)ej(t)fijk{a(t)} = 0 (2.24)

in the component form. It should be noticed that straightness is


directly related to the underlying affine connection. The curvature
can be defined by the deviation from the straightness as will be

shown soon, so that it is also related to the affine connection.

2.5. Statistical a-connection


We have so far explained the general mathematical notion of
affine connection. Now is the time to study the possibility of
introducing an affine connection in the space S of a statistical

model such that it represents the intrinsic properties of the family


of probability distributions. The i-representation of the tangent
space again provides a good guide for this. Since the natural basis
39

dj(e + de) of Te+de is represented by a random variable


i
djR.(X, e + de) = djR.(X, e) + didjR.(x, e)de ,
to obtain an affine connection, it is necessary to find a way of
mapping it to the space T~l) spanned by n functions djR.(X, e). Since
the expectation at e of d.d.R.(X, e) does not vanish because of
1 J
(2.10), djR.(X, e + de) does not belong to T~l). So we first modify
djR.(X, e + de)E T~l~e so that its expectation vanishes at e. This
can be done by adding gij(e)de i to yield
i
djR.(X, e) + {didjR.(x, e) + gij(e)}d8
Since this still does not in general belong to T~l), we project this
random variable to the linear space T~l). By this projection, a
linear correspondence between Te and Te+de is established. Since ~dj

is the projection of {didjR.(x, e) + gij(e)}de i to T~l), the resultant


affine connection is given by
E[ (ai dj R.(x, 8) + gij (e) }dkR.(X, 8)] (2.25)
E[didjR.(x, e)dkR.(X, 8)].
This connection is also derived by projecting didjR.(x, e)de i directly
to T~l) .
There is another possibility of modifying the random variable
didjR.(x, e), because the expectation at 8 of
didjR.(x, e) + diR.(x, 8)d j R.(X, 8)
also vanishes. This modification leads to another affine connection
whose coefficients are given by
(2.26)
The above two definitions suggest that an infinite number of affine
connections can be introduced by using a weighted mean. Let a be a
scalar parameter. Then, the modification of didjR.(x, e) into
l+a l-a
di d j R.(x, 8) + -2- gij (8) + -2- di R.dj R.
leads to the affine connection whose coefficients are given by
r~j~(e) = E[{didjR.(X, e) + l;a diR,(x, e)djR.(x, e)}dkR,(x, e)]. (2.27)
This is called the a-connection, and it reduces to the former one
40

(2.25) when a = 1, and to the latter one (2.26) when a = - 1. The


covariant derivative with respect to the a-connection is denoted by
v(a) .

Let us define a third-order tensor by

T ij k (e) = E [ aiR, (x, e) a j R, (x, e) ok R, (x, e) 1 . (2.28)


Notice that its components change as
_ -i-j-k
Taay - BaBaByTijk
under coordinate transformations. Such a quantity is called a
tensor. The a-connection can be written as
r~<;tk) = r(l)
ijk
+ I-a T
-2- ijk ' (2.29)
~J

which is convenient for calculating the coefficients of the

a-connections.
We have thus introduced a one-parameter family of affine
connections. The reader might ask which is the true connection to be
introduced in S? This question is meaningless. For any a, the
a-connection has its proper meaning depending on a, and it plays a
proper role in statistical inference. The a-connection defines the

a-straightness, so that the deviation from it is measured by the


a-curvature defined in the following.
Although the meanings of the a-connections are to be elucidated
in the following (especially in the next Chapter), we give intuitive

explanations aiming at satisfying the curiosity of some readers. Let

Pl(x) and P2(x) be two density functions. We can connect them by a

curve

p(x, t) = (1 - t)P1(x) +tP2(x) , (2.30)


where t is the parameter connecting them with p(x, 0) = PI(x), p(x,
1) = P2(x). This curve forms a one-parameter family of probability
distributions called the mixture family, because the random variable
x subject to p(x, t) is considered as the mixture 100(1 - t)% from

Pl(x) and 100t% from P2(x). For R,(x, t) = log p(x, t), it is easy to
show
41

d t 2(x, t) = {P2(x) - Pl(x)}/P(x, t) ,


2
d t ,\2(X, t) + (dt2(x, t)} = 0 ,
where d t = a/ at. This shows that the basis vector a t 2 (x, t+dt) E
T t + dt is mapped to at 2 E T~l), if the modification is done in the
manner shown by (2.26), i.e., by a = - 1. This demonstrates that the

-l-connection manifests the criterion that the mixture families


should be understood as straight models. Hence, a mixture family

(2.30) can be regarded as the -l-straight line connecting two


distribution. This point of view can be extended to the mixture of a

number of distributions Pl(x), ... , Pn+l(x), giving an n-dimensional


-l-flat manifold.

There is another way connecting two distributions. The family

p(x, t) = exp{(l - t)2 l (x) + t2 2 (x) - c(t)} , (2.31)


where c(t) is the normalization factor to be determined from fp(x,
t)dP = 1, is obtained by taking linear combinations of 2i (x)
log Pi(x). This family is well known as the exponential family by
statisticians, and will play later a very important role. From the

relation,

exp{c(t)} f exp{(l - t)2 l (X) + t2 2 (x)}dP ,

we have

a t a t 2(x, t) + gtt = 0 ,
where gtt is the Fisher information. Hence, at£(x, t+dt) E T~!) dt is
naturally mapped dt£(X) E T~l), if the modification (2.25) is used,

which implies a 1. Hence, the l-connection is based on the


criterion which regards the exponential family as a straight line.

Readers familiar with differential geometry might have wondered


why we do not introduce Levi-Civita parallelism, which regards a

Riemannian geodesic (the minimum length curve connecting two points)

as a straight line. The Riemannian metric tensor gij(6) brings about

a natural affine connection, whose coefficients are given by

(2.32)
42

which is sometimes called Christoffel's three indices symbol. A


careful calculation proves that this is nothing but the O-connection,
r ~ 9k) = [ i , j; k] (2.33)
1J
This connection is called the Levi-Civita connection, information

connection or Riemannian connection. It is given rise to by the

criterion that the minimum length curve is a straight line.

Among the a-connections, the Riemannian (i.e., a = 0) connection


is the only one by which the parallel shift of a vector does not
change its length. Such a connection is said to be metric. The
a-connections are in general non-metric except for the case a = O.

Example 2.5. The a-connections of normal distributions.

By differentiating i(x, e) twice with respect to e = (~, a) for the

normal distributions, ' \ dj i (x, e) can be obtained as


2 2(x - ~)/a3 ,
dldli(x, e) l/a dld2 i (x, e) =

d2 d2 i (x, e) {3(x ~)2/a4} + 1/a 2


Although the above random variables are quadratic in x, they do not

belong to T~l), because their expectations do not vanish. Hence, in

order to find the counterpart of dii(X, e+de) = dii + didjide j E


T~!) de in T~l), it is necessary to modify the terms didji(X, e) such
that their expectation vanishes. The l-connection is given from

(2.25) as
r(l) r(l) r (1) o ,
221 122 212

(1) _ (1) _ 3
r(l)
r 12l - r 2ll - - 2/a ,
222 6/a 3 .
The components of Tijk are calculated as
3 _ 3
Tlll = T22l = 0, Tl12 = 2/a , T222 - 8/a
and all the other components are obtained from the symmetry of T ijk ,
i.e. , Tijk = Tjik = Tkij Hence, the a-connection is given
from (3.29) as
r(a) r(a) r(a) r(a) 0 , r(a) (1 - a)/a 3 ,
111 212 122 221 112
r(a) r(a) - (1 + a)/a 3 r(a) = 2(1 + 2a)/a 3
121 211 222
43

The a-geodesics connecting two points 6 1 and 6 2 are different for

different a. One can check that r{~~ coincides with the Christoffel
symbol [i, j; k] calculated from the metric tensor. It is not
difficult to check that the geodesic curve given in Example 2.3 is
indeed the a-geodesic satisfying (2.24) for a = a connection.

Example 2.6. The a-connections of multinomial distributions


From dk~ and didj~ calculated in Examples 2.2 and 2.4, we obtain
r~~) = 1 + a {_ (a i )-2 8 . . + (a n +l )-2},
~Jk 2 ~,J ,k
where 8 . . k is equal to 1 when i = j = k and otherwise is equal to
~ ,J,

O. This shows that r{j~ vanishes identically for a = -1. In other


words, the basis vector di(a') at Ta , correponds to di(a) at Ta in
the 6-coordinate system. This is because the mul tinomial

distributions form a mixture family. We can calculate the


coefficients of the a-connection in the ~-coordinate system directly
or by using the transformation rule (2.21).

2.6. Curvature and torsion


In a manifold with an affine connection, the tangent space Ta at
a can be mapped by an affine transformation to the tangent space T a ,
at a' along a curve a(t) connecting a and a'. This affine
transformation depends in general on the curve connecting the two

points. A curve a(t), t € [to' tIl. is called a loop when it is


closed, i. e., a (to) = a (t l ) holds (to f t l ). When a loop a (t) is
given, tangent space Ta at a = a (to) = a (t l ) is mapped to itself
along the loop a(t). This map is not necessary the identity map, so
that the origin of Ta is mapped to another point in Ta by encircling
the loop a(t). The direction of a vector also changes by encircling
the loop. These discrepancies represent characteristic structures of

the manifold and are described by the torsion and curvature of the
manifold.
44

In order to define the torsion and curvature, we need to


introduce the notion of a tensor. A (covariant) tensor Q of order k
is a multilinear mapping from k vector fields to the real number,
Q: T(S)X ... xT(S)~ R.
For k vector dields Al , ... , Ak € T(S), its value is written as Q(A l ,
A2 , ... , Ak ), where Q is linear for any Ai with any scalar functions
as coefficients. By using a coordinate system e, we can see the
value of Q for the basis vector fields a l , ... , an. Then, we obtain
the nk quantities
Qil. ... Q(a il ' a 4 ' ... , ai~) ,
4-
i l , i 2 ,··· , kk 1, 2, ... , n .
i i i
For vector fields Al Alai' A2 = A2 a i ,··· , Ak = Aka i ,
we have
Q(A l , ... , Ak ) = Qi1 .... iIt Ai1 ... A~1t
This is the component form of the tensor operation.
quantities Q. . . . . are called the components of the tensor Q with
Ll LIt
respect to the natural basis {aiJ. In terms of another coordinate
system ~, the components are given by

Q(ll. • • • • (lit.
-- Bit.
(l'.'
B4.
(l
Qii··:iw.·
This gives the trasformation rule of a tensor. When the components
of a tensor vanishes in a coordinate system, they vanish in any
coordinate system. The Riemannian metric is a tensor of order two
whose components are given by gij' An affine connection is not a
tensor, although its components are written as r ijk , because it does
not define a multilinear mapping and its transformation rule (2.21)
is different from that of a tensor.
We define another type of tensors. Let R be a multilinear
mapping from k vector fields to a vector field,
R: T(S) Y. ••• X T(S)~T(S).
This is called a tensor of order k + 1, of covarinat order k and
45

contravariant order 1. It maps k vector fields Al , ... , Ak to a

vector R(A l , A2 ,···, Ak ). For the basis vector fields a. ,


11
it gives a vector

ROil' ... , c\1t ) = R. . . . . j


11 1 ...
aj .
The n k +l quantities R. j are called the components of the tensor
1J: . . it
R with respect to the coordinate system e = (e i ) or the natural basis
i i
Al ai' ... , Ak = Ak di'
_ h ill j
... , Ak ) - Al ... A. R. .... d ••
. J 11. 1Jt. J
When an inner product is introduced in the tangent space Te'
this R defines a contravariant tensor R' of order k + 1 by
R'(A l ,···, Ak , B) = <R(A I , ... , Ak ), B>.
The components of R' are given by
R. . . . . . <R(a., ... , d . ) , d J.>
11 111:J 11 111

and
R. . . . . m = R . . . . . . gjm.
11 11t 11 111 J
Hence, R' is a covarinat version (lower index version) of R. We

herafter identify Rand R', and omit the prime, regarding them as
different expressions of one and the same quantity.
The torsion is a bilinear mapping from T(S) x T(S) to T(S)
induced by the affine connection. Hence, it is a tensor of order

three. For two vector fields A, B E T(S), the mapping is defined by


(2.34)

which is also a vector field. Intuitively speaking, £2 S (A, B)

represents the change in the position of the origin of Te after


shifting Te in parallel along the parallelogram composed of two
infinitesimal vectors £A and £B, where £ is a small quantity. Given

a coordinate system e, the torsion is represented by the torsion


tensor whose components are defined by
46

It is easy to show that SeA, B) can be calculated by the use of Sijk


in the component form. Since the partial derivatives '\ and d j
commute, didj - djd i = 0, Sijk is obtained from f ijk by

Sijk(6) = f ijk - fjik . (2.35)


This Sijk is a tensor antisymmetric with respect to i and j.
Since the coefficients fij~(6) of the a-connection are symmetric
with respect to the first two indices i and j as can be checked from
the definition (2.27), the torsion tensor Sijk identically vanishes
for any a-connection. Therefore, the manifold of a statistical model
is torsion-free, and we will not hereafter mention the torsion.
The Riemann-Christoffel curvature R is a trilinear mapping from
T(S) x T(S) x T(S) to T(S) induced by the affine connection. Hence,
it is a tensor of order four. For three vector fields A, B, C E
T(S), the map is defined by
R(A,B,C) = [VA' VB]C - V[A,B]C ,
where [ , implies alternation, for example,
[VA' VB] = VAV B - VBV A ' [A, B] = AB - BA .
Intuitively speaking, £2R(A, B, C) represents the change in vector C,
when C is shifted in parallel along the parallelogram composed of two
infinitesimally small vectors £A and £B in S. Given a coordinate
system 6, the curvature is represented by the components of a tensor
(2.36)
called the Riemann-Christoffel curvature tensor. Conversely, R(A,
B, C) can be calculated from the components of A, B, C and the
curvature tensor.
By calculating (2.36), the Riemmann-Christoffel curvature tensor
Rijkm is obtained as
Rijkm = (dif jkS - djfiks)gsm + (firmfjkr - fjrmfikr) (2.37)
A space with an affine connection is said to be flat, when the
Riemann-Christoffel curvature vanishes identically, i.e., RCA, B, C)
= 0 holds for any A, B, C E T(S) or equivalently Rijkm (6) = 0 holds
47

for any e. In an flat space, a vector A E Te suffers no change when


it is shifted in parallel along a loop passing through e. This
implies that, when a vector A E. Te is shifted in parallel to the
tangent space Te' at another point e ' , the shifted vector A' E Te' is
uniquely determined irrespective of which curves the vector is
shifted along in parallel. Hence, we can construct a vector field A
such that A(e) for any e E S is obtained from A(e O) at eO by its
parallel shift. Such a vector field is said to be a parallel vector
field.
It can be proved that, when and only when the Riemann-
Christoffel curvature vanishes identically, i.e., Rijkm = 0 at any
point in S, there exists a coordinate system e = (e i ) in a torsion-
free manifold such that the coefficients of the connection vanish

identically, rijk(e) = O. In this case, Vdidj = 0, so that the


basis vector fields di are parallel vector fields, i. e., di E Te at
any e corresponds to each other by the parallel shift. Such a
coordinate system is said to be affine. All the affine coordinate
systems are connected by affine transformations in a flat manifold S.
The geodesic equation (2.24) is rewritten in an affine coordinate
system as

so that its general solution is given by a linear form


ek(t) = tak + b k ,

where a k and b k are constants. In particular, the coordinate curve


e i itself is a geodesic, because, for example, (ak ) = (1, 0, 0, ... ,
2 n
0) gives the first coordinate curve e 1 (t) = t, e (t) = ... =e (t) = O.
Obviously, there exists no affine coordinate systems in S,
unless S is curvature free. However, for any point eO' there exists
a coordinate system such that the coefficients of the affine
connection and their derivatives vanish at this one point eO'
(2.38)
48

etc. hold. Such a system is called a normal coordinate system at 8 0 .


The manifold of a statistical model is equipped with the one

parameter family of affine connections. The Riemann-Christoffel


curvature based on the a -connection is called the a-Riemann-
Christoffel curvature and its tensor is denoted by R~~)
~Jkm
When it

vanishes identically, the space is said to be a-flat. We will prove


later that an a-flat manifold is -a-flat and vice versa. This is an
example of the dualistic structure of a statistical manifold with the

a-connection. An a-affine coordinate system is an affine coordinate


system with respect to the a-connection. A coordinate system is said
to be a-normal at 8 0 , when it is a normal coordinate system at 8 0

with respect to the a-connection.


The notion of a-flatness is defined from the a-affine
connection and has apparently nothing to do with the metric tensor.
Hence, the metric tensor gij(8) in general depends on 8 even when an
a-affine coordinate system is used in an a-flat manifold. This
implies that an a-flat manifold is not necessary Euclidean, even when
it is torsion-free, because the a-connection is not in general

metric. When and only when S is O-flat, gij(8) becomes constant in a


O-affine coordinate system. The space S is Euclidean in this case,
and there exists an orthonormal Cartesian coordinate system (which is
O-affine) such that

holds.

Example 2.7 .. The a-curvature of normal distributions. The


a-Riemann-Christoffel curvature of the family N(~, 0 2) can be
calculated from the definition (2.36) after tedious calculations. In
our case the relations
R~~mk)_ R(a) = _ R(a)
~J jimk ijkm
(a) .
hold so that all the components are calculated from R1212
49

The result is
Ri~i2 = (1 - a 2 ) /0 4
in the coordinate system (].J, 0). The scalar curvature K is
defined by
K = (1 1) Rl.. J• km gimgjk
n n-
and is a scalar taking the same value in any coordinate systems. In
our case, it is given by
K(a) - (1 - a 2 )/2

which happens to be independent of the point a in S. Such a space is


called a space of constant curvature. This result is first remarked
by Amari in an unpublished report (1959) in the Riemannian case of a
= O. When a = ± 1, the a-curvature vanishes so that S is ± l-flat,
and a- affine coordinate systems exist for a = ± l. Indeed, one can
show that the coordinate system f,; (f,;l, f,;2) ,
f,;l = ].J f,;2 2 02
].J +
is - l-affine, i.e. 0 in this coordinate system. The
coordinate system I;; given by
2 2
1;;1 ].J/o , I;; =- 1/(20 2 )
(l) = 0 in this coordinate system.
is l-affine, i.e. r ijk It will be
demonstrated later that S belongs to the exponential type family of
distributions, which is ± l-flat. The l-affine coordinate system I;;

is called the natural or canonical parameter, while the - l-affine


coordinate system f,; is called the expectation parameter.

2.7. Imbedding and submanifold


Let M be a subset of an n-dimensional manifold S such that M
itself has a structure of an m-dimensional manifold, where m < n. A
submanifold is regular, i f its topology is the relative topology
induced in S. We treat only regular submanifolds. Let a = (a i ), i =
1, ... , n, be a coordinate system of Sand let u = (u a ) , a 1, ... ,
m, be a coordinate system of M. Since M is in S, a point u E M has
50

coordinates e e(u) in S which are determined by u. The equation e


'" e (u) gives a parametric representation of M in S. This can be
regarded as an injection from M to S. When the mapping e '" e(u) is
smooth and the Jacobian matrix
B;(U) '" ae i i'" 1, ... , nj a'" 1, ... , m, (2.39)
aua
has full rank, i.e., rank m, for all u E M, the mapping is called an
imbedding of M in S. Such an M in S is called a submanifold of S.
We hereafter treat the image of M by the imbedding. A point in M can
be represented by two types of coordinates, one in terms of u '" (ua )
and the other in terms e [e i(u»). I n d'~ces a, b ,c, e t c. are use d

to denote quantities of M in the coordinate system u, while indices


i, j, k, etc. are used to denote quantities of S in the coordinate
system e.
The tangent space Tu(M) of M at u is a vector space spanned by m
vectors da , a '" 1, ... , m, which are tangent to the coordinate curves
of u, where da is the abbreviation of the partial derivative da
d/aU a (Fig.2.9). Since M is imbedded in S, the tangent vector of a
curve in M is also tangent to the same curve in S. Therefore, the
tangent space Tu(M) is a subspace of the tangent space Te(u)(S) of S
at e e(u), which is denoted shortly by Tu(S). Indeed, by
restricting a function f : S ->- R on M, we have a function f(u) '"
f[e(u») from M to R, and the directional derivative d a is obtained
by
daf B;(u)di f
This shows that da is a linear combination of ai
i
da '" Ba(u)d i ' (2.40)

and the components of vector da E Tu(S) with respect to the basis


{di} are given by B;, i '" 1, ... , n. The da , a '" 1, ... , m, are
linearly independent in Te(S) because the rank of (B;) is full, and
hence they span the m-dimensional subspace Tu(M) in Tu(S).
When S is equipped with a Riemannian metric gij and an affine
51

Fig. 2.9

connection r ijk , the same kind of geometrical structures are


naturally induced in M. The inner products of the basis vectors da

of Tu(M) are given by considering them as vectors of Tu(S),

gab(u) = < a,
d db> = B;B~ < di' d j > = B;Bt;gij (u) , (2.41)
where gij(u) = gij[6(U)]. This gives the induced metric tensor of M.
We can also calculate the covariant derivative Vdadb of vector
field db (which is defined only on M) along d a in the enveloping
manifold S as follows,

(aaB~)dj (BiBjr .. k + daBbk)dk


a b ~J
The resultant vector does not necessarily belong to

Tu(M). because the intrinsic change in the tangent vector db may have

a component orthogonal to Tu (M) . This component shows how M is

curved in S. The proper covariant derivative Vaadb to be defined in

M should be a vector belonging to Tu(M). In order to define Vaadb'


we have only to project Vdadb to Tu(M) by discarding the component
52

orthogonal to Tu(M). The coefficients r abc of the induced affine


connection in M are given by
rabc(u) = < aa ,\,
V cc } = <Vcac b , c c )
..
(B~Blrij
k k
+ daB!;) < dk' B~dm) =
ijk k
BaBbBcrijk + (daBb)B~gkm . (2.42)
The manifold M is in general curved in S. In order to denote
the curvature of M in S, let us choose n - m vectors dK, K = m+l,
m+2, ... , n, in Tu(S) such that 1) the set {c a ' dK}, a = 1, ... , mj K
= m+l, ... , n, forms a basis of Tu(S) at every u E M, 2) c K are
orthogonal to Tu(M), i.e.,
<c a ' dK ) = 0 ,

and 3) these c K defined on M form smooth vector fields in T(S)


(Fig.2.10). The curvature of M in S is measured by intrinsic changes
in the directions of the tangent space Tu(M) in Tu(S) as the point u
moves in M. When there are no intrinsic changes in the directions of
Tu(M), the submanifold M is flat in S. These intrinsic changes can
be measured by the orthogonal components of the covariant derivatives
Vcad b in S of the basis vectors db along ca. Hence, by the use of
the orthogonal vectors the curvature of M in
S is defined by the following tensor
HabK(u) = <Vcacb' c K >. (2.43)
This is called the Euler-Schouten curvature tensor or the imbedding
curvature of M in S. We can define more formally the imbedding
curvature as the bilinear mapping H from T(M) x T(M) to the
orthogonal complement of T(M) in T(S) which assigns to A, B E T(M),
the component H(A, B) orthogonal to T(M) of the vector VAB.
The Riemann-Christoffel curvature of M is related to the
Euler-Schouten imbedding curvature of M in S and the
Riemann-Christoffel curvature of S. Their relation is known by the
equation (see Vos [1987, 1989])
R(a)
abed
= BiaBjbBk
c d
B m R~~) + gKAUl- a ) H(a) _
lJkm adK beA
H la) H(-a) )
acK!xl).. ,
(2.44)
where (g KA) is the inverse of the matrix gKA defined by
53

Fig. 2.10

gKA = <aK' aA>


When the enveloping space S is flat, i.e., Rijkm 0, the
Riemann-Christoffel curvature Rabcd is determined from the imbedding
cat) d (-<II)
curvature HabK an HabK by
R(a) == ul- a ) H(a) _ H(a) Il- a) )gKA.
abed adK bel QCK belA
When M is flat in a flat S, HabK = 0 and Rijkm = 0 hold, so that M is
itself flat, Rabcd = O. However, the fact that M is flat, Rabcd 0,
does not imply that M is a flat submanifold of S, i. e., HabK = O.
This can be explained by the example of the surface M of a cylinder
imbedded in the Euclidean 3-space S. This cylinder M is obviously
curved in S having non-vanishing Ha b K ' because the tangent space
Tu(M) is changing in S. However, one can locally unroll M in a
Euclidean 2-space without destroying its local geometrical
structures, such as distance. Hence, M itself is a Euclidean 2-space
with Rabcd = O. The Riemann- Christoffel curvature is an intrinsic
characteristic of M not depending on how it is imbedded in S, while
the Euler-Schouten curvature HabK represents the manner how M is
imbedded in S. They Ere differ2nt geometrical characteristics,
arthough they are related by (2.44).
54

A statistical model M = {q(x, u)} parametrized by u = (u a ) is a


(curved) submodel of another statistical model S {p(x, e)}

parametrized by e = (e i ), when there exists an injection e(u) such


that
q(x, u) = p{x, e(u)} .
When the mapping 8(u) is smooth having a full rank Jacobian matrix,
the statistical manifold M is a smooth submanifold imbedded in S. In
Part II, we treat the problem of statistical inference for
submanifold M imbedded in exponential families S. The geometrical
structures, i.e., the Riemannian metric gab and a-connection r~~c)'
are introduced in M directly by
gab = (oa' 0b> = E[oaR.(x, u)obR.(x, u)]

r(a) <",(a)" ,,)


abc v oa°b' °c
E~oaobR.(x, u) + 1 2a 0aR.0bR¥cR.]
without referring to the geometric structures of S, where 'V,(a) is
the a-covariant derivative in M and R.(x, u) = log q(x, u). However,
they coincide with those induced from the geometrical structures of S
by imbedding M in S. This can be shown using the fact that the
basis vector 0aR.(x, u) of the l-representation T~l)(S) is given by
0aR.(x, u) = B~(u)oiHx, 8(u)} , (2.45)
where R.(x, 8) = log p(x, e). A submanifold M is in general curved
in the sense of the a-connection. The a-curvature H~~~ of M plays
an important role in statistical inference, as will be shown later.
Some examples will be given in the next section.

2.8. Family of ancillary submanifolds


In order to analyze geometrical properties of M imbedded in S,
it is convenient to introduce a new coordinate system ~ = (~a) in the

enveloping S in the following manner. Let us attach to each point uE


M an (n-m)-dimensional smooth submanifold A(u) of S which transverses
M at .8(u). Moreover, we assume that the family A {A(u) I u E M}
55

Fig. 2.11

fills S up smoothly (or it fills up at least a neighborhood of M in

S). This implies that. by introducing an adequate coordinate system

v = (v K ). K = m+l. m+2. n. to each A(u). a pair (u. v) specifies


uniquely a point in S. This point is in the submanifold A(u) rigging
u E M and has the coordinates v in A(u) (See Fig.2.ll. where M is
one-dimensional. m = 1. and A(u) is two-dimensional. n - m = 2.).

Conversely. every point in S (or at least in a neighborhood of M) can


uniquely be specified by a pair (u. v). More precisely. we say that

a family A {A(u) I u E M} is smooth. when there exists a


coordinate system v in each A(u) such that the pair (u. v) forms an

allowable coordinate system of S. Such an A(u) is called an


ancillary submanifold rigging u. and the family A = {A(u)} is called
a family of ancillary submanifolds or shortly an ancillary family
rigging M. The new coordinate system (u. v) is denoted shortly by ~

= (~a) = (ua • v K ) , a = 1 • . . . • n; a = 1. m; K = m+l • . . . • n.


implying that the first m components of ~ are u and the latter
(n-m)-components of s are v.

Cl = u l • s2 = u 2 .... sm um
56

This E, gives the coordinate system associated with the ancillary

family A. The coordinate transformation from E, to 8 can be written


in the form

8 = 8(E,) = 8(u, v) . (2.45)


Since this is a diffeomorphism, the Jacobian matrix Bi d8 i/ dE,U has
U

full rank, i.e., invertible. The Jacobian matrix can be decomposed


into two parts
a = 1, ... , m; K = m+l, ... , n , (2.47)
dei/dv K , because the first part of U stands

for a and the second part stands for K. Indices K, A, ~, etc. are
used to denote quantities related to the coordinate system v in A(u).
It is convenient to fix the origin v = 0 of each A(u) at the point
8(u) which is the intersection of A(u) and M. Then, all the points
of M have v = 0 coordinates, so that the points of M have the
following 8-coordinates, 8 = 8(u) = 8(u, 0). Hence, 8 8(u) is a
parametric representation of M, and v K o (K = m + 1, ... , n) is
another representation of M.
A foliation of a manifold S is a partitioning

S = UUE-M A(u)
of the manifold S into submanifolds A(u) of dimension n-m. Hence, a
foliation defines an ancillary family when A(u) transverses M at u.
When M is a regular submanifold in S, there always exists a

neighborhood UM of M such that an ancillary family A = {A(u)} is


defined in UM,

UM =Uu~M A(u).

This UM is called a tubular neighborhood. An ancillary family is a

local foliation of S, i. e., a foliation of UM. Our E, = (E,u) = (u a ,

v K ) is a local coordinate system in UM.

The natural basis {d U}' dU = d/dc;,U, of Tu(S) can be decomposed into


two parts, {dU} = {d a , d K }, where d K is the abbreviation of
57

__
d_
K = m+l, ... , n.
dK - dVJ( ,
The vectors da span the tangent space Tu(M) at e e (u) E M C S, and
the vectors dK span the tangent space Tu(A) at e = e(u) E M. The
tangent .space Tu(S) at e = e(u) is decomposed into the direct sum of
these two,

The basis vectors da and dK can be written as


da = B~(u)di dK B;(u)di
in the form of linear combinations of the old basis {ail, where
i
B~(U) Ba (u, 0)

B~(U) B~(U, 0) = dei(u, O)/dV K


When we evaluate a quantity f(u, v) on M, i.e., at v = 0, we often
denote it by f(u) instead of by f(u, 0) for brevity's sake.
The geometric quantities of S can be represented in terms of the
new coordinate system ~ as,
i j
Ba. BSg 1J
..

ijk
r a.Sy = i j
Ba. BSB y r"1J k + BYd a. BSg··
1J
The v-part of the metric
BiB~g .. (2.48)
K 1\ 1J
represents the induced metric of A(u), and the u-part
(2.49)
at e e(u) represents the metric of M. The mixed part
gaK(u) = <d a , > = B~B~gij
dK (2.50)
at e e(u) represents the angles between Tu(M) and Tu(A). When
(2.51)
holds, Tu(M) and Tu(A) are the orthogonal complement of each other.
When (2.51) holds, A = {A(u)} is called an orthogonal ancillary
family. This property is independent of the choice of the coordinate
system v in A(u).
The v-part of the connection
58

r d . l1 (u) = <
i j k
'JdKd>..,
j
dl1
i
>-_
BKB>..Bl1rijk + B11 d KB>..g 1J
.. (2.52)
is the induced affine connection of A(u), while the u-part
i j k
rabc(u) = <'Jdadb' c a ~ gij
BaBbBcrijk + Bjd dC(2.53)
) =

is the induced affine connection of M. When A is an orthogonal


ancillary family, the mixed part

rK>..a(u) <'JdKd>.., Cl a >


HK>..a (u) (2.54)
gives the imbedding curvature of A(u) in S at e = e(u), and

rabK(u) < 'Jdadb' dK > =


i j k
BaBbBKrijk + BKdaBbgij
j i

HabK (2.55)
gives the imbedding curvature of M in S at e(u).
It is always possible to choose a coordinate system v in each
A(u) such that {d K} forms an orthonormal basis vector in A(u) on M,
i. e., at points v 0, so that

gK>.. (u) = oK>..


and at the same time v is the normal coordinate of A(u) at the points
v = 0 on M so that

etc. hold. However, it should be remarked in the case of the


manifold of a statistical model that the a-normal coordinate v is
not necessary normal for another a' (a' fa).

Example 2.8. Submanifolds of the manifold of normal


distributions. We consider three submanifolds Ml , M2 and M3 imbedded
in the space S of normal distributions N(I1, a 2 ), where we use the
coordinate system e = (11, a).
1. Model Ml consisting of the normal distributions N(u, 1),
i.e., the mean u is varying but the variance is fixed to 1,

q(x, u) = ;} {exp - (x 2 u)~


The imbedding e = e(u) is given by el(u) u, 1, so that
59

(1
v
/\ I,
(1,0), a= 1; S
i = 1, 2 . A/..u)

We attach to each point u E: Ml a


rigging ancillary submanifold M
A(u) as is shown in Fig.2.l2, 1 ~

i.e. , A(u) consists of the


normal distributions with U
-,
./
(J

fixed mean u and varying cr.


Fig. 2.12
Let v = cr - 1 be the
coordinate of a point N(u, cr 2 ) in A(u). Then, (u, v) forms a
coordinate system of S with the coordinate transformation
el(u, v) = u, e 2 (u, v) = v + 1 ,

[: l
and the Jacobian matrix

where E; ( u, v ) ,an d "aa BiS o holds. The metric tensor gas is given
by
1

[: :1
This implies that the metric (Fisher information) of Ml is gab(u) =
1, where we put v = 0, and the metric of A(u) on Ml is gKA(u) = 2,
and da and dK are mutually orthogonal, gaK(u) = O. Therefore, the
ancillary family is orhtogonal. The a-connection r(a) (u) in the
aSy
associated coordinate system ~ = (u, v) is given by
r(a)(u) 0 r(~)(u) = - 2(1 + 2a)/(1 + v)3
abc KI\Il

r (a)
(u) (1 - a)/(l + v)3, r(~)(u) = 0
abK Kl\a
r(a)(u)
aKA 0, r~~~(u) = - (1 + a)/(l + v)3
where a, b, c stand for 1 and K, A, 11 stand for 2. The above results
show that the model Ml is not an a-flat submanifold in S except for
60

the case a = 1, because of H~~~(U) = (1 - a)/(l + v)3.


The 1-dimensiona1 model M1 is not an a-geodesic (a f 1), and in
particular is not a O-geodesic. Hence, the minimum length curve
connecting two points N(u l , 1) and N(u 2 , 1) is not in M1 . However,

the Riemann-Christoffel curvature R~b~d (u) vanishes identically so


that Ml itself is a-flat, since everyone-dimensional manifold is
curvature-free. This is because there are no non-trival loops in a
one-dimensional rnanifold. The ancillary submanifo1ds A(u) are a-flat
in S for any a. The coordinate u of M1 is a-affine for any a, while
the coordinate v of A(u) is not a-affine unless a = -1/2.
2. Model M2 consisting of the normal distributions N(O, u 2 ),
i.e., the mean is fixed to 0 but the variance is varying,
~

1 x2
q(x, u) lTITU exp {- -2-} .
2u
The imbedding e e(u) is given in this case by
el(u) o e2 (u) = u .

Let us attach to each point u E M2 a rigging ancillary


2
submanifo1d A(u) consisting of N(v, u), i.e., the normal

distributions with varying means v and the fixed variance


as is shown in Fig.2.13. Then, (u, v) forms a
coordinate system of S with the coordinate transformation

u, u
I~
s
M

A(u)
u
,
... V

Fig. 2.13
61

v , e2 (u, v) u .

The Jacobian matrix is

B~ [: : 1
o also holds. By calculating the metric tensor gaS(u), we

2 2
gab (u) = 2/u , gKA (u) = l/u , gaA(U) = 0 ,
so that the ancillary family A is an orthogonal family. The
a-connection is given by

r~~~(u) 2(1 + 2a)/u 3 r(a)(u)


KAIl
o ,
r(a)(u) 0 r(a)(u) (1 a)/u 3
abK KAa

r~~~(u) = 0 ,

Hence, M2 is an a-flat submanifold of S for any a, because H~b~(u) =

0, while A(u) is not an a-flat submanifold of S for a f 1. The

parameter u is not an affine coordinate system of M2 if a f -1/2.

3. Model M3 consisting of the normal distributions N(u, u 2 ), u


> 0,
2
q(x, u) = .~1TU exp{- ex - u) }
'V,,"JI 2u 2

Such a family of distributions is derived when the random variable x


is derived from x = UZ, where u > 0 is the unknown multiplicative

parameter and Z is a random variable subject to N(l, 1). Since u >

0, we consider a positive half part {Il > 0, 0 > O} of S. The model

M3 is imbedded in S by

Since Bi aei/au = 1 for i = 1, 2, the tangent vector aa of the


a
model is given by
62

where ' \ and Cl 2 denote Cl i , i = 1, 2. In order to obtain an


orthogonal ancillary family, we look for a vector field Cl K on M3
which are orthogonal to Cl a . We can easily check that the vector Cl K
2Cl l - Cl 2 is orthogonal to Cl a . Hence, we attach to each u E M the

following A(u)

----------~------~--~------------~8
u

Fig. 2.14

A(u) = {(e l , e2 ) I 2(e 2 - u) + (e l - u) = O} ,


which is a straight line in <e l , e 2 )-plane (Fig.2.l4). By
introducing a coordinate v in every A(u), we have the following
coordinate transformation e(~),

el(u, v) = u - 2v , u + v ,
so that

B~(U)
[ : -: 1
o on M3 . The metric tensor gaS(u) is decomposed into
the followi~g three
2 2
gab(u) = 3/u o , 6/u ,
63

ascertainig that A is an orthogonal family. By calculating the


a-connection, we obtain some of the components as

r~b~(u) (7a + 3)/u 3 r~b~(u) = (3 - a)/u 3


r(a)(u) 2(3 - 2a)/u3
KAa
This shows that u is not an affine coordinate in M3 unless a = - 3/7
and that M3 is curved in S having the a-imbedding curvature - (a -
3)/u 3 . Hence, M3 is flat in S, when a = 3. This implies that M3 is
not a Riemannian geodesic in S. The ancillary submanifold A(u) is
also curved unless a f - 3/2.

2.9. Notes
It was in 1940's that the possibility of constructing a
differential-geometrical theory on the statistical model was
remarked. Rao [1945] showed that a statistical model forms a
Riemmanian manifold, where the Fisher information matrix plays the
role of the metric tensor gij. He calculated the geodesic distances
in some statistical models. The idea of the non-informative prior
distribution by Jeffrey [1946] seems to be proposed from the
Riemannian point of view. Since then a number of researchers have
tried to elucidate the statistical meaning of the Riemannian
structure. We can mention, among others, unpublished works by Amari
in 1959 and by Moriguti in 1960 which had some influences on the
studies followed by Yoshizawa [197la, b], Takiyama [1974], Sato et
al. [1979], Ozeki [1977] and Ingarden et al. [1979] in Japan. The
Riemannian approach was also adopted by Atkinson and Mitchell [1981],
Kass [1981], Skovgaard [1981]. However, the statistical
implications of Riemannian structures, especially of the
Riemann-Christoffel curvature, are still not clear, except for the
fact that there are no covariant stabilizing transformations unless
the Riemann-Christoffel curvature vanishes. Here, a covariant
stabilizing transformation is a transformation of the parameter (or
64

reparametrization) of a statistical model such that the Fisher


information matrix gij(8) reduces to the unit matrix 0ij for all 8.
A new development was made by Russian mathematician Chentsov
whose results are summarized in a Russian book (Chentsov [1972],
translated in English in 1982). He introduced the a-affine
connection in a statistical manifold and elucidated the structure of
the exponential family by using the exponential (i. e., a = 1)
connection. However, statistical curvatures were not treated there.
It is unfortunate that the higher-order asymptotic theory of
statistical inference, to which the geometrical structures are most
closely related, was not well developed at that time. It was Efron
[1975] who first noticed the importance of the curvature of a
statistical model in the higher-order asymptotic theory of
estimation. He defined the statistical scalar curvature in a
one-dimensional statistical model, and clarified its role in
statistical inference. Dawid [1975] pointed out further possibility
of developing Efron's rather intuitive but fertile idea. Reed [1975]
suggested the multi-dimensional generalization, and Madsen [1979]
generalized Efron's idea in the multi-parameter case, giving
differential-geometrical background to it.
Amari [1980, 1982a] defined the a-connections, which are proved
to be the same as those Chentsov has already introduced, and
constructed a general differential-geometrical framework for
constructing the higher-order- asymptotic theory of statistical
inference. Here, not only the exponential curvature but also the
mixture (i.e., a = - 1) curvature is shown to playa proper important
role. The theory has further been developed in Amari [1982b],
Nagaoka and Amari [1982], Amari and Kumon [1983]. Kumon and Amari
[1983]. and is still developing (see also Amari [1983a]. Amari
[1983b]. Kumon and Amari [1983]. Eguchi [1983]. Kass [1984].
Tsai[1983]. Amari [1984 a]. Lauritzen [1984]. Barndorff-Nielsen
65

[1984)) . Recently, Pfanzagl [1982] has also proposed a general

geometrical theory of statistical inference from another point of


view. However, it is possible to understand the theory from the
differential geometrical viewpoint.
3. a-DIVERGENCE AND a-PROJECTION IN STATISTICAL MANIFOLD

The present chapter treats more fundamental


structures underlying differential geometry of statistical
manifolds. A pair of dual affine connections together
with a Riemannian metric play a fundamental role in the
present theory. Such a structure has not so far been
noticed in literature of differential geometry. The
dualistic structure of the geometry is elucidated by using
the a-representation of the tangent space. An a-flat
manifold will turn out to be an interesting generalization
of the Euclidean space, admitting the Pythagorean relation
with respect to the a-divergence of two points.

3.1. a-representation
We have used the logarithm t(x, e) of the density function p(x,
e) to define the fundamental geometric structures in the previous
chapter. However, there exist more convenient representations in
which the meaning of the a-connections can be understood more
directly. Let Fa(p) be a one-parameter family of functions defined
by
2 (1-a)/2
y-,:--aP ,afl,
{
Fa(p) (3.1)
log p , a = 1
The derivative F'(p) = p-(1+a)/2 is a homogeneous function in p of
a
degree -(1 + a)/2. We call
ta(x, e) = Fa{p(x, e)} (3.2)

the a-representation of the density function p(x, e). The

l-representation tl(x, e) is the logarithm t (x, e), while

-l-representation t_l(x, e) is the density function p(x, e) itself.


Let T~a) be the vector space spanned by n linearly independent
67

functions 0ita(x, e) in x, i = 1, ... , n,


T~a) {A (x) I A(x) Aioita(x, e)} . (3.3)
We have the natural isomorphism Te ~ T~a), where 0i E Te corresponds
to 0ita E T~a). The vector space T~a) is called the a-representation
of the tangent space Te . The a-representation of a vector A = Ai o~. E
Te is the random variable
i
Aa(x) = Ata A 0ita(x, e) .
We next look for the expression in T~a) of the inner product of
two vectors. To this end, we introduce the a-expectation of a random
variable f(x) by
Ea[f(x)] f {p(x, e)}af(x)dP . (3.4)
From the relation
0ita(x, e) = {p(x, e)}(1-a)/2 0it(x, e) (3.5)
or, for A = Aio i E Te'
Ata (x, e) = {p(x, e)} (1-a)/2 M(x, e) ,
it is easy to see that the inner product of two vectors A = Aio.~ and
B = Bioi has the following expression in the a-representation T~a) ,
<A, B >Ea [(At a ) (Bt a )] .
= (3.6)

Moreover, by virtue of the following relation


(a ita) (a j t _a) = p (x, e) ('\ t) (0 j t)
the inner product has the following dualistic expression for any a,
<A, B> = f {Ata(x, e)}{Bt_a(x, e)}dP. (3.7)
This shows that the two vector spaces T~a) and T~-a) are dually
coupled: that is the inner product of two vectors A, B is given by
the integration of the product of their a- and -a-representations.
By differentiating (3.5) again, it follows that
a.a.t (x, e)
~Ja
= {p(x, e)}(1-a)/2(o;o.t + 1 - a
LJ 2
°i to j t)
(a)
Hence, the components r ijk of the a-connection defined by (2.27) can
be written as
(3.8)
by using the fa-representations. Similarly, for vector fields A =
68

B = Bj(8)a., we have
J
ABR.a(x, 8) = {p(x, 8)}(1-a)/2(ABR. +¥AR.BR.)
Hence, for three vector fields A, B, C, the a-covariant derivative
can be written as
Ea[(ABR. a ) (CR. a )]
/{ABR.a(x, 8)}{CR._ a (x, 8)}dP . (3.9)

This shows that the a-covariant derivative V~B is given by projecting


ABR.a to T~ a) in the a-representation. Therefore, T~a) provides a
natural frame for representing the a-covariant derivative, and is
convenient for studying the properties of the a-connection.

Remarks. Invariancy of the geometrical structures. It is


obvious from the definition that the above geometrical structures do
not depend on the manner of parametrization 8 of the family S = {p(x,
8)}. They are also invariant under a one-to-one transformation of
the random variable x to y, y = f(x). This may be regarded as a
coordinate transformation of the sample space X. The transformation
produces another family S' {q(y, 8)} parametrized by 8 of
distributions of random variable y, where the density function q(y,
8) is induced from p(x, 8) by p(x, 8)dx = q(y, 8)dy or
q(y, 8) = p{f-l(y), 8}J- l (y) ,
with the Jacobian determinant J = detlaf/axl which does not depend on
8. Hence, from
ai log q (y, 8 ) di log p{f -1 (y), 8}
follows,
<ai' a j > = E[(ailogp)(ajlogp)] = E[(ailogq)(ajlogq)] .
This shows that the metric is invariant under the transformation of x
to y. Similarly, the same a-connection is derived by using the
distributions q(y, 8) instead of p(x, 8). Hence, the geometrical
structures are invariant under transformations of x. In other words,
the geometry of statistical distributions does not depend on the
69

manner of repre~enting random variable x.


The a-covariant derivative is naturally defined in the
a-representation based on the function Fa. This poses possibility
of introducing another geometrical structure by using another

function F instead of Fa. To answer this problem, let us try to

introduce another inner product <A, B>' of A, B E Te in the


representation F F{p(x, e)} by

<A, B >' = E'[(AF)(BF)]


and another covariant derivative V' by

<V;"B, C)' = E'[(ABF)(CF)] ,


where E' represents a kind of expectation operator, e.g.,

E'[f(x)] = !G{p(x, e)}f(x)dP


for some function G. If we impose the condition that the above
definitions should be invariant under (coordinate) transformations of
the sample space X, it can be proved that the derivative F'(p) be a
homogeneous function in p. Hence, we are naturally led to the class

of functions Fa defined in (3.1). No other definition can produce


invariant structures. This explains the uniqueness of our
geometrical structures introduced in S.
We can use the functions

__2__ {p(1-a)/2 _ CN(x)} , a l' 1


l-a ~

log P , a = 1

instead of (3.1) to define the same geometrical structures,

where Ca(x) is a function in x. When ~~~ Ca(x) = 1 holds, Fa(p)


is continuous with respect to a, because

1· 2 {(1-a)/2 l} log P .
ol:'IIf
l-a p -
The definition (3.1), where we put Ca (x) 0, is used only for
brevity's sake.
70

3.2. Dual affine connections


Let c : a(t) be a smooth curve in a statistical manifold S. The
tangent vector of the curve is given by 6(t) = ei(t)di at point a(t)
on the curve. Let B(t) be a vector field defined on the curve. The
intrinsic change in the vector B(t) along the curve is measured by
the covariant derivative VeB(t),
VeB(t) = (Ai + rjkiBkej)di
When VeB(t) = 0 is satisfied, there is no intrinsic change in B(t)
along the curve. In this case, vector B(t') is said to be the
parallel displacement of vector B(t) from a(t) to a(t') along the
curve. The parallel displacement defines a mapping ITc which maps the
tangent space Ta at a point a a (t) to the tangent space Ta , at
another point a' = a(t') along the curve c,
ITc : Ta ~ Ta " ITcB(t) = B(t')
The parallel displacement based on the a-connection is denoted by rr~.
The parallel displaceIJlent does not necessarily preserve the metric
structure of the tangent space. For two vectors A, B ~ Ta and their
parallel displacements ITcA, ITcB ETa" where c is a curve connecting
a and a', their inner products are not necessarily the same, i.e.,
<A, B > a = <rrcA, ncB> a'
does not necessarily hold, where <A, B >a is the inner product in
Ta' An affine connection or the related covariant derivative is said
to be metric, when the metric (inner product) is preserved by the
parallel displacement, and is otherwise said to be non-metric.
For a non-metric covariant derivative V, there might exist
another covariant derivative V* such that the pair (V, V*) satisfies
(3.10)
i.e., the inner product is preserved by the parallel displacement ITc
of one vector and by the parallel displacement IT~ of the other. Such
a pair of covariant derivatives V and V* are said to be mutually
dual. When V is metric, it is self-dual, i.e., it is dual to itself.
71

It is convenient to define the dual connections in the following


differential manner.

Definition 3.1. Two covariant derivatives 'land '1* or the


related affine connections are said to be dual with respect to the
metric, when
(3.11)

holds for any vector fields A, B, C.

Let

be the components of dual affine connections. Then, by substituting


A = di' B = dj' C = dk in (3.11), we have the following relation
connecting the components of the dual affine connections,
(3.12)

This shows that every affine connection has a unique dual determined
by

r!jk = digjk - r ikj


The dual of the dual is the primal one, '1** = V.
It is easy to prove that (3.10) holds for two dual connections.
Let A(t) and B(t) be the vector fields obtained by the parallel
displacements of two vectors A,B E Te along a curve c with respect to
the dual connections 'land '1*, respectively,
B(t) = Il~B .
Because A(t) and B(t) are parallel fields,
Ve*B(t) = 0
and hence
e <A(t), B(t) >
<'leA, B) + <A, vtB> o ,
holds, proving (3.10).
Now we can prove the following theorem, which elucidates the
72

dualistic structures of the differential geometry of statistical


manifolds.

Theorem 3.1. The a- and - a-connections Va and V-a are mutually


dual. In particular, the O-connection is self-dual and hence is
metric.

Proof. By the use of the a-representation, we have


A <B, C> A J (B~a)(C~_a)dP

J (AB~a)(C~_a)dP + J (B~a)(AC~_a)dP

<V~B, C) + <B, vAac) ,


which proves the theorem.
Let c be a loop passing through a point e, i.e. a curve e(t)
satisfying e = e(t O) = e(t l ). For a vector A E Te , we can define its
parallel displacement ITcA E Te encircling the loop. A manifold S is
said to be flat, when any vectors do not change by the parallel
displacement along any loops, i.e., when ITcA = A holds for any A and
c. The Riemann-Christoffel curvature R vanishes identically in this
case. From this fact, we can prove the following theorem.

Theorem 3.2. When S is flat with respect to V, it is also flat


wi th respect to its dual V*. Especially, an a-flat manifold is
- a-flat.

Proof. For the parallel displacement along a loop c,


<ITcA, IT~B) = <A, B)
holds for any A,B ~ Te' When S is flat with respect to V, we have
ITcA A so that
<A, IT~B > = <A, B>
holds for any A, B. This implies IT~B B, proving that S is flat
with respect to V*, too.
73

The theorem shows that there is a close relation between the


Riemann-Chistoffel curvatures Rand R* of dual affine connections ~

and *
~. For a loop c, let c -1 be the inverse loop encircling c in
the reverse order. It is easy to show
n -I = (nc)-I.
c
Hence, we have
<ncC,D>=<n -I ncC,n;-ID>=<c, nO_ ID >.
c c
Since c -1 is the reverse loop of c, for any vector fields A, B, C, D,

R (A,B, C,D) = R* (B,A,D, C,) = -R* (A,B, C,D)

or in the component form


'1\
Rijkm = - R ijrnk'
Therefore, in the statistical manifold,
R (a) (-a)
ijkm - Rij mk.
This relation was proved by Lauritzen (1984).

3.3. a-family of distributions


The exponential family of distributions is well known to play an
important role in constructing the theory of statistical inference
(see Part II). The mixture family of distributions are also widely
used. We define the a-family of distributions by extending the above
well-known families.

Definition 3.2. A family S = {p(x, e)} of distributions is said


to be an a-family, when their a-representations can be written as
i
~a(x, e) e ci(x) + k(e)cn+l(x) (3.13)
by choosing adequate parametrization e = (e l , ... , en), where ci(x)
are fixed random variables and k(e) is determined from the
norsalization condition jp(x, e)dP = 1 .

By introducing the (n+l)st coordinate e n + 1 , which is not free


74

but subject to a n + l k(a), the a-representation of the a-family is


written as
-i
R.a(x, a) = a ci(x) , (3.14)
where 8 = (a l , ... , an + l ) is the newly introduced coordinates. Here,
index i runs from 1 to n + 1, and a ...-i
= a i
for i = 1, ... , n and ....a n+l
= k(a). We call 8 the natural or canonical homogeneous coordinate
system of the a-family. Note that only n among n+l components of e
are independent. The first n coordinates a = (a l , ... , an) can be
adopted as an ordinary coordinate system.
The l-family (a = 1) is written in the form
R.(x, e) = aici(x) + an+ l cn+l(x) .
When cn+l(x) = 1 holds, by putting an + l = k(a) - ~(a),
i
p(x, a) = exp{a ci(x) - ~(a)} , (3.15)
which is the standard form of the well-known exponential family with
,
respetct to measure P(x) of the sample space X. Here, a is called
the natural or canonical parameter and ~ (a) is the normalization
factor to be determined from
exp{~(a)} = Sexp{aici(x)}dP.
The function ~(a) is related to the cumulant generating function.
Indeed, the characteristic function of the random variables ci(x) is
given by
Ea[exp{iSjcj(x)}] Jexp{(is j + aj)cj(x) - ~(a)}dP
exp{~(is + a) - ~(a)},

where i is the imaginary unit and s = (sj). This shows that exp {~(s
+ a) - ~(a)} is the moment generating function,

E[c. (x) ... c; (x)] = aP!(.as iL •.. aslt.)exp{Hs + a) - ~(a)} Is=o'


~1. ......

Hence, ~(s + a) - ~(e) or ~(s + e) itself is the cumulant generating


function of ci(x) with respect to the distribution p(x, a).
75

The - l-family (n = - 1) can be written as


~ 'Xi ~n+l
(3.16)
p(x, e) = 0 ci(x) + 0 cn+l(x).
When all c.1 's satisfy c i (x) > 0 and JC i (x) dP = 1, this is the
well-known mixture family, mixing n+l distributions ci(x) with
weights ai, where
n+l .
L S1 1 , o < ei < 1 .
i=l

Example 3.l. Discrete or multinomial distributions.


Let Sn
{p(x, ~)} be the set of all the distributions over a finite number of
atoms x = 1, 2, ... , n+l, with weights -i~ The density function .
p(x, ~) can be
written as
n+l .
p(x, 0 ~
i
<\ (x) L ~1 1
i=l
where 0i(x) = o(x - i) or
l, x = i
{

0, x I- i
By virtue of the relations
{~ioi(x)}(1-n)/2 = (~i)(1-n)/2oi(x) log{~ioi(x)} ~i
log~ 0i(x)
if we choose a new parametrization en defined by

{
_2_ ai) (1-n)/2 , n I- 1 ,
1 - n
eai
log~
-i , a = 1 ,
the a-representation of the distribution takes the following form
R-a(x, Sa) = "'i
eaoi(x)
with respect to the new coordinate system. This shows that the set
S of all the distributions over n+l atoms forms an n-family for any
a, and that the ea gives the natural homogeneous coordinate system
of Sn as the a-family. Especially, Sn is an exponential family and
also it is a mixture family.
76

We have so far treated a manifold S of probability


distributions. In order to study the properties of S, it is
convenient to extend S to a manifold S= {cp(x, e) I c > O} of finite
measures and consider S as a submanifold of S, because S has simpler
geometrical structures. The extended S consists of all the density
functions of the form: S= {m(x, e, c)},
m(x, e, c) = cp(x, e) , c > 0 p (x, e) E S ,

so that it is (n+l)-dimensional and the pair (e, c) is an example of


S. Let -e = (e-1 , ... , <>'0+
the coordinate system of ,.., e 1 ) be a coordinate
system in S such that a member of S is parametrized as m(x, e), and
let K(e) be the total measure K(e) = Im(x, e) dP of the distribution
m(x, e). Then, the original S forms a submanifold in S defined by
K(E!') = 1

The geometrical structures can be introduced in S in the same


manner as in S as follows. Let {ail, i = 1, ... , n+l be the natural
basis of the tangent space :re of S associated with the coordinate
system e. Then, the a-representation of 3i is given by 3i £a(x, e),
where ra(x, 6) = Fa{m(x, 6)}. The inner product of two vectors A,BE
Te is defined by
( A, B) = I{AR, a (x, 6) HER, -a (x, e) }dP .
The a-covariant derivative VA~ of vector field ~ along A is given by
<~A~' C> = I (AB!a) (C!_a)dP
for any vector field C. The geometrical structures of the original S
is compatible with those induced from S as a submanifold.
When S is an a-family, the a-representation of the measure in
the extended S are
ia(x, 8) = eici(x)
where e (Ell, '6 n+ l ) is the natural coordinate system of S.
(When a = 1, we assume cn+l(x) = 1 in the above.) Since aiajia(x,
e) = 0 holds, the a-covariant derivative Va of S satisfies
77

This implies that S is a-flat and that the natural coordinate system

8 is a-affine. The geodesic connecting two points 81 and 62 in S is


given by
(3.17)
It should be remarked that, even when the two points 81 , 82 belong to

S, the geodesic 8(t) does not necessarily belong to S, because S is

in general curved in S.
There is a me~hod of obtaining the geodesic c in the a-family S
from the a-geodesic c in the extended S. Roughly speaking, the
proj ection of c from the origin to S is the desired geodesic in

S (Fig. 3.1). The following theorem, whose proof is omitted (See

Nagaoka and Amari, 1982), elucidates the relation between a flat


submanifo1d S' in S and its extension S' in S. A submanifo1d S' in S
is said to be autopara11e1, when it has vanishing imbedding (or
Euler-Schouten) curvature. Since our manifolds are torsion-free, an
autopara11e1 submanifo1d is totally geodesic in the sense that it

consists of all the geodesics whose tangent vectors belong to the

tangent space T e (S ') . For a submanifo1d S I of an a- fami 1y S, the


extension S' of S I is a submanifo1d of S such that S' = {cp (x,
e) I p (x, e) ~ S I, C > O}.

FIG, 3.1
78

Theorem 3.3. A submanifold S' of an a-family S is autoparallel


in S, when, and only when, the extended submanifold S' of S' is
autoparallel in the extended manifold S.

By the use of the theorem, we can obtain the geodesic of an


a-family S connecting 81 and 92 in S. The natural
two points
homogeneous coordinates of the two points are 91 and 9 2 , where K(9 l )
= K(9 2 ) = 1 is satisfied because they are in S. The geodesic
tr(t) given by (3.17) connects the two points in S, so that in
general K{8(t)} f 1. However, the curve
6(t) c(t){(l - t)9 l + t82 } , a f 1
6(t) (1 - t)8l + t82 + c(t) , a = 1
where c(t) is the normalization constant to be determined from K(e)
1, is the a-geodesic of S connecting 81 and 62 , This implies that
the geodesic 8(t) in S is obtained from the geodesic 9(t) in S by
normalizing the measure m(x, 8) such that K(8) = 1 holds.
The mixture family (a = - 1) is special in the sense that S

itself is also a -I-flat submanifold in s. Indeed, the

.. n+l"'i
constraint K(8) = L8 = 1 which determines S is linear in 8.
i=l
Hence, the mixture family S itself is autoparallel. The exponential
family (a = 1) is also special. The extended manifold S in an
exponential family is of the following form
1(x, e) = eic.(x) +
1.
en+ l ,
so that the constraint determining S is -n+l
8 = - 1j!(8) and is not
linear in 8. Hence, S is not an autoparallel submanifold in S.
However, S itself is also a I-flat manifold having null
Riemann-Christoffel curvature, because the (a=l)-connection vanishes
r g~ (8) = E[ a i a j R.(x, 8) akR.(x, 8) 1= 0 .

However, the imbedding curvature of S in S does not vanish in this


79

case.
As a swmnary, the extended manifold S of any a.-family S is
a.-flat, while S itself is in general not so. The mixture and
exponential family are exceptional in the sense that they are 1- and
-l-flat by themselves.

3.4. Duality in a.-flat manifolds


We have already seen that, when a manifold S is V-flat (i.e.,
flat with respect to the covariant derivative V), it is also V*-flat,
where V* is the dual of V. There exists two special coordinate
systems in such a dually flat manifold: V-affine coordinate system 8
and V*-affine coordinate system n. The manifold has a beautiful
dualistic structure concerning the pair V and V* of affine
connections. We show it first in a general framework and then apply
the results to the extended manifold S of an a-family which is flat
with respect to the a.- and -a-affine connections. The results can
also be applied directly to the exponential and mixture families,
because they are 1- and -l-flat by themselves.
(ni) be two coordinate systems in an
n-dimensional Riemannian manifold S, where the lower index is
employed to denote the components of n with the intention of
constructing a dualistic theory. The natural basis of the tangent

space Tp at a point P EO: S is {(\}, '\ = 0/08 i for the coordinate


system 8, and is {oil, oi = a/ani for the coordinate system n. Any
vector A~ Tp can be represented by
A = Aio. = A. oi
l. l.

in these bases. When the basis vectors and oj satisfy


°i
<ai' oj >= o~l. ,
the two bases are said to be biorthogonal. The components Aj and Ai
of vector A can be derived by
80

when the bases are biorthogonal.


Two coordinate systems e and n are said to be mutually dual,
when their natural bases are biorthogonal. Dual coordinate systems
do not necessarily exist in a Riemannian manifold. It will soon be
shown that they always exist in an a-flat manifold. Here, we assume
that a pair of dual systems exist in S and let
e = e(n) , n = n (e)
be the coordinate transformations between e and n. Then, the two
natural bases {ail and {ail are related by
i k . k
ai = (ank/ae)a , aJ = (ae lanj)a k '
where anj/ae i and ae k la n j are the mutually inverse Jacobian matrices
of the coordinate transformations. Taking the inner ·product of a i
and aj , we have
gij = <ai' aj > =
i
(ank/ae ) <a , a j )
k

where gij is the metric of S, Hence,


, 'k
ae J lank = gJ
'k
holds, where gJ is the inverse matrix of gkj' The metric tensor in
the basis {ail is given by this inverse matrix, because of
<ai, aj > = (aek/ani) (aem/a nj ) <a k , am) = gikgjmgkm = gij
The following two theorems are fundamental in the new dualistic
theory of differential geometry.

Theorem 3.4, When a Riemannian manifold S has a pair of dual


coordinate systems (e, n), there exist potential functions w(e) and
$(n) such that the metric tensors are derived by
(3.18)

Conversely, when either potential function W or $ exists from which


the metric is derived by differentiating it twice, there exist a pair
of dual coordinate systems. The dual coordinate systems are related
by the following Legendre transformations
ei = aiHn) , ni = aiw(e) (3.19)
81

where the two potential functions satisfy the identity


i
~(8) + ~(n) - 8 ni = 0 . (3.20)

Proof. When dual coordinate systems (8, n) exist, the metric gij is

given by gij 0inj' Since it is symmetric, 0inj - ojni = 0 holds.


This shows that there exists a potential function ~(8) such that ni =
oi ~ (8) . Therefore, gij = 0 i 0j ~ is derived. Dually to the above,
there exists Hn) such that 8 i = oiHn) and gij = oiojHn). By

differentiating ~(n) - 8 i ni with respect to 8 j , where n is considered


as a function of 8, we have
k i
OJ {Hn) - 8ini} = gjk O ~ - nj - 8 gij =- nj
i
Therefore, - {~(n) - 8 nil yields a potential function ~(8), so that
we can choose ~ and ~ such that (3.20) holds. On the contrary, when
there exists a coordinate system 8 such that the metric is derived by

differentiating a potential function ~ twice, gij = 0i 0 j ~, we can


define a new coordinate system n by ni(8) = 0i~(8). It is easy to
prove that the pair (8, n) thus constructed is dual coordinate
systems, because the natural basis oi of n is given by oi

Theorem 3.5. When a Riemannian manifold S is flat with respect


to a pair of torsion-free dual affine connections V and '11*, there
exists a pair (8, n) of dual coordinate systems such that 8 is a
V-affine and n is a V*-affine coordinate system.

Proof. Since S is V-flat, there exists a V-affine coordinate system


8 in which r ijk = O. Let r*ijk be the components of the dual
connection '11* in this coordinate system. Then, we have

r ijk * = °igjk
Since '11* is tors ion - free, i. e . , r ijk* -- r jik'
*
follows. This guarantees the existence of the potential ~ such that

gij(8) = 0iOj~(8). Hence, from Theorem 3.4, there exists a pair of


82

dual coordinate systems (6, n), where n is defined by ni(6)


Let a i and a i be the natural basis vectors with respect to 6 and n,
respectively. In order to prove that n is V*-affine, we use the
relation (3.11). For any vector A = Aia i

A <ai' aj > = <vAa i , aj ) + (ai' v1 aj >


holds. Here, the left-hand side vanishes because of <ai' aj > 01,
and VA di = 0 because 6 is V-affine. Hence, o which proves
that n is a V*-affine coordinate system.

Since ±a-connections are dual and torsion-free, the above


theorem is directly applicable to an a-flat manifold. We first apply
it to an exponential family S. An exponential family S = {p(x, 6)}
given by (3.15) is (a =1)-f1at, where the natural parameter 6 is the
1-affine coordinate system, because
(1) _
r ijk (8) - E[(aiajR.) (akR.)] = 0
in this coordinate system. The metric tensor gij is calculated as
gij = - E[aiajR.(x, 8)] = a i aj H 8) ,
which shows that W(8) is indeed the potential function. Hence, the
dual c'oordinate system n is given by ni = ai W(8), which is well known
as the expectation parameter of the exponential family, because

or
E[ci(x)] = ai w(6) = ni .
The components of the (-l)-affine connection vanish in this
coordinate system,
r(-l)ijk(n) = <v~i1)aj, ak ) = 0 ,
as we have already seen. The metric gij(n) <ai, aj ) is the
inverse of gji' derived by gij(n) = aiajHn), from the potential
~(n). Let H(n) be the entropy function of the distribution specified
by n. It is given by
83

H(11) Jp(x, e)log p(x, e)dP

E[eici(x) - ~(e)] = ~(e)


where e is considered as a function of 11. This shows that the
i i
potential ~(11) is the negative entropy ~(11) = - H(11), and e = d ~(11)

holds. The two coordinates e and 11 are connected by the Legendre


transformation with the potential functions ~(e) (which is related to

the cumulant generating function) and ~(11) (which is the negative of

the entropy function).


A mixture family given by (3.16) is -l-flat. The coordinate

system e is -l-affine, because the components of the -l-connection

vanish in this coordinate system,


(-1) _ (-1) _
f ijk - <V di dj , dk> - E_l[(didjP)(dkP )] = 0 ,

where the -l-expression is used. The potential function ~(e) is


given in this case by the negative entropy
~(e) =- H(e) = J p(x, e)log p(x, e)dP .

Calculations indeed show gij(e) = - didjH(e). The dual coordinate

system 11 is given by

J {ci(x) - cn+l(x)}log p(x, e)dP .


The dual potential ~(11) is

~(11) =- J cn+l(x)log p(x, e)dP ,


where e is considered as a function of 11, and e i
identity (3.20) is obviously satisfied.
An a-family S (a f ±l) is not a-flat, but its extended manifold
S is. The natural coordinate system e in S is a-affine. The

potential function is given by the total measure as


'"
~a(e) 1 +2 a K(e)
'"

This can be shown by differentiating the above ~a' obtaining

aj~i~a(e)
AI
=
2
1 + a
J-djdim(x,
~ 'e)dP
" = gji(x, e) ,
oJ

The dual coordinates 11 are given by


- 2 ~ w
ni (e) 1 + a diK(e)
The dual potential is also given by the total measure as
84

~a(n) = 1: a K(n) ,
where K(n) is the total measure of the distribution specified by n.
The inverse transformation is given by
i = 2e
1 - a
aiK(ri)
The relation (3.20) reduces in this case to
"'i"" 4 K
e T1i 2
1 - a

3.5. a-divergence
Returning to a general Riemannian manifold S admitting a pair
of dual coordinate systems (e, T1), let us define a function D S x S
+ R by

(3.21)
that is, D is a function of two points PI and P2 E S whose
e-coordinates are e l and e 2 and whose T1-coordinates are T1l and T12'
respectively, where Wand ~ are the dual potentials and e l ' T1 2 is the
abbreviation of
i
e l ' T1 2 e l T1 2i
As can easily be shown from (3.20), when PI = P 2 , D(P l , P 2 ) = O. The
function D is called the divergence of two points PI and P 2 from PI
to P Z' because it satisfies the following properties, where D(e, e')
denotes the divergence of the points whose e-coordinates are e and
e' ,

Lenma
1) D(e, e') > 0, The equality holds when, and only
when e e'.
2) ,\D(e, e ') diD(e, e I ) o at e e' , where d!
~

= d/de,i.
3) di djD(e, e ' ) gij (e)

Proof. By differentiating (3.21), the relations


85

diD(e, e' ) ni (e) - ni(e')


diD(e, e') g .. (e')(e,j
1J - ej )

didjD(e, e' ) gij (e) ,


are derived, which prove 2) and 3). Since gij is positive definite,

D(e, e') is a strictly convex function in e. Hence, D(e, e') > 0,

when e 1 e', because of D(e, e) = 0, diD(e, e) = 0, proving 1).

The divergence does not satisfy the triangular inequality. Nor


is it symmetric. Hence, it is not a usual distance. However, the
divergence between two neighboring points e and e + de satisfies the

relation

D(e, e+de) = D(e+de, e) = ~ gij (e)deide j (3.22)


where the term of order 0(ldeI 3 ) is neglected. Hence, the divergence
can be regarded as an extention of the square of the Riemannian
distance.
Let us consider an a-flat manifold S (e. g., an exponential

family, mixture family, the extended manifold of an a-family) which

has dual coordinate systems (e, n) with the potentials ~(e) and ~(n).

Then, the divergence is introduced in S. It is called the

a-divergence, and is denoted by Da(e, e'),


Da(e, e') = ~(e) + ~(n') - eon' . (3.23)
Since the S is -a-flat at the same time with dual coordinate systems

(n, e), the -a-divergence is also introduced by

D_a(n, n') = ¢(n) + ~(e') - e'on .


The two divergences are related by

Da(P, P') = D_a(P', P) ,


which demonstrates the dualistic structure of S.
The following theorem shows that the a-divergence satisfies an
important relation similar to the Pythagorean relation in the

Euclidean geometry, where the ±a-geodesics play the role of straight


lines. This elucidates the relation between the divergence and the
86

differential-geometrical structures.

Theorem 3.6. Given three points e, e', e" in an a-flat manifold


S, let c+ be the a-geodesic connecting e and e' and let c be the
-a-geodesic connecting e' and e". Then, the following Pythagorean
relation holds,
(3.24)

according to whether the angle between the tangent vectors of the


geodesics c+ and c_ is greater than, equal to, or less than n/2 at
e' .

FIG, 3,2

Proof. From the definition of the divergence, we have


Da(e, e') + Da(e', e")= 1/I(e) + Hn") - e·n" + (e - e')· (nil - n')
Da(e, e") + (e - e')·(n" - n')
Since e and n are a- and -a-affine coordinate systems, respectively,
the a-geodesic c+ and -a-geodesic c_ can be written as
e(t) e' + t(e - e')
c n(t) n' + t(n" - n')
in the respective coordinate systems. Hence, their tangent vectors
at S' or n' are given by

and their inner product can be written as


87

(e - e')'(n" - n') = (e i - e,iHnj' - ni) .


Since the sign of the above inner product is determined by the angle
between c+ and c_, the theorem is proved. The Pythagorean law holds
for a right triangle whose one side is an a-geodesic and the other
side is -a-geodesic (Fig. 3.2).

The a-divergence can directly be expressed in a simpler and


familiar form for an a-family of distributions. Let us define
the following functions
4
1 - a2
= \ u log u a 1 (3.25)
- log u a = - 1

These functions satisfy the following relations,


fa(l) = 0 , f~(l) = 1 ,
and are obtained by integrating the differential equation
f~(u) = u(a-3)/2 .

Hence, f~(u) > 0, so that they are convex.

Theorem 3.7. The a-divergence Da (e, e') in an fa-family is


given by
D (e, e') Ee[f p(x, e') }] (3.26 )
p (x, e)
{

a ~
N
.

Proof. For an exponential family (a = 1), we have


i
Ee ,[- ~(x, e)] = Ee ,[- e ci(x) + ~(e)] H8) - 8'n' ,

E e ,[ - ~(x, e')] = H(8') =- ~(n') ,


where Ee , denotes the expectation with respect to p(x, 8'). Hence,
the l-divergence from 8 to 8' is given by
Dl (8, 8') = E [ - log P (x, 8)] E [ p (x, 8') log P (x, 8')
8' p(x, 8 f) 8 p(x, 8) p(x, 8)
E [f { p(x, e')}]
8 1 p (x, 8)
and the -l-divergence is given by
88

D-1 (e, e') D1 (e " e)


Ee [f -1 { P (x, e')}]
= =
p(x, e)
Similarly, for an mixture family (a = -1),
Ee[~(x, e)] = - H(e) = ~(e)

Ee[~(x, e')] f [ei{ci(x) - c n+ 1 (x)} + c n+ 1 (x)]10g p(x, e')dP


eini - 4><n')
hold, so that (3.26) holds for a ±1.
For a general a-family S (a f ±1), we consider the a-divergence
in the extended manifold S, because S itself is not a-flat but Sis.
The a-divergence from 8 to 6' is given by
- ~,
Da (e, e ) = - +1
2
a K(e) + 1 - a K(il')
- 2
-e'n'
in S, where ~ = (e 1 , ... , ~n+1). We have

_2_ ei~.K(8')
1 + a 1
2
T""+""a e f ci(x){ 1
~i
2a e,jc j (x)}(l+a)/(l-a)dP
4 {m(x, 8)}(1-a)/2{m(x, e,)}(1+a)/2 dP
1 - a
2 f
4 m(x, e){ m~x, 8') }(1+a)/2 dP
2
f m x, B')
1 - a
For two distributions p(x, e), p(x, 8') belonging to S, K(e) K(e' )
= 1 holds, so that (3.26) is proved.
It should be remarked that the a-divergences have already been
known and widely used in statistical literature, without noticing the
differential-geometrical background. Csiszar [1967, 1975] treats a
general theory of this kind of divergence. The -l-divergence from
p(x, e) to p(x, e') (and hence 1-divergence from p(x, e') to p(x,
e» is the well-known Ku11back-information,
D_ 1 (e, e') = I{p(x, e') : p(x, e)}
The a-divergence is nothing but the Chernoff distance of degree a.
Especially, the O-divergence is the Hellinger distance,
DO(e, e') = ~f {/p(x, e) - /p(x, e,)}2 dP . (3.27)

As will be shown soon, the Hellinger distance is directly related to


the Riemannian distance between e and e'. The a-divergence is also
89

equivalent to Renyi's a-information, defined by


Ia{p(x) , q(x)} = ~ log f {p(x)}a{q(x)}l-adP (3.28)

When a = 1, the I-information is defined by taking the limit


a -+ 1 in the above, and it reduces to the Shannon entropy with

respect to the carrier measure qdP. The a-divergence is related to


the Renyi-a-information by

0a(8, 8') 4 2 [ 1
1 - a
! a Ia,{p(x, 8) : p(x, 8')} + 1] , (3.29)

where a' = (1 - a)/2.


3.6. a-projection
Let S' be a smooth submanifold imbedded in a flat manifold S

with dual coordinate systems (8, n). We consider the problem of

obtaining the point 8' E S' that is closest to a given point 8 e S in


the sense that the divergence 0(8, ~) from 8 to a point ~ in S' is
minimized at ~ = 8' (Fig.3.3),
min 0(8,~) 0(8, 8')
~ES '

In the case of an a-family S, this is the problem of approximating a

distribution p(x, 8) by p(x, 8') belonging to a submanifold S' in the

FIG. 3.3
90

sense of minimizing the ex-divergence Dex (a, a'), or dually in the


sense of minimizing the -ex-divergence D_ex(a, a') = Dex(a', a). We
first define some fundamental concepts for solving this problem. The
solution elucidates the intrinsic relation between the ex-divergence
and ex-geodesic or ex-connection.

Definition 3.3. A point a' E S' is called an ex-extreme point


of a E S, when Dex(a, ~), ~ E S', takes an extreme value at ~ = a'.
Let A' be a tangent vector belonging to the tangent space
Ta , (S') of S' at a'. Then, a' is the extreme point of a, if and
only if A'Dex(a, a') 0, where A' = A' i 0/ aa' i) operates on the
second variable a'.

Definition 3.4. A point a' E S' is called an ex-projection of a


to S', when the ex-geodesic connecting a and a' is orthogonal to S'.

Let A (a') be the submanifold of S consisting of all the


ex
ex-geodesics which intersect S' at a' perpendicularly. In other
words, Aex(a') is the inverse image of the ex-projection,
A (a') = {a I the ex-projection of a is a'} .
ex
The submanifold Aex (a') consisting of all such ex-geodesics is totally
ex-geodesic, and is ex-autoparallel.

Theorem 3.8. A point a' E S' is an ex-extreme point of a, when


and only when a' is the ex-projection of a to S'.

Proof. Assume that a' is the ex-extreme point of a. Then, for any
vector A' belonging to the tangent space Ta,(S') at a', A'Dex(a, a') =

o holds. Hence, by using the expression A' A! a ,i


1.
in the

n-coordinates, where a,i = a/ani, we obtain


= A!a,i[~(n')
1.
- a'n'] = - (A', a - a' > o .
91

This shows that the curve c

c : 6(t) 6' + t(6 - 6')


is orthogonal to S' at 6', because its tangent is 6(0) 6 - 6' at
6'. Since the c is the a-geodesic connecting 6 and 6', 6 ~ A (6'),
a
6' is the a-projection of 6. Conversely, if 6' is the a-projection

of 6, (A', 6 - 6' > o holds for any A' which is tangential to S'.
Hence, A'D (6, 6') = 0, proving that 6' is the a-extreme point.
a
The a-projection of 6 is not necessarily unique in general.
Moreover, an extreme point 6' is not necessarily the minimum point
giving the best approximation of 6. The next theorem yields the
condition which guarantees that the extreme point is unique, if it

exists, giving the minimum a-divergence. To state the theorem, we


need one more concept.

Definition 3.5. A subset V of S is said to be a-convex, when,


for any points 6 1 and 6 2 in V, there exists a unique a-geodesic
connecting 6 1 and 6 2 which is entirely included in V.

Theorem 3.9. When a closed set V in S is -a-convex having


smooth boundary av, the a-projection from outside V to the boundary
av is unique. The unique projection 6' E av from 6 gives the point
which minimizes the a-divergence from 6 to V. Especially, the

a-proj ection to a -a-convex submanifold S' is unique, giving the

a-minimal point.

Proof. Assume the contrary that there exist two points 6 1 and 6 2 E
av, (6 1 f 6 2) both of which are a-extreme points of 6 E S - V to V.
Let us construct a triangle (6, 6 1, 6 2), whose sides c i connecting 6

and 6i (i = 1, 2) are a-geodesics and whose side Co connecting 6 1 and


6 2 is a -a-geodesic (Fig.3.4). Since V is -a-convex, Co is included
in V, so that the angle between c i and Co is not less than n/2.
92

Hence, the Pythagorean theorem yields

Dc/ 8 , 8i) :': Da(8, 8 2) + Da (8 2, 8i) ,


Da (8, 8 2) :': Da (8, 8i) + Da (8i, 8 2)
From this follows
Da (8i, 8 2) + Da (8 2, 8i) ~ 0 ,
which is a contradiction, proving the uniqueness of the a-projection.
The minimality of the a-projection is proved in a similar manner as

FIG. 3.4

follows. Let us construct a triangle (8, 8', 1;), where 8' is the
a-proj ection of 8 and I; is any point in V. Since the -a-geodesic
connecting 8' and I; is inside V, the angle of the two geodesics
connecting 8 and 8', and 8' and I; is not less than ~/2. Hence, Da(8,
1;) :': Da (8, 8'), proving the theorem.

The exponential family and mixture family are ±l-flat, so that


Theorems 3.8 and 3.9 are directly applicable to these families for
obtaining the closest distribution belonging to a submanifold S' from
a point of S in the sense of ±l-divergence. The approximation by
93

-i-divergence (Kullback information) is especially important in the


theory of statistical inference, as will be shown in Part II. In the
case of an a-family S (a f ±l), since S is not a-flat by itself, it
apparently seems that the above theory is not directly applicable to
an a-family S, but only to the extended S in which S is imbedded.
However, geodesics in S are tightly related to those of ~, as we see
in Theorem 3.3. By using this theorem, we can prove that Theorems
3.8 and 3.9 are valid for any a-family S.

3.7. On geometry of function space of distributions


We have so far treated parametrized families of statistical
models which form finite-dimensional manifolds from the geometrical
point of view. Then, what kind of geometry can one construct on the
set of all the regular density functions, which is a non-parametric
model from the statistical point of view? We have already seen in
Example 3.1 that the set of all the distributions on a finite number
of atoms is an a-family for any a. Therefore, it might be expected
that the set of all the mutually absolutely continuous density
functions is also an a-family for any a. This assertion seems true
in the sense that the set shares most properties with a-families.
However, the problem is to find an adequate topology with which the
set of density functions form a manifold. We do not discuss here on
this difficulty but only suggest the geometrical properties of the
non-parametric statistical model by the following method.
Let S = {p (x)} be the set of all the mutually absolutely
continuous regular density functions on X with respect to a measure
P(x), and let S= {m(x)}, where
m(x) = cp(x) , c > 0

be its extended set of finite measures. We call r a (x) = Fa {m(x)} the


a-representation of m(x). Let m(x, t) be a smooth curve (in some
:..,
topology) in S and let us put ra(x, t) = Fa{m(x, t)}. We call R-a(x,
94

0) = m-(1+a)/2 m(x, 0) the a-representation of the tangent of the


curve m(x, t) at t = 0, i.e., at m(x, 0), where· denotes d/dt. The
inner product of the tangents Tal(x, 0) and t a2 (x, 0) of two curves

ml(x, t) and m2 (x, t) at their intersection point m(x, 0) = ml(x, 0)


m2 (x, 0) is given by

where
i = 1, 2 .

When the above inner product vanishes, the two curves are said to be
orthogonal. The tangent directions of a submanifold are defined in a
similar manner.
The a-geodesic connecting two points ml (x) and m2 (x) in S is
defined by the curve
1 a (x, t) = tal (x) + t{1 a2 (x) - ral(x)}, t E [0, 1]
in the a-representation. This definition suggests that S is a-flat
for any a, and ra(x) gives the a-affine coordinate system of S. The
a-geodesic p(x, t) connecting two probability distributions Pl(x) and
P2(x) in S is given by
R-a(x, t) c(t) [R-al(x) + t{R- a2 (x) - R-al(x)}] a I 1
R-(x, t) c(t) + R-l(x) + t{R- 2 (x) R-l(x)}, a = 1

where c(t) is the normalization constant to be determined from


f p(x, t)dP = 1 .
Let
K f m(x)dP , H - f m(x)log m(x)dP .
Then,
2 al
1 + aK , -1 ,
tjJ
a (m) : { - H - K , a = -1 ,
(3.30)

gives the potential function of an a-flat manifold S, where the


a-affine coordinate system is given by ra(x). The dual of 1 a (x) is
r-a (x), which is obtained by the Fr~chet derivative of tjJ
a,
95

r -a. (x) = a~ a. [ra. (x)]/aia. (x)


The dual potential is given by

CPa. = ~-a. '


and
~a. + CPa. - J ta.(x)t_a.(x)dP = 0 (3.31)
holds. The second-order Frechet derivative of ~a. gives the
metric as
a2~
___a._ ['2a.l'
aJL aJL a.
2]
.
r
Eex[i l f2 ]
a. ex (3.32)
a. a.
The a.-divergence from Pl(x) to P2(x) is given by

Da.{Pl(x), P2(x)} = J Pl(x)fa.{ :~~~~ }dP . (3.33)


Let S' be a smooth subset of S. Then, the a.-extreme point of p(x) to
S' is given by the a.-projection of p(x) to S'. Theorems 3.8 and 3.9
hold also in the case of the function space of density functions.
Finally, we touch upon the Riemannian geometry of the function
space of distributions. It is the geometry obtained by putting a. =

O. The O-representation of finite measure m(x) is given by


rO(x) = FO{m(x)} = 2/m(x) .

Hence, the extended manifold S= {2/m(x) I m(x) is a finite measure}


in the O-representation is the L2 -space, and the manifold S of the
probability distributions is part of the sphere of radius 2 imbedded
in S,

The potential ~O(m) is given by


~O(m) = 2K(m) = 1 J{JLO(x)}
-z- - 2 dP ,
which is equal to CPo because of the self-duality. The metric
obtained by differentiating it twice in the sense of Frechet is
a2~O
--=:--;;-~ [A, B] = J ABdP .
aJLOaiO
This is the L2 -metric, so that the manifold S is the ordinary
L2-space. The S is a curved manifold imbedded in S as a sphere.
The Riemannian geodesic c connecting p(x) and q(x) is given by
96

~a(X, t) = 2c(t)[/P{X) + t{;qrxy - IP{X)}]


is the a-representation or by
p(x, t) = c 2 (t)[/p(x) + t{~ - ~}]2 (3.34)
in the ordinary density representation, where c (t) is the
normalization constant. The geodesic c in S is obviously the
straight line connection IP and ;q in the a-representation, and the
geodesic in S is its projection on the sphere S (Fig.3.5). The
a-divergence between p(x) and q(x).

FIG. 3.5

Da(p, q) = 4(1 - f ;pq dP) (3.35)


is a half of the length of the chord c in L2. This is known as the
Hellinger distance. The Riemannian distance s(p, q) is the length of
the arc c on the sphere S and is obtained by integrating the
infinitesimal distance ds along c. It is related to Da by
s(p, q) = 2 cos-l(l - Da/4) = 2 cos- l f;pqdP (3.36 )
as can easily be understood from the relation between an arc and a
chord. This distance is known as the Bhattacharrya distance.

3.8. Remarks on possible divergence, metric and connection in


statistical manifold
We have introduced the Riemannian metric and the I-parameter
family of affine connections (a-connections) in the statistical
97

manifolds. When the manifold is a-flat, the a-divergence is also


introduced, which is compatible with the a-connection in the sense
that the a-geodesic minimizes the a-divergence. Here we again
consider the problem of what kinds of geometrical structures can be
naturally introduced in a statistical manifold. Mathematically
speaking, any n 3 differentiable functions r.1.J'k(9) define an affine
connection of S, provided the components r a..,y
Q (~) in another
coordinate system ~ = (~a) are defined according to the law (2.21) of

the coordinate transformations. Also any n 2 smooth functions gij(9)


define a Riemannian metric, provided they form a positive definite

matrix and their components gal3(~) in another coordinate system ~ =

(~a) are given by the tensoria1 law. However, these arbitrarily


defined connection and metric do not reflect any stochastic or
statistical properties of the family of probability distributions, so
that such structures are quite useless. Then, the problem naturally
arises: On what conditions, are the Fisher metric and the a-affine
connections uniquely introduced? We have already shown that our
definitions are invariant under the choices of the coordinate systems
both in the sample space X and in the parameter space @.
Cencov [1972] studied this problem in the framework of the
category whose objects are the manifolds of all the probability
distributions on a finite number of atoms with Markovian morphisms
between manifolds. He proved that the Riemannian metric is unique
(to within a constant factor) and the a-connections are the only
invariant connections in this category with Markovian morphisms.
If we do not require the invariancy under the coordinate
transformations of the sample space X, it is possible to introduce
other metrics and affine connections (see, e. g., Rao and Burbea
[1982b]). Let D(p, q) be a divergence or contrast function between
two density functions p(x) and q(x), which is smooth in p and q and
which satisfyies D(p, q) ~ 0, with the equality when and only when
98

p(x) = q(x). Eguchi[1983] defined the D-metric tensor g~j(e) in the


statistical manifold S with the coordinate system e by
(3.37)
where
D(e, e') = D{p(x, e), p(x, e')}
This is non-negative definite (and hence its positivity is
required for D). He also defined a pair of dual affine
connections by
D a3 (3.38)
r ijk (e) a e i a ej a e ' k D (e, e') Ie' = e '

a3 (3.39)

It is not difficult to prove that these two are indeed mutually dual,
satisfying the law of coordinate transformations for affine
connections. However, it should be noted that these geometrical
structures depend only on the local properties of the function D(e,
e') in a small neighborhood of e = e'.
Now we confine our attention within the class of invariant
divergences, and search for the geometrical structures derived
therefrom. Since a divergence D(p, q) from p(x) to q(x) is a
functional of p(x) and q(x) taking non-negative values, it is
natural to consider the following type of functionals
D(p, q) = Ep[F{p(x), q(x)}] =J F{p(x), q(x)}p(x)dP,
(3.40)
where F is some function and Ep is the expectation with respect to
p(x). We then require that D(p, q) should be invariant under any
(coordinate) transformation of the sample space X, i. e., a
transformation of the random varialble x into y. Then, p(x) and q(x)
are transformed to
p(y) = p{x(y)}J-l(y) , - -1
q(y) = q{x(y)}J (y),

where J = det I ay / ax I . From the invariancy follows Ep[F(p, q)]


99

Ep[F(p, q)], which requires that F(p, p) should be a function of the


ratio of the arguments, i.e.,
F(p, q) = f(~)
for some f. Hence, any invariant divergence can be written as
Df(p, q) = Ep[f(-t-)] (3.41)
by using some function f(u). In order that Df(p, q) o holds, f(l)
= 0 is required. Moreover, for the positivity Df(p, q) ~ 0, f should
be a convex function. We further assume that f is a differentiable
function up to the third order, and normalize f such that f" (1) = 1
holds. The above Df(p, q) is the same as the f-divergence by Csiszar
[1967a, b] who studied its properties in detail,
We can introduce a pair of dual differential-geometrical
structures in S from any f-divergence Df(p, q) by the use of the
relations (3.37) 'V (3.39). These structures are indeed invariant
under the transformations of x and a. However, the following theorem
again ascertains the universality of the a-geometry, i.e., the Fisher
information metric and a-connections.

Theorem 3.10. The Fisher information is the only metric


introduced by any invariant divergences Df(p, q). The fa-connections
are the only connections introduced by invariant divergences Df(p,
q), where a is given by a = 2f"'(1) + 3.

Proof. From
Df(a, a') = J p(x, a)f{ ~~~: ~;) }dP(x) ,
we have by differentiating the above
ajai Df = Ea[(ajp) (aip)f"p(x, a')/{p(x, a)}3],
-

where a. = a/aa j , a! = a/aa .i , fll = fll{p(x, a')/p(x, a)}. Hence, by


J ~

putting a = a', we have the


f
gji(a) - aJ.a~Df(a,
~
a) = fll(l)g J~
.. = g J~
.. (a)
Similarly, after tedious calculations, we have
100

- didjdkDf(e, e') le'=e


rg~ - (f"'(l) + 2f"(1) - 1/2)T ijk
- didjdkDf(e, e') le'=e
r~jk + (f"'(l) + 2f"(1) - 1/2)T ijk .
This proves that rijk is the a-connection rijf with a 2f" (1) + 3
and that rtJk is the -a-connection rij~
If we use f = fa defined in (3.25), the f-divergence reduces to
the a-divergence Da(e, e') defined in (3.26). Since f~' (1) = (a -
3)/2, we have 2f~\1) + 3 a. Therefore, the connection derived from
the fa-divergence Da(e, e') is the a-connection.
We can prove, by calculations, that any f-divergence Df(e, e +
de) can be expanded in the Taylor series as
Df(e,e + de) = !gij(e)deide j + tr-ijk3)deidejdek
+ O(ldeI 4 ), (3.42)

where a 2f"'(1) + 3.

The above considerations lead us to the following conjecture.


If the conjecture is not true, we then have to answer the question
what additional requirements guarantee the uniqueness of the Fisher
information metric and the a-connections.
Conjecture. The (scalar multiple of) Fisher information metric
and the a-connections are the only metric and affine connections
which are invariant under any coordinate transformations of the
sample space X and of the parameter space.

There remain many mathematical problems to be studied further.


We list here some of them, which were discussed at the NATO Advanced
Workshop on differential geometry in statistical inference in London,
1984.
1. Differential geometry of the non-parametric statistical model
should be studied in detail. It is necessary to construct a
101

geometrical theory of non-parametric statistical inference.


2. The conditions which guarantee the uniqueness of the
a-geometry should be studied further. It is also interesting to
study the geometrical structure induced from the general divergence
function D{p(x), q(x)} (cf. Burbea and Rao, 1982 b, Eguchi, 1983).
If we do not require the invariance under coordinate transformations
of the sample space X, it produces a geometry other than the
a-structure.
3. Barndorff-Nielsen (1984) proposed a gometrical structure of a
statistical manifold depending on an observed asymptotically
ancillary statistic. This defines another differential geometry.
Differential geometry of a statistical manifold which admits a
transformation group is also intersting and important. See
Barndorff-Nielsen (1984), Barndorff-Nielsen et al. (1982), Kariya
(1983). Some type of non-regular statistical manifolds admit a
Finsler type geometry. It is interesting to study all of these wider
classes of geometry of statistical manifolds.
4. A statisitcal manifold S is equipped with a Riemannian metric
gij together with a symmetric tensor Tijk , and the a-connections are
defined therefrom. Thus, the a-geometry is represented by the object
{S, g, T}, which a statistical manifold has. Conversely, when an
a-geometrical object {S, g, T} is given, is it possible to define a
manifold of statistical distributions whose a-geometry coincides with
the given one? If not, what conditions are further imposed on for
the a-geometry to be realized as that of a statistical manifold?
This is an unsolved problem.
5. When a metric tensor gij and a torsion-free connection r ijk
are given, we can always construct a dual connection r *
ijk from
(3.12) .. However, it is not necessarily torsion-free, and the triplet
(g, r, *
r) does not necessarily give the a-structure.
Nagaoka(private communication) proved that the dual r* is
102

torsi-on-free, when and only when gij and r ijk are given from a
divergence fuction D(6, 6') by (3.37) and (3.38) as Eguchi did.
6. Given a Riemannian manifold {S, g}, is it possilbe to
associate a tensor Tijk such that the induced manifold {S, g, T} is
a-flat for some a? If not what is the condition imposed on the
Riemannian metric to guarantee this? Lauritzen (1984) defined a new
notion called conjugate symmetry. A statistical manifold S is said
to be conjugate symmetric, when its Riemann-Christoffel curvature
tensor satisfies, for any a,
R(a) (-a)
ijkm Rijkm
or equivalently
R(a) (a)
ijkm - Rijmk ·
This always holds for a = 0, because the O-connetion is metric. Many
statistical manifolds are conjugate symmetric. Lauritzen showed that
a S-flat family for some S is always conjugate symmetic. He also
presented an example of a statiscal maniflod which is not conjugate
symmetric. We do not yet know statistical implications of the
conjugate symmetry.

3.9. Notes
Many researchers have proposed various distance-like measures
between two probability distributions. They are, for example,
Bhattacharrya [1943] distance, Hellinger distance, Rao's Riemannian
distance, Jeffreys divergence [1948], Kullback-Leibler [1951]
information, Chernoff [1952] distance, Matusita [1955] distance,
Kagan divergence [1963], Csiszar [1967a, b] f-divergence, etc.
Chentsov [1972] and Csiszar [1975] remarked the dualistic structures
of the geometry of the exponential family based on the Kullback
divergence (See also Efron [1978], Barndorff-Nielsen [1978]).
Csiszar [1967a, b] studied f-divergence (which includes the
a-divergence as a special case) and showed the topological
103

properties of the divergence. He also remarked the relation between


the a-divergence and a-information (Renyi [1961]) which latter is a
generalization of Shannon's entropy. The relation between
generalized entropy and distance in statistical models is also
studied by Burbea and Rao [1982a, b].

The relation between the a-connection and the a-divergence was


pointed out by Amari [1982a]. The idea was further developed by
Nagaoka and Amari [1982] such that Csisz~r's geometry of a-divergence
and Cencov-Amari's geometry of a-connection are unified. (See also
Ingarden [1981] for information geometry.) The concept of dual
affine connections was introduced for this purpose (Nagaoka and Amari

[1982]). Eguchi [1983] also studied the dualistic structures of


affine connections derived from divergence functions. It is expected
that this newly introduced concept will play an important role in
applications of differential geometry to physics, statistics,
information theory and other engineering sciences as well. The
present chapter is mostly based on Nagaoka and Amari [1982]. There
seems to be some difficulties in extending the geometrical structures
to the function space of density functions. See Csisz~r [1967b],
Koshevnik and Levit [1976], Chentsov [1972], Pfanzagl [1982] in this
respect.
II. HIGHER-ORDER ASYMPTOTIC THEORY OF
STATISTICAL INFERENCE IN CURVED EXPONENTIAL FAMILIES

4. CURVED EXPONENTIAL FAMILIES AND EDGEWORTH EXPANSIONS

Part II is devoted to the higher-order asymptotic


theory of statistical inference in curved exponential
family M imbedded in exponential family S. A number of
independent observations are summarized into a vector
sufficient statistics x in a curved exponential family,
which defines an observed point or distribution in S. In
Chapter 4, we decompose x into a pair (ft, ~) of statistics
such that ft is asymptotically sufficient and ~ is
asymptotically ancillary. The Edeworth expansion of the
joint distribution p(ft, ~) of ft and ~ is given explicitly
up to the third order terms by using the related
geometrical quantities in Sand M.

4.1. Exponential family


We first study the geometry of the exponential family. A family
S = {p(x, e)} of distributions is said to be an exponential family or
of exponential type, when the density function can be written in the
following form
i
p(x, e) = exp {e Xi - ~(e)} (4.1)
with respect to some carrier measure P(x), by choosing an adequate
parametrization e = (e i ) and adequate random variables x (Xi). (In
Chapter 3, we used the following expression exp {eici(x) - ~(e)} for
the exponential family. If we define new random variables Xi by Xi =

ci(x), we obtain the expression (4.1) as the density function of x.)


The parameter e of the above form is called the canonical or
105

natural parameter of the exponential family. The exponential family


has a number of good properties. Many popular families are of the
exponential type. For example, the family S = {N(~, 02)} of normal
distributions is of the exponential type. This can be shown as
follows. Since the density function of N(~, 0 2 ) is
p(x, ~, 0 2 ) = exp{(~/02)x - (1/202)x 2 - (~2/202) - 10g(12TI0)}
if we define new two-dimensional parameter e = (e l , 6 2 ) by

and new two-dimensional random variable x


x 2 = (x) 2 ,

the density function can be rewritten as


p(x, e) = exp {eix i - w(O)} ,
where
12 2 1 2 1
He) = - (e ) / (4e ) - "2log(- e ) + "'2 log 1T • (4.2)
Hence, S is an exponential family with the natural parameter e = (e l ,
62 ). The random variables Xl and x 2 are not independent but related
by x 2 = (x l )2, so that the dominating measure P(x) is concentrated on
the parabolla x 2 = (x l )2 in the (Xl' x 2 )-plane.
We first examine the geometrical structures of the manifold S of
an exponential family. The following relations
diJ!,(x, 6) = Xi - diw(e) , '\ dj J!,(x, e) = - di djW(6)
i
are easily obtained from J!,(x, e) 6 Xi - w(e). This shows that the
normalization factor W(e) defined by
W(e) = log J exp{eixi}dP
plays a fundamental role. It is the potential function in the sense
of Chapter 3 and is related to the cumulant generating function. In
fact, the expectation, covariance and third-order central moments of
Xi are given by
E[xil diW(e) , cov[x i , xj1 = didjw(e)
E[(x i - diW)(xj - djW)(Xk - dkw)l = didjdkw(e)
respectively. These relations can be proved directly from the
106

definition of ~ or from E[di~(x, e)] = 0 and by calculating E[didj~]

and E [ did j dk JI.] •


The geometrical quantities are given in terms of the potential
or cumulant generating function.

Theorem 4 . 1 . The metric tensor and the a-connection of an


exponential family are given, respectively, by
(4.3)
(4.4)
in the natural coordinate system 8. Especially, the exponential
family is I-flat and the natural parameter 8 is I-affine. The a = ±l
Riemann-Christoffel curvatures vanish identically.

Proof. The metric tensor gij is derived from (2.10) immediately. It


is the covariance of d i JI. = xi - d i ~. The a- connection (4.4) is
obtained from the following relations,
E[didjJl.dkJl.] = E[-(didj~)okJl.(x, 8)] o ,
Tijk = E[oiJl.0jJl.0kJl.] = °iOjOk~(8) , (4.5)

the latter of which is proved by calculating E[ 0i 0i 0kJl.]


The Riemann-Christoffel curvature tensor is calculated as

Rijkh (8) =
so that it vanishes for a = ±l.

Since an exponential family S is I-flat, it is also -I-flat.


Hence, there exists the dual coordinate system n which is -I-affine.
It is given by ni = 0i~ (8 ), since ~ is the potential function
(Chapter 3). It is easy to show that
ni = E[x i ] = 0i~(8)

holds, so that the dual parameter ni is the expectation of Xi with


respect to p(x, 8). The n is called the expectation parameter, and
107

it defines the expectation coordinate system of S. The mapping


between 6 and n is bijective, and any distribution in S is specified
by its n coordinates.
We next study the geometrical quantities in terms of the
expectation coordinate system n. Since the natural basis {ail, a i =

a/ani' in n of the tangent space is related to the natural basis {a j }


= a/a6 j , in 6 by

or conversely
aj j ia
g i
where gji is the inverse of the metric tensor in the
gij'
n-coordinate system is given by the inverse of gji because. of
<ai, aj > = <gik ak , gjm am) = gikgjmgkm = gij .
Similarly, we can obtain the a-connection by
r (a)ijk = <'Va. aj ak > = _ ~ Tijk
a1 ' 2
im jr kS
in the n-coordinate system, where Tijk g g g Tmrs' This
vanishes for a = - 1, showing that the n is -l-affine.
The dual potential ~(n) is defined implicitly by
1jJ(6) + Hn) =

and the following relations hold

The Kullback-Leibler information I[p(x, 6 2 ) : p(x, 61 )] is the


-l-divergence from p(x, 6 1 ) to p(x, 6 2 ) in our terminology and is
defined by
D_l (6 1 , 6 2 ) = - J p(x, 61 ) log {p(x, 62 )/p(x, 6l )}dP.
It is explicitly given by
1jJ(6 l ) + ~(n2) 61 , n2
1j!(6 l ) 1j!(6 2 ) (6 1 - 62 )'n2 . (4.6)
The Pythagorean theorem holds for the -l-divergence with the help of
the ±l-geodesics, as has been shown in §3.
108

Example 4.1. Normal distributions


We again treat the family of normal
distributions. This is of exponential type with natural parameter 8
= (8 1 , 8 2 ), 8i = ~/o 2 , 82 = - 1/(20 2 ), and the potential ~(8) given
in (4.2). Inthe following examples, we use the notation 8i instead
of 8i , because 8i might be confused with the i-th power of 8. The
dual parameters are given from ni

81 8 1 2
+ 02
nl = - 28 2 ~ , n2
---zez
( 1 )2
28 2 ~

The dual potential is


1 2 1
<P<n) = - T log(n2 - nl ) - T log27Te (4.7)
The metric tensor is given by

[::0'
2~0
gij (8)
4 ~ 22
0
+ '0 J
4

gij (n) 1
04
[0'- ~
+
2/
~/~ ]
where ~ and 0 2 are considered as functions of 8 or n. They differ
from those given in Example 2.2, because the coordinate systems are
different. The tensor Tijk is given by
_ 4 _ 4
Tl12 - 20 T122 - 8~0
and the a-connection is
r (a) = 1 -2 a Tijk
ijk
The Riemann-Christoffel curvature Rijkm is given by R12l2 = (1
a 2 )06 in this coordinate system 8 and the scalar curvature is K
- (1 - a 2 )/2 which is the same for any coordinate system.

4.2. Curved exponential family


A family M = {q(x, u)} of distributions parametrized by vector
parameter u is said to be a curved exponential family. when the
109

density functions q(x, u) are written in the following form


q(x, u) exp {ei(u)xi - ~[e(u)l} , (4.8)
where e e(u) is a (vector-valued) function of u. When u = (u a ) , a

= 1, 2, ... , m, is an m-dimensional parameter while e = (e i ), i = 1,

... , n, is n-dimensional (m < n), the M is called an (n,m)-curved


exponential family. Geometrically speaking, M is an m-dimensional

smooth submanifold imbedded in the n-dimensional manifold of an

exponential family, provided e(u) is a sufficiently smooth function.


The parameter u = (ua ) defines a coordinate system of M, and a point
(distribution) whose coordinates are u in M is the same as the point

(distribution) whose coordinates are e(u) in S. Thus the manifold of

an exponential family provides a frame in which statistical inference


takes place in Part II. By introducing the notion of the fibre
bundle of local exponential families, the present theory can be
generalized to be applicable to more general families of
distributions.

We first give an example of a curved exponential family, and


show our method of approach.

Example 4.2. Let z be a normal random variable subject to N(l,


1). What we observe is x = uz, where u is an unknown mUltiplicative
factor (u > 0). Therefore, the random variable x is subject to N(u,

u 2 ), and the probability density function q(x, u) is

q(x, u) = du exp{- (x - u)2/(2u 2 )}


i
By the use of the probability density function p(x, e) exp{e Xi
~(e)} of the normal distributions, it can be rewritten as
q(x, u) = p{x, e(u)} ,

where e(u) (el(u), e 2 (u» is


~/02 2 2
= l/u , e 2 (u) 1/(20 ) =- 1/(2u ) .
Hence, the family M = {q(x, u)} is a (2, l)-curved exponential
family. The family M forms a one-dimensional submanifold (i. e. a
110

curve) in the two-dimensional manifold S. It is a parabolla in the


6-plane of the natural coordinates 6, because 6(u) satisfies
6Z = - (1/Z)(6 l )Z
(Fig.4.la). If we use the expectation parameter n = (nl' nZ)' the
n-coordinates n(u) of the distribution specified by u are
nl(u) =u , nZ(u) = Zu Z
Hence, the family M is represented also by a parabolla nZ
the n-plane of the expectation coordinates (Fig.4.lb).
Let us attach a one-dimensional submanifold A(u) of S to each
point u, such that A(u) transverses M at the point 6(u) or n(u). In
the present example, we attach a straight line A(u) at each n(u) in
the n-coordinate system (Fig.4.lb). The equation of A(u) is
Z
nl = u + uv , nZ = Zu + u Zv ,

Fig. 4.la) Fig. 4.lb)

i.e.,
A(u) = {n I nl = u + uv , nZ = ZuZ + u Zv}
where v is the parameter specifying points on A(u). The parameter v
can be regarded as a coordinate on the line A(u), where the origin v
111

= 0 is chosen at the intersection of M and A(u). Then, the pair (u,


v) determines a point n in S. The point n determined by (u, v) is on
A(u) and has the coordinate v on that A(u). We write the
n-coordinates of the point as n(u, v).
We can use the pair (u, v) as a new coordinate system specifying
points in S in a neighborhood of M. It should be remarked that A(u)
and A(u') (u; u') may intersect, so that the coordinate system (u,
v) covers only a neighborhood of M. In this neighborhood, by solving
n = n(u, v), we have
u = u(n) , v = v(n) ,
which give the u- and v-coordinates of a point n. We can use the (u,
v) -coordinates to analyze the asymptotic behaviors of statistical
inference procedures. We have used the n-coordinates for
convenience' sake. The same argument holds in the a-coordinates,

Fig. 4.2

although the submanifold A(u) attached to u is not linear in the


a-coordinate system, even when it is linear in the n-coordinate
system.
Returning to the general case, we study the geometry of a
curved exponential fami ly M imbedded in S. To this end, it is
112

convenient to introduce an ancillary family A = A(u) by attaching


an (n-m)-dimensional submanifold A(u) to each point 8(u) in M, that
transverses M at 8(u). The ancillary family A is naturally
determined from the statistical inference procedure whose
characteristics we would like to analyze, as will be shown later.
mtl, mt2, ... , n be a coordinate system in A(u),
where the origin v o is put at the point 8(u) on M (see Fig.4.2,
where m = 1, n = 3, so that A(u) is two-dimensional). Moreover, it
is assumed that the pair w (u, v) or w (u l , m mtl
= = u , v
vn ) can be used as a coordinate system in S (at least in some
neighborhood of M) so that it forms a local foliation of S. We use

indices a, b, c, etc. running from 1 to m to denote the quantities


related to M, and indices K, A, \l, etc. running from mtl to n to
those related to A(u). Indices a, B, y, etc. running from 1 to n are
used to denote the quantities of S related to the new combined
coordinate system w such that w

denote the point which is on A(u) and which has the coordinates v in
A(u). The 8 and D coordinates of this point is written as
8 = 8(w) = 8(u, v) , D = D(W) = D(U, v) ,

which define the coordinate transformations from w to 8 and to n. We


can express various geometrical quantities in the new coordinate
system w. The quantities can be decomposed into U-, v- and mixed

parts, which represent the geometrical structures of M, A(u) and


their interaction, respectively. The natural basis vectors are

represented by
or
where

B~ a8 i /aw a or Bai = ani/awa B~gji '


and are decomposed into two parts 0a} = 0a' Obviously m

vectors
113

a = 1, ... , m

span the tangent space T(M) of M, and n-m vectors


aK = d/ av K, K = m+ 1, ... , n
span the tangent space T(A) of A(u), and the tangent space T(S) of S
at a point (u, v) is decomposed into the direct sum
T(S) = T(M)GBT(A) .
The metric tensor gaS in the w-coordinate system is written as

gaS = <
aa' dS
i j
BaBsgij )
BaiBSj g
ij

and is decomposed into

Obviously, gab and gKA are, respectively, the metric tensors of M and
A(u) . The mixed part gaK <aa' dK) evaluated at w = (u, 0)

represents the angles between the tangent spaces T(M) and T(A) at the
intersecting point w = (u, 0). When gaK = 0, M and A(u) are said to
intersect perpendicularly.
The components of the a-affine connection is given by

(4.9)

The following expression is also derived:

r~~~ <BSiV~i(Byjdj), B~ak>


BSi ByjB~ <V~i aj , ak ) + <a SByj )B~
l+a ()j (4.10)
2 Tsye + aSByj Be .
Its u-part r(a) gives the a-affine connection of M and its v-part
abc
r(a) gives the a-affine connection of A(u). When the ancillary
KAJ.!
family is orthogonal, i.e., gaK = 0, the a-curvature of M is given by
114

H(a) = (V Cl
abK
Cl
Cla b' K
= r(a)
abK
> (4.11)

and ~he a-curvature of A(u) is given by

H~~~ = <V~KClA' Cl a > = r~~~ (4.12)


The exponential (i.e., a z 1) curvature (which is denoted by H~~~)Of
M and the mixture (i. e., a = - 1) curvature (which is denoted by
H~~~) of A(u) play important roles in the following. The exponential
curvature of M is given from (4.9) and (4.11) by
H(e) = (Cl Bi)B . (4.13)
abK a b K~
The mixture curvature of A(u) is given from (4.10) and (4.12) by
(4.14)

Example 4.3. In the model M = {N(u, u 2 )} of Example 4.2 with


linear A(u)'s, the n-coordinates of a point specified by (u, v) is
given by

Hence, Bai
1 2
1 1 + v 2u(2 + V)]
Ba~.
2 u u2
where a = I, 2, and wI u, W2 v, the indices a, b, etc.
standing only for 1 and K, A, etc. standing only for 2. The tangent
i
vectors Cl a and Cl K of M and A(u) are given by Cl a Bai Cl , and the
metric tensor gaS is given by

= [gab gAb J o

J.
gaK gAK 3/2
where gaS is evaluated on M, i.e. at (u, 0). The gab = 3/u2 is the
Fisher information of M, and gaK = <Cl a , Cl K> = 0 implies that M and
A(u) are orthogonal at the intersection. The tensor Tasy is
T abc = l4/u 3 , TabK = - II u 2 ,
115

TaKA = 2/u , T = - 4
KA 11 '
in the w-coordinate system. By calculating (asB .)B~.gij and
y~ uJ
evaluating them at (u, 0), we have the components of the a-connection
as follows,
r (a)
- (3 + 7a)/u 3
abc
r(a) (3 + a) I (2u 2 )
aKb
r(a) - (1 + a) lu ,
KAa
Especially, the exponential curvature of M is
H(e) = _ l/u 2
abK .
This does not vanish, since M itself is not an exponential family.

The mixture curvature of A(u) vanishes,


H(m) = 0
KAa .
This is obvious, because A(u) is linear in the mixture affine
n-coordinate system.

4.3. Geometrical aspects of statistical inference

In many cases, statistical inference is carried into effect by

the use of a number of independent observations from an identical but


unknown distribution. Therefore, it is necessary to study the
geometrical structures of the parameter space in the case when
repeated observations are allowed. Let x, x, ... , x be N independent
1 2 N
observations from the same distribution p(x, e) e S. Then, their
joint distribution is given by the density function
N
p(x, ... , x; e) II p(x, e)
1 N i=l i
and its logarithm is
N
L JI.(x., e) .
I = log p(x, ... , x; e) =
1 N i=l ~
Therefore, the l-representationNof the basis vector a j becomes
a. I (x, ... , x; e) = La. JI. (x, e) .
J 1 N i=l J i
Since x's are independent and E[a.JI.(x, e)] 0 holds, the metric
i J
tensor based on N observations is
116

Similarly, the components of the a-connection are


- - 1 - a
E[ Cl j ClkR,ClmR, + -Z- Cl j -R,ClkR,ClmR,]
- -

= I E[CljClkR,ClmR, + Y CljR,ClkR,ClmR,] = Nrjk~


This shows that the metric tensor and the a-connection based on N
independent observations are N times those based on one observation.
Hence, the two geometric structures are similar, and it is not
necessary to study them separately.
In the case of an exponential family with the natural parameter
e, the joint density can be written as

p(~, ~; e) = eXP{t [ej~j - w(e)]}

(4.15)
where
- 1 ~ (4.16)
x N i~l ~
is the arithmetic mean of N observations. The statistic x is
sufficient, because the density function p(x, ... , x; e) depends on
1 N
x, ... , x only through X. Moreover,
1 N
I(x, ... , x; e) = NR,(x, e)
1 N
holds. The expectation and covariance of x- are

Let ~ or A be the maximum likelihood estimator (m.l.e.) of e or n in


the exponential family, e and n being connected by
ni = Cl i He) .
The estimator ~ is given by solving the likelihood equation CliR,(x,
e) = 0 or CliW(e) = xi' The n is simply given by Ai = xi' The A is
unbiased, E[A] = n, and its covariance is
(4.17)
which shows that the Cramer-Rao lower bound is exactly attained by
the m.l. e. A. The expectation parameter n is special in this
respect.
Let us consider some fundamental inference problems in a curved
exponential family M. Let x, ... , x be N independent observations
1 N
117

from q(x, u) = p[x, e(u)], where p(x, e) belongs to an exponential


family S. Then, the joint density function of x. x can be
1 N

written in the form


N
q(x. ... , X; u) II p{x. e (u)} [p{x. e(u)}]N.
1 N i=l i
which depends on x • ... , X -
only through x. Hence. x is a sufficient
1 N

statistic in this case. too. and there is no loss of information by


summarizing N observations x •...• x into one x. The sufficient
1 N
statistic x defines a special point (distribution) in the enveloping
exponential family S. It is the point (distribution) in S whose
n-coordinates are given by n = x. i.e .• the distribution p(x. 8) with
ni = ai~(e). We call the point e or n the observed point. The e or
n is the m.l.e. in the enlarged model S. Hence. it is a point in S
but it does not necessarily belong to M.
In the following. we treat statistical inference procedures
based on the sufficient statistic x or equivalently the observed
point e or n. Consider the problem of parameter estimation. In this
case. an estimator ft = ft(x) is a function of X. which d'etermines the
estimated ft from x. In other words. an estimator ft can be regarded
as a mapping from S to M. ft : S + M. which maps an observed point n=
x in S to the estimated point ft or the distribution q(x. ft) in M. We
use the n-coordinate system in S. because the observed point is given
directly by ~ = x in this coordinate system. Let A(u) be the inverse
image of the mapping ft.

A(u) = {n I u ft(n)} •

i.e .• A(u) is the set of the observed points which are mapped to u
by the estimator ft (Fig.4.3). It is assumed that {A(u)} forms a
smooth
118

Fig. 4.3

family of (n-m)-dimensional submanifolds in a neighborhood of M, so


that they define an ancillary family or a local foliation of S. This
is called the ancillary family associated with the estimator.
Conversely, an ancillary family A defines an estimator fi such that A

is associated with it. The characteristics of the estimator can be


obtained from the geometrical features of an ancillary family A =
{A(u)} associated with the estimator. In a general case, the
submanifold A(u) does not necessarily pass through the point n(u).
However, we show soon that when the estimator is consistent, A(u)

passes through n(u). We can introduce a coordinate system v in each


A(u) such that w = (u, v) forms a coordinate system of S (at least in

a neighborhood of M). Let 1) n(w) = 1)(u, v) be the coordinate


transformation from w = (u, v) to 1). Then, the w-coordinates ~ = (fi,

~) of the observed point A= x are given by solving


x = n(~) = n(a, ~) .
The sufficient statistic x is thus decomposed into the two statistics
Q and ~, which together are also sufficient, and Q obviously is the
119

estimator. If we know the (asymptotic) joint probability


distribution of (ft, ~), we can calculate various characteristics of
the estimator therefrom. The joint probability is obtained in the
form of the Edgeworth expansion in the next section. The

distribution of asymptotic ancillary statistics and the conditional

distribution of the estimator conditioned on the asymptotic ancillary

are also obtained therefrom.


Let us next consider the problem of testing a simple hypothesis

HO : u = Uo against the alternative Hi : u f Uo in an (n,m)-curved


exponential family M. The decision concerning the rejection or

acceptance of the hypothesis HO is made on the basis of the observed

point n= x in S. Hence, a test T is a mapping from S to the binary


set {r, r}, T : S ~ {r, r} such that the hypothesis HO is rejected
when T(x) = r. The inverse image R = T-l(r), which is called the

critical region of the test T, is the subset of S such that, when the

observed point belongs to R, the hypothesis HO is rej ected. The

characteristics of a test T depend on the geometrical features of the

critical region, so that they can be analyzed by the geometrical


method. We can introduce an ancillary family A = {A(u)} by attaching
an (n-m)-dimensional submanifold to each point n(u) ~ M such that the
critical region is surrounded by these submanifolds. The
decomposition of the sufficient statistic x into (ft, ~) by x = n(ft,

~) again plays a fundamental role in analyzing the higher-order


asymptotic performance of a test and in designing the optimal test,

as will be shown in a later chapter. The test statistic is a

function of ft only.
We will give the joint probability distribution of the new

statistics (ft, ~), which are determined depending on the associated

ancillary family A, in the form of the Edgeworth expansion.


120

4.4. Edgeworth expansion


When an ancillary family A = {A(u)} is given with a coordinate
system v in each A(u), the statistic representing the observed point
n= i is decomposed into two parts a and~, where x - = n(~) = n(a, ~).

When the true distribution is q(x, u), i.e., the true parameter is u,

the law of the large number guarantees that x converges to its

expectation, i.e. the point n(u, 0) as N tends to infinity.


Moreover, the covariance of x- is gij (u) /N, tending to 0 as N ... 00.

Hence, in order to evaluate the asymptotic distributions of these


statistics, it is customary to enlarge them and define the new random

variables,
/N{x - n(u, O)} , il vN(a - u) , vN~, (4.19)
w (il, v)

We then calculate the asymptotic joint probability distribution of w


(il, v).

By expanding x- = n(w+w/IN) at w (u, 0), we have the


following stochastic expansion

ni(u, 0) + d a n.~ (w)wa/IN + Zl (a a don.)wawS/N


.., ~
1 (d dod n· )wawSw Y/NIN + higher order terms
+ "6 a .., Y ~
By putting

Bai = dani' CaSi = dad Sn i , Da8yi = dadSdyni '


where all of the above quantities are evaluated at w = (u, 0), it is

rewritten as
X.~ = Ba~.wa + z _ CNO;waw S + _l_n
--1..--
6N aSy~
v~
.wawSwy + 0 (N- 3 / Z)
~..,~ p
where op
(N- 3 / Z) denotes a random variable of order N- 3 / Z . This can

again be rewritten in the following form,


wa =
gaSB~x; _ __1_ C awSwy _ __1__ D awSwyw e + 0 (N- 3 / Z)
.., ~ ZIN Sy 6N Sye p (4.Z0)
where indices a, 8, y, etc. are raised or lowered by multiplying the
metric tensor g as or gSa' and
C = C .B i D
Sya Sy~ a ' Syea
121

Lemma 4.1. The coefficients C..,ya


Q are the components of the
mixture (i.e., a = - 1) connection in the w-coordinate system
(m) _I, m ) (4.21)
Csya
r Sya -~VdSdy' da '
=

where the superscript m in Vm or rem) denotes the mixture (a - 1)

covariant derivative or connection. The coefficients Dsyoa are given


by

D
Syoa (4.22)

Proof. Since the coordinate system n is mixture-flat,

v~idj = 0 ,
from which Vm dj = 0 follows. Therefore,
dS
m m i i
VdSd y = VdS(Byid ) = (dSByi)d ,
V~SV~ydO = V~S{(dyBoi)di} = (dSdyBoi)di
are obtained. By taking the inner product of the above and da
Bjd j , (4.21) and (4.22) are derived.

The moments of xi are given by

E[xix j 1 = gij , E[x.x.Xkl


~ J
1
E[xiXjXkXml = 3g(ij g km) + N Sijkm '
where T ijk and Sijkm are the third and fourth order cumulants given
by

T ijk = didjdk1/i , Sijkm = di d j dk dm1/i (4.23)


and the bracket ( ) implies the symmetrization with respect to the

indices in it, e.g.,

3g(ij g km)
Since the statistic x tends to a normal random variable as N tends to
infinity, so does the statistic w as is seen from (4.20). The
average and covariance of ware given by

(4.24)
This enables us to calculate the expectation more accurately as

(4.25)
122

where the asymptotic bias term Ca is given by


Ca = C
ags y (4.26)
Sy
In order to calculate the moments of wa , it is convenient to
modify it into the (higher-order asymptotically) unbiased form as
w*a= wa + Ca(~)/(21N) , (4.27)

where the asymptotic bias term Ca is evaluated at~. In other words,

we modify ~ into
~*a = ~a + Ca(~)/(2N) ,

which is also a statistic, and w* = IN(~* - w). It easily follows

that
E[w*] = O(N- 3 / 2 ) .

Since w* is asymptotically normally distributed, the probability


density function p(w*; u) of w* under the true parameter u of the
distribution can be written in the following form
(4.28)
where AN(w*) is the small order term converging to 0 as N tends to
infinity, and
n[w*; gaS] = (detlgasl)1/2(2n)-n/2 exp{- ~ gaBw*aw*B} (4.29)
is the density function of the normal distribution with mean 0 and
covariance gaB. When we expand AN(w*) in terms of the (tensorial)
Hermite polynomials hat·· .a~ in W*
L c
AN(w*) = haL ... aR(W*)
k=l at·· . a ...
we get the Gram-Charlier expansion, where c tends to 0 as N
al" .ak
tends to infinity. If we expand it into the power series of N- l / 2
A (w*) = l N- k / 2f k (W*)
N k=l
we get the Edgeworth expansion, where fk are polynomials in w*. The
coefficients c of the Gram-Charlier expansion can be obtained
a1. ••• all.
by calculating the moments or cumulants of w*, as we shall show soon.
We can then obtain the Edgeworth expansion by evaluating the
coefficients up to the terms of necessary orders.
Now we define the tensorial Hermite polynomials h a1. ••• a~(w) of
123

the variable w = (wo.) with respect to the metric go.B' and study
their properties. Let us define a derivation operator DO. by DO.
go.B aB . Then, operating DO. repeatedly on the density function n(w)
n[w; go.B] of the normal distribution with covariance go.B, we easily
have
Do.n(w) - wo.n(w) ,
Do.Bn(w) (wo.wB _ go.B)n(w)
Do.Byn(w) (3g(o.B wY) _ wo.wBwY)n(w)
etc., where Do.1. ... a. It is the abbreviation of Do.1 Do.,t ... Do.ll. The
tensorial Hermite polynomials are defined by the coefficient
polynomials of the above derivations,
ho.1·· ·o.~(w) = (_1)k{Do.1·· ·Cl.kn(w)}/n(w) (4.30)
We show some of them explicitly,
hO 1 hCl. = wCl. hCl.B
hCl.By wo.wBwY _3g(o.BwY) ,
hCl.Byo = wCl.wBwYwo _ 6g(Cl.BwYwo ) + 3g(Cl.BgYo)

The following orthogonality relations are very important,


f hCl.J.· .. Cl.k. (w)h B1 ··· Bm(w)n(w)dw

= [k! g (CI.~ I B1 I gCl. .. I B.. I ... gCl.k)B~, when k m


(4.31)
o , k f m

where the k! times the bracket (0. 1 1 Bli ... Cl.k) implies that all the
possible k! permutations are taken for the indices 0. 1 ' ... , Cl. k and
all the resultant terms are added.
When w is decomposed into wCl. (u a , v K) and the orthogonality
relation gaK = 0 holds (i.e., aa and a K are orthogonal), the
following Pythagorean decomposition
Cl.B ab KA
gCl.Bw w = gabu u + gKAv v
is valid. This implies
n(w; gCl.B) = n(u; gab)n(v; gKA)
and hence by operating Das. ... a~ Kl. ... K!o on the both sides of the
124

above follows
hal"'~ Kl (4.32)
Moreover, the following relation is useful
f -00
c
at. ... a",
hat, ... a,(w)n(w)dv=c
at. ... a ...
hat. ... all(u)n(u) (4.33)
Returning to the Gram-Charlier expansion
p(w*; u) = n(w*; g ){l + I c hal" .ak(w*)}
as k=l at ... alt
the contravariant (i. e., upper index) version of the coefficients
are given by the use of the orthogonality relation (4.31) as

where indices are raised or lowered by the use of the metric tensor

gas or gaS' Since h at ··· ak is a polynomial in w*, its expectation


can be written as a linear combination of the moments

of degrees s not larger than k. As is well known, the moments can be


written as linear combinations of the cumulants Kat. ... ak. Moreover,
the k-th cumulant of w* is of order N-(k-2)/2 (k ~ 2) and ~a = Ka

O(N- 3 / 2 ). Hence, in order to evaluate the coefficients cat. . .. a~ up


-1
to the terms of order N , it is convenient to use the cumulant
instead of moments. We then have the following relations which hold
up to the terms of order N- l ,
2caS = KaS _ gas, 3caSy = K
aSy ,

24caSYo = KaSyo , 72caSyoEo = K(aSYK oEo )

and all the other coefficients vanish except for the terms of order
higher than N- l . Hence, in order to get the Edgeworth expansion up
to the order of N- l , we need only to calculate the moments of w*a up
to the fourth order. The cumulants are easily obtained therefrom.
The calculation of the covariance (second moment) of w* is most
cumbersome. Although we can get it by direct calculations from

(4.20), there is a trick to calculate it. The following is a general


useful identity,

E[k(x, W)da~(X, w)]


125

which is proved by integration by parts. Now put k w*a or wa . We


then have
E[W*a dBt ] = o~ + 0(N- 3 / 2 ) ,

E[wadBt] = o~ - ~N dBC a + 0(N- 3 / 2 )


On the other hand, we have
w*a - gaB dBt = 2~ (CByaw*Bw*Y - Ca ) + O(N- l )
By calculating the covariance of the both sides, we have
E[w*aw*B gaB + -1C a C BgEYgOO + 0(N- 3 / 2 )
2N yo EO
It is interesting to note that the term DBYOawBwYwo/N is not
necessary for calculating the second-order moment of the bias
corrected w*. We have the following theorems by careful calculations
of the moments of w* by the use of (4.20) ..

Theorem 4.2. The Edgeworth expansion of the distribution of w*


up to the terms of order N- l is given by
p(w*; u) = n(w*; gaB){l + AN(w*)},

AN(w*) = _1_ K haBy (w*) + L C2 haS (w*) + _1_ K h aBYo (w*)


61N aBy 4N aB 24N aByo

+ 1 K K haByoEo(w*) + 0(N- 3 / 2 )} , (4.34)


""i2N aSy OEO

where the coefficients are given in the geometrical terms as


K = - 3r(-1/3) = T - 3r(m) (4.35)
aBy aBy aBy aBy
C2 r(m) r(m) gEYgOO (4.36)
aB yoa 80B

(4.37)

It is easy to obtain the Edgeworth expansion of the unmodified w


from that of w*.
126

Theorem 4.3.
p(w; u) n(w'; gaB) {l + AN(w')
- ...L(d CY)g hIlB(w')} + O(N- 3 / 2 )} (4.38)
2N Il yB
where
(4.39)
It may be more natural to correct the bias of an estimator 0 not
at ~ = (0, ~) but at (0, 0). To this end, we define the partial bias
correction by
~** = ~ + C{(O, 0)}/(2N)

Then, by defining

~** = ~ - E(u, O)[w]


we have the following theorem.

Theorem 4.4.
P (w** ; u)

The higher-order asymptotic theory of statistical inference is


constructed upon the present Edgeworth expansion in the following.
It should be noted that the joint distributions of (0, ~) etc. are
obtained by the above theorems. By integrating them with respect to
V, v* or v**, we have the Edgeworth expansion of ij, ij>~ or ij**. When
the ancillary family A is orthogonal, i.e., gaK = 0, the integration

is easily carried out by using (4.33). When gaK is not equal to 0


but is a quantity of order N- l / 2 , we need a little more careful
calculations. Such calculations are required when we evaluate the
higher-order asymptotic behaviors of tests. In either case, it will
easily be proved that the distributions of ij* and ij** are identical
up to the terms of order N- l .
127

4.5. Notes
The curved exponential family is a statistical model which is
covenient in constructing a differential geometrical theory, because
there exists a finite-dimensional sufficient statistic x irrespective
of the number N of independent observations. (It is possible to
generalize the geometrical theory to be applicable to a more general
family. ) The term "curved exponential family" was introduced by
Efron [1975, 1978], and has been widely used. The dualistic
structures of the exponential family are studied in detail by
Barndorff-Nielsen [1978]. See also Chentsov [1972], Efron [1978].
The differential geometry of the exponential and curved exponential
family are studied in Amari [1980], in which he introduced the
ancillary family A(u) with a related coordinate system. This gives a
geometrical framework for statistical inference.
The Edgeworth and Gram-Charlier expansions are explained in many
books, e.g., in Kendall and Stuart [1963]. The Edgeworth expansion
of the
distribution of an estimator is evaluated up to the third
-1
order (i. e., up to the term of order N ) by a number of researchers
who have constructed the so-called higher-order asymptotic theory of
statistical inference (see, e.g., LeCam (1956), Rao (1961, 1962),
Akahira and Takeuchi [198laJ. Pfanzagl [1980], Ghosh and Subramanyam
[1974], Chibisov [1972], Skovgaard [1982], etc.). The present
treatment is based on Amari and Kumon [1983]. The merits of the
present method are the following two: Firstly, the joint density
function of a and ~ is evaluated for the frist time, so that the
density function of ~ (which turns out to be asymptotically
ancillary) and the conditional distribution pea I ~) are easily
ob tained therefrom. Secondly, the geometrical interpretation is
given to each term in the expansion, so that we can see which terms
are invariant and which terms depend on the estimator or on the
manner of parameterization of M. (See also McCullagn, 1984b.)
5. ASYMPTOTIC THEORY OF ESTIMATION

A higher-order asymptotic theory of estimation is


presented in this Chapter in the framework of the geometry
of the model M and the ancillary family A associated with
the estimator. Conditions for the consistency and

efficiency of an estimator are given in geometrical terms


of A. The higher-order terms of the covariance of an
efficient estimators are decomposed into the sum of three

non-negative geometrical terms. This proves that the bias


corrected maximum likelihood estimator is the best

estimator from the point of view of the third order


asymptotic evaluation. The effect of parametrization is
elucidated from the geometrical viewpoint.

5.1. Consistency and efficiency of estimators


We analyze the higher-order asymptotic behaviors of a smooth

estimator ft(x) in an (n,m)-curved exponential family M = {q(x, u)}.


It is assumed that the estimator ft is a function, independently of N,
of the arithmetic mean x = (L x)/N of N independent observations. It
i
is easy to extend the theory to the case when ft(x) depends explicitly
on N. The sufficient statistic x can be identified with the observed

point A= x in manifold S of the enveloping exponential family in the


n-coordinate system. Hence, an estimator ft defines a mapping ft : S +

M. The inverse image of u by the estimator ft is


A(u) = ft-l(u) = {nE S I ft(n) = u} , (5.1)
which forms an (n-m) -dimensional submanifold A(u) attached to the
point UE M. The value of the estimator ft is u when the observed

point A= x belongs to A(u). The family A = {A(u)} of these A(u)'s


129

is the ancillary family associated with the estimator ft. The


asymptotic behaviors of an estimator ft are closely related to the
geometric properties of the associated ancillary fa~ily.

Theorem 5.1. An estimator ft is consistent, when and only when


every point II (u) Ii: Me S is included in the associated ancillary
submanifold A(u) attached to the point u.

Proof. Since the expectation of x is E[x] = ll(U) when u is the true


parameter, the arithmetic mean x converges to II (u) as N tends to
infinity. Hence, the estimator ft is consistent, when and only when
ft{ll(U)} = u, i.e., when and only when A(u) = ft-l(u) includes the
point ll(U).
We hereafter treat the class of consistent estimators. We can
introduce an appropriate coordinate system v in each A(u) such that w
(u, v) forms a coordinate system in a neighborhood of M. The

origin v o is put at the intersection of A(u) and M. The


sufficient statistic x is decomposed into (ft, ~) by x= ll(ft, ~). The
Edgeworth expansion of the distribution of the estimator ft can easily
be obtained by integrating with respect to ~ the joint distribution
of ~ = (ft, ~), whose Edgeworth expansion is given in Theorems 4.2 and
4.3 in Chapter 4.
We can evaluate an estimator ft by various criteria. Here, we
evaluate it by the mean square error or equivalently the expectation
where u a = iN"(fta - u a ) and u = (ua ) is the true parameter.
Let us expand the error in the power series of N- l / 2 ,
E[uaub ] = gfb(u) + g~b(U)N-l/2 + g~b(U)N-l + O(N- 3 / 2 ) (5.2)

A consistent estimator is said to be (first-order or Fisher)


efficient, when its first-order term gfb(u) is minimal at all u among
all other consistent estimators. Since gfb is a matrix, the
minimality of a matrix is defined by the order relation h ab ~ gab
130

implying that h ab gab is a non-negative definite matrix. A

first-order efficient estimator is said to be second-order efficient,


when its second-order term g~b(u) is minimal at all u among all other
first-order efficient estimators. Similarly, a second-order
efficient estimator is third-order efficient, when its third-order
term g~b(u) is minimal at all u among all other second-order
efficient estimators.
We first search for the geometric properties of the first-order
efficient estimator. The first-order term of the distribution of u
is given by integrating with respect to v the joint distribution of w
= (ii, v),
p(w; u)
By virtue of
-a-B _a_b + 2 -a-K + -K-
gaB w w gabu u gaK u v gKA v vA
g ,(v K + gK~g ua)(v A + gAVg bub) +
KI\ ~a v
we have

In(w; gaB)dv = Jc exp{- ~aBWawB}dV = n(u; glab) ,


where c is the normalizing constant, gKA is the inverse matrix of gKA

and

glab(u) = gab(u) - ga~(u)gbv(u)g~v(u)


.
1S th
e ·1nverse 0 f t h e asymptot1c
.var1ance
. ab (u)
gl 0 f u. S·1nce t h e

term ga~gbvg~V is positive-semi-definite, glab is maximized and hence


ab 0 holds. This leads to
gl is minimized, when and only when g a~ (u) =

the following theorem, because of ga~ = <aa' a~ >.


Theorem 5. 2 . The covariance of a consistent estimator 11 is
given by
E[(l1a _ u a )(l1b _ u b )] = ~ g~b + O(N-2),

where glab is the inverse of

(5.3)

A consistent estimator is first-order efficient, when and only when


131

the associated ancillary family is orthogonal, i.e., A(u) is


orthogonal to M, <aa' alJ> = galJ(u) = O.

This is the geometrical interpretaion of the well-known result.


The term glab reduces to the Fisher information gab for an efficient
estimator, and the asymptotic variance g~b is equal to the inverse
gab of gba' The first-order term of the distribution of an efficient
estimator ft is
p(ii; u)

5.2. Second- and third-order efficient estimator


The Edgeworth expansion of the distribution of the
bias-corrected first-order efficient estimator ft* or ft** is
calculated here. Due to the relation
ii*a = ii**a - VKa KCa /(2N) ,
the moments of ii* coincide with those of ii** up to the terms of order
N- l . Hence, their distributions are the same up to the term of order
N- l . Therefore, in the following, we simply identify ft* and denote
by ft* the estimator ft** which is bias-corrected at (ft, 0). The bias
of an estimator ft is given by
E[fta _ u a ] = ba(u) + O(N- 3 / 2 ),
where
b a(u) = -
1 Ca
2N = -
1 Cas a g as
2N (5.4)

is called the asymptotic bias of an first-order efficient estimator.


By decomposing gas and gKA, we have

cd g
=
C a cd + C a KA = r(m) cd + H(m)a KA
Ca
KA g cd g KA g ,
because of gaK = 0, Ccd a = r(m)a and C a = H(m)a
cd KA KA'
a
Hence the asymptotic bias b of an efficient estimator fta is given by
the sum of the two terms, one is derived from the mixture connection
of 'M and is common to all the efficient estimators, and the other is
derived from the mixture curvature of the associated A which depends
132

on the estimator. The bias-corrected estimator (ft**) is then written


as
ft* = ft - b(ft) .
The distribution of ii* or ii** is obtained by integrating (4.34) or
(4.40) with respect to v* or v** by the use of the relation g aK = 0,
giving the same result.

Theorem 5.3. The distribution of the bias corrected first-order


efficient estimator ii* is expanded as
p(ii*; u) = n[ii*; gab(u)]{l + AN(ii*;u)} + O(N- 3 / 2 ) , (5.5)

AN(ii*; u) = -L K h abc + ...L C2 h ab + ~ K h abcd


6,1N" abc 4N ab 24N abcd

+ __
1_ K K
72N abc def
habcdef ,

where h abc etc. are the Hermite polynomials in ii* with respect to the
metric gab· The third and fourth cumulants of ii* are given by

and they are common to all the first-order efficient estimators. The
estimators differ only in the term
represents the geometric properties of the associated ancillary
family A as
2
Cab (rm)2 + 2(He )2 + (Hm)2 (5.6)
ab M ab A ab '
where

(rm)2 r(m) r(m) gcegdf (5.7)


ab cda efb

(He )2 H(e) H(e) gCdgKA (5.8)


M ab aCK bdA
H(m) H(m) gKVgq.l
(H~);b KAa vllb
(5.9)

Proof. Since the associated ancillary family A is orthogonal, we can


use the relations (4.32), (4.33) and (4.31) when we integrate p(w*;
u) in (4.34) with respect to v*. The identity
133

CaKb <V~a' db) = da(d K, db> - <d K, V~adb)

d - H(e) (5.10)
= agKb abK

2
is used in calculating Cab' where gKb o.

We define the contravariant versions of the quantities (5.7),


(5.8) and (5.9) by
(r m)2ab g
ac bd(rm)2
g cd
(He )2ab ac bd(He)2
M g g M cd
(H~)2ab ac bd(Hm)2
g g A cd
They are, respectively, the square of the mixture connection of M,
the square of the exponential curvature of M, and the square of the
mixture curvature of the ancillary submanifold A(u). All of them are
non-negative definite. The mean square error of a first-order
efficient estimator is obtained by calculating E[u*au*b j by the use
of (5.5), where the orthogonality of the Hermite polynomials
guarantees
J n(u*)u*au*b h CI ••• c, (u*)du* 0

except for p = 2, and

for p 2.

Theorem 5.4. The mean square error of a bias corrected


first-order efficient estimator is given by
E[u*au*b j = gab + 2~ {(rm)2ab + 2(He )2ab + (Hm)2ab} + O(N- 3 / 2 ) .
M A (5.11)

The first-order term gab is the inverse of the Fisher


information gba of M. The second-order term, i.e., the term of order
N- l / 2 , vanishes for all the first-order efficient estimators, so that
a first-order efficient estimator is automatically second-order
efficient. The third-order term is decomposed into the sum of three
134

non-negative terms. The first is a half of the square of the


components of the mixture connection. It depends on the manner of
parametrization of M, but it is common to all the estimators. If we
adopt the mixture normal coordinate system at a specific point u, it
vanishes at this point. The second is the square of the exponential
curvature of the model M. It is a tensor depending on the
geometrical property of M, but not depending on the manner of
parametrization nor the manner of estimation. The third is a half of
the square of the mixture curvature of the ancillary submanifold A(u)
at v = O. Only this term depends on the estimator. Hence, we have
the following theorem.

Theorem 5.5. A bias-corrected first-order efficient estimator


is automatically second-order efficient. It is third-order efficient
when, and only when, the associated ancillary submanifold A(u) has
zero mixture curvature at v = O.

We can nO..l evaluate the maximum likelihood estimator (m.l. e~ ft


by analyzing the geometric features of the associated ancillary
family A. The m.l.e. is given from the observed point x by solving
the likelihood equation

Since
R,(x, ft) e(ft)·x - ~{e(u)},
it is written as
i -
Ba(ft) {Xi - ni(ft)} = 0, (5.12)
where Bi aei/au a . This shows that the associated ancillary
a
manifold A(u) = ft-l(u) rigging to point u is given by

A(u) = {n I B;(uHni - ni (u)} O}.


In other words, the points n in A(u) satisfy the linear equation

(5. l3)
135

This shows that A(u) is a linear submanifold in the n-coordinate


system so that it is mixture flat. It passes through the point n(u).
Let Xi = BKi (K = m + 1 •...• n) be n-m independent solutions of the
linear equation

B;(U)X i = O.
Then. the general solutions of (5.13) are given as

n 1· = n 1.(u. v) = BK1.v K - n.(u).


1

which is the parametric representation of the points in A(u). Here.


v K are the parameters which play the role of a linear coordinate
system in A(u). The tangent vectors of A(u) are given by
i
'\ = BKi (u) 0 •
which are orthogonal to the tangent vectors

0a = B;(u)oj
of M. because of
oK > o.

Theorem 5.6. The ancillary family A(u) associated with the


maximum likelihood estimator is a mixture-flat submanifold passing
through n(u) and orthogonal to M. It is consistent. first-order (and
hence second-order) efficient. and the bias corrected m.l.e. is
third-order efficient.

The theorem can be proved by the following direct argument.


Given an observed point n = X. the m.l.e. 0 is the one that maximizes
the log likelihood
~{x. 6(u)} = 6(u) x - ~{6(u)} .
Since the -l-divergence D_l{~' 6(u)} from the observed point ~ to a
point 6(u) in M is written as

D_l{~' 6(u)} Dl{n(u). A} = ~(A) + ~(u) - 6(u)·A


~(A) - ~{x. (u)}.

the m.l. e. is given by the point u on the model M at which the


136

-l-divergence from the observed point f\ is minimized, because the


negative entropy 4>(fI) does no t depend on u. It should be also
remarked that D_ l {§', 8(u)} is the Kullback information from '§' to
8 (u), so that the information is minimized at the m.l. e. 11. The
m.l.e. is thus given by the -l-projection from the observed point x
to M, showing that the associated ancillary family A(u) is -l-flat
(mixture-flat), orthogonal to M and passing through n(u).

Example 5.1. We analyze estimators in the model M = {N(u, u 2 )}


of Chapter 4. The maximum likelihood estimator 11 is given by solving
the likelihood equation
- (1/112) (Xl - 11) + (1/11 3 )(x 2 - 2112) 0
or
112 + ftXl - x2 = 0
It is a non-linear equation in 11. However, the ancillary submanifold
A(u) attached to u is linear in the n-coordinate system, and is given
by
2
A(u) = {n I unl - n2 + u = O}
(Fig.s.l.). It is given in the following parametric form,
nl(u, v) u + uv, n2(u, v) = 2u 2 + u 2v ,
by using an adequate parameter v. This ancillary family is the same
as that in Example 4.3, and various geometric quantities have already
been calculated there. Since <(la' (lK) = gaK = 0, 11 is efficient.
2 _ / (m) / 3 (e) _ 2
From the relations gab 3 / u , gKA - 3 2, rabc 4 u , HabK - - l/u ,
H(m)
KAa = 0, the asymptotic bias is given by b(u) - (2u)/(9N), so that
11* =
{I + 2/(9N)}11 .
From (r m)2ab = (16/81)u 2 , (H~)2ab 0, (~)2ab (2/81)u 2 , the mean
square error of 11* is
2 1 10 2
E[ (11* - u) ] = ( 3N + 81N2)u + O(N -3 ) .
137

2 '12

Fig. 5.1

There are many other first-order efficient estimators. Let 0'


be the dual of the m.l.e. at which the l-divergence Dl(~' 8(u)) from
the observed point ~ to M is minimized, or equivalently at which the
-l-divergence (Kullback information) from M to the observed point ~
is minimized. Obviously, the dual m.l.e. 0' is given by the
l-projection of ~ to M. Hence, the associated ancillary A' (u) is
l-autoparallel. That is, it is a linear subspace passing through the
point 8(u) in the 8-coordinate system, and is orthogonal to the
tangent vector da (B; (u)) of M at u. The equation of A' (u) is
written as

B;(U){8 j - 8 j (U)}gij(u) = 0 ,
or 48 2 u 2 + 8 1u + 1 = 0 in the present case.
By the use of the
2 2
relation between 8 and n, this can be rewritten as 2u - unl + nl -
n2 = 0 in the n-coordinate system. Hence, A'(u) is a parabolla in
The n-coordinate system (Fig. 5.1), and the dual m.l.e. 0' is given
138

by solving
2ft,2 - xlft' + (x l )2 - x 2 = 0 .
By introducing a coordinate v in A' (u), we have the new (u, v)
coordinate system such that

We can calculate various geometrical quantities from the above. As

= 0, so that ft' is first-order efficient,


0, H~~~ = 2/u 3 , (H~) 2ab = (16/ 81)u2 .
Hence, the asymptotic bias is -4u/(9N), and the bias-corrected

estimator is
ft'* = {l + 4/(9N)}ft'
The mean square error is
1
E[(Q'* - u)2] ( 3N + -22 )u
2
+ 0 (N
-3
)•
9N
This is second-order efficient, but is not third-order efficient

KAa T~ 0 .
because of H(m)

In the present example, the random variable x is subject to N(u,

u 2 ) or

x = u + £u
2
where £u ~ N(O, u). Hence, given N independent observations I' ... ,
~, we can consider the least square estimator ~S' which minimizes

2 (~.. -
i.
2
u) .

The estimator is given by the arithmetic mean of f'


-
ftLS = xl'
in which x 2 plays no role. The associated ancillary family is given

by

A(u) = {n= (nl' n2) I nl=u},


which is a horizontal line passing through n(u) = (u, 2u 2 ). Hence,

ftLS is consistent. A parametric representation of the A(u) is

nl(u, v) = u, n2(u, v) = 2u 2 + v.
The tangent vector da of M is

[1, 4u]
139

and the tangent vector of A(u) is


i
'\ = BKi d , [0, 1].
Hence, we can calculate
_ ij
gaB - BaiBSjg
by the use of gij evaluated at w = (u, 0) (cf. Example 4.1),

as

gaB [
gab gaK J
gAb gAK
Since gaK f 0, the A(u) is not orthogonal. Hence, ~S is not
first-order efficient. From
l/u 2 ,
we have
2 1 2 -2
E [(11LS - u) ] ="'Nu + 0 (N ).

Since the bias-corrected estimator is third-order efficient if


the associated ancillary family A(u) is orthogonal to M and
mixture-flat at v = 0, there exist an infinite number of third-order
efficient estimators. Among others, the m.l. e. is special in the
sense that its ancillary submanifolds are everywhere mixture-flat.
It should be noted that a first-order efficient estimator can always
be improved to be third-order efficient by the following method.
Given a first-order efficient but not third-order efficient estimator
11, the sufficient statistic x can be decomposed into (0, ~) such that
x = 11(11, ~). It will be shown in a later subsection that the
statistic v, which is discarded when we use the estimator U, is
asymptotically ancillary. We improve 0 to ~ by using the ancillary ~
as
140

(5.14)
When H~~! = 0, no improvement is possible, because ft is already
third-order efficient.

Theorem 5 . 7 . The bias-corrected version ~* of the improved


estimator ft is third-order efficient.

Proof. Let us put U /N(t't-u), and substitute i t in the expression


=
_1_
na g ab~_xi 1
2 IN' CaSawawS + O(N- )
which obtained from (4.20) by taking the u-part of W. Then, by
a
decomposing the term including CaS ' we have
~a = abR.i- ....L (m)a-b-c 2H(e)a-b-K)
g -bxi - ilFl (r bc u u - bK U V
The term H~~~ vanishes in the above expression of ila, so that the

dav-KvA .
improved ~ does not include the term H(m) In other words, the
ancillary submanifo1d is modified by (5.14) such that the improved
ancillary family is free from the mixture curvature at v = O. Hence
~* is third-order efficient.

In the case of Example 5.1., the improved dual m.l.e. is given


by
~ = ft'

where the asymptotic ancillary ~' is the solution of x- n(ft'. ~')

or
2v 2 - xlv
- -2
+ 2x1 - x2 = 0 .
The improved ~ is third-order efficient. but is different from the
m.l.e.

Finally, it should be remarked that all the terms in AN of the


Edgeworth expansion (5.5) are common to all the first-order efficient
estimators, except for one term (H~);b included in C;b' This implies
that the third order characteristics of an efficient estimator
141

depends only on the mixture curvature (H~);b of the associated


ancillary family. Moreover, when (H~);b = 0 holds, the estimator is
superior to any other estimators up to the third order not only by
the mean square error criterion but by any reasonable criterion, as

is understood from (5.5). More precisely, it is superior to others


under any monotone loss function. See Ghosh et al. (1980), Akahira
and Takeuchi (1981).

5.3. Third-order error of estimator without bias correction


We shall see in this subsection what will result if we evaluate
the mean square error of a first-order efficient estimator without

correcting the bias. The Edgeworth expansion of the distribution of


a first-order efficient estimator ft is obtained by integrating p(w;
u) in (4.38) with respect to v. The result is almost similar to
(5.5)

Theorem 5.8. The Edgeworth expansion of the distribution of u


is given by
p(ij; u) = n[ij'; gab(u)]{l + BN(ij; u)} + O(N- 3/ 2 ) , (5.15 )

AN(ij'; u) - 1
2NI (
daC d) gdb h ab ,

1 a
UN C (u) .

Theorem 5.9. The mean square error of a first-order efficient


estimator without bias correction is given by

Since the bias term Ca includes H~~)ag KA., the third-order term

of the mean square error depends not only on the mixture curvature
H(m)(U) but also on the derivative d H(m)(u) of the mixture
KAa b KAa
curvature. This implies that there is no ancillary family A which
142

minimizes the third-order term uniformly in u. It is indeed always


possible to assign an arbitrary large value to 2g da ad Cb at a specific
point u at the sacrifice of the values of H~~! at other u' s . This
shows that there are no third-order efficient estimators in the class
without bias correction.

It is sometimes possible to improve an estimator in the sense of


the mean square error in a wide range of u by adding a bias term.
However, the superiority of the resultant estimator depends on the
manner of parametrization. This is because the mean square error
criterion is invariant under the choice of parameters only when the
bias correction is done. Once the bias term is corrected, we can

compare the behaviors of all the estimators up to the third order


terms invariantly under parametrizations. The third-order efficient
estimator exists in this situation irrespectively of the manner of

parametrization.
We give a rather tricky example to show that one can further

"improve", in a specific parametrization, the third-order efficient


estimator, when the bias correction does not take place. Stein's
contraction estimator can also be understood from this point of view.

Example 5.2. Let x = (y, z) be a two-dimensional normal random


variable, with y ~ N(u, 1), z ~ N(O, 1), and y and z are independent.
We assume that u is larger than some positive constant. The family M
{q(x, u)} forms a rather trivial curved exponential family,

imbedded in S = {p(x, 8)},


p(x, 8) = exp{- ~ (y - 8 1 )2 + (z - 8 2 )2) - 10g(21T) 1
1 122 122
ZiT exp { - T (z + Y )} exp {y 8 1 + z 8 2 - "2 (8 1 + 8 2 )}

The imbedding q(x, u) = p{x, 8(u)} is given by


8(u) (8 l (u), 8 2 (U)), 8 l (u) = u, 8 2 (u) = 0,
=

where we again use 8 i instead of 8 i . Since the imbedding 8 (u) is


143

linear, M is l-flat in S implying that M itself is an exponential


family. The manifold S is Euclidean,

[: :]
and the a-connection identically vanishes for all a, o.
The n-coordinates and 8-coordinates coincide, nl = 81 , n2 = 82 .
In this problem, only y is related to the unknown parameter u,
and z is merely a noise presented independently of u. Hence, y is a
sufficient statistic and it is plausible to assume that the m.l. e.
given by ft = Y is the optimal estimator. However, we consider for
comparison the following c-estimator
ftc = y- + cz-2-
/y ,
where c is a constant. Obviously GO (c = 0) is the m.l.e., and ftc is

a modification of ftO depending on the noise z which is produced


independently of u. The associated ancillary family A(u) is given by
the curves
A(u)
in S, which pass through the point nl = u, n2 o. Let v n2 be the
v-coordinate in A(u). Then, by solving
2
u = nl + cn2/nl ' v = n2 '
we have
nl(u, v) ~ (u + !u2 - 4cv2 ) ,
n2(u, v) v.
The ancillary family is shown in Fig.S.2. In order to guarantee the
existence of the moments of ftc' it is necessary to modify ftc when y
is smaller than some small fixed positive constant. However, when N
is la~ge, the probability that the above modification takes place
decreases in the exponential order in N. Hence, this modification
144

Fig. 5.2

does not affect the following evaluation of the asymptotic behavior


of ftc.
It is easy to calculate the geometric quantities as
Bcd (u)
and all Ca.f3i (u) - 2c/u.
Hence, we have
H(e)
abK
= 0 , o H(m)
KAa
=- 2c/u ,
gab 1, 2c/u ,
The mean square error of the bias corrected c-estimator is given by
E[(u*)2 1 = 1 + (2c 2 /u 2 )N- l + O(N- 2 ) .
c
This indeed ascertains that the m.l.e. (c = 0) is third-order optimal
in the class of the bias-corrected estimators. On the other hand,
the mean square error of the c-estimator without bias correction is
gievn by
E[u~l = 1 + {c(3c - 2)/u 2 }N- l + O(N- 2 )
Hence, it is minimized at c = 1/3. The mean square error of the
145

(1/3)-estimator is
E[ui/3] 1 - {1/(3u 2 )}N- l + O(N- 2 )
which is smaller than that of the m.l.e.

It might seem ridiculous that an estimator is "improved" by


knowing a random noise. Indeed, the randomness is not necessary.
The non-random modification

filc = Y+ c/(Ny) ,
which is obtained by replacing z2 by its expectation l/N, yields a
better estimator,
E[(u~)2] = 1 + {c(c - 2)/u 2 }N- l + O(N- 2 )
This kind of improvement is virtual depending on the bias term. It
also depends on the manner of parametrization and also on the region
which u is assumed to belong to. This suggests that it is inadequate
to compare the higher-order asymptotic behaviors of estimators
without correcting the bias. Or it is inadequate to evaluate an
estimator by the criterion of the least square error or of the
covariance without making bias correction. An invariant evaluation
is given only after bias correction or by the amount of Fisher
information which an estimator carries.

5.4. Ancillary family depending on the number of observations


We have assumed that an estimator fi(i) is a function, which does
not depend on the number N of observations, in the observed point i.
However, in some special case, we can use an estimator ~(i) which
depends on N explicitly. This is the cas e , for examp Ie, when we
consider the geometrical theory of estimation of a time series model.
Such an estimator fiN is accompanied by an ancillary family A(N)
{A(u, N)} which explicitly depend on N. The decomposition of the
sufficient statistics
i = nN(fi, ~)
146

also depends on N, so that we need to write fiN and ~N instead of fi


and ~ , when it is necessary to show the dependence on N explicitly.
However, the Edgeworth expansions (4.34) and (4.38) are valid in this
case, too, if we accept that the geometrical quantities gaB'
KaBy ' etc. depend on N, because the related ancillary family depends
on N.
The distribution 0 -*
f fiN or u N can be obtained by integrating the

joint probability distribution of (fiN' ~N) or (U*N' v*N) with respect


to ~N or v *N' as we did in the previous sections. However, Theorems
5.2 and 5.2 should be replaced by the following one.

Theorem 5.10. An estimator fiN is consistent when, and only


when, the ancillary submanifold A(u, N) includes the point n (u) E M
in the limit N-t oo ,
lim A(u, N) ;:, n(u).
II-+QII
A consistent estimator is first-order efficient when, and only when,
the associated ancillary family is asymptotically orthogonal to M in
the sense that
lim g
a~
o.

Let u + dN(u) be the point at which the model M intersects the


ancillary submanifold A(u, N) of an estimator ~. When ~ is
consistent
lim dN(u) = 0
holds. Hence, the asymptotic bias of ~ is written as
b:(u) = E[~] = d:(u) - (1/2N)C a (u).
For an estimator fi which does not depend on N, let
A* (u) = {n I fi* (n) = u}

*
be the inverse image of the bias corrected version fi. Then, A* (u)
depends on N so that we may regard fi* as an estimator whose ancillary
147

family depends on N. In this case, the term


1 a -2
dN(u) = 2N C (u) + O(N ),
depends on N but the metric gaS(u) does not depend on N .
Let ON be an efficient estimator whose ancillary submanifold
A(u, N) is asymptotically orthogonal to M, and assume that

ga~(u) = <d a , d~> = O(N- 1/2 ).


Moreover, assume that its bias corrected version ~ * satisfies
E[~ - ul = O(N- 3/2 ).

In order to integrate (4.34) with respect to v-* , we decompose n[w*


gas l as
n[u* , glabln[v', gKA l
1
+ -g ~K ab
g g h·
n[u*, gabln[v', gKAl{l 2 a~ bK

+ O(N- 3/2 )},


where

Considering that gAa is of order N


-1/2 and putting
2 _ ~A
Qab - 2Nga~gbAg , (5.17)
we have the following Edgeworth expansion of u*. Here, it is also
assumed that

so that
C
aKb =
d
agKb
- H(e)
abK
= - H(e)
abK
+ O(N- 1/2 )
is used. In the next chapter concerning statistical tests, we treat

the Edgeworth expansion in the case where gaK is of order N


-1/2 but

its derivative QabK is of order 1.

Theorem 5.11. The distribution of the bias corrected vesion uN


of a bias corrected first-order efficient estimator ON is expanded as
-* _ -* 1 1 2 ab 1 (e) KA abc
p(~; u) - n[~; gab {1 + 4N Qabh - 2N HabKg gAc h
+ AN(U;; u) + O(N- 3/2 )}, (5.18)
where AN is given in (5.5).
148

Corollary. The mean square error is decomposed as


E[U;au;b] = gab + iN{(r m)2ab + 2(H~)2ab
+ (H~ 2ab + Q2ab} + O(N- 3 / 2 ) (5.19)

This shows that an efficient estimator ftN depending on N has one


more non-negative mean square error term Q2ab which is related to the
square of the cosines of the angles between M and A(u, N). It should
be noted that one more extra term appears in the coefficient of h abc
in (5.18). This implies a bias-corrected estimator is third-order
uniformly optimal under any synnnetric loss function, when and only
w'hen
H(m) = 0 gaK = O(N
-1
)
abK '
holds, even in the class of estimators ftN depending on N, provided
= O(N-
Qa b l / 2) However, this is not true for an asvmmetric loss
K ' J-

function.

5.5. Effects of parametrization


We have studied the characteristics of estimators in terms of
geometry. The geometric features themselves do not depend on the
manner of parametrizing a model M {q (x, u)}. We may choose any
allowable parameters (or coordinate systems) u' instead of u in
specifying the distributions in M. The theory holds for any
parametrization. For example, an estimator ft* is third-order
optimal, when its associated ancillary family is orthogonal to M and
mixture-flat at M, irrespective of the choice of parameter u.
However, the bias term Ca of ft and the mean square error of the bias
corrected ft* do depend on the manner of parametrization. Therefore,
for some practical purpose, one parametrization can be more
preferable than others. One may want to choose a parameter, if
possible, such that the Fisher information gab(u) does not depend on
u. Holland [1973] called this the covariant stabilizing parameter.
149

As Yoshizawa[1971 a] pointed out, a covariant stabilizing parameter


exists, when and only when the Riemann-Christoffel curvature
vanishes. Hougaard [1982] studied the problem of parametrizing
one-dimensional models suitably in various situations stated in the
sequel. We treat here the problem of how to parametrize
multi-dimensional models from the geometrical viewpoint, by extending
the idea of Hougaard [1982], see also Kass [1984]. The problem is
also related to the parameter transformation of a non-linear
regression model by Bates and Watts [1980, 1981].
We first seek for the parameter which minimizes the mean square
error of a bias-corrected efficient estimator. The third-order term
of the mean square error of ij* can be decomposed into the sum of
three positive terms (5.11). The first term (rm)2ab is the square of
the mixture connection which depends on the parametrization, while
the other two are tensors so that they do not depend on the manner of
parametrizaiton. More precisely, the latter two change under
parameter (coordinate) transformations exactly in the same manner as
the inverse of the Fisher information matrix (metric tensor) gab
changes. If there exists a parametrization u in which the mixture
connection r ~~~ (u) vanishes at all u, this is the parametrization
that minimizes the mean square error uniformly in u up to the third
order. Such a parameter is called the minimum covariance parameter.
The term (rm)2abcorresponds to the Bhattacharrya bound, and is
sometimes called the naming curvature (or the parameter-depending
curvature), although it is not a curvature but a connection in the
geometrical sense. Unfortunately, such a parametrization does not in
general exist. It is a mixture affine (-l-affine) coordinate system,
and it exists when, and only when, M is mixture-flat, i.e. M has a
vanishing -l-Riemann-Christoffel curvature. A mixture affine
parametrization has one more feature. Since the asymptotic bias of a
first-order efficient estimator is given by (5.4), the asymptotic
150

bias of the maximum likelihood estimator, for which 0,


vanishes when the parametrization is mixture affine.
When M is not mixture- flat, we can choos e a mixture-normal
parametrization (coordinate system) at a specific point u o' in which
the components and their derivatives of the mixture connection vanish

This is the parametrization that minimizes the mean square error of a


bias-corrected estimator at U o (and in a small neighborhood of u O).
This parameter is called a locally minimum covariance parameter. The
asymptotic bias of the maximum likelihood estimator also vanishes at
U o (and in a small neighborhood of u O).
We next study the parametrization in which the Fisher
information gab (u) becomes a constant matrix not depending on u.
When such a parametrization exists, it is possible to make gab(u) =

cab (unit matrix) by a linear transformation of the parameter. Such


a parameterization is said to be covariance stabilizing, because the
first-order term of the covariance of an efficient estimator is gab
at any point u in M. In the covariance stabilizing
parametrization (coordinate system), the components r abc
(0) (u) of the
O-affine connection vanishes identically, because the O-affine
connection is the metric connection given by the Christoffel
three-index symbol [a, b·, c 1. Hence, the covariance stabilizing
parametrization yields a O-affine coordinate system, and such a
parametrization exists when, and only when, M is O-flat, i.e., M has
a vanishing Riemannian curvature. This is a well-known fact in the
theory of Riemannian geometry: A Riemannian space is Euclidean, when
and only when the Riemann-Christoffel curvature vanishes. A
Euclidean space has a Cartesian coordinate system in which the metric

tensor reduces gab(u) = cab.


Even when M is not O-flat, there exis ts a O-normal
151

parametrization (coordinate system) at a specific point u O' in which

the components and their derivatives of the O-connection vanish at

r(O)(u) = 0
abc 0 '
This parametrization can be considered covariance stabilizing in a

small neighborhood of a point u O • Hence, it is called a locally

covariance stabilizing parametrization.


We can show the characteristic feature of some other

parametrizations. A (-1/3)-affine parameter, in which Kabc =

- 3r~b~/3)(u) vanishes identically, may be called a zero asymptotic


skewness parameter, because any first-order efficient estimator 0 has
zero asymptotic skevmess in this parametrization. Indeed, the term

Kabchabc vanishes in the Edgeworth expansions of u* or U, as can be


seen from (5.5) or (5.15). Obviously, a zero asymptotic skewness
parameter exists, when and only when M is (-1/3)-flat having a
vanishing (-1/3)-Riemann-Christoffel curvature. When M is not
(-1/3)-flat, there exists a locally zero skewness parametrization,

which is a (-1/3)-normal parametrization (coordinate system) at u O.


An efficient estimator has asymptotic zero skewness in a small
neighborhood of u o in this parametrization.
A (1/3)-affine parameter, when it exists, has another feature.
The expectation of the third derivatives of the log likelihood

function vanishes identically in this parametrization

E[dadbdct(x, u)] = 0 ,
as can easily be proved from the definition. This implies

E[dadbt(x, u)] can be approximated by a quadratic function in u at


around the true parameter. In other words, the distribution of

gab ( uA)(ACi
u - ua) is close to a normal distribution in some sense in

this parametrization. Hence, it is called a normal likelihood


parametrization. It exists, when and only when M is (1/3)-flat. A
(1/3)-normal parametrization at U o is called a locally normal
152

likelihood parametrization at u O '


Finally, an exponential-affine (i.e., l-affine) parameter is a
natural parameter when M is completely l-flat (i. e., when M is an
exponential family). Hence, an exponential affine parameter may be
said to be natural for any l-flat manifold M (which is not
necessarily an exponential family). A l-normal parametrization at u o
is said to be locally natural at u O ' We can summarize the above
results in the following theorem.

Theorem 5.12. Characteristic features of a-affine parameter (or


of a-normal parameter at u O) are as follows:
1) When a 1, it is (locally) natural.
2) When a 1/3, it is a (locally) normal likelihood parameter.
3) When a 0, it is a (locally) covariance stabilizing
parameter.
4) When a = - 1/3, it is a (locally) zero skewness parameter.
5) When a 1, it is a (locally) minimum covariance
parameter. The maximum likelihood estimator 0 has an asymptotically
vanishing bias of higher-order in this parameter.
An a-affine parameter exists, when and only when the
a-Riemann-Christoffel curvature vanishes.

A one-dimensional manifold (a scalar parameter family of


distributions) M has a remarkable characteristic that the
Riemann-Christoffel curvature always vanishes for any affine
connections.

Corollary. A scalar parameter family M of distributions always


has a natural, a normal likelihood, a covariance stabilizing, a zero
skewness, and a minimum covarince parameter.
153

We next state the method of obtaining an a-affine parameter u' =

(u' a) (a = 1. ...• m) from the present parameter u (u a ) by the


parameter transformation u' = u'(u) or u,a = u,a(u l • urn). a = 1•
...• m. The law of transformation of an affine connection (2.21) is
written as
r(a)
abc
where
Ba ' = au,a'/aua •
a
g'c'd' is the metric in the u'-coordinate system. and r~~~ and
r'~ba~
a c
, are. respectively. the a-connections in u- and u'-coordinate

sys terns. Th ere f ore. r ,(a)


a' b' c' = oY1e
' ld s t he . 1
part1a d'ff . 1
1 erent1a
equation
BC r(a)
d' abc
or
a au' c' ( u) = _ r (a) cd" u ,c' (5.20)
a b abc g °d
The integrability condition of this equation is vanishment of the
a-Riemann-Christoffel curvature
R(a) = 0
abcd
The a-affine parameter u' is given by solving the differential
equation (5.22) as a function of u.
It is easier to get an a-normal parameter at u o. An a-geodesic
u(t) satisfies the geodesic equation.
ua + r(a)ucu d = 0 •
g
ab cdb
where t is the a-geodesic parameter and' denotes d/dt. This is an
ordinary differential equation. The initial conditions at t = 0 are
u(O) = Uo • u(O) = A .
showing that u(t) passes through U o in the direction A at t = O.
where A is a tangent vector at u o . Let Al •...• Am be m independent
tangent vectors. It is convenient to choose them such that they are
orthonormal
154

Then, any vector A can be written as a linear combination of Aa's,


A L u,aAa
Let us denote by u = u(t; u'), u' = (u,a), the geodesic u(t) which
passes through Uo in the direction A = L u,aAa . By putting t = 1, we

have u u(l; u') or


ua = ua(l; u,b) , a, b = 1, ... , m . (5.21)

This defines a transformation from u' to u. By solving the above


equation with respect to u', we have u' = u'(u) or
u,b' = u,o' (ua ) . (5.22)
This u' yields an a-normal parameter (coordinate system) of M. When

M is a-flat, it is an a-affine parameter at the same time.

Example 5.3. The a-affine parameter of the model M = N(u, u 2 )


is derived as follows. The a-geodesic equation for u(t a ) is given
3..
--2-
u
u - (3 +3 7a)
u
u.2 = °.
where t is the a-affine parameter. The equation is solved to yield

t = f u -7a/3 , a f- °
This t = t
a ~log u , a
(u) is the a-affine parameter.
° For example t_l = u 7/ 3 is

the minimum variance parameter, and to log u is the variance


stabilizing parameter.

Bates and Watts [1980, 1982] discussed the effect of parameter


transformations in a non-linear regression model in order to
elucidate the non-linearity of the model. A non-linear regression
model has a very simple geometrical structure, and this problem can
easily be sol ved in our framework. Let i = 1, n, be
independent normal random variables which can be expressed as

i = 1, ... , n (5.23)
where £i are mutually independent normal random variable subj ect to
N(O, 1), f is a known function of known control parameters c i and of

unknown m-dimensional parameter u = (u a ). When f is non-linear in u,


155

this defines a non-linear normal regression model, where we put the


variance (i of xi or £i equal to 1 for simplicity I s sake. The

probability density function of x = (xi) is


q(x, u) = c exp{- -i- (xi - ei )2} ,
where
i
f (c ; u) . (5.24)
This forms an m-dimensional manifold M = {q(x, u)} parametrized by u.
Let S = {p(x, e)} be the n-dimensional exponential family of normal

distributions
p (x, e)

ljJ( e)

Then, the non-linear regression model M is an (n,m)-curved

exponential family imbedded in S by e e(u) in (5.24). The


enveloping manifold S is of very simple structure, because
o ,
It is a Euclidean space for any a, and the dual coordinate n can be

identified with e, The tangent space Tu(M) of M at u is


i
spanned by m vectors da = Badi'
Bi (u) = ~
a °a
ei = ~ f (c.
°a~'
u)
. (5.25)
The a-covariant derivative of the tangent vector field db along d a is
given by
i
Cab di
and is the same for all a, because all the a-covariant derivatives

are identical. The quantity

Vdadb = dadbf(c i , u)di


can be decomposed into two parts, one being tangential to M and the
other being normal to M. The tangential part is given by taking the

inner product with dC'

rabc = <Vdadb' d C >.


This is the affine connection of M, which depends on the manner of

parametrization. We can parametrize M such the r abc vanishes at a


156

point u O. This is the normal coordinate system. It should be noted


that all the a-connections are identical in a non-linear normal
regression model. Therefore, this parameter is locally a-normal for
any a, having natural, normal likelihood, covariance stabilizing,
zero skewness, and. minimum covariance properties at the same time.
The tangential part can also be represented by
C(T)i = r Bi cd (5.26)
ab abc dg
or
p~c j
J ab
where p~ is the projection operator form T(S) to T(M).
J
The normal part
C(N)i = C i _ C(T)i (5.27)
ab ab ab
is the imbedding (Euler-Schouten) curvature tensor of the manifold M,
which does not depend on the manner of parametrization. The normal
part is the intrinsic curvature of M. The scalar
K~ = Ic~~)ieaeb 12 ,
where e a is a unit vector, is called the intrinsic curvature in the

direction e of a non-linear regression model by Beale [1960] and


Bates and Watts [1980]. The tangential part C(bT)i
a is not
. a curvature
but a cOIllIllon affine connection. It is called the parameter-effects
curvature array by Bates and Watts [1980, 1981]. The parameter which
Bates and Watts proposed is the normal coordinate of the cOIllIllon

affine connection rabc. See also Hougaard [1981] and Kass [1984].

5.6. Geometrical aspects of jacknifing


The jacknife is a widely applicable non-parametric method of
evaluating the bias and covariance of an estimator by reusing already
observed sample (see Efron [1982]). This is a kind of resampling or
subsampling plan. Although the jacknife is a non-parametric method,
it can be applied to a parametric model. Here, we briefly analyze
the asymptotic properties of the jacknife estimator in a curved
157

exponential family, intending to show the characteristics of the


jacknife in a simple model.
Let x, ... , x be N independent observations from an (n,m)-curved
1 N
exponential family M = {q(x, u)}. Let 0 be an efficient estimator

which is a function of x = r x/N,


i
and let v be the coordinates of the

ancillary submanifold A(u) associated with the estimator. Then, the

sufficient statistic x can be decomposed into (0, ~) by n(O, ~) = X,


where u is the estimator. Let O(i) be the value of the estimator

from the N-l observations x, ... , x, x, ... , x where the i-th


1 i-l i+l N
observation x is omitted. Then, by defining,
i
(i) = 'i'
-
X
L.Hi ~/(N - 1)
the estimator O(i) is obtain~d from n(O(i) ~(i» x(i). We use the

following notations,
~ = (0, ~) , ~(i) = (O(i), ~(i» ,
~(i) o~ (i) = (00 (i), o~ (i» ,

" 1\ a
The jacknife estimate b = (b ) of the bias of an estimator 0 is
given by the (N - 1) times average of the deviation oO(i),
b = N N1 LoO(i) . (5.28)
The jacknife estimator 0JK of 0 is the bias-corrected one,
0JK = 0 - N N1 LOO(i) . (5.29)

The Jacknife estimate g = (sab) of the covariance of 0 is also given


the sample covariance of oO(i) as
~ = N N1 LOo(i)oO(i) - Ntb (5.30)

where the tensorial indices are omitted.


- (i) and hence o~(i) , are random
I t is easy to show that ox ,
-1
variables of order Op(N ), because of
N
ox(i) = (x - x)/(N - 1) = 1
N(N - 1) L (x - x) (5.31)
i j=l j i
Hence, by expanding n(~ + o~(i» = x + ox(i), we have
ox(i) = BO~(i) + ~ CO~(i) o~(i) + 0p (N- 3 ) ,
158

where the indices of B . and C Q. are also omitted, and Band Care
a~ a,,~

evaluated at~. From this follows


o~(i) = B-lox(i) - ~ B-1Co~(i) o~(i) + 0p (N- 3 )

This yields an evaluation of the bias estimate.

A
Theorem 5.11. The bias estimate b converges to the true bias
E[ft] in probability. The jacknife estimator UJK coincides with the
bias-corrected estimator ft* up to Op(N- 1 ).

Proof. From Lox(i) 0, it follows


Low(i) ~ B- 1 C Lo~(i)o~(i) + Op(N- 2 )
Since
o~io~i = B- 1 B- 1 ox(i)ox(i) + Op(N- 3 ) ,
we need to evaluate Lox(i)ox(i). By substituting (5.31) in it and

taking the index notation, e.g., ox(i) = (oxJ~i)) and x (~J.)' we


i ~
have
\ -(i) x-(i) _ 1
~ x. k -
~ J

which is the (N - 1)-1 times the sample covariance, converging to

gjk/(N - 1) by the law of large numbers. Hence, from


L ow(i)S ow(i)y 1 BSjBykg + 0 (N- 3/2 )
i N - 1 jk p

~ gSY + Op(N-3/2)

we have
Low(i)a 2~ CSyagSY + Op(N- 3/2 )
The term Lou(i) is the u-part of the above, so that

ba = - 2~ Ca(ft) + Op(N- 3/2 ) ,


proving the theorem.
We next study the asymptotic properties of the covariance

estimator. It is easy to show


159

E[gab] = ~ gab + O(N-2) .


We can calculate the term of order N- 2 . We do not here write down
it, because it does not coincide with the term of order N- 2 of the

covariance of neither 0 nor 0JK'

5.7.Notes
The higher-order asymptotic theory of estimation was initiated
by Fisher (1925) and Rao (1961, 1962, 1963), and was studied by many
researches, e.g. Chibisov(1973 b), Ghosh and Subramanyam (1974),
Pfanzagl (1982), Akahira and Takeuchi (198la). It was Efron (1975)
who first pointed out the important role of the statistical curvature
in the higher-order theory of estimation. Efron's statistical
curvature y2 is indeed
2 2 ab
y = (HM)ab g ,
the square of the exponential curvature of the statistical model in
our terminology. A multidimensional generalization of the
statistical curvature is given by Reeds (1975) and Madsen (1979).
Its importance has widely been recognized (see, e.g., Reid (1983)).
The geometrical foundation was given by Amari (1980, 1982a) and Amari
and Kumon(1983), where the mixture curvature plays as an important
role as the exponential curvature does. The mixture curvature plays
also an important role when a statistical model includes nuisance
paremeters. It is possible to define these curvatures for a general
regular statistical model other than a curved exponential family, and
to construct the higer-order theory of estimation (see Amari, 1984 ah
There seems to be some confusion in the usage of the term
"order" in the higher-order asymptotic theory. The distribution
_'k
function of U or u is expanded as
>~ *
p(u, u) = Pl(u ) + N
-1/2
P2(u) + N- 1P 3(u*) + O(N- 2/3 ),
7'

,,<
as in (5.5). Some people call the terms Pi(u) the i-th order term,
160

as we did in the present chapter. However, the second-order term


-*
P2(u ) is common to all the first-order efficient esimator, so that
one may call P3(u* ) the second-order term. In fact, if we expand the
mean square error as
E[u*au*b] = g~b + N-l/2g~b + N-lg;b + O(N-3/2),

there is no second-order term,

g~b = 0
for regular efficient estimators. Hence, one sometimes calls g3
ab

derived from P3(u*) the second-order term. We shall use the latter
usage in Chapter 7, where loss of information is treated.
We have shown the characteristic of an estimator ~ which depend

on N. When we consider the higher-order theory of estimation from

non-i. i. d. observations, such an estimator frequently appears. For


example, we can extend the present theory such that it is applicable
to the parameter estimation of parametrized time-series models such
as AR models and ARMA models, where a number of well known efficient

estimators are of this type.


The problem of parametrization is discussed in Holland (1973),
Yoshizawa (1971 a), Hougaard (1981, 1983), and Kass (1984). The
non-linear effect of a statistical model was studied by Beale (1960),
Bates and Watts (1980, 1981). This is a special example of general
geometrical properties of statistical models as was discussed by Tsai

(1983) in his doctral dissertation. See Efron (1982), Hinkley and

Wei (1984) for the Jacknife method. Akahira (1982) evaluated the

properties of the Jacknife estimator in the framework of parametric

models. See also DiCiccio (1984) for the effect of parametrization.


6. ASYMPTOTIC THEORY OF TESTS AND INTERVAL ESTIMATORS

The present chapter studies the higher-order


asymptotic theory of statistical tests and interval (or
region) estimators with or without nuisance parameters.
The power function of a test is determined by the

geometrical features of the boundary of its critical


region. It is proved that a first-order efficient test is
automatically second-order efficient, but there is in
general no third-order uniformly most powerful test. The

third-order power loss functions are explicitly given for


various widely used first-order efficient tests. The
results demonstrate the universal characteristics of these
tests, not depending on a specific model M. We also give
the characteristics of the conditional test conditioned on
the asymptotic ancillary. The third-order characteristics

of interval estimators are also shown. For the sake of


simplicity, we maily treat a one-dimensional model, and the
multi-dimensional generalization is explained shortly.

6.1. Ancillary family associated with a test


The present section treats the third-order asymptotic theory of
statistical tests in a curved exponential family M = {q(x, u)}.

Consider a null hypothesis HO u E: D that the true parameter u

belongs to a subset D of M. It is tested against the alternative


HI : u~D', based on N independent observations ~' .•• , x from the
N
identical but unknown distribution q(x, u). In the asymptotic theory
where the number N of observations is large, D' is taken to be the

complement of D, so that the alternative is written as HI : u¢-D.


When D is a singleton set D = {uO}, the hypothesis is simple, HO : u =
162

u o. Otherwise, it is composite. A test T we consider here is a

mapping from N observations x, ... , x to the binary set {r, r}


1 N
through the sufficient statistic X, where r implies rejection of the

null hypothesis HO and r implies that it is not rejected. Since


the sufficient statistic x = L x/N defines the observed point ~ x
-
i
in S, T is a mapping from S to {r, r}. In other words, T assigns r

or r to every point n ~ S. Let us denote the inverse images of rand


r, respectively, by
R = T-l(r), R = T-l(r)

Then, the hypothesis HO is rejected when the observed point ~ is in


R, and is not rejected when the observed point is in R. The set R is
called the critical region, and its complement R is called the
acceptance region. The manifold S is thus partitioned into Rand R,
RUR = S, Rna = cpo A test T is determined by its critical region R.

We assume that Rand R have smooth boundaries.


It is convenient to use a test statistic A(X), which is a

function of x, to denote the critical region R. The hypothesis HO is


not rej ected when A (x) < c. In two sided cases, it is not rej ected
when c l < A(X) < c Z. The acceptance region R is defined by
R= {n I A(n) < c}
or

respectively. Here, cons tants c, cl ' C z are to be determined from


the level condition (and the unbiasedness condition) of a test,
stated later. The critical region R is bounded by (n-l)-dimensional

submanifold(s) A(n) = c or A(n) = c l and A(n) = c Z. Fig.6.l a) shows


the case where HO is simple and R is bounded by two submanifolds, and
6.1 b) shows the case where HO is composite and R is bounded by one

submanifold.
The power PT(u) of a test T at u is the probability that the
hypothesis HO is rejected, when the true parameter of the distribution
163

Fig. 6.lb)
Fig. 6.la)

is u. A test T is of significance level u, when its power is not

greater than u at any point u belonging to the null hypothesis,

ue:D
A test T of significance level u is unbiased, when its power PT(u) at
any uf.D is not less than u. The power function PT(u) can be
expressed as

fR p(x; u)dP(x) 1 - fR p(x; u)dP(x) , (6.1)

where p(x; u) is the density function of x when the true parameter is


u.
In order to calculate the power function PT(u) of a test T, it
is convenient to introduce an ancillary family associated with the

test T. Let us associate with each point uEM an (n-m)-dimensional

submanifold A(u) which transverses M at u such that A = {A(u)} forms


164

an ancillary family or a local foliation. Let ~ R ()M be the


intersection of Rand M, i.e., nE:'i\t implies that n is in the
acceptance region R and in M. Let ~ = Rn M. An ancillary family A
is said to be associated with test T, when its critical region R is

composed of the ancillary submanifolds A(u)'s attached to the points

R = uuc:~ A(u) = {n I n EA(u), UE:~} •

When A is an associated ancillary family, the critical region R is

bounded by those A(u)'s which are attached to the boundary aRM of ~

(Fig.6.2). By introducing a coordinate system v to each A(u), we

have a new coordinate system w = (u, v) of S. The critical region R

is written as

R = {(u, v) I U€~, v is arbitrary}


in the new coordinate system w. The statistic x- is transformed to ~

= (ft, ~) by x = n(~). Since we already have the asymptotic expansion

(4.34), (4.38) or (4.40) of the distribution p(~; u) of the observed

point ~ in the new coordinate system, the power function is written

as

JR p(~; u)d~ = J~JA(U) p(~; u)d~dft

J~ p(ft; u)dft = 1 - J~ p(ft; u)dft , (6.2)

where

p(ft; u) = JA(u) p(~; u)d~

is the distribution of ft when the true parameter is u. It is

convenient to use the bias-corrected variable


11*a = IN(fta _ u a ) + ~ Ca(ft) (6.3)

instead of ft, when one calculates the power PT(u) at a point u, where
Ca = Cas a g as ,~.e.,
.

Eu[fta] = - 2~ Ca(u) + O(N- 3 / 2 )


The region ~ of integration should be expressed in terms of 11*, when

we integrate p(11*, u).


Before explaining the higher-order powers of a test, we give a
165

Fig. 6.2

simple example to illustrate the ancillary family associated with a


test. Note that the bias correction term Ca is evaluated not at ~
(ft, ~) but at (ft, 0). However, as we noted in Chapter 4, the
distribution of u* is the same in either case, so long as gaK = 0

holds.

Example 6.1. Fisher's circle model


Let M = {q(x, u)} be a (2,l)-curved exponential family imbedded
in the bivariate normal distributions S {p(x, n)},
1 1 212
p(x, n) = 21T exp{- "2 (Xl - nl) - T(x 2 - n2) } ,
x = (Xl' x 2 ), n = (nl' n2)
n(u) = [nl(u), n2(u)] [sin u, 1 - cos u] ,
q(x, u) = pix, n(u)} .
The M forms a unit circle with center at (0, 1) in S. This M is
called Fisher's circle model. The problem is testing a simple
166

hypothesis HO : u = 0 against Hl : u '" O. We consider some typical


unbiased tests and ancillary families associated with them.
The maximum likelihood estimator ft is given by solving daq(X, u)
0, where d a = d/du, as
ft = tan-l{xl/(l - x 2 )}.
We first consider the test based on the m.l.e., (m.l.e. test), which
uses this ft as the test statistic, A (x) = ft(x). The so-called Wald
test, whose test statistic is gab(uO)(ft - u O)2 or gab(ft)(ft - u O)2 is
a version of the m.l. e. test. The associated ancillary family A =

{A(u)} is the same as that associated with the maximum likelihood


estimator. Therefore, A(u) is a straight line connecting the center
(0, 1) and point 11 (u) on M (Fig. 6.3 a). The critical region R is
bounded by two of these lines, say, A(u_) and A(u+) where u_ and u+

Fig. 6. 3a) Fig. 6. 3b)

are to be determined from the significance level and the unbiasedness


of the test. It should be noted that all the A(u) pass through the
center. Hence, the family A covers only some neighborhood of the
circle M and we cannot extend A to cover the entire S. However, it
167

is sufficient for the asymptotic theory that A is defined in some


neighborhood of M. The acceptance region is the sector bounded by

A(u_) and A (u+) , and ~ is the interval [u_, u+] on M. We can


introduce a local coordinate v in each A(u). Let v be the distance

between nand n (u) € M, where nand n (u) are in the same A(u). Then,
any point n € S is specified by the new coordinate system w = (u, v)
as n = n(w) or explicitly as

nl = (1 - v)sin u , nz = 1 - (1 - v)cos u ,
where u shows that n is in A(u) and v denotes the distance between n
and M. The power PT(u) is given by

u+
f p(tl; u)dtl
u

On the other hand, the likelihood ratio test (1. r. test) uses
the following test statistic
A(X) = - z log[q(x, 0)/ max q(x, u) ]
u
Z log[q(x, O)/q(x, tl)] ,
where tl(x) is the m.l.e. We have
A(X) = Z{xlsin tl - (1 - xZ)(l - cos tl)} .
The associated ancillary family is composed of the curves {A(n) c}.
The curves are parabola given by
1 Z
nZ = - 4c (nl) + 1 + c
(see Fig.6.3 b). They can be expressed in the parametric form

nl (1 - cos u)sin v/(l - cos v) ,


n2 1 - (1 - cos u)cos v/(l - cos v)
where v plays the role of a local coordinate system in A(u). In this
case, a parabola A(u) intersects M at two points u_ and u+. In the
asymptotic theory, we consider only a neighborhood of M, in which an

A(u) is divided into two parts. One is the ancillary submanifold

attached to u+ and the other attached to u_. It is not of our

concern whether these two submanifolds A(u_) and A(u+) are a part of
one connected submanifold or separate two.
168

In an asymptotic theory, the number N of observations increases


without limit, and hence the critical region R changes depending on N.

Hence, a test T should be written as TN with the corresponding


critical region RN , by showing N explicitly. We are indeed treating

a test sequence T l , T2 , ... , TN' in the asymptotic theory in

order to evaluate the asymptotic behaviors of TN for large N. The

ancillary family AN also depends on N. We hereafter neglect the


suffix N for simplicity's sake. However, it should be noted that we
are treating a test sequence T Tl , T2 , ... in the following. It
should also be remarked that, in the asymptotic theory, it is not

necessary for an ancillary family A to cover the whole S. It


suffices that A covers a neighborhood of M.
Now we evaluate the asymptotic behaviors of a test (sequence) T.
Since i converges to n(u) in probability as N tends to infinity where

u is the true parameter, the power PT(u) tends to 1 for any fixed
uf: D. Hence, characteristics of a test sequence {TN} should be

evaluated by the power PT(uN) at a point uN which approaches to the

domain D with a reasonable convergence speed as N tends to inftnity.


To this end, let us define a set UN(t) by
UN ( t ) = {u E Mid ( u , D) = t lIN} ,
where d(u, D) is the geodesic distance from u to D. In other words,

UN(t) is the set of the points in M which are separated from D by a

distance t/R. When D has a smooth boundary 3D, the set UN(t) is an
(m-l) -dimensional submanifold surrounding D in M, and it approaches

to aD as N tends to infinity (Fig.6.4). aD for


convenience' sake. We evaluate a test sequence T by the asymptotic
behavior of the power PT(u) at u E UN(t) for various t. Obviously,
PT(u) depends not only on the geodesic distance t/~ of u from D but
also on the direction in which u is separated form D. It is possible

to construct a test which is very powerful in a specific direction at


the sacrifice of the powers in other directions. In order to compare
169

Fig. 6.4

two tests, we use their average powers in all the direc"tions. Let

PT(t, N) be the average of the powers of test T over all u E. UN(t) ,

(t) PT(u)du/SN(t) ,
PT(t, N) = I UEU
N
where SN(t) is the area of UN(t). (When SN(t) is not finite, we can

also define PT(t, N) by the average of PT(u) for all u EUN(t).)


Then, PT(t, N) represents the average power of a test T over the

points which are separated from D by a geodesic distance t/1Nr.

Let us expand PT(t, N) in the power series of N- l / 2 as


PT(t, N) P
Tl
(t) + P T2 (t)N- l / 2 + P T3 (t)N- l + O(N- 3 / 2 ) (6.4)

where PTi(t), i = 1, 2, 3, is called the i-th order asymptotic power

at t of a test T. A test T is said to be first-order uniformly


efficient (most powerful), when there are no

tests T' whose first-order power PT'l(t) is greater than PT1(t)"

PTl(t)< PT'l(t)
at some t. A first-order uniformly efficient test T is said to be

second-order uniformly efficient (most powerful), when its


second-order power satisfies

P T2 (t) ~ P T '2(t)
at all t compared with any other first-order efficient test T'. A

first-order uniformly efficient test is said simply to be efficient


170

in short. It will soon be proved that an efficient test is


automatically second-order uniformly efficient. It will also soon be
found that there does not in general exist a third-order uniformly
efficient test T in the sense that its third-order power satisfies
P T3 (t) ~ PT '3(t) for all t and all T'. Hence, we use the following
definition in order to evaluate the third-order optimality. An
efficient test T is said to be third-order t-efficient (t-most
powerful) or third-order efficient at t, when its power satisfies
P T3 (t) ~ P T '3(t)
at a specific t compared with any other efficient test T'. An
efficient test T is said to be O-efficient (O-most powerful) or
locally efficient (most powerful), when it is third-order t-efficient
for infinitesimally small t. A test T is said to be third-order
admissible when there are no efficient tests T' whose third order
power P T '3(t) is larger than or equal to PT3 (t) at all t.
It is known that there is in general no uniformly most powerful
test T such that PT(t, N) ~ PT,(t, N) for all t and T'. We call
P(t, N) = sup PT(t, N) (6.5)
T
the envelope power function, where sup is taken at each t
independently. Then, P(t, N) ~ PT(t, N) for all t and T, so that the
power function of any T is bounded by the envelope function.
Moreover, for any fixed t, there exists a test T(t), such that
PT(t, N) is as close to P(t, N) as desired at that t. The envelope
function is expanded as
P(t, N) = Pl(t) + PZ(t)N- l / Z + P 3 (t)N- l + O(N- 3 / Z) (6.6)
For any test T, Pl(t) > PT1(t), and PT1(t) = Pl(t) for a first-order
uniformly efficient T. Similarly, PTZ(t) = PZ(t) for a second-order
uniformly efficient T. For any efficient T, P T3 (t) < P 3 (t), and the
equali ty holds at point t, when T is third-order t-efficient. We

call
lim N{P(t, N) - PT(t, N)} (6.7)
N+oo
171

the (third-order) power-loss function of an efficient test T. The


characteristics of an efficient test T are represented by its
power-loss function. It is a kind of deficiency. We will give the
power loss functions of various widely used efficient tests. For a
given t, we show a method of designing the third-order t-efficient
test in the following.
Let ~P(T) be the supremum of ~PT(t),

~P(T) = sup ~PT(t) (6.8)


t
It represents the power loss of a test T at the worst position. We
call ~P(T) the (third-order) maximal power loss of an efficient test
T. The test T* which minimizes ~P(T) is called the third-order
optimal test in the minimax sense.
We have evaluated the average power. It is possible to evaluate
the power at t not by the average over UN(t) but by

infU&UN(t) PT(u) .
However, the result is the same up to the third-order, provided D is
sufficiently smooth. It is also possible to define UN(t) by using a
distance function other than the Riemannian geodesic distance. (Note
that all the a-distances are asymptotically equivalent to the
Riemannian distance, since t/IN" is infinitesimally small for large
N.) If one would like to emphasize the power in a specific
direction, one needs to use an unisotropic distance.

6.2. Asymptotic evaluations of tests: scalar parameter case


Let us first study higher-order characteristics of tests in a
one-dimensional (i. e., scalar parameter) statistical model M = {q (x,
u) }, The same method can be used in the vector-parameter case. A

one-dimensional model M is a curve imbedded in an n-dimensional


manifold S of exponential family. There are two types of tests. One
is the so-called two-sided unbiased test, in which D is a singleton
set D = {uo}' It tests
172

against
The other is the one-sided test, which tests
against
The latter can be considered to be the case with D = [- 00, uOl. (The

case with HO : uE D = [uO' uOl against Hl : u~ D reduces to the


composition of the following two one-sided tests, HO u = U o agains t
Hl : u < U o and HO : u = Uo against Hl : u > u O .) The critical
region R for a two-sided test is bounded by two (n-l)-dimensional
submanifolds A(u_) and A(u+) , which intersect M at u and u+,
respectively, where u_ < U o < u+ (Fig.6.S a). The critical region R
of a one-sided test is bounded by one (n-l)-dimensional submanifold

A(u+) which intersects M at u+, u+ > U o (Fig.6.S. b). In either


case, given a test T, we can construct an associated ancillary family
A = {A(u)} such that R is bounded by A(u+) and A(u_). For example,
when the test statistic A(X) is given, one can construct the
associated ancillary family A by the submanifolds Ac given by A(n) =
c for various constants c.
A test T of significance level a should satisfy the level
condition
(6.9)

In addition, a two-sided test should satisfy the unbiasedness


condition

aaPT(uO) = P±(uO) = 0 , (6.10)


where a = I denotes the differentiation with respect to u. In the
a
present asymptotic theory, it is required that these conditions hold
to within terms of order N- 3 / 2 .
Let us introduce a new variable t enlarging the scale of u at
around Uo by
(6.11)

where g = gab (u O) is the Fisher information at U o and t takes both


positive and negative values. Note that suffices a, b, etc. stand
173

Fig. 6.5a) Fig. 6.5b)

only for 1 in the present scalar parameter case. The point whose
u-coordinate is
(6.12)
is separated from U o by a geodesic distance It1N- 1/2 + O(N- l ). We

use this t to evaluate the power, and expand PT(u t , N) as


PT(u t , N) = PT1(t) + PT2(t)N-1/2 + PT3 (t)N- l + O(N- 3/2 ) . (6.13)

Here t takes on both negative and positive values in the two-sided

case.
In order to calculate the i-th order powers PTi (t), we use the
Edgeworth expansion of p(~; u t ), where ~ is the coordinates of the

observed point x in the coordinate system associated with the test T.

It is convenient to define a random variable wt by


wt = m(~ - wt )

where wt (u t ' 0), wt = (u t , v). We modify it to correct the bias


term as

W~ = wt + C(ft)/(2/N) , (6.14)

(U~, v*). Here, the bias term is corrected not at ~ but


174

at 0 or (0, 0). The Edgeworth expansion of the distribution w~ is


given by (4.40), where the notion w** was used. By integrating this
with respect to v*, we have the Edgeworth expansion of p(u~; u t ).
The power PT(u t ) of test T is given by

u+
PT(u t ) = 1 - f p(u; ut)dO ,
u_
where ~ 'R()M is the interval [u_, u+]. We put u in the
one-sided case. The transformation of the variable 0 to u~ is given
by u t = IN(O - u t ) and by (6.14), so that u *
t satisfies
u~ = u5 - t/.Ig . (6.15 )
The interval ~
coordinate by (6.14). The same interval is expressed as

[U t _, u t +] in terms of the coordinate u~, where


t/!g (6.16)
The power is written as

U+
PT (t)' = 1 - Ju_ t _ p(u*;
t ut)du*t . (6.17)
t

The interval RM is determined from the level condition (and


unbiasedness condition). That is, u+ and U are determined from

1 - a (6.18)

il
dtd J_ t+ PUt;
(-* u t ) d u-*t I t=O = 0 , (6.19)
ut _

where the latter is used only in the two-sided case because u t _ is -


00 in the one-sided case. We begin with the first-order theory. It
is easy to show that the first-order term of p(u~; ut ) is a normal
distribution,
p(u~; u t ) = n(u~; g) + O(N- l / 2 )
where the variance is the inverse of
- KA
g = gab(u O) - gaK(uO)gbA(uO)g (uO)
Since d K span the tangent space of dR, g depends on the quantity gaK
175

= <aa' a K) · Therefore, the first-order term of p(11~; u t ) depends


on the ancillary family A (or underlying test T) only through gaK'
which represents the angles between M and aR. When the boundary aR
is orthogonal to M, gaK = 0 holds, and g reduces to the Fisher
information g = gab' In general, g ~ g, and the equality holds when
gaK = 0 is satisfied. More precisely, in our pre~ent context of the
first-order asymptotic theory, gaK(u t ) = O(N- l / 2 ) guarantees g- = g
except for the term of order N- l / 2 . When gaK(u t ) = O(N- l / 2 ) holds,
the ancillary family A is said to be asymptotically orthogonal.
The interval ~ [u_, u+l is determined from the level and
unbiasedness conditions (6.18) and (6.19) as
;g 11+ = u 2 (a) + O(N -1/2 ) , ~
.g 11_
in the two-sided case, where u2(a) is the two-sided lOOa% point of
N(O, 1), defined by
u 2 (a)
f n(u; l)du 1 - a
-u 2 (a)

In the one-sided case, 11_ - 00 and


;g u+ = 11 1 (a) + O(N- l / 2 )
where u l (a) is the one-sided lOOa'7o point of N(O, 1), i.e., u l (a) =

u 2 (2a). The first-order power PT1(t) is obtained from (6.17) by


neglecting the terms of order N- l / 2 . The results are

in the one-sided case, and

in the two-sided case, where ~ is

~(t) = foo (21T)-1/2 exp{- ~ u 2 }du


t

The following theorem holds from the fact that g ~ g and the equality
holds when and only when gaK(u t ) = 0 except for terms of order N- l / 2 .

Theorem 6.1. A test T is first-order uniformly efficient, when


176

and only when the boundary dR of the critical region R is


asymptotically orthogonal to M, i. e., orthogonal to M in the limit
N + 00 The efficient first-order power is given by
~ [u 1 «l) - t] (6.20)

~[u2«l) - t] + ~[u2«l) + t] (6.21)


in the one-sided and two-sided cases, respectively.

In order to evaluate the higher-order powers, it is necessary to


calculate the second- and third-order terms of p(u~; ut ) for an
asymptotically orthogonal A. We have already calculated the p(u*; u)
in (5.5) for an exactly orthogonal A and in (5.18) for an asymptotic
orthogonal case. It is assumed in (5.18) that
gaK(u) = 0(N- 1 / 2 ), dagbK(U) = 0(N- l / 2 )
holds uniformly in u. In the present case of test, we assume that
the ancillary submanifo1d A(u; N) is orthogonal at u = uo '

gaK(u O) = 0
and is asymptotically orhtogona1 at u t '
gaK(u t ) = 0(N- 1 / 2 ),
where IUt - uol is of order N- l / 2 . Let us define a tensor at Uo

QabK = dagbK(U O) (6.22)


and expand gaK(u t ) to yei1d
t -1
gaK(u t ) = ~ QabK + O(N ). (6.23)

In this case QabK = 0(1) while gaK (u t ) = 0(N- 1 / 2 ), so that the


situation is a little different from that in (5.18). When a test T or
its critical region R is given, we can calculate QabK from gaK(u+)

QabK = gaK(u+)/(u+ - u O)·


We now integrate p(w~; u t ) with respect to v~ to get p(u~; u t ),
taking account that gaK(u t ) is not exactly equal to 0 but is a small
order term given by (6.23). Since QabK is of order 1, we need to use

the relation (5.10), i.e.,


C = Q - H(e)
aKb aKb abK'
177

This is the only difference from (5.18). We show the result in the
general multi-dimensional case for later use. Let ea be a unit
vector normal to aD at Uo E aD,

gab(uO)eae b = 1
Then the point u t defined by

u~ = u~ + ~ e a
is separated from D by a geodesic distance t/ IN": In the scalar
parameter case, ea is simply equal to l/fg, because all the
quantities are one-dimensional.

Theorem 6.2. The Edgeworth expansion of p(u~; ut ) for an


asymptotically orthogonal ancillary family is given by

p(u~; u t ) = n[u~;gab(Ut)]{l + AN(u~; u t ) + BN(u~; u t )} + O(N- 3 / 2 ) ,


(6.24)
with AN given in Theorem 5.3 and BN given by
2NBN(u~; u t ) = QacK{(gcd + t 2 e c e d )QbdA - 2H~~~gcd}gKA h ab

+ tQadA(2QbcK _H~~~)gKAed h abc

+ QabK(QcdA- H~~~)gKA h abcd , (6.25)


where h ab etc. are the Hermite polynomials in u*
t'

We omit the proof (see Amari and Kumon [1982]). When QabK = 0,
the new term BN vanishes. If we replace the term tQabK eb by gaK(u t )
and put all the other QabK's equal to zero, we get (5.18) from (6.25).
It should be remarked that the second-order term (i.e., the term of
order N- l / 2 ) of p(u~; u t ) does not depend on the ancillary family A or
the test T. It is common to all the efficient tests. The third-order
term (the term of order N- l ) depends on A or test T only through the
two geometric quantities QabK and H~~~, which represent the asymptotic
angle gaK(u t ) between aR and M and the mixture-curvature of aR,

respectively. In other words, the third-order characteristics of an


efficient test are completely determined by the angles gaK(u t ) and the
178

mixture-curvature H(~)
Kl\a of the boundary oR of the critical region.
In order to calculate the second- and third-order powers, it is
necessary to evaluate u+ and u_ up to the third-order terms. We
expand them as
-1/2 -1 -3/2
r=:
.g u t = ul(a) + oN + EN + O(N )
in the one-sided case where u "", and
u 2 (a) + 0+N- l / 2 + E+N- l + O(N- 3 / 2 ) ,
- u 2 (a) + 0_N- l / 2 + E_N-l + O(N- 3 / 2 )

in the two-sided case. All of these quantities are determined from


the level condition (and the unbiasedness condition in the two-sided
case). The second-order terms 0, 0+, 0 do not depend on the
ancillary family, so that it is cOIInllon to all the efficient tests.
Moreover, 0+ = 0 is proved by using (6.24). This leads to the
following theorem, although we do not write down the explicit form of
o (see Kumon and Amari [1983]).

Theorem 6.3. All the first-order uniformly efficient tests are


second-order uniformly efficient, having the same second-order power

function PT2(t) = P 2 (t).

We then proceed to the calculation of third-order terms E, E+,


E. In the two-sided case, E+ = - E_ is proved. We can obtain the
third-order power PT3 by calculating (6.17) with the help of (6.24),

and 0 and E derived therefrom. The power is decomposed into the sum
of two terms. One term does not depend on QabK and H(m) so tnat it
KAa'
is cOIInllon to all the efficient tests. We denote it by Pi(t, a),
where i 1 for the one-sided case and i = 2 for the two sided case.
It will soon be proved that this is the third-order envelope function
The other depends on them, representing how the power is
affected by the geometric properties of the critical region which
characterizes the test. We show here only the results, asking the
179

reader to refer to Kumon and Amari [1983].

Theorem 6.4. The third-order power PT3 (t) of an efficient test


T is given by

where i = 1 for the one-sided case and i = Z for the two-sided case,
and
2't n{u l (a.) - t} ,
~ [n{uZ(a.) - t} n{uZ(a.) + t}] ,

t
Jl(t, a. ) = 1 - Zul(a.) , (6.Z6)

( HZ) = -lH(m) H(m) KV A~


A g KAa v~b g g ,
n(u) being the standard normal density.

It is easy to show ~i(t, a.) ~ O. Since the term in the bracket


of (6.Z6) is the sum of the square of H(m) and the square of Q -
J.H(e), where the suffices are omitted,
1

Pi(t, a.) ~ P T3 (t)


for all tests T. Moreover, equality holds at t, when
H(m) = 0 QabK -- J 1. (t, a.)H(e)
KAa ' abK
are satisfied. Hence, Pi(t, a.) is the third-order envelope function
P 3 (t). We thus have the following theorem.

Theorem 6.5. The third-order power loss function of an


efficient test T characterized by H;~~ and QabK is given by
lIP T (t) =
1 Z)
~i (t, a. ) {T (HA + u iZ (a.) g KA g -Z
~(QabK - Ji(t, a.)H~~~)(QCdA - Ji(t, a.)H~~~)}. (6.Z7)

Here we introduce the notion of a k-test. A test is called a


180

k-test, when the associated ancillary family is asymptotically


orthogonal, mixture-flat at v e
and Qa b K is proportional to H(b0,
=
a )K
with the proportionality coefficient k,
QabK = kH(e)
abK'
Obviously, a k-test is first-order efficient. The following
Corollary is immediate from Theorem 6.5.

Corollary. The third-order power loss function of a k-test is


given by
(6.28)
where y is the statistical curvature of Efron, i.e., the absolute
scalar of the exponential curvature of M defined by
y2 = gKA g -2 H~~~H~~~ (6.29)
An efficient test is third-order efficient at to' when and only when

it is k = Ji(t O' a)-test.

It is interesting to note that the power loss is in proportion


to the square of the statistical curvature (exponential curvature) y2
of model M. Moreover, in order to attain the third-order efficiency
at some to' the associated ancillary family at the boundary aR should
not be exactly orthogonal to M but should have a slight inclination in
proportion to the exponential curvature of M such that the
effect of the curvature is compensated. The third-order power loss
function of a third-order admissible test is determined by (6.28)
which is universal in the sense that it depends on the model M only
through the statistical curvature. The maximal power loss of the
k-test
~P{T(k)} = max u?(a)~.(t, a){k - J. (t, a)}2y2. (6.30)
t ~ ~ ~

The third-order optimal test T* = T(k * ) in the minimax sense is given


by minimizing ~P{T(k)} with respect to k,
~P(T *) = min ~P{T(k)} (6.31)
k
181

The k = k* which minimizes lIP{T(k)} is uniquely determined as a

function of the level a, irrespective of the statistical model M.


These universal structures are studied in more detail in the next

section.

6.3. Characteristics of widely used efficient tests: Scalar

parameter case
We compare the behaviors of widely used tests by their
third-order power-loss functions or deficiencies. The third-order
t-efficient test is also given explicitly.

The m.l.e. test. The test statistic of the m.l.e. test is the

maximum likelihood estimator O(i) or its function A{O(i)}. The


2
so-called Wald test, whose test statistic is {O(i) u O} gab (u O) or
2
{O(i) - u O} gab (0), belongs to the m. l. e. tes t. Of course all the
m.l.e. tests are equivalent, having the same ancillary family as that

of the m.l.e., if the level condition is the same. Hence, all the

A(u) are mixture-flat, H~~~ = 0, and A(u t ) is exactly orthogonal to M


for any u t ' gaK(u t ) = O. Hence, QabK = O.

Theorem 6.6. The m.l.e. test is the k = O-test characterized by

H~~~ = 0, QabK = O. It is third-order efficient at to(a) given as the


solution of J i (t, a) = O. Its third-order power-loss function is

given by
i I, 2 . (6.32)

The solution to(a) of Ji(t, a) o is to 2u l (a) in the

one-sided case, and to = ± 2u 2 (a) in the two-sided case. For widely

used significance levels a = 0.05 and 0.01, the solutions are to

3.3 and to = 4.6, respectively, in the one-sided case, and to = ± 4

and to = ± 5 in the two-sided case. This implies that the m.l.e.

test is powerful for rather distant alternative hypotheses. The

power-loss function of the m.l.e. test is shown in Figs.6.6 and 6.7


182

a = 0.05, two-sided tests

efficient score test

test

0.5

5
Fig. 6.6

a = 0.05, one-sided tests


~efficient score test
vr (locally most powerful)
0.5
test

test

optimum test

Fig. 6.7
183

by comparing it with other tests (Amari, 1983a).


The likelihood ratio test (l.r.test). The test statistic of the
likelihood ratio test is given by
A(X) 2log{q(x, uO)/max q(x, u)} ,
where max q(x, u) = q(x, fi), fi(x) being the m.l.e.(In the one-sided
case, fi is the m.1.e. if fi > Uo and fi = uo ' otherwise.) The test
statistic is written as
- i i-
A(X) = 2(6 - 60 )x i - 2{~(6) - ~(60)}

where 60 The behaviors of the 1. r. test are


obtained by studying the geometrical properties of the associated
ancillary submanifolds defined by
i i
A(n) = 2(6 - 60 )ni - 2{~(6) - ~(60)} = c ,
where 6 (n) is the 6-coordinates of the mixture-flat projection of n
to M.

Theorem 6.7. The l.r. test is the k = 1/2-test characterized by


It is third order efficient at to (a)
given by the solution of Ji(t, a) = 1/2. Its third-order power-loss
function is given by
2 1 2 2
L'lPT(t) = ui(aHi(t, a){ T - Ji(t, a)} y (6.33)
i = 1, 2 .
Proof. We first prove that H(m) o at u o ' so that H(m) (u) =
KAa KAa t
0(N- l / 2 ). Let us consider the submanifold A(n) O. This
submanifold consists of the points n satisfying 6(n) = 6 0 , i.e., the
points n such that fi(n) = u o ' where fi is the m.l.e. mapping. Hence,
it coincides with the ancillary submanifold A(u O) associated with the
m.l.e. Therefore, it is mixture-flat and is orthogonal to M at u o '
proving H~~~ (uO) = 0, gaK(u O) = O. The submanifolds A(u) at u ~ Uo

associated with the l.r. test are in general neither exactly

mixture-flat nor orthogonal. We next calculate QabK from gaK (u t ) =

The gradient dA/dni of A(u) at n(u) satisfies


184

Bd (u)aA(U)/dlli = O . . The gradient is calculated as


1 OA (e i - e~)lli + lljdej/dlli - d~(e)/dlli
"2 dlli
At point II ll(U, 0), the relations ei(ll) = ei(u) and dW(e)/dlli
llj(dej/dlli) are satisfied so that
i i
dA(u)/dlli = e (u) - e (u O)
i i
Hence, we have BKi(ut)[e (u t ) - e (u O)] = 0 or by expansion
1 2
i
BKi(u t ) [Ba(u t ) (u t - uo) -
i
daBb(U t - uo) ] = 0 . :r
This yields
_
gaK(u t ) - :r1 (e)
HabK(u t
Proving QabK = (1/2)H(e).
abK

The solution of J i (t, ex) = 1/2 is given by to = u l (ex) in the


one-sided case, and to = u 2 (ex) in the two-sided case. Hence, the
l.r. test is third-order most powerful at about to = l.7 (ex = 0.05)
or to = 2.3 (ex = 0.01) in the one-sided case, and at about to = ± 2
(ex 0.05) or to = ± 2.5 (ex = 0.01) in the two-sided case. The power
loss function does not have a large peak, so that characteristics are
good in a wide range (see Figs.6.6 and 6.7 for the power-loss
function).

Locally most powerful test. A test T is said to be locally most


powerful, when P±(t) at t = 0 is largest in the one-sided case and
p:r(t) at t = 0 is largest in the two-sided case among all other
tests. This implies that its power PT(t) at infinitesimally small t
is largest compared with other tests. (Note that P±(O) o is
satisfied in the two-sided case because of the unbiasedness
condition.) An efficient test T is third-order locally most
powerful, when its third-order power P T3 (t) at infinitesimally small
t is maximal (i.e., PT3 '(0) is largest in the one-sided case and

PT3 " (0) is largest in the two-sided case). Obviously, the locally
most powerful test is third-order locally most powerful.
185

Theorem 6.8. The third-order locally most powerful test is the k


c;-test characterized by H(m)
... KAa
Q
0 'abK = c H(e)
i ab K'
where c -
1 - 1 ;n
...
the one-sided case and c 2 = 1 - 1/ {2u~ (a.)} in the two-sided case.
The third-order power-loss function is given by
2 2 2
6P T (t) = ui(o.)~i(t, o.){c i - Ji(t, a.)} y , i 1, 2 . (6.34)

Proof. Since we have

lim Jl(t, 0.) = c l ' lim J 2 (t, 0.) = c 2


t+O t+O
the third-order power loss is minimized at t .... 0 by putting QabK
(e)
c i HabK ·

Efficient-score test. The efficient score test uses as the test


statistic the efficient score at or its
function. Rao proposed a test based on the test statistic
ab - -
g (u O) daJ!,(X, u O) dbJ!,(x, u O)· This is the efficient-score test. In
the scalar parameter case, the associated ancillary submanifolds are

given by
i
A(n) = Ba(uO){ni - ni(u O)} = c .
This is linear in n, so that A(u) is mixture-flat.

Theorem 6.9. The efficient-score test is the k 1 - test

characterized by H~~~ = 0 and QabK = H~~~. It is locally most

powerful in the one-sided case, but is not third-order admissible in


the two-sided case. The third-order power-loss function is given by

(6.35)

Proof. We have already shown that the associated A(u) is


mixture-flat, H(m)
i.e. , O. The gradient of A(n) is
KAa dA/dni =

B;(U O) for any n(u), so that


o .
Hence, we have
186

which yields gaK (u t ) = H~:~ (u t - u O), proving QabK = H~~~. This


shows that the efficient-score test is locally most powerful in the
one-sided case. However, it is third-order efficient for no t, in
the two-sided case, because J 2 (t, ex) < 1 for all t and hence {l -

J 2 (t, ex)}2 vanishes for no t. Indeed, its power loss function is

larger than that of the locally most poweful test at all t in the
two-sided case.
We can now compare the third-order characteristics of various
first-order efficient tests. It should be noted that the usal x2
procedure for various tests does not guarantee the level condition up
to the order of N- l . The performances of tests we are studying here
are those which are adjusted by introducing the terms I) and £ such
that the level and unbiasedness condition is correct up to N- l .
Fig.6.6 shows the third-order power-loss functions of various tests
in the one-sided case, where ex = 0.05. Fig. 6.7 shows the two-sided
case. It is seen that the efficient-score test or locally most

powerful tests behaves rather badly at t = 2 ~ 3. The Rao test is not


third-order admissible, but it behaves better than the Wald test and
the 1. r. test for 0< t < 1. The m.l.e. test is good at larger t (t =

3 ~ 4) as is expected, but is not so good at t = 1.5 ~ 2. The l.r.


test has a good performance throughout a wide range. The third-order
optimal test in the minimax sense is also shown, which will be given
later. It should be remarked that the third-order power-loss function
is universal in the sense that it depends on the specific model M only
through the statistical curvature y.
In order to design the third-order t-efficient test, we define
the t-likelihood-ratio test.

t-likelihood ratio test. A test T is called the t-likelihood


ratio test (t-l. r. test), when its test statistic is a function of
q(x, uO)/q(x, u t ) in the one-sided case and of q(x, uO)/max{q(x, u t ),
187

q(x, u_ t )} in the two-sided case. It is well known that the t-lor.


test is most powerful at u t in the one-sided case (Neyman-Pearson

fundamental lemma), and hence it gives the third-order t-efficient


test. The t-lo r. tes t is also widely used in the two-sided case.
Then, what is the characteristic of the t-l.r. test in the two-sided

case? To answer this question, we define ti(t) by tl(t) = t and

t
tanh uZ(a)t (6.36 )

with the convention tZ(O) = l/uZ(a). The ti is the solution of the


equation
t
1 -

Theorem 6.10. The t-l.r. test is the k(t)-test characterized by


H(m) = 0 and
KAa
k(t) = 1 - t/{Zui(a)}. (6.37)

It is third-order efficient at ti(t). Conversely, the third-order t-

efficient test is given by the t-l.r. test in the one-sided case and

by the [t/tanh{uZ(a)t}]-l.r. test in the two sided case. Especially,


the third-order locally most powerful test is given by the
{l/uZ(a)}-lor. test. The third-order power loss function of t'-lor.
test is
Z Z Z
6P T (t) = ui(a)~i(t, a){Ji(ti, a) - Ji(t, a)} y , (6.38)
where t! = t. (t').
~ ~

Proof. Let us put

A(n, t) - Zlog{q(n, uO)/q(n, u t )}


i i
Z[{e (u t ) - e (uO)}ni - ~(et) + ~(eO)] .
Since the associated ancillary submanifolds of the t-l. r. test are

defined by A(n, t) c [or max{A(n, t), A(n, t)} = c 1, they are


(piece-wise) linear in n, and hence H(~)
Kl\a
= O. Since the gradient of
A = c is given by
188

(when A(n t) ~ A(n, - t) in the two-sided case), we have


i i
o= BKi(u)(oA/oni) BKi ( u){e (u t ) - e (u O)} = 0
Now we evaluate gaK(u) at u u i (a) by expanding ei(u t )

this u. Then, we have (6.37). Since the t'-l.r. test is third-order


efficient at the point E' satisfying

t'
Ji(E', a) = 1 - 2 ui(a)
we have Ei Ei(t') by solving this.

The third-order optimal test T*. The third-order optimal test


T* is one which minimizes ~P(T) = max PT(t). Since ~P(T) depends on
t
H(~)
Kl\a and Qa b K of T, the minimum is taken over all H(~)
Kl\a and. Qa bK' It
is easily seen from Theorem 6.5 that the minimum is attained only
when H(m) = 0 and QabK is proportional to H(e). We put Q = kH(e)
KAa abK abK abK
and search for the minimum of
F(k) = max u.
2 (a)~. (t, a){k - Ji(t, a)} 2 .
t ~ ~

This is a minimax problem. Unfortunately, the optimal pair (k, t) is


not a saddle-point. Hence, we need numerical calculations to get the
optimal pair (kt, tt). The result is as follows.

Theorem 6.11. The third-order optimal test T* is the

third-order tt(a)-efficient test, where tt(0.05) = 2.2, tt(O.Ol)


2.7 in the one-sided case, and t~(0.05) = 2.5, t~(O.Ol) = 3.0 in the
two-sided case.
We show the maximal power losses of various tests for comparison
in Table 6.1, where a = 0.05:

2 (a =
Table 6.L Maximal power loss ~P(T)/y 0.05)
optimal lor. m.Le. l.most eff. score
test test test powerful t test
one-sided 0.07 0.12 0.24 0.57 0.57
two-sided 0.07 0.12 0.40 0.54 0.78
189

~---~N -+ OC'

N =5
a = 0.05

locally most powerful test


0.5

-t----~------r---~+-----~----~--~t
5
Fig. 6.8

a = 0.05
0.5 two-sided m.l.e. test

Fig. 6.9
190

We have so far developed the third-order theory of tests. When N


is finite, how well does the third-order theory fit? When N is very
large, the second- and third-order terms are negligible, and the
first-order theory fits very well. When N is small, the
applicability of the Edgeworth expansion is doubtful. It is
difficult to answer the range of N in which the third-order theory is
applicable, because it depends on the model. We give a numerical
example by using the Fisher circle model. Fig.6.8 shows the (exact)
power functions of the m.l.e. test, l.r. test and locally most
powerful test in the two-sided case for N = 5. For N = 20, all the
power functions are almost identical with the first-order power
corresponding to N = 00. Fig.6.9 shows N times the power-loss function
of the m.l.e. test, where the third-order theory corresponds to N = 00.

This shows that the third-order theory can predict the qualitative
behavior of the m.l.e. test even when N = 5. This is true in the case
of other tests. Hence, so far as the Fisher circle model is
concerned, the third-order theory is useful to k~ow the behaviors of
tests for N = 5 'V 20, and the first-order theory is sufficient for
N > 20.

6.4. Conditional test


When there exists an exact ancillary statistic r(x) whose
distribution does not depend on the parameter u, the conditionality
principle states that statistical inference should be performed
conditionally on the observed value f' of the ancillary statistic.
Therefore, the critical region R(r) of the conditional test is
determined such that the level and unbiasedness conditions are
satisfied for each value of r separately, thus decomposing the whole
test problem into independent subproblems conditioned on r. Let B(r)
191

be the region of X or S on which the value of the ancillary statistic


is r, i.e.,
B(r) = {n r(n) = r},
where we identify X with S by taking the expectation coordinate
system. Then, the whole space S is partitioned into B(r)'s.,

S = UB(r),
r

and the critical region R(r) is defined in each B(r). Let


R = U R(r)
r

be the union of the critical regions of all the B(r)'s. This is the
(unconditional) critical region of the conditional test. It is
characterized by the fact that not only the total R but each
component R(r) satisfies the level condition (and the unbiasedness
condition on the two-sided case). The total (unconditional)
behaviors of the conditional test are analyzed by the geometrical
features of this R.
The problem for the conditional test is that there does not
always exist an exact ancillary statistic. However, there always
exists an maximal asymptotically ancillary statistic. In fact, the
statistic ~ in the decomposition
x=n(a,~),

where 11 is the m.l.e. or other efficient estimators, is the maximal


asymptotically ancillary statistic, provided the coordinate system in
each A(u) is taken so as to satisfy

gd = °d·
We can define the asymptotic conditional test based on the
asymptotically ancillary ~ or v instead of the exact ancillary f'.
The asymptotic ancillarity and asymptotic sufficiency will be studied
in detail in Chapter 7, so that we briefly describe here
characteristics of the asymptotic conditional test (see Kumon and
192

Amari, 1983).
In the m. 1 . e . decomposition x- n(a, ~), let the acceptance
region R(v) conditioned on v be
-R(v) = {u*
O < d(v)}

in the one-sided case, where used as the test statistic


conditioned on v. The d(v) is determined from the level condition by
using the conditional distribution p (uO I v) given in (7.13). The
result is
d(v) = u(Za) + N-l/ZE(v),
E(V) = tg-3/ZK{uZ(Za) -l} +ig-lu(Za)r, (6.39 )
where

is the third-order cumulant of -*o


U and

ab
r =
= H(e)v
abK
rK

is the exponential-curvature direction component of the ancillary v.


In the two-sided case, the conditional acceptance region is
given by
R(v)
where
u(a) + N-l/Z(E - ~),
u(a) + N- l / Z(£ + ~),

with
1"
u
_.1
- zg
-3/Z{r(m) +lK Z( )}
abc 3 u a , (6.40)
_(_)
£ v =
l
zg- lu( )
a -
r. (6.41)
It is noteworthy that d i (v) or d(v) depends on v only through the
statistic
r =
H(e)v K = Nl/Z{o a ~(x u) + gab(uO)}'
abK a b ' 0
the difference between the expected and observed informations. In
fact, it will be shown that 1:' is the asymptotic ancillary carrying
the whole information. The critical regin R is obtained in the
above. By analyzing its shape we have the following theorem.
193

Theorem 6.12 The asymptotic conditional test is the k = l/2-test


characterized by H(~) = 0 and Q b = (1/2)H(be ), and hence its behavior
Kl\a a c a K
is third-order asymptotically equivalent to the likelihood ratio test.

Proof. The boundary of R is defined by


*
tiO = d(v)
in the w-coordinate system. Since d(v) or di(v) is linear in V, the
boundary is linear in the n-coordinate system because n is linear in
v for the m.l. e. This shows that H~~~ = O. The angle between the
model M and the boundary of R is given by
1.N-1/2dH(e)
gaK = 2 abK'
where d is u(2a) or u(a). This proves
QabK = (1/2)H(e).
abK

We can analyze the asymptotic conditional test in the


multiparameter case in a similar manner. However, it is not
third-order asymptotically equivalent to the 1. r. test in the
multiparameter case.

6.5. Asymptotic properties of interval estimators


Higher-order asymptotic properties of an interval estimator are
analyzed in an (n, l)-curved exponential family M = {q(x, u)}. The
result can be generalized to the vector parameter case. An interval
estimator is a mapping I from the sufficient statistic x- to an
interval, I (x) = (u_, u+), which is called the confidence interval
with lower bound u_ and upper bound u+. When the lower bound u is
put equal to _ 00
, the interval estimator is said to be one-sided.
Otherwise, it is said to be two-sided. We can evaluate an interval
estimator I by its power, or by its average size. When the true
parameter is u O ' the power P1(u I u O) at u of an estimator I is the
probability that I(x) does not include that u,
194

(6.42)
The size LI(u O) of I is the expectation of the length of the
confidence interval I(x),

LI(u O) = Nl / 2 E[(u+ - u_)j = Nl / 2 Eu [/I(i)1g(Uj dUj, (6.43)

where Igdu is the line element.


The interval estimation is closely related to the testing
hypothesis. Let T(u O) be an unbiased test for a simple null
hypothesis HO u = Uo against the alternatives Hl : u f u o ' and let
~(uo) be the critical region of the test T(uO). We can then
construct an interval estimator IT from a family of tests T = {T(u)},
uEM by
(6.44)
This implies that a point u I EM is included in the confidence region
IT(x) , when and only when one cannot reject the hypothesis
HO : u = uI from the observation x. Conversely, given an interval
estimator I(x), one can construct a test TI whose critical region is
given by
(6.45)
This shows that the test is one aspect and the interval estimation is
another aspect of the same structure given by a subset KCM x S (see
Fig.6 .10) . Its section at u = Uo EM gives the acceptance region
R(uO) = {X I (u O' x) E. K}, and its section at fl = xES gives the
confidence interval I(x) = {u I (u, x)e K}. Therefore, we can analyze
the performance of an interval estimator by using the behaviors of the
corresponding test. For example, we have
Prob{u¢IT(x) I u O} = Prob{x1=Rr(u)
when Uo is the true parameter. This shows that

PI(u I u O) = PT(uO I u) , (6.46)


the right-hand side of which is the power of test T at Uo when the
true parameter is u. Since we have analyzed higher-order terms of
the power of a test, we can translate the results to yield the
195

Fig. 6.10

higher-order terms of the power of an interval estimator.


We first treat an unbiased estimator without nuisance parameter.
An interval estimator 1 is said to be of level a when its power at Uo
is

and is unbiased when


da P1(uO I u O) = 0 .
These conditions are required to be satisfied except for a term of
order N- 3 / 2 in the asymptotic theory. We put u t U o+ t(gN)-1/2 as
before, and expand the power P1 (u t I u O) as
P1 (u t I u O) = P1l (t) + P I2 (t)N- 1 / 2 + P13 (t)N- l + O(N- 3 / 2 ). (6.47)
Similarly, the average size L1(u O) is expanded as
L = L + L N- l / 2 + L N- l + O(N- 3 / 2 ) (6.48)
1 11 12 13
196

An interval estimator I is first-order uniformly efficient, when its


first-order power PI1(t) is maximal at all t among all other
estimators. The second-order uniform efficiency is similarly
defined. The following theorem is immediate.

Theorem 6.13. An interval estimator I is first-order uniformly


efficient, and is second-order uniformly efficient at the same time,
when and olny when the associated ancillary family is asymptotically
orthogonal.

An efficient interval estimator I is said to be third-order


efficient or most powerful at t', when PT3 (t') is maximal among all
other efficient estimators. It is of wide use to define the
efficiency of an interval estimator by its average size instead of its
power. An estimator is said to be of first-order minimal size, when
its first-order size LIl is minimal among all other estimators. The
second- and third-order minimality of the size is defined similarly.
The interval estimator obtained from the k-test is called the
k-interval estimator. Its ancillary family satisfies H~~~ = 0, QabK
= k H(e) From Corollary of Theorem 6.5, the third-order efficient
abK·
estimator is given as follows.

Theorem 6 . 14 . A k-interval estimator is first-order and


second-order efficient. It is third-order most powerful at to' when
k is

In order to obtain the third-order minimal size estimator, we


have only to evaluate E: in u+ and u_, because the third-order term
L I3 is related to it. We show the results, omitting the proof (refer
to Kumon and Amari, 1983).
197

Theorem 6. 15 . A k-interval estimator is of first-order and


second-order minimal size. It is of third-order minimal size, when k
is given by
k (6.49)

It is third-order efficient at ti given by k 1, 2.

The upper and lower bounds of the k-estimator is also explicitly


given in the scalar-parameter case.

Theorem 6.16. The confidence interval of the k-estimator is


g iven in terms of the m.l.e. 0 and the ancillary statistic r H(e)v K =
abK
N1/2{dadbt(x, 0) + gab(O)} as follows. In the one-sided case,

u+ 0 + g-1/2 u1 (a)N- 1 / 2 + [u 1 (a)g-3/2 kr

+ {2uf(a) - 1}g-2 r*/2]N- l , (6.50)

where * is the component r~~~ of a = 1/{2uf(a) - 1} connection. In


the two-sided case
2 -2 -1
0+ u 2 ( a ) g -1/2N-1/2 + [u2(a)g-3/2kr u2(a)g r*/2]N
2 -2 -1
u u 2 ( a ) g -1/2N-1/2 [u2(a)g-3/2kr + u2(a)g r*/2]N (6.51)
where r* is the component of a = (-1/3)-connection, which vanishes
when u is the normal likelihood parameter.

6.6. Asymptotic evaluations of tests: general case


Our geometrical method of test is easily generalized so as to
be applicable to the general (n, m) -curved exponential family. We
consider three cases: (1) D is simple hypothesis D = {u O}' (2) D is
a region having an m-dimensiona1 volume, and (3) D is a submanifold.
The last two cases may be regarded as testing a simple hypothesis
under influences of unknown nuisance parameter, as will be shown
soon. We describe the method and results in brief, avoiding the
198

detailed calculations. Refer to Kumon and Amari [1985] for details.

(1) Testing simple hypothesis D A simple hypothesis


U o is tested against alternatives HI : u f- u O . The power
function PT(u) is subject to the level condition PT(uO) a +
O(N- 3 / 2 ), and the unbiasedness condition aaPT(uO) = O(N- 3 / 2 ). We use
a unit vector e = (e a ) belonging to the tangent space of M at u O'
gab(uO)eaeb = I ,
and denote by

ut,e = Uo + teN- 1/2


the point which is separated from U o in the direction e by the
geodesic distance t/IN. We simetimes omit the indices of quantities
as in the above. All of these u t ,e constitute the set

Ut = {u t e I gab(uO)eae b = I} .
,
For a test T, we can construct an associated ancillary family A such
that the critical region R is composed of A(u)'s where u E~, i.e.
R = {A(u) I uE ~ = RnM} .
The statiS'tic x is decomposed into (0, {)) by x n (ft, ~) by this
ancillary family, and we define

ute IN(ft - ut,e) + C(ft)/(2,;N) .


Then, we already have the Edgeworth expansion of the distribution

p(u~ , e; u t , e) in (6.24), in terms of QabK and H~~!. The power


PT(t, e) of a test T at ut,e is given by

where R__ t,
--N, e is the domain of integration obtained from R__
--M by

transforming the variable from ft to u~,e.

We evaluate the power function PT(t) of a test T by the average


of the powers at u t ,e in all the directions e,

PT(t) = < PT(t, e) >' (6.52)


where < > denotes the average with respect to e. By expanding the
199

power into
PT(t) = Pn (t) + At PT2 (t) + ~ PT3 (t) + 0(N- 3 / 2 )
the first-, second- and third-order efficiencies are defined by using
these i-th order functions in the same manner as in the scalar
parameter case. The level condition is written as PTI (0) = tl +
O(N- 3 / 2 ). The unbiasedness condition P T ' (0) = 0 or more strongly

PT(t) = PT(-t) is automatically satisfied, because the power PT(t) is


obtained by taking the average in all the directions.
Now we show the first-order theory. The first-order term of the
distribution of u*t ,e is

p (u*t , e; u t , e) n(u*t , e; gab ) + 0(N- l / 2 )


where

-
Since gab S gab always holds with equality when and only when gaK(u O)
0, the distribution of
U*t,e is mostly concentrated around the
origin when the ancillary family is (asymptotically) orthogonal,
i.e., when gaK(u O) = O. Hence, a first-order efficient test should
have an asymptotically orthogonal ancillary family. We next search
for the section ~ = RnM of the critical region by M. The average
power PT(t) depends on the shape of 111' so that we obtain the I11
which maximizes PT(t) uniformly in t. (In the scalar parameter case,
RM is an interval and is automatically determined from the level and

unbiasedness conditions.) We maximize

= <f R_ _
--M, t, e
P (u~ e; u t e) du~ e )
' , ,

fl11<S(U5 - te; gab» dU5 (6.53)


subject to the level condition PT(O) = tl. The optimal is given by
Lagragian variational method by the solution of

C f~ {~(u5 - te; gab»- An(u5; gab)} du 5 = 0 , (6.54)


where A is the Lagrangian multiplier and the variation c is operated
on the domain of integration RM. The boundary d~ of the optimal
200

section l\i, that satisfies the variational equation, is composed of


the point u5 satisfying

<n(u5 - te; gab» - An(u5; gab) = 0


Since the average over all the directions e of n(u5 - te; gab)
depends only on t and gabu5au5b, the section is given by

~ = {u5 Igab(uO)u5au5b S c6} (6.55)


where RM does not depend on t. Here, Co is to be determined from the
level condition. This shows that the test statistic of a first-order
uniformly efficient test is given by

A( x-) gab ()_*a_*b


Uo Uo Uo . (6.56)
It is subject to the X2 -distribution of m degrees of freedom. Hence,
2
c0 = x2 is the upper lOOoc% point of x 2 (m). The first-order optimal
m,a
power is given by

Pl(t) = 1 - IRM n(u5 - te; gab) du5 '

1 - Is (cO) n(u - te; 0ab)du (6.57)


m-l

where Sm_l (cO) is an (m-l) -dimensional sphere with radius cO. The
above expression does not depend on a specific direction e.

Theorem 6.17. A test T is first-order uniformly efficient, when


and only when the associated ancillary family is asymptotically
orthogonal and the section RM of the acceptance region is given by
-1/2 2 _ 2
the m-dimensional geodisic ball with radius cON ,cO - Xm,a.

The second- and third-order powers of a first-order efficient


test T can be calculated in a similar but very complicated way. We
assume that the section RM of the acceptance region has a spherical
shape boundary similar to (6.55), so that it is given by

RM = {u5 I gab(u O)(u5 a - oa)(ut - ob) S (cO + £)2} (6.58)


2 2
wi th Co = Xm , a . Here, 0 and £ are determined from the level and
unbiasedness condition. It is shown that 0 is a quantity of order
201

N- 1 / 2 and it is conunon to all the firs t-order efficient tes ts not


depending on
H(m) which characterize the higher-order
KAa
properties of a test. The quantity e: is of order N- 1 depending on
(m)
HKAa and QabK' This shows that Theorem 6.3 that the first-order
efficiency implies the second-order efficiency holds in this case,
too.
In order to show the third-order results, we define the function

Zm(c, t) t m- 1 Sm_1~n(uc
- te; gab):>
any point satisfying gab (u O - u ca ) (ubO - u bc )
a
where u c is = c 2 /N and
Sm-1 is the area of the (m-1)-dimensiona1 unit sphere (m > 1). This

does not depend on gab so that we may put gab = 0ab' It can be
expressed as
-m/2 n-2 1 2
Zm(c, t) = (21T) c Sm_2 exp{- T (t + c 2 )}Am(c, t ) , (6.59)
1
Am(c, t) = f (1 - z2)(m-3)/2 exp{- ctz}dz .
-1

The first-order power function (6.56) can be expressed as


2
Co
P1(t) = 1 - fo Zm(c, t)dc (6.60)

We next define two functions,

~(t, m) (6.61)

Zm+4(c O' t)
J(t, m) = 1 - (6.62)
2 (m+2) Zm+2(c O ' t)

We then have the following theorem, whose proof is omitted (Kumon and

Amari [1985]).

Theorem 6.18. The third-order power loss function of an

efficient test T is given by


202

(6.63)
where
g {abcd} =
gabg cd + gacg bd + g
ad g
bc ,

J(t) 0 H(e) = J(t m)H(e) _ (m/2c 2 )g gcdH(e) (6.64)


abK ' abK 0 ab cdK
An efficient test T is third-order most-powerful at t' , when the
associated ancillary family satisfies
H(m) = 0, Q = J(t') 0 H(e) (6.65)
KAa abK abK

Obvious ly, when m = 1, the above theorem reduces to the scalar


parameter case. We can similarly characterize the m.l. e. test, the

1. r. t., and the efficient score test in the multi-parameter case.


The results are the same as in the scalar parameter case. They all
satisfy H(~) = 0, and Q b = 0 for the m.l.e. test, Q b (1/2)H(e)
Kl\a a K a K ab K
for the l. r. test, and Q b e
= H(b ) for the efficient score test.
a K a K
However, in the multiparameter case, none of these tests are
admissilbe, i.e. not third-order efficient. In order to obtain a

third-order efficient test, it is necessary that Q = J (t') 0 H(e)


abK abK
is satisfied for some t'.

From this theorem, we can obtain the third-order power loss


functions of various widely used efficient tests. In the
one-dimensional case, the power loss function l'.PT(t) of a test T is
universal in the sense that it depends on the statistical model only

through the scalar curvature y2, i.e., l'.P T (t)/y2 does not depend on

the model. In the multiparameter case, we can define two scalar


exponential curvatures of model M. They are y2 and K2 defined by
2 = l H(e) H(e) ab cd KA (6.66)
y m abK cdA g g g ,
,2 _ H(e) H(e)gac bdgKA
y - abK cdA g ,
2 ,2 2
K Y - y . (6.67)
The power loss function depends on the model through these two scalar
exponential curvatures of the model M.
203

We can explicitly obtain the power loss functions of various tests


(see Kumon and Amari, 1985). The likelihood ratio test behaves
nicely in a wide range. The efficient score test (Rao test) behaves
nicely in a multiparameter case.
When desired QabK is given, it is possible to design a test
whose ancillary family has the specified QabK and H~~~ = O. We show
one method of obtaining the third-order t' -efficient test. By this
method the m.l. e. a is modified to yield new test statistic by the
use of the (asymptotic) ancillary statistic. Let (a, ~) be the
w-coordinates of the observed point x in the coordinate system
associated with the m.l.e., x= n(a, ~). Since this ancillary family
is exactly orthogonal,
gaK(u) = Bai(u)B Kj (u)gij (u) =0
holds, where Bai = dani' BKj = dKnj. When desired QabK is given,
let us modify the m.l.e. in the following way,
a,a aa + gabQbcK(ac _ U~)~K ,

~' (6.68)
Then, a' const. gives a modified ancillary family with the
coordinate system ~' = ~ in each A(a'). The observed point x in S
can be represented by the new coordinate systems
x = n'(a', ~') = n(a + gQ(a - uO)~, ~) ,
where n' is the coordinate transformation from (u', v') to n . Hence,
the term g' (u) of the new ancillary family at v = 0 is given by
aK
(dani) (aKnj)gi j
Bai(u){BKj(u) + Bbjgbc QCdK (d d )}
u - uO
b b
QabK(u u O)
This shows that the new ancillary family has the specified QabK.
Hence, any scalar function of a', e.g., gab(uO)(a,a - uO)(a,b - u~)
gives the test statistic of a test which is efficient and has the

prescribed QabK. Hence, in order to get a third-order t-efficient


test, we may use a function of
204

a,a = aa + gabJ(t)OH~~~~K(ac - u~) (6.69)


as the test statistic. From the relation
- i - i
dbdc~(x, Q) = dbBc[x i - ni(uO)l - BbBci

the term Hb(e)~K may be replaced by


CK

tbc = abac~(x, a) + gbc(a) .


We will show in the next section that this is a first-order
asymptotic ancillary statistic. The third-order efficient test can
be designed by using the m.l. e. a (or any other efficient estimator
or test statistic) together with the first-order asymptotic ancillary

statistic tbc.

(2) Testing hypothesis where D is a region. When D is a region


of M with an (m-l)-dimensional smooth boundary aD, it can be

represented as
D = {u I f(u) ~ c}
by using a smooth function f(u). The boundary aD is represented by

f(u) c. The problem ~_s then reduced to testing HO : f(u) < c

against Hi: f (u) > c. It is convenient to introduce a new


coordinate system u = (u l , s = 2, ... , m, in M such that the

region D is expressed by u l ~ c by the use of the first coordinate of


u in the new coordinate system. The remaining (m-l) coordinates
2, ... , m) serve as a coordinate system of the boundary
manifold aD. Moreover, we can choose a new coordinate system u =

(u l , zS) such that

gls <ai' as) = 0 , (6.70)


always holds at aD. Then, the coordinate axis u l ( or the tangent
vector a l of u l ) is orthogonal to aD, whose tangent space is spanned

by as = a/az s , s = 2, ... , m.
1
The problem is now restated: Test a hypothesis HO : u ~ U o
against Hi : u l > U o in the curved exponential family {q(x, u)}, u =
1
(u, z).
s Here, we are interested in the value of the first
205

1
parameter u, and the other parameters ZS (s = 2, ... , m) are of no
interest. Such parameters are called nuisance parameters. Hence,
the problem is a one-sided test of a scalar parameter u l in the
presence of nuisance parameters z.
We denote by (u~, z) the point which is separated from D by a
geodesic distance t/~ and whose nuisance parameter is z = (zs). We
then have
(6.71)

where gll(uO' z) is the metric along the axis u l at (u O' z). For a
test T, we can associate an ancillary family A. Then, the statistic
x is decomposed into (ftl, 2, ~) by x= 1) (ftl, 2, ~). In this case,
the problem reduces to testing u l ~ Uo under the model {q(x, Ulj 2)}
where the estimate 2 is substituted for the nuisance parameter z.
By integrating the Edgeworth expansion of p(u~, z*j ut ' z) with
respect to z*, where u~ is defined as before, we have the Edgeworth
expansion of p (ut jUt' z). This function is used to calculate the
power PT(t, z) of test T at (u t ' z). It is easy to show that a test
T is efficient, when and only when the associated ancillary family is
asymptotically orthogonal, satisfying glK(u O' z) = 0 for all z. The
domain RM of an efficient test T is of the form ~ = {ut) I ut) >

u+ (1:'*) }, where
Ig1l(u O' 2) u+ = ul(a) + o(2)N- 1/2 + E(2)N- l + O(N- 3/2 )
Since the problem is reduced to a one-sided test of a scalar
parameter case except for the term 2, we can use the same techniques
as we used in sections 6.2 and 6.3. We show only the results (see
Kumon and Amari [1985]).

Theorem 6.19. A test T is first-order, and at the same time


second-order, uniformly efficient, when and only when the associated
ancillary family is asymptotically orthogonal. An efficient test T
is third-order most powerful at t, when and only when the ancillary
206

family satisfies
H(m)
KH o,
(6.72)

The theorem shows that a test of a function f(u) of u can be


treated in the same manner as in the scalar parameter case. The
first- and second-order power functions of an efficient test is the
same as in the scalar parameter case. The third-order power loss
function is also the same, except that the square of the curvature

y2(uO' z) Hi~~Hi~~giigKA
depends on the nuisance parameter z. This does not imply that the
third-order power function is the same in the both cases. Although
the power loss function is the same, the third-order envelope
function is not the same. We can measure the effect of intervention
of the nuisance parameter by comparing the third-order power PT3 (t,
z) of an efficient test T when the value of z is known with that when
z is unknown. As is expected, the distribution p(u*
t ; ut ' z) is a
little more dispersed when z is known than when z is unknown. It is
given by integrating p(u*
t, z) with respect to z* when z is
unknown. This additional dispersion is given rise to by the squares
of the two curvatures of aD defined in the following. One is the
square of the mixture curvature of aD,
(Hm )2 = H(m) H(m) 11 pr qs
U, Z pql rsl g g g
and the other is the square of the twister component of the
exponential curvature of aD,
e 2 = H(e)
(HU , Z , V) ph
These quantities are defined and explained in Chap.S in more detail,
when we study the statistical inference in the presence of nuisance
parameters.

Theorem 6.20. The third-order power loss induced when the value
207

of z is unknown is given by
(6.73)
~P(t) =~~l(t, a){(~,Z)2 + (H~,Z,V)2}.

(3) Testing hypothesis when D is a submanifold. When D is a


(m-k)-dimensional smooth submanifold, it is represented as
D = {u I fl(u) = c l ' ... , fk(u) = ck }
by using k smooth functions. We can choose a new coordinate system
(ua , zS) such that the submanifold D is given by u l = cl.' u 2.. c 2 '
... , uk ck by using the new coordinates. Then, the other
components z = (zs), s = k + 1, ... , m, serve as a coordinate system
in D. The problem then reduces to test HO : u a = c a (a = 1, ... , k)
in the presence of the nuisance parameter z = (zs), s = k + 1, ... ,
m. This is the problem of testing a simple hypothesis HO : u = U o
against the others HI : u ; u O' where u = (ua ) , U o = (c l ' ... , c k ),
in the presence of the nuisance parameters. We can choose the
coordinate system such that
gas = <aa' as) = 0
holds at D. Hence, aa is orthogonal to D.
The method of analysis is quite similar to the previous case.
We define a point (u t ' z), which is determined by t, z, and e as
ut = U o+ te/lN' , (6.74)
where gab (uO' z) eae b = 1. Then the point is separated from D by a
geodesic dis tance tIN at point ZED in the orthogonal direction e.
For a test T, we have an associated ancillary family, with which i is
decomposed into (ft, ~, ~) by i = n(ft, ~,~). We can then obtain the
Edgeworth expansion of the distribution of 0 or ft~ by integrating the
distribution of (0, ~, ~) with respect to ~ and~. Then, the problem
reduces to the case of testing a simple hypothesis HO : u = Uo in the
k-dimensional parameter u in the presence of the nuisance parameter.
We show only the results.
208

Theorem 6.21. A test T is first- and second-order uniformly


efficient, when and only when the associated ancillary family is
asymptotically orthogonal. An efficient test T is third-order most
powerful at t, when and only when the ancillary family satisfies
H(m)
KAa = 0 ' aQ
bA = J(t', k) 0 H(e)
abK' QaSK -- QSpK =
0 . (6.75)

The test loses the third-order power by amount


~P(t) = {(k + 2)/(2c6)}s(t, k){(H~ , z)2 + (6.76)
when the value of the nuisance parameter z is unknown.

6.7. Notes
The higher-order asymptotic theory of statistical tests has been
developed by Pfanzagl [1973], Chibisov [1973 a], Pfanzagl and
Wefelmeyer [1978], Pfanzagl [1980], Bickel et al. [1981], etc.
However, the theory had been far from complete compared to the
higher-order asymptotic theory of estimation. This is partly because
the structure of the associated ancillary family is much more
complicated in the case of test than in the case of estimation.
Efron (1975) pointed out the importance of the statistical curvature
in the problem of tests.
The geometrical theory of higher-order asymptotics of
statistical tests and interval estimators was fully developed in
Kumon and Amari (1983) in the scalar parameter case, where both
one-sided and two-sided tests were equally analyzed. There are
widely used efficient tests such as the likelihood ratio test,
efficient score test (Rao test), Wald test, locally most powerful
test, etc. They are also second-order efficient. However, their
third-order characteritics were not well known. Their third-order
power loss (deficiency) functions were explicitly given by Amari
(1983a) based on Kumon and Amari (1983). The power loss functions
are universal in the sense that they depend on statistical model M
only through its statistical curvature y2. The results elucidate the
characteristics of these widely used tests. For example, Rao test is
209

not third-order admissible in the two-sided case, but it behaves


locally very nice (see also Chandra and Joshi, 1983). The likelihood
ratio test behaves nicely in a wide range. The characteristics of
interval estimators were also shown in Kumon and Amari (1983).
It is still difficult to construct a general third-order
asymptotic theory of test in a vector parameter case. We have shown
some results and the method of approach in the isotropic case, i.e.,
the case where the power depends only on the geodesic distance from
the null hypothesis and it does not depend on the direction of the
alternative hypothesis. Refer to Kumon and Amari (1985) for details.
7. INFORMATION, ANCILLARITY AND CONDITIONAL INFERENCE

The present chapter studies the amount of information


carried by a statistic t(x) from the geometica1 point of
view. The amount of information plays a fundamental role
in parameter estimation and statistical hypothesis

testing. Higher-order asymptotic sufficiency,


higher-order asymptotic anci11arity, and condi tiona1
information are defined in the beginning. Then, the
performance of asymptotic conditional inference is
studied, and the role of an asymptotic ancillary statistic
is elucidated therefrom. We give an answer at least from
the asymptotic point of view to the problem on which
ancillary statistics one should condition when there are a
number of ancillaries. Finally, the sufficient vector
statistic x is decomposed into component statistics of

geometrical character according as the magnitude of the


amount of information.

7.1. Conditional information, asymptotic sufficiency and asymptotic


ancillarity
In a statistical model M = {q(x, u)}, the metric tensor gab is
defined by the Fisher information which the random variable x carries
for estimating u. A statistic t(x) which is a fuction of x also
carries some information for estimating u. The amount of information
which t carries is defined similarly by using the logarithm t(t, u)
of the probability density function of the statistic t as
(7.1)

where the capital T is used to show that we are interested in the


information carried by statistic t and not in a specific value of t.
This gab(T) is a positive semidefinite matrix depending on u.
211

Obviously, for the statistic x itself, gab (X) is the ordinary Fisher
information matrix. A statistic t is said to be sufficient, when

gab (T) = gab (X) holds, i. e., when t carries the same amount of
information as the original x does. On the other hand, a statistic t

is said to be ancillary, when gab(T) = 0, i.e., when it carries no


information for estimating u.

When there are two statistics t(x) and s(x), the amount gab(T,
S) of the information which t and s together carry is defined
similarly by using the joint probability density of (t, s). When t
and s are independent, it is easy to prove the additivity

gab(T, S) = gab(T) + gab(S) .


This additivity does not hold in general, because t and s may carry
common information or t and s together carry mutually complementary
information which each of them alone does not provide.
When the value of statistic s is known, what is the additional
amount of information obtained by knowing the value of another

statistic t? This is given by the following conditional information

gab(T I s) of statistic t conditioned on statistic s,

gab(T s) = E[da£(t I s, u)db£(t I s, u) I s1 , (7.2)

where £(t s, u) is the logarithm of the conditional probability


density of t conditioned on sand E[ . I s1 is the conditional
expectation. Its average over all s yields the expected conditional

information
(7.3)

which we call the expected conditional information of T conditioned


on S. The following relation is easily derived,
(7.4)

which shows that the conditional information gab(S I T) represents


the amount of loss of information when we discard t from a pair of

statistics sand t, keeping only s.

We denote the conditional information gab(X I T) by ~gab(T),


212

(7.5)
which is indeed the amount of loss of information by summarizing the
original data x into t and keeping only t. The following relation is
of frequent use, when we calculate the information loss,

l1g ab (T) = E[COV[daR.(X, u), dbR.(x, u) It]] , (7.6)


where Cov [r, s i t ] is the condi tional covariance be tween rand s
conditioned on t. It is easy to see that a statistic t is
sufficient, if l1g ab (T) 0 holds. The relation

gab(S T) = l1g ab (T) - l1g ab (T, S) (7.7)


is also useful.

When t is ancillary, gab (T) = O. Although t itself does not


carry any information, the conditional information of T

conditioned on S may carry some information. This shows that, even

when t is ancillary, it may recover some information together with


another statistic s which s does not provide by itself. We can make

use of this information for statistical inference based on s by

conditioning on t.
There exist in general neither non-trivial ancillary statistics
nor non-trivial sufficient statistics. However, we can always
construct asymptotic ancillary statistics and asymptotic sufficient
statistics in the following sense. When N independent (vector)

observations xl' ... ,x N are available, they together carry the amount
N
gab(X ) = Ngab(X)
of Fisher information, where XN denotes the set of N observations.
In a curved exponential family M, the arithmetic mean

x= ~ LXi
is a sufficient statistic, retaining the whole information

gab(~ = Ngab(X) .
Let t(x) be a statistic which is a function of x. When the loss of
information by summarizing x into t is
213

~gab(T) = 0(N-q+1) ,

it is said to be asymptotically sufficient of order q. Hence, a


first-order asymptotically sufficient statistic carries all the
information of order N except for that of order 1. On the other

hand, a statistic t(x) is said to be asymptotic ancillary of order q,


when its information is of order N-q+1 in spite that x carries

information of order N,

gab(T)
We show examples of asymptotically sufficient and asymptotically
ancillary statistics.
Let u(x) be a consistent estimator and let A be the associated
ancillary family. By introducing a coordinate system v = (VK) in
each A(u), the sufficient statistic -
x is decomposed into two
statistics (a, ~) by
x = 'l(a, ~).

Let us study the amount of information included in a and ~. We

first calculate the information gab(U) of a or the information loss


~gab (U) caused by summarizing x into the estimated value a. This
gives an evaluation of the estimator a from the Fisher information
point of view. By substituting (4.20) in
i -
Noat(x, u) = NBa(u) [xi - 'li(u)l
we have
(7.8)

The leading term of the information loss ~gab(U)'" is evaluated from


(7.6) as

E[Cov[Noat, Nobt I all


NE[Cov[gaawa , gbS wS lall + 0(1).
By the use of the decomposition
_a _b + _K
gaaw gab u gaK v
and the asymptotic normality of w = (u, v), we easily have
KA
~gab(U) = NgaAgbKg + 0(1).
214

Therefore, from (7.5), the amount of information carried by 0 is


'" \K
gab(U) = N(gab - ga\gb\g ) + 0(1) (7.9)

in agreement with the fact that the covariance of 0 is


N-lg ab = {gab(U)}-l + 0(N- 2 ).
When and only when the associated ancillary family A is orthogonal,

ga\ = 0 and the information loss is of order 1. Hence, a first-order

efficient estimator is first-order sufficient.

In order to evaluate the order 1 term of ~gab(U) in the


orthogonal case, we use the expansion
NdaJl,(X, u) - INg u b +.1. r(m) ubu c +.1.. H(m) vKv\
- ab 2 bca 2 K\a
-H(b ) ubv + 0(N- / ),
e K l 2 (7.10)
a K
which is obtained from (7.8) by taking account of
C = r(m) c = H(m) C = _H(e)
bca bca' K\a K\a' aKb abK'

The loss of information is then obtained from (7.6) as follows.

Theorem 7.1. The loss of information by summarizing data into an

efficient estimator 0 is given by the sum of the square of the


exponential curvature of M and a half of the square of the mixture

curvature of A,
~ gab(U)
(He)2 +.1. (Hm)2 + O(N- l ).
M ab 2
=
A ab
(7.11)

It is minimized if the associated A(u) of 0 is mixture-flat at v = O.

We can evaluate a consistent estimator 0 from the Fisher


information point of view. Let us expand the amount of information

gab(U) and the information loss ~gab(fi) of 0 as


h -1
~gab(u) N~glab + ~g2ab + N ~g3ab +
~ -1
gab(U) Ngab - ~g2ab - N ~g3ab
where
215

KA.
gab = gab - gaKgb\g = gab - ~glab'
This shows that the first-order term of the covariance of a
consistent estimator ft is given by the first-order term of its amount

of information gab(U), Hence, an estimator is first-order sufficient


if and only if it is first-order efficient. Geometrically, this

implies that the associated ancillary family A is orthogonal. Among

the first-order efficient estimators which are automatically


second-order efficient, one whose ancillary family has a vanishing
mixture curvature at v = 0 is third-order efficient, minimizing the

third-order term of the covariance of its bias corrected version. A


third-order efficient estimator minimizes the second-order term ~g2ab

of the information loss, and vice versa. Note the difference in the

definition of the order of terms, since we expand the information


loss in the power series of N- l , while we expand the covariance in
the power series of N- l / 2 . The amount of information of an efficient
estimator ft is closely related to its covariance, although the
....
inverse of the information gab(U) does not coincide with the

covariance of except for only the first-order term. The


third-order term (i.e., the term of order N- l ) of the covariance of

an efficient u* is written as

In fact, the information matrix "


gab(U) is a tensor, but the
. _*a *b
covariance matr~x Cov[u , u 1 is not. The latter depends on the
manner of parametrization, because it includes the term (r m)2ab. It

also depends on the bias correction, but the information matrix does

not depend on the bias correction

gab(U) = gab(U*)'
It is often easier to calculate the loss of information ~gab than to
calculate the covarinace. In this case the third-order term of the
. m 2ab
covariance is obtained from ~gab by add~ng the term ( r ) /2.
216

We have already shown that the ancillary family A of the m.l.e.


is orthogonal to M and is mixture-flat. Hence, it is third-order
efficient. Its information loss is minimal up to the term ~g2ab of
order 1. Then, a question naturally arises: Does there exist an
estimator whose loss of information is minimal at all u up to the
term of order N- l . If it exists, is the m.l.e. such a one? We can
answer this question by evaluating (7.8) up to the term of order
O(N- l / 2 ) and evaluating ~gab by the use of the Edgeworth expansion
(4.34). The caluculations are cumbersome. The result includes the
derivative of the mixture curvature H(m~
Kl\a . From this, one can prove
that there exist in general no estimators whose information loss
~gab (tJ) is minimal at all u up to the term of order N- l . The
superiority of the m.l.e. does not follow any more from this super
higher order asymptotic point of view.
We next study the amount of information carried by ~ -*
or v. To
this end, we integrate the Edgeworth expansion (4.34) of the
d istribution 0 -*
f w .
w~t h respect to u-* Then, we have the following
expansion of the fensity function of v * .
_* * 1 KAjJ
p(v; u) = n{v ; gKA(U)} {I + 67§KKAjJ(u)h (7.12)

where the orthogonality gaK = 0 of A is taken into account. The


amount gab(V*) of information carried by ~* or v* is calculated from
,,*
the above expansion. By direct calculations, it is seen that gab(V )
or gab (V) is of order 1, so that ~* or ~ is first-order maximal
,..*
ancillary. The amount gab (V) of information depends on the
coordinate system v taken in each A(u), because gKA in general
depends on u. But is is always possible to choose such a coordinate
system v in each A(u) that gKA(U) = 0KA holds at v = 0 for all u.
Then, the statistic ~ * or ~ expressed in this new coordinate system
is second-order maximal ancillary,
,..* 1 ,.. 1
gab(V) = O(N-), gab(V) = O(N- ).
It is also possible to choose such a coordinate system that KKAjJ(U)
217

o at v = 0 for all u in addition to gKA(u) This is indeed


given by a = -1/3 normal coordinate system at v = 0 of each A(u),

because of KKA~ -3r~~t/3) The ~* then becomes third-order


ancillary

in this coordinate system. However, it is in general impossible to

obtain fourth-order ancillary statistics in this method, because the


term O(N- l ) in (7.12) includes tensorial terms which does not in
general vanish in any coordinate system. However, if we allow an

ancillary family A(N) depending on the number N of observations, it


is possible to construct an ancillary statistic of an arbitrary order

by the method used in Chernoff [1949] (I. Skovgaard, private


communication).

7.2. Conditional inference


When there exists an exact ancillary statistic r, the
conditionality principle requires that the statistical inference
should be performed by conditioning on r. A statistical problem
then is decomposed into subproblems in each of which r is fixed at
its observed value, thus dividing the whole set of the possible data
points into subclasses. It is expected that each subclass consists
of relatively homogeneous points with respect to the informativeness

about u. We can then evaluate our conclusion about u based on r, and


it gives a better evaluation than the overall average one. This is a
way of utilizing information which ancillary r conditionally carries.
However in many cases, there are no exact ancillaries to be
condi tioned on. In the asymptotic case, we can always use the
second-order asymptotic maximum ancillary ~

previous subsection. In order to evaluate the asymptotic conditional


inference, we need to obtain the conditional distribution of an
efficient Q conditioned on the second-order maximal ancillary v.
218

Theorem 7.2. The conditional distribution p(u I v) conditioned

on the second order maximal ancillary v is expanded as

p(u I v) n ( u-. ; gab ){l + ~


1 BN (-
u, v-) + O(N- l )}, (7.13)

u' ua + Ca /(2/N),
where h ab , hK etc. are the Hermite polynomials in U, v, etc.

Proof. By differentiating the relation gbK(u) o which holds for an


efficient ft, we have

da(Bbi BKjgij) O.
Hence, we have

Since ~ is second-order ancillary, gKA(U) = 0KA so that

o= dagKA(U) = da(BKiBAjgij) = CaKA + CaAK - TaKA·


Hence,
K
KAa
= TKAa - CKAa - CaKA - CaAK = - CKAa = - HKAa"
(m)

By the use of these relations, p(ft I ~) is immediately derived from

the joint distribution of (u, v) given in (4.38).


It is noteworthy that the conditional distribution of an
efficient estimator ft depends on the second-order maximal ancillary ~

only through the two quantities

rab = H~~~(u)vK, sa -iH;~~(U)(VKVA _ gKA),


where sa = 0 when ft is the m.l.e. We first give the conditional
evaluation of an efficient estimator ft. From (7.12), it is easy to

show
INE[u a I ~l = _i{H;~)a(VKvA - gKA) + Ca !+ O(N- l ) ,
Cov[ua , u b I ~l = gab + ~H;e)abvK + O(N- l ).

Hence, the quantity sa contributes to the evaluation of only the

bias, and the quantity rab contributes to the evaluation of the


conditional covariance of the estimator ft.
219

Theorem 7.3. The conditional expectation and covariance of an


efficient estimator a are given, respectively, by
E[aa I v] u a + N-l(sa - ica ) + O(N- 2 ), (7.15)
N Cov[aa, a b I v] = gab + N- l / 2 r ab + O(N- l ). (7.16 )

The estimates of sa and rab are given, respectively, by


S -(1/2)H(m)(a){v Kv A _ gKA(a)}
=
a KAa '
r = H(e) (a)v K (t = H(e)(a)~K)
ab abK ' ab abK .
They are related to the derivatives of the log likelihood as
dat(X, a) _2N- l (Sa + H~~)agKA) + 0p(N- 3 / 2 )
dadbt(x, a) = -gab(a) + N- l / 2 r ab + O(N- l ),
in which the negative -dadbt(x, a) of the second derivative of the
likelihood is called the observed Fisher information. Since
-{dadbt(x, a)}-l = gab(a) + N- l / 2 r ab + 0p(N- l ),
the inverse of the observed Fisher information gives an estimate of
the conditional covariance of the estimator. More precisely,
NCov[aa, a b I ~] = -{dadbt(x, a)}-l - N-l/2dcgabuc + Op(N- l ),
so that the estimate is accurate up to the term of order N- 3 / 2 when M

is parametrized so as to satisfy d c gab (u) o. Such a


parametrization is possible only when M is O-flat. When u is a
scalar parameter, it is always possible to get such a parameter.

The set of statistics {~a, tab' ~c} plays a fundamental role in

conditional inference. Indeed, we can prove that they together are


second-order sufficient,

~gab(U, a, 5) = O(N- l )
by the use of the expansion (7.10). This implies that, among the

maximum ancillary ~, the statistics tab and ~a are sufficient to


recover the order 1 term of the loss of information caused by keeping

onlya. In other words from (7.7)

gab(R, 51 U) =~ gab(U) + O(N- l ). (7.17)


220

This shows that tab and ~a keep all the amount of the conditional
information of order 1 which the conditioning statistic 0 loses.

Therfore, only the components tab and ~a are important for


conditional inference in the asymptotic sense, among all the
components of the maximum ancillary v. This gives a suggestion from
the asymptotic point of view to the problem on which ancillaries one
should condition when there are two ancillary statistics. We should
condition on the asymptotic ancillary that recovers the lost
information of order 1 completely, even if there exists another exact
ancillary or a more higher-order ancillary statistic.
In order to show this more clearly, we treat the m.l.e. 0, with

which the term sa related to the bias term is identically equal to O.


The maximal asymptotic ancillary is regarded as an
(n-m)-dimensional vector in the tangent space TO(A) of the ancillary
subspace A(O) at v = O. Its components are vK,
v = V-K"oK'

Here {OK} is an orthonormal basis of TO(A),

gKA = <OK' °A> = °KA'


because v is second-order ancillary. Let us consider k independent
vector fields
i 1, ... , k k !> n-m

of TO(A) and let

ci = <v, e i > = aiKv K


be the projection of v in the ei-direction. Here a.
~K

depend on O. The set of statistics are first-order


asymptotically ancillary but not maximal, because it has only k
components while v has n-m components. We search for the effect of
conditional inference conditioned on k statistics c i instead of the

whole V. We are considering only linear statistics ci in v as

conditioning ancillary. This is justified asymptotically as follows.


Any statistic f(x) is expanded stochastically as
221

f(x) = f o(ft) + N- l / 2 f
1K
(ft)V K + 0 (N- l )
p'
When f(x) is first-order asymptotically ancillary, fO(ft) cannot
depend on ft but is a constant. Hence any first-order ancillary f(x)

can be asymptotically represented by a linear statistic flK(O)v K by


neglecting the term of order N- l .

Let TA(C) be the linear subspace of Tft(A) spanned by k vectors

ei . Obviously, the set of the statistics {c i } is equivalent to the

TA(C)-component of V, i.e., the projection of v to the subspace


Tft(A). Let Pc = (P~) be the projection operator of Tu(A) to TA(C).
Then the vector
(7.18)

reperesents the k statistics c l '" .,ck ' Indeed, they are given by

because of PCe i = e i .
We next define the subspace Til) of TO(A) spanned by
m(m + 1)/2 exponential curvature vectors
e ab = H(e)K~
ab oK' a , b -- 1 , ... , m . (7.19)

They represent the directions of the exponential curvature of M.

Then, the statistics tab defined by (7.13) are given by

tab = <e ab , v> H~~~(O)vK, (7.20)


so that they represent the Til)-component of v, i.e., the exponential

- curvature direction component of V.


The following theorem shows the amount of the expected
conditional information that asymptotic ancillary statistics C = {c i }

carry conditioned on ft.

Theorem 7.4. The expected conditional information of C is given

by
(7.21)

where
222

The statistics c = (c i ) recover all the information of order 1

together with ft, i.e., (ft, c) are second-order sufficient, when and
only when

Corollary. The curvature-direction components tab give minimal

second-order sufficient statistics together with ft.

Proof. By using the relations

gab(C I U) = ~gab(U) - ~gab(U, C),


(7.6) and the expansion (7.8), we have

Since the vector v is decomposed into

v = PCv + (6 - PC)v = Vc + (v - v C)
where 0 (6 A) is the identity operator, we have
K

This proves (7.21). Since

when and only when TA (C):>T1P), the latter half of the theorem is

proved.
Now we show how the conditional inference divides the entire
223

problem into relatively homogeneously informative subclasses. There

are two ways of evaluating the homogeneity of subclasses. One is


based on the covariance of the asymptotic covariance Cov[ua , u b I vel
of 0 conditioned on C, and the other is based on the covariance of

the Fisher information gab(X I v C) conditioned on C (cf. Cox, 1971).

However, the two evaluations lead to the same conclusion, because of

the following theorem.

Theorem 7.5. The conditional covariance of u is given by


_b
Cov[ua , u I vCl gab + H(e)ab v KN-l/2 + 0 (N- l ) (7.22)
K C P
and is given from the inverse of the conditional Fisher information

gab (X I ve) = N{gab - H(e)v KN- l / 2 + Op(N- l )} (7.23)


abK e

Proof. The conditional distribution p(u I v C) is given from

p(u I v C) = Jp(u I v)p(v')dv',


where p(v') is the probability density of v-, = -
v - Vc which is the
component of v orthogonal to TA(C). From (7.13), we easily have

p(u I v C) = n(u'; gab){l + (1/2)N-l/2H~~~V~hab + O(N- l )}.


Hence, the conditional covariance is given by (7.22). The
conditional information is given by the conditional covariance of

dai (x I v C ), where i(x I v e ) is the logarithm of the conditional

density of X. From

dai(x I v e ) = Nda£(X I v e ) - da£(V e ),


where £(v C) is the logarithm of the density of VC ' we have

eov[dai(x I v C), dbi(x I v C) I vcl


N 2 E[d a £(x,u)db£(x ,u) I vcl + 0(1),
since Vc is first-order ancillary. From the expansion (7.8), we have

(7.23).
Theorem 7.6 . The covariance of the condi tional Fisher

information gab(X IvC) conditioned on Vc is given by


224

Cov[gab(X I vC)' gcd(X I vC)] = H~~~ H~~~ g~A . (7.24)


It is maximal, when TA(C) :::> Til) , and {tab} gives the minimal set of
the asymptotic ancillary statistics having the maximal effect.

The proof is i\lUllediate. These results show that the

curvature-direction statistics tab are important, not because they


are higher-order ancillary but because they keep all the information
of order 1. At least in the asymptotic sense, it is not important to
condition on an exact ancillary even when it exists, but is important
to condition on the curvature-direction components of the first-order

ancillary v. We have already shown this in section 6.4 of the


conditional test. The present theory also gives a solution to the
problem on which ancillary one should condition (cf. Basu's famous
example of a multinomial distribution, Basu, 1975), although the
Edgeworth expansion is not necessarily valid to discrete
distributions. We give some examples.

Example 7.1. Correlation coefficient


Let (y, z) be jointly normal random variables with mean 0 and
covariance matrix

where the covariance or correlation coefficient u is the unknown


parameter of interest. The probability density q(y, Z; u) is given

by
q(y, Z; u) = exp[-(1/2) {(1_u2 )-1(y2 + z2 - 2uyz)}
-(1/2)log(1-u 2 )].
The family M = {q(y, z; u)} can be regarded as (3, l)-curved
exponential family imbedded in an exponential family S = {p(x, e)},

where x = (xl' x 2 ' x 3 ), e = (e 1 , e 2 , e 3 ) and


p(x, e) = exp{eix i - ~(e)}
225

with
2 2
xl = Y , x2 = z x3 yz
and the imbedding is given by
2 -1 2 -1
6 1 (u) -(1/2)(1-u) , 6 2 (u) - (1/2)(1-u) ,
2 -1
6 3 (u) u(l-u) .
We hereafter write 6 i instead of 6 i to avoid the confusion. The

potential function ~(6) is


~(6) = -(1/2)logD + const.,
It should be noted that 6 1 (U) = 6 2 (u) holds for all u.
Since the submanifold given by 6 1 = 6 2 is I-flat and hence is

two-dimensional exponential family, M is actually a (2, 1) -curved

exponential family. However we analyze M in the three-dimensional S

in this example to show the role of ancillary statistics.


The geometrical quantities of S are given by

S6 22 S6 1 6 2 -2D -46 2 6 3

D- 2 S6 1 6 2 -2D S6 21 -46 1 6 3
gij

-46 2 6 3 -46 1 6 3 26 2 + D
3

They are evaluated on M as

lli(u) [1, 1, u],

{U
2u 2 2u

J
gij (u) 2 2 2u

2u 2u u2 +
226

-2u
1 -2u
-2u 2(1 +
The tangent vector 0a of M is given by
0a Bio. = B oi
=
a ~ ai'
where the suffix a standing only for 1,
Bi ei(u) (1 _ u 2 )-2[_u, -u,
a
Bai ni(u) [0, 0, 1],
and denoting the derivative with respect to u. The Fisher
information gab of M is
i
gab = <oa' db> BaBbi = (1 + u 2 ) (1 - u 2 )-2.

The model M is mixture flat, because of


~ a Bbi -- n" i -- (0 , 0 , 0) ,
o
r(M) =
abc H(m?
0 'ab~ = 0,

The exponential curvature H~~)i and the exponential connection r~~~


of M are given from
oaB~ = ei = (1 - u 2 )-3[_3u 2 - 1, -3u 2 - 1, 2u(u 2 + 3)]
r~~~ = (oaB~)Bci 2u(u 2 + 3)(1 u 2 )-3,

H(e)i r abc
(e) Bigcd = _ (1 + u 2 )-1(1 2)-1[1 1 0]
ab d - u '"

H(e? = -2(1 + u 2 )-1(1 _ u 2 )-1[1 + u 2 , 1 + u 2 , 2u] .


ab~

The square of the exponential curvature is


(H~);b = 4(1 + u 2 )-1(1 _ u 2 )-2,
(H e )2ab
M
= 4(1 _ u 2 )2(1 + u 2 )3.
Let A be the ancillary family associated with the m.l.e. Then,
A(u) is mixture-flat and its tangent space Tu(A) is orthogonal to 0a'
Let us fix an orthonormal basis {oK} (K = 2,3) of Tu(A) , oK = B;oi
BK~.oi such that 02 is a unit vector in the curvature direction,
H(e)i o . = H(e)2 0
ab ~ ab 2,
spanning Til), and 03 is a unit vector orthogonal to it. We then
have
227

or
u
2
, 1 + u 2 , 2u],
2 2
- u , -(1 - u ) , 0].

Hence
H~~~ = -2(1 + u 2 )-1/2(1 - u 2 )-1[1, 0].

Since A(u) is mixture-flat, we can specify the n-coordinates of a


point of S in terms of the new (u, v)-coordinates (u, v l ' v 2 ) as

ni(u, v) n;(u)
...
+ vKB K~.(u)
[1, 1, u] + v 2 (1 + u 2 ) -1/2 [1 + u 2 1 + u 2 , 2u]
+ v 3 (1 -u 2 ) -1/2 [1 - u 2 , -1 + u 2 , 0]
For N independent observations, the observed point Ai is given

Xl = ~hf, x2 = ~Izf, x3 = ~IYizi'


It is easy to show that xl or x 2 is an exact ancillary. Moreover,

(xl + x 2 ' x3 ) is the minimal sufficient statistic as we noticed


before that M is a (2,1)-curved exponential family. The new

coordinates «(1, ~l' ~2) are obtained by solving


- 1 + ~2(1 + (12)1/2 + ~3(1 _ (12)1/2,
xl
x- 2 1 + ~2(1 + (12)1/2 - ~3(1 _ (12)1/2,
-
x3 (1 + 2(1(1 + (12)-1/2~2'

They are given asymptotically by


-
(l = x 3 - 2N
-1/2 xx- (1 + x-2 ) -1 + 0p(N -1 ),
3 3
~2 N- l / 2x(1 + x~)-1/2 + Op(N- l ),

~3 (1/2)N- l / 2 (xl - x2)(1 - x~)-1/2 + Op(N- l ),

where
-
X 1(_
= 2: xl + x-) 1 Nl/2(-xl + x- - 2) .
2 ="2 2
The expected conditional informations of xl' x 2 ' v 2 and v3 are

given from Theorem 7.4 as


-,.. 2-1
gab(X 2 I U) = 2(1 + u ) + O(N ),
4(1 + u 2) + O(N- l ),
228

This shows that the exponential-curvature direction v2 of vK keeps


conditionally all the information of order 1 which a loses, while the
exact ancillary xl (or x 2 ) recovers only part of it. The conditional
Fisher informations are given, respectively, by

gab (X xl) N{gab (1/2)H(e)(1 + u 2 )1/2(xl - l)} + 0(1),


ab2
gab (X ~2) N(gab - H(e)~ )
ab2 2 + 0(1),
gab (X ~3) Ng ab + 0(1).
The variances of the conditional Fisher information are

V[gab(X xl)] 2N(1 + u 2 )2(1 _ u 2 )-2 + 0(1),


V[gab(X ~2)] 4N(1 + u 2 )(1 _ u 2 )-2 + 0(1).

V[gab (X ~3)] 0(1).


This shows that it is effective to condition on the

curvature-direction asymptotic ancillary ~2 even when an exact


ancillary xl or x 2 exists.

Example 7.2. Spiral model


Let us consider the following family S of distributions S

{p(x, 6)} of three-dimensional random variable x = (xl' x 2 ' x 3 ),


1 2 2 2
p(x,6) c exp[- Z{(x l - 6 1 ) + (x 2 - 6 2 ) + (x 3 - 6 3 ) }]
1 2 2 2 i
c exp[-I(x l + x 2 + x 3 )]exp{x i 6 - ~(e)}
specified by 6 (6 1 , e2 , 6 3 ). The random variables xi are
independently and normally distributed with mean ei and variance 1.

Since
1 2 2 2
~(6) = 2(6 1 + 6 2 + 6 3 ),
the metric tensor is given by

gij = °ij
and Tijk = O. Hence, all the a-connections are identical, because

d~k) = 0 for any a. The manifold S is Euclidean and 6 gives its


~J

affine coordinate system. Obviously, ei = ni holds.


Let M = {q(x,u)} be a (3,1)-curved exponential family imbedded
229

in S by
nl(u) cos u, n2(u) = sin u, n3(u) = u,
q(x, u) = c exp[-~(xl - cos u)2 + (x 2 - sin u)2 + (x 3 u)2}].
The scalar parameter u may take on all the real values - 00 < u <
00 or it may take only on 0 ~ u < 2n with mod 2n. The model M forms
a spiral curve in S. The tangent vector of n(u) is given by
da = (ni) = (B ai ) = [-sin u, cos u, 1],
with
gab = <d a , db> = 2.
The curvature direction is given by
ni = daBbi = [-cos u, -sin u, 0],
which is orthogonal to da in the present case. We define a basis
{d K} (K = 2,3) in the subspace Tu(A) orthogonal to Tu(M)
by
d2 [- cos u, - sin u, 0],
d3 (1/ /2') [sin u, -cos u, 1],
such that gK\ = <d K, d\> = 0KA and d2 is the curvature direction. We
have

HabKdK = Hab 2d 2'


Hab2 = 1, Hab3 = 0,
where all the a-curvatures are identical in the present spiral model.
The square of the curvature tensor is

(HM);b = HacKHbdAgKAgcd t,
(HM)2ab = ~ .

In the coordinate system (u, v) = (u, v 2 , v 3 ) associated with


the m.l.e., the point ni(u, v) is written as
ni(u, v) = ni(u, 0) + VKd K.
Hence, for N observations, the observed point x yields
x- = n(Q, ~) or
- cos Q - ~2 cos Q + (1/ /2')~3 sin Q,
Xl
x2 sin Q ~2 sin Q (1/ /2)~3 cos Q,
230

-
x3 = ft + (1/12)~3·

By solving these equations, we have


1 -2 -2 -1
~2 = 2(1 - xl - x 2 ) + 0p(N ),
~3 = -f{z - tan(x 2 /x l )} + 0p (N- l ),
ft = t{z + tan(x 2 / xl)} + 0p(N- l ).
The joint distribution of u and v is given by
p(u, v; u) = c exp{-~(2u2 + v~ + v~)}{l + (1/2)N- l / 2
,,(u2 - -t)v2 + O(N- l )},
and the score function is by
(u + sin u) - v 2 sin u + (v 3 /1!)(1 - cos u)
°
2u - uv 2 + p (N- 3 / 2 ) .
We can hence calculate the conditional Fisher information of ~, ~2

and ~3 as
gab(V I U) = gab(V 2 I U)
gab(V" 3 I"U) = O(N -1 ).
This shows that the curvature-direction ancillary v 2 keeps all the
conditional information of order 1 while v3 keeps none. The
conditional Fisher informations conditioned on V, v 2 , v3 are

gab (X v) = gab(X I v 2) = 2N - ~v2 + 0(1),


gab (X v 3) = 2N + 0(1),
and their variances are

V[gab(X v)] = V[gab(X I v 2 )] N + 0(1),


V[gab(X v 3 )] = 0(1).
The present model admits two exact ancillary statistics,
-2 -2 - -1 - -
xl + x 2 ' z - tan (x2/xl)'
which together form a maximal exact ancillary. They are
asymptotically equivalent to v 2 and v3 ' respectively. Hence, it is
··
useful to con d ition on t h e curvature d 1rect10n anC1·11 ary xl
-2 + x-2 '
2
but it is not so useful in the asymptotic sense to condition on -z -
tan -1 (x- 2 /x- l ).
231

Example 7.3. Basu's multinomial distribution


Let x be a discrete random variable taking value i (i = 1, 2, 3,
4) with probability Pi' where Pi is specified by an unknown parameter
u as
1 1 1 1
Pl = 6"(1 - u), P2 = 6"(1 + u), P3 = 6"(2 - u), P4 = 6"(2 + u) .
The distribution is written as

p(x, u) = 2t=1 0i(x)Pi(u),


where 0i(x) = 1 when x = i and otherwise 0i(x) = O. The family M =
{p(x, u)} is a (3, l)-curved exponential family, because a family of
multinomial distributions is of exponential type. Let NX i be the
number that x = i is observed, where N is the number of total
observations. Then x= (xl' x 2 ' x 3 ) is a sufficient statistic.
Let ft be the m.l.e. Then, we have

where
1
ni(u, 0) = ~[l - u, 1 + u, 2 - ul
and
-1
v Xl + x- 2 - 1/3,
-2
v 1/2,
if we choose BKi suitably. Each of v-1 and v-2 is exact ancillary,
while v = (v l , v 2 ) is only asymptotically ancillary. The curvature
direction component of v is given by
t = -3(2 + ft2)v l + Bftv2.
This shows that we should condition on this asymptotic ancillary
rather than on the exact ancillary v-1 or v-2 . The effect of
-1 -2
conditioning on t is equivalent to conditioning on v = (v , v ).

7.3. Pooling independent observations


When there exist two independent samples from the same distribution,
where each sample consists of N independent observations x(l) l'
...... x(l)N; x(2)1'····· ,x(2)N' we can pool all these 2N independent
232

observations together into one sample and then. we can perform an


optimal statistical inference based on them. In order to obtain the
sufficient statistic

x = 2~ (~x(l)j + ~x(2)j)
for all the 2N observations. it is not necessary to retain all of
them but it suffices only to retain the sufficient statistics xl and
x 2 of the two samples.
- 1 \' i
Xi = 1f LX(i)j' 1.2
because x is given by
- 1 -
x = Z(x l + x- 2 ).
Indeed there is no loss of information even if we summarize

x(i)l.···. x(i)N into However. if we summarize a sample of N


independent observations into the third-order efficient estimator.
m.l.e. 0i (i = 1. 2). we cannot obtain a third-order efficient 0 for
2N observations from the two third-order efficient 0 1 and O2 , This
is because 0i is not second-order sufficient. loosing some
information of order 1. If we retain 0i and ti (i = 1.2). where we
suppressed the indices a or a. b in O~ and t iab for brevity's sake.
it is expected that the third-order efficient 0 can be obtained
therefrom. This shows the usefulness of the curvature-direction
components of the ancillary.
The sufficient statistic x for N independent observations

Xl' . . .• ~ is equivalently represented by (0. ~). Let "R be the


projection of " to the exponential-curvature directions Til) .
Obviously. "R can be reconstructed from tab' because of (7.20). Let
us consider the point x' of S in the n-coordinate system that is
given by

x' = n(O. ~R)'

This is the point reconstructed from O. ~R and is different from the


original x = n(O.~) reconstructed from O.~. However. this x' is
second-order sufficient. We can obtain 0 and ~R from x' .
233

Given the second-order sufficient statistics xi = n(ft l , ~Rl) and


xi n(ft 2 , ~R2)from two independentsamples, we can construct a pooled
statistic
- 1 -, -
+ x2)
I
x~ = 2(X l

by taking simply the average of xi and x2' More generally, when


there are k independent samples of sample size Ni' we can construct a
pooled statistic
N (7.25)

Then, we can prove that the pooled statistic x~ is second-order


sufficient for all the pooled observations. Indeed x; coincides up
to the necessary order with x' from all the observations. Hence, we
can carry out higher-order statistical inference based on the pooled
x~. We show this in the following.

Theorem 7.7. The pooled statistic x~ is second-order sufficient


for all the observations.

Proof. We first derive some useful relations for a pooled sample of


N observations. By decomposing the ancillary v into the sum of the
curvature-direction component vR and the component orthogonal to it,
v = vR + Vo
we have
H (e) V-K = 0
abK 0 .
By substituting the decomposition in the expansion for the m.l.e.ft
n;(ft,

0) + BK~. (ft)~K

we have
-
Xi
= x!~ + N- l / 2 BK~.v0K _ N-la a BKi u-a_K
Vo + 0 p (N- / )
3 2

where xi are the components of x! Hence,


Na a ~(x, u) = /NBix~
a •
+ O(N- l / 2 ), (7.26)
234

where gaK = 0 and H(e)v K


abK 0 = 0 are taken into account and
xi = ~{xi - ni(u,O)}.
When there are k independent samples of sample size Ni , the score

function dat(x(l)l, ... ,x(k)Nk; u) of all the N = I Nj observations is


given by
,k -
L.j=lNjdat(x j , u) = B;(u){I~=lNj(x(j)i - ni(u, O»} + O(N- l / 2 )
= NB;(u){x~i - ni(u, O)} + O(N- l / 2 )
where x(j)i is the i-th component of the statistic x~ for the j-th
sample and xpi is the i-th component of the pooled x~. The loss of
information of x~ is hence given by the expectation of the

conditional covariance of the above score function as

gab(X~) = O(N- l ),
proving that x~ is second-order sufficient.

The third-order efficient estimator 0 and the ancillary tab for

the pooled samples are calculated from x~. We give its explicit form

first in the case with two samples of the equal size.

Theorem 7. 8 . The third-order efficient estimator 0 and the


ancillary fab of the entire sample are given, respectively, from
those of the divided samples by
Oa = i(o! + O~) + ~{r ~~)adcab - gab (t lbc - t 2bc )ac}
+ O(N- 3 / 2 ), (7.27)
(7.28)

where

Proof. The m.l.e. 0 based on all the pooled samples is given by

x- = nCO, ~),

where
-
x (1/2)(x l + x2)' so that
1
n(O,~) = Z{n(Ol' ~l) + n(02' ~2)}· (7.29)
235

By using the expansion

ni ( 0. ~ ) = ni ( ) + Baiu-a + BKiv
u.O
-K 1 -a-b
+ ~abBaiu u
+ "a
'\ BK~.iiav- K + •••

and similar expansions of n(Ol' ~l) and n(OZ' ~Z). we have the
first-order equation

Baiu a + BKivK = tBai(ui + u~) + !BKi(V l + v Z)+ 0p(N- l / Z).


This gives the first-order approximation of ii a and vK•
-a 1 -a -a -K 1 -K -K
U = 2(u l + uZ). v = Z(v l + vZ)
and hence
1
tab = t<tlab + t Zab )·
In order to obtain the second-order approximation. we put

ua = t(ui + u~) + N- l / Z6a + O(N- l )


and substitute it in the expanded form of (7.Z9). By multiplying Bbi
to the both sides. we have
~ _ l r(m){Z_b_c + Z_b-c (_b + - b)(_c + -c)}
U a - 8 bca ulul uzuz - u l Uz ul Uz
1 H(e){Z-b_K + Z-b-K _ (_b + _b)(_K + -K)} + O(N-l/Z)
- 4 abK ulv l uzv z ul Uz v l Vz
_ 1 r(m)(_b _b)(_c -c) 1 H(e)(-b -b)(-K _K)
-"8 bca u l - Uz u l - Uz - 4" abK u l - Uz v l - V z
+ O(N- l / Z).
Hence

6a ~r~~)adddc
= - ~gab(i'lbc - i'Zbc)d c + O(N- l / Z).
where d = INd. l' = !Nt. This gives (7. Z7) . In the above solution

process. the same ~a and tab are obtained if we replace BKiv l and

BKiVZ by BKiv~l and BKiv~Z' respectively. Hence. the same m.l.e. °


is obtained from (xi + xZ)/Z instead of x up to the necessary order.

It is easy to generalize the above result. Let us consider k

independent samples (x(i)l' ... x(i)Ni) (i = 1 •...• k) consisting of


Ni independent observations. Let (Oi' t iab ) be the second-order
sufficient statistics of the i-th sample. We define the following k
matrices (i = 1 •...• k)
(7.30)
236

where
(7.31)

It should be noted that Giab is different from the observed Fisher


information. Moreover, we define

Theorem 7.9. The estimator


Gab[~G. fi7]
fia =
L 1bc 1

is a third-order efficient estimator from the pooled samples. It is


equal to the u-coordinates of the point

We have the following relation by neglecting higher-order terms


ab a a a
G Gibc = (Ni/N)(oc + Eic - EC)'
where
a _ ab{lr(m)(~d ~.d) ~ }
Eic - g 2 bcd ui - U - Libc '
a ~ i a
EC = L (Ni/N)E ic '
Hence, (7.32) can be represented as
fia = L (Ni/N){fi~ + (£i~ - E~)fi~}. (7.33)

7.4. Complete decomposition of information


We see that, among the sufficient statistic x- or equiva1etnt1y
(fi , ~), the m.1.e. fi retains all the information of order N, losing
only that of order 1 which the maximal ancillary ~ retains
conditionally on fi. Moreover, among all the components of ~ its
curvature-direction components ~R retain all the remaining
conditional information of order 1, losing only that of order N- 1 .
It is possible to continue these processes of decomposing the
sufficient statistic x into a series of statistics (fi, ~R"")

according as their orders of the amounts of conditional information.


237

To carry out the decomposition. we define higher-order curvatures and


higher-order curvature directions.
The tangent space Tu(S) of S at e(u) is decomposed as
Tu (S) = Tu (M)Ef) Tu (A) •
where Tu(M) is the tangent space of M spanned by m vectors
i
aa = Baa i • a = 1 •...• m
and Tu(A) is the tangent space of the orthogonal ancillary family A
associated with the m.l.e. ft. Tu(A) is spanned by n - m vectors
K =m+ 1 •...• n.
The intrinsic exponential change in the tangent directions of M as u
moves is represented by the exponential curvature. Let us consider
the exponential covariant derivative of ab in the direction aa
v(e)a = v(e)(Ria.) = (a Ri)a.
a b a -b ~ a-b 1.'
where v~e) is the abbreviation of v~:) and v(e)a.
a 1.
= 0 is taken into

account. The proj ection of the above vectors to Tu (A) which is


orthogonal to Tu(M). gives the exponential curvature. They are given
by the vectors
a.b 1 ..... m
where
H(e) = <V(e)a a >.
abK a b' K
These vectors span the subspace Til) of the curvature directions.
The second covariant derivative of aa
v(e)v(e)a = (a a Bi)a.
b c a b c a 1.
shows how the curvature directions change as the point u moves.
However. this vector includes components belonging to Tu(M) and Til).
too. Let the projection of v~e)v~e)aa to the subspace orthogonal to
both Til) and Tu(M) be
(2) _ H(e)K a
e abc - abc K'
We call the components H~~~K of this projected vector e~~~ the second
curvature. Let Ti 2 ) be the subspace spanned by the vectors e!~~
(a. b. c = 1..... m). which represents the directions of the second
238

curvature.
We can further define the p-th curvature of M in a similar
manner. Let

,
(e) (e)
V........ Va d a
~,

be the p-th order covariant derivative of d a . This vector can be


decompoeed into the sum of the components belonging to Tu(M), T1 l ),
... , TA (p-l) and the components orthogonal to all of them. We denote
the orthogonal component by
(e) K
Hal. . "a,a oK'
and call it the p-th curvature of M. We denote by T1p ) the subspace
spanned by these vectors. The subspace T1p ) represents the
directions of the p-th curvature of M.
The tangent space Tu(A), which is the orthogonal complement of
Tu (M) is thus decomposed into the direct sum of the orthogonal
subspaces of the curvature directions
Tu (A) = T1 l )® T2){f)· .. ,
where the sum is finite, because the number of dimensions of Tu(A) is
(n - m). By decomposing the sufficient statistic x into (fi, ~) by
xi ni(fi,~) = ni(fi, 0) + BKi(fi)~K,
where fi is the m.l.e. and a linear coordinate system v is adopted in
the mixture-flat Tu (A), the approximate ancillary statistic ~ is
considered to be a vector ~ = ~KoK belonging to Tfi(A) of A(fi). The

vector ~ is then decomposed into the sum of those belonging to the


higher-order curvature directions T1p ) (p = 1, 2, ... ). The quantity

indeed represents the p-th order curvature direction components of


~. The sufficient statistic x or (fi, ~) is thus decomposed into
the statistics

{fi, tab' t~~~, ... , ta(P)"'a


1
}.
P.l
Let us study the amounts of information carried by these statistics.
239

To this end, we define the following sets of statistics


R (P) = {"'(p)
La!. •• ·apb' a l , ... ,a p ' b = 1, ... , m}
and
S(l) {ft},
S(2) {ft, t(l)}

s(p) = {ft, t~l) ... , t(p-l)}.

Let us define the square of the p-th order exponential curvature


of M by
e,p 2 _ (e) (e) KA {a a b b}
(HM ) ab - Hal. ... a, aKHbj. .. ·bp bAg g l. .. pl. .. p ,
where g{a l ·· .apb l · .. b p } is the symmetrization of tensor galb l
g a 2g b 2 ... g a p b p multiplied by a cons tant depending on p. It is
defined by
g {al···ap bl··· bpI = _l_·E[-Bt·
(p! )2. u .. u-a, u_bi. ... u_bt ]
,
where u is a normal random variable with covariance matrix gab

Theorem 7.10. The set S(p) of statistics is asymptotically


sufficient of order p, with the information loss given by
~gab(S(P» = N-P+l(H~'P);b + O(N-P). (7.34)

Proof. From

nl..(u, 0) + BKl..(ft)~K},
and

we have

where
240

Aa(P) = [(-l)P/(p!)] t bJ. ... b, aU-bJ. ... u_b, .


The conditional covariance of aa~(x, u) conditioned on S(p) is
written as
Cov[aa~' ab~ 1 S(p)] = N-PCOV[A~P), A~P) 1 S(p)] + O(N-p-l),
because all of A~O), ... , A~P-l) are fixed when S(p) is fixed. The
term A(P) is composed of
a
[( -l)P / (p!)] H(e) (l1)~KiibJ. ... ab,. .
bi . . . b, aK
Hence, by calculating the expectation of its conditional covariance,
we have (7.34).

The statistic x- is decomposed into 11 and ~, or more finely into


11, The information which x carries is
decomposed as
gab (X) = gab(s(l» + gab (R(l) Is(l» + gab(R(2) Is(2» + ...
The conditional information of R(P) conditioned on S(p) is given by
gab(S(p+l» - gab(S(P»
bgab(S(p» - bgab(S(p+l»
N-p+l(H~'P);b + O(N-P).
Hence, the p-th order curvature direction components R(p) of ~ carry
the full information of order N-p+l conditioned on S(p) which carry
all the amount of information up to O(N-p+2).

Theorem 7.11. The information gab(X) is decomposed as


gab (X) gab(U) + L N-P+l(H~'P);b' (7.35)
where the term of order N-p+l is the amount of conditional
information carried by R(P).

We have studied the m.l.e. and the related ancillary~. The


theory is essentially based on the expansion
aa~(x, u) = aa~(x, 11) + L(;pa~ .. .aa; (x, l1)u aj. uap
where aa ~(x, 11) = 0 for the m.l. e. 11. For an arbitrary efficient
241

estimator ft, we have the expansion of aai(x, ft) in terms of u and v


including the higher-order mixture curvatures of A(ft) , too. Then, we
can expand gab (X) - gab dh in a series of N- l . The result is a
higher-order extension of theorem 7.11. However. the result is
complicated and we do not mention about it here.
We have one final remark for readers who are familiar with
tensor caluculus. The covariant derivative of aa' e.g .•
i
vaa b = (aaBb)a i
is a vector for fixed (a, b), but Hab i a a Bi
b
is not a tensor having
three indices a. band i. Indeed, if we change the coordinate system
of M from u a to another u a' Hab i does not behave as a tensor.
although B; is a tensor. In order to obtain a tensor by covariantly
differentiating B;, we need to regard B; as the components of the
quantity Bia.
a 1 aa and operate the covariant derivative Vb in the
following manner,

+Bia.Va a
a 1 b
a
Here. the derivative vba should be restricted to Tu (M) by the use of
the induced covariant derivative V of M. We then have the following
tensorial rule of covariant derivative
V Bi = a Bi + r~ BkBj _ r C Bi
b a b a Jk a b b a c·
We may use the covariant derivatives Vb.t ... Vbp B;. which are tensors,
obtained in this manner in defining the higher-order curvature
tensors. The components orthogonal to Tu(M). Til) •...• Tip - l ) define
K
the higher-order curvature tensor Ha1 ... B.t b .

7.5. Notes
Fisher information is a fundamental quantity
representing potential quality of a statistic t(x) which is to be
used for statistical inference. It represents how well statistical
data is summarized in it (see Efron, 1982). We have shown that the
242

m.l.e. is one of the asymptotically best information summarizers of


the first and second order from the geometrical point of view. The
same result was derived by Rao(1962), Efron(1975), etc. The m.l.e.
is at the same time the first-order (and hence second-orde~

efficient estimator from the point of view of the mean square error
or of the covariance. However, it is not necessarily third-order
efficient from the mean square error point of view. This is because
the mean square error is not an invariant quantity, except for the
first-order term which is given by the inverse of the first-order
term of the Fisher information of Q, while the Fisher information is
a tensor and hence is an invariant. The mean square error includes
terms depending on the manner of parametrization. However, we can
always construct the third-order efficient estimator from the m.l.e.
Q by making one-step bias correction, which depends on the manner of
parametrization. Moreover, we can obtain the third-order term of
the mean square error of an efficient estimator from the
second-order term of the amount of information contained in it. In
this sense, information is a more fundamental quantity.
The optimality of the m.l. e. does not hold from the point of
view of the third-order term of information loss. The statistical
implications of the third- or higher-order terms of information loss
is not clear (cf. Rao et al., 1982). We have decomposed the amount
of information which i carries into the set of statistics {Q, t(l),
t(2), ... }, where r(p) carries conditionally all the information of
order Nl-P and its magnitude is given by the p-th order exponential
curvature of model M. Although
(e) ( ) K
tab = HabK Q~ = aaab~ x, Q) + gab(Q)
(-

which is the curvature-direction components of ~, is first-order


ancillary, the pair (Q, t) is second-order sufficient, where t
t(l). It is not because t is asymptotically ancillary but because it
carries all the conditional information of order I, that it is useful
243

for statistical inference including conditional inference. We have

already shown in Chapter 6 that we can construct the third-order


optimal test and the third-order optimal confidence interval
estimator in various senses from the pair (ft, t). We can reconstruct

the third-order efficient estimator for all the pooled data, if we


keep (ft i , t i ) of each sample from the same distribution. See Akahira

and Takeuchi (198lb) for the higher-order asymptotic theory of pooling


independently observed samples in various situations. we have also
shown that one should condition on t in conditional inference in the
asymptotic case not because it is asymptotic ancillary but because it
carries the maximum conditional information.

There are a number of interesting discussions concerning

theoretical aspects of statistics such as the likelihood principle,

the conditionality principle, etc. See e.g. Basu (1975), Cox and
Hinkley (1973), Hinkley (1981), Efron (1982). The usefulness of the
ancillary t or the observed Fisher information

- 0aobt(x, ft) = gab(ft) - tab


was suggested by Pierce (1975), Efron and Hinkley (1978), and a
number of researches followed concerning the asymptotic ancillarity

and asymptotic conditional inference. See Peers (1978),


Barndorff-Nielsen (1980), Cox (1980), Hinkley(1980) , Amari (1982 b),

Ryall (1981), McCullagh (1984a). Wei and Tsai (1983) treated


conditional inference in the presence of nuisance parameters. The

higher-order sufficiency was studied by Michel (1978) and Suzuki


(1978). Statistic t(p) is in fact constructed from the (p + l)st

derivative of the log likelihood t(x, ft).


8. STATISTICAL INFERENCE IN THE PRESENCE OF NUISANCE PARAMETERS

The present final chapter studies a geometrical


theory of statistical inference in the presence of
nuisance parameters. When we fix the value of the
parameter of interest while the nuisance parameters take
any values, we have a submanifold in the model M depending
on the fixed value. Statistical inference is carried out
concerning the family of these submanifolds, so that their
geometrical properties play an important role on
evaluating inferential procedures. We first search for
the possibility of reparametrizing the nuisance parameter
such that it is always orthogonal to the parameter of
interest. We next study the amount of information loss
caused by not knowing the true value of the nuisance
parameter. It is related to the mixture curvature of the
submanifold defined by the nuisance parameter.

8.1. Orthogonal parametrization and orthogonalized information


Let us consider a curved exponential family M = {q(x, u, z)}
parametrized by two vector parameters u and z, u being m-dimensional
and z being k-dimensional, so that M is an (n, m+k)-curved
exponential family. We sometimes have interest only in u and do not
care about the value of z. In such a case, u is called the parameter
of interest or the structural parameter, and z is called the nuisance
parameter. Statistical inference about u-part of the parameter (u,
z) is called the statistical inference in the presence of nuisance
parameters. Since u = const. defines a k-dimensional submanifold
Z (u) in M (Fig. 8.1) on which z takes on arbitrary values, the
statistical inference in the presence of nuisance parameter is
regarded as, for example, estimating the submanifold Z(u) on which the
245

Fig. 8.1

true (u, z) lies or testing the hypothesis HO that the true parameter
(u, z) is on Z(u O). We have already treated the problem of tests in
the presence of nuisance parameters in Chapter 6, so that we
concentrate here on the problem of estimation in the presence of
nuisance parameters.

The problem may take another form. It sometimes takes the form
of estimating tha value of m independent functions fi(w), i = 1,···,
m, where w is an (m + k)-dimensional parameter in (n, m + k)-curved
exponential family M {q(x, w)} Indeed, the set Z(u) of the
points w satisfying

constitutes a k-dimensional submanifold of M, provided the fi are


smooth. In this case, M is decomposed into a family of these Z(u)'s,

and we have interest in which Z(u) the true distribution lies, but no
interest in the relative position within Z(u). It is convenient to

introduce a new coordinate system (u, z) instead of w such that


ui = fi(w) , i = 1,···, m

hold. The remaining coodinates z are the nuisance parameters, and


246

they can be chosen arbitrarily, because they merely define a


coordinate system in each Z(u).
From now on, we denote the parameter of interest by u = (ua ), a
1, ... , m and denote the nuisance parameter by z = (zp), p = m +
1, ... , m + k, using indices p, q, r, etc. for quantities related to

the nuisance parameter. The model M is parametrized by (u, z), M


{q(x, u, z)}. The set of points whose u-coodinates are U o is denoted
by Z(uO)' This plays the role of the coordinate hypersurfaces for
the u-coordinates. To estimate u is to estimate the submanifold Z(u)
on which the true parameter lies, where we do not have interest in
the relative position z of the true parameter in Z(u). The nuisance
parameter z defines a coordinate system in Z(u), specifying the
position in Z(u), so that (u, z) completely determines a point of M.
We may take another coordinate system z' for each Z(u). This gives a
coordinate transformation of the form

,-___ z

Fig. 8.2a) Fig. 8.2b)

z' = h(z, u), (8.1)


where u is kept unchanged. More generally, the allowable coordinate
transformations are of the type u' = k(u), z' = h(z, u), which keeps
Z(u) invariant. Since we have no interest in the value or z of z' ,
the inference procedure should not be affected by this
247

transformation. Fig.S.2 explains the effect of a coordinate


transformation. The full lines show Z(u)'s on which the value of u

does not change but z changes. They can be regarded as the z-axes.
The u-axes are composed of those points on which the z-coordinates do
not change, as is shown by the dotted lines in Fig. S.2.a). They are

artificial, and if we choose another z'-coordinate system of the form

(S.l), the u-axes change as shown in Fig. S.2b). However, the Z(u)'s
are invariant. We give a simple example.

Example S.l. The set M of normal distributions N(~, 0 2 ), which

is an exponential family, is considered. When we have interest in

t h e mean ~ ·
an d not ~n t he .
var~ance 0 2 , we can put u = ~ an d z = o.
The z defines a coordinate system in each Z(u) shown by a vertial
line (Fig. 8.3a). The dotted lines, on which z takes constant
values, define the u-axis. However, we may choose

a
u=const. u=const.

- - - - - - - - - - - - - ~ z=cons t.

-------------7
~- -- .....
/
" .... -.... " , "~
/ / z=cons to
----1----------- /' ",;0 -_, \

I I '\
I I ~ \V

Fig. 8.3a) Fig. 8. 3b)

as another coordinate system in each Z(u). Then, the new u-axis is

the dotted line in Fig S.3b). The coordinate system (u, z') is

convenient in some case, because it is -l-affine. We may analyze


248

estimation procedures in either coordinate system. On the other


hand, when the variance 0 is of interest and the mean ~ is not, we
put u = 0, and the nuisance parameter is z = ~. or any other z' = h(u,
z) .

Let d = d(x) be an estimator. This defines a mapping from the

observed point f1 X to the parameter space {u}. The associated

ancillary family is
A(u) = ft-l(u) = {n I ft(n) = u} ,

where A(u) is an (n-m)-dimensional submanifold attached to a point u

in the parameter space of interest. In order that the estimator is

consistent, it is necessary and sufficient that A(u) includes Z(u),

i. e., any point n (u, z) is included in A(u). This can be proved

easily, because the observed point x- tends to n(u, z) in the

n-coordinates as the number N of observations tends to infinity,

Fig. 8.4

where (u, z) is the true parameter. Hence, when ft is consistent,


249

the associated A(u)'s are such that each A(u) includes Z(u) (Fig.

8.4) .
We introduce new variables v = (v K), K = m+k+l, ... , n, such that
(z, v) is a coordinate system of A(u) and that the origin v = 0 is

put at Z(u). Then, the triple (u, z, v) constitutes a coordinate

system associated with the estimator, n = n(u, z, v).

The tangent space T( u,z )(5) of the whole space 5 at point (u, z)e M
is spanned by three kinds of vectors {d a , dp ' d K},
i
da Badi' a = 1, ... , m,
i
dp Bpd i , P m+l, ... , m+k,
i
dK BKd i , K m+k+l, ... , n.
Here, d K =d/dV K denote the tangent directions along the coordinates

v K, and are tangent to A(u). The vectors dp d/dZ P , which are along
the coordinates zP, span the tangent space of Z(u). Hence,{d K, dp }
together span the tangent space of A(u). The vectors d a = a/dU a are
along the coordinate curves u a on which zP are fixed. Hence, the
tangent space of the model M is spanned by {d p ' d a } (see Fig. 8.4).

When A(u) is orthogonal to M, we can choose a v-coordinate

system such that d K are orthgonal to both da and dp '


d > 0,
K

o.
As will be soon shown, an estimator is efficient, when and only when

A(u) is orthogonal to M. Hence, we may assume that gaK = gpK = 0,


when we treat efficient estimators. The inner products among da and

dp '

gab = <d a , db>' gap = <d a , dp > , gpq


together give the Fisher information matrix

of M, which denotes the amount of information in estimating the

parameters u and z jointly. However, when we have interests in


250

estimationg only u, the quantity gab = <aa' ab> does not represent
the amount of information available in estimating u. Indeed, this
gab depends on the manner of parametrization z of the nuisance
parameter or the coordinate system z in each Z (u), which can be
chosen arbitrarily. By the coordinate transformation (8.1) from z to
z' the vectors aa and ap change into a' and a' by
a p
a = a' + HPa' a p Hqa' (8.2)
a a a p p q'
or
a' = a - H,qa ' a' H,qa
a a a q , p p q'
respectively, where
H~ = ahP(u, z)/aua , z)/az p ,

H'~ = (H-l)~, H'~


This shows that {a~} again spans the tangent space of Z(u) but the
directions spanned by {a~} change depending on the manner of
parametrization of z. The inner products of a' and a' are given by
a p

Fig. 8.5

g
ab
+ g pq H'PH,q
a b
- g H'P
pa b
a' a'> g H,q - g H,rH,q
< a' p aq p qr a p'
<a' a'> g H,rH,s (8.3)
p' q rs p q'
How do we define the amount of information in the presence of
nuisance parameters? In order to answer this question, we decompose
the vectors aa in two components. One is the component tangential to
Z(u) given by a linear combination of a p . The other is the component
orthogonal to Z(u). The part which is tangential to Z(u) is given by
251

<d a , dp>gPqdq = (gapgPq)d q ,


where gpq is the inverse matrix of gqp' and hence the part orthogonal
to Z(u) is given by
3a da - gapgPqdq. (8.4)
Obviously <3 a , dp > = 0 holds (Fig. 8.5). The orthogonal vector 3a or

the corresponding random variable 3a t(X, u, z) does not include any

components in the directions of dp or dpt(X, u, z). Hence, it is


responsible only for changes in the value of the parameter u of
interest and is not responsible for changes in the value of the
nuisance parameter. Moreover, it is invariant under the
reparametrization (8.1) of the nuisance parameter,

3a = 3~
as can easily be shown from (8.3) and (8.4). The inner products of
these orthogonalized 3a give an invariant tensor

gab = <3 a , 3b > = gab - gapgbqgPq (8.5)


which is called the orthogonalized Fisher information. This plays
the role of the Fisher information in the presence of nuisance

parameters. It is invariant under the parameter transformation

(8.1) .

When and only when gap = < d a' dp > = 0, the orthogonalized
information coincides with the Fisher information,

gab = gab·
However, in general,
g-ab ,g
...... ab ,
-
where g-~ is the inverse of gab' hold in the sense of the positive
semi-definiteness.
Since the inverse matrix g-ab of the orthogonalized information
-
gba is the (b, a) -component of the inverse of the total Fisher
information matrix
252

of M, g-ab.
g~ves the asymptot~c . .
covar~ance 0f any e ff'~c~ent
. estimator
O. We show this in the following.

When N independent observations are given, we can decompose the


observed point n = x into the three statistics 0, 2 and ~ by
x= n(O, 2, ~),

where (u, z, v) are the new coordinates associated with the ancillary

family A(u). When the true parameters are (u, z), we can obtain the
Edgeworth expansion of the joint distribution of
ii = IN(O - u), z = 1N(2 - z), v = IN~

or of their bias corrected version ii , Z , * * v* , in the same manner as


we obtained (4.40) or (4.34). In particular, the first-order term of

the covariance matrix of w= (ii, Z, v) is given by gas, that is the

inverse of gaS' where indices a and S stand for a triplet of indices


(a, p, K). Therefore, the covariance matrix of (ii, z) is minimized,
when and only when A(u) is orthogonal to M. In this case we can
K • -ab
choose v such that gaK = gpK = 0 holds. S~nce g is the (a, b)
component of the inverse gaS of the matrix gSa in this case, we

obtain the following well-known theorem which shows the validity of

using the orthogonalized Fisher information gab in the case with


nuisance parameters.

Theorem 8.1. An estimator is consistent, if and only if the


associated A(u) includes Z(u). It is first-order efficient, when and

only when A(u) is orthogonal to M. The first-order term of the

covariance matrix of an efficient estimator in the presence of


nuisnace parameters is given by the inverse gab of the orthogonalized

Fisher information
-
gab' A first-order efficient estimator is
second-order efficient.

The m.l.e. 0 is given in the presence of nuisance parameter by


solving the simaltaneous likelihood equations,
253

dai(X. ft, 2) = 0, dpi(X, ft, 2) = O.


It is the u-part of the entire m.l.e. (ft, 2). The associated A(u) is
orthogonal to M and hence the m.l.e. is efficient.
When the parameter u of interest is specified and a family of
Z(u)'s are given, we may choose any parametrization z of the nuisance
parameter by (8.1) or we may introduce any coordinate sys tem z in
each A(u). It is convenient, if possible, to choose z in each Z(u)
such that da and dp are always orthogonal, gap (u, z) = 0 at all (u,
z). We call such a coordinate system an orthogonal parametrization.
- reduces to the Fisher information,
The orthogonalized information gab
and aa da holds in this special coordinate system. There always
exists a coordinate system such that gap (and dbgap) vanish at a
specified one point (u O' zO)' However, unfortunately, an orthogonal
coordinate system for which gap(u, z) = 0 at all (u, z) does not in
general exist, except for the case when u is a scalar parameter. We
prove this by showing a necessary and sufficient condition for the
existence of an orthogonal parametrization.
When Z(u) is given, its tangent space T(Z) is spanned by
vectors dp in terms of a coordinate system (u, z). At each point (u,
z) E M, we define the vector space T(U) consisiting of the vectors
that are tangential to M and are orthogonal to T(Z), i. e., T(U) is
the orthogonal complement of T(Z) in T(M). Obviously, T(U) is
spanned by m orthogonalized vectors aa' and
T(M) = T(Z)$T(U).
The vector fields 3a (a = 1, ... , m) define the orthogonal
directions T(U). If there exists an orthogonal coordinate (u, z'),
the tangent directions d~ of the coordinate hyperplane defined by z'
= const. are always orthogonal to T(Z). Hence, the tangent space of
the submanifold z' =const. coinciedes with T(U) spanned by 3a . Thus,
the problem of obtaining an orthogonal parametrization is to search
for a family of m-dimensional submanifolds z' = const. such that
254

their tangent spaces are spanned by m vectors 3 a . Such a submanifold

is called the integral submanifold of given m vector fields 3 a . It


is known that a family of integral submanifolds exist, when and only
when the Lie algebra generated by 3a is closed. The Lie algebra is

said to be closed. if the vectors generated by the Lie bracket,

(8.6)

are linear combinations of 3 c ' that is, there exist Sab c such that

[3 a , 3b l = Sab c3 c'
or
<[3 a , 3b l, dp > = O.

Let (u, z) be an arbitrary parametrization. Then, the

orthogonalized vector is given by

3a = da - g~dp'
where
p qp
ga = gaqg .
By calculating the Lie bracket, we have

[3 a' 3b 1 2{d[bg~l + (dpgq[b) g~l}dq'


where the bracket 1 denotes the alternation of indices as, for
example,

2 d[bg~l = dbg~ - dag~·


Obviously, when u is a scalar parameter, [3 a , 3b l o always holds.

Theorem 8.2. There exists an orthogonal parametrization, when

and only when

d[bg~l + (dpg[~)g~l o (8.7)

holds.

Collorary. When u is a scalar parameter, there always exists an

orthogonal parametrization.

When the condition (8.7) is satisfied, we can obtain an


255

orthogonal parameter (u, z') from a given parametrization (u, z) by


the following transformation
z,p = hP(u, z).

The transformation is obtained as follows. From


' =<d'a' d'>=O,
g ap p
we have the differential equation

dahq(u, z) = g~dphq(u, z). (8.8)


The equation (8.7) is the integrability condition of this partial
differential equation. When u is a scalar, (8.8) is always
integrable.

8.2. Higher-order efficiency of estimators


The higher-order terms of the covariance of a bias-corrected
first-order efficient estimator (/ are derived in the case with
nuisance parameter. When 0 is first-order efficient, the associated
ancillary A(u) includes Z(u). Hence, A(u) is not free from some
curvature inherited from that of Z(u). In order to evaluate the
higher-order covariances, it is convenient to use such a
parametrization that the coordinates (u, z, v) are mutually
orthogonal at the true parameter (u O' zo' 0), i,e., da , dp and d K
are mutually orthogonal at this point. Then, da = aa and gab = gab
at this point. We obtain the Edgeworth expansion of the

distribution of ij* from the Edgeworth expansion of p (w* ; u o ' zO),


where w* = (ij* , Z* , v),
* by integrating it with respect to Z* and
~
v. Since the Edgeworth expansion of p(w* ) is the same as in (4.34)
except that indices a, 8, y, etc. stand for triplets (a, p, K) etc.
in the present case, we see that the Edgeworth expansion of p (ij* ;
u O' zO) is also given by (5.5), if each term is adequately
reinterpreted. In particular, the covariance matrix of ij* is given
by
E[ij*aij*b] = gab + (1/2)N- l (C 2 )ab + O(N-2),
where the covariant (lower index) form of (C 2 )ab is
256

C2 C C ay So
ab aSa yob g g

C C ce df + 2C C cd pq
cda efb g g cpa dqb g g

+ 2CcKaCdAbgcdgKA + 2CpKaCqAb gpqgKA

+ C C pr qs + C C KV A~
pqa rsb g g KAa v~bg g ,
consisting of six non-negative terms. Each term is interpreted
geometrically as follows.
Ccdar(m)
= C = H(m)
cda' pqa pqa'
so that the squares of the terms
(r m)2 r(m)r(m) ce df (8.9)
ab cda efb g g ,

(8.10)

H(m)H(m)gKV AV (8.11)
KAa v~b g ,

It can easily be proved that


H(e) C
caK' pKa
hold. Hence,
e 2 (8.12)
(HU,V)ab
is the square of the oK directions of the exponential curvature of 3a
directions in M. The tensor
e 2 (8.13)
(HU , Z , V)ab
is the square of the cross components of the exponential curvature
H(e) of M, which represents a kind of twist along Z(u) of M, since
apK
H(e) = <v(e)o d >
paK op a' K

represents how the normal directions 3a of Z(u) change in the


directions oK orthogonal to M as the point moves in the directions dp
along Z(u). The last term is denoted by
257

(Hue z)a2b = r(m)r(m) gCdgpq (8.14)


cpa dqb
rhe qUantity' r~~~ behaves as a tensor under the transformations
(8.1), and hence has an invariant meaning. If we choose such a
parametrization that da gbp = 0 at (uO,zO) in addition to gbq = 0, we
have
r(m) = _ H(e).
cpa cap
~ence, it represents the dp-directions of the exponential curvature
)f the u-axes in M.
Among the six non-negative quantities, only (Hm 2
A) ab depends on
the ancillary family A or the estimator ft, and this vanishes for the
n.l.e. The square (rm);b of the mixture connection depends only on
the manner of parametrization u of the parameter of interest. All
the other terms are common to all the efficient estimators, because
they are determined from the geometrical structure of M and Z(u). We
thus have the following theorem.

Theorem. 8.3. The mean square error of a bias corrected


~fficient estimator ft* in the presence of nuisance parameters is
~iven by
-ab + ~{(rm)2ab + 2(He )2ab + 2(He )2ab
g 2N U, Z U, Z , V

+ 2(H e )2ab + (Hm


z )2ab + (Hm
A)2ab} + O(N-2).
U,V
(8.15)
\n efficient estimator is automatically second-order efficient, and
is third-order efficient when, and only when, the mixture curvature
i~~~ of the related A(u) vanishes on M. The m.l.e. is third-order
~fficient.

3.3. The amount of information carried by knowledge on nuisance


Jarameter
We have studied the characteristics of estimators when we do not
~now the value of the nuisance parameter. If we know the true value
258

Zo of the nuisance parameter, we can obtain a better estimator by


using this knowledge. By comparing the covariances of the efficient
estimators in the both cases, we can define the amount of information
which knowledge on the nuisance parameters carries. In other words,
this is the amount of information lost by intervention of the
nuisance parameter. When the value Zo is known in a parametrization
(u, z), the statistical model reduces from (m+k)-dimensional M =
q (x, u, z) with two unknown parameters (u, z) to the simplified
m-dimensional model

M(zO) = {q(x, u, zO)},


where Zo is fixed to the known value. The model M(zO) depends on
the parametrization in which the value of z is known. Since the
model M(zO) includes no nuisance parameter, the ancillary family
A(u) associated with an efficient estimator is orthogonal to M(zO)'
However, the restriction that the A(u) includes Z(u) is not
necessary in this case. The Fisher information of this reduced

model is given by gab(u, zO), which in general is larger than the

orthogonalized gab(u, zO),

gab (u, zO) ~ gab (u, zO)·


The asymptotic covariance of an efficient estimator is g ab( u, zO)
when we know zo' while it is gab when we do not. The difference
shows the gain of information obtained by knowing the value zo of the
true nuisance parameter. When the number of observations is N, the
gain is
lIg ab = N(gab - gab)
N
gapgbqg
pq (8.16 )

which vanishes when <(la' (l > 0, i.e. , when the nuisance parameter
p
is orthogonal to the parameter of interest. This implies that. when
they are not orthogonal, knowledge on the nuisance parameter carries
much information.
In many cases, the nuisance parameter z is orthogonal to the
parameter of interest, even when we can utilize its knowledge zo' so
259

-
that gab - gab = O. Then, what is the amount of information by
knowing the true value Zo of the nuisance parameter in this case?
Let gab(U) be the amount of information carried by the third-order
efficient estimator a (e.g. the m.l.e.) when we have no knowledge on
the nuisance parameters, and let gab(U) be the one when we know its
true value zo. (They are calculated in the next section.) The
amount L\gab of information which the knowledge on the nuisance
parameter carries can be defined by the difference
(8.17)
When the nuisance parameter is not orthogonal, this definition
coincides with the previous one (8.16) except for the higher-order
term. When the nuisance parameter is orthogonal, i. e., gab = gab or
5ap = 0, the knowledge that z = Zo still carries some information of
)rder 1.
Since the amount gab(U) or gab(U) of information included in an
~fficient a is related to the covariance matrix of the bias-corrected
~stimator a* by (7.11), we can calculate L\gab from the covariance
natrix of the m.l.e. 's when Zo is known and that when it is unknown.
Vhen Zo is known, the model reduces to M(zO)' and the ancillary
:amily A(u) associated with the third-order efficient estimator (the
~.l.e.) is orthogonal to M(zO) and is mixture-flat. This A(u) cannot
~nclude Z(u), unless Z(u) is mixture-flat. The third-order terms of
:he covariance of a bias-corrected efficient estimator a* in the
wdel M(zO) is the sum of the three non-negative terms as is given by
:5.11). The term (rm)2 is the same, but the ancillary directions for
:he model M(zO) are decomposed into mutually orthogonal '\ and d p
lirections in the present case. Therefore, for indices K and L,
:tanding for (p, K) and (q, A) of pairs of indices, the square of the
!xponential curvature of M(zO) is decomposed as
(H~)2 H~~H~~gKMgMN = H~~~ H~~~gKAgcd
(H~ , V);b + (H~ , Z);b
260

in the present case. These terms also appear in the case with the
unknown nuisance parameter. The square of the mixture curvature,
which is decomposed as
H~~H~~gKMgLN

H(m)H(m)gKVgA~ + H(m)H(m)gpqgrs+ 2H(m)H(m)gKA g pq


KAa Wb pra qsb Kpa Aqb

( m2 m 2 e 2
HA)ab+ (HU,Z)ab + 2(H U,Z,V)ab' -
vanishes by choosing a mixture flat A when Zo is known. When Zo is
unknown, only the firs t term (H~);b vanishes for the third-order
efficient estimator, because of the restriction that A(u) includes

Z(u). The term (H~,Z);b represents the square of the mixture


curvature of Z(u) in M, and the term (H~,Z,V);b represents the square
of a kind of twist of Z(u) in M. These terms also appeared when we
calculated the power loss of a test in the presence of the
nuisance parameters.

Theorem 8.4. The amount of information carried by knowledge on


the true value Zo of the nuisance parameter is given by

6g ab Ngapgbqgpq + 0(1)
when the nuisnace parameter is not orthogonal, and
e 21m 2 -1
6g ab = (HU,Z,V)ab + Z(HU,Z)ab + O(N ),
when it is orthogonal.

Example 8.2. We again use the set M ={ N(~, 02)} of the normal
dis tributions in Example 8.1, although it is very special in the
sense that M itself is an exponential family so that S = M and there
are no v-directions orthogonal to M. Let u = ~ be the parameter of
interest, and we first assume that we know the value of the nuisance
parameter z = 0, say z = z00 The metric in this (u, z) coordinate
system is already given in Example 2.2 as
2 o.
gab = 1/0 gpq = 2/0 2 gap
261

Hence, M(zO) in Fig.8.3 a) is orthogonal to Z(u), and the first-order


covariance of the efficient estimator ft is the same, even if we can
make use of the knowledge 0 Moreover, the Z(u) is
mixture-flat, H~~~ = 0, in this special case as we showed in Example
2.5. Hence, there is no-third order gain 6g ab = 0, even when we know
z zOo Consider another situation. When we know, not the
variance 0 2 but the mean square of the random variable, z l + 0 2

zo' then the model M(zO) is a curved line as is shown in the dotted
line in Fig. 8.3 b). The metric tensor in the coordinate system (u,
z) , u = ~, z = ~
2+ 02, .
1S a 1 so given in Example 2.2 as
2
gab = (2~ + 0 2 )/0 , 4
gap = -~ 4 /0 , gpq = 1/(20 4 ).
The orthogonalized information is gab = 1/0 2 . Hence, when we make
use of the knowledge z = zO' the asymptotic variance decreases from
- -1 2 -1 2 2 2
(gab) = 0 to (gab) = 0 /(2~ + 0). The knowledge brings an
amount
6g ab = 2N~2/04 + 0(1)
of information.
Consider the third case, where the parameter of interest is the

variance, u = 0 , and z = ~ is the nuisance parameter. Since gap = 0


in the (u, z) coordinate system, there is no first-order information
gain by knowing ~ = zo or the model M(zO)' which is the vertical full
line in Fig 8.3a). The mixture curvature H~~~ of Z(u) (the dotted
line in Fig. 8.3a) is not zero but H~~~ = 2/0 3 , and hence (H~);b
H(m)H(m)gprgqs = 20 2 Therefore, the knowledge " = zo makes the
pqa r s b · ~

variance of the m.l.e. ft* decrease by 0 2 /N, and 6g ab = 02 .

8.4. Asymptotic sufficiency and ancillarity


The amount of information which a statistic t = t (x) carries
with respect to u in the presence of nuisance parameters is measured
by the orthogonalized information matrix

(8.18)
262

Similarly, the average conditional information of t(x) conditioned on


s(x) is defined in the present case by
(8.19)
where gab(T, S) is the amount of information which t and s together
carry. The loss of information caused by retaining t instead of
keeping the original x is written as
llgab(T) = gab (X) - gab(T) = gab(X I T) (8.20)

in the case with nuisance parameter.


By using the above definitions in terms of gab instead of gab'
we can define the concept of asymptotic sufficiency and asymptotic
ancillarity in the presence of nuisance parameters. A statistic t
which summarizes N independent observations xl, ... ,xN is said to be
asymptotically sufficient of order q, when
llgab(T) = O(N-q+l)
holds.
Let t be a statistic which is sufficient of order q for the
entire parameters (u, z),
llgAB(T) = O(Nl-q),
where suffices A, B, etc. stand for paris (a, p), etc., as for
example,

Then, it is sufficient of order q for the parameter u


llgab(T) = O(Nl-q).
This can be shown as follows. From
gAB(T) = gAB - llgAB'
where gAB = gAB (X) is of order N, we have

except for higher-order terms. Hence, the loss of information is


263

evaluated by
tlgab(T) (8.21)
In particular, when u and z are orthogonal at the true (u, z),
gP
a
gpqg
aq
= 0
and
tlgab(T) = tlgab(T)
holds except for higher-order terms. It is hence convenient to use
an orthogonal parametrization (u, z). When it does not exist, we
may use one which is orthogonal at least at a point (ft, ~).

We next calculate the loss of information, which is caused by


summarizing the sufficient x into an efficient ft. Since ft is
efficient, gaK = gpK = 0, and
_b -p
/NClaR,(X, u, z) gab u + gapz ,
_a -q
INClpR,(X, u, z) gpa u + gpqZ ,
except for higher order terms, where (a, ~) is a joint efficient
estimator of (u, z) and Z = IN(~ - z) . It is easy from the above
expansion to show
N(gab
0(1)
so that
gab(O) = Ng ab + 0(1),
tlg ab (0) = 0(1).
Hence, an efficient estimator ft is first-order sufficient. In order
to evaluate the term of order 1 in tlgab(U), we use a parametrization
which is orthogonal at the true point (u, z). Then, from the
expansion
f a ( A) +~ H(m)z-pz-q + 1 H(m)-K-A H(e)_b_p
NoaR,(X, u, z) u 2 pqa T KAav v - abpu z

(8.22)

where fa(ft) is the term depending only on ft, we have


264

...
~gab(U)

(8.23)
which is in a correspondence with the mean square error of a
bias-corrected efficient estimator 0.* given in (8.15). The term
(H~);b vanishes for the m.l.e.
We next calculate the loss of information caused by summarizing
x into the pair (0., 2) From
Naa~(x, u, z) ha(o., 2) + ~ H~~~yKyA - H~~~UbyK
=

_ H(e)zpyK + 0 (N- 1 / 2 ) (8.24)


apK p ,
where ha(o., 2) is the term depending only on 0. and 2, we have
~gab(U, Z) ~gab(U, Z)
.l(Hm) 2 + (He
= (8.25)
2 A ab U,
By subtracting (8.25) from (8.23), one have
- "I ~ 1 m 2 m
+ (HU 2 -1 (8.26)
gab(Z u) = 2(H Z)ab , Z)ab + O(N ),
which is the amount of information carried by the estimator 2 of the
nuisance parameter conditioned on 0..
In order to obtain the second-order sufficient statistics,
let us define the following quantities,
H(e) (0. 2)~K ,
tab abK '
H(e) (0. 2)~K, (8.27)
tap apK '
H(e) (0. 2)~K .
tpq pqK '
They are the exponential-curvature direction components of the
ancillary yK for the full model M with the parameters (u, z), and
hence are second-order sufficient for the entire parameters (u, z)
together with the m.l.e. 0. and~. Hence, the statistics (0., 2, tab'
tap' t pq ) are second-order sufficient,
~gab(U, Z, R) = O(N- 1 ),
where 1t (tab' tap' t pq )' When the parametrization (u, z) is
265

orthogonal at the true value, as can be seen from (8.24), {ft, ~, tab'
tap} are second-order sufficient.
When (u, z) is not orthogonal, we need to use the components in
the orthogonal directions aa of the exponential curvature, instead of
those in the da-directions. Since the orthogonal directions are
given by

lia = da - g~dp'
the orthogonal direction components of the exponential-direction
components t of ~ are given by
t'
ab
t
ab
gPt
a bp
gqt
b aq
+ t pq gPgq
a b'
t' t gqt
ap ap a pq'
where g~ = gab(ft, ~)gpb(ft, ~). The statistics {ft, ~, t~b' t~p} are
second-order sufficient in the non-orthogonal case.
Now let us consider asymptotic ancillary statistics. When a
statistic t is an ancillary of order q with respect to the full
parameter (u, z), it is an ancillary of order q with respect to the

parameter u, because
- pq
gab = gab - gap gbqg
is of order N-q when each of gab' is of order

Therefore, the statistic vK is second order ancillary

gab(V) = O(N- l ),
when (ft, ~) is efficient and the v-coordinate system satisfies gKA(U,

z, 0) = 0KA in the decomposition x - = n(ft, ~, ~). The statistic 2,


which is an efficient estimator of the nuisance parameter, is a
first-order ancillary, because the marginal distribution of z is

given by
p(z I u, z)
where

Hence,
0(1),
266

showing that 2 is first-order ancillary. The average conditional


information of 2 conditioned on Q is given by (8.26). The average
conditional information of ~ conditioned on Q, 2 is given by
~ 1 m 2
~ e 2 e 2
gab(V I U, Z) = ~HA)ab + (HU, Z)ab + (HU, Z , V)ab·
- A
(8.29)
It is also shown that, among all the components of vK, only tab' tap'

tpq and sa H~~~hKA carryall the information of order 1,


conditionally on u, z.

In order to show the role of the above ancillary statistics, let


us consider the conditional distribution of an estimator Q or u

conditioned on 2 or on 2, ~. We first consider the conditional

distribution of u conditioned on 2. It is given by

p(u I z) = c exp
1
{--z-
gab (u, z)(u a + g~zp) (ub + g~zq)}
+ O(N- 1 / 2 ).
This shows that the conditional covariance of u is gab when

condi tioned on 2. Since the uncondi tiona1 covariance of u is gab,

the conditional evaluation of Q might seem to be better than the

unconditional one. However, this is not true~ The condi tiona1

expectation of u is

but we cannot know the deviation z = ,;N(2 - z) even if we know 2.


Hence, conditioned on 2, u is distributed around unknown g~zP with
covariance gab If we take average over unknown zP, the covariance

becomes g-ab Hence, nothing is gained by the conditional inference

conditioned on the first-order ancillary 2. We can make use of the

information carried by 2, only when we evaluate the accuracy of the


estimator Q. Indeed, when 2 is known, gab(Q, 2) gives an estimate of

the asymptotic covariance g-ab ( u, z) cannot


of u. Without 2, we
-ab
obtain any consistent estimator of the asymptotic covariance g (u,

z) of Q.

We next consider the conditional distribution of u conditioned


267

on ~. By integrating p (il, z I ~), which is obtained in the same


manner as in deriving (7.13), with respect to z, we have
p(u- I v) = 1 gab
c exp { -2 - u_,a_,b}
u {l + 1 BN(-u, v-) + O(N- l )},
27R

BN(u-, v) abc + H(e)h ab - K H(m)-K-A


=
b1 Kabc h abK v - KAav v ,
provided (u, z) is orthogonal at this point. When u and z are not
orthogonal, we need to integrate the Hermite polynomials in il and z
with respect to non-orthogonal variables. The result is
BN(il, v) = H(e)'habv K H(m),-K-A
abK - KAa v v ,
where

H(m) , H(e) _ H(e)gP


KAa KAa KAp a'
Hence, we have in general
1 - -,a_,b}
p(il I v) = c exp { -2 gabu u

(8.30)
where
-, = H(m)'vKv
sa KAa
A

vanishes for the m.l.e. (0, 2).


The conditional covariance of il conditioned on ~ is given by
Cov[ila, ilb I ~] = gab _ t,ab + Op(N- l ), (8.31)
t,ab = -ac-bdH(e)'~K
g g cdK .
However, in order to obtain the estimate gab(O, 2) and H~~~'(O, 2)
of g-ab and H(e), we need 2.
abK '
If we use the observed Fisher information matrix

_['N<:'
0, 2) dadpR,(X,
0, OJ
dbdqt(X, 0, 2) dpdqR,(X, 0, 2)
the estimate
- (0 2) _ t,ab
gab '
of the covariance is the (a, b)-component of the inverse of the
268

entire observed Fisher information matrix.

8.5. Reconstruction of estimator from those of independent samples


Let us consider how we can construct the third-order efficient
estimator ft from the second-order sufficient statistics (ft i , 2i' ti)
extracted from s independent samples from the distributions q(x, u,
zi) of sample size Ni (i=l, ... ,s), where u is cOIllIllon but zi are
different. We first treat the case when the values zi of the
nuisance parameter is also cOIllIllon to all the samples, so that the
underlying distribution is the same q(x, u, z) with unknown u and z.
In this case, the joint efficient estimators (ft, 2) of (u, z) is
constructed from (ft i , 2i , ti) by the method in Theorem 7.9.
Therefore, its u-part fta gives the desired estimator ft from all the
pooled samples,

ft a -_ Gab E (G ibcftci + Gibp 2p)


i + Gaq E(G iqcftic + Giqp 2p)
i' (8 . 3Z)

where, Gab, Gaq, Gibc ' Gibp ' etc. are defined in the same manner as
in (7.30) by considering that indices take pairs (a, p) etc. in the
present case.
We next show the method of constructing a third-order efficient
estimator ft, when the i-th sample consists of Ni independent
observations from the distribution q(x, u, zi) such that u is cOIllIllon
but the nuisance parameter takes different unknown values for each
sample.
We treat here a very simple case with two samples of the same
size for illustration. The general case is treated by the same
method, so that the result is shown later.
Let xli' ... ,x lN and x Zl ,' .. x ZN be N independent observations
from the distributions q(x l , u, zl) and q(x Z ' u, zZ), respectively.
The joint distribution of x = (xl' x Z), where xl is from the first
distribution and X z from the second, is given by
q(x, u, zl' zZ) = exp{x l ·6 1 + x Z ·6 Z - ~(61) - ~(6Z)} (8.33)
269

where
81 = 8(u, zl)' 8 Z = 8(u, zZ)
and the dot in x l .8 l denotes the inner product. The distribution
(8.33) defines an (Zn, m + Zk)-curved exponential family M imbedded
in Zn-dimensional S with the natural parameter

8 = [8 1 ] ~[8(U' Zl)]

8Z 8 (u, zZ)
and the expectation parameter

n = [nlj =[n(u, Zl)] ,

nZ n (u, zZ)
where u = (u a ) is the m-dimensional parameter of interests, Z = (zl'
zZ) is the Zk-dimensional nuisance parameter, and 81 = (8~), 8Z = (
i
8Z)' nl = (nli)' n Z = (nZi) are n-dimensional vectors.
The tangent vectors of M at (u, ZZ) are given
by ae/aua,a8/azl and a8/az~. They are Zn-dimensional vectors of the
forms

These vectors, m + Zk in number, span the tangent space T(M) of M.


The tangent subspace T(A) of the ancillary family A associated with
the m.l.e. (11, 2 1 , 2 Z) is orthogonal to T(M) and is spanned by the
vectors
270

B~ = [ g~b (B lbi - glbB1Pi)],

-ab p
gz (B Zbi - gZbBZpi)
where BKi(u, z) is the vectors satisfying
BiB. 0
a Kl.
and
Blci Cl11i(U, zl)/ClUc Bci(u, zl)
Blpi Cl11i(U, Zl)/ClZl Bpi(u, zl) ,
glba gab(u, zl)'
etc. It can easily be checked that these vectors are orthogonal to
those spanning T(M).
We treat the m.l.e. for simplicity's sake. Let (ft l , 21 ) and (ft Z '
2 Z) be the m.l.e.s from the observations x ll '·· .,x lN and x Zl '·· .,xZ N'
respectively. Then, we can decompose the sufficient statistics xi =

~ ~xik (i = 1,Z) into the triplets (ft i , 2i' ~i) as


x- lj 11j(ftl , 2 1 , ~l)
11j(ft l , 2 1 ) + BKj (ft l , 2l)~~ (8.34)
x- Zj 11j(ftZ ' 2 Z ' ~Z)
11j (ft Z ' 2 Z) + BKj(ft Z ' 2Z)~~'
where ~l and ~Z are the approximate ancillary statistics and 11 is
linear in v in the case of the m.l.e. The m.l.e. (ft, 2i, zi ) of the
entire model is given by the decomposition

xli = 11i(ft, 2i) + BKi(ft, 2i)~iK + B~liftOa'


(8.35)
XZi = 11i(ft, 2i) + BKi(ft, 2i)~iK - B~ZiftOa'
where (~lK, ~iK, ftOa) are the approximate ancillary statistics for
the enlarged model M corresponding to the orthogonal directions B1K ,
BZK and B~, respectivel~ and
B~~i = gab(ft, 2i)~i(ft, 2i),
etc. By equating the above two expressions, we can obtain the m.l.e.
ft of the whole pooled samples from ftl' ftZ' 2 1 , 2 Z ' ~l and ~Z of the
271

respective samples.
From the first equations of (8.34) and (8.35), we have
~ ...) (~2 ' ) B ",K B' A' K - ' ab ,
ni ( Ul' Gl - ni u, 1 = lKivl - lKivl - g 1 BlbiftOa'

(8.36)
where BIKi etc. are evaluated at and Bl' Kl.. etc. are
evaluated at (ft, 2i). The left hand side of the above equation is
expanded as
Biai(ftt - fta) + giap(21 - 2iP ) = -ftOa + higher order terms.
We first derive the first-order approximation of ft, 2i, 2 2, ftOa . To
this end, we multiply Bi~ and Bi~ to the both sides of (8.36). Then,
we have

giab(ftt - fta) + giap(21 - 2iP ) -ftOa + higher order terms,

giqa(ftt - fta) + gi qp (21 - 2'i) higher order terms.


Similar equations hold for ft~ - fta and 2~- 2'P2· By solving the above
liner simultaneous equations, where giab etc. may be replaced by glab
in the linear approximation, we have
2'
1

(8.37)

where
a ac- a ac-
Gl b glbc' G2 b
G G g2bc'
and Gac is the inverse of Gac = (glac + g2bc). This shows that the
best estimator is obtained by the weighted mean of ftl and ft2 by using
the orthogonalized Fisher information matrices glab and g2ab as the
weights.
We can proceed further to obtain the higher-order terms of ft.
Here, we assume that (u, z) is orthogonal at (u, zl) and (u, z2) for
simplicity's sake. When it is not orthogonal, the same result holds
272

simply - and the modified


by using the orthogonalized gab t~b instead
of gab and tab. In order to get the higher-order terms. we put

aa af + gi ab a Ob + Ef (8.38)

"a _ "a
U - u2
,ab"
g2 uOb
+ E2a
By multiplying Biai to the both sides of (8.36) and by expanding
various quantities at (a. 2 1 ), we have

g~b(a~ - ab) + trl~~~(a~ - ab)(a~ - aC)

H(e) ~K(ab - ab) - a Ob '


labK 1 1
where we neglected higher-order terms by taking 2i - 21 = 0p (N- l )
into account. By substituting (8.38) in the above. we have
, b _ 1 (m)bc bc
glabEl - ~ r l a aOba Oc + tlabg l aOc ·
On the other hand. by solving (8.38). a Ob is evaluated as

giabaOb = G2\{(a~ - a~)}+ (E~ - E~)}.


where the prime G~ab denotes that the quantity G2 a b is evaluated at
(a. 2 1 , 2 2 ). The final result is

2 b ("b
G,a u2 + Eb2 ) + Gl,ab(~bl
u + El ).
b (8.39)

where
a = ~ r(m) G c G a (aa'
glabEl 2 1 bca 2 c' 2 a' 2

etc.
Finally. we describe the result of the general case. Let aa be
the third-order efficient estimator derived from the second-order
a a ,
sufficient statistics. a i • 2 i • and t iab of the i-th sample consisting
of Ni independent observations from the distribution q (x. u. zi).
The first-order evaluation of aa is given by
a,a = G,ab{ENigbc(a i • 2i)a~}. (8.40)
where G,ab is the inverse of

G~b = ENigab(a i • 2 i )·
Let us define
_ - _1 (m) ( c
Giab - Ni{giab + 2 r abc a i - a, c ) - '}
tiab • (8.41)
273

where all the quantities with suffix i is evaluated at (0', 2 i ).


Then, we have

Theorem 8. 5 . The third-order efficient estimator is given by


the weighted average
Oa = Gab(EGibcO~), (8.42)
of the estimators from various samples with the weight matrices Gibc '
where Gab is the inverse of

Gab = ENiG ibc '


The bias-corrected version O*a is obtained by correcting the bias at

8.6 Notes
Since we have interest in the value only of parameter u in
statistical inference in the presence of nuisance parameter z, we may
take any scale for z. In other words, we may introduce any
coordinate system z in Z(u). Hence, an allowable general parameter
transformation is of the form
z' h(z, u),
u' k(u),
which does not destroy the structure of the problem, i. e., which
keeps the family of submanifolds {Z(u)} invariant. If we can choose
a parametrization (u, z) such that da and dp are mutually orthogonal,
aa -- aa'
ab = g- ab g

hold and discussions become highly transparent. This problem of


orthogonalization of parameters was suggested by K. Takeuchi (private
cOIIllIlunication) , and solved here. Except for the scalar parameter
case, there does not in general exist an orthogonal parametrization.
However, if we use aa instead of da , we get the same procedures and
results as in the orhtogonal parameter case.
When the true value z = Zo of the nuisance parameter is known,
274

we can compose a better statistical inference than when it is


unknown. What is the amount of information which the knowledge z =
Zo carries? We give a solution to this problem. When the parameter
z is not orthogonal, the knowledge carries the amount
If" pq
gab - bab = gapgbqg
of information per observation, which is the square of the cosine of
the angle between da and dp . When z is orthogonal, the amount of
information which the knowledge carries is the sum of the square of
the two types of curvatures of Z(u), one being the mixture curvature
of Z(u) in M and the other is the exponential twist curvature Z(u) in
S. This information is of order 1 in total, and it appears in the
evaluation of characteristics of both the estimation and testing
problems. In this connection, we have defined informaiton,
asymptotic sufficiency and asymptotic ancillarity in the case with
nuisance parameter. However, the theory does not seem satisfactory,
and there remain many problems to be studied further. See also Liang
(1983) . Wei and Tsai (1983) studied conditional inference in the
presence of nuisance parameter. The problem of reconstructing a
third-order efficient estimator from asymptotically sufficient
statistics of independent samples was also studied by Akahira and
Takeuchi (l98lb) in the case when all the samples are subject to
distributions with the nuisance parameter of different unknown
values.
There is an interesting problem of estimating the structural
parameter u from independent observations xl'···' x n ' where Xi is
observed only once and is subject to q(x i , u, zi) with unknown and
different zi (i = 1,···, N). This problem has been studied by many
researchers, e.g., Neyman and Scott (1948), Lindsay (1980),
Hasminskii and Ibragimov (1983), Begun et al. (1983), Kumon and
275

Amari (1984). Differential geometry provides a fundamental for


studying this problem, too, but we cannot state here about it. See
Amari (1984 a), Amari and Kumon (1985).
REFERENCES

Akahira, M. (1983). Asymptotic deficiency of the jacknife


estimator. Australian J. Statist., 25, 123-129

Akahira, M. and Takeuchi, K. (198la). Asymptotic Efficiency of


Statistical Estimators: Concepts and Higher Order Asymptotic
Efficiency. Springer Lecture Notes in Statisitcs, vol.7,
Springer

Akahira, M. and Takeuchi, K. (198lb). On asymptotic deficiency of


estimators in pooled samples. Tech. Rep. Limburgs Univ. Centro
Belgium

Akin, E. (1979). The Geometry of Population Genetics. Springer


Lecture Notes in Biomathematics, vol. 31

Amari, S. (1968). Theory of information spaces --- a geometrical


foundation of the analysis of cOIllIIlunication systems. RAAG
Memoirs, 4, 373-418

Amari, S. (1980). Theory of information spaces --- a differential


geometrical foundation of statistics. POST RAAG Report, No.l06

Amari, S. (1982a). Differential geometry of curved exponential


families curvatures and information loss. Ann. Statist.,
10, 357-387

Amari, S. (1982b). Geometrical theory of asymptotic ancillarity


and conditional inference. Biometrika, 69, 1-17

Amari, S. (1983a). Comparisons of asymptotically efficient tests


in terms of geometry of statistical structures. Bull. Int.
Statist. Inst., Proc. 44 Session, Book2, 1190-1206

Amari, S. (1983b). Differential geometry of statistical inference,


Probability Theory and Mathematical Statistics (ed. Ito, K. and
Prokhorov, J. V.), Springer Lecture Notes in Math., vol. !02l,
26-40
277

Amari, S. (1983c). A foundation of information geometry.


Electronics and Communication in Japan, 66-A, 1-10

Amari, S. (1984a). Differential geometry of statistics---towards


new developments. METR84-1, Univ. Tokyo

Amari, S. (1984b). Differential geometry of systems. Lecture


Notes, No. 528, Inst. Math. Analysis, Kyoto Univ., 235-253

Amari, S. (1984c). Fins1er geometry of non-regular statistical


models (in Japanese). ~ecture Notes, No.538, Inst. Math.
Analysis, Kyoto Univ., 81-95

Amari, S. and Kumon, M. (1983). Differential geometry of Edgeworth


expansions in curved exponential family. Ann. Inst. Statist.
Math., 35A, 1-24

Amari, S. and Kumon, M. (1985). Hilbert bundle theory ----


estimation in the presence of an infinitely increasing number
of nuisance parameters, to appear

Atkinson, C. and Mitchell, A.F. (1981) . Rao t s distance measure.


Sankya, A43, 345-365

Barndorff-Nielsen, o. (1978). Information and Exponential Families


in Statistcal Theory. New York: Wiley

Barndorff-Nielsen, o. (1980). Conditionality resolutions.


~iometrica, 67, 293-310

Barndorff-Nielsen, 0., Blaesild, P., Jensen, J.L. and Jorgensen, B.


(1982). Exponential transformation models. Proc. Roy. Soc.
London, A379, 41-65

Barndorff-Nielsen, O. E. (1984). Differential and integral geometry


in statistical inference. Research Report No. 106, Dept.
Theor. Statist., Univ. Aarhus
Basu, D. (1975). Statistical infromation and likelihood. Sankya,
37A, 1-71
278

Bates, D.M. and Watts, D.G. (1980). Relative curvature measures of


non-linearity. J. Roy. Statist. Soc., ~40, 1-25

Bates, D. M. and Watts, D. G. (1981). Parameter transformations


for improved approximate confidence regions in non-linear least
squares. Ann. Statist. 9, 1152-1167

Beale, E.M.L. (1960). Confidence regions in non-linear estimation.


~y. Statist. Soc., B22, 41-88

Begun, J. M., Hall, W.J., Huang, W.-M. and Wellner, J.A. (1983).
Infromation and asymptotic efficiency in parametric-
nonparametric models. Ann. Statist., !l, 432-452

Bhattacharrya, A. (1943). On discriminaion and divergence.


29th. Indian Sci. Cong., part III, 13

Bickel, P.J., Chibisov, D.M. and Van Zwet, W.R. (1981). On


efficiency of first and second order. Int. Statist. Review, 49,
169-175

Burbea, J. and Rao, C.R. (1982 a). Entropy differential metric,


distance and divergence measures in probability spaces: A
unified approach. J. Multi. Var. Ana1ys., 12, 575-596

Burbea, J. and Rao, C. R. (1982 b). On the convexity of some


divergence measures based on entropy functions. IEEE Trans. on
Inf. Theor.,IT-28, 489-495.

Caianie110, E.T. (1983). A geometica1 view of quantum and


information theories. Incontre di Fisica Teorica

Chandra, T. K. and Joshi, S. N. (1983). Comparison of the


likelihood ratio, Rao's and Wa1d's .tests and a conjecture of C.
R. Rao. Sankya 45, Ser. A, 226-246.

Chentsov, N.N. (1972). Statistical Decision Rules and Optimal


Inference (in Russian). Nauka, Moscow; translated in English
(1982), AMS, Rhode Island
Chernoff, H. (1949). Asymptotic studentization in testing of
hypotheses Ann. Math. Stat., 20, 268-278
279

Chernoff, H. (1952). A measure of asymptotic efficiency for tests


of a hypothesis based on a sum of observations. Ann. Math.
Statist. ~1, 493-507

Chibisov, D. M. (1972). On the normal approximation for a ortain


class of statistics. Proc. Sixth Berkeley Symp. on Math.
Statist. and Prob., !, 153-174

Chibisov, D. M. (1973a). Asymptotic expansions for some


asymptotically optimal tests. Proc. Prague Symp. Asymptotic
Statist. 2, 37-68

Chibisov, D. M. (1973b). An asymptotic expansion for a class of


estimators containing maximum likelihood estimators. Theor.
Probab. Appl., 18. 295-303

Cox, D. R. (1971). The choice between alternative ancillary


statistics. J. Roy. Statist. Soc., B33, 251-255

Cox, D. R. (1980). Local ancillarity. Biometrika,~, 279-286

Cox, D. R and Hinkley, D. V. (1974). Theoretical Statistics.


Chapman and Hall, London

Csiszar, I. (1967 a). Information-type measures of difference of


probability distribution and indirect observations. Studia
Sci. Math. Hungar. 2, 299-318

Csiszar, I. (1967 b), On topological properties of f-diveregence.


Studia Sci. Math. Hungar., 2, 329-339

Csiszar, I. (1975). I-divergence geometry of probability


distributions and minimization problems. Ann. Prob., 3,
146-158

Dawid, A.P. (1975). Discussion to Efron's paper. Ann. Statist.,


~ 1231-1234

Dawid, A.P. (1977). Further comments on a ~aper by Bradley Efron.


Ann. Statist., 5, 1249
280

DiCiccio, T.J. (1984). On parameter transformation and interval


estimation. Biometrika, 71, 477-485

Efron, B. (1975). Defining the curvature of a statistical problem


(with application to second order efficiency) (with
Discussion). Ann. Statist., 3, 1189-1242

Efron, B. (1978) . The geometry of exponential families. Ann.


Statist., 6, 362-376

Efron, B. (1982). The Jacknife! the BootstraE and Other ResamEling


Plans. SIAM: Philadelphia

Efron, B. and Hinkely, D. V. (1978). Assessing the accuracy of the


maximum likelihood estimator: Observed versus expected Fisher
information (with Discussion). Biometrika §i 457-487

Eguchi, S. (1983). Second order efficiency of minimum contrast


estimators in a curved exponential family. Ann. Statist., 11,
793-803

Eguchi, S. (1984). A characterization of second order efficiency


in a curved exponential family. Ann. Inst. Statist. Math. ,
36A, 199-206

Fisher, R. A. (1925). Theory of statisitcal estimation. Proc.


Cambridge Philos.Soc., 122, 700-725

Ghosh, J.K., Shinha, B.K. and Wieand, H.S. (1980). Second order
efficiency of mle with respect to any bounded bowl-shaped loss
function. Ann. Statist.! 8, 506-521

Ghosh, J.K. and Subramanyam, K. (1974). Second order efficiency of


the maximum likelihood estimator. Sankya, ~36, 325-358

Hamilton, D. C., Watts, D.G., and Bates, D.M. (1982). Accounting


for intrinsic nonlinearity in nonlinear regression parameter
inference regions. Ann. Statist., 10, 386-393
281

Hasminskii, R.2. and Ibragimov, I.A. (1983). On asymptotic


efficiency in the presence of an infinitedimensiona1 nuisance
parameter. Probability Theory and mathematical Statistics, ed.
Ito, K. and Prokhorov, J.V., Springer Lecture Notes in Math.,
1021,195-229

Hinke1y, D. V. (1980). Likelihood as approximate pivotal


distribution. Biometrika, 67, 287-292

Hinkely, D.V. (1981). Likelihood. Canad. J. Statist., ~, 151-163

Hinkley, D and Wei, B-C. (1984). Improvements of jacknife


confidence limit methods. Biometrika, 71, 331- 339

Holland, P. W. (1973). Covariance stabilizing transformations.


Ann. Statist. 1, 84-92

Hougaard, P. (1981). The appropriateness of the asymptotic


distribution in a non-linear regression model in relation to
curvature. Res. Rep. 81/9, Statistical Research Unit, Danish
Medical & Social Science Research Council

Hougaard, P. (1983). Parametrization of non-linear models. J.Roy.


Statist. Soc., B44, 244-252

Ingarden, R. S. (1981). Information geometry in function spaces of


classical and quantum finite statisitcal systems. Intern. J.
Engrg. Science, 19, 1609-1633

Ingarden, R. S., Sato, Y., Sugawa, K. and Kawaguchi, M. (1979).


Information thermodynamics and differential geometry. Tensor,
n.s. 33, 347-353

James, A.T. (1973). The variance information manifold and the


function on it. Multivariate Analysis (ed. Krishnaiah, P.K.),
Academic Press, 157-169

Jeffreys, H. (1946). An invariant form for the prior probability in


estimation problems. Proc. Roy. Soc., A, 196, 453-461
282

Jeffreys, H. (1948). Theory of Probability, second ed. Clarendon


Press, Oxford

Kagan, A.M. (1963). On the theory of Fisher's amount of


information. Dokl. Akad. Nauk SSSR, 151, 277-278

Kass, R.E. (1980). The Riemannian structure of model spaces:


A geometrical approach to inference. Ph.D. Thesis, Univ. of
Chicago

Kass, R. E. (1984). Canonical parametrization and zero parameter


effects curvature. J. Roy. Statist. Soc. B., 46, 86-92

Kariya, T. (1983). An invariance approach in a curved model.


Discussion Paper Ser. 88, Hitotsubashi Univ

Kendall, M.G. and Stuart, A. (1963). The Advanced Theory of


Statistics, 1. Charles Griffin, London

Koshevnik, Yu.A. and Levit, B. Ya. (1976). On a non-parametric


analogue of the information matrix. Theory of Prob. and its
Appl., 21, 738-753

Kullback, S. L. (1959). Information Theroy and Statistics. Wiley,


New York

Kullback, S. and Leib1er, R. A. (1951) On information and


sufficiency. Ann. Math. Statist., 22, 79-86

Kumon, M. and Amari, S. (1983). Geometrical theory of higher-order


asymptotics of test, interval estimator and conditional
inference. Proc. Roy. Soc. London, A387, 429-458

Kumon, M. and Amari, S. (1984). Estimation of structural parameter


in the presence of a large number of nuisance parameters.
Biometrika, 11, 445-459

Kumon, M and Amari, S. (1985). Differential geometry of testing


hypothesis: A higher order asymptotic theory in mu1tiparameter
curved exponential family, to appear.
Lauritzen, S. L. (1984). Some differential geometrical notions and
their use in statistical theory. R 84-12, Inst. Elecktr.
Syst., Aalborg Univ.

LeCam, L. (1956). On the asymptotic theory of estimation and


testing hypothesis. Proc. third Berkeley Symp. on Math.
Statist. & Prob., 1, 129-156

Lindsay, B.G. (1982). Conditional score functions: Some


optimality results. Biometrika, 69, 503-512

Liang, K.-Y. (1983). On information and ancillarity in the


presence of a nuisance parameter. Biometrika, ZQ, 607-612

Madsen, L.T. (1979). The geometry of statistical model --- a


generalization of curvature. Research Report, 79-1, Statist.
Res. Unit., Danish Medical Res. Council

Matusita, K. (1955). Decision rule based on the distance of the


classification problem. Ann. Statist. Math., 8, 67-77

McCullagh, P. (1984 a). Local sufficiency. Biometrika, Zl, 233-244

McCullagh, P. (1984 b). Tensor notation and cumulants of


polynomials. Biometrika, 71, 461-476

Michel, R. (1978). Asymptotic sufficiency up to higher orders and


its applications to statistical tests and estimates. Osaka J.
Math., 15, 575-588

Nagaoka, H. and Amari, S. (1982). Differential geometry of smooth


families of probability distributions, METR 82-7, Univ. Tokyo

Neyman, J. and Scott, E. L. (1948). Consistent estimates based on


partially consistent observations. Ecomometrika, 16, 1-32

Ozeki, K. (1977). A Riemannian metric structure on autoregressive


parameter space (in Japanese). Report S77-04, Accoustic
Society of Japan.
284

Peers, H.W. (1978). Second-order sufficiency and statistical


invariants. Biometrika, 65, 489-496

Pierce, D.A. (1975). Dicussion to Efron's paper. Ann. Statist. 3,


1219-1221

Pfanzagl, J. (1973), Asymptotically optimum estimation and test


procedures. Proc. Prague Symp. ASymptotic Statist. 1, 201-272

Pfanzagl, J. (1979). First order efficiency implies second order


efficiency. Contributions to Statistics [J. HAjek Memorial
Volume] (ed. J. JureckovA), pp. 167-196 Prague: Academia

Pfanzagl, J. (1980). Asymptotic expansion in parametric


statistical theory. Developments in Statistics, vol.3 (ed.
Krishnaiah, P.R.), Chap.l, 1-97, Academic Press

Pfanzagl, J. (1982). Contributions to General Asymptotic


Statistical Theory. Lecture Notes in Statistics, 13, Springer

Pfanzagl, J. and Wefelmeyer, W. (1978). An asymptotically complete


class of tests. Z. Wahrsch. & Verw. Gebiete, 45, 49-72.

Rao, C.R. (1945,). Information and accuracy attainable in the


estimation of statistical parameters. Bull. Calcutta. Math.
Soc., 37, 81-91

Rao, C,R. (1961). Asymptotic efficiency and limiting imformation.


Proc. Fifth. Berkeley Symposium, vol. 1, 531-546

Rao, C.R. (1962). Efficient estimates and optimum inference


procedures in large samples (with discussion). J. Roy.
Statist. Soc. B 24, 46-72

Rao, C.R. (1963). Criteria of estimation in large samples. Sankya


A. 25, 189-206

Rao, C.R., Sinha, B. K. and Sub ramany am , K. (1982). Third order


efficiency of the maximum likelihood estimator in the
multinomial distribution. Statistics and Decisions, 1, 1-16
285

Reeds, J. (1975). Discussion to Efron's paper. Ann. Statist., 3,


1234-1238

Reid, N. (1983). Curvature and linear rank statistics. Tech. Rep.


No. 83-13, Inst. App. Math. & Statist., Univ. British Columbia

Renyi, A. (1961). On measures of entropy and information. Proc.


4th Berkeley Symp. Math. Statist. Probab., Univ. California
Press

Ryall, T.A. (1981). Extensions of the concept of local ancillarity.


Biometrika, 68, 677-683

Sato, Y., Sugawa, K. and Kawaguchi, M. (1979). The geometrical


structure of the parameter space of two-dimensional normal
distribution. Rep. Math. Phys. 16, 111-119

Schouten, J. A. (1945). Ricci-Calculus, 2nd ed. Springer, Berlin

Skovgaard, I.M. (1981). Edgeworth expansions of the distributions


of maximum likelihood estimators in the general (non i.i.d.)
case. Scand. J. Statist., ~, 227-236

Skovgaard, L.T. (1984). A Riemannian geometry of the multivariate


normal model, Scand. J. Statist., in press

Suzuki, T. (1978). Asymptotic sufficiency up to higher orders and


its applications to ststistical tests and estimates. Osaka J.
Math., 15, 575-588

Takiyama, R. (1974). On geometrical structures of parameter spaces


of one-dimensional distrubutions (in Japanese). Trans. Inst.
Electr. Comm. Eng. Japan, 57-A, 67-69

Tsai, C.-L. (1983). Contributions to the design and analysis of


non- linear models. Ph. D. Thesis, Univ. Minnesota

Yoshizawa, T. (1971 a). A geometry of parameter space and its


statistical interpretation. Memo TYH-2, Harvard Univ.
286

Yoshizawa, T. (1971 b). A geometrical interpretation of location


and scale parameters. Memo TYH-3, Harvard Univ.

Wei, B.C. and Tsai, C.L. (1983). Geometrical method of asymptotic


conditional inference based on the subset parameters.
Tech.Rep., 417, Univ. Minnesota

Supplements to REFERENCES

Monographs

Amari, S., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S. L., and Rao, C. R.
(1987). Differential Goeometry in Statistical Inferences. IMS Lecture Notes
Monograph Series, vol. 10, Hayward, California, IMS, including the
following papers:
Kass, R. E., Introduction, Chap. 1, 1 - 18.
Amari, S., Differential Geometrical Theory of Statistics, Chap. 2, 19 - 94.
Barndorff-Nielsen, O. E., Differential and Integral Geometry in Statistical
Inference, Chap. 3, 95 - 162.
Lauritzen, S. L., Statistical Manifolds, Chap. 4, 163 • 216.
Rao, C. R., Differential Metrics in Probability Spaces, Chap. 5, 217 - 240.

Barndorff-Nielsen, O. E. (1988). Parametric Statistical Models and Likelihood.


Springer Lecture Notes in Statistics, vol. 50, Springer-Verlag

Dodson, C. T. J. (1987). Geometrization of Statistical Theory. Proc.ofthe


GSTWorkshop; Univ. of Lancaster. ULDMPublications, Dept. of Math.,
U niv. of Lancaster, including the following papers and others:
Lauritzen, S. L., Conjugate connections in statistical theory, 33 - 52.
Barndorff-Nielsen, O. E., On some differential geometric concepts of
relevance to statistics, 53 - 90.
Jupp, P. E., Differential geometry and parameters of interest, 91-122.
Amari, S. Dual connections on the Hilbert bundles of statistical models, 123-
152.
Dodson, C. T. J., Systems of connections for parametric models, 153 - 170.
Kendall, W. S., Computer algebra, Brownian motion, and the statistics of
shape, 171- 192.
287

Blresild, P., Elemental properties with statistical applications, 193 - 198.


Picard, D. B., Invariance properties ofmetrics and connections in regular
families, 203 - 208.
Lyons, T. J., What you can do with n observations, 209 - 218
Hanzon, B., A differential-geometric approach to approximate nonlinear
filtering, 219 - 224.
Eriksen, P. S. ,Geodesics connected with the Fisher metric on the
multivariate normal manifold, 225 - 230.

McCullagh, P. (1987). Tensor Methods in Statistics. Chapman and Hall, London

Papers
Amari, S. (1987). Differential geometrical method in asymptotics of statistical
inference. Invited Paper, Proc. of the 1st Bernoulli Society World Congress
on Mathematical Statistics and Probability Theory (eds. Prohorov. Yu. and
Sazanov, V. V.), 2,195-204, VNU Press
Amari, S. (1987). Differential geometry in statistical inference. Proc. oflSI, 52,
Book 2, Invited Paper, 6.1, 46th Session of the lSI, 321-338
Amari, S. (1987). Statistical curvature. Encyclopedia of Statistical Sciences (eds.
Kotz, S. and Johnson, N. L.), 8, 642-646, Wiley
Amari, S. (1987). Differential geometry ofa parametric family of invertible
linear systems - Riemannian metric, dual affine connections and
divergence. Mathematical Systems Theory, 20, 53-82
Amari, S. (1989). Fisher information under restriction of Shannon information.
AISM, to appear
Amari, S. and Kumon, M. (1988). Estimation in the presence of infinitely many
nuisance parameters --- geometry of estimating functions. Annals of
Statistics, 16, 1044-1068
Amari, S. and Han, T. S. (1989). Statistical inference under multi-terminal rate
restrictions --- a differential geometrical approach. IEEE Trans. on
Information Theory, IT-35, 217-227,1989
288

Barndorff-Nielsen, O. E. (1986). Likelihood and observed geometries. Ann.


Statist.,14,856-873
Barndorff-Nielsen, O. E. (1986). Strings, tensorial combinants, and Bartlett
adjustments. Proc. Roy. Soc. London, A406, 127-137
Barndorff-Nielsen, O. E. (1987). Differential geometry and statistics: some
mathematical aspects. The Journal of Mathematics, 29-3, 335-350
Barndorff-Nielsen, O. E. and Blresild, P. (1987). Strings: mathematical theory
and statistical examples. Proc. Roy. Soc. London, A411, 155-176
Barndorff-Nielsen, O. E. and Blresild, P. (1987). Derivative strings:
contravariant aspects. Proc. Roy. Soc. London, A411, 421-444

Barndorff-Nielsen, O. E. and Blresild, P. (1988). Coordinate-free definition of


structurally symmetric derivative strings. Adval1c.e.6 in Applied
Mathematics, 9,1-6
Barndorff-Nielsen, O. E., Cox, R. D. and Reid, N. (1986). The role of differential
goemetry in statistical theory. Int. Statist. Rev., 54, 83-96
Barndorff-Nielsen, O. E. and Jupp, P. E. (1988). Differential geometry, profile
likelihood, L-sufficiency and composite transformation models, Ann.
Statist., 16, 1009-1043
Barndorff-Nielsen, O. E. and Jupp, P. E. (1989). Approximating exponential
models. AISM, 41, 247-267
Blresild, P. (1989). Yokes and tensors derived from yokes. AISM, submitted
Borre, K. and Lauritzen, S. L. (1989). Some geometric aspects of adjustment.
Festschrift to Torben Krarup (eds. KejlsjiS, E., Poder, K. and Tscherning, C.
C.), 58, 77-89
Burbea, J. (1986). Informative geometry of probability spaces. Expo. Math., 4,
347-378
Burbea, J. and Oller, J. M. (1987). The information metric for univariate linear
elliptic models. Tech. Rep., 87-20, Center for Multivariate Analysis, Univ.
of Pittsburgh, 347-378
289

Caianiello, E. T. (1986). A geometrical view of quantum and information


theories, Frontiers of Non-Equilibrium Statistical Physics, (eds. Moore, G.
T. and Scully, M. 0.), Plenum, 163-187
Caianiello, E. T. and Guz, W. (1988). Quantum fisher metric and uncertainty
relations, Physics Letters, AI26-4, 223-225
Campbell, L. L. (1985). The relation between information theory and the
differential geometric approach to statistics, Inf. Sci., 35, 199-210
Cox, D. R. and Reid, N. (1987). Parameter orthogonalization and approximate
conditional inference (with discussions). J. R. Statist. Soc., B49, 1-39
Eriksen, P. S. (1987). Proportionality of covariance matrices. Ann.
Statist.,15, 732-748
Kass, R. E. (1989). The geometry of asymptotic inference (with discussions).
Statistical Sciences, 4 J 188-234

Kumon, M. and Amari, S. (1988). Differential geometry of testing hypothesis - a


higher order asymptotic theory in multiparameter curved exponential
family. J. Fac. Eng., Univ. Tokyo, B-39, 3, 241-274
Kurose, T. (1988). Dual connections and affine geometry. Tech. Rep. UTYO-
MATH,88-26
McCullagh, P. and Cox, D. R. (1987). Invariants and likelihood ratio statistics.
Ann. Statist, 14, 1419-1430
Mitchell, A. F. S. and Krzanowski, W. J. (1985). The Mahalanobis distance and
elliptic distributions, Biometrika, 92, 464-467
Mitchell, A. F. S. (1988). Statistical manifolds of univariate elliptic distributins.
Int. Statist. Rev., 56, 1-16
Mitchell, A. F. S. (1989). The information matrix, skewness tensors and Q-
connections for the general multivariate elliptic distribution. AISM, 41,
289-304
Mora, M. (1989). Geometrical expansions for the distributions of the score vector
and the maximum likelikelihood estimator. AISM, submitted
Murray, M. K. (1988). Cordinate systems and Taylor series in statistics. Proc.
Royal. Soc. London, A415, 445-452
290

Nomizu, K. and Pinkall, O. (1987). On the geometry of affine immersions. Math.


Z. 195, 165-178
Okamoto, 1., Amari, S. and Takeuchi, K. (1989). Asymptotic theory of sequential
estimation procedures for curved exponential families. to be submitted
Oller, J. M. (1987). Information metric for extreme value and logistic probability
distributions. Sankha, 49A, 17-23
Pazman, A. (1989). Small sample distributional properties of nonlinear
regression estimators (a geometric approach) (with discussions). Statistics,
to appear
Picard, D. B. (1989). Invariance and uniqueness of statistical manifolds under
statistical morphisms. AISM, submitted
Ross, W. H. (1987). The geometry of case deletion and the assessment of influence
in non-linear regression. Canad. J. Statist., 15,91-103
Skovgaard, L. T. (1984). A Riemannian geometry of the multivariate normal
model. Scand. J. Statist., 11,211-223
Skovgaard,1. M. (1986). A note on the diffferentiation of cumulants oflog
likelihood derivatives. Int. Statist. Rev., 54, 169-186
Vos, P. W. (1987). Dual geometries and their applications to generalized linear
models. Ph. D. Disseertation, University of Chicago
Vos, P. W. (1989). Fundamental equations for statistical submanifolds with
applications to the Bartlett correction. AISM, 41, 429-450
Vos, P. W. (1989). Minimum f-divergence estimators and quasi-likelihood
functions. AISM, submitted
Wei, B. C. (1987). Geometric approach to nonlinear regression asymptotics.
Technical Report, N anjin Institute of Technology
Xu, D. (1989). Differential geometrical structures related to forecasting error
variance ratios. AISM, submitted
SUBJECT INDEX

acceptance region 162 asymptotic sufficient 213


affine connection 34 asymptotically orthogonal
affine coordinate system 47 ancillary family 175
a-affine 48 autoparallel 77
a-connection 39
a-convex 91

a-curvature 48 bias corrected estimator 131


a-divergence 85
a-expectation 67

a-extreme point 90 canonical parameter 74

a-family 73 Chernoff distance 88


a-flat 48 Christoffel symbol 42
a-information 89 conditional covariance of
a-normal coordinate 48 estimator 223
a-projection 90 conditional inference 217
a-representation 66 conditional information 211
ancillarity 211 conditional test 190
ancillary family 55 confidence interval 193
associated with consistent 129
estimator 118 coordinate function 12
----·associated with coordinate neighborhood 12
test 163 coordinate system 12
ancillary statistic 190 coordinate transformation 20
ancillary submanifold 55 covarinace of conditional
asymptotic ancillary 213 information 223
asymptotic bias 131 covariance stabilizing 150
asymptotic conditional test 191 covariant derivative 35
asymptotic power 169 Cramer-Rao theorem 27
292

critical region 162 Efron curvature 159


curvature 43 Einstein summation convention 18
higher-order 238 entropy function 83
second 237 envelope power function 170
p-th 238 estimator 128
curved exponential family 108 Euler-Schouten curvature 52
expectation coordinate system
107
decomposition of information 236 expectation parameter 106
differentiable manifold 15 expected conditional information
differentiable structure 14 211
divergence 84 exponential connection (see
dual connection 71 l-connection)
exponential curvature 114
Edgeworth expansion 120 exponential family 74, 104
efficient 129
first-order 129
second-order 130 Fisher information matrix 27
third-order 130 foliation 56
efficient interval estimator 196 function space 93
first-order uniformly --- 196
second-order uniformly--- 196
third-order uniformly 196 geodesic 38
efficient score test 185
efficient test 169
first-order uniformly 169 Hellinger distance 88
second-order uniformly --- 169 Hermite polynomial 122
third-order t-efficient--- 170
293

imbedding 50 minimum covariance parameter 149


imbedding curvature 52 locally 150
information 211 mixture connection (see
information carried by -I-connection)
nuisance parameter 258 mixture curvature 114
interval estimator 193 mixture family 75
k- 196 mixture-normal parametrization

invariancy 68 150
m.l.e. test 181

jacknife estimator 156


jacknifing 156 natural basis 18
natural coordinate system 106
natural parameter 74, 105, 152
Kullback information 88 non-metric connection 70
normal coordinate system 48

normal likelihood parameter 151


Legendre transformation 80 nuisance parameter 244
likelihood ratio test 183
t- 186
locally most powerful test 184 I-representation 19
loss of information 212 optimal test 181

orthogonal ancillary family 131


orthogonal parametrization 253
manifold 12 orthogonalized Fisher
metric connection 70 information 251
metric tensor 26
294

parallel displacement 70 tangent space 16


parallel vector field 47 tangent vector 17
parameter of interest 244 tensor 26, 44
pooled statistic 233 tensorial Hermite polynomial 122
pooling observations 231 third-order admissible test 170
potential function 80 torsion tensor 45
power loss function 171 totally geodesic 77
power of interval estimator 194 tubular neighborhood 56
power of test 168
Pythagorean theorem 92
vector field 32

Rao test 185


Riemann-Christoffel curvature 46 Wald test 181
Riemannian connection 42
Riemannian distance 28
Riemannian metric 25 zero asymptotic skewness
Riemannian space 25 parameter 151

size of interval estimator 194


statistical curvature 159
statistical inference 115
statistical model 11
structural parameter 244
submanifold 49
sufficiency 211
Lecture Notes in Statistics
Vol. 44: D.L. McLeish, Christopher G. Small, The Theory
and Applications of Statistical Inference Functions. 136
pages, 1987.
Vol. 45: J.K. Ghosh, Statistical Information and Likelihood.
384 pages, 1988.
Vol. 46: H.-G. Muller, Nonparametric Regression Analysis
of Longitudinal Data. VI, 199 pages, 1988.
Vol. 47: A.J. Getson, F.C. Hsuan, {2}-lnverses and Their
Statistical Application. VIII, 110 pages, 1988.
Vol. 48: G.L. Bretthorst, Bayesian Spectrum Analysis and
Parameter Estimation. XII, 209 pages, 1988.
Vol. 49: S.L. Lauritzen, Extremal Families and Systems of
Sufficient Statistics. XV, 268 pages, 1988.
Vol. 50: O.E. Barndorff-Nielsen, Parametric Statistical
Models and Likelihood. VII, 276 pages, 1988.
Vol. 51: J. Husler, R.-D. Reiss (Eds.), Extreme Value
Theory. Proceedings, 1987. X, 279 pages, 1989.
VQL52: PK. Goel, T. Ramalingam, The Matching Methodo-
logy: Some Statistical Properties. VIII, 152 pages, 1989.
Vol. 53: B.C. Arnold, N. Balakrishnan, Relations, Bounds
and Approximations for Order Statistics. IX, 173 pages,
1989.
Vol. 54: K. R. Shah, B. K. Sinha, Theory of Optimal
Designs. VIII, 171 pages. 1989.
Vol. 55: L. McDonald, B. Manly, J. Lockwood, J. Logan
(Eds.), Estimation and Analysis of Insect Populations. Pro-
ceedings, 1988. XIV, 492 pages, 1989.
Vol. 56: J.K. Lindsey, The Analysis of Categorical Data
Using GUM. V, 168 pages. 1989.
Vol. 57: A. Decarli, B.J. Francis, R. Gilchrist, G.U.H. See-
ber (Eds.), Statistical Modelling. Proceedings, 1989. IX,
343 pages. 1989.
Vol. 58: O. E. Barndorff-Nielsen, P Blcesild, P S. Eriksen,
Decomposition and Invariance of Measures, and Statisti-
cal Transformation Models. V, 147 pages. 1989.
Vol. 59: S. Gupta, R. Mukerjee, A Calculus for Factorial
Arrangements. VI, 126 pages. 1989.

Vol. 60: L. Gyorii, w. Hardie, P Sarda, Ph. Vieu, Nonpara-


metric Curve Estimation from Time Series. VIII, 153
pages. 1989.
Vol. 61: J. Breckling, The Analysis of Directional Time
Series: Applications to Wind Speed and Direction. VIII,
238 pages. 1989.

You might also like