An Elementary Introduction To Information Geometry
An Elementary Introduction To Information Geometry
Frank Nielsen
Sony Computer Science Laboratories, Tokyo 141-0022, Japan; [email protected]
Keywords: differential geometry; metric tensor; affine connection; metric compatibility; conjugate
connections; dual metric-compatible parallel transport; information manifold; statistical manifold;
curvature and flatness; dually flat manifolds; Hessian manifolds; exponential family; mixture family;
statistical divergence; parameter divergence; separable divergence; Fisher–Rao distance;
statistical invariance; Bayesian hypothesis testing; mixture clustering; a-embeddings; mixed
parameterization; gauge freedom
1. Introduction
from a family of parametric models. This framework was advocated by Abraham Wald [ 8? ,9] who
considered all statistical problems as statistical decision problems. Dissimilarities (also loosely called
distances among others) play a crucial role not only for measuring the goodness-of-fit of data to
model (say, likelihood in statistics, classifier loss functions in ML, objective functions in mathematical
programming or operations research, etc.) but also for measuring the discrepancy (or deviance)
between models.
m
^ D
n( )
m
2
m
1
M
ˆ
Figure 1. The parameter inference of a model from data D can also be interpreted as a decision q
making problem: decide which parameter of a parametric family of models M = fm q gq2Q suits the
“best” the data. Information geometry provides a differential-geometric structure on manifold M which
useful for designing and studying statistical decision rules.
One may ponder why adopting a geometric approach? Geometry allows one to study
invariance of “figures” in a coordinate-free framework. The geometric language (e.g., line, ball or
projection) also provides affordances that help us reason intuitively about problems. Note that
although figures can be visualized (i.e., plotted in coordinate charts), they should be thought of as
purely abstract objects, namely, geometric figures.
Geometry also allows one to study equivariance: For example, the centroid c(T) of a triangle is
equivariant under any affine transformation A: c(A.T) = A.c(T). In Statistics, the Maximum Likelihood
Estimator (MLE) is equivariant under a monotonic transformation g of the model parameter
\ ˆ ˆ q: (g(q)) = g(q), where the MLE of q is denoted by q.
present the conjugate connection manifolds (M, g, r, r ), the statistical manifolds (M, g, C) where C
Entropy 2020, 22, 1100 3 of 61
a a
denotes a cubic tensor, and show how to derive a family of information manifolds (M, g, r , r ) for
1 1
a 2 R provided any given pair (r = r , r = r ) of conjugate connections. We explain how to get
conjugate connections r and r coupled to the metric g from any smooth (potentially asymmetric)
distances (called divergences), present the dually flat manifolds obtained when considering Bregman
divergences, and define, when dealing with parametric family of probability models, the exponential
e m
connection r and the mixture connection r that are dual connections coupled to the Fisher information
metric. We discuss the concept of statistical invariance for the metric tensor and the notion of information
monotonicity for statistical divergences [2,18]. It follows that the Fisher information metric is the unique
invariant metric (up to a scaling factor), and that the f -divergences are the unique separable invariant
divergences.
In the third part (Section 4), we illustrate how to use these information-geometric structures in
simple applications: First, we described the natural gradient descent method in Section 4.1 and its
relationships with the Riemannian gradient descent and the Bregman mirror descent. Second, we
consider two applications in dually flat spaces in Section 4.2: In the first application, we consider the
problem of Bayesian hypothesis testing and show how Chernoff information (which defines the best
error exponent) can be geometrically characterized on the dually flat structure of an exponential
family manifold. In the second application, we show how to cluster statistical mixtures sharing the
same component distributions on the dually flat mixture family manifold.
Finally, we conclude in Section 5 by summarizing the important concepts and structures of
information geometry, and by providing further references and textbooks [2,16] for further readings
to more advanced structures and applications of information geometry. We also mention recent
studies of generic classes of principled distances and divergences.
In Appendix A, we show how to estimate the statistical f -divergences between two probability
distributions in order to ensure that the estimates are non-negative in Appendix B, and report the
canonical decomposition of the multivariate Gaussian family, an example of exponential family
which admits a dually flat structure.
At the beginning of each part, we start by outlining its contents. A summary of the notations
used throughout this survey is provided in Appendix C.
from his bed. In practice, we shall use the most expedient coordinate system to facilitate calculations.
In information geometry, we usually handle a single chart fully covering the manifold.
k k
A C manifold is obtained when the change of chart transformations are C . The manifold is
¥
said smooth when it is C . At each point p 2 M, a tangent plane T p locally best linearizes the
manifold. On any smooth manifold M, we can define two independent structures:
The metric tensor g induces on each tangent plane T p an inner product space that allows one
to measure vector magnitudes (vector “lengths”) and angles/orthogonality between vectors. The
affine connection r is a differential operator that allows one to define:
1. The covariant derivative operator which provides a way to calculate differentials of a vector
field Y with respect to another vector field X: namely, the covariant derivative r XY,
r
2. The parallel transport Õ c which defines a way to transport vectors between tangent planes
along any smooth curve c,
3. The notion of r-geodesics gr which are defined as autoparallel curves, thus extending the
ordinary notion of Euclidean straightness,
4. The intrinsic curvature and torsion of the manifold.
(using notation =), where (X)B := (X ) denotes the contravariant vector components (manipulated
¶
¶xi
as “column vectors” in algebra) in the natural basis B = fe 1 = ¶1, . . . , e D = ¶Dg with ¶i:=: . A tangent plane (vector space) equipped with an inner product h , i yields an inner product space. We define a
reciprocal basis B = fe i = ¶igi of B = fe i = ¶igi so that vectors can also be expressed using the covariant vector components in the natural reciprocal basis. The primal and reciprocal basis are mutually
orthogonal by construction as illustrated in Figure 2 .
Entropy 2020, 22, 1100 5 of 61
x2
j j
hei; e i = i
e2
2
e
e 1
1 x
1
e
Figure 2. Primal basis (red) and reciprocal basis (blue) of an inner product h , i space. The
1 2
primal/reciprocal basis are mutually orthogonal: e is orthogonal to e2, and e1 is orthogonal to e .
i
For any vector v, its contravariant components v ’s (superscript notation) and its covariant
components vi’s (subscript notation) can be retrieved from v using the inner product with the use of
the reciprocal and primal basis, respectively:
i i
v = hv, e i, (2)
vi = hv, eii. (3)
The inner product defines a metric tensor g and a dual metric tensor g :
g
ij := hei, eji, (4)
i j
g ij := he , e i. (5)
hu, uig(p). Using local coordinates of a chart (U, x), we get the vector contravariant/covariant
components, and compute the metric tensor using matrix algebra (with column vectors by convention) as
follows:
> > 1
g(u, v) = (u) B Gx(p) (v)B = (u) B Gx( p) (v)B , (9)
Entropy 2020, 22, 1100 6 of 61
since it follows from the primal/reciprocal basis that G G = I, the identity matrix. Thus on any tangent
plane Tp, we get a Mahalanobis distance:
MG(u, v) := u v G= v å å Gij(ui i
v )(u
j j
v ).
(10)
k k u D D
u
i=1 j=1
The inner product of two vectors u and v is a scalar (a 0-tensor) that can be equivalently calculated as:
S S
i i
hu, vi := g(u, v) = u vi = uiv . (11)
A metric tensor g of manifold M is said conformal when h , i p = k(p)h , iEuclidean. That is, when
the inner product is a scalar function k( ) of the Euclidean dot product. More precisely, we define the
0
notion of a metric g conformal to another metric g when these metrics define the same angles
between vectors u and v of a tangent plane T p:
0
g p(u, v) gp(u, v)
0 0
q g p(u, u) q g p(v, v) =q gp(u, u) q gp(v, v) . (12)
0
Usually g is chosen as the Euclidean metric. In conformal geometry, we can measure angles between
vectors in tangent planes as if we were in an Euclidean space, without any deformation. This is handy for
checking orthogonality in charts. For example, the Poincaré disk model of hyperbolic geometry is
conformal but Klein disk model is not conformal (except at the origin), see [20].
The Christoffel symbols are not tensors (fields) because the transformation rules induced by a
change of basis do not obey the tensor contravariant/covariant rules.
r
2.3.2. Parallel Transport Õ c along a Smooth Curve c
Since the manifold is not embedded in a Euclidean space, we cannot add a vector v 2 T p to a vector
0
v 2 Tp0 as the tangent vector spaces are unrelated to each others without a connection (the Whitney
2D
embedding theorem [21] states that any D-dimensional Riemannian manifold can be embedded into R ;
Euc
when embedded, we can implicitly use the ambient Euclidean connection r on the manifold, see [22]).
Thus a connection r defines how to associate vectors between infinitesimally close tangent planes T p and
Tp+dp. Then the connection allows us to smoothly transport a vector
v 2 Tp by sliding it (with infinitesimal moves) along a smooth curve c (t) (with c(0) = p and c(1) = q), so
that the vector vp 2 Tp “corresponds” to a vector vq 2 Tq: this is called the parallel transport. This
mathematical prescription is necessary in order to study dynamics on manifolds (e.g., study the motion of
a particle on the manifold). We can express the parallel transport along the smooth curve c as:
r
v T, t [0, 1], v = v T
8 2 p 8 2 c(t) Õ 2 c(t) (16)
c(0)!c(t)
v
c(t) vq
v
p
c(t) q
p vq = Qcr vp
M
Figure 3. Illustration of the parallel transport of vectors on tangent planes along a smooth curve. For
a smooth curve c, with c(0) = p and c(1) = q, a vector v p 2 Tp is parallel transported smoothly to a
vector vq 2 Tq such that for any t 2 [0, 1], we have vc(t) 2 Tc(t).
Elie Cartan introduced the notion of affine connections [23,24] in the 1920s motivated by the
principle of inertia in mechanics: a point particle, without any force acting on it, shall move along a
straight line with constant velocity.
rg˙g˙ = 0. (17)
That is, the velocity vector g˙ is moving along the curve parallel to itself (and all tangent vectors
on the geodesics are mutually parallel): In other words, r-geodesics generalize the notion of
k
“straight Euclidean” lines. In local coordinates (U, x), g(t) = (g (t))k, the autoparallelism amounts to
solve the following second-order Ordinary Differential Equations (ODEs):
k l l
g¨(t) + Gij g˙(t)g˙(t) = 0, g (t) = x g(t), (18)
Entropy 2020, 22, 1100 8 of 61
k
where Gij are the Christoffel symbols of the second kind, with:
S S
k lk l
Gij = Gij,l g , Gij,k = glkGij , (19)
where Gthe Christoffel symbols of the first kind. Geodesics are 1D autoparallel submanifolds and
ij,l
• Initial Value Problem (IVP): fix the conditions g(0) = p and g˙(0) = v for some vector v 2 Tp.
• Boundary Value Problem (BVP): fix the geodesic extremities g(0) = p and g(1) = q.
(rV R)(X, Y)Z + (rX R)(Y, V)Z + (rY R)(V, X)Z = 0. (25)
In general, the parallel transport is path-dependent. The angle defect of a vector transported on
an infinitesimal closed loop (a smooth curve with coinciding extremities) is related to the curvature.
However for a flat connection, the parallel transport does not depend on the path, and yields
absolute parallelism geometry [25]. Figure 4 illustrates the parallel transport along a loop curve for a
curved manifold (the sphere manifold) and a flat manifold (the cylinder manifold).
Entropy 2020, 22, 1100 9 of 61
Figure 4. Parallel transport with respect to the metric connection: the curvature effect can be
visualized as the angle defect along the parallel transport on smooth (infinitesimal) loops. For a
sphere manifold, a vector parallel-transported along a loop does not coincide with itself (e.g., a
sphere), while it always conside with itself for a (flat) manifold (e.g., a cylinder).
Historically, the Gaussian curvature at of point of a manifold has been defined as the product of
the minimal and maximal sectional curvatures: kG := kminkmax . For a cylinder, since kmin = 0, it
follows that the Gaussian curvature of a cylinder is 0. Gauss’s Theorema Egregium (meaning
“remarkable theorem”) proved that the Gaussian curvature is intrinsic and does not depend on how
the surface is embedded into the ambient Euclidean space.
An affine connection is a torsion-free linear connection. Figure 5 summarizes the various
concepts of differential geometry induced by an affine connection r and a metric tensor g.
A ne connection
r
Curvature jkl
Parallel transport
r Q
c(t)
Geodesic
(r = 0)
Volume form
(r r! = 0)
Covariant
rX Y
r
r Ri v r ! = fdv derivative
Levi-civita connection
gr
Scalar curvature
ij
Scal = g Rij
Manifold (M; g; r)
Divergence Gradient Hessian Laplacian
div(X) = tr(Y 7!g(rY X; Y )) g(gradf; X) = df(X) Hessf(x)[v] = rvgradf f = div(grad(f))
Figure 5. Differential-geometric concepts associated to an affine connection r and a metric tensor g.
Curvature is a fundamental concept inherent to geometry [26]: there are several notions of
curvatures in differential geometry: scalar curvature, sectional curvature, Gaussian curvature of
surfaces to Riemannian–Christoffel 4-tensor, Ricci symmetric 2-tensor, synthetic Ricci curvature in
Alexandrov geometry, etc.
For example, the real-valued Gaussian curvature sec T p M on a 2D Riemannian manifold (M, g) with
Riemann curvature (1, 3)-tensor R at a point p (with local basis f¶1, ¶2g on its tangent plane Tp M) is
defined by:
Entropy 2020, 22, 1100 10 of 61
R
2112
g g 2
sec Tp M = 11 22 g12 , (26)
r ¶
, ¶2
r ¶2
r ¶1 ¶1 ¶1 ¶2 1 p
= det(g) r . (27)
In general, the sectional curvatures are real values defined for 2-dimensional subspaces pp of the
tangent plane Tp M (called tangent 2-planes) as:
2
Qp(X, Y) := hX, XiphY, Yip h X, Yi p, (29)
denotes the squared area of the parallelogram spanned by vectors X and Y of T p M. It can be shown that
secp(p) is independent of the chosen basis X and Y. In a local basis f¶igi of D-dimensional tangent plane
Tp, we thus get the sectional curvatures at point p 2 M as the following real values:
A Riemannian manifold (M, g) is said of constant curvature k if and only if secp(p) = k for all p 2 M
and pp Tp M. In particular, the Riemannian manifold is said flat when it is of constant curvature 0.
Notice that the definition of sectional curvatures relies on the metric tensor g but the Riemann–
Christoffel curvature tensor is defined with respect to an affine connection (which can be taken as
the default Levi–Civita metric connection induced by the metric g).
2.4. The Fundamental Theorem of Riemannian Geometry: The Levi–Civita Metric Connection
By definition, an affine connection r is said metric compatible with g when it satisfies for any
triple (X, Y, Z) of vector fields the following equation:
Theorem 1 (Levi–Civita metric connection). There exists a unique torsion-free affine connection
LC
compatible with the metric called the Levi–Civita connection r.
The Christoffel symbols of the Levi–Civita connection can be expressed from the metric tensor g as
follows:
S1
LC k kl
Gij = 2 g ¶i gil + ¶j gil
¶l gij , (35)
ij 1
where g denote the matrix elements of the inverse matrix g .
The Levi–Civita connection can also be defined coordinate-free with the Koszul formula:
2g(rXY, Z) = X(g(Y, Z)) + Y(g(X, Z)) Z(g(X, Y)) + g([X, Y], Z) g([X, Z], Y) g([Y, Z], X). (36)
There exists metric-compatible connections with torsions studied in theoretical physics. See for
example the flat Weitzenböck connection [27].
The metric tensor g induces the torsion-free metric-compatible Levi–Civita connection that
determines the local structure of the manifold. However, the metric g does not fix the global
topological structure: For example, although a cone and a cylinder have locally the same flat
Euclidean metric, they exhibit different global structures.
r r
*
hu, vic(0) = Õ u, Õ v + . (37)
c(0)!c(t) c(0)!c(t) c(t)
Thus the Riemannian manifold (M, g) can be interpreted as the self-dual information-geometric
LC
manifold obtained for r = r = r the unique torsion-free Levi–Civita metric connection: (M, g) ( M, g,
LC LC LC
r, r=
r). However, let us point out that for a pair of self-dual Levi–Civita conjugate connections,
the information-geometric manifold does not induce a distance. This contrasts with the Riemannian
modeling (M, g) which provides a Riemmanian metric distance D r(p, q) defined by the length of the
geodesic g connecting the two points p = g(0) and q = g(1):
Z Z
1 1 q
0
Dr(p, q) := kg (t)kg(t)dt = gg(t)(g˙(t), g˙(t))dt, (38)
0 0
Z 1 q
>
= g˙(t) gg(t)g˙(t)dt. (39)
0
This geodesic length distance Dr(p, q) can also be interpreted as the shortest path linking point p to
R 1 0
point q: Dr(p, q) = infg 0 kg (t)kg(t)dt (with p = g(0) and q = g(1)).
Usually, this Riemannian geodesic distance is not available in closed-form (and need to be
approximated or bounded) because the geodesics cannot be explicitly parameterized (see geodesic
shooting methods [28]).
We are now ready to introduce the key geometric structures of information geometry.
Entropy 2020, 22, 1100 12 of 61
3. Information Manifolds
3.1. Overview
In this part, we explain the dualistic structures of manifolds in information geometry. In Section
3.2, we first present the core Conjugate Connection Manifolds (CCMs) (M, g, r, r ), and show how to
build Statistical Manifolds (SMs) (M, g, C) from a CCM in Section 3.3. From any statistical manifold,
a a
we can build a 1-parameter family (M, g, r , r ) of CCMs, the information a-manifolds. We state the
fundamental theorem of information geometry in Section 3.5. These CCMs and SMs structures are
not related to any distance a priori but require at first a pair (r, r ) of conjugate connections coupled
to a metric tensor g. We show two methods to build an initial pair of conjugate connections. A first
D D
method consists of building a pair of conjugate connections ( r, r ) from any divergence D in
Section 3.6. Thus we obtain self-conjugate connections when the divergence is symmetric: D(q1 :
q2) = D(q2 : q1). When the divergences are Bregman divergences (i.e., D = B F for a strictly convex
2 F F
and differentiable Bregman generator), we obtain Dually Flat Manifolds (DFMs) (M, r F, r, r ) in
Section 3.7. DFMs nicely generalize the Euclidean geometry and exhibit Pythagorean theorems. We
F F
further characterize when orthogonal r-projections and dual r -projections of a point on
submanifold a is unique. In Euclidean geometry, the orthogonal projection of a point p onto an affine
subspace S is proved to be unique using the Pythagorean theorem. A second method to get a pair
e m
of conjugate connections ( r, r) consists of defining these connections from a regular parametric
e
family of probability distributions P = fpq (x)gq . In that case, these ‘e’xponential connection r and
m
‘m’ixture connection r are coupled to the Fisher information metric P g. A statistical manifold ( P, P
g, P C) can be recovered by considering the skewness Amari–Chentsov cubic tensor P C, and it
a +a
follows a 1-parameter family of CCMs, (P, P g, P r , P r ), the statistical expected a-manifolds. In
this parametric statistical context, these information manifolds are called expected information
manifolds because the various quantities are expressed from statistical expectations E [ ]. Notice
that these information manifolds can be used in information sciences in general, beyond the
traditional fields of statistics. In statistics, we motivate the choice of the connections, metric tensors
and divergences by studying statistical invariance criteria, in Section 3.10. We explain how to
recover the expected a-connections from standard f -divergences that are the only separable
divergences that satisfy the property of information monotonicity. Finally, in Section 3.11, the recall
the Fisher–Rao expected Riemannian manifolds that are Riemannian manifolds (P, P g) equipped
with a geodesic metric distance called the Fisher–Rao distance, or Rao distance for short.
Conjugation is an involution: (r ) = r.
Definition 2 (Conjugate Connection Manifold). The structure of the Conjugate Connection Manifold
(CCM) is denoted by (M, g, r, r ), where (r, r ) are conjugate connections with respect to the metric g.
A remarkable property is that the dual parallel transport of vectors preserves the metric. That is,
for any smooth curve c(t), the inner product is conserved when we transport one of the vector u
r r
using the primal parallel transport Õ c and the other vector v using the dual parallel transport Õ c .
r r
*
hu, vic(0) = Õ u, Õ v + . (43)
c(0)!c(t) c(0)!c(t) c(t)
Property 1 (Dual parallel transport preserves the metric). A pair (r, r ) of conjugate connections
preserves the metric g if and only if:
r r
*
8t 2 [0, 1], Õ u, Õ v + = hu, vic(0). (44)
c(0)!c(t) c(0)!c(t) c(t)
Property 2. Given a connection r on (M, g) (i.e., a structure (M, g, r)), there exists a unique
conjugate connection r (i.e., a dual structure (M, g, r )).
We consider a manifold M equipped with a pair of conjugate connections r and r that are coupled with
the metric tensor g so that the dual parallel transport preserves the metric. We define the
¯
mean connection r: ¯ = r+r , (45)
r 2
¯
with corresponding Christoffel coefficients denoted by G . This mean connection coincides with the
Levi–Civita metric connection:
¯ LC
r= r. (46)
¯
Property 3. The mean connection r is self-conjugate, and coincide with the Levi–Civita metric connection.
3.3. Statistical Manifolds: (M, g, C)
Lauritzen introduced this corner structure [29] of information geometry in 1987. Beware that
although it bears the name “statistical manifold”, it is a purely geometric construction that may be
used outside of the field of Statistics. However, as we shall mention later, we can always find a
statistical model P corresponding to a statistical manifold [30]. We shall see how we can convert a
conjugate connection manifold into such a statistical manifold, and how we can subsequently derive
an infinite family of CCMs from a statistical manifold. In other words, once we have a pair of
conjugate connections, we will be able to build a family of pairs of conjugate connections.
We define a cubic (0, 3)-tensor (i.e., 3-covariant tensor) called the Amari–Chentsov tensor:
k
C := G Gk , (47)
ijk ij ij
or in coordinate-free equation:
Using the local basis, this cubic tensor can be expressed as:
C = C (¶ , ¶ , ¶ ) = ¶ ¶ ,¶ (49)
hr r i
ijk i j k ¶i j ¶i j k
Definition 3 (Statistical manifold [29]). A statistical manifold (M, g, C) is a manifold M equipped with
a metric tensor g and a totally symmetric cubic tensor C.
a a a
3.4. A Family f(M, g, r , r = (r ) )ga2R of Conjugate Connection Manifolds
For any pair (r, r ) of conjugate connections, we can define a 1-parameter family of connections
a a a 0 ¯ LC
fr ga2R, called the a-connections such that (r , r ) are dually coupled to the metric, with r = r = r,
1 1
r = r and r = r . By observing that the scaled cubic tensor aC is also a totally symmetric cubic 3-
covariant tensor, we can derive the a-connections from a statistical manifold (M, g, C) as:
a 0 aC
G = G , (50)
ij,k ij,k 2 ij,k
0
G a = G +a C , (51)
ij,k ij,k 2 ij,k
S
0
where G are the Levi–Civita Christoffel symbols, and Gki,j = G glk (by index juggling). ij,k
l
ij
a
a LC
g(r XY, Z) = g( rXY, Z) + 2 C(X, Y, Z), 8X, Y, Z 2 X(M). (52)
a a a
Theorem 2 (Family of information a-manifolds). For any a 2 R, (M, g, r , r = (r ) ) is a
conjugate connection manifold.
a
The a-connections r can also be constructed directly from a pair (r, r ) of conjugate
connections by taking the following weighted combination:
Ga = 1 + aG + 1 aG . (53)
ij,k 2 ij,k 2 ij,k
3.5. The Fundamental Theorem of Information Geometry: r k-Curved , r k-Curved
We now state the fundamental theorem of information geometry and its corollaries:
Theorem 3 (Dually constant curvature manifolds). If a torsion-free affine connection r has constant
curvature k then its conjugate torsion-free connection r has necessarily the same constant curvature k.
a a a a
Corollary 1 (Dually a-flat manifolds). A manifold (M, g, r , r ) is r -flat if and only if it is r -flat.
Corollary 2 (Dually flat manifolds (a = 1)). A manifold (M, g, r, r ) is r-flat if and only if it is r -flat.
Definition 4 (Constant curvature k). A statistical structure (M, g, r) is said of constant curvature k when
r
R (X, Y)Z = kfg(Y, Z)X g(X, Z)Yg, 8 X, Y, Z 2 G(TM),
where G(TM) denote the space of smooth vector fields.
It can be proved that the Riemann–Christoffel (RC) 4-tensors of conjugate a-connections [16] are
related as follows:
(a) ( a)
g R (X, Y)Z, W + g Z, R (X, Y)W = 0. (54)
r r
We have g R (X, Y)Z, W = g Z, R (X, Y)W . connections, we can always build a 1-parametric
D D
3.6. Conjugate Connections from Divergences: (M, D) (M, g, r, Dr = D r)
Loosely speaking, a divergence D( : ) is a smooth distance [32], potentially asymmetric. In order
to define precisely a divergence, let us first introduce the following handy notations: ¶ f (x, y) =
2
¶ ¶ ¶ ¶ ¶ ¶2 i,
i
¶xf (x, y), ¶ ,j f (x, y) = ¶yj f (x, y), ¶ij,k f (x, y) = i j
¶x ¶x ¶y
k
f (x, y) and ¶i,jk f (x, y) = i j
¶x ¶y ¶y
k
f (x, y), etc.
Definition 5 (Divergence). A divergence D : M M ! [0, ¥) on a manifold M with respect to a local chart
D 3
Q R is a C -function satisfying the following properties:
0 0 0
1. D(q : q ) 0 for all q, q 2 Q with equality holding iff q = q (law of the indiscernibles),
0 0
2. ¶i, D(q : q )jq=q0 = ¶ ,j D(q : q )jq=q0 = 0 for all i, j 2 [D],
0
3. ¶ ,i¶ ,j D(q : q )jq=q0 is positive-definite.
The dual divergence is defined by swapping the arguments:
0 0
D (q : q ) := D(q : q), (55)
and is also called the reverse divergence (reference duality in information geometry). Reference
duality of divergences is an involution: (D ) = D.
The Euclidean distance is a metric distance but not a divergence. The squared Euclidean
distance is a non-metric symmetric divergence. The metric tensor g yields Riemannian metric
distance Dr but it is never a divergence.
From any given divergence D, we can define a conjugate connection manifold following the
construction of Eguchi [33,34] (1983):
D D D
Theorem 4 (Manifold from divergence). (M, g, r, r) is an information manifold with:
0 D
D g := ¶i,j D(q : q )jq=q0 = g, (56)
D 0
Gijk := ¶ij,k D(q : q )jq=q0 , (57)
D 0
Gijk := ¶k,ij D(q : q )jq=q0 . (58)
D D
The associated statistical manifold is (M, g, C) with:
D D D
Cijk = Gijk Gijk. (59)
Entropy 2020, 22, 1100 16 of 61
D
Since a C is a totally symmetric cubic tensor for any a 2 R, we can derive a one-parameter
family of conjugate connection manifolds:
n o
D D a D D a D a D a
(M, g, C ) (M, g, r , ( r ) = r ) . (60)
a2R
In the remainder, we use the shortcut (M, D) to denote the divergence-induced information
D D D
manifold (M, g, r, r ). Notice that it follows from construction that:
Dr = D r. (61)
B B B B
3.7. Dually Flat Manifolds (Bregman Geometry): (M, F) (M, F g, F r, F r= F r)
We consider dually flat manifolds that satisfy asymmetric Pythagorean theorems. These flat
manifolds can be obtained from a canonical Bregman divergence.
Consider a strictly convex smooth function F(q) called a potential function, with q 2 Q where Q is an
open convex domain. Notice that the function convexity does not change by an affine transformation. We
associate to the potential function F a corresponding Bregman divergence (parameter divergence):
0 0 0> 0
BF(q : q ) := F(q) F(q ) (q q ) rF(q ). (62)
We write also the Bregman divergence between point P and point Q as D(P : Q) := B F(q(P) :
q(Q)), where q(P) denotes the coordinates of a point P.
The information-geometric structure induced by a Bregman generator is (M, F g, FC) := (M, BF g, BF
C) with:
B
F g := F g= ¶¶ (q : q ) = 2 (q ), (63)
i jBF j
FG := BF Gij,k(q) = 0, 0 q 0=q rF (64)
F
C := B
F C = ¶ ¶ ¶ F(q). (65)
ijk ijk i j k
Here, we define a Bregman generator as a proper, lower semi-continuous, and strictly convex
3
and C differentiable real-valued function.
Since all coefficients of the Christoffel symbols vanish (Equation (64)), the information manifold
is Fr-flat. The Levi–Civita connection LCr is obtained from the metric tensor F g (usually not flat),
and we get the conjugate connection ( Fr) = Fr1 from (M, F g, FC).
The Legendre–Fenchel transformation yields the convex conjugate F that is interpreted as the
dual potential function:
F (h) := supfq>h F(q )g. (66)
q2Q
A function f is lower semicontinous (lsc) at x 0 iff f (x0) limx!x0 inf f (x). A function f is lsc if it is lsc
at x for all x in the function domain. The following theorem states that the conjugation of lower
semicontinuous and convex functions is an involution:
In a dually flat manifold, there exists two global dual affine coordinate systems h = rF(q) and
q = rF (h), and therefore the manifold can be covered by a single chart. Thus if a probability family
belongs to an exponential family then its natural parameters cannot belong to, say, a spherical space
(that requires at least two charts).
We have the Crouzeix [36] identity relating the Hessians of the potential functions:
2 2
r F(q)r F (h) = I, (67)
Entropy 2020, 22, 1100 17 of 61
j
where I denote the D D identity matrix. This Crouzeix identity reveals that B = f¶igi and B = f¶ gj are
the primal and reciprocal basis, respectively.
The Bregman divergence can be reinterpreted using Young–Fenchel (in)equality as the
canonical divergence AF,F [37]:
0 0 0 > 0 0
BF(q : q ) = AF,F (q : h ) = F(q) + F (h ) q h = AF ,F(h : q). (68)
0 0 0
The dual Bregman divergence BF (q : q ) := BF(q : q) = BF (h : h ) yields
F ij i j l ¶
g (h) = ¶ ¶ F (h), ¶ :=: l (69)
¶h
ijk i j k
FG (h) = 0, FCijk = ¶ ¶ ¶ F (h ) (70)
F F
Thus the information manifold is both r-flat and r -flat: This structure is called a dually flat
manifold (DFM). In a DFM, we have two global affine coordinate systems q( ) and h( ) related by the
Legendre–Fenchel transformation of a pair of potential functions F and F . That is, (M, F) (M, F ),
and the dual atlases are A = f(M, q)g and A = f(M, h)g.
In a dually flat manifold, any pair of points P and Q can either be linked using the r-geodesic
3
(that is q-straight) or the r -geodesic (that is h-straight). In general, there are 2 = 8 types of
geodesic triangles in a dually flat manifold.
On a Bregman manifold, the primal parallel transport of a vector does not change the contravariant
vector components, and the dual parallel transport does not change the covariant vector components.
Because the dual connections are flat, the dual parallel transports are path-independent.
Moreover, the dual Pythagorean theorems [38] illustrated in Figure 6 holds. Let g(P, Q) = gr(P,
Q) denote the r-geodesic passing through points P and Q, and g (P, Q) = gr (P, Q) denote the r -
geodesic passing through points P and Q. Two curves g1 and g2 are orthogonal at point
p = g1(t1) = g2(t2) with respect to the metric tensor g when g(g˙1(t1), g˙2(t2)) = 0.
(P; Q) ?F (Q; R) (P; Q) F (Q; R)
?
P P
R
R
Q
Q
The dual r -projection PS is unique if M S is r-flat and minimizes the divergence D(q(Q) : q(P)):
0 0
Let S M and S M, then we define the divergence between S and S as
0 0
D(S : S ) := min D(s : s ). (73)
0 0
s2S,s 2S
0 0
When S is a r-flat submanifold and S r -flat submanifold, the divergence D(S : S ) between
0
submanifold S and submanifold S can be calculated using the method of alternating projections [2].
Let us remark that Kurose [39] reported a Pythagorean theorem for dually constant curvature
manifolds that generalizes the Pythagorean theorems of dually flat spaces.
We shall concisely explain the space of Bregman spheres explained in details in [40]. Let D denote
the dimension of Q. We define the lifting of primal coordinates q to the primal potential function
ˆ
F = fq = (q, qD+1 = F(q)) : q 2 Qg using an extra dimension qD+1. A Bregman ball S
of a (D + 2) (D + 2) determinant:
InBregmanBallF(P1, . . . , Pd+1; P) := sign
0
q(P1) .. .. .. q(PD+1) 1
1
F (q(P1)) . . . F (q(PD+1))
1
1
@ q (P ) C . (76)
B
F(q(P)) A
We have:
InBregmanBallF(P1, . . . , Pd+1; P) : 8=0 , P2 ¶InBregmanBallF(P1, . . . , PD+1; P) (77)
and be lifted to the dual potential function F . Notice that Ball F(C : r) = BallF (C : r). Figure 7 displays
five concentric pairs of dual Itakura–Saito circles obtained for the separable Burg
negentropy generator divergence the Itakura–Saito divergence)
Using the space of spheres, it is easy to design algorithms for calculating the union or
intersection of Bregman spheres [40], or data-structures for proximity queries [41] (relying on the
radical axis of two Bregman spheres). The Bregman spheres are considered for building Bregman
Voronoi diagrams in [40,42].
The smallest enclosing Bregman ball [43,44] (SEBB) of a set of points P1, . . . , Pn (with respective
q-coordinates q1, . . . , qn) can also be modeled as a convex program; indeed, point P i belongs to the
D
lower halfspace H of equation qD+1 hha, qi + b (parameterized by vector h a 2 R and scalar b 2 R) iff
hha, qi + b F(qi). Thus we seek to minimize min ha,b r = F (ha) + b such that hqi, hai + b F(qi) 0 for all i 2 f1,
. . . , ng. This is a convex program since F is the convex conjugate of a convex generator F. When F( q) =
1 >
2 q q (i.e., Euclidean geometry), we recover the fact that the smallest enclosing ball of a point set in
Euclidean geometry can be solved using quadratic programming [45]. Faster approximation algorithms
for the smallest enclosing Bregman ball can be built based on core-sets [43].
In general, we have the following quadrilateral relation for Bregman divergences:
Property 4 (Bregman 4-parameter property [46]). For any four points P1, P2, Q1, Q2, we have the
following identity:
space can be built from any smooth strictly convex generator F. For example, a dually flat geometry
can be built on homogeneous cones with the characteristic function F of the cone [48]. Figure 8
illustrates several common constructions of dually flat spaces.
Exponential family
F : cumulant function
Mixture family
Shannon negentropy Bregman divergence
Figure 8. Common dually flat spaces associated to smooth and strictly convex generators.
F F a F a
3.8. Hessian a-Geometry: (M, F, a) (M, g, r , r )
The dually flat manifold is also called a manifold with a Hessian structure [48] induced by a
B F B F F
convex potential function F. Since we built two dual affine connections F r = r and F r= r= r,
we can build a family of a-geometry as follows:
F F ij i j
gij(q) = ¶i¶j F(q), g (h) = ¶ ¶ F(h), (80)
and
F a 1 a F a F a 1+a i j k
Gijk (q) = 2 ¶i¶j¶k F(q), Gijk (h) = Gijk (h) = 2 ¶ ¶ ¶ F (h). (81)
a a
3.9. Expected a-Manifolds of a Family of Parametric Probability Distributions: (P, P g, P r , P r )
Informally speaking, an expected manifold is an information manifold built on a regular
parametric family of distributions. It is sometimes called “expected” manifold or “expected” geometry
in the literature [49] because the components of the metric tensor g and the Amari–Chentsov cubic
tensor C are expressed using statistical expectations.
Let P be a parametric family of probability distributions:
More precisely, the likelihood function is an equivalence class of functions defined modulo a
positive scaling factor.
The score vector:
sq = rq l = (¶il)i, (84)
¶
indicates the sensitivity of the likelihood ¶il:=: ¶q i l(q; x).
The Fisher information matrix (FIM) of D D for dim(Q) = D is defined by:
¶ l¶ l 0,
P I(q) := Eq i j ij (85)
Entropy 2020, 22, 1100 21 of 61
where denotes the Löwner order. That is, for two symmetric positive-definite matrices A and B, A B
if and only if matrix A B is positive semidefinite. For regular models [16], the FIM is positive definite:
P I(q) 0, where A B if and only if matrix A B is positive-definite.
The FIM is invariant by reparameterization of the sample space X , and covariant by
reparameterization of the parameter space Q, see [16]. That is, let p¯(x; h) = p(q(h); x). Then we have:
¯ # " #
I (h) = "¶hij > I(q(h)) ¶hij ij . (86)
¶q ¶q
ij
h i
Matrix Jij = ¶qi is the Jacobian matrix.
¶hj ij
Let us give illustrate the covariance of the Fisher information matrix with the following example:
4 0 l22 5 0 s2
2 0 1 0
(88)
and
Il0 l0 = " 1
0 1 2
# =" 1 1 4 #
l 2
0
2 0 s 0 (89)
2l2 2s
Since the FIM is covariant, we have the following the change of transformation:
0 > 0
Il 0 l = Jl ,l0 Il l l Jl,l0 , (90)
with J0
l ,l = " 0 2s
#
1 0 (91)
As a corollary, notice that we can recognize the Euclidean metric in any other coordinate
>
system if the metric tensor g can be written J l ,l0 Jl,l0 . For example, the Riemannian geometry
induced by a dually flat space with a separable potential function is Euclidean [50].
Entropy 2020, 22, 1100 22 of 61
In statistics, the FIM plays a role in the attainable precision of unbiased estimators. For any
unbiased estimator, the Cramér–Rao lower bound [51] on the variance of the estimator is:
ˆ 1 1
I
Varq [qn(X)] n P (q ). (93)
Figure 9 illustrates the Cramér–Rao lower bound (CRLB) for the univariate distributions:
At regular grid locations (m, s) of the upper space of normal parameters, we repeat 200 runs (trials)
\
of estimating the normal parameters ( ,) using the MLE on 100 iid samples x , . . . , x N( , ).
Figure 9. Visualizing the Cramér–Rao lower bound: the red ellipses display the Fisher information
2
matrix of normal distributions N(m, s ) at grid locations. The black ellipses are sample covariance
matrices centered at the sample means calculated by repeating 200 runs of sampling 100 iid
variates for the normal parameters of the grid.
We report the expression of the FIM for two important generic parametric family of probability
distributions: (1) an exponential family (with its prominent multivariate normal family), and (2) a mixture
family.
Entropy 2020, 22, 1100 23 of 61
Example 2 (FIM of an exponential family E). An exponential family [52] E is defined for a sufficient statistic
vector t(x) = (t1(x), . . . , tD(x)), and an auxiliary carrier measure k(x) by the following canonical density:
E ( i=1 ! 2 )
D
where F is the strictly convex cumulant function (also called log-normalizer, and log partition function or
free energy in statistical mechanics). Exponential families include the Gaussian family, the Gamma and
Beta families, the probability simplex D, etc. The FIM of an exponential family is given by:
2 2 1
E I(q) = CovX pq (x)[t(x)] = r F(q) = (r F (h)) 0. (95)
2 2 2
Indeed, under mild conditions [2], we have I(q) = Epq [r log pq (x)]. Since r log pq (x) = r F(q), it follows that
2
E I(q) = r F(q). Natural parameters beyond vector types can also be used in the canonical decomposition of
the density of an exponential family. For example, we may use a matrix type for defining the zero-centered
multivariate Gaussian family or the Wishart family, a complex numbers for defining the complex-valued
D
Gaussian distribution family, etc. We then replace the term å i =1 ti(x)qi in Equation (94) by an inner product
defined for the natural parameter type (e.g., dot product for vectors, matrix product trace for matrices, etc.).
Furthermore, natural parameters can be of compound types: For example, the multivariate Gaussian
distribution can be written using q = (qv, qM) where qv is a vector part and qM a matrix part, see [52].
1 ij
Let S = [sij] denote the covariance matrix and S = [s ] the precision matrix of a multivariate normal
distribution. The Fisher information matrix of the multivariate Gaussian [53,54] N(m, S) is given by
m S = [sij]
s s + ss S = [skl ] (96)
ij
il jk ik jl
I(m, S) = s 0 m
0
Notice that the lower right block matrix is a 4D tensor of dimension d d d d. The zero subblock
matrices in the FIM indicate that the parameters m and S are orthogonal to each other. In particular,
11 1
when d = 1, since s = 2 , we recover the Fisher information matrix of the univariate Gaussian:
s
" #
1
s
2 0
I(m, S) = 1 (97)
0 4
2s
We refer to [55] for the FIM of a Gaussian distribution using other canonical parameterizations
(natural/expectation parameters of exponential family).
Example 3 (FIM of a mixture family M). A mixture family is defined for D + 1 functions F1, . . . , FD and
C as: ( i=1 2 )
D
M
where the functions fFi(x)gi are linearly independent on the common support X and satisfying Fi(x)dm(x) =
R
0. Function C is such that C(x)dm(x) = 1. Mixture families include statistical mixtures with prescribed
R
component distributions and the probability simplex D. The FIM of a mixture family is given by:
(x)F (x) F (x)F (x)
F
i j ij
2 Z
(pq (x)) = X
M I(q) = EX pq (x) pq (x) dm(x) 0. (99)
The family of Gaussian mixture model (GMM) with prescribed component distributions (i.e.,
convex weight combinations of D + 1 Gaussian densities) form a mixture family [56].
Entropy 2020, 22, 1100 24 of 61
Notice that the probability simplex of discrete distributions can be both modeled as an
exponential family or a mixture family [2].
The expected a-geometry is built from the expected dual a-connections. The Fisher
“information metric” tensor is built from the FIM as follows:
>
P g(u, v) := (u)q P I(q) (v)q (100)
The expected exponential connection and expected mixture connection are given by
m := Eq (¶i¶jl + ¶il¶jl)(¶kl) . (102)
e
P r := Eq (¶i¶jl)(¶kl) , (101)
Pr
m e
The dualistic structure is denoted by (P, P g, Pr, P r) with Amari–Chentsov cubic tensor
called the skewness tensor:
(104)
a +a ,
(P, P g, P r , P r ) a2R
with
a
G (q) := E ¶ ¶ l¶ l + 1 a C (q), (105)
2
a
P ij,k q i j k ijk
1 (106)
a a (107)
¯ = P r + P r = LC := LC ( g)
Pr 2 P r rP
The a-Riemann–Christoffel curvature tensor is:
(108)
R = ¶ Ga ¶G a + grs G a Ga Ga G a ,
P ijkl i jk,l j ik ,l ik ,r js,l jk,r is ,l
a a
with Rijkl = Rijlk . We check that the expected a-connections are coupled with the metric: ¶i gjk =
a a
Gij ,k + Gik, j.
In case of an exponential family E or a mixture family M equipped with the dual
exponential/mixture connection, we get dually flat manifolds (Bregman geometry).
Indeed, for the exponential/mixture families, it is easy to check that the Christoffel symbols of
e m
r and r vanish:
m e m
e G= G= G= G = 0. (109)
M M E E
3.10. Criteria for Statistical Invariance
So far we have explained how to build an information manifold (or information a-manifold) from
a pair of conjugate connections. Then we reported two ways to obtain such a pair of conjugate
connections: (1) from a parametric divergence, or (2) by using the predefined expected
exponential/mixture connections. We now ask the following question: which information manifold
makes sense in Statistics? We can refine the question as follows:
• Which metric tensors g make sense in statistics?
• Which affine connections r make sense in statistics?
• Which statistical divergences make sense in statistics (from which we can get the metric tensor and
dual connections)?
Entropy 2020, 22, 1100 25 of 61
By definition, an invariant metric tensor g shall preserve the inner product under important
0
statistical mappings called Markov embeddings. Informally, we embed D D into DD0 with D > D and
the induced metric should be preserved (see [2], page 62).
Theorem 8 (Uniqueness of Fisher information metric [57,58]). The Fisher information metric is the
unique invariant metric tensor under Markov embeddings up to a scaling constant.
A D-dimensional parameter (discrete) divergence satisfies the information monotonicity if and only if:
0 0
D(q ¯ : q ¯ ) D(q : q ) (110)
A A
E
for any coarse-grained partition A = fAigi =1 of [D] = f1, . . . , Dg (A-lumping [59]) with
ED, where qi ¯ = j
åj2Ai q for i 2 [E]. This concept of coarse-graining is illustrated in
A
Figure 10. This information monotonicity property could be renamed as the “distance coarse-binning
inequality property.”
p p p p p p p p p
1 2 3 4 5 6 7 8
coarse graining
p1 + p2 p3 + p4 + p5 p6 p7 + p8 pA
0 0
Figure 10. A divergence satisfies the property of information monotonicity iff D(qA¯ : q ¯ ) D(q : q ).
A
Here, parameter q represents a discrete distribution.
A separable divergence D(q1 : q2) is a divergence that can be expressed as the sum of elementary
scalar divergences d(x : y):
D(q1 : q2) := åd(q1i : q2j). (111)
i
i i2
For example, the squared Euclidean distance D(q1 : q2) = åi(q1 q2 ) is a separable divergence for the
2 i
scalar Euclidean divergence d(x : y) = (x y) . The Euclidean distance D (q , q )= (q qi )2
0
D
i=1
q q
0
If (q : q ) := å qi f f (1), f (1) = 0. (112)
0
The standard f -divergences are defined for f -generators satisfying f (1) = 0 (choose fl(u) := f
00
(u) + l(u 1) since Ifl = If ), and f (u) = 1 (scale fixed).
Statistical f -divergences are invariant [61] under one-to-one/sufficient statistic transformations
y = t(x) of sample space: p(x; q) = q(y(x); q):
0 0
If [p(x; q) : p(x; q )] = ZX p(x; q) f p(x; q ) dm(x),
p(x; q )
0
= ZY q(y; q) f q(y; q ) dm(y),
q(y; q )
0
= If [q(y; q) : q(y; q )].
0 0 0
If [p(x; q) : p(x; q )] = If [p(x; q ) : p(x; q)] = If [p(x; q) : p(x; q )] (113)
Entropy 2020, 22, 1100 26 of 61
+
4
1 a 1 a
2
Ia[p : q] := 1 a 1 Z p
2 (x )q 2 (x)dm(x) , (115)
4 1+a
obtained for f (u) = 1 2 (1 u 2 ). The a-divergences include:
a
KL [p : q] :=
Z
q(x) log p(x) dm(x) = KL[q : p], (117)
for f (u) = u log u.
– The symmetric squared Hellinger divergence:
Z 2
q q
2
H [p : q] := p(x) q(x) dm(x), (118)
p
for f (u) = ( u 1)2 (corresponding to a = 0)
– The Pearson and Neyman chi-squared divergences [62], etc.
• The Jensen–Shannon divergence:
1 2 ( ) 2()
px qx
JS[p : q] := 2 Z p(x) log p(x) + q(x) + q(x) log p(x) + q(x) dm(x), (119)
for f (u) = (u + 1) log 1+u + u log u.
2
• The Total Variation
1
TV[p : q] := 2 Z jp(x) q(x)j dm(x), (120)
1
for f (u) = 2 ju 1j. The total variation distance is the only metric f -divergence (up to a scaling
factor).
The f -topology is the topology generated by open f -balls, open balls with respect to f -divergences.
0 0
A topology T is said to be stronger than a topology T if T contains all the open sets of T . Csiszar’s
theorem [63] states that when jaj < 1, the a-topology is equivalent to the topology induced by the total
variation metric distance. Otherwise, the a-topology is stronger than the TV topology.
Let us state an important feature of f divergences:
Theorem 9. The f -divergences are invariant by diffeomorphisms m(x) of the sample space X : Let Y = m(X),
and Xi pi with Yi = m(Xi) qi. Then we have If [q1 : q2] = If [p1 : p2].
Entropy 2020, 22, 1100 27 of 61
Example 4. Consider the exponential distributions and the Rayleigh distributions which are related by:
p
X Exponential(l) , Y = m(X) = X Rayleigh s = p2l .
1
It follows that = D q p
DKL pl1 : pl2 KL 2l1 : q p2l1
1 1
=
log
2l2 + 2l2 2l1 l2
2l1 1 1
l
= log l2l + l
1 1.
1 2
A remarkable property is that invariant standard f -divergences yield the Fisher information
matrix and the a-connections. Indeed, the invariant standard f -divergences is related infinitesimally to the
Fisher metric as follows:
If [p(x; q) : p(x; q + dq)] = Z p(x; q) f
p
p(x; q) dm(x) (121)
(x; q + dq)
S 1
i j
= 2 F gij(q)dq dq . (122)
where g denotes the geodesic passing through g(0) = q1 and g(1) = q2. The Fisher–Rao distance can q
R1 >
also be defined as the shortest path length: Dr(p , p ) = infg g˙(t) g (t)g˙(t)dt.
q1 q2 0 g
Definition 6 (Fisher–Rao distance). The Fisher–Rao distance is the geodesic metric distance of the
Fisher–Riemannian manifold (P, P g).
• The Fisher–Riemannian manifold of the family of categorical distributions (also called finite
discrete distributions in [2]) amount to the spherical geometry [14] (spherical manifold).
• The Fisher–Riemannian manifold of the family of bivariate location-scale families amount to
hyperbolic geometry (hyperbolic manifold).
• The Fisher–Riemannian manifold of the family of location families amount to Euclidean
geometry (Euclidean manifold).
S
2
The first fundamental form of the Riemannian geometry is ds = hdx, dxi = gijdxidxj where ds
denotes the line element. Let us give an example of Fisher–Rao geometry for location-scale families:
Example 5. Consider the location-scale family induced by a symmetric probability density f (x) with respect to
R R R 2
0 such that X f (x)dm(x) = 1, X x f (x)dm(x) = 0 and X x f (x)dm(x) = 1 (with support X = R):
1 x q1
P= pq (x) = f , q = (q1, q2) 2 R R++ . (128)
q2 q2
The density f (x) is called the standard density of the location-scale family, and corresponds to the parameter
q0 = (0, 1): p(0,1)(x) = f (x). The parameter space Q = R R++ corresponds to the upper plane, and the
Fisher information matrix can be structurally calculated [67] as the following diagonal matrix:
I(q) = 0 ,
" 2 #
b
(129)
a2 0
with scalars:
(x ) 2
2 Z f
0
(130)
a := f (x) f (x)dm(x),
b2 :=
Z
f
x f
0
(x ) + 12
(x ) (131)
f (x)dm(x).
Entropy 2020, 22, 1100 29 of 61
0 0
By rescaling q = (q , q2) as q = (q 0 , q 0 ) with q = a q 0
and q2 = q2, we get the FIM with respect to
p
0 1 1 2 1 b 2 1
q expressed as: I(q0) = (q20)2
"
0 1 #
2
b 1 0
. (132)
We recognize that this metric is a constant time the metric of the Poincaré upper plane. Thus the Fisher–
Rao manifold of a location-scale family (with symmetric standard probability density f ) is isometric to the
1
planar hyperbolic space of negative curvature k = b 2 . In practice, the Klein non-conformal model of
hyperbolic geometry is often used to implement computational geometric algorithms [20].
3.12. The Monotone a-Embeddings and the Gauge Freedom of the Metric
Another common mathematically equivalent expression of the FIM [16] is given by:
Z
q q
ka(u) := 1 a u 2 , if a 6= 1, (134)
a
The function l (x; q) is called the a-likelihood function. Then the a-representation of the FIM,
the a-FIM for short, is expressed as:
a Z a a
Iij (q) := ¶il (x; q)¶jl (x; q)dm(x). (135)
a R a a
We can rewrite compactly the a-FIM, as Iij (q) = ¶il ¶jl dm(x). Expanding the a-FIM, we get:
Iij(q) =
(
1 ¶
a i
log p (x; q)¶ p(x;
j
q)dm(x) for a 6 1,1 . (136)
1
R 1 a 1+a
¶ p(x; q)
i
a 2 2 2
R 2f g
The 1-representation of the density is called the logarithmic representation (or e-
representation), the 1-representation the mixture representation (or m-representation), and its 0-
a
representation is called the square root representation. The set of a-scores vectors Ba := f¶il gi are
interpreted as the tangent basis vectors of the a-base Ba. Thus the FIM is a-independent.
Entropy 2020, 22, 1100 30 of 61
Furthermore, the a-representation of the FIM can be rewritten under mild conditions [16] as:
a Z
I (q ) = 2 1+a
a
p(x; q) 2
¶i¶jl (x; q)dm(x). (137)
ij 1+ a
Since we have:
a (138)
¶i¶jl (x; q) = 1 a 1 a
p 2 ¶i¶jl + 2 ¶il¶jl ,
it follows that:
a 2
I (q ) = 1 a (139)
ij 1+ a
Iij(q) + 2 Iij = Iij(q).
Notice that when a = 1, we recover the equivalent expression of the FIM (under mild conditions):
1 2
Iij (q) = E[r log p(x; q)]. (140)
In particular, when the family is an exponential family [52] with cumulant function F(q) (satisfying the
mild conditions), we have:
2
I(q) = r F(q). (141)
Zhang [71,72] further discussed the representation/reference biduality which was confounded
in the a-geometry.
Gauge freedom of the Riemannian metric tensor has been investigated under the framework of
(r, t)-monotone embeddings [71–73] in information geometry: let r and t be two strictly increasing
0
functions, and f a strictly convex function such that f (r(u)) = t(u) (with f denoting its convex conjugate).
Observe that the set of strictly increasing real-valued univariate functions has a group structure for the
group operation chosen as the functional composition . Let us write p q (x) = p(x; q).
r,t r,t
The (r, t)-metric tensor g(q) = [ gij(q)]ij can be derived from the (r, t)-divergence:
Z
Dr,t (p : q) = ( f (r(p(x))) + f (t(q(x))) r(p(x))t(q(x))) dn(x). (142)
We have:
r,t
gij(q) = Z (¶ir(pq (x))) ¶jt(pq (x)) dn(x), (143)
0 0
= Z r (pq (x))t (pq (x)) (¶i pq (x)) ¶j pq (x) dn(x), (144)
00
= Z f (r(pq (x))) (¶ir(pq (x))) ¶jr(pq (x)) dn(x), (145)
00
= Z ( f ) (t(pq (x))) (¶it(pq (x))) ¶jt(pq (x)) dn(x). (146)
3.13. Dually Flat Spaces and Canonical Bregman Divergences
We have described how to build a dually flat space from any strictly convex and smooth
2
generator F: A Hessian structure is built from F(q) with Riemannian Hessian metric r F(q), and the
convex conjugate F (h) (obtained by the Legendre–Fenchel duality) yields the dual Hessian
2
structure with Riemannian Hessian metric r F (h). The dual connections r and r are coupled with
the metric. The connections are defined by their respective Christoffel symbols G(q) = 0 and G (h) =
0, showing that they are flat connections.
Conversely, it can be proved [2] that given two dually flat connections r and r , we can
reconstruct two dual canonical strictly convex potential functions F(q) and F (h) such that h = rF(q)
and q = rF (h). The canonical divergence AF,F yields the dual Bregman divergences BF and BF .
Entropy 2020, 22, 1100 31 of 61
The only symmetric Bregman divergences are squared Mahalanobis distances M [40] with the Q
2
q
0 0 > 0
MQ(q, q ) = (q q) Q(q q). (147)
>
Let Q = LL be the Cholesky decomposition of a positive-definite matrix Q 0. It is well-known that
the Mahalanobis distance MQ amounts to the Euclidean distance on affinely transformed points:
2 0 >
MQ (q, q ) = Dq QDq, (148)
> >
= Dq LL Dq, (149)
2 > > 0 > > 0 2
= M I(L q, L q ) = kL q L q k , (150)
0
where Dq = q q.
2
The squared Mahalanobis distance MQ does not satisfy the triangle inequality, but the
Mahalanobis distance MQ is a metric distance. We can convert a Mahalanobis distance M Q1 into
another Mahalanobis distance MQ2 , and vice versa, as follows:
>
Proof. Let us write matrix Q = L L 0 using the Cholesky decomposition. Then we have
2 1 >
We have MQ (q1, q2) = BF(q1, q2) (Bregman divergence) with F(q) = 2 q Qq for a positive-
1 > 1 1 1
definite matrix Q 0. The convex conjugate F (h) = 2 h Q h (with Q 0). We have h = Q q and
2 2
h = Qq. We have the following identity between the dual Mahalanobis divergences M Q and MQ 1 :
2 2
MQ (q1, q2) = MQ 1 (h1, h2). (154)
When the Bregman generator is based on an integral, i.e., the log-normalizer F(q)
R
=
R
log ( exp(ht(x), qidm(x)) for exponential families E, or the negative Shannon entropy F(q) =
mq (x) log m(h)dm(x) for mixture families M, the associated Bregman divergences B F,E or BF,M can be
relaxed and interpreted as a statistical distance. We explain how to obtain the reconstruction below:
Z
F (h) = h(pq ) = p(x) log p(x)dm(x), (157)
where h(p) = p(x) log p(x)dm(x) denotes Shannon’s entropy.
R >
Let l(i) denotes the i-th coordinates of vector l, and let us calculate the inner product q1 h2 =
åi q1(i)h2(i) of the Legendre–Fenchel divergence. We have h2(i) = Epq [ti(x)]. Using the linear
2
property of the expectation E[ ], we find that åi q1(i)h2(i) = Epq2 [åi q1(i)ti(x)]. Moreover, we
have åi q1(i)ti(x) = log pq1 (x) + F(q1). Thus we have:
q >h = E
1 2 pq2 log pq1 + F(q1) = F(q1) + Epq2 log pq1 . (158)
It follows that we get
B [p : p ] >
F,E q1 q2 = F(q1) + F (h2) q 1 h 2, (159)
E [log p ]
= F(q1) h(pq2 ) pq2 q1 F(q1), (160)
p
q2
Ep p :p
= q2 log q1 =: DKL [pq1 q2 ]. (161)
By relaxing the exponential family densities p q1 and pq2 to be arbitrary densities p1 and p2, we
obtain the reverse KL divergence between p1 and p2 from the dually flat structure induced by
the integral-based log-normalizer of an exponential family:
p2 p 2 (x )
E Z
DKL [p1 : p2] = p2 log p1 = p2(x) log p1 (x) dm(x), (162)
= DKL[p2 : p1]. (163)
Thus we have recovered the reverse Kullback–Leibler divergence DKL from BF,E .
The dual divergence D [p1 : p2] := D[p2 : p1] is obtained by swapping the distribution parameter
orders. We have:
:p ]= E
DKL [p1 : p2] := DKL [p2 1 p1 log p2 =: DKL[p1 : p2], (164)
p
1
Z
F(q) = FM(mq ) = h(mq ) = mq (x) log mq (x)dm(x). (166)
We have
Z
hi = [rF(q)]i = (pi(x) p0(x)) log mq (x)dm(x), (167)
and the dual convex potential function is
Z
åi q1(i) (pi(x) p0(x)) log mq2 (x)dm(x) = Z åi q1(i)pi(x) log mq2 (x)dm(x)
åq1(i)p0(x) log mq2 (x)dm(x). (169)
i
That is
> Z
å å
q1 h2 = i
q1(i)pi log mq2 dm i q1(i)p0 log mq2 dm. (170)
Thus it follows that we have the following statistical distance:
B [m : m ] := >
F,M q1 q2 F(q1) + F (h2) q1 h2, (171)
Z q (i))p (x) +
= h(mq1 ) ((1 åi 1 0 åi q1(i)pi(x)) log mq2 (x)dm(x),(173)
Example 6. Consider the family P of Poisson distributions with rate parameter l > 0:
x
l e l
This family is a univariate discrete exponential family of order one (i.e., d = 1 and D = 1) with the
following canonical decomposition of its probability mass function pl(x):
m(x) k(x)
• Base measure: n(x) = x! = e m(x) where m is the counting measure and k(x) = log(x!)
represents an auxiliary measure carrier term for defining the base measure n,
• Sufficient statistics: t(x) = x,
• Natural parameter: q = q(l) = log(l) 2 Q = R,
• Log-normalizer: F(q) = exp(q) since F(q(l)) = l.
Thus we can rewrite the Poisson family as the following Discrete Exponential Family (DEF):
D p : p
KL l1 l2 = BF(q(l2) : q(l1)), (179)
l1
= l1 log + l2 l1. (180)
l2
We recognize the expression of the univariate Kullback–Leibler divergence extended to the positive scalars.
0 00 1
We have h = F (q) = l and Ih (h) = (F ) (h) = h where F (h) = h log h h is the convex
1
conjugate of F(q). Since h = l, we deduce that the Fisher information is I l(l) = l . Notice that Iq (q) =
00 00 1
exp(q) = l. Thus we check the Crouzeix identity: F (q)(F ) (h) = l l = 1. Beware that although I q
(q) = l, this is not the FIM Il. Using the covariance equation of the FIM of Equation (86), we have:
dq dq
Il(l) = dl Iq (q(l)) dl, (181)
= 1 exp(log(l))1 = 1 . (182)
l l l
The Fisher–Rao distance [78] between two Poisson distributions pl1 and pl2 is:
Dr(l1, l2) = 2 p l1
p l2
. (183)
In general, it is easy to get the Fisher–Rao distance of uniorder families because both the length
elements and the geodesics are available in closed forms.
The following example demonstrates the computational intractability of the Fisher–Rao distance.
Example 7. Consider the parametric family of Gamma distributions [79] with probability density:
a a 1
b x exp( bx)
pa,b(x) = , (184)
G(a)
¥ 2X
for shape parameter a > 0, rate parameter b > 0 and support x = (0, ¥). Function G(z) =
R f a,b g
z 1 x
0 x e dx is the Gamma function defined for z > 0, and satisfying G(n) = (n 1)! for integers n.
The Gamma distributions p , a, b > 0 form an univariate exponential family of order 2 (i.e., d = 1 and
D = 2) with the following canonical decomposition:
• Natural parameters: q = (q1, q2) with q(l) = ( b, a 1) with source parameter l = (a, b),
• Sufficient statistics: t(x) = (x, log(x)),
• Log-normalizer: F (q) = (q2 + 1) log ( q1) + log G (q2 + 1),
Entropy 2020, 22, 1100 35 of 61
q2+1
• Dual parameterization: h = (h1, h2) = Epq [t(x)] = rF (q) = , log ( q1) + y (q2 + 1) ,
d G0(x) q1
where y(x) = ln(G(x)) = denotes the digamma function.
dx G(x)
It follows that the Kullback–Leibler divergence between two Gamma distributions p l1 and pl2
with respective source parameters l1 = (a1, b1) and l2 = (a2, b2) is:
D p : p
KL a1,b1 a2,b2 = BF(q(l2) : q(l1)), (185)
= (a1 a2) y (a1) log G (a1) + log G (a2) (186)
b
b2 1
+a2 (log b1 log b2) + a1 . (187)
b1
2
The Fisher information matrix is Iq (q) = r F(q). It can be expressed using the l-parameterization [80] as:
" 1 #
b
2
b b
(188)
y1(a)
Il(a, b) = 1 a ,
d
2
0 . (190)
d
2
The parameters d1 and d2 are not correlated and orthogonal since the FIM is diagonal.
The numerical evaluation of the Fisher–Rao distance between two gamma distributions has been
a
studied in [81]. Let w = log b . The length element is shown to be in this (a, w) parameterization:
2 1 2 2
ds = y1(a) (da) + a(dw) . (191)
a
However, no closed-form expression is known for the Fisher–Rao distance between two gamma
distributions because of the intractability of the geodesic equations on the gamma Fisher–Rao
manifold [81]. This example highlights the fact that computing the Fisher–Rao distance for simple
family of distributions can be challenging. In fact, we do not know the Fisher–Rao distance between
any two multivariate Gaussian distibutions [82] (except in a few cases including the univariate case).
3
In general, dually flat spaces can be built from any strictly convex C generator F. Vinberg and
Koszul [48] showed how to obtain such a convex generator for homogeneous cones. A cone C in a
vector space V yields a dual cone of positive linear functionals in the dual vector space V :
C := fw 2 V : 8v 2 C, w(v) 0g . (192)
Entropy 2020, 22, 1100 36 of 61
Z
cC (q) := C exp( w(q))dw 0, (193)
and the function log cC (q) defines a Bregman generator which induces a Hessian structure and a
dually flat space.
Figure 11 displays the main types of information manifolds encountered in information
geometry with their relationships.
• Pattern recognition [83] and machine learning: Restricted Boltzmann machines [2] (RBMs),
neuromanifolds [84] and natural gradient [85],
• Signal processing: Principal Component Analysis (PCA), Independent Component Analysis
(ICA), Non-negative Matrix Factorization (NMF),
• Mathematical programming: Barrier function of interior point methods,
• Game theory: Score functions.
Next, we shall describe a few applications, starting with the celebrated natural gradient descent.
Smooth Manifolds
Conjugate Connection Manifolds LC = r+r
r 2 Riemannian Manifolds
(M; g; r; r ) LC
(M; g) = (M; g; )
(M; g; C = Self-dual Manifold r
1+ 1
r = 2 r+ 2 r
(M; g; r ;r ) =2C
(M; g; C) Divergence Manifold Fisher-Riemannian g = Fisherg
g D D D Fisher
(M; D ; r; r = r) Manifold gij = E[@il@jl]
D D
r at , r at
I[p : p 0] = D( :
0
) Multinomial
Parametric family Location-scale
families KL on exponential families family
KL on mixture families
Conformal divergences on deformed families
f-divergences Etc. Bregman divergence
Location
Expected Manifold canonical family
Fisher
(M; g; r ; r ) divergence Spherical Manifold Hyperbolic Manifold
-geometry Dually at Manifolds
(M;F;F )
(Hessian Manifolds) Euclidean Manifold
Cubic skewness tensor Dual Legendre potentials
Cijk = E[@il@jl@kl] Bregman Pythagorean theorem
C = r Fisherg
Distance = Non-metric divergence Distance = Metric geodesic length
Figure 11. Overview of the main types of information manifolds with their relationships in information geometry.
Entropy 2020, 22, 1100 38 of 61
Thus in general, the two gradient descent location sequences fqtgt and fhtgt (initialized at q0 =
q(h0) and h0 = h(q0)) are different (because usually h(q) 6= q), and the two GDs may potentially
reach different stationary points. In other words, the GD local optimization depends on the choice of
the parameterization of the function L (i.e., L q or Lh ). For example, minimizing with the gradient
descent a temperature function L q (q) with respect to Celsius degrees q may yield a different result
than minimizing the same temperature function L h (h) = Lq (q(h)) expressed with respect to
Fahrenheit degrees h. That is, the GD optimization is extrinsic since it depends on the choice of the
parameterization of the function, and does not take into account the underlying geometry of the
parameter space Q.
The natural gradient precisely addresses this problem and solves it by choosing intrinsically the
steepest direction with respect to a Riemannian metric tensor field on the parameter manifold. We
shall explain the natural gradient descent method and highlight its connections with the Riemannian
gradient descent, the mirror descent and even the ordinary gradient descent when the parameter
space is dually flat.
4.1.2. Natural Gradient and Its Connection with the Riemannian Gradient
Let (M, g) be a D-dimensional Riemannian space [10] equipped with a metric tensor g, and L 2
¥
C (M) a smooth function to minimize on the manifold M. The Riemannian gradient [89] uses the
Riemannian exponential map expp : Tp ! M to update the sequence of points p t’s on the manifold as
follows:
p = exp (
RG : t+1 pt atrM L(pt)), (197)
where the Riemannian gradient rM is defined according to a directional derivative rv by:
rM L(p) := rv L expp(v) v=0 , (198)
with
( ) := lim L(p + hv) L(p) . (199)
rv L p h 0 h
!
However, the Riemannian exponential mapping exp p( ) is often computationally intractable since it
requires to solve a system of second-order differential equations [10,22]. Thus instead of using expp, we
D
shall rather use a computable Euclidean retraction R : T p ! R of the exponential map expressed
in a local q-coordinate system as:
NG 1
rLq (q) := gq (q)rq Lq (q) (202)
encodes the Riemannian steepest descent vector, and the natural gradient descent method yields
the following update rule
NG
NG : qt+1 = qt at rLq (qt). (203)
Notice that the natural gradient is a contravariant vector while the ordinary gradient is a covariant
vector. Recall that a covariant vector [v i] is transformed into a contravariant vector [v i] by vi = åj gijvi, that is
by using the dual Riemannian metric gh (h) = gq (q) 1. The natural gradient is invariant under
Entropy 2020, 22, 1100 39 of 61
an invertible smooth change of parameterization. However, the natural gradient descent does not
guarantee that the locations qt’s always stay on the manifold: Indeed, it may happen that for some t,
D
qt 62Q when Q 6= R .
Property 5 ([89]). The natural gradient descent approximates the intrinsic Riemannian gradient
descent using a contravariant gradient vector induced by the Riemannian metric tensor g. The
natural gradient is invariant to coordinate transformations.
Next, we shall explain how the natural gradient descent is related to the mirror descent and the
ordinary gradient when the Riemannian space Q is dually flat.
4.1.3. Natural Gradient in Dually Flat Spaces: Connections to Bregman Mirror Descent
and Ordinary Gradient
Recall that a dually flat space (M, g, r, r ) is a manifold M equipped with a pair (r, r ) of dual torsion-
free flat connections which are coupled to the Riemannian metric tensor g [2,90] in the sense
r+r LC LC
that = r, where r denotes the unique metric torsion-free Levi–Civita connection.
2
On a dually flat space, there exists a pair of dual global Hessian structures [ 48] with dual
canonical Bregman divergences [2,91]. The dual Riemannian metrics can be expressed as the
Hessians of dual convex potential functions F and F . Examples of Hessian manifolds are the
manifolds of exponential families or the manifolds of mixture families [92]. On a dually flat space
3
induced by a strictly convex and C function F (Bregman generator), we have two dual global
coordinate system: q(h) = rF (h) and h(q) = rF(q), where F denotes the Legendre–Fenchel convex
conjugate function [51,93]. The Hessian metric expressed in the primal q-coordinate system is gq
2 2
(q) = r F(q), and the dual Hessian metric expressed in the dual coordinate system is g h (h) = r F
(h). Crouzeix’s identity [36] shows that gq (q)gh (h) = I, where I denotes the D D matrix identity.
The ordinary gradient descent method can be extended using a proximity function F( , ) as follows:
t+1 arg q2Q h , rLq t i at
F
, t .
2
When F(q, qt) = 12 kq qtk , the PGD update rule becomes the ordinary GD update rule.
Consider a Bregman divergence [91] BF for the proximity function F: F(p, q) = BF(p : q).
Then the PGD yields the following mirror descent (MD):
B
t+1 arg q2Q h , rL t i at F : t .
MD : q = min q (q ) + (q q ) (205)
1
2
Property 6 ([94]). Bregman mirror descent on the Hessian manifold (M, g = r F(q)) is equivalent to
2
natural gradient descent on the dual Hessian manifold (M, g = r F(h)), where F is a Bregman
generator, h = rF(q) and q = rF (h).
Indeed, the mirror descent rule yields the following natural gradient update rule:
Property 7 ([96]). In a dually flat space induced by potential convex function F, the natural gradient amounts to
NG
the ordinary gradient on the dually parameterized function: rLq (q) = rh Lh (h) where h = rq F(q)
and Lh (h) = Lq (q(h)).
2
Proof. Let (M, g, r, r ) be a dually flat space. We have g q (q) = r F(q) = rq rq F(q) = rq h since
h = rq F(q). The function to minimize can be written either as Lq (q) = Lq (q(h)) or as Lh (h) = Lh (h(q)).
Recall the chain rule in the calculus of differentiation:
NG g 1
rLq (q) := q (q)rq Lq (q), (209)
1
= (rq h) (rq h)rh Lh (h), (210)
= rh L h ( h ). (211)
It follows that the natural gradient descent on a loss function L q (q) amounts to an ordinary
gradient descent on the dually parameterized loss function L h (h) := Lq (q(h)). In short,
NG
rq Lq = rh Lh .
˜ 1 n
I (l) = n å rlll(xi)(rlll(xi))> I(l), (214)
i=1
where ll(x) := log pl(x) denote the log-likelihood function. Notice that the approximated FIM may
potentially be degenerated and may not respect the structure of the true FIM. For example,
Entropy 2020, 22, 1100 41 of 61
2 x m (x m)
2
1 ˜
we have rml(x; m, s )= s2 and rs2 =
4
2s 2s
2 . The non-diagonal of the approximate FIM I (l)
2 1 1
are close to but usually non-zero although the expected FIM is diagonal I(m, s ) = diag s
2
, 2s4
less than a
.
Thus we may estimate the FIM until the non-diagonal elements have absolute values
1
prescribed e > 0. For multivariate normals, we have rml(x; m, S) = S (x m) and rSl(x; m, S) =
1 > 1
2 (rml(x; m, S)rml(x; m, S) S ).
4.3. Hypothesis Testing in the Dually Flat Exponential Family Manifold (E, KL )
Given two probability distributions P 0 p0(x) and P1 p1(x), we ask to classify a set of iid.
observations X1:n = fx1, . . . , xng as either sampled from P0 or from P1? This is a statistical decision
problem [100]. For example, P0 can represent the signal distribution and P 1 the noise distribution.
Figure 12 displays the probability distributions and the unavoidable error that is made by any
statistical decision rule (on observations x1 and x2).
p1(x)
p0(x)
x
x1 x2
Figure 12. Statistical Bayesian hypothesis testing: the best Maximum A Posteriori (MAP) rule
chooses to classify an observation from the class that yields the maximum likelihood.
Assume that both distributions P0 Pq0 and P1 Pq1 belong to the same exponential family
E = fPq : q 2 Qg, and consider the exponential family manifold with the dually flat structure (E, E g, E
e m
r , E r ). That is, the manifold equipped with the Fisher information metric tensor field and the expected
exponential connection and conjugate expected mixture connection. More generally, the expected a-
geometry of an exponential family E with cumulant function F is given by:
gij(q) = ¶i¶j F(q), (215)
1 a
a
Gij ,k = 2 ¶i¶j¶k F(q). (216)
a 1 1
When a = 1, Gij ,k = 0 and r is flat, and so is r by using the fundamental theorem of information
geometry.
The 1-structure can also be derived from a divergence manifold structure by choosing the
reverse Kullback–Leibler divergence KL :
e m
(E, E g, E r , E r ) (E, KL ). (217)
Therefore, the Kullback–Leibler divergence KL[P q : Pq0 ] amounts to a Bregman divergence
(for the cumulant function of the exponential family):
0
KL [Pq0 : Pq ] = KL[Pq : Pq0 ] = BF(q : q). (218)
Entropy 2020, 22, 1100 42 of 61
The best exponent error a of the best Maximum A Priori (MAP) decision rule is found by
minimizing the Bhattacharyya distance to get the Chernoff information [101]:
Z
CP1 , P2 log a2(0,1) x2X p1 x p2 xd x 0.
[ ]= min a( ) 1 a( ) m( ) (219)
On the exponential family manifold E, the Bhattacharyya distance:
Z
a 1 a
Ba[p1 : p2] = log x2X p1 (x)p 2 (x)dm(x), (220)
amounts to a skew Jensen parameter divergence [102] (also called Burbea-Rao divergence):
a
JF (q1 : q2) = aF(q1) + (1 a)F(q2) F(q1 + (1 a)q2). (221)
It can be shown that the Chernoff information [100,103,104] (that minimizes a) is equivalent to
a Bregman divergence: Namely, the Bregman divergence for exponential families at the optimal
exponent value a .
Theorem 10 (Chernoff information [100]). The Chernoff information between two distributions
belonging to the same exponential family amount to a Bregman divergence:
a a
C[Pq1 : Pq2 ] = B(q1 : q12 ) = B(q2 : q12 ), (222)
a
where q12 = (1 a)q1 + aq2, and a denote the best exponent error.
a
Let q12 := q12 denote the best exponent error. The geometry [100] of the best error exponent
can be explained on the dually flat exponential family manifold as follows:
>
Bim(P1, P2) = fP : F(q1) F(q2) + h(P) (q2 q1) = 0g. (224)
Figure 13 illustrates how to retrieve the best error exponent from an exponential arc (q-
geodesic) intersecting the m-bisector.
e-geodesic G (P ;P )
p 12 e 1 2
p
P 12 2
p
1
C( 1: 2)=B( 1: 12)
Figure 13. Exact geometric characterization (not necessarily i closed-form) of the best exponent
error rate a .
Furthermore, instead of considering two distributions for this statistical binary decision problem, we
may consider a set of n distributions of P 1, . . . , Pn 2 E. The geometry of the error exponent in this
multiple hypothesis testing setting has been investigated in [105]. On the dually flat exponential family
Entropy 2020, 22, 1100 43 of 61
manifold, it corresponds to check the exponential arcs between natural neighbors (sharing Voronoi
subfaces) of a Bregman Voronoi diagram [40]. See Figure 14 for an illustration.
-coordinate system
4.4. Clustering Mixtures in the Dually Flat Mixture Family Manifold (M, KL)
Given a set of k prescribed statistical distributions p0(x), . . . , pk 1(x), all sharing the same support
X (say, R), a mixture family M of order D = k 1 consists of all strictly convex combinations of these
component distributions [56]:
M ( i=1 i=1 ! i=1 )
q
k 1
:= m(x; q) = å qi pi(x) + 1 k 1 å
i p0(x) such that qi > 0, å qi < 1
k 1
.(225)
Figure 15 displays two mixtures obtained as convex combinations of prescribed Laplacian,
Gaussian and Cauchy component distributions (D = 2). When considering a set of prescribed
Gaussian component distributions, we obtain a w-Gaussian Mixture Model, or w-GMM for short.
0.5
M1
M2
0.45 Gaussian(-2,1)
Cauchy(2,1)
Laplace(0,1)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4 6 8
Figure 15. Example of a mixture family of order D = 2 (3 components: Laplacian, Gaussian and
Cauchy prefixed distributions).
Entropy 2020, 22, 1100 44 of 61
M Mm Me
We consider the expected information manifold (M, g, r , r ) which is dually flat and
equivalent to (MQ, KL). That is, the KL between two mixtures with prescribed components (w-
mixtures, for short) is equivalent to a Bregman divergence for F(q) = h(mq ), where h(p) =
R
p(x) log p(x)dm(x) is the differential Shannon information (negative entropy) [56]:
Consider a set fmq1 , . . . , mqn g of n w-mixtures [56]. Because F(q) = h(m(x; q)) is the negative
differential entropy of a mixture (not available in closed form [106]), we approximate the untractable F
˜
by another close tractable generator . We use Monte Carlo stochastic sampling to get Monte-Carlo
F
˜
convex FS for an independent and identically distributed sample S.
˜ ˜
Thus we can build a nested sequence (M, FS1 ), . . . , (M, FSm ) of tractable dually flat manifolds
carried inside the dually flat manifold ( , FS ) are consistent, see [56].
For example, we can apply Bregman k-means [107] on these Monte Carlo dually flat spaces
[108] of w-GMMs (Gaussian Mixture Models) to cluster a set of w-GMMs. Figure 16 displays the
result of such a clustering.
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4
We have briefly described two applications using dually flat manifolds: (1) the dually flat
exponential manifold induced by the statistical reverse Kullback–Leibler divergence on an
exponential family (structure (E, KL )), and (2) the dually flat mixture manifold induced by the
statistical Kullback–Leibler divergence on a mixture family (structure (M, KL)). There are many other
dually flat structures that can be met in a statistical context. For example, two other dually flat
structures for the D-dimensional probability simplex D D are reported in Amari’s textbook [2]: (1) the
conformally deforming of the a-geometry (page 88, Equation 4.95 of [2]), and (2) the c-escort
geometry (page 91, Equation 4.114 of [2]).
5.1. Summary
We explained the dualistic nature of information manifolds (M, g, r, r ) in information geometry. The
dualistic structure is defined by a pair of conjugate connections coupled with the metric tensor that provides
a dual parallel transport that preserves the metric. We showed how to extend this structure to a 1-
parameter family of structures. From a pair of conjugate connections, the pipeline to build this 1-parameter
family of structures can be informally summarized as:
Entropy 2020, 22, 1100 45 of 61
a a
(M, g, r, r ) ) (M, g, C) ) (M, g, aC) ) (M, g, r , r ), 8a 2 R. (227)
We stated the fundamental theorem of information geometry on dual constant-curvature
manifolds, including the special but important case of dually flat manifolds on which there exists two
potential functions and global affine coordinate systems related by the Legendre–Fenchel
transformation. Although, information geometry historically started with the Riemannian modeling (P ,
P g) of a parametric family of probability distributions P by letting the metric tensor be the Fisher
information matrix, we have emphasized the dualistic view of information geometry which considers non-
Riemannian manifolds that can be derived from any divergence, and not necessarily tied to a statistical
context (e.g., information manifold can be used in mathematical programming [109]). Let us notice that for
any symmetric divergence (e.g. any symmetrized f -divergence like the squared Hellinger divergence),
the induced conjugate connections coincide with the Levi–Civita connection but the Fisher–Rao metric
distance does not coincide with the squared Hellinger divergence.
On one hand, a Riemannian metric distance D r is never a divergence because the rooted
distance functions fail to be smooth at the extremities but a squared Riemmanian metric distance is
d
always a divergence. On the other hand, taking the power d of a divergence D (i.e., D ) for some d
> 0 may yield a metric distance (e.g., the square root of the Jensen–Shannon divergence [ 110]), but
d
this may not always be the case: the powered Jeffreys divergence J is never a metric distance (see
[111], page 889). Recently, the Optimal Transport (OT) theory [112] gained interest in statistics and
machine learning. However, the optimal transport between two members of a same elliptically-
contoured family has the same optimal transport formula distance (see [113] Eq. 16 and Eq. 17,
although they have different Kullback–Leibler divergences). Another essential difference is that the
Fisher–Rao manifold of location-scale families is hyperbolic but the Wasserstein manifold of
location-scale families has positive curvature [113,114].
Notice that we may convert back and forth a similarity S(p, q) 2 (0, 1] to a dissimilarity D(p, q) 2
[0, ¥) as follows:
Hotelling mentioned that location-scale probability families yield Riemannian manifolds of constant
non-positive curvatures. This Riemannian modeling of parametric family of densities was further
independently studied by Calyampudi Radhakrishna Rao (C.R. Rao) in his celebrated paper [ 66]
(1945) that also includes the Cramér–Rao lower bound [51] and the Rao–Blackwellization technique
used in statistics. Nowadays the induced Riemannian metric distance is often called the Fisher–Rao
distance [119] or Rao distance [81]. Yet another use of Riemannian geometry in statistics was
pioneered by Harold Jeffreys [70] that proposed to use as an invariant prior the normalized volume
element of the Fisher–Riemannian manifold. In those seminal papers, there was no theoretical
justification of using the Fisher information matrix as a metric tensor (besides the fact that it is a
well-defined positive-definite matrix for regular identifiable models). Nowadays, this Riemmanian
metric tensor is called the information metric for short. Modern information geometry considers a
generalization of this approach using a non-Riemannian dualistic modeling (M, g, r, r ) which
LC
coincides with the Riemannian manifold when r = r = r, the Levi–Civita connection (the unique
torsion-free affine connection compatible with the metric tensor). The Fisher–Rao geometry has also
been explored in thermodynamics yielding the Ruppeiner geometry [120], and the geometry of
thermodynamics is called nowadays called geometrothermodynamics [121].
ˇ
In the 1960s, Nikolai Chentsov (also commonly written Cencov) studied the algebraic category
of all statistical decision rules with its induced geometric structures: Namely, the a-geometries
(“equivalent differential geometry”) and the dually flat manifolds (“Nonsymmetric Pythagorean
geometry” of the exponential families with respect to the Kullback–Leibler divergence). In the
preface of the english translation of his 1972 russian monograph [115], the field of investigation is
defined as “geometrical statistics.” However in the original Russian monograph, Chentsov used the
russian term geometrostatistics. According to Professor Alexander Holevo, the geometrostatistics
term was coined by Andrey Kolmogorov to define the field of differential geometry of statistical
models. In the monograph of Chentsov [115], the Fisher information metric is shown to be the
unique metric tensor (up to a scaling factor) yielding statistical invariance under Markov morphisms
(see [57] for a simpler proof that generalizes to positive measures).
The dual nature of the information geometry was thoroughly investigated by Professor Shun-
ichi Amari [122]. In the preface of his 1985 monograph [116], Professor Amari coined the term
information geometry as follows: “The differential-geometrical method developed in statistics is also
applicable to other fields of sciences such as information theory and systems theory... They together
will open a new field, which I would like to call information geometry.” Professor Amari mentioned in
[116] that he considered the Gaussian Riemannian manifold as a hyperbolic manifold in 1959, and
was strongly influenced by Efron’s paper on statistical curvature [123] (1975) to study the family of
a-connections in the 1980s [122,124]. Professor Amari prepared his PhD under the supervision of
Professor Kondo [125], an expert of differential geometry in touch with Professor Kawaguchi [126].
The role of differential geometry in statistics has been discussed in [127].
Note that the dual affine connections of information geometry have also been investigated
independently in affine differential geometry [128] which considers invariance under volume-
preserving affine transformations by defining a volume form instead of a metric form for Riemannian
geometry. The notion of dual parallel transport compatible with the metric is due to Aleksandr
Norden [129] and Rabindra Nath Sen [130–132] (see the Senian geometry in
http://insaindia.res.in/detail/N54-0728).
Entropy 2020, 22, 1100 47 of 61
5.3. Perspectives
We recommend the two recent textbooks [2,16] for an indepth covering of (parametric) information
geometry, and the book [133] for a thorough description of some infinite-dimensional statistical models.
(Japanese readers may refer to [134,135]) We did not report the various coefficients of the metric
tensors, Christoffel symbols and skewness tensors for the expected a-geometry of common parametric
models like the multivariate Gaussian distributions, the Gamma/Beta distributions, etc. They can be found
in [15,16] and in various articles dealing with less common family of distributions [15,64,136–140].
Although we have focused on the finite parametric setting, information geometry is also considering non-
parametric families of distributions [141], and quantum information geometry [142].
We have shown that we can always create an information manifold (M, D) from any divergence
function D. It is therefore important to consider generic classes of divergences in applications, that
are ideally axiomatized and shown to have exhaustive characteristics. The a-skewed Jensen
divergences [102] are defined by for a real-valued strictly convex function F(q) by:
a
JF (q1 : q2) := (1 a)F(q1) + aF(q2) F((1 a)q1 + aq2) > 0, (230)
(q2 : q1). F F
a 1 a
where both q1 and q2 belong to the parameter space Q. Clearly, we have J (q1 : q2) = J
1 a
JF (q1 : q2)
lim = BF(q1 : q2), (231)
+
a!0 a (1 a)
1 a
JF (q1 : q2)
lim = BF(q2 : q1), (232)
a!1 a(1 a)
where BF(q1 : q2) is the Bregman divergence [91] induced by a strictly convex and differentiable
function F(q):
0
BF(q1 : q2) := F(q1) F(q2) (q1 q2)F (q2). (233)
Appendix C further reports how to interpret geometrically these Jensen/Bregman divergences from
the chordal slope theorem. Beyond the three main Bregman/Csiszár/Jensen classes (theses
classes overlap [143]), we may also mention the class of conformal divergences [73,144,145], the
class of projective divergences [146,147], etc. Figure 17 illustrates the relationships between the
principal classes of distances.
There are many perspectives on information geometry as attested by the new Springer journal
(see online at https://www.springer.com/mathematics/geometry/journal/41884), and the biannual
international conference “Geometric Sciences of Information” (GSI) [148–150] with its collective
post-conference edited books [151,152]. We also mention the edited book [153] on the Occasion of
Shun-ichi Amari’s 80th birthday.
Entropy 2020, 22, 1100 48 of 61
Dissimilarity measure
C3 Divergence
v Projective divergence
v-Divergence D
double sided
one-sided
scaled conformal divergence CD;g( : ; ) -divergence
0 0 0
D( p : p ) = D(p : p )
Hyvarinen SM/RM
0 0
D( p : p ) = D(p : p )
BF (P:Q)
v
D (P : Q) = D(v(P ) : v(Q)) tBF (P : Q) = p 1+krF (Q)k2
q(x) CD;g(P : Q) = g(Q)D(P : Q)
R
B (P : Q) = F (P ) Fp((Q) P Q; F(Q)
d (x)
P Q
B F;g(P : Q; W ) = W BF :
If (P : Q) = p(x)f ( x)
F h r i Q W
Figure 17. Principled classes of distances/divergences.
where P = pdm and Q = qdm (i.e., p and q denote the Radon-Nikodym derivatives with respect to m).
We use the following conventions:
+ uf a .
0f 0 0, f 0 u!0 fu 8a 0, 0 f 0 u!0+ u u!¥ u
0 = ( ) = lim ( ), > a = lim a = lim f (u) (A2)
Entropy 2020, 22, 1100 49 of 61
p(x)
c n i =1 r(xi) q(xi)
where x1, . . . , xn iid r. When r is chosen as p, the KLD can be estimated as:
n
KLn(p : q) = 1 å log p(xi) . (A5)
n
c i=1 q(xi)
Monte Carlo estimators are consistent under mild conditions: lim n!¥ KLcn(p : q) = KL(p : q).
In practice, one problem when implementing Equation (A5), is that we may end up potentially
with KLcn(p : q) < 0. This may have disastrous consequences as algorithms implemented by
programs consider non-negative divergences to execute a correct workflow. The potential negative
value problem of Equation (A5) comes from the fact that å i p(xi) 6= 1 and åi q(xi) 6= 1. One way to
circumvent this problem is to consider the extended f -divergences:
Definition A1 (Extended f -divergence). The extended f -divergence for a convex generator f , strictly
convex at 1 and satisfying f (1) = 0 is defined by
Z 0
e
I f (p : q) = p(x) f p(x) f (1) p(x) 1 dm(x). (A6)
q(x) q(x)
Indeed, for a strictly convex generator f , let us consider the scalar Bregman divergence [91]:
0 (A7)
Bf (a : b) = f (a) f (b) (a b) f (b) 0.
q(x)
Setting a = and b = 1 in Equation (A7), and using the fact that f (1) = 0, we get
p(x)
f p(x) p(x) 1 f 0(1) 0.
q(x) q(x)
(A8)
Then we estimate the extended f -divergence using importance sampling of the integral with
respect to distribution r, using n variates x1, . . . , xn iid p as:
n
1
n i=1
q(xi)
p(xi)
q(xi)
p(xi)
ˆ 0
I f ,n(p : q) = åf f (1) 1 0. (A11)
Entropy 2020, 22, 1100 50 of 61
For example, for the KLD, we obtain the following Monte Carlo estimator:
n
KLn(p : q) = 1
n
å log p(xi) + q(xi) 1 0, (A12)
i=1
q(x i ) p(x i )
c
since the extended KLD is
DKLe (p : q) =
Z p(x) log q(x) + q(x) p(x) dm(x). (A13)
p(x)
Equation (A12) can be interpreted as a sum of scalar Itakura–Saito divergences since the Itakura–
1 n
Saito divergence is scale-invariant: KLcn(p : q) = n åi =1 DIS(p(xi) : q(xi)) with the scalar Itakura–
Saito divergence
a a a
0
fe(u) = f (u) f (1)(u 1). (A15)
0 e
We check that the generator fe satisfies both f (1) = 0 and f (1) = 0, and we have I f (p : q) = Ife (p : q).
e
Thus DKLe (p : q) = IfKLe (p : q) with fKL (u) = log u + u 1.
Let us remark that we only need to have the scalar function strictly convex at 1 to ensure that
a
Bf b : 1 0. Indeed, we may use the definition of Bregman divergences extended to strictly convex
functions but not necessarily smooth functions [156,157]:
B
f (x : y ) = max )f f (x ) f (y ) (x y )g (y )g, (A16)
() ¶ (
gy 2 f y
where j j denotes the matrix determinant. The natural parameters q are also expressed using both a
vector parameter qv and a matrix parameter qM in a compound object q = (qv, qM). By defining the
following compound inner product on a composite (vector,matrix) object
0 > 0 0 >
hq, q i := qv qv + tr q M qM , (A18)
Entropy 2020, 22, 1100 51 of 61
where tr( ) denotes the matrix trace, we rewrite the MVN density of Equation ( A17) in the canonical
form of an exponential family [52]:
>
t(x) = (x, xx ) (A21)
is the compound sufficient statistic. The function F q is the strictly convex and continuously
differentiable log-normalizer defined by:
1 1 > 1
Fq (q) = 2 d log p log jqMj + 2 qv qM qv ,
(A22)
The log-normalizer can be expressed using the ordinary parameters, l = (m, S), as:
1
Fl(l) = > 1 , (A23)
2 lv lM lv + log jlMj + d log 2p
1
= (A24)
> 1
2 m S m + log jSj + d log 2p .
The moment/expectation parameters [2] are
1
> 1
Fh (h) = 2 log(1 + hv hM hv) + log j hMj + d(1 + log 2p) , (A29)
and q = rh Fh (h).
F (q ) + F ( h ) q, h = 0. (A30)
The Kullback-Leibler divergence between two d-dimensional Gaussians distributions p (m1,S1) and
p (with = m m ) is
(m2,S2) Dm 2 1
1 S2
KL(p : p ) = 1 > 1 j
(m1,S1) (m2,S2) 2 tr(S2 S 1) + D m S 2 Dm + log jj S1j d = KL(pl1 : pl2 ). (A31)
Entropy 2020, 22, 1100 52 of 61
1
We check that KL(p(m,S) : p(m,S)) = 0 since Dm = 0 and tr(S S) = tr(I) = d. Notice that when
S1 = S2 = S, we have
> 1 2
KL(p :p ) =1 D S D =1 D 1 ( m , m ), (A32)
m
(m1,S) (m2 ,S) 2 m 2S 1 2
1
that is half the squared Mahalanobis distance for the precision matrix S (a positive-definite matrix:
1
S 0), where the Mahalanobis distance is defined for any positive matrix Q 0 as follows:
q
>
DQ(p1 : p2) = (p1 p2) Q(p1 p2). (A33)
The Kullback–Leibler divergence between two probability densities of the same exponential
families amount to a Bregman divergence [2]:
KL(p : p ) = KL(p : p ) = B (q : q ) = B (h : h ),
(m1,S1) (m2,S2) l1 l2 F 2 1 F 1 2 (A34)
where the Bregman divergence is defined by
0 0 0 0
BF(q : q ) := F(q) F(q ) h q q , rF(q )i, (A35)
0 0
with h = rF(q ). Define the canonical divergence [2]
Lemma A1 (Chordal slope lemma). Let F be a strictly convex function on interval I = (a, b) for a < b.
For any q1 < q < q2 in I, we have:
That is, define the points P1 = (q1, F (q1)), P = (q, F(q)), and P2 = (q2, F (q2)). Then the chordal
slope lemma states that the slope of the chord [P 1P] is less than the slope of the chord [P1P2],
which is less than the slope of [PP2] (Figure A1).
P2 = ( 2;F( 2))
P1 = ( 1;F( 1))
P =( ;F( ))
1 2
That is, we recover for q1 < (1 a)q1 + aq2 < q2 (i.e., a 2 (0, 1) and q1 6= q2) the a-skewed
Jensen divergence [102]:
a
JF (q1 : q2) := (1 a)F(q1) + aF(q2) F((1 a)q1 + aq2) > 0. (A44)
Now, a consequence of the chordal slope lemma is that for a strictly convex and differentiable
function F, we have:
F(q2) F(q1)
0 q q 0
F (q1) 2 1 F (q2). (A45)
This can be geometrically visualized in Figure A1.
That is,
0
F(q2) F(q1) (q 2 q1)F (q1) 0, (A46)
0
F(q2) F(q1) (q 2 q1)F (q2) 0. (A47)
We recognize the expressions of the Bregman divergences [91]:
0
BF(q1 : q2) := F(q1) F(q2) (q1 q2)F (q2), (A48)
and get:
Theorem A1. A multivariate Bregman divergence between two parameters q1 and q2 can be
expressed as a univariate Bregman divergence for the generator F q1,q2 induced by the parameters:
B (q : q ) = B (0 : 1),
F 1 2 Fq1,q2 8q1 6= q2, (A51)
where
Fq1,q2 (u) := F(q1 + u(q2 q1)). (A52)
Entropy 2020, 22, 1100 54 of 61
Proof. The functions fFq1,q2 g(q1,q2)2Q2 are 1D Bregman generators. Consider the directional derivative:
>
= (q 2 q 1 ) rF (q 1 + u (q 2 q1)), (A54)
0
Since F ( 0 ) = F( q ), F (1) = F(q2), and F (u ) = F (u), it follows that
r
q1 ,q2 1 q1,q2 q1,q2 q2 q1 q1,q2
B (0 : 1) = F (0) F (1) 1)r F (1),
Fq1,q2 q1,q2 q1,q2 (0 q2 q1 q1,q2 (A55)
>
= F(q1) F(q2) + (q2 q1) rF(q2), (A56)
= BF(q1 : q2). (A57)
Notations
Below is a list of notations we used in this document:
[D] [D] := f1, . . . , Dg
h,i inner product
i j
MQ(u, v) = ku vkQ
Mahalanobis distance ( , )= ( i )( j ) , 0
0 MQ Q
D (q : q ) parameter divergence uv q åi,j u v u v ij Q
0
D[p(x) : p (x)] statistical divergence
D, D Divergence and dual (reverse) divergence
0
q
0 D
If (q : q ) := åi =1 qi f with f (1) = 0
i
q
Csiszár divergence If Bregman i
0 0 0> 0
BF (q : q ) := F(q) F(q ) (q q ) rF(q )
divergence BF Canonical 0 0 > 0
AF,F (q : h ) = F(q) + F (h ) q h
divergence AF,F Bhattacharyya a 1 a
Ba[p1 : p2] = log p (x)p (x)dm(x)
distance Jensen/Burbea-Rao J Fa
( ) x 2X 1
(q 1 : q 2) = aF(q 1R) + (1 a)F(q 2) F(q 1 + (1 a)q 2 )
2
divergence Chernoff information
a 1 a R
C[P1, P2] = log mina2(0,1)
x2X p1 (x)p2 (x)dm(x)
F, F Potential functions related by Legendre–Fenchel transformation
Dr(p, q) R 1 0
Riemannian distance Dr(p, q) := 0 kg (t)kg(t)dt
B, B basis, reciprocal basis
B = fe1 = ¶1, . . . , eD = ¶D natural basis
g fdxi gi covector basis (one-forms)
i contravariant components of vector v
(v)B := (v )
(v)B := (vi) u covariant components of vector v
?v vector u is perpendicular to vector v (hu, vi = 0)
p
r affine connection
rXr Y covariant derivative
Õ
c
parallel transport of vectors along a smooth curve c
Parallel transport of v 2 Tc(0) along a smooth curve
g, gr c geodesic, geodesic with respect to connection r
G
ij,l Christoffel symbols of the first kind (functions)
k
Gij Christoffel symbols of the second kind (functions)
R Riemann–Christoffel curvature tensor
[X, Y] Lie bracket [X, Y]( f ) = X(Y( f )) Y(X( f )), 8 f 2 F(M)
r-projection PS = arg minQ2S D(q(P) : q(Q))
r -projection PS = arg minQ2S D(q(Q) : q(P))
C Amari–Chentsov totally symmetric cubic 3-covariant tensor
P = fpq (x)gqi parametric family of probability distributions exponential
nQ E, M, DD family, mixture family, probability simplex Fisher
P I(q) information matrix of family P
P I(q) Fisher Information Matrix (FIM) for a parametric family P
g Fisher information metric tensor field
P
h i
e e
exponential connection P P r := Eq (¶i¶jl)(¶kl)
h i
m m
r mixture connection Pr Pr := Eq (¶i¶jl + ¶i l¶jl)(¶kl)
h i
expected skewness tensor Cijk := Eq ¶il¶jl¶kl
hi
References
1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 623–656.
2. Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo,
Japan, 2016.
3. Kakihara, S.; Ohara, A.; Tsuchiya, T. Information Geometry and Interior-Point Algorithms in Semidefinite
Programs and Symmetric Cone Programs. J. Optim. Theory Appl. 2013, 157, 749–780,
doi:10.1007/s10957-012-0180-9.
4. Amari, S.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence,
RL, USA 2007.
5. Peirce, C.S. Chance, Love, and Logic: Philosophical Essays; U of Nebraska Press: Lincoln, NE, USA, 1998.
6. Schurz, G. Patterns of abduction. Synthese 2008, 164, 201–234.
7. Wald, A. Statistical decision functions. Ann. Math. Stat. 1949, pp. 165–205.
8. Wald, A. Statistical Decision Functions; Wiley: Chichester, U.K., 1950.
9. Dabak, A.G. A Geometry for Detection Theory. Ph.D. Thesis, Rice University, Houston, USA, 1993.
10. Do Carmo, M.P. Differential Geometry of Curves and Surfaces; Courier Dover Publications: New York,
NY, USA, Revised and Updated Second Edition; 2016.
11. Amari, S.; Barndorff-Nielsen, O.E.; Kass, R.E.; Lauritzen, S.L.; Rao, C.R. Differential Geometry in
Statistical Inference; Institute of Mathematical Statistics: Hayward, CA, USA, 1987.
12. Dodson, C.T.J. (Ed.) Geometrization of Statistical Theory; ULDM Publications; University of Lancaster,
Department of Mathematics: Bailrigg, Lancaster, UK, 1987.
13. Murray, M.; Rice, J. Differential Geometry and Statistics; Number 48 in Monographs on Statistics and
Applied Probability; Chapman and Hall: London, UK, 1993.
14. Kass, R.E.; Vos, P.W. Geometrical Foundations of Asymptotic Inference; Wiley-Interscience: New York,
NY, USA, 1997.
15. Arwini, K.A.; Dodson, C.T.J. Information Geometry: Near Randomness and Near Independance; Springer:
Berlin, Germany, 2008.
16. Calin, O.; Udriste, C. Geometric Modeling in Probability and Statistics; Mathematics and Statistics,
Springer International Publishing: Cham, Switzerland, 2014.
Entropy 2020, 22, 1100 56 of 61
17. Ay, N.; Jost, J.; Vân Lê, H.; Schwachhöfer, L. Information Geometry; Springer: Berlin, Germany, 2017;
Volume 64.
18. Corcuera, J.; Giummolè, F. A characterization of monotone and regular divergences. Ann. Inst. Stat.
Math. 1998, 50, 433–450.
19. Mühlich, U. Fundamentals of Tensor Calculus for Engineers with a Primer on Smooth Manifolds; Springer:
Berlin, Germany, 2017; Volume 230.
20. Nielsen, F.; Nock, R. Hyperbolic Voronoi diagrams made easy. In Proceedings of the IEEE International
Conference on Computational Science and Its Applications (ICCSA), Fukuoka, Japan, 23–26 March 2010;
pp. 74–80.
21. Whitney, H.; Eells, J.; Toledo, D. Collected Papers of Hassler Whitney; Nelson Thornes: London, UK, 1992.
22. Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University
Press: Princeton, NJ, USA, 2009.
23. Cartan, E.J. On Manifolds with an Affine Connection and the Theory of General Relativity; Bibliopolis;
Humanities Pr: ; First English Edition, 1986.
24. Akivis, M.A.; Rosenfeld, B.A. Élie Cartan (1869–1951); American Mathematical Society: Cambridge, MA,
USA 2011; Volume 123.
25. Wanas, M. Absolute parallelism geometry: Developments, applications and problems. arXiv 2002,
arXiv:gr-qc/0209050.
26. Bourguignon, J.P. Ricci curvature and measures. Jpn. J. Math. 2009, 4, 27–45.
27. Baez, J.C.; Wise, D.K. Teleparallel gravity as a higher gauge theory. Commun. Math. Phys. 2015, 333, 153–186.
28. Ashburner, J.; Friston, K.J. Diffeomorphic registration using geodesic shooting and Gauss-Newton
optimisation. NeuroImage 2011, 55, 954–967.
29. Lauritzen, S.L. Statistical manifolds. Differ. Geom. Stat. Inference 1987, 10, 163–216.
30. Vân Lê, H. Statistical manifolds are statistical models. J. Geom. 2006, 84, 83–93.
31. Furuhata, H. Hypersurfaces in statistical manifolds. Differ. Geom. Its Appl. 2009, 27, 420–429.
32. Zhang, J. Divergence functions and geometric structures they induce on a manifold. In Geometric Theory
of Information; Nielsen, F., Ed.; Springer: Berlin, Germany, 2014; pp. 1–30.
33. Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann.
Stat. 1983, 11, 793–803.
34. Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals.
Hiroshima Math. J. 1985, 15, 341–391.
35. Hiriart-Urruty, J.B.; Lemaréchal, C. Fundamentals of Convex Analysis; Springer Science & Business
Media: Berlin, Germany, 2012.
36. Crouzeix, J.P. A relationship between the second derivatives of a convex function and of its conjugate.
Math. Program. 1977, 13, 364–365.
37. Ay, N.; Amari, S. A novel approach to canonical divergences within information geometry. Entropy 2015,
17, 8111–8129.
38. Nielsen, F. What is... an information projection? Not. AMS 2018, 65, 321–324, doi:10.1090/noti1647.
39. Kurose, T. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J. Second Ser.
1994, 46, 427–433.
40. Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discret. Comput. Geom. 2010, 44,
281–307.
41. Nielsen, F.; Piro, P.; Barlaud, M. Bregman vantage point trees for efficient nearest neighbor queries. In
Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, New York, NY, USA, 28
June–3 July 2009; pp. 878–881.
42. Nielsen, F.; Boissonnat, J.D.; Nock, R. Visualizing Bregman Voronoi diagrams. In Proceedings of the
Twenty-Third Annual Symposium on Computational Geometry, Gyeongju, Korea, 6–8 June 2007;
pp. 121–122.
43. Nock, R.; Nielsen, F. Fitting the smallest enclosing Bregman ball. In Proceedings of the European Conference
on Machine Learning; Porto, Portugal, 3–7 October 2005; Springer: Berlin, Germany, 2005, pp. 649–656.
44. Nielsen, F.; Nock, R. On the smallest enclosing information disk. Inf. Process. Lett. 2008, 105, 93–97.
Entropy 2020, 22, 1100 57 of 61
45. Fischer, K.; Gärtner, B.; Kutz, M. Fast smallest-enclosing-ball computation in high dimensions. In
Proceedings of the European Symposium on Algorithms; Budapest, Hungary, 16–19 September 2003;
Springer: Berlin, Germany, 2003, pp. 630–641.
46. Della Pietra, S.; Della Pietra, V.; Lafferty, J. Inducing features of random fields. IEEE Trans. Pattern Anal.
Mach. Intell. 1997, 19, 380–393.
47. Nielsen, F. On Voronoi Diagrams on the Information-Geometric Cauchy Manifolds. Entropy 2020, 22, 713.
48. Shima, H. The Geometry of Hessian Structures; World Scientific: New Jersey, NJ, USA, 2007.
49. Zhang, J. Reference duality and representation duality in information geometry. AIP Conf. Proc. 2015,
1641, 130–146.
50. Gomes-Gonçalves, E.; Gzyl, H.; Nielsen, F. Geometry and Fixed-Rate Quantization in Riemannian Metric
Spaces Induced by Separable Bregman Divergences. In Proceedings of the 4th International Conference
on Geometric Science of Information (GSI), Toulouse, France, 27–29 August 2019; Nielsen, F.,
Barbaresco, F., Eds.; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2019; Volume
11712, pp. 351–358, doi:10.1007/978-3-030-26980-7\_36.
51. Nielsen, F. Cramér-Rao lower bound and information geometry. In Connected at Infinity II; Springer:
Berlin, Germany, 2013; pp. 18–37.
52. Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv 2009, arXiv:0911.4863.
53. Sato, Y.; Sugawa, K.; Kawaguchi, M. The geometrical structure of the parameter space of the two-
dimensional normal distribution. Rep. Math. Phys. 1979, 16, 111–119.
54. Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223.
55. Malagò, L.; Pistone, G. Information geometry of the Gaussian distribution in view of stochastic
optimization. In Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII,
Aberystwyth, UK, 17–20 January 2015; pp. 150–162.
56. Nielsen, F.; Nock, R. On the geometry of mixtures of prescribed distributions. In Proceedings of the
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB,
Canada,
15–20 April 2018; pp. 2861–2865.
ˇ
57. Campbell, L.L. An extended Cencov characterization of the information metric. Proc. Am. Math. Soc. 1986,
98, 135–141.
58. Vân Lê, H. The uniqueness of the Fisher metric as information metric. Ann. Inst. Stat. Math. 2017,
69, 879–896.
59. Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; Foundations and Trends in
Communications and Information Theory; Now Publishers Inc.: Hanover, MA, USA, 2004; Volume 1, pp.
417–528.
60. Jiao, J.; Courtade, T.A.; No, A.; Venkat, K.; Weissman, T. Information measures: The curious case of the
binary alphabet. IEEE Trans. Inf. Theory 2014, 60, 7616–7626.
61. Qiao, Y.; Minematsu, N. A Study on Invariance of f -Divergence and Its Application to Speech
Recognition. IEEE Trans. Signal Process. 2010, 58, 3884–3890.
62. Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f -divergences.
IEEE Signal Process. Lett. 2013, 21, 10–13.
63. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation.
Stud. Sci. Math. Hung. 1967, 2, 229–318.
64. Mitchell, A.F.S. Statistical manifolds of univariate elliptic distributions. Int. Stat. Rev. 1988, 56, 1–16.
65. Hotelling, H. Spaces of statistical parameters. Bull. Am. Math. Soc. (AMS) 1930, 36, 191.
66. Rao, R.C. Information and the accuracy attainable in the estimation of statistical parameters. Bull.
Calcutta Math. Soc. 1945, 37, 81–91.
67. Komaki, F. Bayesian prediction based on a class of shrinkage priors for location-scale models. Ann. Inst.
Stat. Math. 2007, 59, 135–146.
68. Stigler, S.M. The epic story of maximum likelihood. Stat. Sci. 2007, pp. 598–620.
69. Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. In
Breakthroughs in Statistics; Springer: Berlin, Germany, 1992; pp. 235–247.
70. Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A 1946,
186, 453–461.
71. Zhang, J. On monotone embedding in information geometry. Entropy 2015, 17, 4485–4499.
Entropy 2020, 22, 1100 58 of 61
72. Naudts, J.; Zhang, J. Rho–tau embedding and gauge freedom in information geometry. Inf. Geom. 2018,
doi:10.1007/s41884-018-0004-6.
73. Nock, R.; Nielsen, F.; Amari, S. On Conformal Divergences and Their Population Minimizers. IEEE TIT
2016, 62, 527–538.
74. Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential
family of distributions. Mach. Learn. 2001, 43, 211–246.
75. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn.
Res. 2005, 6, 1705–1749.
76. Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE
International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 3621–3624.
77. Nielsen, F.; Nock, R. Cumulant-free closed-form formulas for some common (dis)similarities between
densities of an exponential family. Tech. Rep. 2020, doi:10.13140/RG.2.2.34792.70400
78. Amari, S. Differential geometry of a parametric family of invertible linear systems: Riemannian metric,
dual affine connections, and divergence. Math. Syst. Theory 1987, 20, 53–82.
79. Schwander, O.; Nielsen, F. Fast learning of Gamma mixture models with k-MLE. In Proceedings of the
International Workshop on Similarity-Based Pattern Recognition, York, UK, 3–5 July 2013; Springer:
Berlin, Germany, 2013; pp. 235–249.
80. Miura, K. An introduction to maximum likelihood estimation and information geometry. Interdiscip. Inf. Sci.
2011, 17, 155–174.
81. Reverter, F.; Oller, J.M. Computing the Rao distance for Gamma distributions. J. Comput. Appl. Math.
2003, 157, 155–167.
82. Pinele, J.; Strapasson, J.E.; Costa, S.I. The Fisher-Rao Distance between Multivariate Normal
Distributions: Special Cases, Bounds and Applications. Entropy 2020, 22, 404.
83. Nielsen, F. Pattern learning and recognition on statistical manifolds: An information-geometric review. In
Proceedings of the International Workshop on Similarity-Based Pattern Recognition, York, UK, 3–5 July
2013; Springer: Berlin, Germany, 2013; pp. 1–25.
84. Sun, K.; Nielsen, F. Lightlike Neuromanifolds, Occam’s Razor and Deep Learning. arXiv 2019,
arXiv:1905.11027.
85. Sun, K.; Nielsen, F. Relative Fisher Information and Natural Gradient for Learning Large Modular Models.
In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–
11 August 2017; pp. 3289–3298.
86. Amari, S. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276.
87. Cauchy, A. Methode générale pour la résolution des systèmes d’équations simultanées. C. R. de
l’Académie Sci. 1847, 25, 536–538.
88. Curry, H.B. The method of steepest descent for non-linear minimization problems. Q. Appl. Math. 1944, 2,
258–261.
89. Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. Autom. Control 2013,
58, 2217–2229.
90. Nielsen, F. On geodesic triangles with right angles in a dually flat space. arXiv 2019, arXiv:1910.03935.
91. Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to
the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217.
92. Nielsen, F.; Hadjeres, G. Monte Carlo information-geometric structures. In Geometric Structures of
Information; Springer, Berlin, Germany, 2019; pp. 69–103.
93. Nielsen, F. Legendre Transformation and Information Geometry; Springer, Berlin, Heidelberg, Germany, 2010.
94. Raskutti, G.; Mukherjee, S. The information geometry of mirror descent. IEEE Trans. Inf. Theory 2015, 61,
1451–1457.
95. Bubeck, S. Convex Optimization: Algorithms and Complexity; Foundations and Trends in Machine
Learning; Hanover, MA, USA, 2015; Volume 8, pp. 231–357.
96. Zhang, G.; Sun, S.; Duvenaud, D.; Grosse, R. Noisy natural gradient as variational inference. In Proceedings of
the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5852–5861.
97. Beyer, H.G.; Schwefel, H.P. Evolution strategies–A comprehensive introduction. Nat. Comput. 2002, 1, 3–52.
Entropy 2020, 22, 1100 59 of 61
98. Berny, A. Selection and reinforcement learning for combinatorial optimization. In Proceedings of the
International Conference on Parallel Problem Solving from Nature, Paris, France, 18–20 September 2000;
Springer: Berlin, Germany, 2000; pp. 601–610.
99. Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Peters, J.; Schmidhuber, J. Natural evolution
strategies. J. Mach. Learn. Res. 2014, 15, 949–980.
100. Nielsen, F. An Information-Geometric Characterization of Chernoff Information. IEEE Sig. Proc. Lett.
2013, 20, 269–272.
101. Pham, G.; Boyer, R.; Nielsen, F. Computational Information Geometry for Binary Classification of High-
Dimensional Random Tensors. Entropy 2018, 20, 203.
102. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57,
5455–5466.
103. Nielsen, F. Chernoff Information of Exponential Families; Technical Report arXiv:1102.2684; Ithaca, NY,
USA, 2011.
104. Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic
means. Pattern Recognit. Lett. 2014, 42, 25–34.
105. Nielsen, F. Hypothesis Testing, Information Divergence and Computational Geometry. In Proceedings of
the International Conference on Geometric Science of Information Geometric Science of Information
(GSI), Paris, France, 28–30 August 2013; pp. 241–248.
106. Nielsen, F.; Sun, K. Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using
Piecewise Log-Sum-Exp Inequalities. Entropy 2016, 18, 442.
107. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55.
108. Nielsen, F.; Hadjeres, G. Monte Carlo Information Geometry: The dually flat case. arXiv 2018,
arXiv:1803.07225.
109. Ohara, A.; Tsuchiya, T. An Information Geometric Approach to Polynomial-Time Interior-Point Algorithms:
Complexity Bound via Curvature Integral; Research Memorandum; The Institute of Statistical
Mathematics: Tokyo, Japan, 2007; Volume 1055.
110. Fuglede, B.; Topsøe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the IEEE
International Symposium on Information Theory (ISIT), Chicago, IL, USA, 27 June–2 July 2004; p. 31.
111. Vajda, I. On metric divergences of probability measures. Kybernetika 2009, 45, 885–900.
112. Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin, Germany, 2008;
Volume 338.
113. Dowson, D.C.; Landau, B.V. The Fréchet distance between multivariate normal distributions.
J. Multivar. Anal. 1982, 12, 450–455.
114. Takatsu, A. Wasserstein geometry of Gaussian measures. Osaka J. Math. 2011, 48, 1005–1026.
115. Chentsov, N.N. Statistical Decision Rules and Optimal Inference; Monographs; American Mathematical
Society: Providence, RI, USA, 1982.
116. Amari, S. Differential-Geometrical Methods in Statistics; Lecture Notes on Statistics; Second Edition in
1990; 1985; Volume 28.
117. Amari, S.; Nagaoka, H. Methods of Information Geometry; Jouhou kika no Houhou; Iwanami Shoten:
Tokyo, Japan, 1993. (In Japanese)
118. Gibilisco, P.; Riccomagno, E.; Rogantin, M.P.; Wynn, H.P. (Eds.) Algebraic and Geometric Methods in
Statistics; Cambridge University Press: Cambridge, UK, 2009.
119. Srivastava, A.; Wu, W.; Kurtek, S.; Klassen, E.; Marron, J.S. Registration of Functional Data Using
Fisher-Rao Metric. arXiv 2011, arXiv:1103.3817.
120. Wei, S.W.; Liu, Y.X.; Mann, R.B. Ruppeiner geometry, phase transitions, and the microstructure of
charged AdS black holes. Phys. Rev. D 2019, 100, 124033.
121. Quevedo, H. Geometrothermodynamics. J. Math. Phys. 2007, 48, 013506.
122. Amari, S. Theory of Information Spaces: A Differential Geometrical Foundation of Statistics; Post RAAG
Reports; Tokyo, Japan, 1980.
123. Efron, B. Defining the curvature of a statistical problem (with applications to second order efficiency). Ann.
Stat. 1975, 3, 1189–1242.
124. Nagaoka, H.; Amari, S. Differential Geometry of Smooth Families of Probability Distributions; Technical
Report; METR 82-7; University of Tokyo, Tokyo, Japan, 1982.
Entropy 2020, 22, 1100 60 of 61
125. Croll, G.J. The Natural Philosophy of Kazuo Kondo. arXiv 2007, arXiv:0712.0641.
126. Kawaguchi, M. An introduction to the theory of higher order spaces I. The theory of Kawaguchi spaces.
RAAG Memoirs 1960, 3, 718–734.
127. Barndorff-Nielsen, O.E.; Cox, D.R.; Reid, N. The role of differential geometry in statistical theory. Int. Stat.
Rev. 1986, 54, 83–96.
128. Nomizu, K.; Katsumi, N.; Sasaki, T. Affine Differential Geometry: Geometry of Affine Immersions;
Cambridge University Press: Cambridge, UK, 1994.
129. Norden, A.P. On Pairs of Conjugate Parallel Displacements in Multidimensional Spaces. In Doklady
Akademii nauk SSSR; Kazan State University, Comptes rendus de l’Académie des Sciences de l’URSS:
Kazan, Russia, 1945; Volume 49, pp. 1345–1347.
130. Sen, R.N. On parallelism in Riemannian space I. Bull. Calcutta Math. Soc. 1944, 36, 102–107.
131. Sen, R.N. On parallelism in Riemannian space II. Bull. Calcutta Math. Soc 1944, 37, 153–159.
132. Sen, R.N. On parallelism in Riemannian space III. Bull. Calcutta Math. Soc 1946, 38, 161–167.
133. Giné, E.; Nickl, R. Mathematical Foundations of Infinite-Dimensional Statistical Models; Cambridge
University Press: Cambridge, UK, 2015; Volume 40.
134. Amari, S. New Developments of Information Geometry; Jouhou Kikagaku no Shintenkai; Saiensu’sha:
Tokyo, Japan, 2014. (In Japanese)
135. Fujiwara, A. Foundations of Information Geometry; Jouhou Kikagaku no Kisou; Makino Shoten: Tokyo,
Japan, 2015; p. 223. (In Japanese)
136. Mitchell, A.F.S. The information matrix, skewness tensor and a-connections for the general multivariate
elliptic distribution. Ann. Inst. Stat. Math. 1989, 41, 289–304.
137. Zhang, Z.; Sun, H.; Zhong, F. Information geometry of the power inverse Gaussian distribution. Appl. Sci.
2007, 9, 194–203.
138. Peng, T.L.L.; Sun, H. The geometric structure of the inverse gamma distribution. Contrib. Algebra Geom.
2008, 49, 217–225.
139. Zhong, F.; Sun, H.; Zhang, Z. The geometry of the Dirichlet manifold. J. Korean Math. Soc. 2008, 45, 859–870.
140. Peng, L.; Sun, H.; Jiu, L. The geometric structure of the Pareto distribution. Bol. de la Asoc. Mat. Venez.
2007, 14, 5–13.
141. Pistone, G. Nonparametric information geometry. In Geometric Science of Information; Springer: Berlin,
Germany, 2013; pp. 5–36.
142. Hayashi, M. Quantum Information; Springer: Berlin, Germany, 2006.
143. Pardo, M.d.C.; Vajda, I. About distances of discrete distributions satisfying the data processing theorem
of information theory. IEEE Trans. Inf. Theory 1997, 43, 1288–1293.
144. Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and k-means++ clustering. arXiv
2013, arXiv:1309.7109.
145. Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and clustering. In Proceedings of
the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Brisbane, Australia, 19–24 April 2015; pp. 2016–2020.
146. Nielsen, F.; Nock, R. Patch Matching with Polynomial Exponential Families and Projective Divergences.
In Proceedings of the International Conference on Similarity Search and Applications (SISAP), Tokyo,
Japan, 24–26 October 2016, pp. 109–116.
147. Nielsen, F.; Sun, K.; Marchand-Maillet, S. On Hölder Projective Divergences. Entropy 2017, 19, 122.
148. Nielsen, F.; Barbaresco, F. (Eds.) Geometric Science of Information; Lecture Notes in Computer Science;
Springer: Berlin, Germany, 2013; Volume 8085, doi:10.1007/978-3-642-40020-9.
149. Nielsen, F.; Barbaresco, F. (Eds.) Geometric Science of Information; Lecture Notes in Computer Science;
Springer: Berlin, Germany, 2015; Volume 9389, doi:10.1007/978-3-319-25040-3.
150. Nielsen, F.; Barbaresco, F. (Eds.) Geometric Science of Information; Lecture Notes in Computer Science;
Springer: Berline, Germany, 2017; Volume 10589, doi:10.1007/978-3-319-68445-1.
151. Nielsen, F. Geometric Structures of Information; Springer: Berlin, Germany, 2018.
152. Nielsen, F. Geometric Theory of Information; Springer: Berlin, Germany, 2014.
153. Ay, N.; Gibilisco, P.; Matús, F. Information Geometry and its Applications: On the Occasion of Shun-ichi
Amari’s 80th Birthday, IGAIA IV Liblice, Czech Republic, 12–17 June 2016; Springer Proceedings in
Mathematics & Statistics; Springer: Berlin, Germany, 2018; Volume 252.
Entropy 2020, 22, 1100 61 of 61
154. Keener, R.W. Theoretical Statistics: Topics for a Core Course; Springer: Berlin, Germany, 2011.
155. Nielsen, F.; Sun, K. Guaranteed bounds on the Kullback–Leibler divergence of univariate mixtures. IEEE
Signal Process. Lett. 2016, 23, 1543–1546.
156. Gordon, G.J. Approximate Solutions to Markov Decision Processes. Ph.D. Thesis, Department of
Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, 1999.
157. Telgarsky, M.; Dasgupta, S. Agglomerative Bregman clustering. In Proceedings of the 29th International
Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012; Omnipress: Madison, WI, USA,
pp. 1011–1018.
158. Yoshizawa, S.; Tanabe, K. Dual differential geometry associated with Kullback–Leibler information on the
Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999, 35, 113–137.
159. Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy
2019, 21, 485.
160. Niculescu, C.; Persson, L.E. Convex Functions and Their Applications; Springer: Berlin, Germany, 2006.
161. Nielsen, F.; Nock, R. The Bregman chord divergence. In Proceedings of the International Conference on
Geometric Science of Information, Toulouse, France, 27–29 August 2019; Springer: Berlin, Germany, pp.
299–308.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open
access article distributed under the terms and conditions of the Creative Commons
Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).