Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
Cram Er-Rao Lower Bound and Information Geometry: 1 Introduction and Historical Background
Information Geometry∗
arXiv:1301.3578v2 [cs.IT] 24 Jan 2013
Frank Nielsen
Sony Computer Science Laboratories Inc., Japan
École Polytechnique, LIX, France
Bhatia and C.S. Rajan, Eds.), special volume of Texts and Readings In Mathematics
(TRIM), Hindustan Book Agency, 2013. http://www.hindbook.com/trims.php
1
contribution has led to the birth of a flourishing field of Information
Geometry [6].
2
2.1 Rao’s lower bound for statistical estimators
For a fixed integer n ≥ 2, let {X1 , ..., Xn } be a random sample of size n
on a random variable X which has a probability density function (pdf) (or,
probability mass function (pmf)) p(x). Suppose the unknown distribution
p(x) belongs to a parameterized family F of distributions
L(θ; x1 , . . . , xn ) = pθ (x1 , . . . , xn ),
3
pmf) pθ (x) (for instance, if X1 , . . . , Xn is a random sample from pθ (x)), the
likelihood function is
n
Y
L(θ; x1 , . . . , xn ) = pθ (xi ).
i=1
Eθ (T ) = θ, for all θ ∈ Θ.
Consider probability distributions with pdf (or, pmf) satisfying the fol-
lowing regularity conditions:
• The support {x | pθ (x) > 0} is identical for all distributions (and thus
does not depend on θ),
R
• pθ (x)dx can be differentiated under the integral sign with respect to
θ,
Let us first consider the case of uni-parameter distributions like Poisson dis-
tributions with mean parameter λ. These families are also called order-1
4
families of probabilities. The C. R. Rao lower bound in the case of uni-
parameter distributions can be stated now.
x
l′ (x; λ) = −1 + ,
λ
′′ x
l (x; λ) = − 2 .
λ
The first derivative is technically called the score function. It follows that
2
d l(x; λ)
I(λ) = −Eλ ,
dλ2
1 1
= 2
Eλ [x] =
λ λ
since E[X] = λ for a random variable X following a Poisson distribution
with parameter λ: X ∼ Poisson(λ). What the RLB theorem states in plain
words is that for any unbiased estimator λ̂ based on an IID sample of size n
of a Poisson distribution with parameter θ∗ = λ∗ , the variance of λ̂ cannot
1 ∗
go below nI(λ ∗ ) = λ /n.
The Fisher information, defined as the variance of the score, can be geo-
metrically interpreted as the curvature of the log-likelihood function. When
the curvature is low (log-likelihood curve is almost flat), we may expect some
5
large amount of deviation from the optimal θ∗ . But when the curvature is
high (peaky log-likelihood), we rather expect a small amount of deviation
from θ∗ .
∂ ∂
[I(θ)]ij = Eθ log pθ (x) log pθ (x) , (1)
∂θi ∂θj
Z
∂ ∂
= log pθ (x) log pθ (x) pθ (x)dx. (2)
∂θi ∂θj
Provided certain regularity conditions are met (see [6], section 2.2), the
Fisher information matrix can be written equivalently as:
∂2
[I(θ)]ij = −Eθ log pθ (x) ,
∂θi ∂θj
or as:
∂ p ∂ p
Z
[I(θ)]ij = 4 pθ (x) pθ (x)dx.
x∈X ∂θi ∂θj
In the case of multi-parameter distributions, the lower bound on the ac-
curacy of unbiased estimators can be extended using the Löwner partial
ordering on matrices defined by A B ⇔ A − B 0, where M 0 means
M is positive semidefinite [11] (We similarly write M ≻ 0 to indicate that
M is positive definite).
The Fisher information matrix is always positive semi-definite [33]. It
can be shown that the Fisher information matrix of regular probability dis-
tributions is positive definite, and therefore always invertible. Theorem 1 on
the lower bound on the inaccuracy extends to the multi-parameter setting as
follows:
2
Multi-parameter distributions can be univariate like the 1D Gaussians N (µ, σ) or
multivariate like the Dirichlet distributions or dD Gaussians.
6
Theorem 2 (Multi-parameter Rao lower bound (RLB)) Let θ be a
vector-valued parameter. Then for an unbiased estimator θ̂ of θ∗ based on
a IID random sample of n observations, one has V [θ̂] n−1 I −1 (θ∗ ), where
V [θ̂] now denotes the variance-covariance matrix of θ̂ and I −1 (θ∗ ) denotes the
inverse of the Fisher information matrix evaluated at the optimal parameter
θ∗ .
Therefore, " #
n−1 σ ∗ 2 0
V [θ̂] n−1 I(θ∗ )−1 = −1 ∗4
.
0 2n σ
There has been a continuous flow of research along the lines of the CRLB,
including the case where the Fisher information matrix is singular (positive
semidefinite, e.g. in statistical mixture models). We refer the reader to the
book of Watanabe [47] for a modern algebraic treatment of degeneracies in
statistical learning theory.
7
information geometry [6]. Although there were already precursor geomet-
ric work [35, 12, 36] linking geometry to statistics by the Indian commu-
nity (Professors Mahalanobis and Bhattacharyya), none of them studied the
differential concepts and made the connection with the Fisher information
matrix. C. R. Rao is again a pioneer in offering Statisticians the geometric
lens.
= (∇θ)⊤ G(θ)∇θ,
8
The elements gij (θ) form the quadratic differential form defining the el-
ementary length of Riemannian geometry. The matrix G(θ) = [gij (θ)] ≻ 0
is positive definite and turns out to be equivalent to the Fisher information
matrix: G(θ) = I(θ). The information matrix is invariant to monotonous
transformations of the parameter space [43] and makes it a good candidate
for a Riemannian metric.
We shall discuss later more on the concepts of invariance in statistical
manifolds [18, 38].
In [43], Rao proposed a novel versatile notion of statistical distance in-
duced by the Riemannian geometry beyond the traditional Mahalanobis D-
squared distance [35] and the Bhattacharyya distance [12]. The Mahalanobis
D-squared distance [35] of a vector x to a group of vectors with covariance
matrix Σ and mean µ is defined originally as
1
Z p p
2
H (θ1 , θ2 ) = ( pθ1 (x) − pθ2 (x))2 dx = 1 − e−B(θ1 ,θ2 ) ≤ 1 (3)
2
9
2.2.2 Rao’s distance: Riemannian distance between two popula-
tions
Therefore we need to calculate explicitly the geodesic linking pθ1 (x) to pθ2 (x)
to compute Rao’s distance. This is done by solving the following second
order ordinary differential equation (ODE) [6]:
where Einstein summation [6] convention has been used to simplify the math-
ematical writing by removing the leading sum symbols. The coefficients Γk,ij
are the Christoffel symbols of the first kind defined by:
1 ∂gik ∂gkj ∂gij
Γk,ij = + − .
2 ∂θj ∂θi ∂θk
10
To give an example of the Rao distance, consider the smooth manifold of
univariate normal distributions, indexed by the θ = (µ, σ) coordinate system.
The Fisher information matrix is
" #
1
2 0
I(θ) = σ ≻ 0. (4)
0 σ22
11
3 A brief overview of information geometry
Since the seminal work of Rao [6] in 1945, the interplay of differential ge-
ometry with statistics has further strengthened and developed into a new
discipline called information geometry with a few dedicated monographs
[5, 40, 30, 6, 46, 7]. It has been proved by Chentsov and published in his Rus-
sian monograph in 1972 (translated in English in 1982 by the AMS [18]) that
the Fisher information matrix is the only invariant Riemannian metric for
statistical manifolds (up to some scalar factor). Furthermore, Chentsov [18]
proved that there exists a family of connections, termed the α-connections,
that ensures statistical invariance.
m: X →Y
x 7→ y = m(x)
a probability density p(x) is converted into another density q(y) such that:
12
where |M(x)| denotes the determinant of the Jacobian matrix [6] of the
transformation m (i.e., the partial derivatives):
∂y1 ∂y1
. . . ∂xd
.∂x1 .
M(x) = .. . . ...
.
∂yd ∂yd
∂x1
. . . ∂xd
It follows that
q(y) = q(m(x)) = p(x)|M(x)|−1 .
For any two densities p1 and p2 , we have the f -divergence on the transformed
densities q1 and q2 that can be rewritten mathematically as
q2 (y)
Z
Df (q1 : q2 ) = q1 (y)f dy,
y∈Y q1 (y)
p2 (x)
Z
−1
= p1 (x)|M(x)| f |M(x)|dx,
x∈X p1 (x)
= Df (p1 : p2 ).
Furthermore, the f -divergences are the only divergences satisfying the re-
markable data-processing theorem [24] that characterizes the property of
information monotonicity [4]. Consider discrete distributions on an alphabet
X of d letters. For any partition B = X1 ∪ ...Xb of X that merge alphabet
letters into b ≤ d bins, we have
where p̄1 and p̄2 are the discrete distribution induced by the partition B on
X . That is, we loose discrimination power by coarse-graining the support of
the distributions.
The most fundamental f -divergence is the Kullback-Leibler divergence
[19] obtained for the generator f (x) = x log x:
p(x)
Z
KL(p : q) = p(x) log dx.
q(x)
13
The Kullback-Leibler divergence between two distributions p(x) and q(x) is
equal to the cross-entropy H × (p : q) minus the Shannon entropy H(p):
p(x)
Z
KL(p : q) = p(x) log dx,
q(x)
= H × (p : q) − H(p).
with
Z
×
H (p : q) = −p(x) log q(x)dx,
Z
H(p) = −p(x) log p(x)dx = H × (p : p).
14
for estimating the parameter µ of Gaussians). In statistics, the concept
of sufficiency was introduced by Fisher [26]:
“... the statistic chosen should summarize the whole of the relevant
information supplied by the sample. ”
Mathematically, the fact that all information should be aggregated in-
side the sufficient statistic is written as
Pr(x|t, θ) = Pr(x|t).
15
• Poisson distributions are univariate exponential distributions of order
1 (with X = N∗ = {0, 1, 2, 3, ...} and dim Θ = 1) with associated
probability mass function:
λk e−λ
,
k!
for k ∈ N∗ .
The canonical exponential family decomposition yields
16
parameter θ. The cumulant function F is evaluated by the log-Laplace trans-
form.
To illustrate the generic behavior of exponential families in Statistics [14],
let us consider the maximum likelihood estimator for a distribution belonging
to the exponential family. We have the MLE θ̂:
n
!
X 1
θ̂ = (∇F )−1 t(xi ) ,
i=1
n
I(θ) = ∇2 F (θ) ≻ 0,
As mentioned earlier, the “:” notation emphasizes that the distance is not a
metric: It does not satisfy the symmetry nor the triangle inequality in gen-
eral. Divergence BF is called a Bregman divergence [13], and is the canonical
distances of dually flat spaces [6]. This Kullback-Leibler divergence on den-
sities ↔ divergence on parameters relies on the dual canonical parameteri-
zation of exponential families [14]. A random variable X ∼ pF,θ (x), whose
17
distribution belongs to an exponential family, can be dually indexed by its
expectation parameter η such that
Z
⊤
η = E[t(X)] = xeθ t(x)−F (θ)+k(x) dx = ∇F (θ).
x∈X
The maximum is attained for η = ∇F (θ) and is unique since F (θ) is strictly
convex (∇2 F (θ) ≻ 0). It follows that θ = ∇F −1 (η), where ∇F −1 denotes
the functional inverse gradient. This implies that:
η = ∇F (θ),
θ = ∇F ∗ (η).
18
The Bregman divergence can also be rewritten in a canonical mixed co-
ordinate form CF or in the θ- or η-coordinate systems as
p(x)
Z
D−1 (p : q) = KL(p : q) = p(x) log dx.
q(x)
Z p
p
D0 (p : q) = D0 (q : p) = 4 1 − p(x) q(x)dx = 4H 2(p, q).
19
3.5 Exponential geodesics and mixture geodesics
Information geometry as further pioneered by Amari [6] considers dual affine
geometries introduced by a pair of connections: the α-connection and −α-
connection instead of taking the Levi-Civita connection induced by the Fisher
information Riemmanian metric of Rao. The ±1-connections give rise to
dually flat spaces [6] equipped with the Kullback-Leibler divergence [19].
The case of α = −1 denotes the mixture family, and the exponential family
is obtained for α = 1. We omit technical details in this expository paper,
but refer the reader to the monograph [6] for details.
For our purpose, let us say that the geodesics are defined not anymore as
shortest path lengths (like in the metric case of the Fisher-Rao geometry) but
rather as curves that ensures the parallel transport of vectors [6]. This defines
the notion of “straightness” of lines. Riemannian geodesics satisfy both the
straightness property and the minimum length requirements. Introducing
dual connections, we do not have anymore distances interpreted as curve
lengths, but the geodesics defined by the notion of straightness only.
In information geometry, we have dual geodesics that are expressed for
the exponential family (induced by a convex function F ) in the dual affine
coordinate systems θ/η for α = ±1 as:
In fact, a more general triangle relation (extending the law of cosines) exists:
20
∗
Note that the θ-geodesic γpq and η-geodesic γqr are orthogonal with respect
to the inner product G(q) defined at q (with G(q) = I(q) being the Fisher
information matrix at q). Two vectors u and v in the tangent place Tq at q
are said to be orthogonal if and only if their inner product equals zero:
u ⊥q v ⇔ u⊤ I(q)v = 0.
Observe that in any tangent plane Tx of the manifold, the inner product
induces a squared Mahalanobis distance:
21
the Cramér-Rao lower bound (CRLB) and the Fisher-Rao geometry. Both
the contributions are related to the Fisher information, a concept due to
Sir R. A. Fisher, the father of mathematical statistics [26] that introduced
the concepts of consistency, efficiency and sufficiency of estimators. This
paper is undoubtably recognized as the cornerstone for introducing differen-
tial geometric methods in Statistics. This seminal work has inspired many
researchers and has evolved into the field of information geometry [6]. Ge-
ometry is originally the science of Earth measurements. But geometry is
also the science of invariance as advocated by Felix Klein Erlang’s program,
the science of intrinsic measurement analysis. This expository paper has
presented the two key contributions of C. R. Rao in his 1945 foundational
paper, and briefly presented information geometry without the burden of
differential geometry (e.g., vector fields, tensors, and connections). Informa-
tion geometry has now ramified far beyond its initial statistical scope, and is
further expanding prolifically in many different new horizons. To illustrate
the versatility of information geometry, let us mention a few research areas:
Geometry with its own specialized language, where words like distances,
balls, geodesics, angles, orthogonal projections, etc., provides “thinking
22
tools” (affordances) to manipulate non-trivial mathematical objects and no-
tions. The richness of geometric concepts in information geometry helps one
to reinterpret, extend or design novel algorithms and data-structures by en-
hancing creativity. For example, the traditional expectation-maximization
(EM) algorithm [25] often used in Statistics has been reinterpreted and fur-
ther extended using the framework of information-theoretic alternative pro-
jections [3]. In machine learning, the famous boosting technique that learns
a strong classifier by combining linearly weak weighted classifiers has been
revisited [39] under the framework of information geometry. Another strik-
ing example, is the study of the geometry of dependence and Gaussianity for
Independent Component Analysis [15].
References
[1] Ali, S.M. and Silvey, S. D. (1966). A general class of coefficients of
divergence of one distribution from another. J. Roy. Statist. Soc. Series
B 28, 131–142.
[5] Amari, S., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S.. L. and
Rao, C. R. (1987). Differential Geometry in Statistical Inference. Lecture
Notes-Monograph Series. Institute of Mathematical Statistics.
23
[6] Amari, S. and Nagaoka, H. (2000). Methods of Information Geometry.
Oxford University Press.
[9] Banerjee, A., Merugu, S., Dhillon, I. S. and Ghosh, J. (2005). Clustering
with Bregman divergences. J. Machine Learning Res. 6, 1705–1749.
24
[15] Cardoso, J. F. (2003). Dependence, correlation and Gaussianity in inde-
pendent component analysis. J. Machine Learning Res. 4, 1177–1203.
[23] Dawid, A. P. (2007). The geometry of proper scoring rules. Ann. Instt.
Statist. Math. 59, 77–93.
[24] del Carmen Pardo, M. C. and Vajda, I. (1997). About distances of dis-
crete distributions satisfying the data processing theorem of information
theory. IEEE Trans. Inf. Theory 43, 1288–1293.
25
[26] Fisher, R. A. (1922). On the mathematical foundations of theoretical
statistics. Phil. Trans. Roy. Soc. London, A 222, 309–368.
26
[38] Morozova, E. A. and Chentsov, N. N. (1991). Markov invariant geometry
on manifolds of states. J. Math. Sci. 56, 2648–2669.
[39] Murata, N., Takenouchi, T., Kanamori, T. and Eguchi, S. (2004). Infor-
mation geometry of U-boost and Bregman divergence. Neural Comput.
16, 1437–1481.
[43] Rao, C. R. (1945). Information and the accuracy attainable in the esti-
mation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–89.
27