0% found this document useful (0 votes)
4 views41 pages

Nielsen A Geometric Modeling of Occams Razor in Deep Learning

Attempt to understand inner working of deep networks

Uploaded by

Arrigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views41 pages

Nielsen A Geometric Modeling of Occams Razor in Deep Learning

Attempt to understand inner working of deep networks

Uploaded by

Arrigo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Information Geometry

https://doi.org/10.1007/s41884-025-00167-2

RESEARCH

A geometric modeling of Occam’s razor in deep learning

Ke Sun1 · Frank Nielsen2

Received: 17 December 2023 / Revised: 26 March 2025 / Accepted: 24 May 2025


© Crown 2025

Abstract
Why do deep neural networks (DNNs) benefit from very high dimensional parameter
spaces? Their huge parameter complexities vs. stunning performance in practice is all
the more intriguing and not explainable using the standard theory of model selection
for regular models. In this work, we propose a geometrically flavored information-
theoretic approach to study this phenomenon. With the belief that simplicity is linked
to better generalization, as grounded in the theory of minimum description length,
the objective of our analysis is to examine and bound the complexity of DNNs. We
introduce the locally varying dimensionality of the parameter space of neural network
models by considering the number of significant dimensions of the Fisher information
matrix, and model the parameter space as a manifold using the framework of singu-
lar semi-Riemannian geometry. We derive model complexity measures which yield
short description lengths for deep neural network models based on their singularity
analysis thus explaining the good performance of DNNs despite their large number of
parameters.

Keywords Information geometry · Deep learning · Minimum description length ·


Fisher information · Stochastic complexity

1 Introduction

Deep neural networks (DNNs) are usually large models in terms of storage costs.
In the classical model selection theory, such models are not favored as compared to
simple models with the same training performance. For example, if one applies the
Bayesian information criterion (BIC) [1] to DNN, a shallow neural network (NN)
Communicated by Noboru Murata.

B Ke Sun
[email protected]; [email protected]
B Frank Nielsen
[email protected]

1 CSIRO’s Data61, Sydney, Australia


2 Sony Computer Science Laboratories Inc. (Sony CSL), Tokyo, Japan

123
K. Sun, F. Nielsen

will be preferred over a deep NN due to the penalty term with respect to (w.r.t.) the
complexity. A basic principle in science is the Occam1 ’s Razor, which favors simple
models over complex ones that accomplish the same task. This raises the fundamental
question of how to measure the simplicity or the complexity of a model.
Formally, the preference of simple models has been studied in the area of minimum
description length (MDL) [2–4], also known in another thread of research as the
minimum message length (MML) [5]. By the theory of MDL [4], statistical models
that can most concisely communicate the observed data are favored and expected to
generalize better [6–10]. This is intuitive, as complex models often lead to overfit.
Consider a parametric family of distributions M = { p(x | θ )} with θ ∈  ⊂ R D .
The distributions are mutually absolutely continuous, which guarantees all densities
to have the same support. Otherwise, many problems of non-regularity will arise
as described by [11, 12]. The Fisher information matrix (FIM) I(θ) is a D × D
positive semi-definite (psd) matrix: I(θ)  0. The model is called regular if it is (i)
identifiable [13] with (ii) a non-degenerate and finite Fisher information matrix (i.e.,
I(θ)  0).
In a Bayesian setting, the description length of a set of N i.i.d. observations X =
{x i }i=1
N ⊂ X w.r.t. M can be defined as the number of nats with the coding scheme

of a parametric model p(x | θ ) and a prior p(θ ). The code length of any x i is given
by the cross entropy between the empirical distribution
 δi (x) = δ(x − x i ), where
δ(·) denotes the Dirac’s delta function, and p(x) = p(x | θ ) p(θ ) dθ . Therefore, the
description length of X is


N 
N 
− log p(X) = h × (δi : p) = − log p(x i | θ ) p(θ ) dθ , (1)
i=1 i=1


where h × ( p : q):= − p(x) log q(x)dx denotes the cross entropy between p(x) and
q(x), and log denotes natural logarithm throughout the paper. The code length means
the cumulative loss of the Bayesian mixture model p(x) w.r.t. the observations X.
Equation (1) corresponds to the Bayesian universal code. In MDL, the optimal code
in terms of the minimax strategy [14] is given by the normalized maximum likelihood
(NML) code. With a suitable choice of the prior, the Bayesian universal code and the
NML code asymptotically coincide [4] with O(1) difference.
By using Jeffreys2 ’ non-informative prior [15] as p(θ), the MDL in eq. (1) can be
approximated (see [2, 3, 16]) as

geometric complexity
   
D N
χ = − log p(X | θ̂) + log + log |I(θ)| dθ , (2)
  
2  2π   
fitness
penalize high dof model capacity

1 William of Ockham (ca. 1287 — ca. 1347), a monk (friar) and philosopher.
2 Sir Harold Jeffreys (1891–1989), a British statistician.

123
A geometric modeling of Occam’s…

where θ̂ ∈  is the maximum likelihood estimation (MLE), or the projection [15] of


X onto the model, D = dim() is the model size, N is the number of observations,
and | · | denotes the matrix determinant. In this paper, the symbols χ and O and the
term “razor” all refer to the same concept, that is the description length of the data X
by the model M. The smaller those quantities, the better.
The first term in eq. (2) is the fitness of the model to the observed data. The second
and the third terms measure the geometric complexity [17] and make χ favor simple
models. The second O(log N ) term only depends on the number of parameters D
and the number of observations N . It penalizes large models with a high degree of
freedom (dof). The third O(1) term is independent to the observed data and measures
the model capacity, or the total “number” of distinguishable distributions [17] in the
model.
Unfortunately, this razor χ in eq. (2) does not fit straightforwardly into DNNs,
which are high-dimensional singular models. The FIM I(θ) is a large singular matrix
(not full rank) and the last term may be difficult to evaluate. Based on the second term
on the right-hand-side (RHS), a DNN can have very high complexity and therefore
is less favored against a shallow network. This contradicts the good generalization of
DNNs as compared to shallow NNs. These issues call for a new analysis of the MDL
in the DNN setting.
Towards this direction, we made the following contributions in this paper:

– New concepts and methodologies from singular semi-Riemannian geometry [18]


to analyze the space of neural networks;
– A definition of the local dimensionality in this space, that is the amount of non-
singularity, with bounding analysis;
– A connection between f -mean and DNN model complexity with related bounds;
– A new MDL formulation, which explains how the singularities contribute to the
“negative complexity” of DNNs: That is, the model turns simpler as the number
of parameters grows.

The rest of this paper is organized as follows. Section 2 reviews singularities in infor-
mation geometry. In the setting of a DNN, section 3 introduces its singular parameter
manifold. Section 4 bounds the number of singular dimensions of the parameter man-
ifold of the DNN. Sections 5, 6, 7, and 8 derive our MDL criterion based on two
different priors, and discuss how model complexity is affected by the singular geome-
try. We discuss related work in section 9 and conclude in section 10. Proofs and related
derivations of our main results are provided in the appendix.

2 Lightlike statistical manifold

In this paper, bold capital letters like A denote matrices, bold small letters like a denote
vectors, normal capital/small letters like A/a and Greek letters like α denote scalars,
and calligraphy letters like M denote manifolds (with exceptions). We use X, Y and
Z to denote a collection of N random observations and use x, y and z to denote one
single observation.

123
K. Sun, F. Nielsen

The term “statistical manifold” refers to M = { p(x | θ)}, where each point of
M corresponds to a probability distribution p(x | θ )3 . The discipline of information
geometry [15] studies such a space in the Riemannian and more generally differential
geometry framework. Hotelling [20] and independently Rao [21, 22] proposed to
endow a parametric space of statistical models with the Fisher information matrix as
a Riemannian metric:
∂ log p(x | θ) ∂ log p(x | θ )
I(θ):=E p , (3)
∂θ ∂θ 

where E p denotes the expectation w.r.t. p(x | θ). The corresponding infinitesimal
squared length element ds 2 = tr(I(θ)dθ dθ  ) = dθ , dθ I (θ ) = dθ  I(θ)dθ , where
tr(·) means the matrix trace4 , is independent of the underlying parameterization of the
population space.
Amari further developed this approach by revealing the dualistic structure of statis-
tical manifolds which extends the Riemannian framework [15, 23]. The MDL criterion
arising from the geometry of Bayesian inference with Jeffreys’ prior for regular mod-
els is detailed in [16]. In information geometry, the regular assumption is (1) an open
connected parameter space in some Euclidean space; and (2) the FIM exists and is
non-singular. However, in general, the FIM is only positive semi-definite and thus for
non-regular models like neuromanifolds [15] or Gaussian mixture models [24], the
manifold is not Riemannian but singular semi-Riemannian [18, 25]. In the machine
learning community, singularities have often been dealt with as a minor issue: For
example, the natural gradient has been generalized based on the Moore-Penrose inverse
of I(θ) [26] to avoid potential non-invertible FIMs. Watanabe [24] addressed the fact
that most usual learning machines are singular in his singular learning theory which
relies on algebraic geometry. Nakajima and Ohmoto [27] discussed dually flat struc-
tures for singular models.
Recently, preliminary efforts [28, 29] tackle singularity at the core, mostly from
a mathematical standpoint. For example, Jain et al. [29] studied the Ricci curvature
tensor of such manifolds. These mathematical notions are used in the community of
differential geometry or general relativity but have not yet been ported to the machine
learning community.
Following these efforts, we first introduce informally some basic concepts from a
machine learning perspective to define the differential geometry of non-regular sta-
tistical manifolds. The tangent space Tθ (M) is a D-dimensional (D = dim(M))
real vector space, that is the local linear approximation of the manifold M at the
point θ ∈ M, equipped with the inner product induced by I(θ). The tangent bundle
T M:={(θ, v), θ ∈ M, v ∈ Tθ } is the 2D-dimensional manifold obtained by com-
bining all tangent spaces for all θ ∈ M. A vector field is a smooth mapping from M to
T M such that each point θ ∈ M is attached a tangent vector originating from itself.
Vector fields are cross-sections of the tangent bundle. In a local coordinate chart θ,
the vector fields along the frame are denoted as ∂θi . A distribution (not to be confused
3 To be more precise, a statistical manifold [19] is a structure (∇, g, C) on a smooth manifold M, where
g is a metric tensor, ∇ a torsion-free affine connection, and C is a symmetric covariant tensor of order 3.
4 Using the cyclic property of the matrix trace, we have ds 2 = tr(I(θ )dθdθ  ) = dθ  I(θ )dθ.

123
A geometric modeling of Occam’s…

Fig. 1 A toy lightlike manifold M with a null curve. The ellipses are Tissot’s indicatrices, showing how
circles of infinitesimal radius are distorted by the lightlike geometry on M. On the null curve, the FIM is
degenerate so that ∂θi , ∂θi I = 0. Therefore the local dynamic ∂θi (tangent vector of the null curve) has
zero length, meaning that it does not change the model. The radical distribution Rad(T M) is formed by
the null curve and its tangent vectors. In the context of DNNs, such dynamics refer to small changes of NN
weights/biases that do not alter the global model

with probability distributions which are points on M) means a vector subspace of


the tangent bundle spanned by several independent vector fields, such that each point
θ ∈ M is associated with a subspace of Tθ (M) and those subspaces vary smoothly
with θ . Its dimensionality is defined by the dimensionality of the subspace, i.e., the
number of vector fields that span the distribution.
In a lightlike manifold [18, 25] M, I(θ) can be degenerate. The tangent space
Tθ (M) is a vector space with a kernel subspace, i.e., a nullspace. A null vector field is
formed by null vectors, whose lengths measured according to the Fisher metric tensor
are all zero. The radical 5 distribution Rad(T M) is the distribution spanned by the null
vector fields. Locally at θ ∈ M, the tangent vectors in Tθ (M) which span the kernel of
I(θ) are denoted as Radθ (T M). In a local coordinate chart, Rad(T M) is well defined
if these Radθ (T M) form a valid distribution. We write T M = Rad(T M)⊕S(T M),
where ‘⊕” is the direct sum, and the screen distribution S(T M) is complementary
to the radical distribution Rad(T M) and has a non-degenerate induced metric. See
fig. 1 for an illustration of the concept of radical distribution.
We can find a local coordinate frame (a frame is an ordered basis) (θ1 , · · · , θd , θd+1 ,
· · · , θ D ), where the first d dimensions θ s = (θ1 , · · · , θd ) correspond to the screen dis-
tribution, and the remaining d̄:=D − d dimensions θ r = (θd+1 , · · · , θ D ) correspond
to the radical distribution. The local inner product ·, ·I satisfies

∂θi , ∂θ j I = δi j , (∀ 1 ≤ i, j ≤ d)
∂θi , ∂θk I = 0, (∀ d + 1 ≤ i ≤ D, 1 ≤ k ≤ D)

5 Radical stems from Latin and means root.

123
K. Sun, F. Nielsen

where δi j = 1 if and only if (iff) i = j and δi j = 0, otherwise. Unfortunately, this


frame is not unique [30]. We will abuse I to denote both the FIM of θ and the FIM
of θ s . One has to remember that I(θ)  0, while I(θ s )  0 is a proper Riemannian
metric. Hence, both I −1 (θ s ) and log |I(θ s )| are well-defined.
Remark Notice that the Fisher information matrix is covariant under reparameteriza-
tion. That is, let θ (λ) be an invertible smooth reparameterization of λ. Then the FIM
rewrites in the θ -parameterization as:

I(θ) = Jθ→λ I(λ(θ))Jθ→λ , (4)

where Jθ→λ is the full rank Jacobian matrix.


The natural gradient flows (vector fields on M) with respect to λ and θ coincide
but not the natural gradient descent methods (learning paths that consist of sequences
of points on M) because of the non-zero learning step sizes.
Furthermore, the ranks of I(θ) and I(λ) as well as the dimensions of the screen
and radical distributions coincide. Hence, the notion of singularities is intrinsic and
independent of the smooth reparameterization.

3 Lightlike neuromanifold

This section instantiates the concepts in the previous section 2 in terms of a sim-
ple DNN predictive model. The random variable x = (z, y) of interest consists of
two components: z, referred to as the “input”, and y, referred to as the “target”. By
assumption, their joint probability distribution is specified by

log p(x | ψ, θ ) = log p(z | ψ) + log p(y | z, θ ),

where p(z | ψ) is a generative model of z which is parameterized by ψ, p(y | z, θ )


is a predictive DNN, and θ consists of all neural network parameters.
Our main subject is the latter predictive model p(y | z, θ ) and its parameter
manifold Mθ . Here, we need the generative model p(z | ψ) for the purpose of
discussing how the geometry of Mθ is affected by the choice of p(z | ψ) and can be
studied independent of the parameter space of p(z | ψ), which we denote as Mψ .
In the end, our results do not depend on the specific form of p(z) or whether it is
parametric.
For p(y | z, θ), we consider a deep feed-forward network with L layers, uniform
width M except the last layer which has m output units (m < M), input z ∈ Z
with dim(Z) = M, pre-activations hl of size M (except that in the last layer, h L has
m elements), post-activations zl of size M, weight matrices W l and bias vectors bl
(1 ≤ l ≤ L). The layers are given by

zl = φ(hl ),
hl = W l zl−1 + bl , (5)
z = z,
0

123
A geometric modeling of Occam’s…

where φ is an element-wise nonlinear activation function such as ReLU [31].


Without loss of generality, we assume multinomial6 output units and the DNN
output [32]
y ∼ Multinomial SoftMax(h L )

is a random label in the set {1, · · · , m}, where

1
SoftMax(t):= m (exp(t1 ), exp(t2 ), · · · , exp(tm ))
i=1 exp(ti )

denotes the softmax function. SoftMax(h L ) is a random point in  m , the (m − 1)

dimensional statistical simplex. Therefore, p(y = k) = exp(h k )/ mj=1 exp(h Lj ),


L

k = 1, · · · , m. The neural network parameters θ consists of W l and bl , l = 1, · · · , L.


In this supervised setting, the code length  in eq. (1) means the predictive loss of the
Bayesian mixture model p(x) = p(z) p(y | z, θ ) p(θ )dθ w.r.t. the observed pairs
(z i , yi ). The smaller the code length, the more accurate the prediction.
All such neural networks NNθ when θ varies in a parameter space are referred to
as the neuromanifold: Mθ = {NNθ : θ ∈ }. Similarly, the parameter space of the
distribution family p(z | ψ) is denoted as Mψ . In machine learning, we are often
interested in the FIM w.r.t. θ as it reveals the geometry of the parameter space. However,
the FIM can also be computed relatively w.r.t. a subset of θ in a sub-system [33].
By the definition in eq. (3), the FIM on the product manifold Mψ × Mθ is in a
block-diagonal form  
I(ψ) 0
I(ψ, θ ) = . (6)
0 I(θ)
The off-diagonal blocks are zero-matrices (denoted as 0) because the generative and
predictive models do not share parameters. Indeed, we have

∂ log p(x | ψ, θ ) ∂ log p(x | ψ, θ ) ∂ log p(z | ψ) ∂ log p(y | z, θ )


Ep  = Ep
∂ψ ∂θ ∂ψ ∂θ 

∂ log p(z | ψ) ∂ log p(y | z, θ)
= E p(z|ψ) E p(y|z, θ ) = 0,
∂ψ ∂θ

where E p(y|z, θ ) ∂ log p(y|z,


∂θ
θ)
is the expectation of the score function and is always
zero. The metric I(ψ, θ ) is a product metric, meaning that the geometry of Mθ defined
by I(θ ) can be studied separately to the geometry of Mψ .
As we are interested in the predictive model corresponding to the diagonal block
I(θ), we further have (see e.g. [34][35] for derivations)
  
∂ h L (z) ∂ h L (z)
I(θ ) = E p(z) C(z) , (7)
∂θ ∂θ

6 In fact, a generalization of the Bernoulli distribution with integer k ≥ 2 mutually exclusive events, called
informally a multinoulli distribution since it is a multinomial distribution with a single trial.

123
K. Sun, F. Nielsen

where the expectation is taken w.r.t. p(z):= p(z | ψ), an underlying true distribution
in the input space depending on the parameter ψ. ∂ h∂θ(z) is the m × D parameter-output
L

Jacobian matrix, based on a given input z, C(z):=diag (o(z))−o(z)o(z)  0, diag (·)


means the diagonal matrix with the given diagonal entries, and o(z):=SoftMax(h L (z))
is the predicted class probabilities of z. By the definition of SoftMax, each dimension
of o(z) represents a positive probability, although o(z) can be arbitrarily close to a
one-hot vector. As a result, the kernel of the psd matrix C(z) is given by {λ1 : λ ∈ R},
where 1 is the vector of all 1’s.
In eq. (7), I(θ ) is the single-observation FIM. It is obvious that the FIM w.r.t.
the joint distribution p(X | θ ) of multiple observations is N I(θ) (Fisher information
is additive), so that I(θ) does not scale with N . In theory, computing I(θ) requires
assuming p(z), which depends on the parameter ψ. This makes sense as (ψ 1 , θ ) and
(ψ 2 , θ ) with ψ 1 = ψ 2 are different points on the product manifold Mψ × Mθ and
thus their I(θ) should be different. In practice, one only gets access to a set of N i.i.d.
samples drawn from an unknown p(z | ψ). In this case, it is reasonable toN take p(z) in
eq. (7) to be the empirical distribution p̂(z) so that p(z) = p̂(z):= N1 i=1 δ(z − z i ),
then
  
1 
N
∂ h L (z i ) ∂ h L (z i )
I(θ) = Î(θ ):= C(z i ) . (8)
N ∂θ ∂θ
i=1

The FIM computed in this way does not rely on the assumption of a parametric
generative model p(z | ψ) and the choice of a ψ. Î(θ) can be directly computed
from the observed zi ’s and does not depend on the observed yi ’s. Although denoted
differently than I(θ ) in the current paper, this Î(θ) is a standard version of the definition
of the FIM for neural networks [33, 35–37].
By considering the neural network weights and biases as random variables satisfy-
ing a prescribed prior distribution [38, 39], this I(θ) can be regarded as a random
matrix [40] depending on the structure of the DNN and the prior. The empiri-
cal density of I(θ) is the empirical distribution of its eigenvalues {λi }i=1 D , that is,
1 D
ρ D (λ) = D i=1 δ(λi ). If at the limit D → ∞, the empirical density converges to a
probability density function (pdf), then

ρI (λ):= lim ρ D (λ) (9)


D→∞

is called the spectral density of the Fisher information matrix.


For DNN, we assume that
(A1) At the MLE θ̂ , the prediction SoftMax(h L (z i )) perfectly recovers (tending to be
one-hot vectors) the training target yi , for all the training samples (z i , yi ).
In this case, the negative Hessian of the average log-likelihood

1  ∂ 2 log p(yi | z i , θ )
N
1 ∂ 2 log p(X | θ )
J(θ ):= − = −
N ∂θ∂θ  N ∂θ∂θ 
i=1

123
A geometric modeling of Occam’s…

Table 1 The FIM and the observed FIM. The last three columns explain whether the tensor depends on
the observed z i ’s, whether it depends on the observed yi ’s, and whether they can be computed in practice
based on empirical observations

Notation Name Depend on z i Depend on yi Computable

I(θ ) FIM (w.r.t. true p(z)) No No No


Î(θ) FIM (w.r.t. empirical p̂(z)) Yes No Yes
J(θ) observed FIM Yes Yes Yes

is called the observed FIM (sample-based FIM), which is also known as the “empirical
Fisher” in machine learning literature [36, 37]. In our notations explained in table 1,
the FIM I depends on the true distribution p(z) and does not depend on the observed
samples. In the expression of the FIM in eq. (7), if p(z) = p̂(z), then I become
Î, which depends on the observed input z i ’s. The observed FIM J depends on both
the observed input z i ’s and the observed target yi ’s. If p(z) = p̂(z), the observed
FIM coincides with the FIM at the MLE θ̂ and J(θ̂ ) = Î(θ̂). For general statistical
models, there is a residual term in between these two matrices which scales with the
training error (see e.g. Eq. 6.19 in section 6 of [41], or eq. (A1) in the appendix). How
these different metric tensors are called is just a matter of terminology. One should
distinguish them by examining whether/how they depend (partially) on the observed
information.

4 Local dimensionality

This section quantitatively measures the singularity of the neuromanifold. Our main
definitions and results do not depend on the settings introduced in the previous section
and can be generalized to similar models including stochastic neural networks [42].
For example, if the output units or the network structure is changed, the expression of
the FIM and related results can be adapted straightforwardly. Our derivations depend
on that (1) DNNs have a large amount of singularity corresponding zero eigenvalues
of the FIM; and (2) the spectrum of the (observed) FIM has many eigenvalues close to
zero [43]. That being said, our results also apply to singular models [24] with similar
properties.
Definition 1 (Local dimensionality) The local dimensionality d(θ ):=rank (I(θ)) of
the neuromanifold M at θ ∈ M refers to the rank of the FIM I(θ). If p(z) = p̂(z),
then d(θ ) = d̂(θ):=rank Î(θ ) .

The local dimensionality d(θ ) is the number of degrees of freedom at θ ∈ M which


can change the probabilistic model p(y | z, θ ) in terms of information theory. One can
find a reparameterized DNN with d(θ ) parameters, which is locally equivalent to the
original DNN with D parameters. Recall the dimensionality of the tangent bundle is
two times the dimensionality of the manifold.
Remark The dimensionality of the screen distribution S(T M) at θ is 2 d(θ ).

123
K. Sun, F. Nielsen

By definition, the FIM as the singular semi-Riemannian metric of M must be


psd. Therefore it only has positive and zero eigenvalues, and the number of positive
eigenvalues d(θ ) is not constant as θ varies in general.
Remark The local metric signature (number of positive, negative, zero eigenvalues
of the FIM) of the neuromanifold M is (d(θ ), 0, D − d(θ )), where d(θ ) is the local
dimensionality.
The local dimensionality d(θ ) depends on the specific choice of p(z). If p(z) =
p̂(z), then d(θ ) = d̂(θ ) = rank Î(θ ) . On the other hand, one can use the rank
of the negative Hessian J(θ ) (i.e., observed rank) to get an approximation of the
local dimensionality d(θ ) ≈ rank (J(θ )). In the MLE θ̂ , this approximation becomes
accurate. We simply denote d and d̂, instead of d(θ ) and d̂(θ), if θ is clear from the
context.
We first show that the lightlike dimensions of M do not affect the neural network
model in eq. (5).
  
Lemma 1 If (θ , j α j ∂θ j ) ∈ Rad(T M), i.e.  j α j ∂θ j , j α j ∂θ j I (θ ) = 0, then
∂ h L (z)
almost surely we have ∂θ α = λ(z)1, where λ(z) ∈ R.

By lemma 1, the Jacobian ∂ h∂θ(z) is the local linear approximation of the map θ → h L .
L

The dynamic α (coordinates of a tangent vector) on M causes a uniform increment on


the output h L , which, after the SoftMax function, does not change the neural network
map z → y.
Then, we can upper-bound the local dimensionality using the rank of the parameter-
output Jacobian ∂ h L (z)/∂θ .
N  
min rank ∂ h∂θ(zi ) , m − 1 .
L
Proposition 2 ∀θ ∈ M, d̂(θ ) ≤ i=1

Remark While the total number D of free parameters is unbounded in DNNs, the local
dimensionality estimated by d̂(θ ) grows at most linearly w.r.t. the sample size N , given
fixed m (size of the last layer). If both N and m are fixed, then d̂(θ ) is bounded even
when the network width M → ∞ and/or depth L → ∞.
  
The above bound is based on the inequality rank i Ai ≤ i rank ( Ai ) for any
matrices Ai , which could lead to loose bounds. Alternatively, an upper bound can be
established directly based on the definition of the matrix rank.
 ∂ h L (z)
Proposition 3 For all θ ∈ M, we have d(θ ) ≤ dim span z∈supp( p) Row ∂θ ,
where supp( p) is the support of p(z), and Row( A) denotes the row vectors of the matrix
A. Similarly, d̂(θ) has an upper bound obtained by replacing supp( p) with supp( p̂)
on the RHS, i.e. the union is over the observed z i ’s.
∂ h L (z)
In summary, the less the rank of ∂θ , the more potential singularities in the neuro-
∂ h L (z)
manifold. Note the Jacobian ∂θ can be further written as

∂ h L (z) ∂ h L (z) ∂ h L (z)


, ∂ ∂w
h (z) ,
L
= , ∂w2 , · · · (10)
∂θ ∂w 1 L

123
A geometric modeling of Occam’s…

where wl contains all parameters in the l’th layer (l = 1, · · · , L) and is obtained by


stacking the columns of W l and bl into a long vector. We can bound ∂ h∂w(z)
L
l individually
as below.

Proposition 4 It holds that


 
∂ h L (z) L−1  
rank ≤ rank W L  L−1 W L−1 · · · W l+1 l ≤ min rank s , (11)
∂wl s=l

 
where l = diag φ  (hl1 ), · · · , φ  (hlM ) is the Jacobian of the l’th activation layer.

Observe that the upper bounds in proposition 4 are monotonically decreasing with
respect to L. For example, we have

rank W L  L−1 · · · W l+1 l ≤ rank W L  L−1 · · · W l+2 l+1 . (12)

∂ h L (z)
Based on these upper bounds, ∂wl
is potentially more singular for layers that are
close to the input.

Remark If φ is ReLU, then the diagonal entries of s form a binary vector, and
rank (s ) is the number of activated neurons in the s’th layer. In this case, the upper
L−1
bound mins=l rank (s ) means the smallest number of activated neurons across all
layers.

To understand d(θ ), one can parameterize the DNN, locally, with only d(θ ) free
parameters while maintaining the same predictive model. The log-likelihood is a func-
tion of these d(θ ) parameters, and therefore its Hessian has at most rank d(θ ). In theory,
one can only reparameterize M so that at one single point θ̂, the screen and radical
distributions are separated based on the coordinate chart. Such a chart may neither
exist locally (in a neighborhood around θ̂ ) nor globally.
The local dimensionality is not constant and may vary with θ . The global topology of
the neuromanifold is therefore like a stratifold [44, 45]. As θ has a large dimensionality
in DNNs, singularities are more likely to occur in M. Compared to the notion of
intrinsic dimensionality [46], our d(θ ) is well-defined mathematically rather than
based on empirical evaluations. One can regard our local dimensionality as an upper
bound of the intrinsic dimensionality, because a very small singular value of I still
counts towards the local dimensionality. Notice that random matrices have full rank
with probability 1 [47].
We can regard small singular values (below a prescribed threshold ε > 0) as ε-
singular dimensions, and use ε-rank defined below to estimate the local dimensionality.

Definition 2 The ε-rank of the FIM I(θ) is the number of eigenvalues of I(θ) which
is not less than some given ε > 0.

By definition, the ε-rank is a lower bound of the rank of the FIM, which depends on
the θ -parameterization — different parameterizations of the DNN may yield different

123
K. Sun, F. Nielsen

ε-ranks of the corresponding FIM. If ε → 0, the ε-rank of I(θ) becomes the true
rank of I(θ) given by d(θ ). The spectral density ρI (probability distribution of the
eigenvalues of I(θ)) affects the ε-rank of I(θ ) and the expected local dimensionality
of M. On the support of ρI , the higher the probability of the region [0, ε), the more
likely M is singular. By the Cramér-Rao lower bound, the variance of an unbiased
1D estimator θ̂ must satisfy

1
var(θ̂ ) ≥ I(θ )−1 ≥ .
ε

Therefore the ε-singular dimensions lead to a large variance of the estimator θ̂: a
single observation x i carries little or no information regarding θ , and it requires a
large number of observations to achieve the same precision. The notion of thresholding
eigenvalues close to zero may depend on the parameterization but the intrinsic ranks
given by the local dimensionality are invariant.
In a DNN, there are several typical sources of singularities:
• First, if a neuron is saturated and gives constant output regardless of the input
sample z i , then all dynamics of its input and output connections are in Rad(T M).
• Second, two neurons in the same layer can have linearly dependent output, e.g.
when they share the same weight vector and bias. They can be merged into one
single neuron, as there exists redundancy in the original parameterization.
• Third, if the activation function φ(·) is homogeneous, e.g. ReLU, then any neuron
in the DNN induces a reparametrization by multiplying the input links by α and
output links by 1/α k (k is the degree of homogeneity). This reparametrization
corresponds to a null curve in the neuromanifold parameterized by α.
• Fourth, certain structures such as recurrent neural networks (RNNs) suffer from
vanishing gradient [32]. As the FIM is the variance of the gradient of the log-
likelihood (known as variance of the score in statistics), its scale goes to zero
along the dimensions associated with such structures.
It is meaningful to formally define the notion of “lightlike neuromanifold”. Using
geometric tools, related studies can be invariant w.r.t. neural network reparametriza-
tion. Moreover, the connection between neuromanifold and singular semi-Riemannian
geometry, which is used in general relativity, is not yet widely adopted in machine
learning. For example, the textbook [24] in singular statistics mainly used tools from
algebraic geometry which is a different field.
Notice that the Fisher-Rao distance along a null curve is undefined because there the
FIM is degenerate and there is no arc-length reparameterization along null curves [48].

5 General formulation of our razor

In this section, we derive a new formula of MDL for DNNs, aiming to explain how
does the high dimensional DNN structure can have a short code length of the given
data? Notice that, this work focuses on the concept of model complexity but not the
generalization bounds. We aim to show that the DNN model is intrinsically simple

123
A geometric modeling of Occam’s…

because it can be described shortly. The theoretical connection between generalization


power and MDL is studied in PAC-Bayesian theory and PAC-MDL (see [6–10] and
references therein). This is beyond the scope of this paper.
We derive a simple asymptotic formula for the case of large sample size and large
network size. Therefore crude approximations are taken and the low-order terms are
ignored, which are common practices in deriving information criteria [1, 49].
In the following, we will abuse p(x | θ ) to denote the DNN model p(y | z, θ) for
shorter equations and to be consistent with the introduction. Assume
(A2) The absolute values of the third-order derivatives of log p(x | θ ) w.r.t. θ are
bounded by some constant.

(A3) ∀i, |θi − θ̂i | = O(1/ M), where O(·) is Bachmann-Landau’s big-O notation.
Recall that M is the √ width of the neural network. We consider that the NN weights
have a order of O(1/ M). For example, if the input to a√neuron follows the standard
Gaussian distribution, then its weights with order O(1/ M) guarantee the output is
O(1). In practice, this constraint can be enforced by clipping the weight vector to a pre-
scribed range. This scaling is commonly used by random initialization techniques [50,
51] for training DNNs.
We rewrite the code length in eq. (1) based on the Taylor expansion of log p(X | θ )
at θ = θ̂ up to the second order:

N
− log p(X) = − log p(θ ) exp log p(X | θ̂ ) − (θ − θ̂ ) J(θ̂ )(θ − θ̂ )
M 2

+ O N θ − θ̂3 dθ . (13)

Notice that the first order term vanishes because θ̂ is a local optimum of log p(X | θ),
and in the second order term, −N J(θ̂ ) is the Hessian matrix of the likelihood function
log p(X | θ) evaluated at θ̂ . At the MLE, J(θ̂ )  0, while in general the Hessian of
the loss of a DNN evaluated at θ = θ̂ can√have a negative spectrum [52, 53].
Through a change of variable φ:= N (θ − θ̂ ), the density of φ is p(φ) =

√1 p( √φ + θ̂) so that
N N M p(φ)dφ = 1. In the integration in eq. (13), the term
− N2 (θ − θ̂) J(θ̂ )(θ − θ̂) has an order of O(φ2 ). The cubic remainder term has
an order of O( √1 φ3 ). If N is sufficiently large, this remainder can be ignored.
N
Therefore we can write
N
− log p(X) ≈ − log p(X | θ̂ ) − log E p exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) . (14)
2

On the RHS, the first term measures the error of the model w.r.t. the observed data X.
The second term measures the model complexity. We have the following bound.
Proposition 5 We have ∀θ ∈ M,

N
0 ≤ − log E p exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ )
2

123
K. Sun, F. Nielsen

N
≤ tr J(θ̂ ) (μ(θ ) − θ̂ )(μ(θ ) − θ̂ ) + cov(θ ) ,
2

where μ(θ ) and cov(θ) denote the mean and covariance matrix of the prior p(θ),
respectively.

Therefore the complexity is always non-negative and its scale is bounded by the prior
p(θ ). The model has low complexity when θ̂ is close to the mean of p(θ ) and/or when
the variance of p(θ ) is small. 
 p(θ ) = κ(θ)/ M κ(θ)dθ , where κ(θ) > 0 is a positive measure
Consider the prior
on M so that 0 < M κ(θ)dθ < ∞. Based on the above approximation of − log p(X),
we arrive at a general formula

O:= − log p(X | θ̂) + log κ(θ)dθ
M

N
− log κ(θ) exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) dθ , (15)
M 2

where “O” stands for Occam’s razor. Compared with previous formulations of
MDL [2, 3, 16], eq. (15) relies on a quadratic approximation of the log-likelihood
function and can be instantiated based on different assumptions of κ(θ). The non-
normalized κ(θ) in Bayesian coding serves a similar role to the luckiness function in
NML coding [4], as they both incorporate prior knowledge to favor certain parameters
in the parameter space M. 
Informally, the term M κ(θ)dθ gives the total capacity of models in M spec-
ified by the improper prior κ(θ), up to constant scaling. For example, if κ(θ) is
uniform on a subregion in M, then M κ(θ)dθ corresponds to the size of this region

w.r.t. the base measure dθ . The term M κ(θ) exp − N2 (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) dθ
gives the model capacity specified by the posterior p(θ | X) ∝ p(θ ) p(X | θ ) ∝
κ(θ) exp − N2 (θ − θ̂) J(θ̂ )(θ − θ̂ ) . It shrinks to zero when the number N of obser-
vations increases. The last two terms in eq. (15) is the log-ratio between the model
capacity w.r.t. the prior and the capacity w.r.t. the posterior. A large log-ratio means
there are many distributions on M which have a relatively large value of κ(θ) but a
small value of κ(θ) exp − N2 (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) . The associated model is consid-
ered to have a high complexity, meaning that only a small “percentage” of the models
are helpful to describe the given data.
DNNs have a large amount of symmetry: the parameter space consists of many
pieces that look exactly the same. This can be caused e.g. by permutating the neurons
in the same layer. This is a different non-local property than singularity that is a local
differential property. Our O is not affected by the model size caused by symmetry,
because these symmetric models are both counted in the prior and the posterior, and
the log-ratio in eq. (15) cancels out symmetric models. Formally, M has ζ symmetric
pieces denoted by M1 , · · · , Mζ . Note any MLE on Mi is mirrored on those ζ pieces.
Then both integrations on the RHS of eq. (15) are multiplied by a factor of ζ . Therefore
O is invariant to symmetry.

123
A geometric modeling of Occam’s…

6 Connection with f -mean

In this section, we show that the model complexity terms in Eqs. (14) and (15) can be
studied based on the notion of f -mean, defined below.

Definition 3 (f-mean) Given a set T = {ti }i=1


n ⊂ R and a continuous and strictly
monotonous function f : R → R, the f -mean of T is
 
1
n
−1
M f (T):= f f (ti ) .
n
i=1

The f -mean, also known as the quasi-arithmetic mean was studied in [54, 55]: Thus
they are also called Kolmogorov-Nagumo means [56]. By definition, the image of
M f (T) under f is the arithmetic mean of the image of T under the same mapping.
Therefore, M f (T) is in between the smallest and largest elements of T. If f (x) = x,
then M f becomes the arithmetic mean, which we denote as T. We have the following
bound.

Lemma 6 Given a real matrix T = (ti j )n×m , we use t i to denote the i’th row of T ,
and t :, j to denote the j’th column of T . If f (t) = exp(−t), then

 
M f (T ) ≤ M f (t :,1 ), · · · , M f (t :,m ) ≤ M f ({t 1 , · · · , t n }) ≤ T ,

where M f (T ) is the f -mean of all n × m elements of T , and T is their arithmetic


mean.

Particular attention should be given to the second “≤”. If the arithmetic mean of
each row is first evaluated, and then their f -mean is evaluated, we get an upper
bound of the arithmetic mean of the f -mean of the columns. In simple terms, for
f (t) = exp(−t), the f -mean of arithmetic mean is lower bounded by the arithmetic
mean of the f -mean.
 The proof is straightforward from Jensen’s inequality, and by
noting that − log i exp(−ti ) is a concave function of t. The last “≤” leads to a proof
of the upper bound in proposition 5. Lemma 6 pertains to the mean of a matrix. It
leads to bounds of the mean value of a bi-variable function w.r.t. a probability mass
function or a probability density function, or a mix of both. This is straightforward
and omitted.

Remark All instances of “≤” in lemma 6 are derived from Jensen’s inequality. Con-
sequently, the gaps of these bounds shrink as the variance of the matrix elements
decreases, where the specific way of measuring the variance depends on the particular
“≤”. For instance, the gap associated with the second “≤” becomes smaller as the
variance across each row t i reduces.

Remark The second complexity term on the RHS of eq. (14) is the f -mean of the
quadratic term N2 (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) w.r.t. the prior p(θ ), where f (t) = exp(−t).

123
K. Sun, F. Nielsen

rank J(θ̂) 
Based on the spectrum decomposition J(θ̂ ) = j=1 λ+j v j v j , where the
positive eigenvalues λ+j :=λ+j (J(θ̂ )) and the eigenvectors v j :=v j (θ̂) depend on the
MLE θ̂ , we further write this term as

rank J(θ̂)
N  λ+j N
(θ − θ̂ ) J(θ̂ )(θ − θ̂ ) = · tr(J(θ̂ ))θ − θ̂ , v j 2 .
2
j=1 tr(J(θ̂ )) 2

By lemma 6, we have

N
− log E p exp − (θ − θ̂) J(θ̂ )(θ − θ̂ )
2
rank J(θ̂)
 λ+j N
≥− log E p exp − tr(J(θ̂))θ − θ̂, v j 2 ,
j=1 tr(J(θ̂ )) 2

λ+j
where the f -mean and the mean w.r.t. is swapped on the RHS. Denote ϕ j =
tr(J(θ̂))
θ − θ̂ , v j , which in matrix form is written as ϕ = V  (θ − θ̂ ). V has orthonormal
columns and the j’th column of V is v j . The prior of ϕ is given by p(V ϕ + θ̂ ). Then

N N
− log E p exp − tr(J(θ̂ ))θ − θ̂, v j 2 = − log E p(ϕ j ) exp − tr(J(θ̂ ))ϕ 2j ,
2 2

where the RHS is a variance-like measure of p(θ ) up to scaling, as it is the f -mean of


2 tr(J(θ̂ ))ϕ j evaluated at point θ̂ along the direction v j . In summary, we get a lower
N 2

bound of the model complexity, which is tighter than the lower bound in proposition 5,
given by

N
− log E p exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ )
2
rank J(θ̂)
 λ+j N
≥− log E p(ϕ j ) exp − tr(J(θ̂ ))ϕ 2j . (16)
j=1 tr(J(θ̂ )) 2

The RHS is determined by the quantity N2 tr J(θ̂ ) ϕ 2j after evaluating the f -mean and
some weighted mean, where ϕ j is an orthogonal transformation of the local coordinates
θi based on the spectrum of J(θ̂ ). Recall that the trace of the observed FIM J(θ̂ ) means
the overall amount of information a random observation contains w.r.t. the underlying
model. Given the same sample size N , a larger tr J(θ̂ ) indicates that the samples are
more informative and the likelihood is more sensitive to the choice of the parameters

123
A geometric modeling of Occam’s…

on M. Consequently, it is reasonable to regard the model as more complex, because


small changes of model parameters more easily lead to different representations.
The bound in eq. (16) is tight when the variance of N2 tr J(θ̂ ) ϕ 2j w.r.t. the discrete
λ+
distribution j
is small. In the case when J(θ̂ ) is rank-one, “≥” becomes “=”.
tr(J(θ̂))
In practice, the FIM of DNNs exhibits a pathological spectrum [43], where most
eigenvalues of J(θ̂ ) are near zero, with a small fraction taking large values. This means
λ+
that the -rank of J(θ̂ ) is limited, and the distribution j
has lower variance as
tr(J(θ̂))
compared to a uniform spectrum. This distinctive property of DNNs offers some basis
for considering the lower bound in eq. (16) as a proxy of the model complexity.
As θ̂ is the MLE, we have J(θ̂ ) = Î(θ̂ ). Recall from eq. (7) that the FIM Î(θ̂ ) is a
numerical average over all observed samples. We can have alternative lower bounds
of the model complexity based on lemma 6:

N
− log E p exp (θ − θ̂ ) J(θ̂ )(θ − θ̂ )

2
 
1 
N
N ∂ h L
(z ) ∂ h L (z i )
log E p exp − (θ − θ̂ )
i
≥− Ci (θ − θ̂ )
N 2 ∂θ ∂θ
i=1
 2
1 
N
N ∂ log p(y | z i )
≥− E p(y|zi ) log E p exp − (θ − θ̂ ) . (17)
N 2 ∂θ 
i=1

The bounds are obtained by swapping the f -mean with the numerical average of the
samples, and by swapping f -mean with the expectation w.r.t. p(y | z i ). Therefore the
model complexity can be bounded by the average scale of the vector ∂ h∂θ(zi ) (θ − θ̂),
L

where θ ∼ p(θ ). Note that ∂ h∂θ(zi ) is the parameter-output Jacobian matrix, or a linear
L

approximation of the neural network mapping θ → h L . The complexity lower bounds


in eq. (17) mean how the local parameter change (θ − θ̂ ) w.r.t. the prior p(θ ) affect
the output. If the output is sensitive to these parameter variations, then the model is
considered to have high complexity. In summary, the f -mean offers a powerful tool
to analyze our model complexity and obtain its approximations.

7 The razor based on Gaussian prior

The simplest and most widely-used choice of the prior p(θ ) is the Gaussian prior (see
e.g. [38, 57] among many others). In eq. (15), we set

1
κ(θ) = exp −θ  diag θ ,
σ

where diag (·) means a diagonal matrix constructed with given entries, and σ > 0
(elementwisely). Equivalently, the associated prior is pG (θ ) = G(θ | 0, diag (σ )),

123
K. Sun, F. Nielsen

meaning a Gaussian distribution with mean 0 and covariance matrix diag (σ ). We


further assume
(A4) M has a global coordinate chart and M is homeomorphic to R D .
  
(A5) Regardless of D, θ̂ diag σ1 θ̂ < ∞.
Assumption (A4) enables us to define a Gaussian distribution in a global coordinate
system, which typically represents the neural network weights and biases. By assump-
tion (A5), the MLE θ̂ has a non-zero probability under the Gaussian prior.
From eq. (15), we get a closed form expression of the razor

rank J(θ̂ )
OG := − log p(X | θ̂ ) + log N
2
rank J(θ̂)
1  1
+ log λi+ J(θ̂ )diag (σ ) + + O(1), (18)
2 N
i=1

where λi+ J(θ̂ )diag (σ ) denotes the i’th positive eigenvalue of J(θ̂ )diag (σ ). Notice
√  √ 
that J(θ̂ )diag (σ ) and diag σ J(θ̂ )diag σ share the same set of non-zero eigen-
values, and the latter is psd with rank J(θ̂ ) positive eigenvalues.
In our razor expressions, all terms that do not scale with the sample size N or the
number of parameters D are discarded. The first two terms on the RHS are similar to
BIC [1] up to scaling. The complexity terms (second and third terms on the RHS of
eq. (18)) do not scale with D but are bounded by the rank of the Hessian, or the observed
FIM. In other words, the radical distribution associated with zero-eigenvalues of J(θ̂ )
does not affect the model complexity. This is different from previous formulations of
MDL [2, 3, 16] and BIC [1]. For example, the 2nd term on the RHS of eq. (2) increases
linearly with D, while the 2nd term on the RHS of eq. (18) increases linearly with
rank J(θ̂ ) ≤ D.
 
Interestingly, if λi+ (J(θ̂ )) < σmax1
1 − N1 , the third term on the RHS of
eq. (18) becomes negative. In the extreme case when λi+ (J(θ̂ )) tends to zero,
1
2 log σmax λi+ (J(θ̂ )) + 1
N → − 21 log N , which cancels out the model complex-
rank J(θ̂)
ity penalty in the term 2 log N . In other words, the corresponding parameter
is added free (without increasing the model complexity). Informally, we call simi-
lar terms that are helpful in decreasing the complexity while contributing to model
flexibility the negative complexity.
We have

rank J(θ̂) rank J(θ̂)


 1  1
log σmin λi+ (J(θ̂ )) + ≤ log λi+ (J(θ̂ )diag (σ )) +
N N
i=1 i=1

123
A geometric modeling of Occam’s…

rank J(θ̂)
 1
≤ log σmax λi+ (J(θ̂ )) + ,
N
i=1

where σmax and σmin denote the largest and smallest elements of σ , respectively.
Therefore the term can be bounded based on the spectrum of J(θ̂ ). If σ = σ 1, where
σ > 0, then both of the above “≤”s become equalities. In this case, we let D → ∞
and rewrite the razor in terms of the spectrum density ρI (λ) of J(θ̂ ):

rank J(θ̂ )
OG = − log p(X | θ̂ ) + EρI (λ) log (N σ λ + 1) + O(1). (19)
2

Note rank J(θ̂ ) = d̂(θ̂) is the local dimensionality at θ̂ , which could have a smaller
order than D, especially when N is finite. If ρI (λ) is highly concentrated around 0 as
shown in [43], then the expectation of log (N σ λ + 1) can be roughly approximated
as zero. This approximation is also linked to the low intrinsic complexity of DNNs.
The Gaussian prior pG is helpful to give simple and intuitive expressions of OG .
However, the problem in choosing pG is two fold. First, it is not invariant. Under a
reparametrization (e.g. normalization or centering techniques), the Gaussian prior in
the new parameter system does not correspond to the original prior. Second, it double
counts equivalent models. Because of the many singularities of the neuromanifold, a
small dynamic in the parameter system may not change the prediction model. How-
ever, the Gaussian prior is defined in a real vector space and may not fit in this singular
semi-Riemannian structure. Gaussian distributions are defined on Riemannian mani-
folds [58] which lead to potential extensions of the discussed prior pG (θ ).

8 The razor based on Jeffreys’ non-informative prior



Jeffreys’ prior is specified by p J (θ ) ∝ |I(θ)|. It is non-informative in the sense that
no neural network model θ 1 is prioritized over any other model θ 2 . It is invariant to
the choice of the coordinate system. Under a reparameterization θ → η,
   
 ∂θ 
∂θ   ∂θ 
|I(η)|dη =  I(θ)  · dη = |I(θ)| ·   dη = |I(θ)|dθ ,
∂η ∂η ∂η

showing that the Riemannian volume element is the same in different coordinate
systems. Unfortunately, the Jeffreys’ prior is not well defined
√ on the lightlike neuro-
manifold M, where the metric I(θ ) is degenerate and |I(θ)| becomes zero. The
stratifold structure of M, where d(θ ) varying with θ ∈ M, makes it difficult to
properly define the base measure dθ and integrate functions as in eq. (15). From
a mathematical standpoint, one has to integrate on the screen distribution S(T M),
which has a Riemannian structure. We refer the reader to [59, 60] for other extensions
of Jeffreys’ prior.

123
K. Sun, F. Nielsen

In this paper, we take a simple approach by examining a submanifold of M denoted


as M and parameterized by ξ , which has a Riemannian metric I(ξ )  0 that is induced
by the FIM I(θ)  0 and the mapping ξ → θ . The dimensionality of M  is upper-
bounded by the local dimensionality d(θ ). Intuitively, any infinitesimal dynamic on
M means such a change of neural network parameters that leads to a non-zero change
of the global predictive model z → y. For example, M  can be defined based on a
subset of sensitive parameters. In theory, we would like to construct M so that it is

representative of M, meaning that dim(M) is close to the local dimensionality d(θ ),
and at the same time M  remains Riemannian. The following results are constrained
to the choice of the submanifold M. 
In eq. (15), let κ(ξ ) = |I(ξ )|. We further assume

(A6) 0 < M  |I(ξ )|dξ < ∞;

 is bounded. After straightforward deriva-


meaning that the Riemannian volume of M
tions, we arrive at

O J (ξ ) = − log p(X | ξ̂ ) + log |I(ξ )|dξ
M

N
− log exp − (ξ − ξ̂ ) J(ξ̂ )(ξ − ξ̂ ) |I(ξ )|dξ

M 2
 
= − log p(X | ξ̂ ) + log |I(ξ )|dξ − log ω(ξ ) |I(ξ )|dξ , (20)

M 
M

where ω(ξ ):= exp − N2 (ξ − ξ̂ ) J(ξ̂ )(ξ − ξ̂ ) is a shorthand. Let us examine the
meaning of O J (ξ ). As I(ξ ) is the Riemannian metric of M  based on information
geometry, |I(ξ )|dξ is a Riemannian volume element (volume form). In the second

term on the RHS of eq. (20), the integral M  |I(ξ )|dξ is the information volume,
or the total “number” of different DNN models [17] on M.  In the last (third) term,
because 0 < ω(ξ ) ≤ 1, the integral on the LHS of
 
ω(ξ ) |I(ξ )|dξ ≤ |I(ξ )|dξ

M 
M

 where the positive weights ω(ξ ) are determined by


means a “weighted volume” of M,
the observed FIM J(ξ̂ ). Combining these two terms, the model complexity is the log-
ratio between the unweighted volume and the weighted volume and is lower bounded
by 0.
Assume the spectrum decomposition J(ξ̂ ) = Qdiag λi+ (J(ξ̂ )) Q  , where Q has
orthonormal columns, and λi+ (J(ξ̂ )) are the positive eigenvalues of J(ξ̂ ). Equation (20)
becomes

O J (ζ ) = − log p(X | ζ̂ ) + log |I(ζ )|dξ

M

123
A geometric modeling of Occam’s…

⎛ ⎞
 rank J(ξ̂ )
⎜ N  ⎟
− log exp ⎜
⎝− 2 λi+ J(ξ̂ ) (ζ i − ζ̂ i )2 ⎟
⎠ |I(ζ )|dζ , (21)

M i=1

where ζ = Q  ξ is an orthogonal transformation of ξ , and O J is invariant to such


transformations. If an eigenvalue of J(ξ̂ ) has an order of o( N1 ), the last two terms in
eq. (21) cancel out in the corresponding direction, meaning no complexity is added.
This is similar to how the positive and negative complexity terms cancel out in eq. (18)
– small eigenvalues of J(ξ̂ ) are helpful to enhance the representation power of DNNs
without increasing the model complexity. Only eigenvalues that are large enough
contribute significantly to the model complexity.
In the rest of this section, we connect our O J with previous formulations of
MDL [16, 17]. Observe that ω(ξ ) in eq. (20) resembles a Gaussian density up to
a scaling factor. If J(ξ̂ ) has full rank, we can further write

 
 = − log p(X | ξ̂ ) + dim(M) N
O J (M) log + log |I(ξ )|dξ
2 2π 
M

1 |I(ξ )|1/2
− log G ξ | ξ̂ , J−1 (ξ̂ ) dξ . (22)
M N |J(ξ̂ )|1/2

By assumption (A6), the RHS of eq. (20) is well defined, while the RHS of eq. (22)
is only meaningful for a full rank J(ξ̂ ). If J(ξ̂ ) is not invertible, one can consider the
limit case when the zero eigenvalues of J(ξ̂ ) are replaced by a small  > 0 and still
apply the expression in eq. (22). One has to note that

1 −1
G ξ | ξ̂ , J (ξ̂ ) dξ ≤ 1,

M N

)
 which is a subset of Rdim(M
as the integral is over M . The last term on the RHS of

eq. (22) resembles an expectation w.r.t. a Gaussian distribution centered at ξ̂ on M,

except that the Gaussian density may be truncated by M. One can therefore take the
rough approximation based on the mean of the Gaussian:

1 −1 |I(ξ )|1/2 1 |J(ξ̂ )|
− log G ξ | ξ̂ , J (ξ̂ ) dξ ≈ log . (23)

M N |J(ξ̂ )| 1/2 2 |I(ξ̂ )|

Under this approximation, eq. (22) gives the MDL criterion discussed in [16, 17],
where the term on the RHS of eq. (23) is interpreted as a penalty to models that
lack robustness and are sensitive to the choice of parameters. We therefore consider
the spectrum of both matrices I(ξ ) and J(ξ ), noting that in the large sample limit
N → ∞, they become identical. Because of the finite N , the observed FIM J(ξ̂ ) is
singular in potentially many directions. The true FIM I(ξ̂ ) can be regarded as the
sum of the observed FIM J(ξ̂ ) and the FIM w.r.t. unobserved samples, up to a scaling

123
K. Sun, F. Nielsen

factor. Based on how M  is constructed, I(ξ̂ )  0 is positive definite and suffers


less from singularities. In the directions where J(ξ̂ ) is nearly singular, the log-ratio
log |J(ξ̂ )|/|I(ξ̂ )| contributes significantly and negatively to the model complexity. As
a result, eq. (23) serves as a negative complexity term and explains how singularities
of J(ξ̂ ) correspond to the simplicity of DNNs.
Compared with OG , O J is based on a more accurate geometric modeling, However,
it is hard to be computed numerically, as it depends on how M  is constructed, and
I(ξ ) and p(z) which are unknown due to limited observations. Despite that OG and
O J have different expressions, their preference to model dimensions with small Fisher
information (as in DNNs) is similar.
Hence, we can conclude that the intrinsic complexity of a DNN is affected by the
singularity and spectral properties of the Fisher information matrix.

9 Related work

The dynamics of supervised learning of a DNN describes a trajectory on the param-


eter space of the DNN geometrically modeled as a manifold when endowed with
the FIM (e.g., ordinary/natural gradient descent learning the parameters of a MLP).
Singular regions of the neuromanifold [61] correspond to non-identifiable parameters
with rank-deficient FIM, and the learning trajectory typically exhibits chaotic pat-
terns [41] with the singularities which translate into slowdown plateau phenomena
when plotting the loss function value against time. By building an elementary singular
DNN, [41] (and references therein) showed that stochastic gradient descent learning
dynamics yields a Milnor-type attractor with both attractor/repulser subregions where
the learning trajectory is attracted in the attractor region, then stay a long time there
before escaping through the repulser region. The natural gradient is shown to be free
of critical slowdowns. Furthermore, although DNNs have potentially many singular
regions, the interaction of elementary units cancels out the Milnor-type attractors. It
was shown [62] that skip connections are helpful to reduce the effect of singulari-
ties. However, a full understanding of the learning dynamics [63] for generic DNN
architectures with multiple output values or recurrent DNNs is yet to be investigated.
The MDL criterion has undergone several fundamental revisions, such as the origi-
nal crude MDL [2] and refined MDL with the introduction of stochastic complexity [3,
64], and the NML [14, 65] as a modern refinement. We refer the reader to the book [4]
for a comprehensive introduction to this area and [6] for a recent review. We should
also mention that the relationship between MDL and generalization has been explored
in the PAC-MDL framework [4, 7–10]. See [6] (section 6.4) for related remarks.
The relationship between MDL and information geometry is well established [3,
15–17, 66]. For example, they both rely on fundamental concepts such as the Fisher
information. The geometric complexity of statistical models is commonly formulated
using tools from information geometry [3, 16, 49, 66]. The stochastic complexity in
singular mixture models can be bounded [67] and therefore is smaller than that of
regular models. On this line of research, our derivations based on a Taylor expansion

123
A geometric modeling of Occam’s…

of the log-likelihood are similar to [16]. This technique is also used for deriving natural
gradient optimization for deep learning [34, 41, 68].
Recently, MDL has been ported to deep learning [69] focusing on variational
methods. Practical techniques such as weight sharing [70], binarization [71], model
compression [72], etc., follow similar principles of MDL. In the same community,
many efforts have been made to develop a theory of deep learning, for example,
based on PAC-Bayes theory [73], statistical learning theory [74], algorithmic infor-
mation theory [75], information geometry [76], geometry of the DNN mapping [77], or
through defining an intrinsic dimensionality [46] that is much smaller than the network
size. Our analysis depends on J(θ̂ ) and therefore is related to the flatness/sharpness
of the local minima [78, 79], which is known to affect generalization. Using advanced
mathematical tools such as random matrix theory, investigations are conducted on the
spectrum of the input-output Jacobian matrix [80], the Hessian matrix w.r.t. the neural
network weights [81], and the FIM [38, 39, 43, 82, 83].

10 Conclusion

We consider mathematical tools from singular semi-Riemannian geometry to study


the locally varying intrinsic dimensionality of a deep learning model. These models
fall in the category of non-identifiable parameterizations. We take a meaningful step to
quantify geometric singularity through the notion of local dimensionality d(θ ) yielding
a singular semi-Riemannian neuromanifold with varying metric signature. We show
that d(θ ) grows at most linearly with the sample size N . Recent findings show that the
spectrum of the Fisher information matrix shifts towards 0+ with a large number of
small eigenvalues. We show that these singular dimensions help to reduce the model
complexity. As a result, we contribute a simple and general MDL for deep learning.
It provides theoretical insights on the description length of DNNs. DNNs benefit
from a high-dimensional parameter space in that the singular dimensions impose a
negative complexity to describe the data, which can be seen in our derivations based
on Gaussian and Jeffreys’ priors. How the short description length is connected to
the empirical performance of DNNs and related generalization bounds require further
examinations. This is not addressed in the current work. A more careful analysis of the
FIM’s spectrum, e.g. through considering higher-order terms, could give more practical
formulations of the proposed criterion. We leave empirical studies as potential future
work.

Appendix A Proof of J(Â̂) = Î (Â̂)

Proof

⎛ ⎞

p(yi | z i , θ ) = exp ⎝OneHot(yi ) h L (z i ) − log exp(h Lj (z i ))⎠ ,
j

123
K. Sun, F. Nielsen

where OneHot(y) is the binary vector with the same dimensionality as h L (z i ), with
the y’th bit set to 1 and the rest bits set to 0. Therefore,
 
∂ log p(yi | z i , θ ) ∂ hL % &
= OneHot(yi ) − SoftMax(h L (z i )) .
∂θ ∂θ

Therefore,

∂ 2 log p(yi | z i , θ )  % & ∂ 2 hL


j
= OneHot(y ) − SoftMax(h L
(z ))
∂θ∂θ  
i i
j ∂θ∂θ
j
 
∂ hL ∂ hL
− · Ci · . (A1)
∂θ ∂θ

where

∂SoftMax(h L (z i )) 
Ci = = diag (oi ) − oi oi , oi = SoftMax(h L (z i )).
∂ h L (z i )

By (A1), at the MLE θ̂ ,

∀i, SoftMax(h L (z i )) = OneHot(yi ).

Therefore  
∂ 2 log p(yi | z i , θ ) ∂ hL ∂ hL
∀i, − = · Ci · .
∂θ∂θ  ∂θ ∂θ
Taking the sample average on both sides, we get

J(θ̂ ) = Î(θ̂ ).




Appendix B Proof of Lemma 1



Proof If (θ, j α j ∂θ j ) ∈ Rad(T M), Then
' (
 
α j ∂θ j , α j ∂θ j = 0.
j j I (θ )

In matrix form, it is simply α  I(θ )α = 0. We have the analytical expression


  
∂ h L (z) ∂ h L (z)
I(θ) = E p C(z) .
∂θ ∂θ

123
A geometric modeling of Occam’s…

Therefore
  
∂ h L (z) ∂ h L (z)
Ep α C(z) α = 0.
∂θ ∂θ

By noting that C(z)  0 is psd, we have almost surely that


 
∂ h L (z) ∂ h L (z)
α C(z) α = 0.
∂θ ∂θ

Any eigenvector of C(z) associated with the zero eigenvalues must be a multiple of
1. Indeed,
 
v  C(z)v = v  diag (o(z)) − o(z)o(z) v
 
= o j (z)(v j − o j (z)v j )2 = 0 ⇔ v ∝ 1,
j j

where o j (z) > 0 is the j’th element of o(z). Hence, almost surely

∂ h L (z)
α = λ(z)1.
∂θ



Remark α is associated with a tangent vector in Rad(T M), meaning a dynamic along
the lightlike dimensions. The Jacobian ∂ h∂θ(z) is the local linear approximation of the
L

mapping θ → h L (z). By lemma 1, with probability 1 such a dynamic leads to uniform


 L  h (z) → h (z) + λ(z)1, ∀i, and therefore
L L
increments in the output units, meaning
the output distribution SoftMax h (z) is not affected. In summary, we have verified
that the radical distribution does not affect the neural network mapping.

Appendix C Proof of Proposition 2

Proof
 N   
 ∂ h L (z i ) ∂ h L (z i )
d̂(θ ) = rank Î(θ ) = rank Ci
∂θ ∂θ
i=1
  

N
∂ h L (z i ) ∂ h L (z i )
≤ rank Ci
∂θ ∂θ
i=1
)   *
N
∂ h L (z i )
≤ min rank , rank (C i ) .
∂θ
i=1

123
K. Sun, F. Nielsen

∂ h L (zi )
Note the matrix ∂θ has size m × D, and C i has size m × m and rank (m − 1).
We also have d̂(θ ) = rank Î(θ ) ≤ D = dim(θ ). Therefore

)   *

N
∂ h L (z i )
d̂(θ) ≤ min rank ,m − 1 .
∂θ
i=1




Appendix D Proof of Proposition 3

Proof We only prove the upper bound of d(θ ). The upper bound of d̂(θ ) can be proved
similarly.

   
∂ h L (z) ∂ h L (z)
d(θ ) = rank (I(θ)) = rank E p(z) C(z) .
∂θ ∂θ
%  &
∂ h L (z)
C(z) ∂ h∂θ(z) .
L
That means d(θ ) is the dimensionality of the image of E p(z) ∂θ
∀θ , we have
     
∂ h L (z) ∂ h L (z) ∂ h L (z) ∂ h L (z)
E p(z) C(z) θ = E p(z) C(z) θ
∂θ ∂θ ∂θ ∂θ
  
∂ h L (z)
= E p(z) β(z) ,
∂θ

where β(z) = C(z) ∂ h∂θ(z) θ is an m-dimensional vector. Therefore


L

    
∂ h L (z) ∂ h L (z) + ∂ h L (z)
E p(z) C(z) θ ∈ span Row .
∂θ ∂θ ∂θ
z∈supp( p)

Letting θ vary in R D , and applying dim(·) on both sides, the statement follows imme-
diately. 


Appendix E Proof of Proposition 4

Proof We have the total derivative


L
dh L (z) = W L  L−1 W L−1 · · · l dW l zl−1 + dbl .
l=1

123
A geometric modeling of Occam’s…

Therefore,

∂ h L (z) zl−1
∀w ∈ Rdim(w ) ,
l
w = W L  L−1 W L−1 · · · l mat(w) ,
∂w l 1

where mat(·) means to rearrange the vector into a matrix. Therefore,

 
∂ h L (z)
rank ≤ rank W L  L−1 W L−1 · · · l .
∂wl

The second “≤” in the statement is because

 
rank W L  L−1 W L−1 · · · l ≤ min rank W L , rank  L−1 , · · · rank l
 
≤ min rank  L−1 , · · · , rank l
L−1  
= min rank s .
s=l




Appendix F Metric Signature of the Neuromanifold

The metric signature of M


(d(θ ), 0, D − d(θ ))

is straightforward from the fact that I(θ) is positive semi-definite (there is no negative
eigenvalues), and the local dimensionality d(θ ), by definition, is rank (I(θ)) (the
number of non-zero eigenvalues).
We also show that rank (J(θ )) = d̂(θ ). Recall that d̂(θ ) = rank Î(θ ) , and

 
∂ 2  ∂ 2 i
rank (J(θ )) = rank = rank ,
∂θθ  ∂θθ 
i

where  is the log-likelihood, and i = log p(yi | z i , θ ). We write the analytical form
of the elementwise Hessian

∂ 2 i  ∂h j (z i )
m L

 = (OneHot j (y) − SoftMax j (h L )) − I(θ),


∂θ ∂θ ∂θ ∂θ 
j=1

123
K. Sun, F. Nielsen

where OneHot(·) denote the one-hot vector associated with the given target label y.
Therefore

 
∂ 2 i  m ∂h Lj (z i )

α α = α α (OneHot j (y) − SoftMax j (h L )) − α  I(θ)α.
∂θ∂θ  ∂θ∂θ 
j=1

Because of the first term on the RHS, the kernels of the two matrices J(θ ) and Î(θ)
are different, and thus their ranks are also different.

Appendix G Proof of Proposition 5

Proof As θ̂ is the MLE, we have J(θ̂ )  0, and ∀θ ∈ M,

N
− (θ − θ̂) J(θ̂ )(θ − θ̂) ≤ 0.
2

Hence,
N
E p exp − (θ − θ̂) J(θ̂ )(θ − θ̂ ) ≤ 1.
2

Hence,
N
− log E p exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) ≥ 0.
2

This proves the first “≤”.


As − log(x) is convex, by Jensen’s inequality, we get

N
− log E p exp − (θ − θ̂) J(θ̂ )(θ − θ̂ )
2
N
≤ E p − log exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ )
2
N
= Ep (θ − θ̂) J(θ̂ )(θ − θ̂)
2
N
= tr E p J(θ̂ )(θ − θ̂ )(θ − θ̂)
2
N
= tr J(θ̂ ) (μ(θ) − θ̂)(μ(θ ) − θ̂ ) + cov(θ ) .
2

This proves the second “≤”. 




123
A geometric modeling of Occam’s…

Appendix H Proof of Lemma 6

Proof Due to the convexity of − log t, we have


  n 
1  1
m
{M f (t :,1 ), · · · , M f (t :,m )} = − log exp(−ti j )
m n
j=1 i=1
⎡ ⎤
1 m
1  n
≥ − log ⎣ exp(−ti j )⎦ = M f (T ).
m n
j=1 i=1

n
This proves the first “≤”. To prove the second “≤”, we note that − log n1 i=1 exp(−ti )
is a concave function. Therefore
  n 
1  1
m
{M f (t :,1 ), · · · , M f (t :,m )} = − log exp(−ti j )
m n
j=1 i=1
⎛ ⎛ ⎞⎞
1  n
1 m
 
≤ − log ⎝ exp ⎝− ti j ⎠⎠ = M f {t 1 , · · · , t n } .
n m
i=1 j=1

The last “≤” is based on the convexity of − log t. Once again, by Jensen’s inequality,
we have
⎛ ⎛ ⎞⎞
  1 n
1 m
M f {t 1 , · · · , t n } ≤ − log ⎝exp ⎝− ti j ⎠⎠ = T .
n m
i=1 j=1




Appendix I Derivations of OG

We recall the general formulation in eq. (15):



O:= − log p(X | θ̂) + log κ(θ)dθ
M

N
− log κ(θ) exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) dθ .
M 2
   
If κ(θ) = exp − 21 θ  diag σ1 θ , then the second term on the RHS is
 
1 1
log κ(θ)dθ = log exp − θ  diag θ dθ
M M 2 σ
D 1
= log 2π + log |diag (σ ) |
2 2

123
K. Sun, F. Nielsen


D 1 1 1
+ log log 2π − log |diag (σ ) | − θ  diag
exp − θ dθ
M 2 2 2 σ
D 1 D 1
= log 2π + log |diag (σ ) | + log 1 = log 2π + log |diag (σ ) |.
2 2 2 2

The third (last) term on the RHS is



N
− log (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) dθ
κ(θ) exp −
M 2

1 1 N
= − log exp − θ  diag θ − (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) dθ
M 2 σ 2

1 
= − log exp − θ Aθ + b θ + c dθ ,
M 2

where

1 N 
A = N J(θ̂ ) + diag  0, b = N J(θ̂ )θ̂ , c = − θ̂ J(θ̂ )θ̂ .
σ 2

Then,

N
− log κ(θ) exp − (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) dθ
M 2

1 1 
= − log exp − (θ − θ̄ ) A(θ − θ̄ ) + c + θ̄ Aθ̄ dθ
M 2 2
D 1 1 
=− log 2π + log | A| − c − θ̄ Aθ̄
2  2 2
D 1 1
− log exp − log 2π + log | A| − (θ − θ̄ ) A(θ − θ̄ ) dθ
M 2 2 2
D 1 1 
= − log 2π + log | A| − c − θ̄ Aθ̄ ,
2 2 2

where Aθ̄ = b. To sum up,

D 1
OG = − log p(X | θ̂ ) + log 2π + log |diag (σ ) |
2 2
D 1 1 
− log 2π + log | A| − c − θ̄ Aθ̄
2 2 2
1 1 1 
= − log p(X | θ̂ ) + log |diag (σ ) | + log | A| − c − θ̄ Aθ̄ ,
2 2 2
1 1 1
= − log p(X | θ̂ ) + log |diag (σ ) | + log |N J(θ̂) + diag |
2 2 σ
 −1
N  1 1
+ θ̂ J(θ̂ )θ̂ − N J(θ̂ )θ̂ N J(θ̂ ) + diag N J(θ̂ )θ̂
2 2 σ

123
A geometric modeling of Occam’s…

1
= − log p(X | θ̂ ) + log |N J(θ̂ )diag (σ ) + I|
2
 −1
1  1 1 1
+ θ̂ J(θ̂ ) J(θ̂ ) + diag diag θ̂
2 N σ σ
1
= − log p(X | θ̂ ) + log |N J(θ̂ )diag (σ ) + I|
2
−1
1  1
+ θ̂ J(θ̂ ) diag (σ ) J(θ̂ ) + I θ̂ .
2 N

The last term does not scale with N and has a smaller order as compared to other
terms. Indeed,

−1
1 1
lim J(θ̂ ) J(θ̂ ) + diag = J(θ̂ )J(θ̂ )+ ,
N →∞ N σ

where J(θ̂ )+ is the Moore-Penrose inverse of J(θ̂ ). Hence, as N → ∞,

−1
1  1 1  1
θ̂ J(θ̂ ) diag (σ ) J(θ̂ ) + I θ̂ → θ̂ J(θ̂ )J(θ̂ )+ diag θ̂
2 N 2 σ
1  1
≤ θ̂ diag θ̂ .
2 σ

By assumption (A5), the RHS is O(1). This term is therefore dropped. We get

 
1  
OG = − log p(X | θ̂ ) + log N J(θ̂ )diag (σ ) + I  + O(1).
2

Note that rank J(θ̂ ) ≤ D, and the matrix J(θ̂ )diag (σ ) has the same rank as J(θ̂ ).
We can write J(θ̂ ) = L(θ̂)L(θ̂) , where L(θ̂ ) has shape D × rank J(θ̂ ) . We abuse
I to denote both the identity matrix of shape D × D and the identity matrix of shape
rank J(θ̂ ) × rank J(θ̂ ) . By the Weinstein-Aronszajn identity,

 
1  
OG = − log p(X | θ̂ ) + log N L(θ̂ )L(θ̂ ) diag (σ ) + I  + O(1)
2
 
1  
= − log p(X | θ̂ ) + log  N L(θ̂ ) diag (σ ) L(θ̂) + I  + O(1)
2
rank J(θ̂ )
= − log p(X | θ̂ ) + log N
 2 
1  1 
 
+ log  L(θ̂ ) diag (σ ) L(θ̂) + I  + O(1).
2 N

123
K. Sun, F. Nielsen

Note L(θ̂ ) diag (σ ) L(θ̂) has the same set of non-zero eigenvalues as L(θ̂ )L(θ̂ )
diag (σ ) = J(θ̂ )diag (σ ) , which we denote as λi+ J(θ̂ )diag (σ ) . Then,

rank J(θ̂ )
OG = − log p(X | θ̂ ) + log N
2
rank J(θ̂)
1  1
+ log λi+ J(θ̂ )diag (σ ) + + O(1).
2 N
i=1

Denote the largest and smallest elements of σ as σmax and σmin , respectively. Then,

L(θ̂ ) diag (σ ) L(θ̂ )  σmax L(θ̂) L(θ̂ ).

Hence,
   
1  1  1  1 
log  L(θ̂ ) diag (σ ) L(θ̂) + I  ≤ log σmax L(θ̂ ) L(θ̂ ) + I 
2 N 2 N
rank J(θ̂)
1  1
= log σmax λi+ (J(θ̂ )) + .
2 N
i=1

Similarly,

  rank J(θ̂)
1 
 1  1  1

log  L(θ̂) diag (σ ) L(θ̂ ) + I  ≥ log σmin λi+ (J(θ̂ )) + .
2 N 2 N
i=1

If σ = σ 1, then σmax = σmin = σ . Both “≤” and “≥” in the above inequalities
become tight.

Appendix J Probability Measures on M

Probability measures are not defined on the lightlike M, because along the lightlike
geodesics, the distance is zero. To compute the integral of a given function f (θ ) on
M one has to first choose a proper Riemannian submanifold Ms ⊂ M specified
 θ (θ ), whose singular. Then, the integral on Ms can
s
by an embedding  smetric is not
be defined as M s f θ (θ ) dθ , where M is the sub-manifold associated with the
s s

frame θ s = (θ 1 , · · · , θ d ), so that T Ms = S(T M), and the induced Riemannian


volume element as

dθ s = |I(θ s )| dθ 1 ∧ dθ 2 ∧ · · · ∧ dθ d
= |I(θ s )| dE θ s , (J2)

123
A geometric modeling of Occam’s…

where dE θ is the Euclidean volume element. We artificially shift θ to be positive


definite and define the volume element as

dθ := |I(θ) + ε1 I| dθ 1 ∧ dθ 2 ∧ · · · ∧ dθ D
= |I(θ) + ε1 I| dE θ s , (J3)

where ε1 > 0 is a very small value as compared to the scale of I(θ) given by
1
D tr(I(θ)), i.e. the average of its eigenvalues. Notice this element will vary with
θ : different coordinate systems will yield different volumes. Therefore it depends on
how θ can be uniquely specified. This is roughly guaranteed by our assumption that
the θ -coordinates correspond to the input coordinates (weights and biases) up to an
orthogonal transformation. Despite that eq. (J3) is a loose mathematical definition, it
makes intuitive sense and is convenient for making derivations. Then, we can integrate
functions  
f (θ)dθ = f (θ ) |I(θ) + ε1 I| dE θ , (J4)
M

where the RHS is an integration over R D , assuming θ is real-valued.


Using this tool, we first consider Jeffreys’ non-informative prior on a sub-manifold
Ms , given by
|I(θ s )|
pJ (θ s ) =  . (J5)
Ms |I(θ )|dE θ
s s


It is easy to check Ms p(θ s )dE θ s = 1. This prior may lead to similar results as
[3, 16], i.e. a “razor” of the model Ms . However, we will instead use a Gaussian-
like prior, because Jeffreys’ prior is not well defined on M. Moreover, the integral
Ms |I(θ s
)|dE θ s
is likely to diverge based on our revised volume element in eq. (J3).
If the parameter space is real-valued, one can easily check that, the volume based on
eq. (J3) along the lightlike dimensions will diverge. The zero-centered Gaussian prior
corresponds to a better code, because it is commonly acknowledged that one can
achieve the same training error and generalization without using large weights. For
example, regularizing the norm of the weights is widely used in deep learning. By
using such an informative prior, one can have the same training error in the first term
in eq. (2), while having a smaller “complexity” in the rest of the terms, because we
only encode such models with constrained weights. Given the DNN, we define an
informative prior on the lightlike neuromanifold
 
1 1
p(θ ) = exp − 2 θ 2
|I(θ) + ε1 I|, (J6)
V 2ε2

where
 ε2 > 0 is a scale parameter of θ , and V is a normalizing constant to ensure
√ p(θ )dE θ = 1. Here, the base measure is the Euclidean volume element dE θ , as
|I(θ) + ε1 I| already appeared in p(θ ). Keep in mind, again, that this p(θ ) is defined
in a special coordinate system, and is not invariant to re-parametrization. This distri-

123
K. Sun, F. Nielsen

bution is also isotropic in the input coordinate system, which agrees with initialization
techniques7 .
This bi-parametric prior connects Jeffreys’ prior (that is widely used in MDL) and a
Gaussian prior (that is widely used in deep learning). If ε2 → ∞, ε1 → 0, it coincides
with Jeffreys’ prior (if it is well defined and I(θ) has full rank); if ε1 is large, the
metric (I(θ) + ε1 I) becomes spherical, and eq. (J6) becomes a Gaussian prior. We
refer the reader to [59, 60] for other extensions of Jeffreys’ prior.
The normalizing constant of eq. (J6) is an information volume measure of M, given
by

  
1
V := exp − 2 θ  dθ .
2
(J7)
M 2ε2

Unlike Jeffreys’ prior whose information volume (the 3rd term on the RHS of eq. (2))
can be unbounded, this volume can be bounded as stated in the following theorem.

Theorem 7
( 2π ε1 ε2 ) D ≤ V ≤ ( 2π(ε1 + λm )ε2 ) D , (J8)

where λm is the largest eigenvalue of the FIM I(θ).

Notice λm may not exist, as the integration is taken over θ ∈ M. Intuitively, V is a


weighted volume w.r.t. a Gaussian-like prior distribution on M, while the 3rd term
on the RHS of eq. (2) is an unweighted volume. The larger the radius ε2 , the more
“number” or possibilities of DNNs are included; the larger the parameter ε1 , the larger
the local volume element in eq. (J3) is measured, and therefore the total volume is
measured larger. log V is an O(D) terms, meaning the volume grows with the number
of dimensions.

Appendix J.1 Proof of Theorem 7

By definition,

     
1 1
V = exp − 2 θ  dθ = exp − 2 θ 
2 2
|I(θ) + ε1 I|dE θ .
M 2ε2 2ε2

By our assumption, θ is an orthogonal transformation of the neural network weights


and biases, and therefore θ ∈ R D . We have

D
|I(θ) + ε1 I| ≥ |ε1 I| = ε12 .

7 Different layers, or weights and biases, may use different variance in their initialization. This minor issue
can be solved by a simple re-scaling re-parameterization.

123
A geometric modeling of Occam’s…

Hence
  
1 D
V ≥ exp − 2 θ  ε12 dE θ
2
2ε2
  
D D D 1 1
= (2π ) 2 ε2 ε1
D 2
exp − log 2π − log |ε2 I| − 2 θ  dE θ
2 2
2 2 2ε2
D D D
= (2π ) 2 ε2D ε12 = 2π ε1 ε2 .

For the upper bound, we prove a stronger result as follows.

D  D2
0
D
1 1 2
|I(θ) + ε1 I| = (λi + ε1 ) D ≤ tr(I(θ)) + ε1 .
D
i=1

Therefore
D
√ D 1 2
V ≤ 2π ε2 tr(I(θ)) + ε1 .
D
If one applies 1
D tr(I(θ)) ≤ λm to the RHS, the upper bound is further relaxed as

√ D D D
V ≤ 2π ε2 (λm + ε1 ) 2 = 2π(ε1 + λm )ε2 .

Appendix K An alternative derivation of the razor

In this section, we provide an alternative derivation of the propose razor O based on a


different prior. The main observations on the negative complexity are consistent with
the cases of Gaussian and Jeffreys’ priors.
We plug in the expression of p(θ ) in eq. (J6) and get

− log p(X) ≈ − log p(X | θ̂ ) + log V


  
θ 2 N 
− log − 2 − (θ − θ̂ ) J(θ̂ )(θ − θ̂) dθ .
M 2ε2 2

In the last term on the RHS, inside the parentheses is a quadratic function w.r.t. θ.
However the integration is w.r.t. to the non-Euclidean volume element dθ and therefore
does not have closed form. We need to assume
(A7) N is large enough so that |I(θ) + ε1 I| ≈ |I(θ̂) + ε1 I|.
This means the quadratic function will be sharp enough to make the volume element
dθ to be roughly constant. Along the lightlike dimensions (zero eigenvalues of I(θ ))
this is trivial.

123
K. Sun, F. Nielsen

Plug eq. (J6) into eq. (13), the following three terms
1
1
, |I(θ) + ε1 I| ≈ |I(θ̂) + ε1 I|, exp log p(X | θ̂ ) = p(X | θ̂ )
V

can all be taken out of the integration as constant scalers, as they do not depend on θ.
The main difficulty is to perform the integration
  
θ 2 N 
exp − 2 − (θ − θ̂ ) J(θ̂ )(θ − θ̂ ) dE θ
2ε2 2

1
= exp − θ  Aθ + b θ + c dE θ
2

1 1
= exp − (θ − A−1 b) A(θ − A−1 b) + b A−1 b + c dE θ
2 2

1  −1 1
= exp b A b+c exp − (θ − A−1 b) A(θ − A−1 b) dE θ
2 2
1  −1 D 1
= exp b A b + c exp log 2π − log | A|
2 2 2
1  −1 D 1
= exp b A b + c + log 2π − log | A| .
2 2 2

where
1 1 
A = N J(θ̂ ) + I, b = N J(θ̂ )θ̂ , c = − θ̂ N J(θ̂ )θ̂ .
ε2
2 2

The rest of the derivations are straightforward. Note R = −c − 21 b A−1 b.


After derivations and simplifications, we get

D N
− log p(X) ≈ − log p(X | θ̂ ) + log + log V
  2 2π
 1  1  
1   
+ log J(θ̂ ) + I  − log I(θ̂ ) + ε1 I  + R. (K9)
2  N ε2  2
2

The remainder term is given by


 −1 
1  1
R = θ̂ N J(θ̂ ) − N J(θ̂ ) N J(θ̂ ) + 2 I N J(θ̂ ) θ̂. (K10)
2 ε2

We need to analyze the order of this R term. Assume the largest eigenvalue of J(θ̂ ) is
λm , then

N λm
|R| ≤ θ̂ 2 . (K11)
ε2 N λm + 1
2

123
A geometric modeling of Occam’s…

Fig. 2 A: a model far from the


truth (underlying distribution of
observed data); B: close to the
truth but sensitive to parameter;
C (deep learning): close to the
truth with many good local
optima

We assume
θ̂i
(A8) The ratio between the scale of each dimension of the MLE θ̂ to ε2 , i.e. ε2 (i =
1, · · · , D) is in the order O(1).
Intuitively, the scale parameter ε2 in our prior p(θ ) in eq. (J6) is chosen to “cover”
the good models. Therefore, the order of R is O(D). As N turns large, R will be
dominated by the 2nd O(D log N ) term. We will therefore discard R for simplicity. It
could be useful for a more delicate analysis. In conclusion, we arrive at the following
expression
 
 
J(θ̂ ) + 1 2 I 
D N 1  N ε2 
O:= − log p(X | θ̂) + log + log V + log   . (K12)
2 2π 2  
I(θ̂) + ε1 I 

Notice the similarity with eq. (2), where the first two terms on the RHS are exactly
the same. The 3rd term is an O(D) term, similar to the 3rd term in eq. (2). It is bounded
according to theorem 7, while the 3rd term in eq. (2) could be unbounded. Our last
term is in a similar form to the last term in eq. (2), except it is well defined on lightlike
manifold. If we let ε2 → ∞, ε1 → 0, we get exactly eq. (2) and in this case O = χ .
As the number of parameters D turns large, both the 2nd and 3rd terms will grow
linearly w.r.t. D, meaning that they contribute positively to the model complexity.
Interestingly, the fourth term is a “negative complexity”. Regard N1ε2 and 1 as small
2
positive values. The fourth term essentially is a log-ratio from the observed FIM to the
true FIM. For small models, they coincide, because the sample size N is large based
on the model size. In this case, the effect of this term is minor. For DNNs, the sample
size N is very limited based on the huge model size D. Along a dimension θi , J(θ )
is likely to be singular as stated in proposition 2, even if I has a very small positive
value. In this case, their log-ratio will be negative. Therefore, the razor O favors DNNs
with their Fisher-spectrum clustered around 0.
In fig. 2, model C displays the concepts of a DNN, where there are many good
local optima. The performance is not sensitive to specific values of model parameters.
On the lightlike neuromanifold M, there are many directions that are very close to
being lightlike. When a DNN model varies along these directions, the model slightly
changes in terms of I(θ), but their prediction on the samples measured by J(θ ) are
invariant. These directions count negatively towards the complexity, because these

123
K. Sun, F. Nielsen

extra freedoms (dimensions of θ ) occupy almost zero volume in the geometric sense,
and are helpful to give a shorter code to future unseen samples.
To obtain a simpler expression, we consider the case that I(θ) ≡ I(θ̂) is both
constant and diagonal in the interested region defined by eq. (J6). In this case,

D 1
log V ≈ log 2π + D log ε2 + log |I(θ̂) + ε1 I|. (K13)
2 2

On the other hand, as D → ∞, the spectrum of the FIM I(θ ) will follow the density
ρI (θ ). We plug these expressions into eq. (K12), discard all lower-order terms, and
get a simplified version of the razor

  
D D ∞ 1
O ≈ − log p(X | θ̂ ) + log N + ρI (λ) log λ + dλ, (K14)
2 2 0 N ε22

where ρI denotes the spectral density of the Fisher information matrix.


Author Contributions KS initiated this work. KS and FN jointly developed this work and wrote the paper.

Funding Open access funding provided by CSIRO Library Services.

Data availability No datasets were generated or analysed during the current study.

Declarations
Conflicts of Interest Frank Nielsen is a Co-Editor of Information Geometry. He was not involved in the
peer review or handling of the manuscript. On behalf of all authors, the corresponding author states that
there is no other conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

References
1. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
2. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
3. Rissanen, J.: Fisher information and stochastic complexity. IEEE Trans. Inf. Theory 42(1), 40–47
(1996)
4. Grünwald, P.D.: The Minim. Desc. Length Princ. Adaptive Computation and Machine Learning series.
The MIT Press, Cambridge, Massachusetts (2007)
5. Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(2), 185–194
(1968)
6. Grünwald, P., Roos, T.: Minimum description length revisited. Int. J. Math. Ind. 11(01), 1930001
(2019)

123
A geometric modeling of Occam’s…

7. Barron, A.R., Cover, T.M.: Minimum complexity density estimation. IEEE Trans. Inf. Theory 37(4),
1034–1054 (1991)
8. Zhang, T.: Information-theoretic upper and lower bounds for statistical estimation. IEEE Trans. Inf.
Theory 52(4), 1307–1321 (2006)
9. Grünwald, P.D., Mehta, N.A.: A tight excess risk bound via a unified PAC-Bayesian-Rademacher-
Shtarkov-MDL complexity. In: Garivier, A., Kale, S. (eds.) Proceedings of the 30th International
Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research, vol. 98, pp.
433–465 (2019)
10. Blum, A., Langford, J.: PAC-MDL bounds. In: Proc. Sixteenth Conf. Learning Theory (COLT’ 03),
pp. 344–357 (2003)
11. Hayashi, M.: Large deviation theory for non-regular location shift family. Ann. Inst. Stat. Math. 63(4),
689–716 (2011)
12. Pollard, D.: A note on insufficiency and the preservation of Fisher information. In: From Probability
to Statistics and Back: High-Dimensional Models and Processes-A Festschrift in Honor of Jon A.
Wellner, pp. 266–275. Institute of Mathematical Statistics, Beachwood, Ohio (2013)
13. Calin, O., Udrişte, C.: Geometric Modeling in Probability and Statistics. Springer, Cham (2014)
14. Shtarkov, Y.M.: Universal sequential coding of single messages. Probl. Inf. Transm. 23(3), 3–17 (1987)
15. Amari, S.: Information geometry and its applications. Applied Mathematical Sciences, vol. 194.
Springer, Japan (2016)
16. Balasubramanian, V.: MDL, Bayesian inference and the geometry of the space of probability distribu-
tions. In: Advances in Minimum Description Length: Theory and Applications, pp. 81–98. MIT Press,
Cambridge, Massachusetts (2005)
17. Myung, I.J., Balasubramanian, V., Pitt, M.A.: Counting probability distributions: differential geometry
and model selection. Proc. Natl. Acad. Sci. 97(21), 11170–11175 (2000)
18. Kupeli, D.N.: Singular Semi-Riemannian Geometry. Mathematics and Its Applications, vol. 366.
Springer, Netherlands (1996)
19. Lauritzen, S.L.: Statistical manifolds. Differential geometry in statistical inference 10, 163–216 (1987)
20. Hotelling, H.: Spaces of statistical parameters. Bull. Amer. Math. Soc 36, 191 (1930)
21. Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters. Bulletin
of Cal. Math. Soc. 37(3), 81–91 (1945)
22. Rao, C.R.: Information and the accuracy attainable in the estimation of statistical parameters. In:
Breakthroughs in Statistics, pp. 235–247. Springer, New York, NY (1992)
23. Nomizu, K., Katsumi, N., Sasaki, T.: Affine Differential Geometry: Geometry of Affine Immersions.
Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, United Kingdom (1994)
24. Watanabe, S.: Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on
Applied and Computational Mathematics, vol. 25. Cambridge University Press, Cambridge, United
Kingdom (2009)
25. Duggal, K., Bejancu, A.: Lightlike Submanifolds of Semi-Riemannian Manifolds and Applications.
Mathematics and Its Applications, vol. 364. Springer, Netherlands (1996)
26. Thomas, P.: Genga: a generalization of natural gradient ascent with positive and negative convergence
results. In: International Conference on Machine Learning. Proceedings of Machine Learning Research,
vol. 32 (2), pp. 1575–1583 (2014)
27. Nakajima, N., Ohmoto, T.: The dually flat structure for singular models. Inf. Geom. 4(1), 31–64 (2021)
28. Bahadir, O., Tripathi, M.M.: Geometry of lightlike hypersurfaces of a statistical manifold.
arXiv:1901.09251 [math.DG] (2019)
29. Jain, V., Singh, A.P., Kumar, R.: On the geometry of lightlike submanifolds of indefinite statistical
manifolds. arXiv:1903.07387 [math.DG] (2019)
30. Duggal, K.: A review on unique existence theorems in lightlike geometry. Geometry 2014 (2014).
Article ID 835394
31. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference
on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 15, pp. 315–
323 (2011)
32. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT press, Cambridge, Massachusetts (2016)
33. Sun, K., Nielsen, F.: Relative Fisher information and natural gradient for learning large modular models.
In: International Conference on Machine Learning. Proceedings of Machine Learning Research, vol.
70, pp. 3289–3298 (2017)

123
K. Sun, F. Nielsen

34. Pascanu, R., Bengio, Y.: Revisiting natural gradient for deep networks. In: International Conference
on Learning Representations (ICLR) (2014)
35. Soen, A., Sun, K.: On the variance of the Fisher information for deep learning. In: Advances in Neural
Information Processing Systems 34, pp. 5708–5719. Curran Associates, Inc., NY 12571, USA (2021)
36. Martens, J.: New insights and perspectives on the natural gradient method. J. Mach. Learn. Res. 21(146),
1–76 (2020)
37. Kunstner, F., Hennig, P., Balles, L.: Limitations of the empirical Fisher approximation for natural
gradient descent. In: Advances in Neural Information Processing Systems 32, pp. 4158–4169. Curran
Associates, Inc., NY 12571, USA (2019)
38. Karakida, R., Akaho, S., Amari, S.: Universal statistics of Fisher information in deep neural networks:
Mean field approach. In: International Conference on Artificial Intelligence and Statistics. Proceedings
of Machine Learning Research, vol. 89, pp. 1032–1041 (2019)
39. Pennington, J., Worah, P.: The spectrum of the Fisher information matrix of a single-hidden-layer
neural network. In: Advances in Neural Information Processing Systems 31, pp. 5410–5419. Curran
Associates, Inc., NY 12571, USA (2018)
40. Mingo, J.A., Speicher, R.: Free Probability and Random Matrices. Fields Institute Monographs, vol.
35. Springer, New York (2017)
41. Amari, S., Ozeki, T., Karakida, R., Yoshida, Y., Okada, M.: Dynamics of learning in MLP: Natural
gradient and singularity revisited. Neural Comput. 30(1), 1–33 (2018)
42. Calin, O.: Deep Learn. Archit. Springer, London (2020)
43. Karakida, R., Akaho, S., Amari, S.-i: Pathological Spectra of the Fisher Information Metric and Its
Variants in Deep Neural Networks. Neural Computation 33(8), 2274–2307 (2021)
44. Aoki, T., Kuribayashi, K.: On the category of stratifolds. Cahiers de Topologie et Géométrie Différen-
tielle Catégoriques LVIII(2), 131–160 (2017). arXiv:1605.04142 [math.CT]
45. Esser, P.M., Nielsen, F.: Towards modeling and resolving singular parameter spaces using stratifolds.
arXiv preprint arXiv:2112.03734 (2021)
46. Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes.
In: International Conference on Learning Representations (ICLR) (2018)
47. Feng, X., Zhang, Z.: The rank of a random matrix. Appl. Math. Comput. 185(1), 689–694 (2007)
48. Kay, D.C.: Schaum’s Outl. Theory Probl. Tensor Calculus. McGraw-Hill, New York (1988)
49. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Automat. Contr. 19(6),
716–723 (1974)
50. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In:
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS),
pp. 249–256 (2010)
51. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance
on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision
(ICCV), pp. 1026–1034 (2015)
52. Alain, G., Roux, N.L., Manzagol, P.-A.: Negative eigenvalues of the Hessian in deep neural networks.
In: ICLR’18 Workshop (2018). arXiv:1902.02366 [cs.LG]
53. Sagun, L., Evci, U., Guney, V.U., Dauphin, Y., Bottou, L.: Empirical analysis of the Hessian of over-
parametrized neural networks. In: ICLR’18 Workshop (2018). arXiv:1706.04454 [cs.LG]
54. Nagumo, M.: Über eine Klasse der Mittelwerte. In: Japanese Journal of Mathematics: Transactions
and Abstracts, vol. 7, pp. 71–79 (1930). The Mathematical Society of Japan
55. Kolmogorov, A.N.: Sur la Notion de la Moyenne. G. Bardi, tip. della R. Accad. dei Lincei, Rome, Italy
(1930)
56. Komori, O., Eguchi, S.: A unified formulation of k-Means, fuzzy c-Means and Gaussian mixture model
by the Kolmogorov-Nagumo average. Entropy 23(5), 518 (2021)
57. MacKay, D.J.C.: Bayesian methods for adaptive models. PhD thesis, California Institute of Technology
(1992)
58. Said, S., Hajri, H., Bombrun, L., Vemuri, B.C.: Gaussian distributions on Riemannian symmetric
spaces: statistical learning with structured covariance matrices. IEEE Trans. Inf. Theory 64(2), 752–
772 (2017)
59. Takeuchi, J., Amari, S.-I.: α-parallel prior and its properties. IEEE Trans. Inf. Theory 51(3), 1011–1023
(2005)
60. Jiang, R., Tavakoli, J., Zhao, Y.: Weyl prior and Bayesian statistics. Entropy 22(4) (2020)

123
A geometric modeling of Occam’s…

61. Wei, H., Zhang, J., Cousseau, F., Ozeki, T., Amari, S.: Dynamics of learning near singularities in
layered networks. Neural Comput. 20(3), 813–843 (2008)
62. Orhan, A.E., Pitkow, X.: Skip connections eliminate singularities. In: International Conference on
Learning Representations (ICLR) (2018)
63. Yoshida, Y., Karakida, R., Okada, M., Amari, S.: Statistical mechanical analysis of learning dynamics
of two-layer perceptron with multiple output units. Journal of Physics A: Mathematical and Theoretical
(2019)
64. Barron, A., Rissanen, J., Bin, Yu.: The minimum description length principle in coding and modeling.
IEEE Trans. Inf. Theory 44(6), 2743–2760 (1998)
65. Rissanen, J.: Strong optimality of the normalized ml models as universal codes and information in
data. IEEE Trans. Inf. Theory 47(5), 1712–1717 (2001)
66. Murata, N., Yoshizawa, S., Amari, S.-i: Network information criterion-determining the number of
hidden units for an artificial neural network model. IEEE transactions on neural networks 5(6), 865–
872 (1994)
67. Yamazaki, K., Watanabe, S.: Singularities in mixture models and upper bounds of stochastic complexity.
Neural Netw. 16(7), 1029–1038 (2003)
68. Lin, W., Duruisseaux, V., Leok, M., Nielsen, F., Khan, M.E., Schmidt, M.: Simplifying momentum-
based positive-definite submanifold optimization with applications to deep learning. In: International
Conference on Machine Learning, pp. 21026–21050 (2023). PMLR
69. Blier, L., Ollivier, Y.: The description length of deep learning models. In: Advances in Neural Infor-
mation Processing Systems 31, pp. 2216–2226. Curran Associates, Inc., NY 12571, USA (2018)
70. Gaier, A., Ha, D.: Weight agnostic neural networks. In: Advances in Neural Information Processing
Systems 32, pp. 5365–5379. Curran Associates, Inc., NY 12571, USA (2019)
71. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In:
Advances in Neural Information Processing Systems 29, pp. 4107–4115. Curran Associates, Inc., NY
12571, USA (2016)
72. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: Model compression and acceleration for deep neural net-
works: the principles, progress, and challenges. IEEE Signal Process. Mag. 35(1), 126–136 (2018)
73. Neyshabur, B., Bhojanapalli, S., Mcallester, D., Srebro, N.: Exploring generalization in deep learning.
In: Advances in Neural Information Processing Systems 30, pp. 5947–5956. Curran Associates, Inc.,
NY 12571, USA (2017)
74. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethink-
ing generalization. In: International Conference on Learning Representations (ICLR) (2017)
75. Valle-Pérez, G., Camargo, C.Q., Louis, A.A.: Deep learning generalizes because the parameter-function
map is biased towards simple functions. In: International Conference on Learning Representations
(ICLR) (2019)
76. Liang, T., Poggio, T., Rakhlin, A., Stokes, J.: Fisher-Rao metric, geometry, and complexity of neural
networks. In: International Conference on Artificial Intelligence and Statistics. Proceedings of Machine
Learning Research, vol. 89, pp. 888–896 (2019)
77. Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., Sohl-Dickstein, J.: On the expressive power of deep
neural networks. In: International Conference on Machine Learning. Proceedings of Machine Learning
Research, vol. 70, pp. 2847–2854 (2017)
78. Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9(1), 1–42 (1997)
79. Dinh, L., Pascanu, R., Bengio, S., Bengio, Y.: Sharp minima can generalize for deep nets. In: Inter-
national Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp.
1019–1028 (2017)
80. Pennington, J., Schoenholz, S., Ganguli, S.: The emergence of spectral universality in deep networks.
In: International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning
Research, vol. 84, pp. 1924–1932 (2018)
81. Pennington, J., Bahri, Y.: Geometry of neural network loss surfaces via random matrix theory. In:
International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70,
pp. 2798–2806 (2017)
82. Hayase, T., Karakida, R.: The spectrum of Fisher information of deep networks achieving dynamical
isometry. In: International Conference on Artificial Intelligence and Statistics, pp. 334–342 (2021)
83. Papyan, V.: Traces of class/cross-class structure pervade deep learning spectra. J. Mach. Learn. Res.
21(252), 1–64 (2020)

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
123

You might also like