Loss Landscapes and Optimization in Over-Parameterized Non-Linear Systems and Neural Networks
Loss Landscapes and Optimization in Over-Parameterized Non-Linear Systems and Neural Networks
Loss Landscapes and Optimization in Over-Parameterized Non-Linear Systems and Neural Networks
b
Department of Computer Science and Engineering, University of California, San Diego
c
Halicioğlu Data Science Institute, University of California, San Diego
Abstract
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-
based optimization methods applied to large neural networks. The purpose of this work is to propose
a modern view and a general mathematical framework for loss landscapes and efficient optimization
in over-parameterized machine learning models and systems of non-linear equations, a setting that
includes over-parameterized deep neural networks. Our starting observation is that optimization
problems corresponding to such systems are generally not convex, even locally. We argue that instead
they satisfy PL∗ , a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter
space, which guarantees both the existence of solutions and efficient optimization by (stochastic)
gradient descent (SGD/GD). The PL∗ condition of these systems is closely related to the condition
number of the tangent kernel associated to a non-linear system showing how a PL∗ -based non-linear
theory parallels classical analyses of over-parameterized linear equations. We show that wide neural
networks satisfy the PL∗ condition, which explains the (S)GD convergence to a global minimum.
Finally we propose a relaxation of the PL∗ condition applicable to “almost” over-parameterized
systems.
1 Introduction
A singular feature of modern machine learning is a large number of trainable model parameters. Just in
the last few years we have seen state-of-the-art models grow from tens or hundreds of millions parameters
to much larger systems with hundreds billions [6] or even trillions parameters [13]. Invariably these models
are trained by gradient descent based methods, such as Stochastic Gradient Descent (SGD) or Adam [18].
Why are these local gradient methods so effective in optimizing complex highly non-convex systems? In
the past few years an emerging understanding of gradient-based methods have started to focus on the
insight that optimization dynamics of “modern” over-parameterized models with more parameters than
constraints are very different from those of “classical” models when the number of constraints exceeds
the number of parameters.
The goal of this paper is to provide a modern view of the optimization landscapes, isolating key
mathematical and conceptual elements that are essential for an optimization theory of over-parameterized
models.
We start by characterizing a key difference between under-parameterized and over-parameterized
landscapes. While both are generally non-convex, the nature of the non-convexity is rather different:
under-parameterized landscapes are (generically) locally convex in a sufficiently small neighborhood of
a local minimum. Thus classical analyses apply, if only locally, sufficiently close to a minimum. In
contrast, over-parameterized systems are “essentially” non-convex systems, not even in arbitrarily small
neighbourhoods around global minima.
1
(a) Loss landscape of under-parameterized models (b) Loss landscape of over-parameterized models
Figure 1: Panel (a): Loss landscape is locally convex at local minima. Panel (b): Loss landscape
incompatible with local convexity as the set of global minima is not locally linear.
Thus, we cannot expect the extensive theory of convexity-based analyses to apply to such over-
parameterized problems. In contrast, we argue that these systems typically satisfy the Polyak-Lojasiewicz
condition, or more precisely, its slightly modified variant – PL∗ condition, on most (but not all) of the
parameter space. This condition ensures existence of solutions and convergence of GD and SGD, if it
holds in a ball of sufficient radius. Importantly, we show that sufficiently wide neural networks satisfy
the PL∗ condition around their initialization point, thus guaranteeing convergence. In addition, we show
how the PL∗ condition can be relaxed without significantly changing the analysis. We conjecture that
many large systems behave as if the were over-parameterized along the stretch of their optimization path
from initialization to the early stopping point.
In a typical supervised learning task, given a training dataset of size n, D = {xi , yi }ni=1 , xi ∈ Rd , y ∈ R,
and a parametric family of models f (w; x), e.g., a neural network, one aims to find a model with parameter
vector w∗ , that fits the training data, i.e.,
f (w∗ ; xi ) ≈ yi , i = 1, 2, . . . , n. (1)
constructed so that the solutions of Eq.(2) are global minimizers of L(w). This is a non-linear least
squares problem, which is well-studied under classical under-parameterized settings (see [28], Chapter
10). An exact solution of Eq.(2) corresponds to interpolation, where a predictor fits the data exactly.
As we discuss below, for over-parameterized systems (m > n), we expect exact solutions to exist.
Essential non-convexity. Our starting point, discussed in detail in Section 3, is the observation that
the loss landscape of an over-parameterized system is generally not convex in any neighborhood of any
1 The same setup works for multiple outputs. For examples for multi-class classification problems both the prediction
f (w; xi ) and labels yi (one-hot vector) are c-dimensional vectors, where c is the number of classes. In this case, we are in
fact solving n × c equations. Similarly, for multi-output regression with c outputs, we have n × c equations.
2
global minimizer. This is different from the case of under-parameterized systems, where the loss landscape
is globally not convex, but still typically locally convex, in a sufficiently small neighbourhood of a (locally
unique) minimizer. In contrast, the set of solutions of over-parameterized systems is generically a manifold
of positive dimension [10] (and indeed, systems large enough have no non-global minima [21, 27, 37]).
Unless the solution manifold is linear (which is not generally the case) the landscape cannot be locally
convex. The contrast between over and under-parametrization illustrated pictorially in Fig 1.
The non-zero curvature of the curve of global minimizers at the bottom of the valley in Fig 1(b)
indicates the essential non-convexity of the landscape. In contrast, an under-parameterized landscape
generally has multiple isolated local minima with positive definite Hessian of the loss, where the function
is locally convex. Thus we conclude that
Convexity is not the right framework for analysis of over-parameterized systems, even locally.
Without assistance from local convexity, what alternative mathematical framework can be used to
analyze loss landscapes and optimization behavior of non-linear over-parameterized systems?
In this paper we argue that such a general framework is provided by the Polyak-Lojasiewicz condi-
tion [32, 24], or, more specifically, by its variant that we call PL∗ condition (also called local PL condition
in [29]). We say that a non-negative function L satisfies µ-PL∗ on a set S ⊂ Rm for µ > 0, if
We will now outline some key reasons why PL∗ condition provides a general framework for analyzing
over-parameterized systems. In particular, we show how it is satisfied by the loss functions of sufficiently
wide neural networks, although which are certainly non-convex.
PL∗ condition =⇒ existence of solutions and exponential convergence of (S)GD. The first
key point (see Section 5), is that if L satisfies the µ-PL∗ condition in a ball of radius O(1/µ) then L has
a global minimum in that ball (corresponding to a solution of Eq.(2)). Furthermore, (S)GD initialized
at the center of such a ball2 converges exponentially to a global minimum of L. Thus to establish both
existence of solutions to Eq.(2) and efficient optimization, it is sufficient to verify the PL∗ condition in a
ball of a certain radius.
Analytic form of PL∗ condition via the spectrum of the tangent kernel. Let DF(w) be the
differential of the map F at w, viewed as a n × m matrix. The tangent kernel of F is defined as an n × n
matrix
K(w) = DF(w) DF T (w).
It is clear that K(w) is a positive semi-definite matrix. It can be seen (Section 4) that square loss L is
µ-PL∗ at w, where
µ = λmin (K(w)), (4)
is the smallest eigenvalue of the kernel matrix. Thus verification of the PL∗ condition reduces to analyzing
the spectrum of the tangent kernel matrix associated to F.
It is worthwhile to compare this to the standard analytical condition for convexity, requiring that the
Hessian of the loss function, HL , is positive definite. While these spectral conditions appear to be similar,
the similarity is superficial as K(w) contains first order derivatives, while the Hessian is a second order
operator. Hence, as we discuss below, the tangent kernel and the Hessian have very different properties
under coordinate transformations and in other settings.
Why PL∗ holds across most of the parameter space for over-parameterized systems. We will
now discuss the intuition why µ-PL∗ for a sufficiently small µ should be expected to hold in the domain
containing most (but not all) of the parameter space Rm . The intuition is based on simple parameter
counting. For the simplest example consider the case n = 1. In that case the tangent kernel is a scalar
2 The constant in O(1/µ) is different for GD and SGD.
3
Figure 2: The loss function L(w) is µ-PL∗ inside the shaded domain. Singular set corresponds to
parameters w with degenerate tangent kernel K(w). Every ball of radius R = O(1/µ) within the shaded
set intersects with the set of global minima of L(w), i.e., solutions to F(w) = y.
and K(w) = kDF(w)k2 is singular if and only if kDF(w)k = 0. Thus, by parameter counting, we
expect the singular set {w : K(w) = 0} to have co-dimension m and thus, generically, consist of isolated
points. Because of the Eq.(4) we generally expect most points sufficiently removed from the singular set
to satisfy the PL∗ condition. By a similar parameter counting argument, the singular set of w, such that
K(w) is not full rank will of co-dimension m − n + 1. As long as m > n, we expect the surroundings of
the singular set, where the PL∗ condition does not hold, to be “small” compared to the totality of the
parameter space. Furthermore, the larger the degree of the model over-parameterization (m − n) is, the
smaller the singular set is expected to be. This intuition is represented graphically in Fig. 2.
Note that under-parameterized systems are always rank deficient and have λmin (K(w)) ≡ 0. Hence
such systems never satisfy PL∗ .
Wide neural networks satisfy PL∗ condition. We show that wide neural networks, as special cases
of over-parameterized models, are PL∗ , using the technical tools developed in Section 4. This property of
the neural networks is a key step to understand the success of the gradient descent based optimization,
as seen in Section 5. To show the PL∗ condition for wide neural networks, a powerful, if somewhat
crude, tool is provided by the remarkable recent discovery [16] that tangent kernels (so-called NTK) of
wide neural networks with linear output layer are nearly constant in a ball B of a certain radius around
the ball with a random
√ center. More precisely, it can be shown that the norm of the Hessian tensor
kHF (w)k = O∗ (1/ m), where m is the width of the neural network and w ∈ B [23].
Furthermore (see Section 4.1):
√
|λmin (K(w)) − λmin (K(w0 ))| < O sup kHF (w)k = O(1/ m).
w∈B
Bounding the kernel eigenvalue at the initialization point λmin (K(w0 )) from below (using the results
from [12, 11]) completes the analysis.
To prove the result for general neural networks (note that the Hessian norm is typically large for
such networks [23]) we observe that they can be decomposed as a composition of a network with a
linear output layer and a coordinate-wise non-linear transformation of the output. The PL∗ condition is
preserved under well-behaved non-linear transformations which yields the result.
4
Relaxing the PL∗ condition: when the systems are al-
most over-parameterized. Finally, a framework for under-
standing large systems cannot be complete without address-
ing the question of transition between under-parameterized and
over-parameterized systems. While neural network models such
as those in [6, 13] have billions or trillions on parameters, they
are often trained on very large datasets and thus have compara-
ble number of constraints. Thus in the process of optimization
they may initially appear over-parameterized, while no truly in-
terpolating solution may exist. Such an optimization landscape
is shown in Fig. 3. While an in-depth analysis of this question Figure 3: The loss landscape of “almost
is beyond the scope of this work, we propose a version of the over-parameterized” systems. The
PL∗ condition that can account for such behavior by postulat- landscape looks over-parameterized ex-
ing that the PL∗ condition is satisfied for values of L(w) above cept for the grey area where the loss
a certain threshold and that the optimization path from the ini- is small. Local minima of the loss are
tialization to the early stopping point lies in the PL∗ domain. contained there.
Approximate convergence of (S)GD can be shown under this
condition, see Section 6.
5
respectively. For vectors, we use k · k to denote the Euclidean norm and k · k∞ for the ∞-norm. For
matrices, we use k · kF to denote the Frobenius norm and k · k2 to denote the spectral norm (i.e., 2-norm).
We use DF to represent the differential map of F : Rm → Rn . Note that DF is represented as a n × m
∂Fi
matrix, with (DF)ij := ∂w j
. We denote Hessian of the function F as HF , which is an n × m × m tensor
2
Fi
with (HF )ijk = ∂w∂ j ∂w k
, and define the norm of the Hessian tensor to be the maximum of the spectral
norms of each of its Hessian components, i.e., kHF k = maxi∈[n] kHFi k2 , where HFi = ∂ 2 Fi /∂w2 . When
necessary, we also assume that HF is continuous. We also denote the Hessian matrix of the loss function
as HL := ∂ 2 L/∂w2 , which is an m × m matrix. We denote the smallest eigenvalue of a matrix K as
λmin (K).
In this paper, we consider the problem of solving a system of equations of the form Eq.(2), i.e. finding
w, such that F(w) = y. This problem is solved by minimizing a loss function L(F(w), y), such as the
square loss L(w) = 21 kF(w) − yk2 , with gradient-based algorithms. Specifically, the gradient descent
method starts from the initialization point w0 and updates the parameters as follows:
Throughout this paper, we assume the map F is Lipschitz continuous and smooth. See the definition
below:
Definition 1 (Lipschitz continuity). Map F : Rm → Rn is LF -Lipschitz continuous, if
βF
kF(w) − F(v) − DF(v)(w − v)k2 ≤ kw − vk2 , ∀w, v ∈ Rm . (7)
2
6
the solutions are isolated. A specific result for wide neural networks, showing non-existence of isolated
global minima of the loss function is given in Proposition 6 (Appendix A).
It is important to note that such continuous manifolds of solutions generically have non-zero curva-
ture3 , due to the non-linear nature of the system of equations. This results in the lack of local convexity
of the loss landscape, i.e., loss landscape is nonconvex in any neighborhood of a solution (i.e, a global
minimizer of L). This can be seen from the following argument. Suppose that the loss function landscape
of L is convex within a ball B that intersects the set of global minimizers M. The minimizers of a convex
function within a convex domain form a convex set, thus M ∩ S must also be convex. Hence M ∩ S
must be a convex subset within a lower dimensional linear subspace of Rm and cannot have curvature
(either extrinsic ot intrinsic). This geometric intuition is illustrated in Fig. 1b, where the set of global
minimizers is of dimension one.
To provide an alternative analytical intuition (leading to a precise argument) consider an over-
parameterized system F(w) : Rm → Rn , where m > n, with the square loss function L(F (w), y) =
1 2
2 kF(w) − yk . The Hessian matrix of the loss function takes the form
n
∂2L X
HL (w) = DF(w)T (w)DF(w) + (F(w) − y)i HFi (w), (9)
| ∂F{z2 } i=1
A(w)
| {z }
B(w)
where HFi (w) ∈ Rm×m is the Hessian matrix of ith output of F with respect to w.
Let w∗ be a solution to Eq.(2). Since w∗ is a global minimizer of L, B(w∗ ) = 0. We note that A(w∗ )
is a positive semi-definite matrix of rank at most n with at least m − n zero eigenvalues.
Yet, in a neighbourhood of w∗ there typically are points where B(w) has rank m. As we show below,
this observation, together with a mild technical assumption on the loss, implies that HL (w) cannot be
positive semi-definite in any ball around w∗ and hence is not locally convex. To see why this is the case,
consider an example of a system with only one equation (n = 1). The loss function takes the form as
L(w) = 21 (F(w) − y)2 , y ∈ R and the Hessian of the loss function can be written as
X
HL (w) = DF(w)T DF(w) + ((F(w))i − yi )HFi (w). (10)
i
Let w∗ be a solution, F(w∗ ) = y and suppose that DF(w∗ ) does not vanish. In the neighborhood of w∗ ,
there exist arbitrarily close points w∗ + δ and w∗ − δ, such that F(w∗ + δ) − y > 0 and F(w∗ − δ) −y < 0.
Assuming that the rank of HFi (w∗ ) is greater than one, and noting that the rank of DF(w)T DF(w)
is at most one, it is easy to see that either HL (w∗ + δ) or HL (w∗ − δ) must have negative eigenvalues,
which rules out local convexity at w∗ .
A more general version of this argument is given in the following:
Proposition 2 (Local non-convexity). Let L(w∗ ) = 0 and, furthermore, assume that DF(w∗ ) 6= 0 and
rank(HFi (w∗ )) > 2n for at least one i ∈ [n]. Then L(w) is not convex in any neighborhood of w∗ .
Remark 2. Note that in general we do not expect DF to vanish at w∗ as there is no reason why a
of∗F. For a general loss L(w), the assumption in Proposition 2
solution to Eq.(2) should be a critical point
that DF(w∗ ) 6= 0 is replaced by dw d ∂L
∂F (w ) 6= 0.
A full proof of Prop. 2 for a general loss function can be found in Appendix B.
3 Both extrinsic, the second fundamental form, and intrinsic Gaussian curvature.
7
4 Over-parameterized non-linear systems are PL∗ on most of
the parameter space
In this section we argue that loss landscapes of over-parameterized systems satisfy the PL∗ condition
across most of their parameter space.
We begin with a sufficient analytical condition uniform conditioning of a system closely related to
PL∗ .
The following important connection shows that uniform conditioning of the system is sufficient for the
corresponding square loss landscape to satisfy the PL∗ condition.
Theorem 1 (Uniform conditioning ⇒ PL∗ condition). If F(w) is µ-uniformly conditioned, on a set
S ⊂ Rm , then the square loss function L(w) = 21 kF(w) − yk2 satisfies µ-PL∗ condition on S.
Proof.
1 1 1
k∇L(w)k2 = (F(w) − y)T K(w)(F(w) − y) ≥ λmin (K(w))kF(w) − yk2 = λmin (K(w))L(w) ≥ µL(w).
2 2 2
We will now provide some intuition for why we expect λmin (K(w)) to be separated from zero over
most but not all of the optimization landscape. The key observation is that
Note that kernel K(w) is an n × n positive semi-definite matrix by definition. Hence the singular set
Ssing , where the tangent kernel is degenerate, can be written as
Generically, rank DF(w) = min(m, n). Thus for over-parameterized systems, when m > n, we expect
Ssing to have positive codimension and to be a set of measure zero. In contrast, in under-parametrized
settings, m < n and the tangent kernel is always degenerate, λmin (K(w)) ≡ 0 so such systems cannot be
uniformly conditioned according to the definition above. Furthermore, while Eq.(11) provides a sufficient
condition, it’s also in a certain sense necessary:
Proposition 3. If λmin (K(w0 )) = 0 then the system F(w) = y cannot satisfy PL∗ condition for all y
on any set S containing w0 .
Proof. Since λmin (K(w0 )) = 0, the matrix K(w0 ) has a non-trivial null-space. Therefore we can choose
y so that K(w0 )(F(w0 ) − y) = 0 and F(w0 ) − y 6= 0. We have
1 1
k∇L(w0 )k2 = (F(w0 ) − y)T K(w0 )(F(w0 ) − y) = 0
2 2
and hence the PL∗ condition cannot be satisfied at w0 .
Thus we see that only over-parameterized systems can be PL∗ , if we want that condition to hold for
any label vector y.
By parameter counting, it is easy to see that the codimension of the singular set Ssing is expected to
be m − n + 1. Thus, on a compact set, for µ sufficiently small, points which are not µ-PL∗ will be found
8
Figure 4: µ-PL∗ domain and the singular set. We expect points away from the singular set to satisfy
µ-PL∗ condition for sufficiently small µ.
around Ssing , a low-dimensional subset of Rm . This is represented graphically in Fig. 4. Note that the
more over-parameterization we have, the larger the codimension is expected to be.
To see a particularly simple example of this phenomenon, consider the case when DF(w) is a random
matrix with Gaussian entries4 . Denote by
λmax (K(w))
κ=
λmin (K(w))
the condition number of K(w). Note that by definition κ ≥ 1. It is shown in [9] that (assuming m > n)
m
E(log κ) < 2 log + 5.
m−n+1
We see that as the amount of over-parameterization increases κ converges in expectation (and also can be
shown with high probability) to a small constant. While this example is rather special, it is representative
of the concept that over-parameterization helps with conditioning. In particular, as we discuss below in
Section 4.2, wide neural networks satisfy the PL∗ with high probability within a random ball of a constant
radius.
We will now provide techniques for proving that the PL∗ condition holds for specific systems and later
in Section 4.2 we will show how these techniques apply to wide neural networks, demonstrating that they
are PL∗ . In Section 5 we discuss the implications of the PL∗ condition for the existence of solutions and
convergence of (S)GD, in particular for deep neural networks.
9
networks with linear output layer have small Hessian norm. For some intuition on why this appears to
be a feature of many large systems see Appendix C.
The second approach to demonstrating conditioning is by noticing that it is preserved under well-
behaved transformations of the input or output or when composing certain models.
Combining these methods yields a general result on PL∗ condition for wide neural networks in Sec-
tion 4.2.
Uniform conditioning through Hessian spectral norm. We will now show that controlling the
norm of the Hessian tensor for the map F leads to well-conditioned systems. The idea of the analysis is
that the change of the tangent kernel of F(w) can be bounded in terms of the norm of the Hessian of
F(w). Intuitively, this is analogous to the mean value theorem, bounding the first derivative of F by its
second derivative. If the Hessian norm is sufficiently small, the change of the tangent kernel and hence
its conditioning can be controlled in a ball B(w0 , R) with a finite radius, as long as the tangent kernel
matrix at the center point K(w0 ) is well-conditioned.
Theorem 2 (PL∗ condition via Hessian Norm). Given a point w0 ∈ Rm , suppose the tangent kernel
matrix K(w0 ) is strictly positive definite, i.e., λ0 := λmin (K(w0 )) > 0. If the Hessian spectral norm
kHF k ≤ 2LλF0 −µ
√
nR
holds within the ball B(w0 , R) for some R > 0 and µ > 0, then the tangent kernel
K(w) is µ-uniformly conditioned in the ball B(w0 , R). Hence, the square loss L(w) satisfies the µ-PL∗
condition in B(w0 , R).
Proof. First, let’s consider the difference between the tangent kernel matrices at w ∈ B(w0 , R) and at
w0 .
By the assumption, we have, for all v ∈ B(w0 , R), kHF (v)k < 2LλF0 −µ
√
nR
. Hence, for each i ∈ [n],
λ0 −µ
kHFi (w)k2 < √
2LF nR
. Now, consider an arbitrary point w ∈ B(w0 , R). For all i ∈ [n], we have:
Z 1
DFi (w) = DFi (w0 ) + HFi (w0 + τ (w − w0 ))(w − w0 )dτ. (12)
0
Since τ is in [0, 1], w0 + τ (w − w0 ) is on the line segment S(w0 , w) between w0 and w, which is inside
of the ball B(w0 , R). Hence,
λ0 − µ λ0 − µ
kDFi (w) − DFi (w0 )k ≤ sup kHFi (w0 + τ (w − w0 ))k2 · kw − w0 k ≤ √ ·R= √ .
τ ∈[0,1] 2LF nR 2LF n
λ0 −µ
In the second inequality above, we used the fact that kHFi k2 < √
2LF nR
in the ball B(w0 , R). Hence,
√ λ0 − µ λ −µ
sX
kDF(w) − DF(w0 )kF = kDFi (w) − DFi (w0 )k2 ≤ n √ = 0 .
2LF n 2LF
i∈[n]
In the second inequality above, we used the LF -Lipschitz continuity of F and the fact that kAk2 ≤ kAkF
for a matrix A.
10
By triangular inequality, we have, at any point w ∈ B(w0 , R),
Hence, the tangent kernel is µ-uniformly conditioned in the ball B(w0 , R).
By Theorem 1, we immediately have that the square loss L(w) satisfies µ-PL∗ condition in the ball
B(w0 , R).
Below in Section 4.2, we will see that wide neural networks with linear output layer have small Hessian
norm (Theorem 2). An illustration for a class of large models is given in Appendix C.
Conditioning of transformed systems. We now discuss why the conditioning of a system F(w) = y
is preserved under a transformations of the domain or range of F, as long as the original system is well-
conditioned and the transformation has a bounded inverse.
Remark 3. Note that the even if the original system had a small Hessian norm, there is no such guarantee
for the transformed system.
Consider a transformation Φ : Rn → Rn that, composed with F, results in a new transformed system
Φ ◦ F(w) = y. Put
1
ρ := inf −1 , (14)
w∈B(w0 ,R) kJ
Φ (w)k2
where JΦ (w) := JΦ (F(w)) is the Jacobian of Φ evaluated at F(w). We will assume that ρ > 0.
Theorem 3. If a system F is µ-uniformly conditioned in a ball B(w0 , R) with R > 0, then the trans-
formed system Φ ◦ F(w) is µρ2 -uniformly conditioned in B(w0 , R). Hence, the square loss function
1 2 2 ∗
2 kΦ ◦ F(w) − yk satisfies µρ -PL condition in B(w0 , R).
KΦ◦F (w) = ∇(Φ ◦ F)(w)∇(Φ ◦ F)T (w) = JΦ (w)KF (w; X)JΦ (w)T .
Hence, if F(w) is µ-uniformly conditioned in B(w0 , R), i.e. λmin (KF (w)) ≥ µ, we have for any v ∈ Rn
with kvk = 1,
If F is µ-uniformly conditioned with respect to Ψ(w) in B(w0 , R), then an analysis similar to Theorem 3
shows that F ◦ Ψ is also µρ2 -uniformly conditioned in B(w0 , R).
11
Conditioning of composition models. Although composing different large models often leads to
non-constant tangent kernels, the corresponding tangent kernels can also be uniformly conditioned, under
0
certain conditions. Consider the composition of two models h := g ◦ f , where f : Rd → Rd and
0 00
g : Rd → Rd . Denote wf and wg as the parameters of model f and g respectively. Then, the
parameters of the composition model h are w := (wg , wf ). Examples of the composition models include
”bottleneck” neural networks, where the modules below or above the bottleneck layer can be considered
as the composing (sub-)models.
Let’s denote the tangent kernel matrices of models g and f by Kg (wg ; Z) and Kf (wf ; X ) respectively,
where the second arguments, Z and X , are the datasets that the tangent kernel matrices are evaluated
on. Given a dataset D = {(xi , yi )}ni=1 , denote f (D) as {(f (xi ), yi )}ni=1 .
Proposition 4. Consider the composition model h = g ◦ f with parameters w = (wg , wf ). Given a
dataset D, the tangent kernel matrix of h takes the form:
α(0) = x,
1
α(l) = σl √ W (l) α(l−1) , ∀l = 1, 2, · · · , L + 1,
ml−1
f (W; x) = α(L+1) . (17)
Here, ml is the width (i.e., number of neurons) of l-th layer, α(l) ∈ Rml denotes the vector of l-th
hidden layer neurons, W := {W (1) , W (2) , . . . , W (L) , W (L+1) } denotes the collection of the parameters (or
weights) W (l) ∈ Rml ×ml−1 of each layer, and σl is the activation function of l-th layer, e.g., sigmoid,
tanh, linear activation. We also denote the width of the neural network as m := minl∈[L] ml , i.e., the
minimal width of the hidden layers. In the following analysis, we assume that the activation functions σl
are twice differentiable. Although this assumption excludes ReLU, but we believe the same results also
apply when the hidden layer activation functions are ReLU.
Remark 5. The above definition of neural networks does not include convolutional (CNN) and residual
(ResNet) neural networks. In Appendix E, we show that both CNN and ResNet also satisfy the PL∗
condition. Please see the definitions and analysis there.
We study the loss landscape of wide neural networks in regions around randomly chosen points in
parameter space. Specifically, we consider the ball B(W0 , R), which has a fixed radius R > 0 (we will
see later, in Section 5, that R can be chosen to cover the whole optimization path) and is around a
(l)
random parameter point W0 , i.e., W0 ∼ N (0, Iml ×ml−1 ) for l ∈ [L + 1]. Note that such a random
12
parameter point W0 is a common choice to initialize a neural network. Importantly, the tangent kernel
matrix at W0 is generally strictly positive definite, i.e., λmin (K(W0 )) > 0. Indeed, this is proven
for infinitely wide networks as long as the training data is not degenerate (see Theorem 3.1 of [12]
and Proposition F.1 and F.2 of [11]). As for finite width networks, with high probability w.r.t. the
initialization randomness, its tangent kernel K(W0 ) is close to that of the infinite network and the
minimum eigenvalue λmin (K(W0 )) = O(1).
Using the techniques in Section 4.1, the following theorem shows that neural networks with sufficient
width satisfies the PL∗ condition in a ball of any fixed radius around W0 , as long as the tangent kernel
K(W0 ) is strictly positive definite.
Theorem 4 (Wide neural networks satisfy PL∗ condition). Consider the neural network f (W; x) in
(l)
Eq.(17), and a random parameter setting W0 such that W0 ∼ N (0, Iml ×ml−1 ) for l ∈ [L + 1]. Suppose
0
that the last layer activation σL+1 satisfies |σL+1 (z)| ≥ ρ > 0 and that λ0 := λmin (K(W0 )) > 0. For
any µ ∈ (0, λ0 ρ2 ), if the width of the network
nR6L+2
m = Ω̃ , (18)
(λ0 − µρ−2 )2
then µ-PL∗ condition holds the square loss function in the ball B(w0 , R).
0
Remark 6. In fact, it is not necessary to require |σL+1 (z)| to be greater than ρ for all z. The theorem still
0
holds as long as |σL+1 (z)| > ρ is true for all z actually achieved by the output neuron before activation.
This theorem tells that while the loss landscape of wide neural networks is nowhere convex (as seen
in Section 3), it can still can be described by the PL∗ condition at most points, in line with our general
discussion.
Proof of Theorem 4. We divide the proof into two distinct steps based on representing an arbitrary
neural network as a composition of network with a linear output layer and an output non-linearity
σL+1 (·). In Step 1 we prove the PL∗ condition for the case of a network with a linear output layer (i.e.,
σL+1 (z) ≡ z). The argument relies on the fact that wide neural networks with linear output layer have
small Hessian norm in a ball around initialization. In Step 2 for general networks we observe that an
arbitrary neural network is simply a neural network with a linear output layer from Step 1 with output
transformed by applying σL+1 (z) coordinate-wise. We obtain the result by combining Theorem 3 with
Step 1.
Step 1. Linear output layer: σL+1 (z) ≡ z. In this case, ρ = 1 and the output layer of the network
has a linear form, i.e., a linear combination of the units from the last hidden layer.
As was shown in [23], for this type of networks with sufficient width, the model Hessian matrix have
arbitrarily small spectral norm (a transition to linearity). This is formulated in the following theorem:
Theorem 5 (Theorem 3.2 of [23]: transition to linearity). Consider a neural network f (W; x) of the
form Eq.(17). Let m be the minimum of the hidden layer widths, i.e., m = minl∈[L] ml . Given any fixed
R > 0, and any W ∈ B(W0 , R) := {W : kW − W0 k ≤ R}, with high probability over the initialization,
the Hessian spectral norm satisfies the following:
√
kHf (W)k = Õ R3L / m . (19)
In Eq.(19), we explicitly write out the dependence of Hessian norm on the radius R, according to the
proof in [23].
Directly plugging Eq.(19) into the condition of Theorem 2, letting = λ0 − µ and noticing that ρ = 1,
we directly have the expression for the width m.
13
Step 2. General networks: σL+1 (·) is non-linear. Wide neural networks with non-linear output
layer generally do not exhibit transition to linearity or near-constant tangent kernel, as was shown [23].
Despite that, these wide networks still satisfy the PL∗ condition in the ball B(W0 , R). Observe that
this type of network can be viewed as a composition of a non-linear transformation function σL+1 with
a network f˜ which has a linear output layer:
By the same argument as in Step 1, we see that f˜, with the width as in Eq.(18), is ρµ2 -uniformly
conditioned.
Now we apply our analysis for transformed systems in Theorem 3. In this case, the transformation
map Φ becomes a coordinate-wise transformation of the output given by
Hence, the µ
ρ2 -uniform conditioning of f˜ immediately implies µ-uniform conditioning of f , as desired.
14
Remark 8. It is easy to see that the usual condition number κ(w) = λmax (K(w))/λmin (K(w)) of the
tangent kernel K(w), is upper bounded by κF (S).
Now, we are ready to present the optimization theory based on the PL∗ condition. First, let the
arbitrary set S to be a Euclidean ball B(w0 , R) around the initialization w0 of gradient descent methods,
with a reasonably large but finite radius R. The following theorem shows that satisfaction of the PL∗
condition on B(w0 , R) implies the existence of at least one global solution of the system in the same
ball B(w0 , R). Moreover, following the original argument from [32], the PL∗ condition also implies fast
convergence of gradient descent to a global solution w∗ in the ball B(w0 , R).
Theorem 6 (Local PL∗ condition ⇒ existence of a solution + fast convergence). Suppose the system F
is LF -Lipschitz continuous and βF -smooth. If the square loss L(w) satisfies the µ-PL∗ condition in the
ball B(w0 , R) := {w ∈ Rm : kw − w0 k ≤ R} with R = 2LF kF (w µ
0 )−yk
. Then we have the following:
(a) Existence of a solution: There exists a solution (global minimizer of L) w∗ ∈ B(w0 , R), such that
F(w∗ ) = y.
(b) Convergence of GD: Gradient descent with a step size η ≤ 1/(L2F + βF kF(w0 ) − yk) converges to a
global solution in B(w0 , R), with an exponential (a.k.a. linear) convergence rate:
t
L(wt ) ≤ 1 − κ−1
F (B(w0 , R)) L(w0 ). (26)
1
where the condition number κF (B(w0 , R)) = ηµ .
Convergence of SGD. In most practical machine learning settings, including typical problems of
supervised learning, the loss function L(w) has the form
n
X
L(w) = `i (w).
i=1
Pn
For example, for the square loss L(w) = i=1 `i (w), where `i (w) = 21 (Fi (w) − yi )2 . Here the loss `i
corresponds simply to the loss for ith equation. Mini-batch SGD updates the parameter w, according to
the gradient of s individual loss functions `i (w) at a time:
X
wt+1 = wt − η ∇`i (wt ), ∀t ∈ N.
i∈S⊂[n]
We will assume that each element of the set S is chosen uniformly at random at every iteration.
We now show that the PL∗ condition on L also implies exponential convergence of SGD within a ball,
an SGD analogue of Theorem 6. Our result can be considered as a local version of Theorem 1 in [4] which
showed exponential convergence of SGD, assuming PL condition holds in the entire parameter space. See
also [29] for a related result.
15
Theorem 7.√Assume each `i (w) is β-smooth and L(w) satisfies the µ-P L∗ condition in the ball B(w0 , R)
2n 2βL(w )
0
with R = µδ where δ > 0. Then, with probability 1 − δ, SGD with mini-batch size s ∈ N and
nµ
step size η ≤ nβ(n2 β+µ(s−1)) converges to a global solution in the ball B(w0 , R), with an exponential
convergence rate:
µsη t
E[L(wt )] ≤ 1 − L(w0 ). (27)
n
The proof is deferred to Appendix G.
Theorem 8. Consider the neural network f (W; x) and its random initialization W0 under the same
condition as in Theorem 4. If the network width satisfies m = Ω̃ µ6L+2 (λ0n−µρ−2 )2 , then, with an
appropriate step size, gradient descent converges to a global minimizer in the ball B(W0 , R), where
R = O(1/µ), with an exponential convergence rate:
Proof. The result is obtained by combining Theorem 4 and 6, after setting the ball radius R = 2LF kF(W0 )−
yk/µ and choosing the step size 0 < η < L2 +βF kF1(W0 )−yk .
F
It remains to verify that all of the quantities LF , βF and kF(W0 ) − yk are of order O(1) with respect
to the network width m. First note that, with the random initialization of W0 as in Theorem 4, it is
shown in [16] that with high probability, f (W0 ; x) = O(1) when the width is sufficiently large under mild
assumption on the non-linearity functions σl (·), l = 1, ..., L + 1. Hence kF(W0 ) − yk = O(1).
Using the definition of LF , we have
LF = sup kDF(W)k
W∈B(W0 ,R)
!
√
≤ Lσ kDF̃(W0 )k + R n · sup kHF̃ (W)k
W∈B(W0 ,R)
√
1
q
= kKF̃ (W0 )kLσ + Lσ R n · Õ √ = O(1),
m
where the last inequality follows from Theorem 5, and kKF̃ (W0 )k = O(1) with high probability over the
random initialization.
16
Finally, βF is bounded as follows:
Remark 9. Using the same argument, a result similar to Theorem 8 but with a different convergence
rate,
L(Wt ) ≤ (1 − ηsµ)t L(W0 ). (29)
and with difference constants, can be obtained for SGD by applying Theorem 7.
Note that (near) constancy of the tangent kernel is not a necessary condition for exponential conver-
gence of gradient descent or (S)GD.
Theorem 9 (Local PL∗ condition ⇒fast convergence). Assume the loss function L(w) (not necessarily
∗ m
√ and satisfies the µ-PL condition in the ball B(w0 , R) := {w ∈ R :
the square loss) is β-smooth
2 2βL(w )
0
kw − w0 k ≤ R} with R = µ . We have the following:
(a) Existence of a point which makes the loss less than : There exists a point w∗ ∈ B(w0 , R), such that
L(w∗ ) < .
17
(b) Convergence of GD : Gradient descent with the step size η = 1/ supw∈B(w0 ,R) kHL (w)k2 after T =
Ω(log(1/)) iterations satisfies L(wT ) < in the ball B(w0 , R), with an exponential (also known as linear)
convergence rate:
t
−1
L(wt ) ≤ 1 − κL,F (B(w0 , R)) L(w0 ), for all t ≤ T, (31)
1
where the condition number κL,F (B(w0 , R)) = ηµ .
It’s not hard to see that ≤ (1 − ηµ)T L(w0 ), from which we get T = Ω(log(1/)).
√
2 2βL(w0 )
And obviously, w0 , w1 , ..., wT are also in the ball B(w0 , R) with R = µ .
t
µsη ∗ (s)
E[L(wt )] ≤ 1− L(w0 ), (35)
n
While the global landscape of the loss function L(w) can be complex, the conditions above allow us
to find solutions within a certain ball around the initialization point w0 .
18
7 Concluding thoughts and comments
In this paper we have proposed a general framework for understanding generically non-convex landscapes
and optimization of over-parameterized systems in terms of the PL∗ condition. We have argued that PL*
condition generally holds on most but not all of the parameter space, which is sufficient for the existence
of solutions and convergence of gradient-base methods to global minimizers. In contrast, it is not possible
for loss landscapes of under-parameterized systems kF(w) − yk2 to satisfy PL∗ for any y. We conclude
with a number of comments and observations.
Transition over the interpolation threshold. Recognizing the power of over-parameterization has
been a key insight stemming from the practice of deep learning. Transition to over-parameterized mod-
els – over the interpolation threshold – leads to a qualitative change in a range of system properties.
Statistically, over-parameterized systems enter a new interpolating regime, where increasing the number
of parameters, even indefinitely to infinity, can improve generalization [5, 34]. From the optimization
point of view, over-parameterized system are generally easier to solve. There has been significant effort
(continued in this work) toward understanding effectiveness of local methods in this setting [33, 12, 25,
2, 17, 30].
In this paper we note another aspect of this transition, to the best of our knowledge not addressed in
the existing literature – transition from local convexity to essential non-convexity. This relatively simple
observation has significant consequences, indicating the need to depart from the machinery of convex
analysis. Interestingly, our analyses suggest that this loss of local convexity is of little consequence for
optimization, at least as far as gradient-based methods are concerned.
19
be case in the over-parameterized non-linear setting as well. To give just one example, we expect that
accelerated methods, such as the Nesterov’s method [26] and its stochastic gradient extensions in the
over-parameterized case [22, 35] to have faster convergence rates for non-linear systems in terms of the
condition numbers defined in this work.
Acknowledgements
We thank Raef Bassily and Siyuan Ma for many earlier discussions about gradient methods and Polyak-
Lojasiewicz conditions and Stephen Wright for insightful comments and corrections. The authors ac-
knowledge support from the NSF, the Simons Foundation and a Google Faculty Research Award.
References
[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A Convergence Theory for Deep Learning via
Over-Parameterization”. In: International Conference on Machine Learning. 2019, pp. 242–252.
[2] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. “Fine-Grained Analysis of
Optimization and Generalization for Overparameterized Two-Layer Neural Networks”. In: Inter-
national Conference on Machine Learning. 2019, pp. 322–332.
[3] Peter L Bartlett, David P Helmbold, and Philip M Long. “Gradient descent with identity initializa-
tion efficiently learns positive-definite linear transformations by deep residual networks”. In: Neural
computation 31.3 (2019), pp. 477–502.
[4] Raef Bassily, Mikhail Belkin, and Siyuan Ma. “On exponential convergence of sgd in non-convex
over-parametrized learning”. In: arXiv preprint arXiv:1811.02564 (2018).
[5] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. “Reconciling modern machine-
learning practice and the classical bias–variance trade-off”. In: Proceedings of the National Academy
of Sciences 116.32 (2019), pp. 15849–15854.
[6] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are
few-shot learners”. In: arXiv preprint arXiv:2005.14165 (2020).
[7] Peter Bürgisser and Felipe Cucker. Condition: The geometry of numerical algorithms. Vol. 349.
Springer Science & Business Media, 2013.
20
[8] Zachary Charles and Dimitris Papailiopoulos. “Stability and generalization of learning algorithms
that converge to global optima”. In: International Conference on Machine Learning. PMLR. 2018,
pp. 745–754.
[9] Zizhong Chen and Jack J Dongarra. “Condition numbers of Gaussian random matrices”. In: SIAM
Journal on Matrix Analysis and Applications 27.3 (2005), pp. 603–620.
[10] Yaim Cooper. “The loss landscape of overparameterized neural networks”. In: arXiv preprint
arXiv:1804.10200 (2018).
[11] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. “Gradient Descent Finds Global
Minima of Deep Neural Networks”. In: International Conference on Machine Learning. 2019,
pp. 1675–1685.
[12] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. “Gradient descent provably optimizes
over-parameterized neural networks”. In: arXiv preprint arXiv:1810.02054 (2018).
[13] William Fedus, Barret Zoph, and Noam Shazeer. “Switch Transformers: Scaling to Trillion Param-
eter Models with Simple and Efficient Sparsity”. In: arXiv preprint arXiv:2101.03961 (2021).
[14] Chirag Gupta, Sivaraman Balakrishnan, and Aaditya Ramdas. “Path length bounds for gradient
descent and flow”. In: arXiv preprint arXiv:1908.01089 (2019).
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[16] Arthur Jacot, Franck Gabriel, and Clément Hongler. “Neural tangent kernel: Convergence and
generalization in neural networks”. In: Advances in neural information processing systems. 2018,
pp. 8571–8580.
[17] Ziwei Ji and Matus Telgarsky. “Polylogarithmic width suffices for gradient descent to achieve arbi-
trarily small test error with shallow ReLU networks”. In: arXiv preprint arXiv:1909.12292 (2019).
[18] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[19] Johannes Lederer. “No Spurious Local Minima: on the Optimization Landscapes of Wide and Deep
Neural Networks”. In: arXiv preprint arXiv:2010.00885 (2020).
[20] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein,
and Jeffrey Pennington. “Wide neural networks of any depth evolve as linear models under gradient
descent”. In: Advances in neural information processing systems. 2019, pp. 8570–8581.
[21] Dawei Li, Tian Ding, and Ruoyu Sun. “Over-parameterized deep neural networks have no strict
local minima for any continuous activations”. In: arXiv preprint arXiv:1812.11039 (2018).
[22] Chaoyue Liu and Mikhail Belkin. “Accelerating SGD with momentum for over-parameterized learn-
ing”. In: The 8th International Conference on Learning Representations (ICLR). 2020.
[23] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. “On the linearity of large non-linear models: when
and why the tangent kernel is constant”. In: Advances in Neural Information Processing Systems
33 (2020).
[24] Stanislaw Lojasiewicz. “A topological property of real analytic subsets”. In: Coll. du CNRS, Les
équations aux dérivées partielles 117 (1963), pp. 87–89.
[25] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. “A mean field view of the landscape of
two-layer neural networks”. In: Proceedings of the National Academy of Sciences 115.33 (2018),
E7665–E7671.
[26] Yurii Nesterov. “A method for unconstrained convex minimization problem with the rate of con-
vergence O (1/kˆ 2)”. In: Doklady AN USSR. Vol. 269. 1983, pp. 543–547.
[27] Quynh Nguyen, Mahesh Chandra Mukkamala, and Matthias Hein. “On the loss landscape of a class
of deep neural networks with no bad local valleys”. In: arXiv preprint arXiv:1809.10749 (2018).
21
[28] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media,
2006.
[29] Samet Oymak and Mahdi Soltanolkotabi. “Overparameterized nonlinear learning: Gradient de-
scent takes the shortest path?” In: International Conference on Machine Learning. PMLR. 2019,
pp. 4951–4960.
[30] Samet Oymak and Mahdi Soltanolkotabi. “Towards moderate overparameterization: global con-
vergence guarantees for training shallow neural networks”. In: arXiv preprint arXiv:1902.04674
(2019).
[31] Tomaso Poggio, Gil Kur, and Andrzej Banburski. “Double descent in the condition number”. In:
arXiv preprint arXiv:1912.06190 (2019).
[32] Boris Teodorovich Polyak. “Gradient methods for minimizing functionals”. In: Zhurnal Vychisli-
tel’noi Matematiki i Matematicheskoi Fiziki 3.4 (1963), pp. 643–653.
[33] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. “Theoretical insights into the optimiza-
tion landscape of over-parameterized shallow neural networks”. In: IEEE Transactions on Infor-
mation Theory 65.2 (2018), pp. 742–769.
[34] S Spigler, M Geiger, S d’Ascoli, L Sagun, G Biroli, and M Wyart. “A jamming transition from
under- to over-parametrization affects generalization in deep learning”. In: Journal of Physics A:
Mathematical and Theoretical 52.47 (Oct. 2019), p. 474001.
[35] Sharan Vaswani, Francis Bach, and Mark Schmidt. “Fast and Faster Convergence of SGD for Over-
Parameterized Models and an Accelerated Perceptron”. In: The 22th International Conference on
Artificial Intelligence and Statistics. 2019, pp. 1195–1204.
[36] Patrick M Wensing and Jean-Jacques Slotine. “Beyond convexity—Contraction and global conver-
gence of gradient descent”. In: Plos one 15.8 (2020), e0236661.
[37] Xiao-Hu Yu and Guo-An Chen. “On the local minima free condition of backpropagation learning”.
In: IEEE Transactions on Neural Networks 6.5 (1995), pp. 1300–1303.
[38] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. “Stochastic gradient descent optimizes
over-parameterized deep relu networks”. In: arXiv preprint arXiv:1811.08888 (2018).
22
A Wide neural networks have no isolated local/global minima
In this section, we show that, for feedforward neural networks, if the network width is sufficiently large,
there is no isolated local/global minima in the loss landscape.
Consider the following feedforward neural networks:
Here, x ∈ Rd is the input, L is the number of hidden layers of the network. Let ml be the width
of l-th layer, l ∈ [L], and m0 = d and mL+1 = c where c is the dimension of the network output.
W := {W (1) , W (2) , · · · , W (L) , W (L+1) }, with W (l) ∈ Rml−1 ×ml , is the collection of parameters, and φl
is the activation function at each hidden layer. The minimal width of hidden layers, i.e., width of the
network, is denoted as m := min{m1 , · · · , mL }. The parameter space is denoted as M. Here, we further
assume the loss function L(W) is of the form:
n
X
L(W) = l(f (W; xi ), yi ), (37)
i=1
where loss l(·, ·) is convex in the first argument, and (xi , yi ) ∈ Rd × Rc is one of the n training samples.
Proposition 6 (No isolated minima). Consider the feedforward neural network in Eq.(36). If the network
width m ≥ 2c(n + 1)L , given a local (global) minimum W∗ of the loss function Eq.(37), there are always
other local (global) minima in any neighborhood of W∗ .
The main idea of this proposition is based on some of the intermediate-level results of the work [19].
Before starting the proof, let’s review some of the important concepts and results therein. To be consistent
with the notation in this paper, we modify some of their notations.
Definition 6 (Path constant). Consider two parameters W, V ∈ M. If there is a continuous function
hW,V : [0, 1] → M that satisfies hW,V (0) = W, hW,V (1) = V and t 7→ L(hW,V (t)) is constant, we say
that W and V are path constant and write W ↔ V.
Path constantness means the two parameters W and V are connected by a continuous path of pa-
rameters that is contained in a level set of the loss function L.
Lemma 1 (Transitivity). W ↔ V and V ↔ U ⇒ W ↔ U.
Definition 7 (Block parameters). Consider a number s ∈ {0, 1, · · · } and a parameter W ∈ M. If
(1)
1. Wji = 0 for all j > s;
(l)
2. Wij = 0 for all l ∈ [L] and i > s, and for all l ∈ [L] and j > s;
(L+1)
3. Wij = 0 for all j > s,
we call W an s-lower-block parameter of depth L. We denote the sets of the s-upper-block and s-lower-
block parameters of depth L by Us,L and Vs,L , respectively.
The key result that we use in this paper is that every parameter is path constant to a block parameter,
or more formally:
23
Proposition 7 (Path connections to block parameters [19]). For every parameter W ∈ M and s :=
c(n + 1)L (n is the number of training samples), there are W, W ∈ M with W ∈ Us,L and W ∈ Vs,L
such that W ↔ W and W ↔ W.
The above proposition says that, if the network width m is large enough, every parameter is path
connected to both an upper-block parameter and a lower-block parameter, by continuous paths contained
in a level set of the loss function.
Now, let’s present the proof of Proposition 6.
Proof of Proposition 6. Let W∗ ∈ M be an arbitrary local/global minimum of the loss function L(W).
According to Proposition 7, there exist an upper-block parameter W ∈ Us,L ⊂ M and W ∈ Vs,L ⊂ M
such that W ↔ W and W ↔ W. Note that W and W are distinct, because Us,L and Vs,L do not
intersect except at zero due to m ≥ 2s. This means there must be parameters distinct from W∗ that
is connected to W∗ via a continuous path contained in a level set of the loss function. Note that all
the points (i.e., parameters) along this path have the same loss value as L(W∗ ), hence are local/global
minima. Therefore, W∗ is not isolated (i.e, there are other local/global minima in any neighborhood of
W∗ ).
B Proof of Proposition 2
Proof. The Hessian matrix of a general loss function L(F(w)) takes the form
n
∂2L X ∂L
HL (w) = DF(w)T (w)DF(w) + (w)HFi (w).
∂F 2 i=1
∂F i
Recall that HFi (w) is the Hessian matrix of i-th output of F with respect to w.
We consider the Hessian matrices of L around a global minimizer w∗ of the loss function, i.e., solution
of the system of equations. Specifically, consider the following two points w∗ + δ and w∗ − δ, which are
in a sufficiently small neighborhood of the minimizer w∗ . Then Hessian of loss at these two points are
n
∂2L ∗
∗ ∗ ∗
X d ∂L ∗
HL (w + δ) = DF(w + δ) 2
T
(w + δ)DF(w + δ) + (w ) δ HFi (w∗ + δ) + o(kδk),
| ∂F {z } i=1 dw ∂F i
A(w∗ +δ)
2 n
∗ ∗∂ L ∗ ∗
X d ∂L ∗
HL (w − δ) = DF(w − δ) 2
T
(w − δ)DF(w − δ) − (w ) δ HFi (w∗ + δ) + o(kδk).
| ∂F {z } i=1 dw ∂F i
A(w∗ −δ)
Note that both the terms A(w∗ + δ) and A(w∗ − δ) are matrices with rank at most n, since DF is of
the size n × m.
By the assumption, at least one component HFk of the Hessian of F satisfies that the rank of HFk (w∗ )
is greater than 2n. By the continuity of the Hessian, we have that, if magnitude of δ is sufficiently small,
then the ranks of HFk (w∗ + δ) and HFk (w∗ − δ) are also greater than 2n.
Hence, we can always find a unit length vector v ∈ Rm s.t.
but
vT HFk (w∗ + δ)v 6= 0, vT HFk (w∗ − δ)v 6= 0. (39)
Consequently the vector hvT HF1 (w∗ +δ)v, . . . , vT HFn (w∗ +δ)vi =
6 0 and hvT HF1 (w∗ −δ)v, . . . , vT HFn (w∗ −
6 0.
δ)vi =
24
With the same v, we have
n
∗
X d ∂L ∗
T
v HL (w + δ)v = (w ) δ vT HFi (w∗ + δ)v + o(kδk), (40)
i=1
dw ∂F i
n
X d
∗ ∂L ∗
T
v HL (w − δ)v = − (w ) δ vT HFi (w∗ − δ)v + o(kδk). (41)
i=1
dw ∂F i
In the following, we show that, for sufficiently small δ, vT HL (w∗ + δ)v and vT HL (w∗ − δ)v can not be
non-negative simultaneously, which immediately implies that HL is not positive semi-definite in the close
neighborhood of w∗ , hence L is not locally convex ∗
at w .
d ∂L ∗
Specifically, with the condition dw ∂F (w ) 6= 0, for Eq.(40) and Eq.(41) we have the following
cases: Pn d ∂L ∗ ∗ ∗
T T
Case 1 : If i=1 dw ∂F (w ) δ i v HFi (w + δ)v < 0, then directly v HL (w + δ)v < 0 if δ is small
enough which completes Pnthe proof.
d ∂L ∗ ∗
T
Case 2 : Otherwise if i=1 dw ∂F (w ) δ i v HFi (w + δ)v > 0, by the continuity of each HFi (·), we
have
n
X d ∂L ∗
− (w ) δ vT HFi (w∗ − δ)v
i=1
dw ∂F i
n n
X d ∂L ∗ ∗
X d ∂L ∗
=− T
(w ) δ v HFi (w + δ)v + (w ) δ vT (HFi (w∗ + δ) − HFi (w∗ − δ))v
i=1
dw ∂F i i=1
dw ∂F i
n
X d
∂L ∗
=− (w ) δ vT HFi (w∗ + δ)v + O()
i=1
dw ∂F i
< 0,
when δ is small enough.
Pn ∗
d dL
δ i v HFi (w∗ + δ)v = 0, we can always adjust δ a little so that it turns
T
Note: If i=1 dw dF (w )
to case 1 or 2.
In conclusion, with certain δ which is arbitrarily small, either vT HL (w∗ + δ)v or vT HL (w∗ − δ)v
has to be negative which means L(w) has no convex neighborhood around w∗ .
25
1
Due to the randomness of the sub-model weights vi , the scaling factor s(m) = o(1) w.r.t. the size m
1 1
of the model f (e.g, s(m) = √m for neural networks [16, 23]).
The following theorem states that, as long as the model size m is sufficiently large, the Hessian spectral
norm of the model f is arbitrarily small.
Theorem 11. Consider the model f defined in Eq.(42). Under Assumption 1, the spectral norm of the
Hessian of model f satisfies:
β
kHf k ≤ . (43)
s(m)
Proof. An entry (Hf )jk , j, k ∈ Rm×p , of the model Hessian matrix is
m m
1 X ∂ 2 αi 1 X
(Hf )jk = vi =: vi (Hαi )jk , (44)
s(m) i=1 ∂wj ∂wk s(m) i=1
with Hαi being the Hessian matrix of the sub-model αi . Because the parameters of f is the concatenation
of sub-model parameters and the sub-models share no common parameters, in the summation of Eq.(44),
there must be at most one non-zero term (non-zero only when wj and wk belong to the same model αi .
Thus, the Hessian spectral norm can be bounded by
1 1
kHf k ≤ max |vi | · kHαi k ≤ · β. (45)
s(m) i∈[m] s(m)
Here we set the frequencies ωk ∼ N (0, 1) and νk ∼ N (0, n2 ), for all k ∈ [m], to be fixed. The trainable
parameters of the model are (u, a) ∈ R2m × R2m . It’s not hard to see that the model h(x) = g ◦ f (x) can
be viewed as a 4-layer ”bottleneck” neural network where the second hidden layer has only one neuron,
i.e. the output of f .
For simplicity, we let the input x be 1-dimensional, and denote the training data as x1 , ..., xn ∈ R.
We consider the case of both sub-models, f and g, are large, especially m → ∞.
By Proposition 4, the tangent kernel matrix of h, i.e. Kh , can be decomposed into the sum of two
positive semi-definite matrices, and the uniform conditioning of Kh can be guaranteed if one of them is
uniformly conditioned, as demonstrated in Eq.(16). In the following, we show Kg is uniformly conditioned
by making f well separated.
We assume the training data are not degenerated and the parameters u are randomly initialized.
This makes sure the initial outputs of f , which are √ the initial inputs of g, are not degenerated, with high
probability. For example, let mini6=j |xi − xj | ≥ 2 and initialize u by N (0, 100R2 ) with a given number
R > 0. Then, we have
−|xi −xj |2
f (u0 ; xi ) − f (u0 ; xj ) ∼ N 0, 200R2 − 200R2 e 2 , ∀i, j ∈ [n].
26
√ −|xi −xj |2
Since mini6=j |xi − xj | ≥ 2, the variance V ar := 200R2 − 200R2 e 2 > 100R2 . For this Gaussian
distribution, we have, with probability at least 0.96, that
For this model, the partial tangent kernel Kg is the Gaussian kernel in the limit of m → ∞, with the
following entries: 2
n |f (u; xi ) − f (u; xj )|2
Kg,ij (u) = exp − .
2
By the Gershgorin circle theorem, its smallest eigenvalue is lower bounded by:
2
n ρ(S)2
inf λmin (Kg (u)) ≥ 1 − (n − 1) exp − ,
u∈S 2
where ρ(S) := inf u∈S mini6=j |f (u;p xi ) − f (u; xj )| is the minimum separation between the outputs of f ,
i.e., inputs of g in S. If ρ(S) ≥ 2ln(2n − 2)/n, then we have inf u∈S λmin (Kg (u)) ≥ 1/2. Therefore, we
see that the uniform conditioning of Kg , hence that of Kh , is controlled by the separation between the
inputs of g, i.e., outputs of f .
Within the ball B((u0 , a0 ), R) := {(u, a) : ku − u0 k2 + ka − a0 k2 ≤ R2 } with an arbitrary radius
R > 0, the outputs of f are always well separated, given that the initial outputs of f are separated by
2R as already discussed above. This is because
m
1 X
|f (u; x) − f (u0 ; x)| = √ ((uk − u0,k ) cos(ωk x) + (uk+m − u0,k+m ) sin(ωk x)) ≤ ku − u0 k ≤ R,
m
k=1
which
p leads to ρ(B((u0 , a0 ), R)) > R by Eq.(48). By choosing R appropriately, to be specific, R ≥
2ln(2n − 2)/n, the uniform conditioning of Kg is satisfied.
Hence, we see that composing large non-linear models may make the tangent kernel no longer constant,
but the uniform conditioning of the tangent kernel can remain.
α(0) = x,
1
α(l) = σl √ W (l)
∗ α (l−1)
, ∀l = 1, 2, . . . , L,
ml−1
1
f (W; x) = √ hW (L+1) , α(L+1) i, (49)
mL
where ∗ is the convolution operator (see the definition below) and h·, ·i is the standard matrix inner
product.
Compared to the definition of fully-connected neural networks in Eq.(17), the l-th hidden neurons
α(l) ∈ Rml ×Q is a matrix where ml is the number of channels and Q is the number of pixels, and
W (l) ∈ RK×ml ×ml−1 is an order 3 tensor where K is the filter size except that W (L+1) ∈ RmL ×Q .
For the simplicity of the notation, we give the definition of convolution operation for 1-D CNN in the
following. We note that it’s not hard to extend to higher dimensional CNNs and one will find that our
analysis still applies.
27
K ml−1
(l) (l−1)
X X
(l) (l−1)
W ∗α = Wk,i,j αj,q+k− K+1 , i ∈ [ml ], q ∈ [Q]. (50)
i,q 2
k=1 j=1
α(0) = x,
(l) 1
α = σl √ W (l) α(l−1) + α(l−1) , ∀l = 1, 2, · · · , L + 1,
ml−1
f (W; x) = α(L+1) . (51)
We see that the ResNet is the same as a fully connected neural network, Eq. (17), except that the
activations α(l) has an extra additive term α(l−1) from the previous layer, interpreted as skip connection.
Remark 10. This definition of ResNet differs from the standard ResNet architecture in [15] that the
skip connections are at every layer, instead of every two layers. One will find that the same analysis can
be easily generalized to cases where skip connections are at every two or more layer. The same definition,
up to a scaling factor, was also theoretically studied in [11].
By the following theorem, we have an upper bound for the Hessian spectrum norm of the CNN and
ResNet, similar to Theorem 5 for fully-connected neural networks.
Theorem 12 (Theorem 3.3 of [23]). Consider a neural network f (W; x) of the form Eq.(49) or Eq.(51).
Let m be the minimum of the hidden layer widths, i.e., m = minl∈[L] ml . Given any fixed R > 0, and
any W ∈ B(W0 , R) := {W : kW − W0 k ≤ R}, with high probability over the initialization, the Hessian
spectral norm satisfies the following:
√
kHf (W)k = Õ R3L / m . (52)
Using the same analysis in Section 4.2, we can get a similar result with Theorem 4 for CNN and
ResNet to show they also satisfy PL∗ condition:
Theorem 13 (Wide CNN and ResNet satisfy PL∗ condition). Consider the neural network f (W; x) in
(l)
Eq.(49) or Eq.(51), and a random parameter setting W0 such that each element of W0 for l ∈ [L + 1]
0
follows N (0, 1). Suppose that the last layer activation σL+1 satisfies |σL+1 (z)| ≥ ρ > 0 and that λ0 :=
λmin (K(W0 )) > 0. For any µ ∈ (0, λ0 ρ2 ), if the width of the network
nR6L+2
m = Ω̃ , (53)
(λ0 − µρ−2 )2
then µ-PL∗ condition holds for square loss in the ball B(W0 , R).
Theorem 14. Suppose the loss function L(w) (not necessarily the square loss) is β-smooth
√ and satisfies
∗ m 2 2βL(w0 )
the µ-PL condition in the ball B(w0 , R) := {w ∈ R : kw − w0 k ≤ R} with R = µ . Then we
have the following:
28
(a) Existence of a solution: There exists a solution (global minimizer of L) w∗ ∈ B(w0 , R), such that
F(w∗ ) = y.
(b) Convergence of GD: Gradient descent with a step size η ≤ 1/ supw∈B(w0 ,R) kHL (w)k2 converges to a
global solution in B(w0 , R), with an exponential (also known as linear) convergence rate:
t
L(wt ) ≤ 1 − κ−1
L,F (B(w 0 , R)) L(w0 ). (54)
1
where the condition number κL,F (B(w0 , R)) = ηµ .
Proof. Let’s start with Theorem 14. We prove this theorem √ by induction. The induction hypothesis is
2 2βL(w0 )
that, for all t ≥ 0, wt is within the ball B(w0 , R) with R = µ , and
t
L(wt ) ≤ (1 − ηµ) L(w0 ). (55)
0
In the base case, where t = 0, it is trivial that w0 ∈ B(w0 , R) and that L(wt ) ≤ (1 − ηµ) L(w0 ).
Suppose that, for a given t ≥ 0, wt is in the ball B(w0 , R) and Eq.(55) holds. As we show separately
below (in Eq.(59) we have wt+1 ∈ B(w0 , R). Hence we see that L is (supw∈B(w0 ,R) kHL (w)k2 )-smooth
in B(w0 , R). By definition of η = 1/ supw∈B(w0 ,R) kHL (w)k2 and, consequently, L is 1/η-smooth. Using
that we obtain the following:
1
L(wt+1 ) − L(wt ) − ∇L(wt )(wt+1 − wt ) ≤ sup kHL k2 kwt+1 − wt k2
2 w∈B(w0 ,R)
1
= kwt+1 − wt k2 . (56)
2η
Taking wt+1 − wt = −η∇L(wt ) and by µ-PL∗ condition at point wt , we have
η
L(wt+1 ) − L(wt ) ≤ − k∇L(wt )k2 ≤ −ηµL(wt ). (57)
2
Therefore,
t+1
L(wt+1 ) ≤ (1 − ηµ)L(wt ) ≤ (1 − ηµ) L(w0 ). (58)
To prove wt+1 ∈ B(w0 , R), by the fact that L is β-smooth, we have
t
X
kwt+1 − w0 k = ηk ∇L(wi )k
i=0
t
X
≤ η k∇L(wi )k
i=0
Xt p
≤ η 2β(L(wi ) − L(wi+1 ))
i=0
Xt p
≤ η 2βL(wi )
i=0
t
!
p X i/2
p
≤ η 2β (1 − ηµ) L(w0 )
i=0
p p 1
≤ η 2β L(w0 ) √
1 − 1 − ηµ
√ p
2 2β L(w0 )
≤
µ
= R. (59)
29
Thus, wt+1 resides in the ball B(w0 , R).
By the principle of induction, the hypothesis is true.
Now, we prove Theorem 6, i.e., the particular case of square loss L(w) = 21 kF(w) − yk2 . In this case,
Also note that, for all t > 0, kHL (wt )k2 ≤ L2F + βF · kF(w0 ) − yk, since kF(wt ) − yk ≤ kF(w0 ) − yk.
Hence, the step size η = 1/L2F + βF · kF(w0 ) − yk is valid.
G Proof of Theorem 7
Proof. We first aggressively assume that the µ-PL∗ condition holds in the whole parameter space Rm .
We will find that the condition can be relaxed to hold only in the ball B(w0 , R).
Following a similar analysis to the proof of Theorem 1 in [4], by the µ-PL∗ condition, we can get that
nµ
for any mini-bath size s, the mini-batch SGD with step size η ∗ (s) := nλ(n2 λ+µ(s−1)) has an exponential
convergence rate for all t > 0:
t
µsη ∗ (s)
E[L(wt )] ≤ 1− L(w0 ). (62)
n
30
(1) (2) (s)
where we use {it , it , ..., it } to denote a random mini-batch of the dataset at step t.
Then the expectation of the length of the whole optimization path is bounded by
"∞ # ∞
X X
E kwi+1 − wi k = Ekwi+1 − wi k
i=0 i=0
∞ t/2 √ p
√ µsη ∗
X
∗
p 2n 2λ L(w0 )
= η s 2λ 1 − L(w0 ) = .
i=0
n µ
By Markov’s inequality, we have, with probability at least 1 − δ, the length of the path is shorter than
R, i.e.,
∞ √ p
X 2n 2λ L(w0 )
kwi+1 − wi k ≤ = R.
i=0
µδ
This means that, with probability at least 1 − δ, the whole path is covered by the ball B(w0 , R), namely,
for all t,
t−1
X
kwt − w0 k ≤ kwi+1 − wi k ≤ R.
i=0
For those events when the whole path is covered by the ball, we can relax the satisfaction of the PL∗
condition from the whole space to the ball.
31