Loss Landscapes and Optimization in Over-Parameterized Non-Linear Systems and Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Loss landscapes and optimization in over-parameterized

non-linear systems and neural networks


Chaoyue Liua , Libin Zhub,c , and Mikhail Belkinc
a
Department of Computer Science and Engineering, The Ohio State University
arXiv:2003.00307v2 [cs.LG] 26 May 2021

b
Department of Computer Science and Engineering, University of California, San Diego
c
Halicioğlu Data Science Institute, University of California, San Diego

May 28, 2021

Abstract
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-
based optimization methods applied to large neural networks. The purpose of this work is to propose
a modern view and a general mathematical framework for loss landscapes and efficient optimization
in over-parameterized machine learning models and systems of non-linear equations, a setting that
includes over-parameterized deep neural networks. Our starting observation is that optimization
problems corresponding to such systems are generally not convex, even locally. We argue that instead
they satisfy PL∗ , a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter
space, which guarantees both the existence of solutions and efficient optimization by (stochastic)
gradient descent (SGD/GD). The PL∗ condition of these systems is closely related to the condition
number of the tangent kernel associated to a non-linear system showing how a PL∗ -based non-linear
theory parallels classical analyses of over-parameterized linear equations. We show that wide neural
networks satisfy the PL∗ condition, which explains the (S)GD convergence to a global minimum.
Finally we propose a relaxation of the PL∗ condition applicable to “almost” over-parameterized
systems.

1 Introduction
A singular feature of modern machine learning is a large number of trainable model parameters. Just in
the last few years we have seen state-of-the-art models grow from tens or hundreds of millions parameters
to much larger systems with hundreds billions [6] or even trillions parameters [13]. Invariably these models
are trained by gradient descent based methods, such as Stochastic Gradient Descent (SGD) or Adam [18].
Why are these local gradient methods so effective in optimizing complex highly non-convex systems? In
the past few years an emerging understanding of gradient-based methods have started to focus on the
insight that optimization dynamics of “modern” over-parameterized models with more parameters than
constraints are very different from those of “classical” models when the number of constraints exceeds
the number of parameters.
The goal of this paper is to provide a modern view of the optimization landscapes, isolating key
mathematical and conceptual elements that are essential for an optimization theory of over-parameterized
models.
We start by characterizing a key difference between under-parameterized and over-parameterized
landscapes. While both are generally non-convex, the nature of the non-convexity is rather different:
under-parameterized landscapes are (generically) locally convex in a sufficiently small neighborhood of
a local minimum. Thus classical analyses apply, if only locally, sufficiently close to a minimum. In
contrast, over-parameterized systems are “essentially” non-convex systems, not even in arbitrarily small
neighbourhoods around global minima.

1
(a) Loss landscape of under-parameterized models (b) Loss landscape of over-parameterized models

Figure 1: Panel (a): Loss landscape is locally convex at local minima. Panel (b): Loss landscape
incompatible with local convexity as the set of global minima is not locally linear.

Thus, we cannot expect the extensive theory of convexity-based analyses to apply to such over-
parameterized problems. In contrast, we argue that these systems typically satisfy the Polyak-Lojasiewicz
condition, or more precisely, its slightly modified variant – PL∗ condition, on most (but not all) of the
parameter space. This condition ensures existence of solutions and convergence of GD and SGD, if it
holds in a ball of sufficient radius. Importantly, we show that sufficiently wide neural networks satisfy
the PL∗ condition around their initialization point, thus guaranteeing convergence. In addition, we show
how the PL∗ condition can be relaxed without significantly changing the analysis. We conjecture that
many large systems behave as if the were over-parameterized along the stretch of their optimization path
from initialization to the early stopping point.
In a typical supervised learning task, given a training dataset of size n, D = {xi , yi }ni=1 , xi ∈ Rd , y ∈ R,
and a parametric family of models f (w; x), e.g., a neural network, one aims to find a model with parameter
vector w∗ , that fits the training data, i.e.,

f (w∗ ; xi ) ≈ yi , i = 1, 2, . . . , n. (1)

Mathematically, it is equivalent to solving, exactly or approximately, a system of n equations1 . Ag-


gregating them in a single map (and suppressing the dependence on the training data in the notation)
we write:
F(w) = y, where w ∈ Rm , y ∈ Rn , F(·) : Rm → Rn . (2)
Here (F(w))i := f (w; xi ).
The system in Eq.(2) is solved through minimizing a certain loss function L(w), e.g., the square loss
N
1 1X
L(w) = kF(w) − yk2 = (f (w; xi ) − yi )2
2 2 i=1

constructed so that the solutions of Eq.(2) are global minimizers of L(w). This is a non-linear least
squares problem, which is well-studied under classical under-parameterized settings (see [28], Chapter
10). An exact solution of Eq.(2) corresponds to interpolation, where a predictor fits the data exactly.
As we discuss below, for over-parameterized systems (m > n), we expect exact solutions to exist.

Essential non-convexity. Our starting point, discussed in detail in Section 3, is the observation that
the loss landscape of an over-parameterized system is generally not convex in any neighborhood of any
1 The same setup works for multiple outputs. For examples for multi-class classification problems both the prediction

f (w; xi ) and labels yi (one-hot vector) are c-dimensional vectors, where c is the number of classes. In this case, we are in
fact solving n × c equations. Similarly, for multi-output regression with c outputs, we have n × c equations.

2
global minimizer. This is different from the case of under-parameterized systems, where the loss landscape
is globally not convex, but still typically locally convex, in a sufficiently small neighbourhood of a (locally
unique) minimizer. In contrast, the set of solutions of over-parameterized systems is generically a manifold
of positive dimension [10] (and indeed, systems large enough have no non-global minima [21, 27, 37]).
Unless the solution manifold is linear (which is not generally the case) the landscape cannot be locally
convex. The contrast between over and under-parametrization illustrated pictorially in Fig 1.
The non-zero curvature of the curve of global minimizers at the bottom of the valley in Fig 1(b)
indicates the essential non-convexity of the landscape. In contrast, an under-parameterized landscape
generally has multiple isolated local minima with positive definite Hessian of the loss, where the function
is locally convex. Thus we conclude that
Convexity is not the right framework for analysis of over-parameterized systems, even locally.
Without assistance from local convexity, what alternative mathematical framework can be used to
analyze loss landscapes and optimization behavior of non-linear over-parameterized systems?
In this paper we argue that such a general framework is provided by the Polyak-Lojasiewicz condi-
tion [32, 24], or, more specifically, by its variant that we call PL∗ condition (also called local PL condition
in [29]). We say that a non-negative function L satisfies µ-PL∗ on a set S ⊂ Rm for µ > 0, if

k∇L(w)k2 ≥ µL(w), ∀ w ∈ S. (3)

We will now outline some key reasons why PL∗ condition provides a general framework for analyzing
over-parameterized systems. In particular, we show how it is satisfied by the loss functions of sufficiently
wide neural networks, although which are certainly non-convex.

PL∗ condition =⇒ existence of solutions and exponential convergence of (S)GD. The first
key point (see Section 5), is that if L satisfies the µ-PL∗ condition in a ball of radius O(1/µ) then L has
a global minimum in that ball (corresponding to a solution of Eq.(2)). Furthermore, (S)GD initialized
at the center of such a ball2 converges exponentially to a global minimum of L. Thus to establish both
existence of solutions to Eq.(2) and efficient optimization, it is sufficient to verify the PL∗ condition in a
ball of a certain radius.

Analytic form of PL∗ condition via the spectrum of the tangent kernel. Let DF(w) be the
differential of the map F at w, viewed as a n × m matrix. The tangent kernel of F is defined as an n × n
matrix
K(w) = DF(w) DF T (w).
It is clear that K(w) is a positive semi-definite matrix. It can be seen (Section 4) that square loss L is
µ-PL∗ at w, where
µ = λmin (K(w)), (4)
is the smallest eigenvalue of the kernel matrix. Thus verification of the PL∗ condition reduces to analyzing
the spectrum of the tangent kernel matrix associated to F.
It is worthwhile to compare this to the standard analytical condition for convexity, requiring that the
Hessian of the loss function, HL , is positive definite. While these spectral conditions appear to be similar,
the similarity is superficial as K(w) contains first order derivatives, while the Hessian is a second order
operator. Hence, as we discuss below, the tangent kernel and the Hessian have very different properties
under coordinate transformations and in other settings.

Why PL∗ holds across most of the parameter space for over-parameterized systems. We will
now discuss the intuition why µ-PL∗ for a sufficiently small µ should be expected to hold in the domain
containing most (but not all) of the parameter space Rm . The intuition is based on simple parameter
counting. For the simplest example consider the case n = 1. In that case the tangent kernel is a scalar
2 The constant in O(1/µ) is different for GD and SGD.

3
Figure 2: The loss function L(w) is µ-PL∗ inside the shaded domain. Singular set corresponds to
parameters w with degenerate tangent kernel K(w). Every ball of radius R = O(1/µ) within the shaded
set intersects with the set of global minima of L(w), i.e., solutions to F(w) = y.

and K(w) = kDF(w)k2 is singular if and only if kDF(w)k = 0. Thus, by parameter counting, we
expect the singular set {w : K(w) = 0} to have co-dimension m and thus, generically, consist of isolated
points. Because of the Eq.(4) we generally expect most points sufficiently removed from the singular set
to satisfy the PL∗ condition. By a similar parameter counting argument, the singular set of w, such that
K(w) is not full rank will of co-dimension m − n + 1. As long as m > n, we expect the surroundings of
the singular set, where the PL∗ condition does not hold, to be “small” compared to the totality of the
parameter space. Furthermore, the larger the degree of the model over-parameterization (m − n) is, the
smaller the singular set is expected to be. This intuition is represented graphically in Fig. 2.
Note that under-parameterized systems are always rank deficient and have λmin (K(w)) ≡ 0. Hence
such systems never satisfy PL∗ .

Wide neural networks satisfy PL∗ condition. We show that wide neural networks, as special cases
of over-parameterized models, are PL∗ , using the technical tools developed in Section 4. This property of
the neural networks is a key step to understand the success of the gradient descent based optimization,
as seen in Section 5. To show the PL∗ condition for wide neural networks, a powerful, if somewhat
crude, tool is provided by the remarkable recent discovery [16] that tangent kernels (so-called NTK) of
wide neural networks with linear output layer are nearly constant in a ball B of a certain radius around
the ball with a random
√ center. More precisely, it can be shown that the norm of the Hessian tensor
kHF (w)k = O∗ (1/ m), where m is the width of the neural network and w ∈ B [23].
Furthermore (see Section 4.1):


 
|λmin (K(w)) − λmin (K(w0 ))| < O sup kHF (w)k = O(1/ m).
w∈B

Bounding the kernel eigenvalue at the initialization point λmin (K(w0 )) from below (using the results
from [12, 11]) completes the analysis.
To prove the result for general neural networks (note that the Hessian norm is typically large for
such networks [23]) we observe that they can be decomposed as a composition of a network with a
linear output layer and a coordinate-wise non-linear transformation of the output. The PL∗ condition is
preserved under well-behaved non-linear transformations which yields the result.

4
Relaxing the PL∗ condition: when the systems are al-
most over-parameterized. Finally, a framework for under-
standing large systems cannot be complete without address-
ing the question of transition between under-parameterized and
over-parameterized systems. While neural network models such
as those in [6, 13] have billions or trillions on parameters, they
are often trained on very large datasets and thus have compara-
ble number of constraints. Thus in the process of optimization
they may initially appear over-parameterized, while no truly in-
terpolating solution may exist. Such an optimization landscape
is shown in Fig. 3. While an in-depth analysis of this question Figure 3: The loss landscape of “almost
is beyond the scope of this work, we propose a version of the over-parameterized” systems. The
PL∗ condition that can account for such behavior by postulat- landscape looks over-parameterized ex-
ing that the PL∗ condition is satisfied for values of L(w) above cept for the grey area where the loss
a certain threshold and that the optimization path from the ini- is small. Local minima of the loss are
tialization to the early stopping point lies in the PL∗ domain. contained there.
Approximate convergence of (S)GD can be shown under this
condition, see Section 6.

1.1 Related work


Loss landscapes of over-parameterized machine learning models, especially deep neural networks, have
recently been studied in a number of papers including [10, 21, 27, 37, 19, 25]. The work [10] suggests
that the solutions of an over-parameterized system typically are not discrete and form a lower dimen-
sional manifold in the parameter space. In particular [21, 27] show that sufficiently over-parameterized
neural networks have no strict local minima under certain mild assumptions. Furthermore in [19] it
is proved that, for sufficiently over-parameterized neural networks, each local minimum (if it exists) is
“path connected” to a global minimum where the loss value on the path is non-increasing. While the
above works mostly focus on the properties of the minima of the loss landscape, in this paper we on the
landscape within neighborhoods of these minima, which can often cover the whole optimization path of
gradient-based methods.
There has also been a rich literature that particularly focuses on the optimization of wide neural
networks using gradient descent [33, 12, 11, 20, 3, 2, 17, 30], and SGD [1, 38], especially after the
discovery of the constancy of neural tangent kernels (NTK) for certain neural networks [16]. Many of
these analyses are based on the observation that the constancy of the tangent kernel implies that the
training dynamic of these networks is approximately that of a linear model [20]. This type of analysis
can be viewed within our framework of Hessian norm control.
The Polyak-Lojasiewicz (PL) condition has attracted interest in connection with convergence of neural
networks and other non-linear systems as it allows to prove some key properties of convexity with respect
to the optimization in non-convex settings. For example, the work [8] proved that a composition of
strongly convex function with a one-layer leaky ReLU network satisfies the PL condition. The paper [4]
proved fast convergence of stochastic gradient descent with constant step size for over-parameterized
models, while [29] refined this result showing SGD convergence within a ball with a certain radius. In
related work [14] shows that the optimization path length can be upper bounded.
Finally, it is interesting to point out the connections with the contraction theory for differential
equations explored in [36].

2 Notation and standard definitions


We use bold lowercase letters, e.g., v, to denote vectors, capital letters, e.g., W , to denote matrices,
and bold capital letters, e.g., W, to denote tuples of matrices or higher order tensors. We denote the
set {1, 2, · · · , n} as [n]. Given a function σ(·), we use σ 0 (·) and σ 00 (·) to denote its first and second

5
respectively. For vectors, we use k · k to denote the Euclidean norm and k · k∞ for the ∞-norm. For
matrices, we use k · kF to denote the Frobenius norm and k · k2 to denote the spectral norm (i.e., 2-norm).
We use DF to represent the differential map of F : Rm → Rn . Note that DF is represented as a n × m
∂Fi
matrix, with (DF)ij := ∂w j
. We denote Hessian of the function F as HF , which is an n × m × m tensor
2
Fi
with (HF )ijk = ∂w∂ j ∂w k
, and define the norm of the Hessian tensor to be the maximum of the spectral
norms of each of its Hessian components, i.e., kHF k = maxi∈[n] kHFi k2 , where HFi = ∂ 2 Fi /∂w2 . When
necessary, we also assume that HF is continuous. We also denote the Hessian matrix of the loss function
as HL := ∂ 2 L/∂w2 , which is an m × m matrix. We denote the smallest eigenvalue of a matrix K as
λmin (K).
In this paper, we consider the problem of solving a system of equations of the form Eq.(2), i.e. finding
w, such that F(w) = y. This problem is solved by minimizing a loss function L(F(w), y), such as the
square loss L(w) = 21 kF(w) − yk2 , with gradient-based algorithms. Specifically, the gradient descent
method starts from the initialization point w0 and updates the parameters as follows:

wt+1 = wt − η∇L(wt ), ∀t ∈ N. (5)

We call the set {wt }∞


t=0 the optimization path, and put w∞ = limt→∞ wt (assuming the limit exists).

Throughout this paper, we assume the map F is Lipschitz continuous and smooth. See the definition
below:
Definition 1 (Lipschitz continuity). Map F : Rm → Rn is LF -Lipschitz continuous, if

kF(w) − F(v)k ≤ LF kw − vk, ∀w, v ∈ Rm . (6)

Remark 1. In supervised learning cases, (F(w))i = f (w; xi ). If we denote Lf as the Lipschitz


P continuity
2
of the machine learning model f (w; x) w.r.t.
√ the parameters w, then kF(w) − F(v)k = i (f (w; xi ) −
f (v; xi ))2 ≤ nL2f kw − vk2 . One has LF ≤ nLf .
A direct consequence of the LF -Lipschitz condition is that kDF(w)k2 ≤ LF for all w ∈ Rm . Fur-
thermore, it is easy to see that the tangent kernel has bounded spectral norm:
Proposition 1. If map F is LF -Lipschitz, then kK(w)k2 ≤ L2F .

Definition 2 (Smoothness). Map F : Rm → Rn is βF -smooth, if

βF
kF(w) − F(v) − DF(v)(w − v)k2 ≤ kw − vk2 , ∀w, v ∈ Rm . (7)
2

Jacobian matrix. Given a map Φ : z ∈ Rn 7→ Φ(z) ∈ Rn , the corresponding Jacobian matrix JΦ is


defined as following: each entry
∂Φi
(JΦ )ij := . (8)
∂zj

3 Essential non-convexity of loss landscapes of over-parameterized


non-linear systems
In this section we discuss the observation that over-parameterized systems give rise to landscapes that are
essentially non-convex – there typically exists no neighborhood around any global minimizer where the
loss landscape is convex. This is in contrast to under-parameterized systems where such a neighborhood
typically exists, although it can be small.
For over-parameterized system of equations, the number of parameters m is larger than the number
of constraints n. In this case, the system of equations generically has exact solutions, which form a
continuous manifold (or several continuous manifolds), typically of dimension m − n > 0 so that none of

6
the solutions are isolated. A specific result for wide neural networks, showing non-existence of isolated
global minima of the loss function is given in Proposition 6 (Appendix A).
It is important to note that such continuous manifolds of solutions generically have non-zero curva-
ture3 , due to the non-linear nature of the system of equations. This results in the lack of local convexity
of the loss landscape, i.e., loss landscape is nonconvex in any neighborhood of a solution (i.e, a global
minimizer of L). This can be seen from the following argument. Suppose that the loss function landscape
of L is convex within a ball B that intersects the set of global minimizers M. The minimizers of a convex
function within a convex domain form a convex set, thus M ∩ S must also be convex. Hence M ∩ S
must be a convex subset within a lower dimensional linear subspace of Rm and cannot have curvature
(either extrinsic ot intrinsic). This geometric intuition is illustrated in Fig. 1b, where the set of global
minimizers is of dimension one.
To provide an alternative analytical intuition (leading to a precise argument) consider an over-
parameterized system F(w) : Rm → Rn , where m > n, with the square loss function L(F (w), y) =
1 2
2 kF(w) − yk . The Hessian matrix of the loss function takes the form

n
∂2L X
HL (w) = DF(w)T (w)DF(w) + (F(w) − y)i HFi (w), (9)
| ∂F{z2 } i=1
A(w)
| {z }
B(w)

where HFi (w) ∈ Rm×m is the Hessian matrix of ith output of F with respect to w.
Let w∗ be a solution to Eq.(2). Since w∗ is a global minimizer of L, B(w∗ ) = 0. We note that A(w∗ )
is a positive semi-definite matrix of rank at most n with at least m − n zero eigenvalues.
Yet, in a neighbourhood of w∗ there typically are points where B(w) has rank m. As we show below,
this observation, together with a mild technical assumption on the loss, implies that HL (w) cannot be
positive semi-definite in any ball around w∗ and hence is not locally convex. To see why this is the case,
consider an example of a system with only one equation (n = 1). The loss function takes the form as
L(w) = 21 (F(w) − y)2 , y ∈ R and the Hessian of the loss function can be written as
X
HL (w) = DF(w)T DF(w) + ((F(w))i − yi )HFi (w). (10)
i

Let w∗ be a solution, F(w∗ ) = y and suppose that DF(w∗ ) does not vanish. In the neighborhood of w∗ ,
there exist arbitrarily close points w∗ + δ and w∗ − δ, such that F(w∗ + δ) − y > 0 and F(w∗ − δ) −y < 0.
Assuming that the rank of HFi (w∗ ) is greater than one, and noting that the rank of DF(w)T DF(w)
is at most one, it is easy to see that either HL (w∗ + δ) or HL (w∗ − δ) must have negative eigenvalues,
which rules out local convexity at w∗ .
A more general version of this argument is given in the following:
Proposition 2 (Local non-convexity). Let L(w∗ ) = 0 and, furthermore, assume that DF(w∗ ) 6= 0 and
rank(HFi (w∗ )) > 2n for at least one i ∈ [n]. Then L(w) is not convex in any neighborhood of w∗ .

Remark 2. Note that in general we do not expect DF to vanish at w∗ as there is no reason why a
 of∗F. For a general loss L(w), the assumption in Proposition 2
solution to Eq.(2) should be a critical point
that DF(w∗ ) 6= 0 is replaced by dw d ∂L
∂F (w ) 6= 0.

A full proof of Prop. 2 for a general loss function can be found in Appendix B.

Comparison to under-parameterized systems. For under-parameterized systems, local minima


are generally isolated, as illustrated in Figure 1a. Since HL (w∗ ) is generically positive definite when w∗
is an isolated local minimizer, by the continuity of HL (·), positive definiteness holds in the neighborhood
of w∗ . Therefore, L(w) is locally convex around w∗ .

3 Both extrinsic, the second fundamental form, and intrinsic Gaussian curvature.

7
4 Over-parameterized non-linear systems are PL∗ on most of
the parameter space
In this section we argue that loss landscapes of over-parameterized systems satisfy the PL∗ condition
across most of their parameter space.
We begin with a sufficient analytical condition uniform conditioning of a system closely related to
PL∗ .

Definition 3 (Uniform conditioning). We say that F(w) is µ-uniformly conditioned (µ > 0) on S ⊂ Rm


if the smallest eigenvalue of its tangent kernel K(w) satisfies

λmin (K(w)) ≥ µ, ∀w ∈ S. (11)

The following important connection shows that uniform conditioning of the system is sufficient for the
corresponding square loss landscape to satisfy the PL∗ condition.
Theorem 1 (Uniform conditioning ⇒ PL∗ condition). If F(w) is µ-uniformly conditioned, on a set
S ⊂ Rm , then the square loss function L(w) = 21 kF(w) − yk2 satisfies µ-PL∗ condition on S.
Proof.
1 1 1
k∇L(w)k2 = (F(w) − y)T K(w)(F(w) − y) ≥ λmin (K(w))kF(w) − yk2 = λmin (K(w))L(w) ≥ µL(w).
2 2 2

We will now provide some intuition for why we expect λmin (K(w)) to be separated from zero over
most but not all of the optimization landscape. The key observation is that

rank K(w) = rank DF(w) DF T (w) = rank DF(w)




Note that kernel K(w) is an n × n positive semi-definite matrix by definition. Hence the singular set
Ssing , where the tangent kernel is degenerate, can be written as

Ssing = {w ∈ Rm | λmin (K(w)) = 0} = {w ∈ Rm | rank DF(w) < n}.

Generically, rank DF(w) = min(m, n). Thus for over-parameterized systems, when m > n, we expect
Ssing to have positive codimension and to be a set of measure zero. In contrast, in under-parametrized
settings, m < n and the tangent kernel is always degenerate, λmin (K(w)) ≡ 0 so such systems cannot be
uniformly conditioned according to the definition above. Furthermore, while Eq.(11) provides a sufficient
condition, it’s also in a certain sense necessary:
Proposition 3. If λmin (K(w0 )) = 0 then the system F(w) = y cannot satisfy PL∗ condition for all y
on any set S containing w0 .
Proof. Since λmin (K(w0 )) = 0, the matrix K(w0 ) has a non-trivial null-space. Therefore we can choose
y so that K(w0 )(F(w0 ) − y) = 0 and F(w0 ) − y 6= 0. We have
1 1
k∇L(w0 )k2 = (F(w0 ) − y)T K(w0 )(F(w0 ) − y) = 0
2 2
and hence the PL∗ condition cannot be satisfied at w0 .
Thus we see that only over-parameterized systems can be PL∗ , if we want that condition to hold for
any label vector y.
By parameter counting, it is easy to see that the codimension of the singular set Ssing is expected to
be m − n + 1. Thus, on a compact set, for µ sufficiently small, points which are not µ-PL∗ will be found

8
Figure 4: µ-PL∗ domain and the singular set. We expect points away from the singular set to satisfy
µ-PL∗ condition for sufficiently small µ.

around Ssing , a low-dimensional subset of Rm . This is represented graphically in Fig. 4. Note that the
more over-parameterization we have, the larger the codimension is expected to be.
To see a particularly simple example of this phenomenon, consider the case when DF(w) is a random
matrix with Gaussian entries4 . Denote by

λmax (K(w))
κ=
λmin (K(w))

the condition number of K(w). Note that by definition κ ≥ 1. It is shown in [9] that (assuming m > n)
m
E(log κ) < 2 log + 5.
m−n+1
We see that as the amount of over-parameterization increases κ converges in expectation (and also can be
shown with high probability) to a small constant. While this example is rather special, it is representative
of the concept that over-parameterization helps with conditioning. In particular, as we discuss below in
Section 4.2, wide neural networks satisfy the PL∗ with high probability within a random ball of a constant
radius.
We will now provide techniques for proving that the PL∗ condition holds for specific systems and later
in Section 4.2 we will show how these techniques apply to wide neural networks, demonstrating that they
are PL∗ . In Section 5 we discuss the implications of the PL∗ condition for the existence of solutions and
convergence of (S)GD, in particular for deep neural networks.

4.1 Techniques for establishing the PL∗ condition


While we expect typical over-parameterized systems to satisfy the PL∗ condition at most points, directly
analyzing the smallest eigenvalue of the corresponding kernel matrix is often difficult. Below we describe
two methods for demonstrating the PL∗ condition holds in a ball of a certain radius. First method is
based on the observation that is well-conditioned in a ball provided it is well-conditioned at its center and
the Hessian norm is not large compared to the radius of the ball. Interestingly, this “Hessian control”
condition holds for a broad class of non-linear systems. In particular, as discussed in Sec 4.2, wide neural
4 In this the matrix “DF (w)” does not have to correspond to an actual map F , rather the intention is to illustrate how

over-parameterization leads to better conditioning.

9
networks with linear output layer have small Hessian norm. For some intuition on why this appears to
be a feature of many large systems see Appendix C.
The second approach to demonstrating conditioning is by noticing that it is preserved under well-
behaved transformations of the input or output or when composing certain models.
Combining these methods yields a general result on PL∗ condition for wide neural networks in Sec-
tion 4.2.

Uniform conditioning through Hessian spectral norm. We will now show that controlling the
norm of the Hessian tensor for the map F leads to well-conditioned systems. The idea of the analysis is
that the change of the tangent kernel of F(w) can be bounded in terms of the norm of the Hessian of
F(w). Intuitively, this is analogous to the mean value theorem, bounding the first derivative of F by its
second derivative. If the Hessian norm is sufficiently small, the change of the tangent kernel and hence
its conditioning can be controlled in a ball B(w0 , R) with a finite radius, as long as the tangent kernel
matrix at the center point K(w0 ) is well-conditioned.
Theorem 2 (PL∗ condition via Hessian Norm). Given a point w0 ∈ Rm , suppose the tangent kernel
matrix K(w0 ) is strictly positive definite, i.e., λ0 := λmin (K(w0 )) > 0. If the Hessian spectral norm
kHF k ≤ 2LλF0 −µ

nR
holds within the ball B(w0 , R) for some R > 0 and µ > 0, then the tangent kernel
K(w) is µ-uniformly conditioned in the ball B(w0 , R). Hence, the square loss L(w) satisfies the µ-PL∗
condition in B(w0 , R).

Proof. First, let’s consider the difference between the tangent kernel matrices at w ∈ B(w0 , R) and at
w0 .
By the assumption, we have, for all v ∈ B(w0 , R), kHF (v)k < 2LλF0 −µ

nR
. Hence, for each i ∈ [n],
λ0 −µ
kHFi (w)k2 < √
2LF nR
. Now, consider an arbitrary point w ∈ B(w0 , R). For all i ∈ [n], we have:
Z 1
DFi (w) = DFi (w0 ) + HFi (w0 + τ (w − w0 ))(w − w0 )dτ. (12)
0

Since τ is in [0, 1], w0 + τ (w − w0 ) is on the line segment S(w0 , w) between w0 and w, which is inside
of the ball B(w0 , R). Hence,

λ0 − µ λ0 − µ
kDFi (w) − DFi (w0 )k ≤ sup kHFi (w0 + τ (w − w0 ))k2 · kw − w0 k ≤ √ ·R= √ .
τ ∈[0,1] 2LF nR 2LF n

λ0 −µ
In the second inequality above, we used the fact that kHFi k2 < √
2LF nR
in the ball B(w0 , R). Hence,

√ λ0 − µ λ −µ
sX
kDF(w) − DF(w0 )kF = kDFi (w) − DFi (w0 )k2 ≤ n √ = 0 .
2LF n 2LF
i∈[n]

Then, the spectral norm of the tangent kernel change is bounded by

kK(w) − K(w0 )k2 = kDF(w)DF(w)T − DF(w0 )DF(w0 )T k2


T
= kDF(w) (DF(w) − DF(w0 )) + (DF(w) − DF(w0 )) DF(w0 )T k2
≤ kDF(w)k2 kDF(w) − DF(w0 )k2 + kDF(w) − DF(w0 )k2 kDF(w0 )k2
≤ LF · kDF(w) − DF(w0 )kF + kDF(w) − DF(w0 )kF · LF
λ0 − µ
≤ 2LF ·
2LF
= λ0 − µ.

In the second inequality above, we used the LF -Lipschitz continuity of F and the fact that kAk2 ≤ kAkF
for a matrix A.

10
By triangular inequality, we have, at any point w ∈ B(w0 , R),

λmin (K(w)) ≥ λmin (K(w0 )) − kK(w) − K(w0 )k2 ≥ µ. (13)

Hence, the tangent kernel is µ-uniformly conditioned in the ball B(w0 , R).
By Theorem 1, we immediately have that the square loss L(w) satisfies µ-PL∗ condition in the ball
B(w0 , R).
Below in Section 4.2, we will see that wide neural networks with linear output layer have small Hessian
norm (Theorem 2). An illustration for a class of large models is given in Appendix C.

Conditioning of transformed systems. We now discuss why the conditioning of a system F(w) = y
is preserved under a transformations of the domain or range of F, as long as the original system is well-
conditioned and the transformation has a bounded inverse.
Remark 3. Note that the even if the original system had a small Hessian norm, there is no such guarantee
for the transformed system.
Consider a transformation Φ : Rn → Rn that, composed with F, results in a new transformed system
Φ ◦ F(w) = y. Put
1
ρ := inf −1 , (14)
w∈B(w0 ,R) kJ
Φ (w)k2

where JΦ (w) := JΦ (F(w)) is the Jacobian of Φ evaluated at F(w). We will assume that ρ > 0.
Theorem 3. If a system F is µ-uniformly conditioned in a ball B(w0 , R) with R > 0, then the trans-
formed system Φ ◦ F(w) is µρ2 -uniformly conditioned in B(w0 , R). Hence, the square loss function
1 2 2 ∗
2 kΦ ◦ F(w) − yk satisfies µρ -PL condition in B(w0 , R).

Proof. First, note that,

KΦ◦F (w) = ∇(Φ ◦ F)(w)∇(Φ ◦ F)T (w) = JΦ (w)KF (w; X)JΦ (w)T .

Hence, if F(w) is µ-uniformly conditioned in B(w0 , R), i.e. λmin (KF (w)) ≥ µ, we have for any v ∈ Rn
with kvk = 1,

vT KΦ◦F (w)v = (JΦ (w)T v)T KF (w)(JΦ (w)T v)


≥ λmin (KF (w))kJΦ (w)T vk2
≥ λmin (KF (w))/kJΦ−1 (w)k22 ≥ µρ2 .

Applying Theorem 1, we immediately obtain that 1


2 kΦ ◦ F(w) − yk2 satisfies µρ2 -PL∗ condition in
B(w0 , R).
Remark 4. A result analogous to Theorem 3 is easy to obtain for a system with transformed input
F ◦ Ψ(w) = y. Assume the transformation map Ψ : Rm → Rm that applies on the input of the system
w satisfies
1
ρ := inf −1 > 0. (15)
w∈B(w0 ,R) kJ
Ψ (w)k2

If F is µ-uniformly conditioned with respect to Ψ(w) in B(w0 , R), then an analysis similar to Theorem 3
shows that F ◦ Ψ is also µρ2 -uniformly conditioned in B(w0 , R).

11
Conditioning of composition models. Although composing different large models often leads to
non-constant tangent kernels, the corresponding tangent kernels can also be uniformly conditioned, under
0
certain conditions. Consider the composition of two models h := g ◦ f , where f : Rd → Rd and
0 00
g : Rd → Rd . Denote wf and wg as the parameters of model f and g respectively. Then, the
parameters of the composition model h are w := (wg , wf ). Examples of the composition models include
”bottleneck” neural networks, where the modules below or above the bottleneck layer can be considered
as the composing (sub-)models.
Let’s denote the tangent kernel matrices of models g and f by Kg (wg ; Z) and Kf (wf ; X ) respectively,
where the second arguments, Z and X , are the datasets that the tangent kernel matrices are evaluated
on. Given a dataset D = {(xi , yi )}ni=1 , denote f (D) as {(f (xi ), yi )}ni=1 .
Proposition 4. Consider the composition model h = g ◦ f with parameters w = (wg , wf ). Given a
dataset D, the tangent kernel matrix of h takes the form:

Kh (w; D) = Kg (wg ; f (D)) + Jg (f (D))Kf (wf ; D)Jg (f (D))T ,


00
×d0
where Jg (f (D)) ∈ Rd is the Jacobian of g w.r.t. f evaluated on f (D).
From the above proposition, we see that the tangent kernel of the composition model h can be
decomposed into the sum of two positive semi-definite matrices, hence the minimum eigenvalue of Kh
can be lower bounded by
λmin (Kh (w)) ≥ λmin (Kg (wg ; f (D))). (16)
We note that g takes the outputs of f as inputs which depends on wf , while for the model f the
inputs are fixed. Hence, if the model g is uniformly conditioned at all the inputs provided by the model
f , we can expect the uniform conditioning of the composition model h.
We provide a simple illustrative example for a “bottleneck” neural network in Appendix D.

4.2 Wide neural networks satisfy PL∗ condition


In this subsection, we show that wide neural networks satisfy the PL∗ condition, using the techniques we
developed in the last subsection.
A L-layer (feedforward) neural network f (W; x), with parameters W and input x, is defined as follow:

α(0) = x,
 
1
α(l) = σl √ W (l) α(l−1) , ∀l = 1, 2, · · · , L + 1,
ml−1
f (W; x) = α(L+1) . (17)

Here, ml is the width (i.e., number of neurons) of l-th layer, α(l) ∈ Rml denotes the vector of l-th
hidden layer neurons, W := {W (1) , W (2) , . . . , W (L) , W (L+1) } denotes the collection of the parameters (or
weights) W (l) ∈ Rml ×ml−1 of each layer, and σl is the activation function of l-th layer, e.g., sigmoid,
tanh, linear activation. We also denote the width of the neural network as m := minl∈[L] ml , i.e., the
minimal width of the hidden layers. In the following analysis, we assume that the activation functions σl
are twice differentiable. Although this assumption excludes ReLU, but we believe the same results also
apply when the hidden layer activation functions are ReLU.
Remark 5. The above definition of neural networks does not include convolutional (CNN) and residual
(ResNet) neural networks. In Appendix E, we show that both CNN and ResNet also satisfy the PL∗
condition. Please see the definitions and analysis there.
We study the loss landscape of wide neural networks in regions around randomly chosen points in
parameter space. Specifically, we consider the ball B(W0 , R), which has a fixed radius R > 0 (we will
see later, in Section 5, that R can be chosen to cover the whole optimization path) and is around a
(l)
random parameter point W0 , i.e., W0 ∼ N (0, Iml ×ml−1 ) for l ∈ [L + 1]. Note that such a random

12
parameter point W0 is a common choice to initialize a neural network. Importantly, the tangent kernel
matrix at W0 is generally strictly positive definite, i.e., λmin (K(W0 )) > 0. Indeed, this is proven
for infinitely wide networks as long as the training data is not degenerate (see Theorem 3.1 of [12]
and Proposition F.1 and F.2 of [11]). As for finite width networks, with high probability w.r.t. the
initialization randomness, its tangent kernel K(W0 ) is close to that of the infinite network and the
minimum eigenvalue λmin (K(W0 )) = O(1).
Using the techniques in Section 4.1, the following theorem shows that neural networks with sufficient
width satisfies the PL∗ condition in a ball of any fixed radius around W0 , as long as the tangent kernel
K(W0 ) is strictly positive definite.

Theorem 4 (Wide neural networks satisfy PL∗ condition). Consider the neural network f (W; x) in
(l)
Eq.(17), and a random parameter setting W0 such that W0 ∼ N (0, Iml ×ml−1 ) for l ∈ [L + 1]. Suppose
0
that the last layer activation σL+1 satisfies |σL+1 (z)| ≥ ρ > 0 and that λ0 := λmin (K(W0 )) > 0. For
any µ ∈ (0, λ0 ρ2 ), if the width of the network

nR6L+2
 
m = Ω̃ , (18)
(λ0 − µρ−2 )2

then µ-PL∗ condition holds the square loss function in the ball B(w0 , R).
0
Remark 6. In fact, it is not necessary to require |σL+1 (z)| to be greater than ρ for all z. The theorem still
0
holds as long as |σL+1 (z)| > ρ is true for all z actually achieved by the output neuron before activation.

This theorem tells that while the loss landscape of wide neural networks is nowhere convex (as seen
in Section 3), it can still can be described by the PL∗ condition at most points, in line with our general
discussion.

Proof of Theorem 4. We divide the proof into two distinct steps based on representing an arbitrary
neural network as a composition of network with a linear output layer and an output non-linearity
σL+1 (·). In Step 1 we prove the PL∗ condition for the case of a network with a linear output layer (i.e.,
σL+1 (z) ≡ z). The argument relies on the fact that wide neural networks with linear output layer have
small Hessian norm in a ball around initialization. In Step 2 for general networks we observe that an
arbitrary neural network is simply a neural network with a linear output layer from Step 1 with output
transformed by applying σL+1 (z) coordinate-wise. We obtain the result by combining Theorem 3 with
Step 1.

Step 1. Linear output layer: σL+1 (z) ≡ z. In this case, ρ = 1 and the output layer of the network
has a linear form, i.e., a linear combination of the units from the last hidden layer.
As was shown in [23], for this type of networks with sufficient width, the model Hessian matrix have
arbitrarily small spectral norm (a transition to linearity). This is formulated in the following theorem:
Theorem 5 (Theorem 3.2 of [23]: transition to linearity). Consider a neural network f (W; x) of the
form Eq.(17). Let m be the minimum of the hidden layer widths, i.e., m = minl∈[L] ml . Given any fixed
R > 0, and any W ∈ B(W0 , R) := {W : kW − W0 k ≤ R}, with high probability over the initialization,
the Hessian spectral norm satisfies the following:
√ 
kHf (W)k = Õ R3L / m . (19)

In Eq.(19), we explicitly write out the dependence of Hessian norm on the radius R, according to the
proof in [23].
Directly plugging Eq.(19) into the condition of Theorem 2, letting  = λ0 − µ and noticing that ρ = 1,
we directly have the expression for the width m.

13
Step 2. General networks: σL+1 (·) is non-linear. Wide neural networks with non-linear output
layer generally do not exhibit transition to linearity or near-constant tangent kernel, as was shown [23].
Despite that, these wide networks still satisfy the PL∗ condition in the ball B(W0 , R). Observe that
this type of network can be viewed as a composition of a non-linear transformation function σL+1 with
a network f˜ which has a linear output layer:

f (W; x) = σL+1 (f˜(W; x)). (20)

By the same argument as in Step 1, we see that f˜, with the width as in Eq.(18), is ρµ2 -uniformly
conditioned.
Now we apply our analysis for transformed systems in Theorem 3. In this case, the transformation
map Φ becomes a coordinate-wise transformation of the output given by

Φ(·) = diag(σL+1 (·), σL+1 (·), · · · , σL+1 (·)), (21)


| {z }
n

and the norm of the inverse Jacobian matrix, kJΦ−1 (w)k2 is


1
kJΦ−1 (w)k2 = ≤ ρ−1 . (22)
0 ˜
mini∈[n] σL+1 (f (w; xi ))

Hence, the µ
ρ2 -uniform conditioning of f˜ immediately implies µ-uniform conditioning of f , as desired.

5 PL∗ condition in a ball guarantees existence of solutions and


fast convergence of (S)GD
In this section, we show that fast convergence of gradient descent methods is guaranteed by the PL∗
condition in a ball with appropriate size. We assume the system F(w) is LF -Lipschitz continuous and
βF -smooth on the local region S that we considered. In what follows S will typically be a Euclidean ball
B(w0 , R), with an appropriate radius R, chosen to cover the optimization path of GD or SGD.
First, we define the (non-linear) condition number, as follows:
Definition 4 (Condition number). Consider a system in Eq.(2), with a loss function L(w) = L(F(w), y)
and a set S ⊂ Rm . If the loss L(w) is µ-PL∗ conditioned on S, define the condition number κL,F (S):

supw∈S λmax (HL (w))


κL,F (S) := , (23)
µ
where HL (w) is the Hessian matrix of the loss function. The condition number for the square loss (used
throughout the paper) will be written as simply κF (S), omitting the subscript L.
Remark 7. In the special case of a linear system F(w) = Aw with square loss 21 kAw − yk2 , both the
Hessian HL = AT A and the tangent kernel K(w) = A AT are constant matrices. As AAT and AT A have
the same set of non-zero eigenvalues, the largest eigenvalue λmax (HL ) is equal to λmax (K). In this case,
the condition number κF (S) reduces to the standard condition number of the tangent kernel K,
λmax (K)
κF (S) = . (24)
λmin (K)
Since F is LF -Lipschitz continuous and βF smooth by assumption, we directly get the following by
substituting the definition of the square loss function into Eq. (23).
Proposition 5. For the square loss function L, the condition number is upper bounded by:
L2F + βF · supw∈S kF(w) − yk
κF (S) ≤ . (25)
µ

14
Remark 8. It is easy to see that the usual condition number κ(w) = λmax (K(w))/λmin (K(w)) of the
tangent kernel K(w), is upper bounded by κF (S).
Now, we are ready to present the optimization theory based on the PL∗ condition. First, let the
arbitrary set S to be a Euclidean ball B(w0 , R) around the initialization w0 of gradient descent methods,
with a reasonably large but finite radius R. The following theorem shows that satisfaction of the PL∗
condition on B(w0 , R) implies the existence of at least one global solution of the system in the same
ball B(w0 , R). Moreover, following the original argument from [32], the PL∗ condition also implies fast
convergence of gradient descent to a global solution w∗ in the ball B(w0 , R).
Theorem 6 (Local PL∗ condition ⇒ existence of a solution + fast convergence). Suppose the system F
is LF -Lipschitz continuous and βF -smooth. If the square loss L(w) satisfies the µ-PL∗ condition in the
ball B(w0 , R) := {w ∈ Rm : kw − w0 k ≤ R} with R = 2LF kF (w µ
0 )−yk
. Then we have the following:
(a) Existence of a solution: There exists a solution (global minimizer of L) w∗ ∈ B(w0 , R), such that
F(w∗ ) = y.
(b) Convergence of GD: Gradient descent with a step size η ≤ 1/(L2F + βF kF(w0 ) − yk) converges to a
global solution in B(w0 , R), with an exponential (a.k.a. linear) convergence rate:
t
L(wt ) ≤ 1 − κ−1
F (B(w0 , R)) L(w0 ). (26)
1
where the condition number κF (B(w0 , R)) = ηµ .

The proof of the theorem is deferred to Appendix F.


It is interesting to note that the radius R of the ball B(w0 , R) takes a finite value, which means
that the optimization path {wt }∞ t=0 ⊂ B(w0 , R) stretches at most a finite length and the optimization
happens only at a finitely local region around the initialization. Hence, the conditioning of the tangent
kernel and the satisfaction of the PL∗ condition outside of this ball are irrelevant to this optimization
and are not required.
Indeed, this radius R has to be of order Ω(1). From the LF -Lipschitz continuity of F(w), it follows
that there is no solution within the distance kF(w0 ) − yk/LF from the initialization point. Any solution
must have a Euclidean distance away from w0 at least Rmin = kF(w0 ) − yk/LF , which is finite. This
means that the parameter update ∆w = w∗ − w0 must not be too small in terms of Euclidean distance.
However, due to the large population m of the model parameters, each individual√ parameter wi may take
a small change during the gradient descent training, i.e., |wi − w0,i | = O(1/ m). Indeed, this is what
happening for wide neural networks [16, 23].
Below, we make an extension of the above theory: from (deterministic) gradient descent to stochastic
gradient descent (SGD).

Convergence of SGD. In most practical machine learning settings, including typical problems of
supervised learning, the loss function L(w) has the form
n
X
L(w) = `i (w).
i=1
Pn
For example, for the square loss L(w) = i=1 `i (w), where `i (w) = 21 (Fi (w) − yi )2 . Here the loss `i
corresponds simply to the loss for ith equation. Mini-batch SGD updates the parameter w, according to
the gradient of s individual loss functions `i (w) at a time:
X
wt+1 = wt − η ∇`i (wt ), ∀t ∈ N.
i∈S⊂[n]

We will assume that each element of the set S is chosen uniformly at random at every iteration.
We now show that the PL∗ condition on L also implies exponential convergence of SGD within a ball,
an SGD analogue of Theorem 6. Our result can be considered as a local version of Theorem 1 in [4] which
showed exponential convergence of SGD, assuming PL condition holds in the entire parameter space. See
also [29] for a related result.

15
Theorem 7.√Assume each `i (w) is β-smooth and L(w) satisfies the µ-P L∗ condition in the ball B(w0 , R)
2n 2βL(w )
0
with R = µδ where δ > 0. Then, with probability 1 − δ, SGD with mini-batch size s ∈ N and

step size η ≤ nβ(n2 β+µ(s−1)) converges to a global solution in the ball B(w0 , R), with an exponential
convergence rate:
 µsη t
E[L(wt )] ≤ 1 − L(w0 ). (27)
n
The proof is deferred to Appendix G.

5.1 Convergence for wide neural networks.


Using the optimization theory developed above, we can now show convergence of (S)GD for sufficiently
wide neural networks. As we have seen in Section 4.2, the loss landscapes of wide neural networks f (W; x)
defined in Eq.(17) are µ-PL∗ condition in a ball B(W0 , R) of an arbitrary radius R. We will now show
convergence of GD for these models.
As before (Section 4.2) write f (W; x) = σL+1 (f˜(W; x)) where f˜(W; x) is a neural network with a lin-
ear output layer and we denoted F(W) = (f (W; x1 ), . . . , f (W; xn )), F̃(W) = (f˜(W; x1 ), . . . , f˜(W; xn ))
and y = (y1 , . . . , yn ). We use tilde, e.g., Õ(·) to suppress logarithmic terms in Big-O notation.
We further assume σL+1 (·) is Lσ -Lipschitz continuous and βσ -smooth.

Theorem 8. Consider the neural network f (W; x) and its random initialization W0 under the same
condition as in Theorem 4. If the network width satisfies m = Ω̃ µ6L+2 (λ0n−µρ−2 )2 , then, with an
appropriate step size, gradient descent converges to a global minimizer in the ball B(W0 , R), where
R = O(1/µ), with an exponential convergence rate:

L(Wt ) ≤ (1 − ηµ)t L(W0 ). (28)

Proof. The result is obtained by combining Theorem 4 and 6, after setting the ball radius R = 2LF kF(W0 )−
yk/µ and choosing the step size 0 < η < L2 +βF kF1(W0 )−yk .
F
It remains to verify that all of the quantities LF , βF and kF(W0 ) − yk are of order O(1) with respect
to the network width m. First note that, with the random initialization of W0 as in Theorem 4, it is
shown in [16] that with high probability, f (W0 ; x) = O(1) when the width is sufficiently large under mild
assumption on the non-linearity functions σl (·), l = 1, ..., L + 1. Hence kF(W0 ) − yk = O(1).
Using the definition of LF , we have

LF = sup kDF(W)k
W∈B(W0 ,R)
!

≤ Lσ kDF̃(W0 )k + R n · sup kHF̃ (W)k
W∈B(W0 ,R)


 
1
q
= kKF̃ (W0 )kLσ + Lσ R n · Õ √ = O(1),
m

where the last inequality follows from Theorem 5, and kKF̃ (W0 )k = O(1) with high probability over the
random initialization.

16
Finally, βF is bounded as follows:

βF = sup kHF (W)k


W∈B(W0 ,R)

= sup kHfk (W)k


W∈B(W0 ,R)
   
σL+1 f˜k (W) Df˜k (W)Df˜k (W)T + σL+1 f˜k (W) Hf˜k (W)
00 0
= sup

W∈B(W0 ,R)
!2
≤ βσ sup kDf˜k (W)k + Lσ sup kHf˜k (W)k
W∈B(W0 ,R) W∈B(W0 ,R)
 
1
≤ βσ · O(1) + Lσ · Õ √ = O(1),
m
!
where k = argmax sup kHfi (W)k .
i∈[n] W∈B(W0 ,R)

Remark 9. Using the same argument, a result similar to Theorem 8 but with a different convergence
rate,
L(Wt ) ≤ (1 − ηsµ)t L(W0 ). (29)
and with difference constants, can be obtained for SGD by applying Theorem 7.
Note that (near) constancy of the tangent kernel is not a necessary condition for exponential conver-
gence of gradient descent or (S)GD.

6 Relaxation to PL∗ condition


In certain situations, for example mildly under-parameterized cases, the PL∗ condition may not hold
exactly, since an exact solution for system F(w) = w may not exist. Fortunately, in practice, we do not
need to run algorithms until exact convergence. Most of the time, early stopping is employed, i.e. we
stop the algorithm once it achieves a certain small loss  > 0. To account for that case, we define PL∗
condition, a relaxed variant of PL∗ condition, which still implies a fast convergence of the gradient-based
algorithms up to loss .
Definition 5 (PL∗ condition). Given a set S ⊂ Rm and  > 0, define the set S := {w ∈ S : L(w) ≥ }.
A loss function L(w) is µ-PL∗ on S, if the following holds:
1
k∇L(w)k2 ≥ µL(w), ∀w ∈ S . (30)
2
Intuitively, the PL∗ condition is the same as PL∗ condition, except that the loss landscape can be
arbitrary wherever the loss is less than . This is illustrated in Figure 5.
Below, following a similar argument above, we show that a local PL∗ condition guarantees fast con-
vergence to an approximation of a global minimizer. Basically, the gradient descent terminates when the
loss is less than .

Theorem 9 (Local PL∗ condition ⇒fast convergence). Assume the loss function L(w) (not necessarily
∗ m
√ and satisfies the µ-PL condition in the ball B(w0 , R) := {w ∈ R :
the square loss) is β-smooth
2 2βL(w )
0
kw − w0 k ≤ R} with R = µ . We have the following:
(a) Existence of a point which makes the loss less than : There exists a point w∗ ∈ B(w0 , R), such that
L(w∗ ) < .

17
(b) Convergence of GD : Gradient descent with the step size η = 1/ supw∈B(w0 ,R) kHL (w)k2 after T =
Ω(log(1/)) iterations satisfies L(wT ) <  in the ball B(w0 , R), with an exponential (also known as linear)
convergence rate:
 t
−1
L(wt ) ≤ 1 − κL,F (B(w0 , R)) L(w0 ), for all t ≤ T, (31)
1
where the condition number κL,F (B(w0 , R)) = ηµ .

Proof. If L(w0 ) < , we let w∗ = w0 and we are done.


Suppose L(w0 ) ≥ . Following the similar analysis to the proof of Theorem 6, as long as L(wt ) ≥ 
for t ≥ 0, we have

L(wt ) ≤ (1 − ηµ)t L(w0 ). (32)

Hence there must exist a minimum T > 0 such that

L(wT ) < . (33)

It’s not hard to see that  ≤ (1 − ηµ)T L(w0 ), from which we get T = Ω(log(1/)).

2 2βL(w0 )
And obviously, w0 , w1 , ..., wT are also in the ball B(w0 , R) with R = µ .

Following similar analysis, when PL∗ condition holds in


the ball, SGD converges exponentially to an approximation
of the global solution.
Theorem 10. Assume each `i (w) is γ-smooth and L(w)

satisfies
√ the µ-P L condition in the ball B(w0 , R) with R =

2n 2γ L(w0 )
µδ where δ > 0. Then, with probability 1 − δ,
SGD with mini-batch size s ∈ N and step size η ∗ (s) =

nγ(n2 γ+µ(s−1)) after at most T = Ω(log(1/)) iterations sat-
isfies min (E[L(wT )], L(wT )) <  in the ball B(w0 , R), with
an exponential convergence rate:
t Figure 5: The loss landscape of under-
µsη ∗ (s)

E[L(wt )] ≤ 1− (34) parameterized systems. In the set S ,
L(w0 ), t ≤ T.
n where the loss is larger than , PL∗ still
Proof. If L(w0 ) < , we let T = 0 and we are done. holds. Beyond that, the loss landscape
Suppose L(w0 ) ≥ . By the similar analysis in the can be ∗arbitrary, which is the grey area,
proof of Theorem 7, with the µ-PL∗ condition, for any mini- and PL doesn’t hold any more.

batch size s, the mini-batch SGD with step size η ∗ (s) :=

nλ(n2 λ+µ(s−1)) has an exponential convergence rate:

t
µsη ∗ (s)

E[L(wt )] ≤ 1− L(w0 ), (35)
n

for all t where L(wt ) ≥ .


Hence there must exist a minimum T > 0 such that either L(wT ) <  or E[L(wT )] < . Supposing
 ∗
T
E[L(wT )] <  at time step T , we have  ≤ 1 − µsηn (s) L(w0 ), therefore T = Ω(log(1/)).
And it’s easy to check that with√probability 1 − δ, the optimization path w0 , w1 , ..., wT is covered by

2n 2γ L(w0 )
the ball B(w0 , R) with R = µδ .

While the global landscape of the loss function L(w) can be complex, the conditions above allow us
to find solutions within a certain ball around the initialization point w0 .

18
7 Concluding thoughts and comments
In this paper we have proposed a general framework for understanding generically non-convex landscapes
and optimization of over-parameterized systems in terms of the PL∗ condition. We have argued that PL*
condition generally holds on most but not all of the parameter space, which is sufficient for the existence
of solutions and convergence of gradient-base methods to global minimizers. In contrast, it is not possible
for loss landscapes of under-parameterized systems kF(w) − yk2 to satisfy PL∗ for any y. We conclude
with a number of comments and observations.

Linear and non-linear systems. A remarkable property of over-parameterized non-linear systems


discussed in this work is their strong resemblance to linear systems with respect to optimization by
(S)GD, even as their dynamics remain nonlinear. In particular, optimization by gradient-based methods
and proximity to global minimizers is controlled by non-linear condition numbers, similarly to classical
analyses of linear systems. The key difference is that while for linear systems the condition number is
constant, in the non-linear case we need a uniform bound in a domain containing the optimization path.
In contrast, the optimization properties of non-linear systems in the under-parameterized regime appear
very different from those of linear systems. Furthermore, increasing the degree of over-parameterization
generally improves conditioning just like it does for linear systems (cf. [9] and the discussion in [31]). In
particular, this suggests that the effectiveness of optimization should improve, up to a certain limit, with
increased over-parameterization.

Transition over the interpolation threshold. Recognizing the power of over-parameterization has
been a key insight stemming from the practice of deep learning. Transition to over-parameterized mod-
els – over the interpolation threshold – leads to a qualitative change in a range of system properties.
Statistically, over-parameterized systems enter a new interpolating regime, where increasing the number
of parameters, even indefinitely to infinity, can improve generalization [5, 34]. From the optimization
point of view, over-parameterized system are generally easier to solve. There has been significant effort
(continued in this work) toward understanding effectiveness of local methods in this setting [33, 12, 25,
2, 17, 30].
In this paper we note another aspect of this transition, to the best of our knowledge not addressed in
the existing literature – transition from local convexity to essential non-convexity. This relatively simple
observation has significant consequences, indicating the need to depart from the machinery of convex
analysis. Interestingly, our analyses suggest that this loss of local convexity is of little consequence for
optimization, at least as far as gradient-based methods are concerned.

Transition from over-parameterization to under-parameterization along the optimization


path. As discussed above, transition over the interpolation threshold occurs when the number of pa-
rameters in a variably parameterized system exceeds the number of constraints (corresponding to data
points in typical ML scenarios). Over-parameterization does not refer to the number of parameters as
such but to the difference between the number of parameters and the number of constraints. While some
learning models are very large, with billions or even trillions parameters, they are often trained on equally
large datasets and are thus not necessarily over-parameterized. Yet, it is still tempting to view these mod-
els through the lens of over-parameterization. While precise technical analysis is beyond the scope of this
paper, we conjecture that transition between effective over-parameterization to under-parameterization
happens along the optimization trajectory. Initially, the system behaves as over-parameterized but as the
optimization process continues, it fails to reach zero. Mathematically it can be represented as our PL∗
condition. We speculate that for many realistic large models trained on big data the full optimization
path lies within the PL∗ domain and hence, functionally, the analyses in this paper apply.

Condition numbers and optimization methods. In this work we concentrate on optimization


by gradient descent and SGD. Yet, for linear systems of equations and in many other settings, the
importance of conditioning extends far beyond one specific optimization technique [7]. We expect this to

19
be case in the over-parameterized non-linear setting as well. To give just one example, we expect that
accelerated methods, such as the Nesterov’s method [26] and its stochastic gradient extensions in the
over-parameterized case [22, 35] to have faster convergence rates for non-linear systems in terms of the
condition numbers defined in this work.

Equations on manifolds. In this paper we consider systems of equations F(w) = y defined on


Euclidean spaces and with Euclidean output. A more general setting is to look for solutions of arbitrary
systems of equations defined by a map between two Riemannian manifolds F : M → N . In that case
the loss function L needs to be defined on N . The over-parameterization corresponds to the case when
dimension dim(M) > dim(N ). While analyzing gradient descent requires some care on a manifold,
most of the mathematical machinery, including the definitions of the PL∗ condition and the condition
number associated to F, is still applicable without significant change. In particular, as we discussed
above (see Remark 4 in Section 4.1), the condition number is preserved under “well-behaved” coordinate
transformations. In contrast, this is not the case for the Hessian and thus manifold optimization analyses
based on geodesic convexity require knowledge about specific coordinate charts, such as those given by
the exponential map.
We note that manifold and structural assumptions on the weight vector w is a natural setting for
addressing many problems in inference. In particular, the important class of convolutional neural networks
is an example of such a structural assumption on w, which is made invariant to certain parallel transforms.
Furthermore, there are many settings, e.g., robot motion planning, where the output of a predictor, y,
also belongs to a certain manifold.

Acknowledgements
We thank Raef Bassily and Siyuan Ma for many earlier discussions about gradient methods and Polyak-
Lojasiewicz conditions and Stephen Wright for insightful comments and corrections. The authors ac-
knowledge support from the NSF, the Simons Foundation and a Google Faculty Research Award.

References
[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A Convergence Theory for Deep Learning via
Over-Parameterization”. In: International Conference on Machine Learning. 2019, pp. 242–252.
[2] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. “Fine-Grained Analysis of
Optimization and Generalization for Overparameterized Two-Layer Neural Networks”. In: Inter-
national Conference on Machine Learning. 2019, pp. 322–332.
[3] Peter L Bartlett, David P Helmbold, and Philip M Long. “Gradient descent with identity initializa-
tion efficiently learns positive-definite linear transformations by deep residual networks”. In: Neural
computation 31.3 (2019), pp. 477–502.
[4] Raef Bassily, Mikhail Belkin, and Siyuan Ma. “On exponential convergence of sgd in non-convex
over-parametrized learning”. In: arXiv preprint arXiv:1811.02564 (2018).
[5] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. “Reconciling modern machine-
learning practice and the classical bias–variance trade-off”. In: Proceedings of the National Academy
of Sciences 116.32 (2019), pp. 15849–15854.
[6] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are
few-shot learners”. In: arXiv preprint arXiv:2005.14165 (2020).
[7] Peter Bürgisser and Felipe Cucker. Condition: The geometry of numerical algorithms. Vol. 349.
Springer Science & Business Media, 2013.

20
[8] Zachary Charles and Dimitris Papailiopoulos. “Stability and generalization of learning algorithms
that converge to global optima”. In: International Conference on Machine Learning. PMLR. 2018,
pp. 745–754.
[9] Zizhong Chen and Jack J Dongarra. “Condition numbers of Gaussian random matrices”. In: SIAM
Journal on Matrix Analysis and Applications 27.3 (2005), pp. 603–620.
[10] Yaim Cooper. “The loss landscape of overparameterized neural networks”. In: arXiv preprint
arXiv:1804.10200 (2018).
[11] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. “Gradient Descent Finds Global
Minima of Deep Neural Networks”. In: International Conference on Machine Learning. 2019,
pp. 1675–1685.
[12] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. “Gradient descent provably optimizes
over-parameterized neural networks”. In: arXiv preprint arXiv:1810.02054 (2018).
[13] William Fedus, Barret Zoph, and Noam Shazeer. “Switch Transformers: Scaling to Trillion Param-
eter Models with Simple and Efficient Sparsity”. In: arXiv preprint arXiv:2101.03961 (2021).
[14] Chirag Gupta, Sivaraman Balakrishnan, and Aaditya Ramdas. “Path length bounds for gradient
descent and flow”. In: arXiv preprint arXiv:1908.01089 (2019).
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[16] Arthur Jacot, Franck Gabriel, and Clément Hongler. “Neural tangent kernel: Convergence and
generalization in neural networks”. In: Advances in neural information processing systems. 2018,
pp. 8571–8580.
[17] Ziwei Ji and Matus Telgarsky. “Polylogarithmic width suffices for gradient descent to achieve arbi-
trarily small test error with shallow ReLU networks”. In: arXiv preprint arXiv:1909.12292 (2019).
[18] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[19] Johannes Lederer. “No Spurious Local Minima: on the Optimization Landscapes of Wide and Deep
Neural Networks”. In: arXiv preprint arXiv:2010.00885 (2020).
[20] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein,
and Jeffrey Pennington. “Wide neural networks of any depth evolve as linear models under gradient
descent”. In: Advances in neural information processing systems. 2019, pp. 8570–8581.
[21] Dawei Li, Tian Ding, and Ruoyu Sun. “Over-parameterized deep neural networks have no strict
local minima for any continuous activations”. In: arXiv preprint arXiv:1812.11039 (2018).
[22] Chaoyue Liu and Mikhail Belkin. “Accelerating SGD with momentum for over-parameterized learn-
ing”. In: The 8th International Conference on Learning Representations (ICLR). 2020.
[23] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. “On the linearity of large non-linear models: when
and why the tangent kernel is constant”. In: Advances in Neural Information Processing Systems
33 (2020).
[24] Stanislaw Lojasiewicz. “A topological property of real analytic subsets”. In: Coll. du CNRS, Les
équations aux dérivées partielles 117 (1963), pp. 87–89.
[25] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. “A mean field view of the landscape of
two-layer neural networks”. In: Proceedings of the National Academy of Sciences 115.33 (2018),
E7665–E7671.
[26] Yurii Nesterov. “A method for unconstrained convex minimization problem with the rate of con-
vergence O (1/kˆ 2)”. In: Doklady AN USSR. Vol. 269. 1983, pp. 543–547.
[27] Quynh Nguyen, Mahesh Chandra Mukkamala, and Matthias Hein. “On the loss landscape of a class
of deep neural networks with no bad local valleys”. In: arXiv preprint arXiv:1809.10749 (2018).

21
[28] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media,
2006.
[29] Samet Oymak and Mahdi Soltanolkotabi. “Overparameterized nonlinear learning: Gradient de-
scent takes the shortest path?” In: International Conference on Machine Learning. PMLR. 2019,
pp. 4951–4960.
[30] Samet Oymak and Mahdi Soltanolkotabi. “Towards moderate overparameterization: global con-
vergence guarantees for training shallow neural networks”. In: arXiv preprint arXiv:1902.04674
(2019).
[31] Tomaso Poggio, Gil Kur, and Andrzej Banburski. “Double descent in the condition number”. In:
arXiv preprint arXiv:1912.06190 (2019).
[32] Boris Teodorovich Polyak. “Gradient methods for minimizing functionals”. In: Zhurnal Vychisli-
tel’noi Matematiki i Matematicheskoi Fiziki 3.4 (1963), pp. 643–653.
[33] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. “Theoretical insights into the optimiza-
tion landscape of over-parameterized shallow neural networks”. In: IEEE Transactions on Infor-
mation Theory 65.2 (2018), pp. 742–769.
[34] S Spigler, M Geiger, S d’Ascoli, L Sagun, G Biroli, and M Wyart. “A jamming transition from
under- to over-parametrization affects generalization in deep learning”. In: Journal of Physics A:
Mathematical and Theoretical 52.47 (Oct. 2019), p. 474001.
[35] Sharan Vaswani, Francis Bach, and Mark Schmidt. “Fast and Faster Convergence of SGD for Over-
Parameterized Models and an Accelerated Perceptron”. In: The 22th International Conference on
Artificial Intelligence and Statistics. 2019, pp. 1195–1204.
[36] Patrick M Wensing and Jean-Jacques Slotine. “Beyond convexity—Contraction and global conver-
gence of gradient descent”. In: Plos one 15.8 (2020), e0236661.
[37] Xiao-Hu Yu and Guo-An Chen. “On the local minima free condition of backpropagation learning”.
In: IEEE Transactions on Neural Networks 6.5 (1995), pp. 1300–1303.
[38] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. “Stochastic gradient descent optimizes
over-parameterized deep relu networks”. In: arXiv preprint arXiv:1811.08888 (2018).

22
A Wide neural networks have no isolated local/global minima
In this section, we show that, for feedforward neural networks, if the network width is sufficiently large,
there is no isolated local/global minima in the loss landscape.
Consider the following feedforward neural networks:

f (W; x) := W (L+1) φL (W (L) · · · φ1 (W (1) x)). (36)

Here, x ∈ Rd is the input, L is the number of hidden layers of the network. Let ml be the width
of l-th layer, l ∈ [L], and m0 = d and mL+1 = c where c is the dimension of the network output.
W := {W (1) , W (2) , · · · , W (L) , W (L+1) }, with W (l) ∈ Rml−1 ×ml , is the collection of parameters, and φl
is the activation function at each hidden layer. The minimal width of hidden layers, i.e., width of the
network, is denoted as m := min{m1 , · · · , mL }. The parameter space is denoted as M. Here, we further
assume the loss function L(W) is of the form:
n
X
L(W) = l(f (W; xi ), yi ), (37)
i=1

where loss l(·, ·) is convex in the first argument, and (xi , yi ) ∈ Rd × Rc is one of the n training samples.
Proposition 6 (No isolated minima). Consider the feedforward neural network in Eq.(36). If the network
width m ≥ 2c(n + 1)L , given a local (global) minimum W∗ of the loss function Eq.(37), there are always
other local (global) minima in any neighborhood of W∗ .
The main idea of this proposition is based on some of the intermediate-level results of the work [19].
Before starting the proof, let’s review some of the important concepts and results therein. To be consistent
with the notation in this paper, we modify some of their notations.
Definition 6 (Path constant). Consider two parameters W, V ∈ M. If there is a continuous function
hW,V : [0, 1] → M that satisfies hW,V (0) = W, hW,V (1) = V and t 7→ L(hW,V (t)) is constant, we say
that W and V are path constant and write W ↔ V.
Path constantness means the two parameters W and V are connected by a continuous path of pa-
rameters that is contained in a level set of the loss function L.
Lemma 1 (Transitivity). W ↔ V and V ↔ U ⇒ W ↔ U.
Definition 7 (Block parameters). Consider a number s ∈ {0, 1, · · · } and a parameter W ∈ M. If
(1)
1. Wji = 0 for all j > s;
(l)
2. Wij = 0 for all l ∈ [L] and i > s, and for all l ∈ [L] and j > s;
(L+1)
3. Wij = 0 for all j > s,

we call W an s-upper-block parameter of depth L.


Similarly, if
(1)
1. Wji = 0 for all j ≤ m1 − s;
(l)
2. Wij = 0 for all l ∈ [L] and i ≤ ml − s, and for all l ∈ [L] and j ≤ ml−1 − s;
(L+1)
3. Wij = 0 for all j ≤ mL − s,

we call W an s-lower-block parameter of depth L. We denote the sets of the s-upper-block and s-lower-
block parameters of depth L by Us,L and Vs,L , respectively.
The key result that we use in this paper is that every parameter is path constant to a block parameter,
or more formally:

23
Proposition 7 (Path connections to block parameters [19]). For every parameter W ∈ M and s :=
c(n + 1)L (n is the number of training samples), there are W, W ∈ M with W ∈ Us,L and W ∈ Vs,L
such that W ↔ W and W ↔ W.

The above proposition says that, if the network width m is large enough, every parameter is path
connected to both an upper-block parameter and a lower-block parameter, by continuous paths contained
in a level set of the loss function.
Now, let’s present the proof of Proposition 6.
Proof of Proposition 6. Let W∗ ∈ M be an arbitrary local/global minimum of the loss function L(W).
According to Proposition 7, there exist an upper-block parameter W ∈ Us,L ⊂ M and W ∈ Vs,L ⊂ M
such that W ↔ W and W ↔ W. Note that W and W are distinct, because Us,L and Vs,L do not
intersect except at zero due to m ≥ 2s. This means there must be parameters distinct from W∗ that
is connected to W∗ via a continuous path contained in a level set of the loss function. Note that all
the points (i.e., parameters) along this path have the same loss value as L(W∗ ), hence are local/global
minima. Therefore, W∗ is not isolated (i.e, there are other local/global minima in any neighborhood of
W∗ ).

B Proof of Proposition 2
Proof. The Hessian matrix of a general loss function L(F(w)) takes the form
n
∂2L X ∂L
HL (w) = DF(w)T (w)DF(w) + (w)HFi (w).
∂F 2 i=1
∂F i

Recall that HFi (w) is the Hessian matrix of i-th output of F with respect to w.
We consider the Hessian matrices of L around a global minimizer w∗ of the loss function, i.e., solution
of the system of equations. Specifically, consider the following two points w∗ + δ and w∗ − δ, which are
in a sufficiently small neighborhood of the minimizer w∗ . Then Hessian of loss at these two points are
n 
∂2L ∗
  
∗ ∗ ∗
X d ∂L ∗
HL (w + δ) = DF(w + δ) 2
T
(w + δ)DF(w + δ) + (w ) δ HFi (w∗ + δ) + o(kδk),
| ∂F {z } i=1 dw ∂F i
A(w∗ +δ)
2 n    
∗ ∗∂ L ∗ ∗
X d ∂L ∗
HL (w − δ) = DF(w − δ) 2
T
(w − δ)DF(w − δ) − (w ) δ HFi (w∗ + δ) + o(kδk).
| ∂F {z } i=1 dw ∂F i
A(w∗ −δ)

Note that both the terms A(w∗ + δ) and A(w∗ − δ) are matrices with rank at most n, since DF is of
the size n × m.
By the assumption, at least one component HFk of the Hessian of F satisfies that the rank of HFk (w∗ )
is greater than 2n. By the continuity of the Hessian, we have that, if magnitude of δ is sufficiently small,
then the ranks of HFk (w∗ + δ) and HFk (w∗ − δ) are also greater than 2n.
Hence, we can always find a unit length vector v ∈ Rm s.t.

vT A(w∗ + δ)v = vT A(w∗ − δ)v = 0, (38)

but
vT HFk (w∗ + δ)v 6= 0, vT HFk (w∗ − δ)v 6= 0. (39)
Consequently the vector hvT HF1 (w∗ +δ)v, . . . , vT HFn (w∗ +δ)vi =
6 0 and hvT HF1 (w∗ −δ)v, . . . , vT HFn (w∗ −
6 0.
δ)vi =

24
With the same v, we have
n    

X d ∂L ∗
T
v HL (w + δ)v = (w ) δ vT HFi (w∗ + δ)v + o(kδk), (40)
i=1
dw ∂F i
n
X d    
∗ ∂L ∗
T
v HL (w − δ)v = − (w ) δ vT HFi (w∗ − δ)v + o(kδk). (41)
i=1
dw ∂F i

In the following, we show that, for sufficiently small δ, vT HL (w∗ + δ)v and vT HL (w∗ − δ)v can not be
non-negative simultaneously, which immediately implies that HL is not positive semi-definite in the close
neighborhood of w∗ , hence L is not locally convex ∗
 at w .
d ∂L ∗
Specifically, with the condition dw ∂F (w ) 6= 0, for Eq.(40) and Eq.(41) we have the following
cases: Pn d ∂L ∗ ∗ ∗
  T T
Case 1 : If i=1 dw ∂F (w ) δ i v HFi (w + δ)v < 0, then directly v HL (w + δ)v < 0 if δ is small
enough which completes Pnthe proof.
d ∂L ∗ ∗
  T
Case 2 : Otherwise if i=1 dw ∂F (w ) δ i v HFi (w + δ)v > 0, by the continuity of each HFi (·), we
have
n    
X d ∂L ∗
− (w ) δ vT HFi (w∗ − δ)v
i=1
dw ∂F i
n     n    
X d ∂L ∗ ∗
X d ∂L ∗
=− T
(w ) δ v HFi (w + δ)v + (w ) δ vT (HFi (w∗ + δ) − HFi (w∗ − δ))v
i=1
dw ∂F i i=1
dw ∂F i
n
X d    
∂L ∗
=− (w ) δ vT HFi (w∗ + δ)v + O()
i=1
dw ∂F i

< 0,
when δ is small enough.
Pn ∗
d dL
δ i v HFi (w∗ + δ)v = 0, we can always adjust δ a little so that it turns
  T
Note: If i=1 dw dF (w )
to case 1 or 2.

In conclusion, with certain δ which is arbitrarily small, either vT HL (w∗ + δ)v or vT HL (w∗ − δ)v
has to be negative which means L(w) has no convex neighborhood around w∗ .

C Small Hessian is a feature of certain large models


Here, we show that the small Hessian spectral norm is not an strong condition. In fact, it is a mathematical
freebie as long as the model has certain structure and is large enough. For example, if a neural network
has a linear output layer and is wide enough, its Hessian spectral norm can be arbitrarily small, see [23].
In the following, let’s consider an illustrative example. Let the model f be a linear combination of m
independent sub-models,
m
1 X
f (w; x) = vi αi (wi ; x), (42)
s(m) i=1
1
where wi is the trainable parameters of the sub-model αi , i ∈ [m], and s(m) is a scaling factor that depends
on m. The sub-model weights vi are independently randomly chosen from {−1, 1} and not trainable.
The parameters of the model f are the concatenation of sub-model parameters: w := (w1 , · · · , wm ) with
wi ∈ Rp . We make the following mild assumptions.
Assumption 1. For simplicity, we assume that the sub-models αi (wi , x) have the same structure but
different initial parameters wi,0 due to random initialization. We further assume each sub-model has a
Θ(1) output, and is second-order differentiable and β-smooth.

25
1
Due to the randomness of the sub-model weights vi , the scaling factor s(m) = o(1) w.r.t. the size m
1 1
of the model f (e.g, s(m) = √m for neural networks [16, 23]).
The following theorem states that, as long as the model size m is sufficiently large, the Hessian spectral
norm of the model f is arbitrarily small.
Theorem 11. Consider the model f defined in Eq.(42). Under Assumption 1, the spectral norm of the
Hessian of model f satisfies:
β
kHf k ≤ . (43)
s(m)
Proof. An entry (Hf )jk , j, k ∈ Rm×p , of the model Hessian matrix is
m m
1 X ∂ 2 αi 1 X
(Hf )jk = vi =: vi (Hαi )jk , (44)
s(m) i=1 ∂wj ∂wk s(m) i=1

with Hαi being the Hessian matrix of the sub-model αi . Because the parameters of f is the concatenation
of sub-model parameters and the sub-models share no common parameters, in the summation of Eq.(44),
there must be at most one non-zero term (non-zero only when wj and wk belong to the same model αi .
Thus, the Hessian spectral norm can be bounded by
1 1
kHf k ≤ max |vi | · kHαi k ≤ · β. (45)
s(m) i∈[m] s(m)

D An illustrative example of composition models


Consider the model h = g ◦ f as a composition of 2 random Fourier feature models f : R → R and
g : R → R with:
m
1 X
f (u; x) = √ (uk cos(ωk x) + uk+m sin(ωk x)) , (46)
m
k=1
m
1 X
g(a; z) = √ (ak cos(νk z) + ak+m sin(νk z)) . (47)
m
k=1

Here we set the frequencies ωk ∼ N (0, 1) and νk ∼ N (0, n2 ), for all k ∈ [m], to be fixed. The trainable
parameters of the model are (u, a) ∈ R2m × R2m . It’s not hard to see that the model h(x) = g ◦ f (x) can
be viewed as a 4-layer ”bottleneck” neural network where the second hidden layer has only one neuron,
i.e. the output of f .
For simplicity, we let the input x be 1-dimensional, and denote the training data as x1 , ..., xn ∈ R.
We consider the case of both sub-models, f and g, are large, especially m → ∞.
By Proposition 4, the tangent kernel matrix of h, i.e. Kh , can be decomposed into the sum of two
positive semi-definite matrices, and the uniform conditioning of Kh can be guaranteed if one of them is
uniformly conditioned, as demonstrated in Eq.(16). In the following, we show Kg is uniformly conditioned
by making f well separated.
We assume the training data are not degenerated and the parameters u are randomly initialized.
This makes sure the initial outputs of f , which are √ the initial inputs of g, are not degenerated, with high
probability. For example, let mini6=j |xi − xj | ≥ 2 and initialize u by N (0, 100R2 ) with a given number
R > 0. Then, we have
 
−|xi −xj |2
f (u0 ; xi ) − f (u0 ; xj ) ∼ N 0, 200R2 − 200R2 e 2 , ∀i, j ∈ [n].

26
√ −|xi −xj |2
Since mini6=j |xi − xj | ≥ 2, the variance V ar := 200R2 − 200R2 e 2 > 100R2 . For this Gaussian
distribution, we have, with probability at least 0.96, that

min |f (u0 ; xi ) − f (u0 ; xj )| > 2R. (48)


i6=j

For this model, the partial tangent kernel Kg is the Gaussian kernel in the limit of m → ∞, with the
following entries:  2
n |f (u; xi ) − f (u; xj )|2

Kg,ij (u) = exp − .
2
By the Gershgorin circle theorem, its smallest eigenvalue is lower bounded by:
 2
n ρ(S)2

inf λmin (Kg (u)) ≥ 1 − (n − 1) exp − ,
u∈S 2

where ρ(S) := inf u∈S mini6=j |f (u;p xi ) − f (u; xj )| is the minimum separation between the outputs of f ,
i.e., inputs of g in S. If ρ(S) ≥ 2ln(2n − 2)/n, then we have inf u∈S λmin (Kg (u)) ≥ 1/2. Therefore, we
see that the uniform conditioning of Kg , hence that of Kh , is controlled by the separation between the
inputs of g, i.e., outputs of f .
Within the ball B((u0 , a0 ), R) := {(u, a) : ku − u0 k2 + ka − a0 k2 ≤ R2 } with an arbitrary radius
R > 0, the outputs of f are always well separated, given that the initial outputs of f are separated by
2R as already discussed above. This is because

m
1 X
|f (u; x) − f (u0 ; x)| = √ ((uk − u0,k ) cos(ωk x) + (uk+m − u0,k+m ) sin(ωk x)) ≤ ku − u0 k ≤ R,

m
k=1

which
p leads to ρ(B((u0 , a0 ), R)) > R by Eq.(48). By choosing R appropriately, to be specific, R ≥
2ln(2n − 2)/n, the uniform conditioning of Kg is satisfied.
Hence, we see that composing large non-linear models may make the tangent kernel no longer constant,
but the uniform conditioning of the tangent kernel can remain.

E Wide CNN and ResNet satisfy PL∗ condition


In this section, we will show that wide Convolutional Neural Networks (CNN) and Residual Neural
Networks (ResNet) also satisfy the PL∗ condition.
The CNN is defined as follows:

α(0) = x,
 
1
α(l) = σl √ W (l)
∗ α (l−1)
, ∀l = 1, 2, . . . , L,
ml−1
1
f (W; x) = √ hW (L+1) , α(L+1) i, (49)
mL

where ∗ is the convolution operator (see the definition below) and h·, ·i is the standard matrix inner
product.
Compared to the definition of fully-connected neural networks in Eq.(17), the l-th hidden neurons
α(l) ∈ Rml ×Q is a matrix where ml is the number of channels and Q is the number of pixels, and
W (l) ∈ RK×ml ×ml−1 is an order 3 tensor where K is the filter size except that W (L+1) ∈ RmL ×Q .
For the simplicity of the notation, we give the definition of convolution operation for 1-D CNN in the
following. We note that it’s not hard to extend to higher dimensional CNNs and one will find that our
analysis still applies.

27
  K ml−1
(l) (l−1)
X X
(l) (l−1)
W ∗α = Wk,i,j αj,q+k− K+1 , i ∈ [ml ], q ∈ [Q]. (50)
i,q 2
k=1 j=1

The ResNet is similarly defined as follows:

α(0) = x,
 
(l) 1
α = σl √ W (l) α(l−1) + α(l−1) , ∀l = 1, 2, · · · , L + 1,
ml−1
f (W; x) = α(L+1) . (51)

We see that the ResNet is the same as a fully connected neural network, Eq. (17), except that the
activations α(l) has an extra additive term α(l−1) from the previous layer, interpreted as skip connection.
Remark 10. This definition of ResNet differs from the standard ResNet architecture in [15] that the
skip connections are at every layer, instead of every two layers. One will find that the same analysis can
be easily generalized to cases where skip connections are at every two or more layer. The same definition,
up to a scaling factor, was also theoretically studied in [11].

By the following theorem, we have an upper bound for the Hessian spectrum norm of the CNN and
ResNet, similar to Theorem 5 for fully-connected neural networks.
Theorem 12 (Theorem 3.3 of [23]). Consider a neural network f (W; x) of the form Eq.(49) or Eq.(51).
Let m be the minimum of the hidden layer widths, i.e., m = minl∈[L] ml . Given any fixed R > 0, and
any W ∈ B(W0 , R) := {W : kW − W0 k ≤ R}, with high probability over the initialization, the Hessian
spectral norm satisfies the following:
√ 
kHf (W)k = Õ R3L / m . (52)

Using the same analysis in Section 4.2, we can get a similar result with Theorem 4 for CNN and
ResNet to show they also satisfy PL∗ condition:

Theorem 13 (Wide CNN and ResNet satisfy PL∗ condition). Consider the neural network f (W; x) in
(l)
Eq.(49) or Eq.(51), and a random parameter setting W0 such that each element of W0 for l ∈ [L + 1]
0
follows N (0, 1). Suppose that the last layer activation σL+1 satisfies |σL+1 (z)| ≥ ρ > 0 and that λ0 :=
λmin (K(W0 )) > 0. For any µ ∈ (0, λ0 ρ2 ), if the width of the network

nR6L+2
 
m = Ω̃ , (53)
(λ0 − µρ−2 )2

then µ-PL∗ condition holds for square loss in the ball B(W0 , R).

F Proof for convergence under PL∗ condition


In Theorem 6, the convergence of gradient descent is established in the case of square loss function. In
fact, similar results (with a bit modification) hold for general loss functions. In the following theorem,
we provide the convergence under PL∗ condition for general loss functions. Then, we prove Theorem 6
and 14 together.

Theorem 14. Suppose the loss function L(w) (not necessarily the square loss) is β-smooth
√ and satisfies
∗ m 2 2βL(w0 )
the µ-PL condition in the ball B(w0 , R) := {w ∈ R : kw − w0 k ≤ R} with R = µ . Then we
have the following:

28
(a) Existence of a solution: There exists a solution (global minimizer of L) w∗ ∈ B(w0 , R), such that
F(w∗ ) = y.
(b) Convergence of GD: Gradient descent with a step size η ≤ 1/ supw∈B(w0 ,R) kHL (w)k2 converges to a
global solution in B(w0 , R), with an exponential (also known as linear) convergence rate:
 t
L(wt ) ≤ 1 − κ−1
L,F (B(w 0 , R)) L(w0 ). (54)
1
where the condition number κL,F (B(w0 , R)) = ηµ .

Proof. Let’s start with Theorem 14. We prove this theorem √ by induction. The induction hypothesis is
2 2βL(w0 )
that, for all t ≥ 0, wt is within the ball B(w0 , R) with R = µ , and
t
L(wt ) ≤ (1 − ηµ) L(w0 ). (55)
0
In the base case, where t = 0, it is trivial that w0 ∈ B(w0 , R) and that L(wt ) ≤ (1 − ηµ) L(w0 ).
Suppose that, for a given t ≥ 0, wt is in the ball B(w0 , R) and Eq.(55) holds. As we show separately
below (in Eq.(59) we have wt+1 ∈ B(w0 , R). Hence we see that L is (supw∈B(w0 ,R) kHL (w)k2 )-smooth
in B(w0 , R). By definition of η = 1/ supw∈B(w0 ,R) kHL (w)k2 and, consequently, L is 1/η-smooth. Using
that we obtain the following:

1
L(wt+1 ) − L(wt ) − ∇L(wt )(wt+1 − wt ) ≤ sup kHL k2 kwt+1 − wt k2
2 w∈B(w0 ,R)
1
= kwt+1 − wt k2 . (56)

Taking wt+1 − wt = −η∇L(wt ) and by µ-PL∗ condition at point wt , we have

η
L(wt+1 ) − L(wt ) ≤ − k∇L(wt )k2 ≤ −ηµL(wt ). (57)
2
Therefore,
t+1
L(wt+1 ) ≤ (1 − ηµ)L(wt ) ≤ (1 − ηµ) L(w0 ). (58)
To prove wt+1 ∈ B(w0 , R), by the fact that L is β-smooth, we have
t
X
kwt+1 − w0 k = ηk ∇L(wi )k
i=0
t
X
≤ η k∇L(wi )k
i=0
Xt p
≤ η 2β(L(wi ) − L(wi+1 ))
i=0
Xt p
≤ η 2βL(wi )
i=0
t
!
p X i/2
p
≤ η 2β (1 − ηµ) L(w0 )
i=0
p p 1
≤ η 2β L(w0 ) √
1 − 1 − ηµ
√ p
2 2β L(w0 )

µ
= R. (59)

29
Thus, wt+1 resides in the ball B(w0 , R).
By the principle of induction, the hypothesis is true.
Now, we prove Theorem 6, i.e., the particular case of square loss L(w) = 21 kF(w) − yk2 . In this case,

∇L(w) = (F(w) − y)T DF(w). (60)

Hence, in Eq.(59), we have the following instead:


t
X
kwt+1 − w0 k ≤ η k∇L(wi )k
i=0
Xt
≤ η kDF(wi )k2 kF(wi ) − y)k
i=0
t
!
X
i/2
≤ ηLF (1 − µ/βF ) kF(w0 ) − y)k
i=0
2
≤ ηLF · kF(w0 ) − y)k
ηµ
2LF kF(w0 ) − y)k
= . (61)
µ

Also note that, for all t > 0, kHL (wt )k2 ≤ L2F + βF · kF(w0 ) − yk, since kF(wt ) − yk ≤ kF(w0 ) − yk.
Hence, the step size η = 1/L2F + βF · kF(w0 ) − yk is valid.

G Proof of Theorem 7
Proof. We first aggressively assume that the µ-PL∗ condition holds in the whole parameter space Rm .
We will find that the condition can be relaxed to hold only in the ball B(w0 , R).
Following a similar analysis to the proof of Theorem 1 in [4], by the µ-PL∗ condition, we can get that

for any mini-bath size s, the mini-batch SGD with step size η ∗ (s) := nλ(n2 λ+µ(s−1)) has an exponential
convergence rate for all t > 0:

t
µsη ∗ (s)

E[L(wt )] ≤ 1− L(w0 ). (62)
n

Moreover, the expected length of each step is bounded by


 
s
X
E [kwt+1 − wt k] = η ∗ E k ∇`i(j) (wt )k
t
j=1
s
X h i
≤ η∗ E k∇`i(j) (wt )k
t
j=1
Xs q 

≤η E 2λ`i(j)
t
j=1
hp i
≤ η ∗ sE 2λL(wt )
p
≤ η ∗ s 2λE[L(wt )]
t/2
√ µsη ∗

p
≤ η ∗ s 2λ 1 − L(w0 ),
n

30
(1) (2) (s)
where we use {it , it , ..., it } to denote a random mini-batch of the dataset at step t.
Then the expectation of the length of the whole optimization path is bounded by
"∞ # ∞
X X
E kwi+1 − wi k = Ekwi+1 − wi k
i=0 i=0
∞ t/2 √ p
√ µsη ∗

X

p 2n 2λ L(w0 )
= η s 2λ 1 − L(w0 ) = .
i=0
n µ

By Markov’s inequality, we have, with probability at least 1 − δ, the length of the path is shorter than
R, i.e.,
∞ √ p
X 2n 2λ L(w0 )
kwi+1 − wi k ≤ = R.
i=0
µδ

This means that, with probability at least 1 − δ, the whole path is covered by the ball B(w0 , R), namely,
for all t,
t−1
X
kwt − w0 k ≤ kwi+1 − wi k ≤ R.
i=0

For those events when the whole path is covered by the ball, we can relax the satisfaction of the PL∗
condition from the whole space to the ball.

31

You might also like