Calculus of Variations
Calculus of Variations
Jeff Calder
University of Minnesota
School of Mathematics
[email protected]
February 6, 2024
2
Contents
1 Introduction 5
1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Applications 29
3.1 Shortest path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 The brachistochrone problem . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Minimal surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Minimal surface of revolution . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Isoperimetric inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Image restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.1 Gradient descent and acceleration . . . . . . . . . . . . . . . . 46
3.6.2 Primal dual approach . . . . . . . . . . . . . . . . . . . . . . . 49
3.6.3 The split Bregman method . . . . . . . . . . . . . . . . . . . . 51
3.6.4 Edge preservation properties of Total Variation restoration . . 53
3.7 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7.1 Ginzburg-Landau approximation . . . . . . . . . . . . . . . . 59
3
4 CONTENTS
5 Graph-based learning 91
5.1 Calculus on graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.1 The Graph Laplacian and maximum principle . . . . . . . . . 97
5.2 Concentration of measure . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Random geometric graphs and basic setup . . . . . . . . . . . . . . . 113
5.4 The maximum principle approach . . . . . . . . . . . . . . . . . . . . 116
5.5 The variational approach . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.1 Consistency for smooth functions . . . . . . . . . . . . . . . . 124
5.5.2 Discrete to nonlocal via transportation maps . . . . . . . . . . 127
5.5.3 Nonlocal to local estimates . . . . . . . . . . . . . . . . . . . . 131
5.5.4 Discrete to continuum . . . . . . . . . . . . . . . . . . . . . . 137
Introduction
5
6 CHAPTER 1. INTRODUCTION
leave out some of the rigorous detail that is expected in a graduate course. These first
four chapters require some familiarity with ordinary differential equations (ODEs),
and multi-variable calculus. The appendix contains some mathematical preliminaries
and notational conventions that are important to review.
The level of rigour increases dramatically in Chapter 4, where we begin a study of
the direct method in the Calculus of Variations for proving existence of minimizers.
This chapter requires some basic background in functional analysis and, in particular,
Sobolev spaces. We provide a very brief overview in Section 4.2, and refer the reader
to [28] for more details.
Finally, in Chapter 5 we give an overview of applications of the calculus of varia-
tions to prove discrete to continuum results in graph-based semi-supervised learning.
The ideas in this chapter are related to Γ-convergence, which is a notion of conver-
gence for functionals that ensures minimizers converge to minimizers.
These notes are just a basic introduction. We refer the reader to [28, Chapter
8] and [22] for a more thorough exposition of the theory behind the Calculus of
Variations.
1.1 Examples
We begin with some examples.
Example 1.1 (Shortest path). Let A and B be two points in the plane. What
is the shortest path between A and B? The answer depends on how we measure
length! Suppose the length of a short line segment near (x, y) is the usual Euclidean
length multiplied by a positive scale factor g(x, y). For example, the length of a path
could correspond to the length of time it would take a robot to navigate the path,
and certain regions in space may be easier or harder to navigate, yielding larger or
smaller values of g. Robotic navigation is thus a special case of finding the shortest
path between two points.
Suppose A lies to the left of B and the path is a graph u(x) over the x axis. See
Figure 1.1. Then the “length” of the path between x and x + ∆x is approximately
p
L = g(x, u(x)) 1 + u′ (x)2 ∆x.
If we let A = (0, 0) and B = (a, b) where a > 0, then the length of a path (x, u(x))
connecting A to B is
Z a p
I(u) = g(x, u(x)) 1 + u′ (x)2 dx.
0
The problem of finding the shortest path from A to B is equivalent to finding the
function u that minimizes the functional I(u) subject to u(0) = 0 and u(a) = b. 4
1.1. EXAMPLES 7
Figure 1.1: In our version of the shortest path problem, all paths must be graphs of
functions u = u(x).
Example 1.2 (Brachistochrone problem). In 1696 Johann Bernoulli posed the fol-
lowing problem. Let A and B be two points in the plane with A lying above B.
Suppose we connect A and B with a thin wire and allow a bead to slide from A to
B under the influence of gravity. Assuming the bead slides without friction, what is
the shape of the wire that minimizes the travel time of the bead? Perhaps counter-
intuitively, it turns out that the optimal shape is not a straight line! The problem
is commonly referred to as the brachistochrone problem—the word brachistochrone
derives from ancient Greek meaning “shortest time”.
Let g denote the acceleration due to gravity. Suppose that A = (0, 0) and B =
(a, b) where a > 0 and b < 0. Let u(x) for 0 ≤ x ≤ a describe the shape of the
wire, so u(0) = 0 and u(a) = b. Let v(x) denote the speed of the bead when it is
at position x. When the bead is at position (x, u(x)) along the wire, the potential
energy stored in the bead is PE = mgu(x) (relative to height zero), and the kinetic
energy is KE = 12 mv(x)2 , where m is the mass of the bead. By conservation of energy
1
mv(x)2 + mgu(x) = 0,
2
since the bead starts with zero total energy at point A. Therefore
p
v(x) = −2gu(x).
p
Between x and x + ∆x,pthe bead slides a distance of approximately 1 + u′ (x)2 ∆x
with a speed of v(x) = −2gu(x). Hence it takes approximately
p
1 + u′ (x)2
t= p ∆x
−2gu(x)
8 CHAPTER 1. INTRODUCTION
time for the bead to move from position x to x + ∆x. Therefore the total time taken
for the bead to slide down the wire is given by
Z as
1 1 + u′ (x)2
I(u) = √ dx.
2g 0 −u(x)
The problem of determining the optimal shape of the wire is therefore equivalent to
finding the function u(x) that minimizes I(u) subject to u(0) = 0 and u(a) = b. 4
Example 1.3 (Minimal surfaces). Suppose we bend a piece of wire into a loop of
any shape we wish, and then dip the wire loop into a solution of soapy water. A
soap bubble will form across the loop, and we may naturally wonder what shape the
bubble will take. Physics tells us that soap bubble formed will be the one with least
surface area, at least locally, compared to all other surfaces that span the wire loop.
Such a surface is called a minimal surface.
To formulate this mathematically, suppose the loop of wire is the graph of a
function g : ∂U → R, where U ⊂ R2 is open and bounded. We also assume that all
possible surfaces spanning the wire can be expressed as graphs of functions u : U → R.
To ensure the surface connects to the wire we ask that u = g on ∂U . The surface
area of a candidate soap film surface u is given by
Z p
I(u) = 1 + |∇u|2 dx.
U
Thus, the minimal surface problem is equivalent to finding a function u that minimizes
I subject to u = g on ∂U . 4
Example 1.4 (Image restoration). A grayscale image is a function u : [0, 1]2 → [0, 1].
For x ∈ R2 , u(x) represents the brightness of the pixel at location x. In real-world
1.1. EXAMPLES 9
applications, images are often corrupted in the acquisition process or thereafter, and
we observe a noisy version of the image. The task of image restoration is to recover
the true noise-free image from a noisy observation.
Let f (x) be the observed noisy image. A widely used and very successful approach
to image restoration is the so-called total variation (TV) restoration, which minimizes
the functional Z
1
I(u) = (u − f )2 + λ|∇u| dx,
U 2
where λ > 0 is a parameter and U = (0, 1)2 . The restored image is the function u
that minimizes I (we do not impose boundary conditions on the minimizer). The
first term 21 (u − f )2 is a called a fidelity term, and encourages the restored image to
be close to the observed noisy image f . The second term |∇u| measures the amount
of noise in the image and minimizing this term encourages the removal of R noise in
the restored image. The name TV restoration comes from the fact that U |∇u| dx
is called the total variation of u. Total variation image restoration was pioneered by
Rudin, Osher, and Fatemi [49]. 4
Example 1.5 (Image segmentation). A common task in computer vision is the seg-
mentation of an image into meaningful regions. Let f : [0, 1]2 → [0, 1] be a grayscale
image we wish to segment. We represent possible segmentations of the image by the
level sets of functions u : [0, 1]2 → R. Each function u divides the domain [0, 1]2 into
two regions defined by
R+ (u) = {x ∈ [0, 1]2 : u(x) > 0} and R− (u) = {x ∈ [0, 1]2 : u(x) ≤ 0}.
The boundary between the two regions is the level set {x ∈ [0, 1]2 : u(x) = 0}.
At a very basic level, we might assume our image is composed of two regions
with different intensity levels f = a and f = b, corrupted by noise. Thus, we might
propose to segment the image by minimizing the functional
Z Z
I(u, a, b) = (f (x) − a) dx +
2
(f (x) − b)2 dx,
R+ (u) R− (u)
over all possible segmentations u and real numbers a and b. However, this turns
out not to work very well since it does not incorporate the geometry of the region
in any way. Intuitively, a semantically meaningful object in an image is usually
concentrated in some region of the image, and might have a rather smooth boundary.
The minimizers of I could be very pathological and oscillate rapidly trying to capture
every pixel near a in one region and those near b in another region. For example, if f
only takes the values 0 and 1, then minimizing I will try to put all the pixels in the
image where f is 0 into one region, and all those where f is 1 into the other region,
and will choose a = 0 and b = 1. This is true regardless of whether the region where
f is zero is a nice circle in the center of the image, or if we randomly choose each
10 CHAPTER 1. INTRODUCTION
pixel to be 0 or 1. In the later case, the segmentation u will oscillate wildly and does
not give a meaningful result.
A common approach to fixing this issue is to include a penalty on the length
of the boundary between the two regions. Let us denote the length of the boundary
between R+ (u) and R− (u) (i.e., the zero level set of u) by L(u). Thus, we seek instead
to minimize the functional
Z Z
I(u, a, b) = (f (x) − a) dx +
2
(f (x) − b)2 dx + λL(u),
R+ (u) R− (u)
Then the region R+ (u) is precisely the region where H(u(x)) = 1, and the region
R− (u) is precisely where H(u(x)) = 0. Therefore
Z Z
(f (x) − a) dx =
2
H(u(x))(f (x) − a)2 dx,
R+ (u) U
Therefore we have
Z
I(u, a, b) = H(u)(f − a)2 + (1 − H(u)) (f − b)2 + λδ(u)|∇u| dx.
U
4
Chapter 2
L : U × R × Rd → R.
L = L(x1 , x2 , . . . , xn , z, p1 , p2 , . . . , pn ).
and
∇p L(x, z, p) = (Lp1 (x, z, p), . . . , Lpn (x, z, p)).
Each example from Section 1.1 involved a functional of the general form of (2.1).
For the shortest path problem d = 1 and
q
L(x, z, p) = g(x1 , z) 1 + p21 .
11
12 CHAPTER 2. THE EULER-LAGRANGE EQUATION
This means that h(t) := I(u + tφ) has a global minimum at t = 0, i.e., h(0) ≤ h(t)
for all t. It follows that h′ (t) = 0, which is equivalent to
d
(2.4) I(u + tφ) = 0.
dt t=0
For notational simplicity, let us suppress the x arguments from u(x) and φ(x). By
the chain rule
d
L(x, u+tφ, ∇u+t∇φ) = Lz (x, u+tφ, ∇u+t∇φ)φ+∇p L(x, u+tφ, ∇u+t∇φ)·∇φ.
dt
Therefore
d
L(x, u + tφ, ∇u + t∇φ) = Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ,
dt t=0
13
and we have
Z
d d
I(u + tφ) = L(x, u(x) + tφ(x), ∇u(x) + t∇φ(x)) dx
dt t=0 dt t=0 U
Z
d
= L(x, u(x) + tφ(x), ∇u(x) + t∇φ(x)) dx
dt t=0
Z U
It follows from the vanishing lemma (Lemma A.43 in the appendix) that
for all φ ∈ Cc∞ (U ). A function u ∈ C 1 (U ) satisfying the above for all φ ∈ Cc∞ (U ) is
called a weak solution of the Euler-Lagrange equation (2.3). Thus, weak solutions of
PDEs arise naturally in the calculus of variations.
14 CHAPTER 2. THE EULER-LAGRANGE EQUATION
In some of the examples presented in Section 1.1, such as the image segmentation
and restoration problems, we did not impose any boundary condition on the minimizer
u. For such problems, Theorem 2.1 still applies, but the Euler-Lagrange equation
(2.3) is not uniquely solvable without a boundary condition. Hence, we need some
additional information about minimizers in order for the Euler-Lagrange equation to
be useful for these problems.
Theorem 2.7. Suppose that u ∈ C 2 (U ) satisfies
for all v ∈ C 2 (U ). Then u satisfies the Euler-Lagrange equation (2.3) with boundary
condition
Proof. By Theorem 2.1, u satisfies the Euler-Lagrange equation (2.3). We just need
to show that u also satisfies the boundary condition (2.9).
Let φ ∈ C ∞ (U ) be a smooth function that is not necessarily zero on ∂U . Then
by hypothesis I(u) ≤ I(u + tφ) for all t. Therefore, as in the proof of Theorem 2.1
we have
Z
d
(2.10) 0= I(u + tφ) = Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ dx.
dt t=0 U
Since u satisfies the Euler-Lagrange equation (2.3), the second term above vanishes
and we have Z
φ ∇p L(x, u, ∇u) · ν dS = 0
∂U
for any vector v ∈ Rd . It is possible to take (2.11) as the definition of the gradient of
u. By this, we mean that w = ∇u(x) is the unique vector satisfying
d
u(x + tv) = w · v
dt t=0
for all v ∈ Rd .
In the case of functionals I(u), we showed in the proof of Theorem 2.1 that
d
I(u + tφ) = (Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) , φ)L2 (U )
dt t=0
for all φ ∈ Cc∞ (U ). Here, the L2 -inner product plays the role of the dot product from
the finite dimensional case. Thus, it makes sense to define the gradient, also called
the functional gradient, to be
The reader should compare this with the ordinary chain rule (2.11). Notice the
definition of the gradient ∇I depends on the choice of the L2 -inner product. Using
other inner products will result in different notions of gradient.
We consider the problem of minimizing I(u) subject to the constraint J(u) = 0. The
Lagrange multiplier method gives necessary conditions that must be satisfied by any
minimizer.
Theorem 2.8 (Lagrange multiplier). Suppose that u ∈ C 2 (U ) satisfies J(u) = 0 and
for all v ∈ C 2 (U ) with v = u on ∂U and J(v) = 0. Then there exists a real number
λ such that
This says that ∇I(u) is orthogonal to everything that is orthogonal to ∇J(u). In-
tuitively this must imply that ∇I(u) and ∇J(u) are co-linear; we give the proof
below.
We now have three cases.
1. If ∇J(u) = 0 then (∇I(u), φ)L2 (U ) = 0 for all φ ∈ C ∞ (U ), and by the vanishing
lemma ∇I(u) = 0. Here we can choose any real number for λ.
2. If ∇I(u) = 0 then we can take λ = 0 to complete the proof.
18 CHAPTER 2. THE EULER-LAGRANGE EQUATION
(∇I(u), ∇J(u))L2 (U )
λ=− and v = ∇I(u) + λ∇J(u).
k∇J(u)kL2 (U )
(∇J(u), v)L2 (U ) = 0.
0 = (∇I(u), v)L2 (U )
= (v − λ∇J(u), v)L2 (U )
= (v, v)L2 (U ) − λ(∇J(u), v)L2 (U )
= kvk2L2 (U ) .
Remark 2.10. Notice that (2.15) is equivalent to the necessary conditions minimizers
for the augmented functional
compute
Z
d d
I(u) = L(x, u(x, t), ∇u(x, t)) dx
dt dt
ZU
= Lz (x, u, ∇u)ut + ∇p L(x, u, ∇u)∇ut dx
ZU
= (Lz (x, u, ∇u) − div (∇p L(x, u, ∇u))) ut dx
U
= (∇I(u), ut )L2 (U )
= (∇I(u), −∇I(u))L2 (U )
= −k∇I(u)kL2 (U )
≤ 0.
We used integration by parts in the third line, mimicking the proof of Theorem 2.1.
Notice that either boundary condition u = g or ∇p L(x, u, ∇u) · ν = 0 on ∂U ensures
the boundary terms vanish in the integration by parts.
Notice that by writing out the Euler-Lagrange equation we can write the gradient
descent PDE (2.16) as
(
ut + Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) = 0, in U × (0, ∞)
(2.17)
u = u0 , on U × {t = 0}.
Example 2.3. Gradient descent on the Dirichlet energy (2.6) is the heat equation
(
ut − ∆u = f, in U × (0, ∞)
(2.18)
u = u0 , on U × {t = 0}.
Thus, solving the heat equation is the fastest way to decrease the Dirichlet energy. 4
We now prove that gradient descent converges to the unique solution of the Euler-
Lagrange equation when I is strongly convex. For simplicity, we consider functionals
of the form
Z
(2.19) I(u) = Ψ(x, u) + Φ(x, ∇u) dx.
U
This includes the regularized variational problems used for image denoising in Exam-
ple 1.4 and Section 3.6, for example. The reader may want to review definitions and
properties of convex functions in Section A.8 before reading the following theorem.
Theorem 2.11. Assume I is given by (2.19), where z 7→ Ψ(x, z) is θ-strongly convex
for θ > 0 and p 7→ Φ(x, p) is convex. Let u∗ ∈ C 2 (U ) be any function satisfying
(2.20) ∇I(u∗ ) = 0 in U.
20 CHAPTER 2. THE EULER-LAGRANGE EQUATION
Let u(x, t) ∈ C 2 (U × [0, ∞)) be a solution of the gradient descent equation (2.16) and
assume that u = u∗ on ∂U for all t > 0. Then
We differentiate e(t), use the equations solved by u and u∗ and the divergence theorem
to find that
Z
e (t) = (u − u∗ )ut dx
′
UZ
Therefore
Z
′
(2.23) e (t) ≤ −θ (u − u∗ )2 dx = −2θe(t).
U
Noting that
d
(e(t)e−2θt ) = e′ (t)e−2θt − 2θe(t)e−2θt ≤ 0,
dt
−2θt
we have e(t) ≤ e(0)e , which completes the proof.
2.4. ACCELERATED GRADIENT DESCENT 21
Remark 2.12. Notice that the proof of Theorem 2.11 shows that solutions of ∇I(u) =
0, subject to Dirichlet boundary condition u = g on ∂U , are unique.
Remark 2.13. From an optimization or numerical analysis point of view, the expo-
nential convergence rate (2.21) is a linear convergence rate. Indeed, if we consider
the decay in the energy e(t) defined in (2.22) over a unit ∆t in time, then Eq. (2.23)
yields
Z t+∆t Z t+∆t
′
e(t + ∆t) − e(t) = e (s) ds ≤ −2θ e(s) ds ≤ −2θ∆te(t + ∆t),
t t
e(t + ∆t) 1
≤ =: µ.
e(t) 1 + 2θ∆t
where K is the kinetic energy and P is the potential energy. The total energy K +P is
conserved in Lagrangian mechanics, and momentum arises through the kinetic energy
K.
Example 2.4 (Mass on a spring). Consider the motion of a mass on a spring with
spring constant k > 0. Assume one end of the spring is fixed to a wall at position
x = 0 and the other end attached to a mass at position x(t) > 0. The spring is
relaxed (at rest) at position x = a > 0, and we assume the mass slides horizontally
along a frictionless surface. Here, the kinetic energy is 21 mx′ (t)2 , where m is the mass
of the spring. The potential energy is the energy stored in the spring, and is given by
1
2
k(x(t) − a)2 , according to Hooke’s law. Hence, the Lagrangian action is
Z b
1 1
mx′ (t)2 − k(x(t) − a)2 dt.
a 2 2
The corresponding Euler-Lagrange equation is
d
− (mx′ (t)) − k(x(t) − a) = 0,
dt
which reduces to the equations of motion mx′′ (t) + k(x(t) − a) = 0. 4
Exercise 2.14. Let f : R → R and suppose a bead of mass m > 0 is sliding
without friction along the graph of f under the force of gravity. Let g denote the
acceleration due to gravity, and write the trajectory of the bead as (x(t), f (x(t))).
What are the equations of motion satisfied by x(t). [Hint: Follow Example 2.4, using
the appropriate expressions for kinetic and potential energy. Note that kinetic energy
is not 12 mx′ (t)2 , since this only measures horizontal movement, and bead is moving
vertically as well. The potential energy is mgf (x(t)).] 4
In our setting, we may take the kinetic energy of u = u(x, t) to be
Z
1
(2.24) K(u) = u2t dx,
2 U
and the potential energy to be the functional I(u) we wish to minimize, defined in
(2.1). Then our action functional is
Z bZ
1 2
(2.25) ut − L(x, u, ∇u) dx dt.
a U 2
2.4. ACCELERATED GRADIENT DESCENT 23
The equations of motion are simply the Euler-Lagrange equations for (2.25), treating
the problem in Rn+1 with (x, t) as our variables, which are
∂
− ut − (Lz − divx (∇p L(x, u, ∇u))) = 0.
∂t
This reduces to utt = −∇I(u), which is a nonlinear wave equation in our setting.
Unfortunately this wave equation will conserve energy, and is thus useless for opti-
mization.
In order to ensure we descend on the energy I(u), we need to introduce some
dissipative agent into the problem, such as friction. In the Lagrangian setting, this
can be obtained by adding time-dependent functions into the action functional, in
the form
Z bZ
1 2
(2.26) e at
u − L(x, u, ∇u) dx dt,
a U 2 t
∂ at
− (e ut ) − eat (Lz − divx (∇p L(x, u, ∇u))) = 0,
∂t
which reduce to
Notice that in the descent equation (2.27), the gradient ∇I(u) acts as a forcing term
that causes acceleration, whereas in gradient descent ut = −∇I(u), the gradient is the
velocity. The additional term aut is a damping term, with the physical interpretation
of friction for a rolling (or sliding) ball, with a > 0 the coefficient of friction. The
damping term dissipates energy, allowing the system to converge to a critical point of
I. We note it is possible to make other modifications to the action functional (2.26)
to obtain a more general class of descent equations (see, e.g., [17]).
The accelerated descent equation (2.27) does not necessarily descend on I(u).
However, since we derived the equation from a Lagrangian variational formulation,
we do have monotonicity of total energy.
Lemma 2.15 (Energy monotonicity [17, Lemma 1]). Assume a ≥ 0 and let u satisfy
(2.27). Suppose either u(x, t) = g(x) or ∇p L(x, u, ∇u) · n = 0 for x ∈ ∂U and t > 0.
Then for all t > 0 we have
d
(2.28) (K(u) + I(u)) = −2aK(u).
dt
24 CHAPTER 2. THE EULER-LAGRANGE EQUATION
Proof.
Z
d d
(2.29) I(u) = L(x, u, ∇u) dx
dt dt
ZU
= Lz (x, u, ∇u)ut + ∇p L(x, u, ∇u)∇ut dx
ZU
= Lz (x, u, ∇u)ut − divx (∇p L(x, u, ∇u))ut dx
U Z
+ ut ∇p L(x, u, ∇u) · ν dS
Z Z ∂U
Similarly, we have
Z
d
(2.31) K(u) = ut utt dx
dt
ZU
= ut (−aut − ∇I(u)) dx
U Z
= −2aK(u) − ∇I(u)ut dx,
U
where we used the equations of motion (2.27) in the second step above. Adding (2.30)
and (2.31) completes the proof.
It is possible to prove an exponential convergence rate for the accelerated descent
equations (2.27) to the solution u∗ of ∇I(u∗ ) = 0 in the strongly convex setting, as
we did in Theorem 2.11 for gradient descent. The proof uses energy methods, as in
Theorem 2.11, but is more involved, so we refer the reader to [17, Theorem 1].
There is a natural question the reader may have at this point. If both gradient
descent and accelerated gradient descent have similar exponential convergence rates,
why would acceleration converge faster than gradient descent? The answer is that
the accelerated convergence rate is invisible at the level of the continuum descent
equations (2.16) and (2.27). The acceleration is realized only when the equations are
discretized in space-time, which is necessary for computations. Recall from Remark
2.13 that the convergence rate for a discretization with time step ∆t is
µ ≈ 1 − c∆t,
2.4. ACCELERATED GRADIENT DESCENT 25
where c > 0 is the exponential rate. Hence, the convergence rate µ depends crucially
on how large we can take the time step ∆t; larger time steps yield much faster con-
vergence rates. The time step is restricted by stability considerations in the discrete
schemes, also called Courant-Friedrichs-Lewy (CFL) conditions.
The gradient descent equation (2.16) is first order in time and second order in
∆t
space, so the ratio ∆x 2 appears in the discretization of the descent equation, where
∆x is the spatial resolution. Stability considerations require bounding this ratio and
so ∆t ≤ C∆x2 turns out to be the stability condition for gradient descent equations
in this setting. This is a very restrictive stability condition, and such equations are
called numerically stiff. This yields a convergence rate of the form
stability conditions amount to bounding ∆t ≤ C∆x. Another way to see this is that
wave equations have bounded speed of propagation c, and the CFL condition amounts
to ensuring that the speed of propagation of the discrete scheme is at least as fast as
c, that is ∆x
∆t
≥ c. Thus, accelerated gradient descent has a convergence rate of the
form
Thus, as a function of the grid resolution ∆x, the accelerated equation (2.27) has a
convergence rate that is a whole order of magnitude better than the gradient descent
equation (2.16). At a basic level, gradient descent is a diffusion equation, which is
numerically stiff, while accelerated descent is a wave equation, which does not exhibit
the same stiffness. The discussion above is only heuristic; we refer the reader to [17]
for more details.
We present below the results of a simple numerical simulation to show the differ-
ence between gradient descent and accelerated gradient descent. We consider solving
the Dirichlet problem
(
∆u = 0 in U
(2.34)
u = g on ∂U,
in which case ∇I(u) = −∆u in both (2.27) and (2.16). We choose a = 2π 1 , g(x1 , x2 ) =
sin(2πx21 ) + sin(2πx22 ), and use initial condition u(x, 0) = g(x). We discretize using
the standard 5-point stencil for ∆u in n = 2 dimensions, and use forward differences
for ut and centered differences for utt . We choose the largest stable time step for each
scheme and run the corresponding descent equations until the discretization of (2.34)
26 CHAPTER 2. THE EULER-LAGRANGE EQUATION
Table 2.1: Comparison of PDE acceleration (2.27) and gradient descent (2.16) for
solving the Dirichlet problem (2.34). Runtimes are for C code, and copied from [17].
is satisfied up to an O(∆x2 ) error. The iteration counts and computation times for
C code are shown in Table 2.1 for different grid resolutions.
We note that many other methods are faster for linear equations, such as multigrid
or Fast Fourier Transform. The point here is to contrast the difference between
gradient descent and accelerated gradient descent. The accelerated descent method
works equally well for non-linear problems where multigrid and FFT methods do not
work, or in the case of multigrid take substantial work to apply.
Remark 2.16. We have not taken into account discretization errors in our analysis
above. In practice, the solution is computed on a grid of spacing ∆x in the space
dimensions and ∆t in time. The discretization errors accumulate over time, and need
to be controlled to ensure convergence of the numerical scheme.
For gradient descent, the PDE is first order in space and second order in time,
and the discretization errors on U × [0, T ] are bounded by CT (∆t + ∆x2 ), which is
proved in standard numerical analysis books. Thus, by (2.21) the error between the
discrete solution and the minimizer u∗ is bounded by
where u∆t,∆x denotes any piecewise constant extension of the numerical solution to
U . Furthermore, since ∆t ≤ C∆x2 is necessary for stability and convergence of the
gradient descent equation, we have the bound
Hence, our numerical solution computes the true solution u∗ up to O(∆x2 ) accuracy
T C
(excluding log factors) in k = ∆t = ∆x 2 iterations. The analysis for accelerated
1
The choice a = 2π is optimal. See Exercise 2.18 for an explanation in a simple setting.
2.4. ACCELERATED GRADIENT DESCENT 27
gradient descent is identical, except here ∆t ≤ C∆x is sufficient for stability and
C
convergence of the scheme, and so only k = ∆x iterations are needed for convergence.
Exercise 2.17. Consider the linear ordinary differential equation (ODE) version of
gradient descent
X
d
x(t) = di (t)vi
i=1
for corresponding functions d1 (t), . . . , dn (t). Find the di (t) and complete the problem
from here.] 4
Exercise 2.18. Consider the linear ordinary differential equation (ODE) version of
acceleration
Notes
We note the presentation here is an adaptation of the variational interpretation of
acceleration due to Wibisono, Wilson, and Jordan [63] to the PDE setting. The
framework is called PDE acceleration; for more details see [7, 17, 56, 66]. We also
mention the dynamical systems perspective is investigated in [3].
28 CHAPTER 2. THE EULER-LAGRANGE EQUATION
Chapter 3
Applications
We now continue the parade of examples by computing and solving the Euler-Lagrange
equations for the examples from Section 1.1.
Euler-Lagrange equation is
p d ′ 2 − 12 ′
gz (x, u(x)) 1 + u (x) −
′ 2 g(x, u(x))(1 + u (x) ) u (x) = 0.
dx
This is in general difficult to solve. In the special case that g(x, z) = 1, gz = 0 and
this reduces to !
d u′ (x)
p = 0.
dx 1 + u′ (x)2
Computing the derivative yields
p
1 + u′ (x)2 u′′ (x) − u′ (x)(1 + u′ (x)2 )− 2 u′ (x)u′′ (x)
1
= 0.
1 + u′ (x)2
p
Multiplying both sides by 1 + u′ (x)2 we obtain
(1 + u′ (x)2 )u′′ (x) − u′ (x)2 u′′ (x) = 0.
This reduces to u′′ (x) = 0, hence the solution is a straight line! This verifies our
intuition that the shortest path between two points is a straight line.
29
30 CHAPTER 3. APPLICATIONS
where the constant C is different than the one on the previous line. The constant C
should be chosen to ensure the boundary conditions hold.
Before solving (3.1), let us note that we can say quite a bit about the solution u
from the ODE it solves. First, since u(a) = b < 0, the left hand side must be negative
somewhere, hence C < 0. Solving for u′ (x)2 we have
C − u(x)
u′ (x)2 = .
u(x)
This tells us two things. First, since the left hand side is positive, so is the right
hand side. Hence C − u(x) ≤ 0 and so u(x) ≥ C. If u attains its maximum at an
interior point 0 < x < a then u′ (x) = 0 and u(x) = C, hence u is constant. This is
impossible, so the maximum of u is attained at x = 0 and we have
Since x = 0 is a local maximum of u, we have u′ (x) < 0 for x > 0 small. Therefore
Therefore
Since the left hand side is positive, so is the right hand side, and we deduce u′′ (x) > 0.
Therefore u′ (x) is strictly increasing, and u is a convex function of x. In fact, we can
say slightly more. Solving for u′′ (x) and using (3.1) we have
′′ 1 + u′ (x)2 (1 + u′ (x)2 )2 1
u (x) = − =− ≥− > 0.
2u(x) 2C 2C
This means that u is uniformly convex, and for large enough x we will have u′ (x) > 0,
so the optimal curve could eventually be increasing.
In fact, since u′ is strictly increasing, there exists (by the intermediate value
theorem) a unique point x∗ > 0 such that u′ (x∗ ) = 0. We claim that u is symmetric
about this point, that is
u(x∗ + x) = u(x∗ − x).
To see this, write w(x) = u(x∗ + x) and v(x) = u(x∗ − x). Then
and
v(x) + v(x)v ′ (x) = u(x∗ − x) + u(x∗ − x)u′ (x∗ − x)2 = C.
Differentiating the eqautions above, we find that both w and v satisfy the second
order ODE (3.2). We also note that w(0) = u(x∗ ) = v(0), and w′ (0) = u′ (x∗ ) = 0 and
v ′ (0) = −u′ (x∗ ) = 0, so w′ (0) = v ′ (0). Since solutions of the initial value problem for
(3.2) are unique, provided their values stay away from zero, we find that w(x) = v(x),
which establishes the claim. The point of this discussion is that without explicitly
computing the solution, we can say quite a bit both quantitatively and qualitatively
about the solutions.
32 CHAPTER 3. APPLICATIONS
-0.2
-0.4
-0.6
-0.8
-1
-1.2
-1.4
-1.6
-1.8
-2
0 1 2 3 4 5 6 7
Figure 3.1: Family of brachistochrone curves. The straight line is the line x + π2 y = 0
passing through the minima of all brachistochrone curves.
Now, suppose instead of releasing the bead from the top of the curve, we release
the bead from some position (x0 , u(x0 )) further down (but before the minimum) on
the brachistochrone curve. How long does it take the bead to reach the lowest point
on the curve? It turns out this time is the same regardless of where you place the
bead on the curve! To see why, we recall that conservation of energy gives us
1
mv(x)2 + mgu(x) = mgu(x0 ),
2
where v(x) is the velocity of the bead. Therefore
p
v(x) = 2g(u(x0 ) − u(x)),
Recall that
1
1 + u′ (x)2 = 2 , u(x) = y(θ) = C sin θ, and dx = −2C sin θdθ,
2 2
sin θ
Make the change of variables t = − cos θ/ cos θ0 . Then cos θ0 dt = sin θdθ and
s Z
−2C 0 1
T = √ dt.
g −1 1 − t2
We can integrate this directly to obtain
s s
−2C −C
T = (arcsin(0) − arcsin(−1)) = π .
g 2g
Notice this is independent of the initial position x0 at which the bead is released!
A curve with the property that the time taken by an object sliding down the curve
to its lowest point is independent of the starting position is called a tautochrone,
or isochrone curve. So it turns out that the tautochrone curve is the same as the
brachistochrone curve. The words tautochrone and isochrone are ancient Greek for
same-time and equal-time, respectively.
Even though minimal surfaces are defined in dimension n = 2, it can still be mathe-
matically interesting to consider the general case of arbitrary dimension n ≥ 2.
From the form of L we see that Lz (x, z, p) = 0 and
p
∇p L(x, z, p) = p .
1 + p2
Therefore, the Euler-Lagrange equation for the minimal surface problem is
!
∇u
(3.3) div p = 0 in U
1 + |∇u|2
subject to u = g on ∂U . This is called the minimal surface equation. Using the chain
rule, we can rewrite the PDE as
!
1 1
∇ p · ∇u + p div(∇u) = 0.
1 + |∇u| 2 1 + |∇u|2
3.3. MINIMAL SURFACES 35
Notice that
!
∂ 1 1 Xd
2 − 32
p = − (1 + |∇u| ) 2uxi uxi xj .
∂xj 1 + |∇u|2 2 i=1
1 X
d
∆u
− 3 uxi xj uxi uxj + p = 0.
(1 + |∇u|2 ) 2 i,j=1 1 + |∇u|2
3
Multiplying both sides by (1 + |∇u|2 ) 2 we have
X
d
(3.4) − uxi xj uxi uxj + (1 + |∇u|2 )∆u = 0.
i,j=1
u(x) = a · x + b
−ux1 x1 u2x1 − 2ux1 x2 ux1 ux2 − ux2 x2 u2x2 + (1 + u2x1 + u2x2 )(ux1 x1 + ux2 x2 ) = 0,
which reduces to
Since u′ (x)2 ≥ 0, we deduce that c2 u(x)2 ≥ 1 for all. Since u(L) = r we require
1
(3.9) c≥ .
r
To solve (3.8), we use a clever trick: We differentiate both sides to find
or
u′′ (x) = c2 u(x).
Notice we have converted a nonlinear ODE into a linear ODE! Since c2 > 0 the general
solution is
A B
u(x) = ecx + e−cx .
2 2
3.4. MINIMAL SURFACE OF REVOLUTION 37
Tx sin θ
θ
Tx cos θ
(−T0 ,0)
−L 0 x L
Figure 3.2: Depicition of some quantities from the derivation of the shape of a hanging
chain. The shape of the curve is deduced by balancing the force of gravity and tension
on the blue segment above.
Remark 3.3 (The shape of a hanging chain). We pause for a moment to point out
that the catenary curve is also the shape that an idealized hanging chain or cable
assumes. Due to this property, an inverted catenary, or catenary arch, has been used
in architectural designs since ancient times.
To see this, we assume the chain has length ℓ, with ℓ > 2L, and is supported at
x = ±L at a height of r, thus the shape of the chain w(x) is a positive even function
satisfying w(±L) = r. Consider the segment of the chain lying above the interval
[0, x] for x > 0. Let s(x) denote the arclength of this segment, given by
Z xp
s(x) = 1 + w′ (t)2 dt.
0
38 CHAPTER 3. APPLICATIONS
If the chain has density ρ, then the force of gravity acting on this segment of the
chain is F = −ρs(x)g in the vertical direction, where g is acceleration due to gravity.
We also have tension forces within the rope acting on the segment at both ends of the
interval. Let Tx denote the magnitude of the tension vector in the rope at position
x, noting that the tension vector is always directed tangential to the curve. Since
w′ (0) = 0, as w is even, the tension vector at 0 is (−T0 , 0). At the right point x of the
interval, the tension vector is Tx = (Tx cos θ, Tx sin θ), where θ is the angle between
the tangent vector to w and the x-axis. Since the chain is at rest, the sum of the
forces acting on the segment [0, x] of rope must vanish, which yields
Therefore
Tx sin θ ρg
w′ (x) = tan θ = = s(x).
Tx cos θ T0
p
Setting c = ρg/T0 and differentiating above we obtain w′′ (x) = c 1 + w′ (x)2 , and so
Noting the similarity to (3.8), we can use the methods above to obtain the solution
cA cx cB −cx
w′ (x) = e + e ,
2 2
for some constants A and B. Since w(x) is even, w′ (x) is odd, and so B = −A and
we have w′ (x) = cA sinh(cx). Integrating again yields
w(x) = A cosh(cx) + C
for an integration constant C. The difference now with the minimal surface problem
is that A, C are the free parameters, while c is fixed.
Note that for fixed A, we can choose C > 0 so that w(L) = r, yielding
We set the parameter A by asking that the curve has length ℓ, that is
Z Lp Z Lq
(3.11) ℓ=2 ′ 2
1 + w (x) dx = 2 1 + A2 c2 sinh2 (cx) dx.
0 0
Unless Ac = 1, the integral on the right is not expressible in terms of simple functions.
Nevertheless, we can see that the length of w is continuous and strictly increasing in
A, and tends to ∞ as A → ∞. Since the length of w for A = 0 is 2L, and ℓ > 2L,
there exists, by the intermediate value theorem, a unique value of A for which (3.11)
holds.
3.4. MINIMAL SURFACE OF REVOLUTION 39
12
10 y=cosh(C)
y=3C
y=3C/2
8 y=C
6
y
4
0
0 0.5 1 1.5 2 2.5 3
C
Figure 3.3: When θ > θ0 there are two solutions of (3.12). When θ = θ0 there is
one solution, and when θ < θ0 there are no solutions. By numerical computations,
θ0 ≈ 1.509.
This equation is not always solvable, and depends on the value of θ = Lr , that is,
on the ratio of the radius r of the rings to L, which is half of the separation distance.
There is a threshold value θ0 such that for θ > θ0 there are two solutions C1 < C2 of
(3.12). When θ = θ0 there is one solution C, and when θ < θ0 there are no solutions.
See Figure 3.3 for an illustration. To rephrase this, if θ < θ0 or r < Lθ0 , then the
rings are too far apart and there is no minimal surface spanning the two rings. If
r ≥ Lθ0 then the rings are close enough together and a minimal surface exists. From
numerical computations, θ0 ≈ 1.509.
Now, when there are two solutions C1 < C2 , which one gives the smallest surface
area? We claim it is C1 . To avoid complicated details, we give here a heuristic
argument to justify this claim. Let c1 < c2 such that C1 = c1 L and C2 = c2 L. So we
have two potential solutions
1 1
u1 (x) = cosh(c1 x) and u2 (x) = cosh(c2 x).
c1 c2
40 CHAPTER 3. APPLICATIONS
Figure 3.4: Simulation of the minimal surface of revolution for two rings being slowly
pulled apart. The rings are located at x = −L and x = L where L ranges from (left)
L = 0.095 to (right) L = 0.662, and both rings have radius r = 1. For larger L the
soap bubble will collapse.
Since u1 (0) = c11 ≥ c12 = u2 (0), we have u1 ≥ u2 . In other words, as we increase c the
solution decreases. Now, as we pull the rings apart we expect the solution to decrease
(the soap bubble becomes thinner), so the value of c should increase as the rings are
pulled apart. As the rings are pulled apart L is increasing, so θ = r/L is decreasing.
From Figure 3.3 we see that C2 is decreasing as θ decreases. Since C2 = c2 L and L
is increasing, c2 must be decreasing as the rings are pulled apart. In other words,
u2 is increasing as the rings are pulled apart, so u2 is a non-physical solution. The
minimal surface is therefore given by u1 . Figure 3.4 shows the solutions u1 as the two
rings pull apart, and Figure 3.5 shows non-physical solutions u2 .
We can also explicitly compute the minimal surface area for c = c1 . We have
Z L p Z L
(3.13) I(u) = 2π ′ 2
u(x) 1 + u (x) dx = 4πc u(x)2 dx,
−L 0
where we used the Euler-Lagrange equation (3.7) and the fact that u is even in the
last step above. Substituting u′′ (x) = c2 u(x) and integrating by parts we have
Z
4π L ′′
I(u) = u (x)u(x) dx
c 0
Z
4π ′ L 4π L ′ 2
= u (x)u(x) − u (x) dx
c 0 c 0
Z
4π ′ 4π L ′ 2
= u (L)u(L) − u (x) dx.
c c 0
Using (3.8) we have u′ (x)2 = c2 u(x)2 − 1 and so
Z
4πu(L) p 2 4π L 2
I(u) = c u(L) − 1 −
2 c u(x)2 − 1 dx.
c c 0
3.4. MINIMAL SURFACE OF REVOLUTION 41
Figure 3.5: Illustration of solutions of the minimal surface equation that do not
minimize surface area. The details are identical to Figure 3.4, except that we select
c2 instead of c1 . Notice the soap bubble is growing as the rings are pulled apart,
which is the opposite of what we expect to occur.
and we have
2π
I(u) = (r sinh(cL) + L) .
c
While it is not possible to analytically solve for c1 and c2 , we can numerically com-
pute the values to arbitrarily high precision with our favorite root-finding algorithm.
42 CHAPTER 3. APPLICATIONS
Most root-finding algorithms require one to provide an initial interval in which the
solution is to be found. We already showed (see (3.9)) that c ≥ 1/r. For an upper
bound, recall we have the Taylor series
x2 x4 X x2n
∞
cosh(x) = 1 + + + ··· = ,
2 4! k=0
(2k)!
x2
and so cosh(x) ≥ 2
. Recall also that c1 and c2 are solutions of
cosh(cL) = cr.
c2 L 2
Therefore, if 2
> cr we know that cosh(cL) > cr. This gives the bounds
1 2r
≤ ci ≤ 2 (i = 1, 2).
r L
Furthermore, the point c∗ where the slope of c 7→ cosh(cL) equals r lies between c1
and c2 . Therefore
c1 < c ∗ < c 2 ,
where L sinh(c∗ L) = r, or
1 r
∗−1
c = sinh .
L L
Thus, if cosh(c∗ L) = c∗ r, then there is exactly one solution c1 = c2 = c∗ . If
cosh(c∗ L) < c∗ r then there are two solutions
1 2r
(3.15) ≤ c1 < c ∗ < c 2 ≤ 2 .
r L
Otherwise, if cosh(c∗ L) > c∗ r then there are no solutions. Now that we have the
bounds (3.15) we can use any root-finding algorithm to determine the values of c1
and c2 . In the code I showed in class I used a simple bisection search.
where h < 12 ℓ. The largest area for this rectangle is attained when 12 ℓ − 2h = 0, or
h = 14 ℓ. That is, the rectangle is a square! The area of the square is
ℓ2
A= .
16
Can we do better? We can try regular polygons with more sides, such as a
pentagon, hexagon, etc., and we will find that the area increases with the number of
sides. In the limit as the number of sides tends to infinity we get a circle, so perhaps
the circle is a good guess for the optimal shape. If C is a circle of radius r > 0 then
ℓ
2πr = ℓ, so r = 2π and
ℓ2
A = πr2 = .
4π
This is clearly better than the square, since π < 4.
The question again is: Can we do better still? Is there some shape we have not
thought of that would give larger area than a circle while having the same perimeter?
We might expect the answer is no, but lack of imagination is not a convincing proof.
We will prove shortly that, as expected, the circle gives the largest area for a fixed
perimeter. Thus for any simple closed curve C we have the isoperimetric inequality
(3.16) 4πA ≤ ℓ2 ,
ℓ
where equality holds only when C is a circle of radius r = 2π .
We give here a short proof of the isoperimetric inequality (3.16) using the Lagrange
multiplier method in the calculus of variations. Let the curve C be parametrized by
(x(t), y(t)) for t ∈ [0, 1]. For notational simplicity we write x = x1 and y = x2 in
this section. Since C is a simple closed curve, we may also assume without loss of
generality that
Let U denote the interior of C. Then the area enclosed by the curve C is
Z Z
A(x, y) = dx = div(F ) dx,
U U
where F is the vector field F (x, y) = 12 (x, y). By the divergence theorem
Z
A(x, y) = F · ν dS,
∂U
and p
dS = x′ (t)2 + y ′ (t)2 dt,
provided we take the curve to have positive orientation. Therefore
Z 1 Z
1 ′ ′ 1 1
A(x, y) = (x(t), y(t)) · (y (t), −x (t)) dt = x(t)y ′ (t) − x′ (t)y(t) dt.
0 2 2 0
Similarly
1 d 1
∇y A(x, y) = − x′ (t) − x(t) = −x′ (t),
2 dt 2
and !
d y ′ (t)
∇y ℓ(x, y) = − p .
dt x′ (t)2 + y ′ (t)2
Then the gradients of A and ℓ are defined as
∇x A(x, y) ∇x ℓ(x, y)
∇A(x, y) = and ∇ℓ(x, y) = .
∇y A(x, y) ∇y ℓ(x, y)
Following (2.15), the necessary conditions for our constrained optimization problem
are
∇A(x, y) + λ∇ℓ(x, y) = 0,
3.6. IMAGE RESTORATION 45
is not differentiable at p = 0. This causes some minor issues with numerical sim-
ulations, so it is common to take an approximation of the TV functional that is
differentiable. One popular choice is
Z p
1
(3.19) Iε (u) = (f − u)2 + λ |∇u|2 + ε2 dx,
U 2
λp
Lε,z (x, z, p) = z − f (x) and ∇p Lε (x, z, p) = p .
|p|2 + ε2
∂u
with initial condition u(x, 0) = f (x) and boundary conditions ∂ν
= 0 on ∂U . This is
a nonlinear heat equation where the thermal conductivity
1
κ= p
|∇u|2 + ε2
3.6. IMAGE RESTORATION 47
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 3.6: Example of denoising a noisy signal with the total variations restoration
algorithm. Notice the noise is removed while the edges are preserved.
depends on ∇u. Discretized on a grid with spacing ∆t and ∆x, the stability condition
for the scheme is
ε∆x2
∆t ≤ .
4λ
Hence, the scheme is numerically stiff when ∆x is small or when ε > 0 is small.
We can also solve (3.20) with accelerated gradient descent, described in Section
2.4, which leads to the accelerated descent equation
!
∇u
utt + aut + u − λ div p = f for x ∈ U, t > 0,
|∇u|2 + ε2
where a > 0 is the coefficient of friction (damping). The CFL stability condition for
accelerated descent is
ε∆x2
∆t ≤
2
.
4λ + ε∆x2
The scheme is only numerically stiff when ε > 0 is small or λ > 0 is large. However,
for accelerated descent, a nonlinear phenomenon allows us to violate the stability
condition and roughly set ∆t ≈ ∆x, independent of ε > 0. We refer the reader to [7]
for details.
Figure 3.6 shows a one dimensional example of denoising a signal with the TV
restoration algorithm. Notice the algorithm can remove noise while preserving the
sharp discontinuities in the signal. This suggests that minimizers of the TV restora-
tion functional can have discontinuities. Figure 3.7 shows an example of TV image
restoration of a noisy image of Vincent hall. The top image is the noisy image, the
middle image is TV restoration with a small value of λ, and the bottom image is TV
restoration with a large value of λ. Notice that as λ is increased, the images becomes
48 CHAPTER 3. APPLICATIONS
Figure 3.7: Example of TV image restoration: (top) noisy image (middle) TV restora-
tion with small value for λ, and (bottom) TV restoration with large value for λ.
3.6. IMAGE RESTORATION 49
smoother, and many fine details are removed. We also observe that for small λ the
algorithm is capable of preserving edges and fine details while removing unwanted
noise.
If Φ is convex, then by convex duality we have Φ∗∗ := (Φ∗ )∗ = Φ, that is we have the
representation
(3.22) Φ(x) := sup {x · p − Φ∗ (p)}.
p∈Rd
Exercise 3.4. Prove the assertion above for d = 1 and Φ smooth and strongly
convex. 4
We assume Φ : Rd → R is convex and consider for simplicity the problem
Z
(3.23) min Ψ(x, u) + Φ(∇u) dx,
u U
where p : U → Rd . Here, u is the primal variable, p is the dual variable, and (3.24)
is called a saddle point formulation. Notice we are dualizing only in the gradient
variable ∇u, in which the original functional is nonsmooth, and leaving the variable
u, in which the functional is smooth, unchanged.
Provided u p · n = 0 on ∂U , we can integrate by parts to express the problem as
Z
(3.25) min max Ψ(x, u) − u div(p) − Φ∗ (p) dx.
u p U
50 CHAPTER 3. APPLICATIONS
In general, primal dual methods work by alternatively taking small steps in the di-
rection of minimizing with respect to u and maximizing with respect to p, with either
gradient descent steps, or proximal updates, which corresponds to implicit gradient
descent. Either (3.24) or (3.25) may be used in the descent equations, and often
(3.24) is more convenient for the dual update while (3.25) is more convenient for the
primal update.
For some application, things can be simplified significantly. We consider Total
Variation image restoration where Ψ(x, z) = 2λ 1
(z − f (x))2 and Φ(p) = |p|. The
convex dual is easily computed to be
(
0, if |v| ≤ 1
Φ∗ (v) = sup {p · v − |p|} =
p∈Rd ∞, if |v| > 1,
and the representation of Φ via convex duality in (3.22) is the familiar formula
|p| = max p · v.
|v|≤1
Therefore
∇T (p) = ∇(λ div(p) − f ).
To handle the constraint |p| ≤ 1 we can do a type of projected gradient ascent
pk + τ ∇(div(pk ) − f /λ)
(3.30) pk+1 = ,
1 + τ |div(pk ) − f /λ)|
with time step τ > 0, starting from an initial guess p0 with |p0 | < 1, and taking
appropriate discretizations of div and ∇ on a grid with spacing ∆x above. Notice
we absorbed λ into the choice of time step, to be consistent with other presentations.
This primal dual approach, and its convergence properties, is studied closely in [18].
The iteration above converges when τ is sufficiently small, and the solution of the
Total Variation problem is obtained by evaluating (3.28).
The Bregman iteration for solving constrained problems was originally proposed in [9],
and the novelty of the split Bregman method is in splitting the terms |p| and Ψ and
applying the Bregman iteration only to |p|.
We will not prove convergence of the Bregman iteration here, and refer the reader
to [47]. However, it is easy to see that if the method converges to a triple (u∗ , p∗ , b∗ ),
then the pair (u∗ , p∗ ) is a solution of the constrained problem (3.31). To see this, we
note that since bk → b∗ we have ∇u∗ = p∗ , so the constraint is satisfied. The pair
(u∗ , p∗ ) is clearly a solution of
Z
µ
(3.33) (u , p ) = arg min Ψ(x, u) + |p| + |p − ∇u − b∗ |2 dx
∗ ∗
u,p U 2
Let (û, p̂) be any pair of functions satisfying ∇û = p̂, that is, a feasible pair for the
constrained problem (3.31). By definition of (u∗ , p∗ ) we have
Z Z
∗ ∗ µ ∗ ∗ ∗ 2 µ
Ψ(x, u ) + |p | + |p − ∇u − b | dx ≤ Ψ(x, û) + |p̂| + |p̂ − ∇û − b∗ |2 dx.
U 2 U 2
Due to the constraints ∇û = p̂ and ∇u∗ = p∗ this simplifies to
Z Z
∗ ∗
Ψ(x, u ) + |p | dx ≤ Ψ(x, û) + |p̂| dx.
U U
Thus, (u∗ , p∗ ) is a solution of the constrained problem (3.31). This holds indepen-
dently of the choice of µ > 0. One may think of the spit Bregman method as an
iteration scheme to find for any µ > 0, a function b∗ so that the penalized problem
(3.33) is equivalent to the original constrained problem (3.31).
The efficiency of the split Bregman method relies on the ability to solve the prob-
lem
Z
µ
(3.34) arg min Ψ(x, u) + |p| + |p − ∇u − bk |2 dx
u,p U 2
This is a linear Poisson-type equation that can be solved very efficiently with either
a fast Fourier transform (FFT) or the multigrid method. Alternatively, in [31], the
3.6. IMAGE RESTORATION 53
authors report using only a handful of steps of gradient descent and can get by with
only approximately solving the problem.
The efficiency of the split Bregman method stems largely from the fact that the
second problem in (3.35) is pointwise, and can be solved explicitly. Note the objective
is strongly convex in p, so the minimum is attained and unique, and the critical point,
if it exists, is the minimizer. We can differentiate in p to find that the minimizer pj+1
satisfies
pj+1
+ µpj+1 = µ(∇uj+1 + bk ),
|p |
j+1
1
|pj+1 | = |∇uj+1 + bk | − .
µ
provided |∇uj+1 +bk | > µ1 . If |∇uj+1 +bk | ≤ µ1 , then there are no critical points where
the objective is smooth, and the minimizer is pj+1 = 0, i.e., the only non-smooth point.
The formula for pj+1 can be expressed compactly as
The shrinkage operator is extremely fast and requires only a few operations per
element of pj , which is the key to the computational efficiency of the split Bregman
method.
As an example, let us consider the one dimensional case and ignore the fidelity
term, since it does not involve the derivative of u. Hence, we consider the functional
Z 1
Jp (u) = |u′ (x)|p dx,
−1
Hence, the functional J1 (u) actually only depends on the boundary values u(1) and
u(−1), provided u is increasing. This is the reason why the Euler-Lagrange equation
is degenerate; every increasing function satisfying the boundary conditions u(−1) = 0
and u(1) = 1 is a minimizer. Thus, the TV functional does not care how the function
gets from u(−1) = 0 to u(1) = 1, provided the function is increasing. So the linear
function u(x) = 12 x + 12 is a minimizer, but so is the sequence of functions
0, if − 1 ≤ x ≤ 0
un (x) = nx, if 0 ≤ x ≤ n1
1, if n1 ≤ x ≤ 1.
3.7. IMAGE SEGMENTATION 55
The function un has a sharp transition from zero to one with slope n between x = 0
and x = 1/d. For each n
Z 1
n
J1 (un ) = n dx = 1,
0
where δ(x) is the Delta function. This explains why minimizers of the TV functional
can admit discontinuities.
Notice that if p > 1 then
Z 1
n
Jp (un ) = np dx = np−1 ,
0
over u : U → R and real numbers a and b. Let us assume for the moment that a and
b are fixed, and I is a function of only u.
The Lagrangian
is not even continuous, due to the presence of the Heaviside function H(z) and the
delta function δ(z). This causes problems numerically, hence in practice we usually
56 CHAPTER 3. APPLICATIONS
Therefore
Lε,z (x, z, p) = δε (z) (f (x) − a)2 − (f (x) − b)2 + λδε′ (z)|p|
and
λδε (z)p
∇p Lε (x, z, p) = .
|p|
By the chain and product rules
δε (u(x))∇u(x)
div (∇p L(x, u(x), ∇u(x))) = λ div
|∇u(x)|
∇δε (u(x)) · ∇u(x) ∇u(x)
=λ + λ div
|∇u(x)| |∇u(x)|
′
δε (u(x))∇u(x) · ∇u(x) ∇u(x)
=λ + λδε (u(x)) div
|∇u(x)| |∇u(x)|
′ ∇u(x)
= λδε (u(x))|∇u(x)| + λδε (u(x)) div
|∇u(x)|
Therefore, the Euler-Lagrange equation is
∇u
(3.43) δε (u) (f − a) − (f − b) − λ div
2 2
= 0 in U
|∇u|
Since it is easy to minimize over a and b, the idea now is to consider an alternating
minimization algorithm, whereby one fixes a, b ∈ R and takes a small gradient descent
step in the direction of minimizing Iε with respect to u, and then one freezes u and
updates a and b according to (3.44) and (3.45). We repeat this iteratively until the
values of a, b, and u remain unchanged with each new iteration.
Gradient descent on Iε with respect to u is the partial differential equation
∇u
(3.46) ut + δε (u) (f − a) − (f − b) − λ div
2 2
= 0 for x ∈ U, t > 0
|∇u|
Figure 3.8: Illustration of gradient descent for segmenting the cameraman image. Top
left is the original image, and top right is the initialization of the gradient descent
algorithm. The lower four images show the evolution of the zero level set of u.
3.7. IMAGE SEGMENTATION 59
Figure 3.9: Segmentation of clean, blurry, and noisy versions of the cameraman image.
The energy (3.49) is called a phase-field approximation of (3.39). Notice that µ = λ−1
from the original energy (3.39).
The phase-field approximation (3.49) can be efficiently solved by splitting into
two pieces, the double-well potential and the remaining portions, and alternatively
60 CHAPTER 3. APPLICATIONS
taking small steps of minimizing over each piece. Gradient descent on the double-
well potential part of the energy 1ε W (u) would be ut = − 1ε W ′ (u). When ε > 0
is small, we approximate this step by sending t → ∞, and u converges to stable
equillibrium points W ′ (u) = 0 and W ′′ (u) > 0, i.e., u = 0 or u = 1. By examining
W ′ (u) = 2u(1 − u)(1 − 2u), we see that W ′ (u) > 0 for u > 1/2 and W ′ (u) < 0 for
u < 1/2. So any value of u larger than 1/2 gets sent to 1, while value smaller than
1/2 is sent to 0. This step is thus thresholding.
The other part of the energy is
Z
µ u2 (f − a)2 + (1 − u)2 (f − b)2 + ε|∇u|2 dx.
U
The scheme runs the gradient descent equation above for a short time, alternating
with the thresholding step described above. The steps are outlined below.
Given: Image f , initial guess u0 , and time step τ > 0. For k = 1, 2, . . .
2. Set (
0, if w(x, τ ) ≤ 12
uk+1 (x) =
1, if w(x, τ ) > 12 .
The constants a, b are updated after each thresholding step according to the formulas:
R R
uf dx (1 − u)f dx
a= R U
and b = RU .
U
u dx U
(1 − u) dx
The algorithm above is an example of threshold dynamics, which was invented in [41]
for approximation motion by mean curvature, and has been widely used for other
problems since. The scheme is unconditionally stable, which means that the scheme
is stable and convergent for arbitrarily large time step τ > 0. This breaks the stiffness
in the segmentation energy and allows very fast computations of image segmentations.
Chapter 4
We now turn to the problem of determining conditions under which the functional
Z
(4.1) I(u) = L(x, u(x), ∇u(x)) dx
U
attains its minimum value. That is, we want to understand under what conditions we
can construct a function u∗ so that I(u∗ ) ≤ I(u) for all u in an appropriate function
space. We will always assume L is a smooth function of all variables.
Example 4.1 (Optimization on Rd ). Suppose f : Rd → [0, ∞) is continuous. Then
f attains its minimum value over any closed and bounded set D ⊂ Rd . Indeed,
we can always take a sequence xm with f (xm ) → inf D f as m → ∞. Then by the
Bolzano-Weierstrauss Theorem, there is a subsequence xmj and a point x0 ∈ D such
that xmj → x0 . By continuity of f
f (x0 ) = lim f (xmj ) = inf f.
j→∞ D
61
62 CHAPTER 4. THE DIRECT METHOD
subject to u(0) = 0 = u(1). We claim there is no solution to this problem. To see this,
let uk (x) be a sawtooth wave of period 2/k and amplitude 1/k. That is, uk (x) = x for
x ∈ [0, 1/k], uk (x) = 2/k − x for x ∈ [1/k, 2/k], uk (x) = x − 2/k for x ∈ [2/k, 3/k] and
so on. The function uk satisfies u′k (x) = 1 or u′k (x) = −1 at all but a finite number
of points of non-differentiability. Also 0 ≤ uk (x) ≤ 1/k. Therefore
Z 1
1 1
I(uk ) ≤ 2
dx = 2 .
0 k k
If the minimum exists, then
1
0 ≤ min I(u) ≤
u k2
for all natural numbers k. Therefore minu I(u) would have to be zero. However, there
is no function u for which I(u) = 0. Such a function would have to satisfy u(x)2 = 0
and u′ (x)2 = 1 for all x, which are not compatible conditions. Therefore, there is no
minimizer of I, and hence no solution to this problem. 4
In contrast with Example 4.2, the minimizing sequence uk constructed for the
Bolza problem converges to a function (the zero function) that cannot be easily inter-
preted as a feasible candidate for a minimizer of the variational problem. The gradient
u′k is oscillating wildely and is nonconvergent everywhere in any strong sense.
The issue underlying both examples above is that the function space we are op-
timizing over is infinite dimensional, and so the notion of closed and bounded is not
relevant. We will need a stronger compactness criterion. Indeed, the functional I in
the Bolza example is a smooth function of u and u′ , and the minimizing sequence uk
constructed in Example 4.3 satisfies |uk | ≤ 1 and |u′k | ≤ 1, so we are minimizing I
over a closed and bounded set. These examples show that infinite dimensional spaces
are much different and we must proceed more carefully. In particular, we require
further structure assumptions on L beyond smoothness to ensure minimizers exist.
where X is a topological space and f : X → [0, ∞). A topological space is the most
general space where we can make sense of limits limn→∞ xn = x. Beyond this, it is not
so important that the reader knows what a topological space is, so we will not give a
definition, other than to say that all spaces considered in these notes are toplogical
spaces with some additional structure (e.g., normed linear space, Hilbert space, etc).
The main question is: What conditions should we place on f to ensure its mini-
mum value is attained over X, that is, that a minimizer exists in (4.2)?
It is useful to consider how one may go about proving the existence of a minimizer.
Since f ≥ 0, we have inf x∈X f (x) ≥ 0, and so there exists a minimizing sequence (xn )
satisfying
The limit of the sequence (xn ), or any subsequence thereof, is presumably a leading
candidate for a minimizer of f . If (xn ) is convergent in X, so there exists x0 ∈ X so
that xn → x0 as n → ∞, then we would like to show that f (x0 ) = inf x∈X f (x). If f
is continuous, then this follow directly from (4.3). However, functionals f of interest
are often not continuous in the correct topology. It turns out we can get by with
lower semicontinuity.
For topological spaces, there is another notion called compactness, which has a
definition in terms of open sets. In metric spaces, sequentially compact is equiva-
lent to compact, and in that setting we will just use the term compact. In general
topological spaces the notions are not the same, and for minimizing functionals we
64 CHAPTER 4. THE DIRECT METHOD
Definition 4.3. We say that f : X → [0, ∞) is coercive if there exists λ > inf x∈X f (x)
such that the set
is sequentially precompact.
Theorem 4.4 is known as the direct method in the calculus of variations for prov-
ing existence of minimizers. It is important to understand how the minimizer is con-
structed as the limit of a minimizing sequence. In our setup, a minimizing sequence
uk has bounded energy, that is
Z
(4.7) sup L(x, uk , ∇uk ) dx < ∞.
k≥1 U
From this, we must prove estimates on appropriate norms of the sequence uk to obtain
coercivity—convergence of a subsequence ukj in some appropriate sense. Since (4.7)
only gives estimates on the integral of functions of uk and ∇uk , the best we can hope
for is that uk and ∇uk are uniformly bounded in Lp (U ) for some p ≥ 1. In the best
case we have p = ∞, and by the Arzelà-Ascoli Theorem (proof below) there exists a
subsequence ukj converging uniformly to a continuous (in fact Lipschitz) function u.
However, since u is constructed though a limit of a sequence of functions, even if we
ask that uk ∈ C 1 (U ), the uniform limit u need not be in C 1 (U ).
This means that coercivity cannot hold when X = C 1 (U ) in Theorem 4.4, and so
the classical function spaces C k (U ) are entirely incompatible with the direct method.
While we may expect minimizers to belong to C k (U ), proving this directly via the
direct method is intractable. Instead, we have to work on a larger space of functions
where it is easier to find minimizers, and treat the question of regularity of minimizers
separately from existence.
We give a brief sketch of the proof of the Arzelà-Ascoli Theorem in a special case.
for all k ≥ 1 and x ∈ (0, 1). Then there exists a subseuqence ukj and a function
u ∈ C((0, 1)) satisfying
|u(x) − u(y)| ≤ C|x − y|
such that ukj → u uniformly on (0, 1) (that is, limj→∞ max0≤x≤1 |ukj (x) − u(x)| = 0).
Proof. We claim there exists a subsequence ukj such that ukj (x) is a convergent se-
quence for all rational coordinates x. To prove this, we enumerate the rationals
r1 , r2 , r3 , . . . , and note that each sequence (uk (ri ))k is a bounded seqeunce, and
by Bolzano-Weierstrauss has a convergent subsequence. We construct the subse-
quence kj inductively as follows. Let ℓ1i such that uℓ1i (r1 ) is convergent (by Bolzano-
Weierstrauss) and set k1 = ℓ11 . Since uℓ1i (r2 ) is a bounded sequence, there is a con-
vergent subsequence (uℓ1i (r2 ))j , and we write ℓ2q = ℓ1iq and k2 = ℓ22 . Given we have
j
ki so that (ki ) ⊂ (ℓji )i for all j. Since ki ≥ i we have that uki (rj ) is convergent for all
j, which establishes the claim. The argument above is called a diagonal argument,
due to Cantor.
We now claim there is a unique continuous function u : (0, 1) → R such that
u(x) = u(x) for all rational x. To see this, let x ∈ (0, 1) be irrational. Let rij be an
approximating rational sequence, so that limj→∞ rij = x. We have
It follows that the sequence (u(rij )j is a Cauchy sequence, and hence convergent. The
value of the limit is independent of the choice of approximating rational sequence,
since if rℓj is another such sequence with rℓj → x as j → ∞, and argument similar
to above shows that
|u(rij ) − u(rℓj )| ≤ λ|rij − riℓ | −→ 0
66 CHAPTER 4. THE DIRECT METHOD
|ukj (x) − u(x)| = |ukj (x) − ukj (xi ) + ukj (xi ) − u(xi ) + u(xi ) − u(x)|
≤ |ukj (x) − ukj (xi )| + |ukj (xi ) − u(xi )| + |u(xi ) − u(x)|
2λ
≤ + max |uk (xi ) − u(xi )|
M 0≤i≤M j
2λ
lim sup |ukj (x) − u(x)| ≤ .
j→∞ M
and
The space Lp (U ) is a Banach space1 equipped with the norm kukLp (U ) , provided
we identify functions in Lp (U ) that agree up to sets of measure zero (i.e., almost
everywhere). As in any normed linear space, we say that uk → u in Lp (U ) provided
ku − uk kLp (U ) −→ 0 as k → ∞.
If uk → u in Lp (U ), there exists a subsequence ukj such that limj→∞ ukj (x) = u(x)
for almost every x ∈ U .
When p = 2, the norm kukL2 (U ) arises from an inner product
Z
(4.10) (u, v)L2 (U ) = uv dx,
U
p
in the sense that kukL2 (U ) = (u, v)L2 (U ) . The space L2 (U ) is thus a Hilbert space.
The borderline cases p = 1 and p = ∞ deserve special treatment in the Calculus of
Variations, so we will not consider them here.
The norms on the Lp spaces (and any Banach space for that matter), satisfy the
triangle inequality
(4.11) ku + vkLp (U ) ≤ kukLp (U ) + kvkLp (U ) .
Another important inequality is Hölder’s inequality
(4.12) (u, v)L2 (U ) ≤ kukLp (U ) kvkLq (U ) ,
which holds for any 1 < p, q < ∞ with p1 + 1q = 1. When p = q = 2, Hölder’s
inequality is called the Cauchy-Schwarz inequality.
To define Sobolev spaces, we need to define the notion of a weak derivative. We
write u ∈ L1loc (U ) when u ∈ L1 (V ) for all V ⊂⊂ U .
Definition 4.7 (Weak derivative). Suppose u, v ∈ L1loc (U ) and i ∈ {1, . . . , n}. We
say that v is the xi -weak partial derivative of u, written v = uxi , provided
Z Z
(4.13) uφxi dx = − vφ dx
U U
(4.15) W 1,p (U ) = {Weakly differentiable u ∈ Lp (U ) for which kukW 1,p (U ) < ∞}.
The Sobolev spaces W 1,p (U ) are Banach spaces, and in particular, their norms
satisfy the triangle inequality (4.11). For p = 2 we write H 1 (U ) = W 1,2 (U ). The
space H 1 (U ) is a Hilbert space with inner product
Z
(4.16) (u, v)H 1 (U ) = u · v + ∇u · ∇v dx.
U
The smooth functions C ∞ (U ) are dense in W 1,p (U ) and Lp (U ). That is, for any
u ∈ W 1,p (U ), there exists a sequence um ∈ C ∞ (U ) such that um → u in W 1,p (U )
(that is limm→∞ kum − ukW 1,p (U ) = 0). However, a general function u ∈ W 1,p (U ) need
not be continuous, much less infinitely differentiable.
Example 4.4. Let u(x) = |x|α for α 6= 0. If α > −n then u is summable, and hence
a candidate for weak differentiability. Let us examine the values of α for which u is
weakly differentiable and belongs to W 1,p (U ), where U = B(0, 1). Provided x 6= 0,
u is classically differentiable and ∇u(x) = α|x|α−2 x. Define v(x) = α|x|α−2 xi . Let
φ ∈ Cc∞ (U ) and compute via the divergence theorem that
Z Z Z
u φxi dx = u φxi dx + u φxi dx
U U −B(0,ε) B(0,ε)
Z Z Z
=− uxi φ dx − uφνi dS + u φxi dx,
U −B(0,ε) ∂B(0,ε) B(0,ε)
where ν is the outward normal to ∂B(0, ε). Noting that |uxi φ| ≤ C|x|α−1 and |x|α−1 ∈
L1 (U ) when α > 1 − n we have
Z Z
lim uxi φ dx = v φ dx
ε→0 U −B(0,ε) U
when α > 1 − n. Similarly, |uφνi | ≤ C|x|α = Cεα for x ∈ ∂B(0, ε), and so
Z
uφνi dS ≤ Cεα+n−1 → 0
∂B(0,ε)
as ε → 0. Therefore Z Z
u φxi dx = − vφ dx
U U
4.2. BRIEF REVIEW OF SOBOLEV SPACES 69
for all φ ∈ Cc∞ (U ) provided α > 1 − n. This is exactly the definition of weak
differentiability, so u is weakly differentiable and ∇u = α|x|α−2 xi in the weak sense
when α > 1 − n. We have |uxi |p ≤ C|x|p(α−1) , and so uxi ∈ Lp (U ) provided α > 1 − np .
Therefore, when α > 1 − np we have u ∈ W 1,p (U ). Notice the u is not continuous and,
in fact, unbounded at x = 0, when α < 0, which is allowed when p < n.
In fact, the situation can be much worse. Provided α > 1 − np the function
X∞
1
u(x) = k
|x − rk |α
k=1
2
Example 4.5. The closed unit ball B = {u ∈ L2 ((0, 1)) : kukL2√ ((0,1)) ≤ 1} is not
2
compact in L ((0, 1)). To see this, consider the sequence um (x) = 2 sin(πmx). We
have Z 1 Z 1
kum k2L2 ((0,1)) = 2 sin2 (πmx) dx = 1 − cos(2πmx) dx = 1.
0 0
Therefore um ∈ B. However, um has no convergent subsequences. Indeed, by the
Riemann-Lebesgue Lemma in Fourier analysis
√ Z 1
(4.19) lim (um , v)L2 ((0,1)) = lim 2 sin(πmx)v(x) dx = 0
m→∞ m→∞ 0
for all v ∈ L2 ((0, 1)). Essentially, the sequence um is oscillating very rapidly, and
when measured on average against a test function v, the oscillations average out to
zero in the limit. Therefore, if any subsequence of umj was convergent in L2 ((0, 1))
to some u0 ∈ L2 ((0, 1)), then we would have
and
(u0 , v)L2 ((0,1)) = lim (umj , v)L2 ((0,1)) = 0
j→∞
Theorem 4.12 (Banach-Alaoglu Theorem). Let 1 < p < ∞. The unit ball B = {u ∈
Lp (U ) : kukLp (U ) ≤ 1} is weakly sequentially compact.
Example 4.6. Theorem 4.12 does not hold for p = 1. We can take, for example,
U = (−1, 1) and the sequence of functions um (x) = m/2 for |x| ≤ 1/m and um (x) = 0
for |x| > 1/m. Note that kum kL1 (U ) = 1, so the sequence is bounded in L1 (U ), and
that um (x) → 0 pointwise for x 6= 0 (actually, uniformly away from x = 0). So any
weak limit in L1 (U ), if it exists, must be identically zero. However, for any smooth
function φ : U → R we have
Z 1/m
m
lim (um , φ)L2 (U ) = φ(x) dx = φ(0).
m→∞ 2 −1/m
Hence, the sequence um does not converge weakly to zero. In fact, the sequence
um converges to a measure, called the Dirac delta function. Note that kum kLp (U ) =
(m/2)1−1/p , and so the sequence is unbounded in Lp (U ) for p > 1.
We note that Theorem 4.12 does hold when p = ∞, provided we interpret weak
convergence in the correct sense (which is weak-* convergence here). The issue with
these borderline cases is that L1 and L∞ are not reflexive Banach spaces. 4
The examples above show that weak convergence does not behave well with func-
tion composition, even in the setting of the bilinear form given by the L2 inner-
product. The consequence is that our functional I defined in (4.1) is in general not
continuous in the weak topology. So by weakening the topology we have gained a
crucial compactness condition, but have lost continuity.
The issues above show how it is, in general, difficult to pass to limits in the weak
topology when one is dealing with nonlinear problems. This must be handled carefully
in our analysis of minimizers of Calculus of Variations problems. We record below a
useful lemma for passing to limits with weakly convergent sequences.
′
Lemma 4.13. Let 1 ≤ p < ∞. If um ⇀ u in Lp (U ) and wm → w in Lp (U ), then
Notice the difference with Example 4.8 is that wm converges strongly (and not
′
just weakly) to w in Lp (U ).
Proof. We write
where the last line follows from Hölder’s inequality. Since um ⇀ w, the first term
′
vanishes as m → ∞ by definition. Since um is bounded and wm → w in Lp (U ), the
second term vanishes as m → ∞ as well, which completes the proof.
that are strongly closed and convex, are also weakly closed. For example, the closed
linear subspace W01,p (U ) is weakly closed in W 1,p (U ). That is, if uk ⇀ u in W 1,p (U )
and uk ∈ W01,p (U ) for all k, then u ∈ W01,p (U ).
It follows from the Banach-Alaoglu Theorem that the unit ball in W 1,p (U ) is
weakly compact. However, we gain an additional, and very important, strong com-
pactness in Lp (U ) due to bounding the weak gradient in Lp (U ). This is essentially
what makes it possible to pass to limits with weak convergence, since we end up in
the setting where we can apply tools like Lemma 4.13.
(4.24) umj → u in Lp (U ).
(4.25) um j ⇀ u in W 1,p (U ).
Proof. Assume, by way of contradiction, that for each k ≥ 1 there exists uk ∈ W01,p (U )
such that
We may assume kuk kLp (U ) = 1 so that k∇uk kLp (U ) < 1/k and ∇uk → 0 in Lp (U ) as
k → ∞. By Theorem 4.15 there exists a subsequence ukj and u ∈ Lp (U ) such that
ukj → u in Lp (U ). Since kuk kLp (U ) = 1 we have that kukLp (U ) = 1.
74 CHAPTER 4. THE DIRECT METHOD
and Z Z
|∇u| dx = r
p d−p
|∇w|p dx.
Br B1
Therefore
kukpLp (Br ) = rd kwkpLp (B1 ) ≤ C p rd k∇wkpLp (U ) = C p rp k∇ukpLp (Br ) ,
which completes the proof.
Poincaré-type inequalities are true in many other situations.
Exercise 4.18. Let 1 ≤ p ≤ ∞. Show that there exists a constant C > 0, depending
on p and U , such that
(4.29) kukLp (U ) ≤ C(kT ukLp (∂U ) + k∇ukLp (U ) )
for all u ∈ W 1,p (U ), where T : W 1,p (U ) → Lp (∂U ) is the trace operator. [Hint: Use
a very similar argument to the proof of Theorem 4.16.] 4
4.3. LOWER SEMICONTINUITY 75
Exercise 4.19. Let 1 ≤ p ≤ ∞. Show that there exists a constant C > 0, depending
on p and U , such that
Exercise 4.20. Let 1 ≤ p ≤ ∞. Show that there exists a constant C > 0, depending
on p, such that
whenever
uk ⇀ u weakly in W 1,q (U ).
Theorem 4.22 (Weak lower semicontinuity). Let 1 < q < ∞. Assume L is bounded
below and that p 7→ L(x, z, p) is convex for each z ∈ R and x ∈ U . Then I is weakly
lower semicontinuous on W 1,q (U ).
We split up the proof of Theorem 4.22 into two parts. The first part, Lemma
4.23, contains the main ideas of the proof under the stronger condition of uniform
convergence, while the proof of Theorem 4.22, given afterwards, shows how to reduce
to this case using standard tools in analysis.
Lemma 4.23. Let 1 < q < ∞. Assume L is bounded below and convex in p. Let uk ∈
W 1,q (U ) and u ∈ W 1,p (U ) such that uk ⇀ u weakly in W 1,q (U ), uk → u uniformly on
U , and |∇u| is uniformly bounded on U . Then
It follows from the Sobolev compact embedding (Theorem 4.15) that, upon passing to
a subsequence if necessary, uk → u strongly in Lq (U ). Passing to another subsequence
we have
uk → u a.e. in U.
4.3. LOWER SEMICONTINUITY 77
Let m > 0. By Egoroff’s Theorem (Theorem A.29) there exists Em ⊂ U such that
|U − Em | ≤ 1/m and uk → u uniformly on Em . We define
Fm := {x ∈ U : |u(x)| + |∇u(x)| ≤ m},
and
Gm = Fm ∩ (E1 ∪ E2 ∪ · · · ∪ Em ).
Then
(4.35) uk → u uniformly on Gm
for each m. Furthermore, the sequence of sets Gm is monotone, i.e.,
G1 ⊂ G2 ⊂ G3 ⊂ · · · ⊂ Gm ⊂ Gm+1 ⊂ · · · ,
and
1
|U − Gm | ≤ |U − Fm | + |U − Em | ≤ |U − Fm | +
.
m
By the Dominated Convergence Theorem (Theorem A.28) we have
Z
lim |U − Fm | = lim 1 − χFm (x) dx = 0,
m→∞ m→∞ U
where χFm (x) = 1 when x ∈ Fm and zero otherwise. Therefore |U − Gm | → 0 as
m → ∞.
Since L is bounded below, there exists β ∈ R such that L(x, z, p) + β ≥ 0 for all
x ∈ U , z ∈ R, and p ∈ Rd . We have
Z Z
(4.36) I(uk ) + β|U | = L(x, uk , ∇uk ) + β dx ≥ L(x, uk , ∇uk ) + β dx.
U Gm
We call (4.37) a coercivity condition, since it plays the role of coercivity from Section
4.1. We also need to allow for the imposition of constraints on the function u. Thus,
we write
Theorem 4.25. Let 1 < q < ∞. Assume that L is bounded below, satisfies the
coercivity condition (4.37), and is convex in p. Suppose also that the admissible set
A is nonempty. Then there exists u ∈ A such that
Proof. We may assume inf w∈A I(w) < ∞. Select a minimizing sequence uk ∈ A, so
that I(uk ) → inf w∈A as k → ∞. We can pass to a subsequence, if necessary, so that
I(uk ) < ∞ for all k, so that supk≥1 I(uk ) < ∞. By the coercivity condition (4.37)
Z
I(uk ) ≥ α|∇uk |q − β dx,
U
and so Z
sup |∇uk |q dx < ∞.
k≥1 U
and so
Therefore supk≥1 kuk kW 1,q (U ) < ∞, and so there exists a subsequence ukj and a func-
tion u ∈ W 1,q (U ) such that ukj ⇀ u weakly in W 1,q (U ) as j → ∞. By Theorem 4.22
we have
I(u) ≤ lim inf I(ukj ) = inf I(w).
j→∞ w∈A
are unique for 1 < q < ∞. [Hint: Note that while L(x, z, p) = |p|q is not strongly
convex in p, it is strictly convex. Use a proof similar to Theorem 4.26.] 4
Since our candidate solutions live in W 1,q (U ), we need a weaker notion of solution
that does not require differentiating u twice. For motivation, recall when we derived
the Euler-Lagrange equation in Theorem 2.1 we arrive at Eq. (2.5), which reads
Z
(4.43) Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ dx = 0,
U
′
and so ∇p L(x, u, ∇u) ∈ Lq (U ), where q ′ = q−1 q
, 1q + q1′ = 1. Hence, we need
∇v ∈ Lq (U ) to ensure the integral in (4.44) exists (by Hölder’s inequality). Simi-
′
larly |Lz (x, u, ∇u)| = |z|q−1 , and so Lz (x, u, ∇u) ∈ Lq (U ), which also necessitates
v ∈ Lq (U ).
Following the discussion in Remark 4.30, we must place some assumptions on L so
that the weak form (4.44) of the Euler-Lagrange equation is well-defined. We assume
that
|L(x, z, p)| ≤ C(|p| + |z| + 1),
q q
for every nonzero t ∈ [−1, 1]. Thus, we can apply the Dominated Convergence The-
orem (Theorem A.28) to find that
Z
d
0= I(u + tv) = lim wt (x) dx
dt t=0 t→0 U
Z
= lim wt (x) dx
t→0
Z U
If L is jointly convex in (z, p), then it is easy to check that z 7→ L(x, z, p) and
p 7→ L(x, z, p) are convex. The converse is not true.
X
2 X
2
uxi xj vi vj = 2v1 v2
i=1 j=1
is clearly not positive for, say, v1 = 1 and v2 = −1. However, x1 7→ u(x1 , x2 ) is convex,
since ux1 x1 = 0, and x2 7→ u(x1 , x2 ) is convex, since ux2 x2 = 0. Therefore, convexity of
x 7→ u(x) is not equivalent to convexity in each variable x1 and x2 independently. 4
Proof. Let w ∈ A. Since (z, p) 7→ L(x, z, p) is convex, we can use Theorem A.38 (iii)
to obtain
L(x, w, ∇w) ≥ L(x, u, ∇u) + Lz (x, u, ∇u)(w − u) + ∇p L(x, u, ∇u) · (∇w − ∇u).
4.5. THE EULER-LAGRANGE EQUATION 83
4.5.1 Regularity
It is natural to ask whether the weak solution we constructed to the Euler-Lagrange
equation (4.42) is in fact smooth, or has some additional regularity beyond W 1,q (U ).
In fact, this was one of the famous 23 problems posed by David Hilbert in 1900.
Hilbert’s 19th problem asked “Are the solutions of regular problems in the calculus
of variations necessarily analytic”? The term “regular problems” refers to strongly
convex Lagrangians L for which the Euler-Lagrange equation is uniformly elliptic,
and it is sufficient to interpret the term “analytic” as “smooth”.
The problem was resolved affirmatively by de Giorgi [23] and Nash [44], indepen-
dently. Moser later gave an alternative proof [43], and the resulting theory is often
called the de Giorgi-Nash-Moser theory. We should note that the special case of n = 2
is significantly easier, and was established earlier by Morrey [42].
We will not give proofs of regularity here, but will say a few words about the
difficulties and the main ideas behind the proofs. We consider the simple calculus of
variations problem
Z
(4.47) min L(∇u) dx,
u U
where L is smooth and θ-strongly convex with θ > 0. The corresponding Euler-
Lagrange equation is
X
d
(4.48) div(∇p L(∇u)) = ∂xj (Lpj (∇u)) = 0 in U.
j=1
The standard way to obtain regularity in PDE theory is to differentiate the equation
to get a PDE satisfied by the derivative uxk . For simplicity we assume u is smooth.
Differentiate (4.48) in xk to obtain
X
d
∂xj (Lpi pj (∇u)uxk xi ) = 0 in U.
i,j=1
84 CHAPTER 4. THE DIRECT METHOD
X
d
(4.49) ∂xj (aij wxi ) = 0 in U.
i,j=1
X
d
(4.50) θ|η|2 ≤ aij ηi ηj ≤ Θ|η|2 ,
i,j=1
for some Θ > 0. Thus (4.49) is uniformly elliptic. Notice we have converted our
problem of regularity of minimizers of (4.47) into an elliptic regularity problem.
However, since u ∈ W 1,q is not smooth, we at best have aij ∈ L∞ (U ), and in
particular, the coefficients aij are not a priori continuous, as is necessary to apply
elliptic regularity theory, such as the Schauder estimates [30]. All we know about the
coefficients is the ellipticity condition (4.50). Nonetheless, the remarkable conclusion
is that w ∈ C 0,α for some α > 0. This was proved separately by de Giorgi [23] and
Nash [44]. Since w = uxk we have u ∈ C 1,α and so aij ∈ C 0,α . Now we are in the
world of classical elliptic regularity. A version of the Schauder estimates shows that
u ∈ C 2,α , and so aij ∈ C 1,α . But another application of a different Schauder estimate
gives u ∈ C 3,α and so aij ∈ C 2,α . Iterating this argument is called bootstrapping, and
leads to the conclusion that u ∈ C ∞ is smooth.
for all p, q ∈ Rd and λ ∈ (0, 1), and consider minimizing L over the constraint set
(4.53) A = {u ∈ C 0,1 (U ) : u = g on ∂U },
4.6. MINIMAL SURFACES 85
The Lipschitz constant Lip(u) is the smallest number C > 0 for which
|u(x) − u(y)| ≤ C|x − y|
holds for all x, y ∈ U . By Radamacher’s Theorem [28], every Lipschitz function is
differentiable almost everywhere in U , and C 0,1 (U ) is continuously embedded into
W 1,q (U ) for every q ≥ 1, which means
(4.56) kukW 1,q (U ) ≤ CkukC 0,1 (U ) .
Our strategy to overcome the lack of coercivity will be to modify the constraint
set to one of the form
(4.57) Am = {u ∈ C 0,1 (U ) : Lip(u) ≤ m and u = g on ∂U }.
By (4.56), minimizing sequences are bounded in W 1,q (U ), which was all that coercivity
was used for. Then we will show that minimizers over Am satisfy kukC 0,1 (U ) < m by
proving a priori gradient bounds on u, which will show that minimizers of Am are
the same as over A.
Theorem 4.35. Assume L is convex and bounded below. Then there exists u ∈ Am
such that I(u) = minw∈Am I(w).
Proof. Let uk ∈ Am be a minimizing sequence. By the continuous embedding (4.56),
the sequence uk is bounded in H 1 (U ) = W 1,2 (U ). Thus, we can apply the argument
from Theorem 4.25, and the Sobolev compact embeddings 4.15, to show there is a
subsequence ukj and a function u ∈ H 1 (U ) such that u = g on ∂U in the trace sense,
ukj ⇀ u weakly in H 1 (U ), ukj → u in L2 (U ), and
I(u) ≤ lim inf I (kj ) = inf I(w).
j→∞ w∈I
We now turn to proving a priori estimates on minimizers. For this, we need the
comparison principle and notions of sub and supersolutions. We recall Cc0,1 (U ) is the
subset of C 0,1 (U ) consisting of functions with compact support in U , that is, each
u ∈ Cc0,1 (U ) is identically zero in some neighborhood of ∂U .
Definition 4.36. We say that u ∈ C 0,1 (U ) is a subsolution if
and (
v(x), if x ∈ U \ K
v(x) =
u(x), if x ∈ K,
Since u ≤ v on ∂U , K is an open set compactly contained in U .
Note that u ≤ u and define φ := u − u. Then u = u − φ and φ ≥ 0 is supported
on K, so φ ∈ Cc0,1 (U ). By the definition of subsolution we have I(u) ≥ I(u). By a
similar argument we have I(v) ≥ I(v). Therefore
Z Z Z Z
L(∇u) dx ≤ L(∇u) dx = L(∇u) dx + L(∇v) dx,
U U U \K K
and so Z Z
L(∇u) dx ≤ L(∇v) dx.
K K
The opposite inequality is obtained by a similar argument and so we have
Z Z
(4.60) L(∇v) dx = L(∇u) dx.
K K
4.6. MINIMAL SURFACES 87
Define now (
u(x), if x ∈ U \ K
w(x) = 1
2
(u(x) + v(x)), if x ∈ K.
As above, the subsolution property yields I(w) ≥ I(u) and so we can use the argument
above and (4.60) to obtain
Z Z Z
1 1 1
(L(∇u) + L(∇v)) dx = L(∇u) dx + L(∇v)) dx
K 2 2
Z K
2 U
= L(∇u) dx
K
Z
∇u + ∇v
≤ I(w) = L dx.
K 2
Therefore Z
∇u + ∇v 1
L − (L(∇u) + L(∇v)) dx ≥ 0.
K 2 2
By convexity of L we conclude that
∇u + ∇v 1
L = (L(∇u) + L(∇v))
2 2
a.e. in K. Since L is strictly convex, we have
p+q 1
L ≤ (L(p) + L(q)).
2 2
with equality if and only if p = q. Therefore ∇u = ∇v a.e. in K, and so u − v is
constant in K. Since u = v on ∂K, we find that u ≡ v in K, which is a contradiction
to the definition of K.
Corollary 4.39. Let u and v be sub- and supersolutions, respectively. Then we have
Proof. Since I(u + λ) = I(u) for any constant λ, adding constants does not change
the property of being a sub- or supersolution. Therefore w := v + sup∂U (u − v) is a
supersolution satisfying u ≤ w on ∂U . By Theorem 4.37 we have u ≤ w in U , which
proves (4.61).
Note that for any φ ∈ Cc0,1 (U ) we have by convexity of L that
Z Z Z
L(0 ± Dφ) dx ≥ L(0) ± ∇p L(0) · ∇φ dx = L(0),
U U U
where we used integration by parts in the last equality. Therefore, any constant
function is both a sub- and supersolution. Using (4.61) with v = 0 we have supU u ≤
sup∂U u. If u is also a supersolution, then the opposite inequality is also obtained,
yielding (4.62).
Both u and uτ are sub- and supersolutions in U ∩ Uτ . By Corollary 4.39 there exists
z ∈ ∂(U ∩ Uτ ) such that
The opposite inequality is obtained similarly. Since x1 , x2 ∈ U are arbitrary, the proof
is completed.
Lemma 4.40 reduces the problem of a priori estimates to estimates near the bound-
ary (or, rather, relative to boundary points). To complete the a priori estimates, we
need some further assumptions on ∂U and g. The simplest assumption is the bounded
slope condition, which is stated as follows:
(BSC) A function u ∈ A satisfies the bounded slope condition (BSC) if there exists
m > 0 such that for every x0 ∈ ∂U there exist affine functions w, v with
|Dv| ≤ m and |Dw| ≤ m such that
and
u(x) − u(x0 ) ≥ v(x) − v(x0 ) ≥ −m|x − x0 |.
Therefore
|u(x) − u(x0 )|
≤ m,
|x − x0 |
for all x ∈ U , and so
|u(x) − u(y)|
sup ≤ m.
x∈U,y∈∂U |x − y|
Combining this with Lemma 4.40 completes the proof.
Finally, we prove existence and uniqueness of minimizers over A.
Theorem 4.42. Let L be strictly convex and assume g satisfies the bounded slope
condition (BSC) for a constant m > 0. Then there is a unique u ∈ A such that
I(u) = minw∈A I(w) and furthermore Lip(u) ≤ m.
Proof. By Theorem 4.35 there exists u ∈ Am+1 such that
By Remark 4.38, Theorems 4.37 and 4.41 apply to minimizers in Am+1 with similar
arguments, and so we find that Lip(u) ≤ m. Let w ∈ A. Then for small enough t > 0
we have that
(1 − t)u + tw ∈ Am+1
and so using convexity of L we have
Z
I(u) ≤ I((1 − t)u + tw) = L((1 − t)∇u + t∇w) dx ≤ (1 − t)I(u) + tI(w).
U
90 CHAPTER 4. THE DIRECT METHOD
Discrete to continuum in
graph-based learning
Machine learning algorithms learn relationships between data and labels, which can
be used to automate everyday tasks that were once difficult for computers to perform,
such as image annotation or facial recognition. To describe the setup mathematically,
we denote by X the space in which our data lives, which is often Rd , and by Y the label
space, which may be Rk . Machine learning algorithms are roughly split into three
categories: (i) fully supervised, (ii) semi-supervised, and (iii) unsupervised. Fully
supervised algorithms use labeled training data (x1 , y1 ), . . . , (xn , yn ) ∈ X × Y to learn
a function f : X → Y that appropriately generalizes the rule xi 7→ yi . The learned
function f (x) generalizes well if it gives the correct label for datapoints x ∈ X that it
was not trained on (e.g., the testing or validation set). For a computer vision problem,
each datapoint xi would be a digital image, and the corresponding label yi could, for
example, indicate the class the image belongs to, what type of objects are in the image,
or encode a caption for the image. Convolutional neural networks trained to classify
images [37] is a modern example of fully supervised learning, though convolutional
nets are used in semi-supervised and unsupervised settings as well. Unsupervised
learning algorithms, such as clustering [36] or autoencoders [24], make use of only the
data x1 , . . . , xn and do not assume access to any label information.
An important and active field of current research is semi-supervised learning, which
refers to algorithms that learn from both labeled and unlabeled data. Here, we are
given some labeled data (x1 , y1 ), . . . , (xm , ym ) and some unlabeled data xm+1 , . . . , xn ,
where usually m n, and the task is to use properties of the unlabeled data to
obtain enhanced learning outcomes compared to using labeled data alone. In many
applications, such as medical image analysis or speech recognition, unlabeled data
is abundant, while labeled data is costly to obtain. Thus, it is important to have
efficient and reliable semi-supervised learning algorithms that are able to properly
utilize the ever-increasing amounts of unlabeled data that is available for learning
tasks.
91
92 CHAPTER 5. GRAPH-BASED LEARNING
Figure 5.1: Example of some of the handwritten digits from the MNIST dataset [40].
(
Minimize E(u) over u : X → Rk
(5.1)
subject to u = g on Γ,
80 k=10 80 k=10
k=20 k=20
k=30 k=30
60 60
Test error (%)
20 20
Figure 5.2: Error plots for MNIST experiment showing testing error versus number
of labels, averaged over 100 trials.
One of the most widely used smoothness functionals [73] is the graph Dirichlet energy
1 X
(5.3) E2 (u) := wxy |u(y) − u(x)|2 .
2 x,y∈X
The p-Laplace problem was introduced originally in [70] for semi-supervised learning.
It was proposed more recently in [26] with large p as a superior regularizer in problems
with very few labels where Laplacian regularization gives poor results. The well-
posedness of the p-Laplacian models with very few labels was proved rigorously in [55]
for the variational graph p-Laplacian, and in [11] for the game-theoretic graph p-
Laplacian. In addition, in [11] discrete regularity properties of the graph p-harmonic
functions are proved, and the p = ∞ case, called Lipschitz learning [38], is studied in
[13]. Efficient algorithms for solving these problems are developed in recent work [29].
We also note there are alternative re-weighting approaches that have been successful
for problems with low label rates [15, 53].
Graph-based techniques area also widely used in unsupervised learning algorithms,
such as clustering. The binary clustering problem is to split X into two groups in
such a way that similar data points belong to the same group. Let A ⊂ X be one
group and Ac := X \ A be the other. One approach to clustering is to minimize the
graph cut energy
where
X
(5.5) Cut(A, B) = wxy .
x∈A,y∈B
The graph cut energy MC(A) is the sum of the weights of edges that have to be cut
to partition the graph into A and Ac . Minimizing the graph cut energy is known
as the min-cut problem and the solution can be obtained efficiently. However, the
min-cut problem often yields poor clusterings, since the energy is often minimized
with A = {x} for some x ∈ X , which is usually undesirable.
This can be addressed by normalizing the graph cut energy by the size of the
clusters. One approach is the normalized cut [52] energy
Cut(A, Ac )
(5.6) NC(A) = ,
V (A)V (Ac )
where V (A) = Cut(A, X ) measures the size of the cluster A. There are many other
normalizations one can use, such as min{V (A), V (Ac )} [2]. Minimizing the normalized
cut energy NC(A) turns out to be an NP-hard problem.
95
Therefore, X
V (A)V (Ac ) = d(x)d(y)1A (x)(1 − 1A (y)),
x,y∈X
So far, we have changed nothing besides notation, so the problem is still NP-hard.
However, this reformulation allows us to view the problem as optimizing over binary
functions, in the form
P
x,y∈X wxy (u(x) − u(y))
2
(5.7) min P ,
u:X →{0,1} x,y∈X d(x)d(y)u(x)(1 − u(y))
since binary functions are exactly the characteristic functions. Later we will cover
relaxations of (5.7), which essentially proceed by allowing u to take on all real values
R, but this must be done carefully. These relaxations lead to spectral methods for
dimension reduction and clustering, such as spectral clustering [46] and Laplacian
eigenmaps [4].
We remark that all of the methods above can be viewed as discrete calculus of
variations problems on graphs. A natural question to ask is what relationship they
have with their continuous counterparts. Such relationships can give us insights about
how much labeled data is necessary to get accurate classifiers, and which algorithms
and hyperparameters are appropriate in different situations. Answering these question
requires a modeling assumption on our data. A standard assumption is that the data
points are a sequence of independent and identically distributed random variables
and the graph is constructed as a random geometric graph. The next few sections are
a review of calculus on graphs, and some basic probability, so that we can address
the random geometric graph model rigorously.
96 CHAPTER 5. GRAPH-BASED LEARNING
This induces a norm kuk2ℓ2 (X ) = (u, u)ℓ2 (X ) . We define a vector field on the graph to
be an antisymmetric function V : X 2 → R (i.e., V (x, y) = −V (y, x)). Thus, vector
fields are defined along edges of the graph. The gradient of a function u ∈ ℓ2 (X ),
denoted for simplicity as ∇u, is the vector field
holds for all u ∈ ℓ2 (X ). Compare this with the continuous Divergence Theorem
(Theorem A.32). To find an expression for divV (x), we compute
1 X
(∇u, V )ℓ2 (X 2 ) = wxy (u(y) − u(x))V (x, y)
2 x,y∈X
1 X 1 X
= wxy u(y)V (x, y) − wxy u(x)V (x, y)
2 x,y∈X 2 x,y∈X
1 X 1 X
= wxy u(x)V (y, x) − wxy u(x)V (x, y)
2 x,y∈X 2 x,y∈X
X
=− wxy u(x)V (x, y),
x,y∈X
5.1. CALCULUS ON GRAPHS 97
Using this definition, the integration by parts formula (5.11) holds, and div is the
negative adjoint of ∇.
In particular
which shows that the graph Dirichlet energy E2 (u), defined in (5.3), is closely related
to the graph Laplacian. Eq. (5.16) also shows that −L is a positive semi-definite
operator.
Now, we consider the Laplacian learning problem
where
A = {u ∈ ℓ2 (X ) : u(x) = g(x) for all x ∈ Γ},
where Γ ⊂ X and g : Γ → R are given.
98 CHAPTER 5. GRAPH-BASED LEARNING
Lemma 5.1 (Existence). There exists u ∈ A such that E2 (u) ≤ E2 (w) for all w ∈ A,
and
for any a < b, then E2 (w) ≤ E2 (u). Indeed, this follows from
which holds for all x, y ∈ X . By setting a = minx∈Γ g(x) and b = maxx∈Γ g(x), we
have that w ∈ A provided u ∈ A. Therefore, we may look for a minimizer of E2 over
the constrained set
Since E2 (u) ≥ 0 for all u ∈ ℓ2 (X ), we have inf w∈B E2 (w) ≥ 0. For every integer k ≥ 1,
let uk ∈ B such that E2 (uk ) ≤ inf w∈A E2 (w) + 1/k. Then uk is a minimizing sequence,
meaning that limk→∞ E2 (uk ) = inf w∈A E2 (w). By the Bolzano-Weierstrauss Theorem,
there exists a subsequence ukj and u ∈ B such that limj→∞ ukj (x) = u(x) for all
x ∈ X . Since E2 (u) is a continuous function of u(x) for all x ∈ X , we have
where we used the fact that L is self-adjoint (Eq. (5.15)) in the last line. Therefore,
any minimizer u of (5.17) must satisfy
We now choose (
Lu(x), if x ∈ X \ Γ
v(x) =
0, if x ∈ Γ,
to find that X
0 = (Lu, v)ℓ2 (X ) = |Lu(x)|2 .
x∈X \Γ
1 X
(5.20) u(x) = wxy u(y).
d(x) y∈X
Eq. (5.20) is a mean value property on the graph, and says that any graph harmonic
function (i.e., Lu = 0) is equal to its average value over graph neighbors.
Uniqueness of solutions of the boundary value problem (5.19) follow from the
following maximum principle.
Theorem 5.2 (Maximum Principle). Let u ∈ ℓ2 (X ) such that Lu(x) ≥ 0 for all
x ∈ X \ Γ. If the graph G = (X , W) is connected to Γ, then
1 X
(5.22) u(x) ≤ wxy u(y),
d(x) y∈X
In particular, we have
In particular, if u = v on Γ, then u ≡ v.
Proof. Since L(u − v)(x) = 0 for x ∈ X \ Γ, we have by Theorem 5.2 that
and
max(v(x) − u(x)) = max(v(x) − u(x)) ≤ max |u(x) − v(x)|,
x∈X x∈Γ x∈Γ
For use later on, we record another version of the maximum principle that does
not require connectivity of the graph.
Lemma 5.4 (Maximum principle). Let u ∈ ℓ2 (X ) such that Lu(x) > 0 for all x ∈
X \ Γ. Then
(5.25) max u(x) = max u(x).
x∈X x∈Γ
Proof. Let x0 ∈ X such that u(x0 ) = maxx∈X u(x). Since u(x0 ) ≥ u(y) for all y ∈ X ,
we have X
Lu(x0 ) = wxy (u(y) − u(x0 )) ≤ 0.
y∈X
Since Lu(x) > 0 for all x ∈ X \ Γ, we must have x0 ∈ Γ, which completes the
proof.
for any t > 0. Bounds of the form (5.26) and (5.27) are called concentration in-
equalities, or concentration of measure. In this section we describe the main ideas for
proving exponential bounds of the form (5.27), and prove the Hoeffding and Bernstein
inequalities, which are some of the most useful concentration inequalities. For more
details we refer the reader to [8].
One normally proves exponential bounds like (5.27) with the Chernoff bounding
trick. Let s > 0 and note that
P(Sn − µ ≥ t) = P(s(Sn − µ) ≥ st) = P es(Sn −µ) ≥ est .
The random variable Y = es(Sn −µ) is nonnegative, so we can apply Markov’s inequality
(see Proposition A.39) to obtain
P(Sn − µ ≥ t) ≤ e−st E es(Sn −µ)
h s ∑n i
−st (Xi −µ)
=e E e n i=1
" n #
Y s
= e−st E e n (Xi −µ) .
i=1
Y
n
s s n
(5.28) P(Sn − µ ≥ t) ≤ e−st E e n (Xi −µ) = e−st E e n (X1 −µ) .
i=1
This bound is the main result of the Chernoff bounding trick. The key now is to
obtain bounds on the moment generating function
MX (λ) := E[eλ(X−µ) ],
where X = X1 .
In the case where the Xi are Bernoulli random variables, we can compute the
moment generating function explicitly, and this leads to the Chernoff bounds. Be-
fore giving them, we present some preliminary technical propositions regarding the
function
f ′ (x) = δ − ex + 1 = 0
when x = log(1 + δ). Since f ′′ (x) = −ex , the critical point is a maximum and we
have
f (log(1 + δ)) = h(δ),
which completes the proof of (5.30). The proof of (5.31) is similar.
δ2
(5.32) h(δ) ≥ ,
2(1 + 13 δ+ )
Proof. We first note that h′ (0) = h(0) = 0 and h′′ (δ) = 1/(1 + δ). We define
δ2
f (δ) = ,
2(1 + 31 δ)
i=1
2
for any s > 0. We use (5.30) from Proposition 5.5 to optimize over s > 0 yielding
!
Xn
P Xi ≥ (1 + δ)np ≤ exp (−np h(δ)) .
i=1
Set t = δp to obtain
!
X
n
≤ exp −sδp + sp + np(e− n − 1) .
s
P Xi ≤ (1 − δ)np
i=1
5.2. CONCENTRATION OF MEASURE 105
!
X
n
≤ exp −np δ ns − e− n − 1 +
s
P Xi ≤ (1 − δ)np s
n
.
i=1
1
P(|P | ≥ 61 log n) ≤ P(Y ≥ 58 log n) ≤ .
n2
Since here are n paths from leaves to the root, we union bound over all paths to find
that
1
P(Z ≥ 61 log n) ≤ .
n
Therefore, with probability at least 1 − n1 , quicksort takes at most O(n log n) opera-
tions to complete. 4
In general, we cannot compute the moment generating function explicitly, and are
left to derive upper bounds. The first bound is due to Hoeffding.
Lemma 5.8 (Hoeffding Lemma). Let X be a real-valued random variable for which
|X − µ| ≤ b almost surely for some b > 0, where µ = E[X]. Then we have
λ2 b 2
(5.35) MX (λ) ≤ e 2 .
x+b
eλx ≤ e−λb + sinh(λb)
b
provided |x| ≤ b (the right hand side is the secant line from (−b, e−λb ) to (b, eλb );
recall sinh(t) = (et − e−t )/2 and cosh(t) = (et + e−t )/2). Therefore we have
λ(X−µ) −λb X −µ+b
MX (λ) = E e ≤E e + sinh(λb)
b
E[X] − µ + b
= e−λb + sinh(λb)
b
= e−λb + sinh(λb) = cosh(λb).
x2
The proof is completed by the elementary inequality cosh(x) ≤ e 2 (compare Taylor
series).
provided |Xi − µ| ≤ b almost surely. Optimizing over s > 0 we find that s = nt/b2 ,
which yields the following result.
5.2. CONCENTRATION OF MEASURE 107
Assume there exists b > 0 such that |X − µ| ≤ b almost surely. Then for any t > 0
we have
nt2
(5.36) P(Sn − µ ≥ t) ≤ exp − 2 .
2b
Remark 5.10. Of course, the opposite inequality
nt2
P(Sn − µ ≤ −t) ≤ exp − 2
2b
holds by a similar argument. Thus, by the union bound we have
nt2
P(|Sn − µ| ≥ t) ≤ 2 exp − 2 .
2b
The Hoeffding inequality is tight if σ 2 ≈ b2 , so that the right hand side looks like
the Gaussian distribution in (5.27), up to constants. For example, if Xi are uniformly
distributed on [−b, b] then
Z
2 1 b 2 b2
σ = x dx = .
2b −b 3
However, if σ 2 b2 , then one would expect to see σ 2 in place of b2 as in (5.27), and
the presence of b2 leads to a suboptimal bound.
Example 5.2. Let
|X|
Y = max 1 − ,0 .
ε
where X is uniformly distributed on [−1, 1], as above, and ε b. Then |Y | ≤ 1, so
b = 1, but we compute Z
1 ε
σ ≤
2
dx = ε.
2 −ε
Hence, σ 2 1 when ε is small, and we expect to get sharper concentration bounds
than are provided by the Hoeffding inequality. This example is similar to what we
will see later in consistency of graph Laplacians. 4
The Bernstein inequality gives the sharper bounds that we desire, and follows
from Bernstein’s Lemma.
Lemma 5.11 (Bernstein Lemma). Let X be a real-valued random variable with finite
mean µ = E[X] and variance σ 2 = Var (X), and assume that |X − µ| ≤ b almost
surely for some b > 0. Then we have
2
σ λb
(5.37) MX (λ) ≤ exp (e − 1 − λb) .
b2
108 CHAPTER 5. GRAPH-BASED LEARNING
k=2
k!
2
x λb
= 1 + λx + (e − 1 − λb).
b2
Therefore
(X − µ)2 λb σ 2 λb
MX (λ) ≤ E 1 + λ(X − µ) + (e − 1 − λb) = 1 + (e − 1 − λb).
b2 b2
The proof is completed by using the inequality 1 + x ≤ ex .
We now prove the Bernstein inequality.
Theorem 5.12 (Bernstein Inequality). Let X1 , X2 . . . , Xn be a sequence of i.i.d. real-
valued random variables
P with finite expectation µ = E[Xi ] and variance σ 2 = Var (Xi ),
and write Sn = n1 i=1 Xi . Assume there exists b > 0 such that |X − µ| ≤ b almost
n
Remark 5.14. As with the Hoeffding inequality, we can obtain two-sided estimates
of the form
nt2
P(|Sn − µ| ≥ t) ≤ 2 exp − .
2(σ 2 + 31 bt)
Later in the notes, we will encounter double sums of the form
1 X
(5.40) Un = f (Xi , Xj ),
n(n − 1) i̸=j
where the last line follows from Jensen’s inequality. Since Yτ is a sum of k i.i.d. random
variables with zero mean, absolute bound b, and σ 2 variance, the same application of
Bernstein’s Lemma as used in Theorem 5.12 yields
2
kσ sb sb
E[e ] ≤ exp
sYτ
ek −1− .
b2 k
110 CHAPTER 5. GRAPH-BASED LEARNING
Therefore, we obtain
kσ 2 sb
P(Un − µ > t) ≤ exp − 2 σbt2 · sb
k
− e k −1− sb
k
.
b
and the proof is completed by applying Proposition 5.6 and noting that k ≥ n/3 for
all n ≥ 2.
We pause to give an application to Monte Carlo numerical integration, which is a
randomized numerical method for approximating integrals of the form
Z
(5.43) I(u) = u(x) dx.
[0,1]d
1X
n
(5.44) In (u) = u(Xi ),
n i=1
λσ
(5.45) |I(u) − In (u)| ≤ √
n
with probability at least 1 − 2 exp − 14 λ2 .
Remark 5.17. Let δ > 0 and choose λ > 0 so that δ = 2 exp − 41 λ2 ; that is
q
λ = 4 log 2δ . Then Theorem 5.16 can be rewritten to say that
s
2
4σ 2 log
(5.46) |I(u) − In (u)| ≤ δ
n
holds with probability at least 1 − δ. This is a more common form to see Monte Carlo
error estimates.
5.2. CONCENTRATION OF MEASURE 111
The reader should contrast this with the case where Xi form a uniform grid over
[0, 1]d . In this case, for Lipschitz functions the numerical integration error is O(∆x),
where ∆x is the grid spacing. For n points on a uniform grid in [0, 1]d the grid spacing
is ∆x ∼ n−1/d , which suffers from the curse of dimensionality when d is large. The
Monte Carlo error estimate (5.46), on the other hand, is remarkable in that it is
independent of dimension d! Thus, Monte Carlo integration overcomes the curse of
dimensionality by simply replacing a uniform discretization grid by random variables.
Monte Carlo based techniques have been used to solve PDEs in high dimensions via
sampling random walks or Brownian motions.
Proof
Pn of Theorem 5.16. Let Yi = u(Xi ). We apply Bernstein’s inequality with Sn =
1 2
n i=1 Yi = In (u), σ = Var(Yi ) and b = 2kukL∞ ([0,1]d ) to find that
|I(u) − In (u)| ≤ t
nt2 √
with probability at least 1 − 2 exp − 2 σ2 + 1 bt for any t > 0. Set t = λσ/ n for
( 3 )
λ > 0 to find that
λσ
|I(u) − In (u)| ≤ √
n
2 2 √
with probability at least 1 − 2 exp − ( σ2 λbλσ ) . Restricting λ ≤ 3σ n/b completes
2 σ + 3 √n
the proof.
We conclude this section with the Azuma/McDiarmid inequality. This is slightly
more advanced and is not used in the rest of these notes, so the reader may skip ahead
without any loss. It turns out that the Chernoff bounding method used to prove the
Chernoff, Hoeffding, and Bernstein inequalities does not use in any essential way the
linearity of the sum defining Sn . Indeed, what matters is that Sn does not depend
too much on any particular random variable Xi . Using these ideas leads to far more
general (and more useful) concentration inequalities for functions of the form
(5.47) Yn = f (X1 , X2 , . . . , Xn )
that may depend nonlinearly on the Xi . To express that Yn does not depend too
much on any of the Xi , we assume that f satisfies the following bounded differences
condition: There exists b > 0 such that
(5.48) |f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , x
ei , . . . , xn )| ≤ b
ei . In this case we have the following concentration inequality.
for all x1 , . . . , xn and x
Theorem 5.18 (Azuma/McDiarmid inequality). Define Yn by (5.47), where X1 , . . . , Xn
are i.i.d. random variables satisfying |Xi | ≤ M almost surely, and assume f satisfies
(5.48). Then for any t > 0
t2
(5.49) P(Yn − E[Yn ] ≥ t) ≤ exp − .
2nb2
112 CHAPTER 5. GRAPH-BASED LEARNING
Proof. The proof uses conditional probability, which we have not developed in these
notes, so we give a sketch of the proof. For 2 ≤ k ≤ n we define
Yn = E[Yn | X1 , . . . , Xn ],
The random variables Zk record how much the conditional expectation changes when
we add information about Xk . While the Zk are not independent, they form a mar-
tingale difference sequence, which allows us to essentially treat them as independent
and use a similar proof to Hoeffding’s inequality. The useful martingale difference
property is the identity
E[Zk | X1 , . . . , Xk−1 ] = 0
for k ≥ 2, and E[Z1 ] = 0, which follow from the law of iterated expectations.
We now follow the Chernoff bounding method and law of iterated expectations to
obtain
∑n
P(Yn − E[Yn ] ≥ t) = P(es k=1 Zk
≥ est )
∑n
≤ e−st E[es k=1 Zk
]
∑n
= e−st E[E[es k=1 Zk
| X1 , . . . , Xn−1 ]]
∑n−1
= e−st E[es k=1 Zk
E[esZn | X1 , . . . , Xn−1 ]].
Define
and
Lk = inf E[f (X1 , . . . , Xk−1 , x, Xk+1 , . . . Xn ) − Yn | X1 , . . . , Xk−1 ].
|x|≤M
In this notation, En,ε (u) = k∇n,ε uk2ℓ2 (Xn2 ) . The L2 norm of the gradient is of course
induced by the corresponding inner product
1 X
(5.59) (V, W )ℓ2 (Xn2 ) := ηε (|x − y|)V (x, y)W (x, y).
ση n2 x,y∈X
n
and H 1 (Xn ) norm by kuk2H 1 (Xn ) = (u, u)H 1 (Xn ) . Notice that
With these definitions, the graph Laplacian (5.54) is given by Ln,ε = divn,ε (∇n,ε u)
for u ∈ ℓ2 (Xn ), and the integration by parts formula
Proof. For simplicity, we give the proof for a box U = [0, 1]d ; the extension to arbitrary
domains U with smooth boundaries is straightforward. We assume 0 < ε ≤ 1
Let Ω be the event that the graph Gn,ε is not connected. We partition the box U
into M = h−d sub boxes B1 , B2 , . . . , Bn of side
√ length h, and let ni denote the number
of points from Xn falling in Bi . If ε ≥ 4 dh, then every point √ in Bi is connected
to all points in neighboring boxes Bj . Thus, if we set h = ε/4 d then the graph is
connected if all boxes Bi contain at least one point from Xn , and paths between pairs
of points are constructed by hopping between neighboring boxes. Hence, if the graph
is not connected, then some box Bi must be empty. Letting Ai denote the event that
ni = 0, we have by the union bound
!
[
M XM X
M
P(Ω) ≤ P Ai ≤ P(Ai ) = P(ni = 0).
i=1 i=1 i=1
Since X1 , . . . , Xn are i.i.d. random variables with density ρ ≥ α > 0, we have that
Y
n
P(ni = 0) = P (∀j ∈ {1, . . . , n}, Xj 6∈ Bi ) = P(Xj 6∈ Bi ) = P(X1 6∈ Bi )n .
j=1
116 CHAPTER 5. GRAPH-BASED LEARNING
Therefore
P(Ω) ≤ M exp −cnεd = Cε−d exp −cnεd .
If nεd ≥ 1, then
P(Ω) ≤ Cn exp −cnεd .
If nεd ≤ 1, then
Cn exp −cnεd ≥ Cn exp(−c) ≥ 1 ≥ P(Ω)
provided we choose C larger, if necessary. This completes the proof.
Remark 5.21. By Lemma 5.20, there exists C > 0 such that if nεd ≥ C log(n), then
the graph is connected with probability at least 1 − n12 . Note we can rewrite this
condition as
1/d
log(n)
(5.64) ε≥K ,
n
We also write Uε = U \ ∂ε U . Note we have chosen our labeled points to be all points
within ε of the boundary. Other choices of boundary conditions can be used, and
it is an interesting, and somewhat open, problem to determine the fewest number of
labeled points for which discrete to continuum convergence holds.
The continuum version of (5.65) is
(
div(ρ2 ∇u) = 0 in U
(5.67)
u = g on ∂U.
The goal of this section is to use the maximum principle to prove convergence, with
a rate, of the solution of (5.65) to the solution of (5.67) as ε → 0 and n → ∞.
We first prove pointwise consistency of graph Laplacians. The proof utilizes the
nonlocal operator
Z
2
(5.68) Lε u(x) = ηε (|x − y|)(u(y) − u(x))ρ(y) dy.
σ η ε2 U
Lemma 5.22 (Discrete to nonlocal). Let u : U → R be Lipschitz continuous and
ε > 0 with nεd ≥ 1. Then for any 0 < λ ≤ ε−1
so that
2 1X
n
Ln,ε u(x) = Yi .
ση ε2 n i=1
We compute
≤ CLip(u) ε2 2−d
.
118 CHAPTER 5. GRAPH-BASED LEARNING
and
Ch X
|Ln,ε u(xi ) − Ln,ε u(Xi )| ≤ 1 ≤ CLip(u)hε−2 .
nεd+2
y∈Xn ∩B(xi ,2ε)
for any 0 < λ ≤ ε−1 , provided u ∈ C 3 (U ). The proof is similar to Lemma 5.22,
but involves Taylor expanding u before applying concentration of measure so that
the application of Bernstein’s inequality does not depend on u. The uniformity over
u ∈ C 3 (U ) is required when using the viscosity solution machinery to prove discrete
to continuum convergence in the non-smooth setting. We refer the reader to [11,
Theorem 5] for more details. Also, when one uses conditional probability instead of
the covering argument in proving Lemma 5.22, the term 3 log(n) can be improved to
log(n). This is inconsequential for the results below.
We now turn to comparing the nonlocal operator Lε to its continuum counterpart
Lemma 5.24 (Nonlocal to local). There exists C > 0 such that for every u ∈ C 3 (U )
and x ∈ U with dist(x, ∂U ) ≥ ε we have
ε2 T 2
u(x + zε) − u(x) = ∇u(x) · zε + z ∇ u(x)z + O(βε3 )
2
and
ρ(x + zε) = ρ(x) + ∇ρ(x) · zε + O(ε2 ),
for |z| ≤ 1, to obtain
Z
2 ε2 T 2
Lε u(x) = η(|z|) ∇u(x) · zε + z ∇ u(x)z + O(βε )
3
ση ε2 B(0,1) 2
× ρ(x) + ∇ρ(x) · zε + O(ε2 ) dz
Z
2 1
= η(|z|) ρ(x)∇u(x) · zε−1 + ρ(x)z T ∇2 u(x)z
ση B(0,1) 2
!
+ (∇u(x) · z)(∇ρ(x) · z) dz + O(βε)
Z
2 1
= η(|z|) ρ(x)z T ∇2 u(x)z + (∇u(x) · z)(∇ρ(x) · z) dz + O(βε)
ση B(0,1) 2
=: A + B + O(βε),
120 CHAPTER 5. GRAPH-BASED LEARNING
X d Z
1
A = ρ(x) uxi xj (x) η(|z|)zi zj dz
ση i,j=1 B(0,1)
X d Z
1
= ρ(x) uxi xi (x) η(|z|)zi2 dz
ση i=1 B(0,1)
= ρ(x)∆u(x),
= 2∇ρ(x) · ∇u(x).
Remark 5.25. Combining Lemmas 5.22 and 5.24, for any u ∈ C 3 (U ) and 0 < λ ≤
ε−1 we have that
hold with probability at least 1 − C exp −cnεd+2 λ2 + 3 log(n) . This is called point-
wise consistency for graph Laplacians. Notice that for the bound to be non-vacuous,
we require nεd+2 log(n). To get a linear O(ε) pointwise consistency rate, we set
5.4. THE MAXIMUM PRINCIPLE APPROACH 121
λ = ε and require nεd+4 ≥ C log(n) for a large constant C > 0. Written a different
way, we require
1/(d+2)
log(n)
ε
n
for pointwise consistency of graph Laplacians and
1/(d+4)
log(n)
ε≥K
n
for pointwise consistency with a linear O(ε) rate. We note these are larger length
scale restrictions than for connectivity of the graph (compare with (5.64)). In the
length scale range
1/d 1/(d+2)
log(n) log(n)
(5.74) ε
n n
the graph is connected, and Laplacian learning well-posed, but pointwise consistency
does not hold and current PDE techniques cannot say anything about discrete to
continuum convergence.
Remark 5.26. If u ∈ C 4 (U ), then Lemma 5.24 can be sharpened to read
|Lε u(x) − ∆ρ u(x)| ≤ CkukC 4 (U ) ε2 .
We leave this as an exercise for the reader. Combining this with Lemma 5.22 we have
we have that
(5.75) max |Ln,ε u(x) − ∆ρ u(x)| ≤ C(λ + ε2 )
x∈Xn \∂ε U
hold with probability at least 1 − C exp −cnεd+2 λ2 + 3 log(n) for all 0 < λ ≤ ε−1
and all u ∈ C 4 (U ). To obtain the second order convergence rate, we must choose
λ = ε2 , and so we require the stricter length scale restriction
1/(d+6)
log n
ε≥C .
n
We now show how to use the maximum principle and pointwise consistency to
establish discrete to continuum convergence with a rate.
Theorem 5.27 (Discrete to continuum convergence). Let 0 < ε ≤ 1 and n ≥ 1. Let
un,ε ∈ ℓ2 (Xn ) be a solution of (5.65) satisfying (5.18) with Γ = Xn ∩ ∂ε U , and let
u ∈ C 3 (U ) be the solution of (5.67). Then for any 0 < λ ≤ 1
(5.76) max |un,ε (x) − u(x)| ≤ C(kukC 3 (U ) + 1)(λ + ε)
x∈Xn
holds with probability at least 1 − C exp −cnεd+2 λ2 + 3 log(n) .
122 CHAPTER 5. GRAPH-BASED LEARNING
Since ∆ρ φ(x) ≥ 0 at any point x ∈ U where φ attains its minimum value, and
∆ρ φ(x) = −1 for all x ∈ U , the minimum value of φ is attained on the boundary ∂U ,
and so φ ≥ 0. Combining Lemmas 5.22 and 5.24, and recalling ∆ρ u = 0, we have
that
and
|un,ε (x) − u(x)| ≤ |un,ε (x)| + |u(x)| ≤ 2kgkL∞ (U ) ≤ 4C1 kgkL∞ (U ) (λ + ε),
for all x ∈ Xn . For the other direction, we can define w = un,ε − u − Kφ and make a
similar argument to obtain
Notes and references: Pointwise consistency for graph Laplacians was established
in [34, 35] for a random geometric graph on a manifold. Pointwise consistency for k-
nearest neighbor graphs was established without rates in [57], and with convergence
rates in [14]. The maximum principle is a well-established tool for passing to limits
with convergence rates, and has been well-used in numerical analysis. The theory
presented here requires regularity of the solution u of the continuum PDE. In some
problems, especially nonlinear problems, the solution is not sufficiently regular, and
the theory of viscosity solutions has been developed for precisely this purpose. The pa-
pers [11] and [13] use the maximum principle, pointwise consistency and the viscosity
solution machinery (see [12] for more on viscosity solutions) to prove discrete to con-
tinuum convergence for p-Laplace and ∞-Laplace semi-supervised learning problems
with very few labels. The maximum principle is used in [16, 54] to prove convergence
rates for Laplacian regularized semi-supervised learning, and in [59] to prove rates for
Laplacian regularized regression on graphs.
for u ∈ L2 (U ) and ε, t > 0. We write Iε (u) = Iε,0 (u). We also define the local energy
Z
(5.81) I(u) = |∇u|2 ρ2 dx
U
for u ∈ H 1 (U ).
124 CHAPTER 5. GRAPH-BASED LEARNING
Lemma 5.28 (Discrete to nonlocal). There exists C, c > 0 such that for any 0 <
λ ≤ 1 and Lipschitz continuous u : U → R we have
1
(5.82) |En,ε (u) − Iε (u)| ≤ CLip(u) 2
+λ
n
with probability at least 1 − 2 exp −cnεd λ2 .
where 2
u(x) − u(y)
f (x, y) = ηε (|x − y|) ,
ε
and note that En,ε (u) = n−1
U .
ση n n
Since u is Lipschitz, |ηε | ≤ Cε−d , and ηε (|x − y|) = 0
for |x − y| > ε we have
We also have
Z Z 2
u(x) − u(y)
E[f (Xi , Xj )] = ηε (|x − y|) ρ(x)ρ(y) dx dy = ση Iε (u),
U U ε
and
σ 2 := Var (f (Xi , Xj ))
Z Z 4
u(x) − u(y)
≤ ηε (|x − y|) 2
ρ(x)ρ(y) dxdy
ε
U U
Z Z
4 −2d
≤ CLip(u) ε dx dy
U B(y,ε)
4 −d
≤ CLip(u) ε .
1
|En,ε (u) − Iε (u)| = (1 − n1 )Un − ση Iε (u)
ση
1 ση
= (1 − n1 )(Un − ση Iε (u)) − n ε
I (u)
ση
1 1
≤ |Un − ση Iε (u)| + Iε (u).
ση n
Since u is Lipschitz we have Iε (u) ≤ Lip(u)2 . Therefore we can apply the result of
Bernstein above to obtain that
1
|En,ε (u) − Iε (u)| ≤ CLip(u) λ +
2
n
holds with probability at least 1 − 2 exp −cnεd λ2 , which completes the proof.
Lemma 5.29 (Nonlocal to local). There exist C > 0 such that for all u ∈ C 2 (U )
and 0 < ε ≤ 1
for |x − y| ≤ ε to obtain
for |x − y| ≤ ε to obtain
Z Z
1
Iε (u) = ηε (|x − y|) (|∇u(x) · (y − x)|2 + O(βLip(u)ε3 ))ρ(x)2 (1 + O(ε)) dy dx
ση ε2 U B(x,ε)∩U
Z Z
1
= 2
η ε (|x − y|) |∇u(x) · (y − x)| 2
+ O(βLip(u)ε 3
) ρ(x)2 dy dx
ση ε U B(x,ε)∩U
Z Z
1
= ηε (|x − y|) |∇u(x) · (y − x)|2 ρ(x)2 dy dx + O(βLip(u)ε).
ση ε2 U B(x,ε)∩U
126 CHAPTER 5. GRAPH-BASED LEARNING
A∇u(x) = |∇u(x)|e1
Therefore
Z Z
1
Iε (u) = |∇u(x)| ρ(x)
2 2
ηε (|x − y|) |z1 − x1 |2 dz dx + O(βLip(u)ε),
σ η ε2 U B(x,ε)∩V
where we used that B(x, ε) ∩ V = B(x, ε) and the change of variables y = (z − x)/ε.
For all x ∈ U , a similar computation yields
Z
ηε (|x − y|) |z1 − x1 |2 dz ≤ ση ε2 .
B(x,ε)∩V
Therefore
Z
|Iε (u) − I(u)| ≤ 2 |∇u(x)|2 ρ(x)2 dx + CβLip(u)ε
∂ε U
Z
≤ 2Lip(u) 2
ρ(x)2 dx + CβLip(u)ε
∂ε U
≤ CLip(u) ε + CβLip(u)ε,
2
Proof. There exists a universal constants K > 0 such that for each h > 0 there is a
partition B1 , B2 , . . . , BM of U for which M ≤ Ch−n and Bi ⊂ B(xi , Kh) for some
xi ∈ Bi for all i. Let δ > 0 and set h = δ/2K so that Bi ⊂ B(xi , δ/2). Let ρδ be the
histogram density estimator
1 X ni
M
ρδ (x) = 1B (x),
n i=1 |Bi | i
and Z
pi = ρ(x) dx ≤ C|Bi |.
Bi
Union bounding over i = 1, . . . , M , for any 0 < λ ≤ 1, with probability at least
1 − Cn exp −cnδ d λ2 , we have
Z
ni
(5.87) − ρ(x) dx ≤ Cλ|Bi |
n Bi
X
n
(5.88) Eδ u(x) = u(Xi )1Ui (x).
i=1
for every u ∈ ℓ2 (Xn ). Hence, the extension operator allow us to relate empirical
discrete sums with their continuum integral counterparts uniformly over L2 (U ) and
ℓ2 (Xn ), without using concentration of measure as we did for pointwise consistency
of graph Laplacians.
It will often be more be convenient in analysis to define the associated transporta-
tion map Tδ : U → Xn , which is defined by Tδ (x) = Xi if and only if x ∈ Ui . Then the
extension map Eδ : ℓ2 (Xn ) → L2 (U ) can be written as Eδ u = u ◦ Tδ . In this notation,
(5.89) becomes
Z
1X
n
(5.90) u(Xi ) = u(Tδ (x))ρδ (x) dx.
n i=1 U
Since Tδ (Ui ) = Xi and Ui ⊂ B(Xi , δ), we have |Tδ (x) − x| ≤ δ. This implies that
for all x, y ∈ U .
To see why Tδ is called a transportation map, note that Ui = Tδ−1 ({Xi }), and so
property (5.85) becomes Z
1
ρδ (x) dx = .
T −1 ({Xi }) n
P
We define the empirical distribution µn := n1 ni=1 δXi (x). Then for any A ⊂ U
1X
n
µn (A) = 1X ∈A
n i=1 i
and so Z Z
X
n
µn (A) = 1Xi ∈A ρδ (x) dx = ρδ dx.
i=1 T −1 ({Xi }) T −1 (A)
130 CHAPTER 5. GRAPH-BASED LEARNING
By definition, this means that Tδ pushes forward the probability density ρδ to the
empirical distribution µn , written Tδ # ρδ = µn . In other words, the map Tδ is a
transportation map, transporting the distribution ρδ to µn .
We can also extend this to double integrals. In the transportation map notation
we have
Z Z
1 X
n
(5.92) Φ(Xi , Xj ) = Φ(Tδ (x), Tδ (y))ρδ (x)ρδ (y) dx dy.
n2 i,j=1 U U
We could have taken the above as a definition of ψε,δ . We also note that ση = ση,0
and
2α(n)Lip(η)
(5.95) ση − t ≤ ση,t ≤ ση ,
d+2
where
Z
(5.98) θε,δ (x) = ψε,δ (|x − y|) dy.
U
Proposition 5.33. For all ε, δ > 0 with ε > 2δ and x ∈ U with dist(x, ∂U ) ≥ ε − 2δ
we have θε,δ (x) = 1.
132 CHAPTER 5. GRAPH-BASED LEARNING
Therefore
Z Z
1 ση
kθε,δ (Λε,δ u − u)k2L2 (U ) ≤ ηε (|x − y| + 2δ) (u(y) − u(x))2 dxdy ≤ Iε,δ (u)ε2 .
c U U c
Applying Proposition 5.33 completes the proof.
We now turn to comparing the local energy I(u) to the nonlocal energy Iε,δ (u).
Lemma 5.35. There exists C > 0 such that for every u ∈ L2 (U ) and ε, δ > 0 with
ε ≥ 4δ we have
Z
ση (1 + Cε) C
|∇Λε,δ u|2 ρ2 θε,δ
2
dx ≤ (Iε,δ (u) + Iε,δ (u; ∂ε−2δ U )) + 2 kΛε,δ u − uk2L2 (∂ε−2δ U ) .
U ση,δ/ε ε
1 w(x)
∇w(x) = ∇v(x) − ∇θε,δ (x),
θε,δ (x) θε,δ (x)
where Z
v(x) = ψε,δ (|x − y|) u(y) dy.
U
We now compute
Z
1 ′ |x − y| (x − y)
∇v(x) = d ψ1,δ/ε u(y) dy
ε ε ε|x − y|
U
Z
1 |x − y| + 2δ
= η (y − x)u(y) dy
ση,δ/ε εd+2 ε
Z U
1
= ηε (|x − y| + 2δ) (y − x)u(y) dy,
ση,δ/ε ε2 U
Therefore
(5.101) "Z
1
∇w(x) = ηε (|x − y| + 2δ) (y − x)(u(y) − u(x)) dy
ση,δ/ε θε,δ (x)ε2 U
Z #
+ (u(x) − w(x)) ηε (|x − y| + 2δ) (y − x) dy .
U
134 CHAPTER 5. GRAPH-BASED LEARNING
Similarly, we have
Z
1 C
ηε (|x − y| + 2δ) (y − x) · ξ dy ≤ 1x∈∂ε−2δ U ,
ση,δ/ε θε,δ (x)ε2 U ε
since Z
ηε (|x − y| + 2δ) (y − x) · ξ dy = 0
U
if dist(x, ∂U ) ≥ ε − 2δ. Therefore
Z 1/2
1
|∇w(x)| ≤ √ ηε (|x − y| + 2δ)(u(y) − u(x)) dy
2
ση,δ/ε θε,δ (x)ε U
C
+ |w(x) − u(x)|1∂ε−2δ U (x) .
ε
Squaring and integrating against ρ2 θε,δ 2
over U we have
Z Z Z
1
|∇w|2 ρ2 θε,δ
2
dx ≤ 2
ηε (|x − y| + 2δ)(u(y) − u(x))2 ρ(x)2 dxdy
σ η,δ/ε ε
U
Z
U U
Z
1
+ ηε (|x − y| + 2δ)(u(y) − u(x))2 ρ(x)2 dydx
ση,δ/ε ε2 ∂ε−2δ U U
Z
C
+ 2 (w(x) − u(x))2 dx,
ε ∂ε−2δ U
5.5. THE VARIATIONAL APPROACH 135
where we applied Cauchy’s inequality 2ab ≤ a2 + b2 to the cross terms. The proof is
completed by noting that ρ is Lipschitz continuous and ρ ≥ α > 0, and so ρ(x) ≤
ρ(y)(1 + Cε) for y ∈ B(x, ε).
We can sharpen Lemma 5.35 if we have some information about the regularity of
u near the boundary.
Theorem 5.36 (Nonlocal to local). Suppose that u ∈ L2 (U ) satisfies
(5.103) |u(x) − u(y)| ≤ Cε
for all x ∈ ∂ε−2δ U and y ∈ B(x, ε − 2δ) ∩ U . Then for ε > 0 and δ ≤ cε, with
0 < c ≤ 1/4 depending only on η and n, we have
(5.104) I(Λε,δ u) ≤ 1 + C ε + δε Iε,δ (u) + C ε + δε .
Proof. For any x ∈ ∂ε−2δ U we have by (5.103) that
Z
1
|(Λε,δ u)(x) − u(x)| ≤ ψε,δ (|x − y|)|u(y) − u(x)| dx
θε,δ (x) B(x,ε−2δ)∩U
Z
Cε
≤ ψε,δ (|x − y|) dx = Cε.
θε,δ (x) B(x,ε−2δ)∩U
Since 2δ < ε we have
(5.105) kΛε,δ u − uk2L2 (∂ε−2δ U ) ≤ Cε2 |∂ε U | ≤ Cε3 .
We again use (5.103) to obtain
Z Z
1
Iε,δ (u; ∂ε−2δ U ) = 2
ηε (|x − y| + 2δ)(u(x) − u(y))2 ρ(x)ρ(y) dxdy
ση ε ∂ε−2δ U B(x,ε−2δ)∩U
Z Z
≤C ηε (|x − y| + 2δ) dxdy ≤ C|∂ε U | ≤ Cε.
∂ε−2δ U B(x,ε−2δ)∩U
By Proposition 5.33 we have θε,δ = 1 on Uε−2δ . Combining these facts with (5.105)
and Lemma 5.35 we have
Z
ση (1 + Cε)
|∇Λε,δ u|2 ρ2 dx ≤ (Iε,δ (u) + Cε) + Cε.
Uε−2δ ση,δ/ε
Invoking (5.95), for δ ≤ cε, with 0 < c ≤ 1/4 depending only on η and n, we have
Z
|∇Λε,δ u|2 ρ2 dx ≤ 1 + C ε + δε (Iε,δ (u) + Cε) + Cε.
Uε−2δ
We also need estimates in the opposite direction, bounding the nonlocal energy
Iε (u) by the local energy I(u). While a similar result was already established in
Lemma 5.29 under the assumption u ∈ C 2 (U ), we need a result that holds for Lips-
chitz u to prove discrete to continuum convergence.
Lemma 5.37 (Local to nonlocal). There exists C > 0 such that for every Lipschitz
continuous u : U → R and ε > 0 we have
(5.106) Iε (u) ≤ (1 + Cε)I(u) + CLip(u)2 ε.
Proof. We write
Iε (u) = Iε (u; Uε ) + Iε (u; ∂ε U )
Note that
Z 1 2 Z 1
d
(u(x) − u(y)) =2
u(x + t(y − x)) dt ≤ |∇u(x + t(y − x)) · (y − x)|2 dt,
0 dt 0
where we used Jensen’s inequality in the final step. Since ρ is Lipschitz and bounded
below, we have ρ(y) ≤ ρ(x)(1 + Cε). Plugging these into the definition of Iε (u; Uε )
we have
Z Z Z
1 + Cε 1
Iε (u; Uε ) ≤ 2
ηε (|x − y|)|∇u(x + t(y − x)) · (y − x)|2 dy ρ(x)2 dx dt.
ση ε 0 Uε B(x,ε)
Now make the change of variables z = y − x in the inner integral to find that
Z Z Z
1 + Cε 1
Iε (u; Uε ) ≤ 2
ηε (|z|)|∇u(x + tz) · z|2 dz ρ(x)2 dx dt
ση ε 0 Uε B(0,ε)
Z 1Z Z
1 + Cε
= ηε (|z|) |∇u(x + tz) · z|2 ρ(x)2 dx dz dt
ση ε2 0 B(0,ε) Uε
Z 1Z Z
1 + Cε
≤ ηε (|z|) |∇u(x) · z|2 ρ(x)2 dx dz dt
ση ε2 0 B(0,ε) U
Z Z
1 + Cε
= ηε (|z|)|∇u(x) · z|2 dz ρ(x)2 dx
ση ε2 U B(0,ε)
Z
= (1 + Cε) |∇u|2 ρ(x)2 dx = (1 + Cε)I(u).
U
Notes and references: The material in this section roughly follows [58], though
we have made some modifications to simplify the presentation. In particular, our
smoothing operator Λε,δ is different than the one in [58], which essentially uses Λε,0
everywhere. Some of the core arguments in [58] appeared earlier in [10], which consid-
ered the non-random setting with constant kernel η. The spectral convergence rates
from [10, 58] were recently improved in [14] by incorporating pointwise consistency of
graph Laplacians.
and
(5.108) A = {u ∈ H 1 (U ) : u = g on ∂U }.
R
where we recall En,ε is defined in (5.53) and I(u) = U |∇u|2 ρ2 dx.
To prove discrete to continuum convergence with rates, we require stability for
the limiting problem.
Proposition 5.38 (Stability). Let u ∈ C 2 (U ) such that div (ρ2 ∇u) = 0 in U . There
exists C > 0, depending only on U , such that for all w ∈ H 1 (U )
(5.111) ku − wk2H 1 (U ) ≤ C I(w) − I(u) + kuk C 1 (U ) ku − wk
L2 (∂U ) + ku − wk2L2 (∂U ) .
138 CHAPTER 5. GRAPH-BASED LEARNING
= 2I(u) − 2 ρ ∇w · ∇u dx + R
2
Z U
= 2 ρ2 ∇u · ∇(u − w) dx + R
ZU
∂u
=2 (u − w)ρ2 dS + R
∂U ∂ν
≤ Cα−2 kukC 1 (U ) ku − wkL2 (∂U ) + R,
where we used integration by parts in the second to last line. We now use the Poincaré
inequality proved in Exercise 4.18 to obtain
Z Z Z
(u − w) dx ≤ C
2
|∇u − ∇w| dx +
2
(u − w) dx .
2
U U ∂U
Proof. The proof is a discrete analog of Proposition 5.38. Write R = En,ε (w) − En,ε (u)
and compute
k∇n,ε u − ∇n,ε wk2ℓ2 (Xn2 ) = k∇n,ε uk2ℓ2 (Xn2 ) − 2(∇n,ε u, ∇n,ε w)ℓ2 (Xn2 ) + k∇n,ε wk2ℓ2 (Xn2 )
= 2(k∇n,ε uk2ℓ2 (Xn2 ) − (∇n,ε u, ∇n,ε w)ℓ2 (Xn2 ) ) + R
= 2(∇n,ε u, ∇n,ε u − ∇n,ε w)ℓ2 (Xn2 ) + R = 0
= −2(Ln,ε u, u − w)ℓ2 (Xn ) + R = R,
and
uε (x) = u(x) + φ d(x)
2ε
(g(x) − u(x)).
Since d and φ are Lipschitz continuous, uε is Lipschitz. Since u and g are Lipschitz
and u = g on ∂U , we also have |g(x) − u(x)| ≤ Cε for x ∈ ∂4ε U . Recalling |∇d| = 1
a.e., we have
∇d(x)
|∇uε (x)| = ∇u(x) + φ′ d(x)
2ε
(g(x) − u(x)) + φ d(x)
2ε
(∇g(x) − ∇u(x)
2ε
1
≤ |∇u(x)| + φ′ d(x)2ε
|g(x) − u(x)| + |∇g(x) − ∇u(x)|
2ε
1
≤ 2|∇u(x)| + |∇g(x)| + kg − ukL∞ (∂4ε U )
2ε
1
≤ 2|∇u(x)| + |∇g(x)| + C.
2
Therefore |∇uε | is bounded. Since uε = u on U4ε we have
Z Z Z
(5.114) I(uε ) − I(u) = |∇uε | ρ dx −
2 2
|∇u| ρ dx ≤
2 2
|∇uε |2 ρ2 dx ≤ Cε.
U U ∂4ε U
Now, since |∇uε | ≤ C a.e., uε is Lipschitz continuous with Lip(u) bounded indepen-
dently of ε. By Lemma 5.28 we have that
1
|En,ε (uε ) − Iε (uε )| ≤ C +λ
n
with probability at least 1 − 2 exp −cnεd λ2 . By Lemma 5.37 and (5.114) we have
|Eδ un,ε (x) − g(x)| = |un,ε (Tδ (x)) − g(x)| = |g(Tδ (x)) − g(x)| ≤ Cδ.
kΛε,δ Eδ un,ε − Eδ un,ε k2L2 (Uε ) ≤ CIε,δ (Eδ un,ε )ε2 ≤ Cε2 ,
ku − Eδ un,ε k2L2 (Uε ) ≤ 2ku − Λε,δ Eδ un,ε k2L2 (Uε ) + 2kΛε,δ Eδ un,ε − Eδ un,ε k2L2 (Uε )
= C(ε + δε + λ) + Cε2 .
1 X
ku − un,ε k2ℓ2 (Xn ) = (u(x) − un,ε (x))2
n x∈X
Z n
Therefore
5. Define
wε (x) = Λε,δ Eδ un,ε (x) + φ 4d(x)
ε
(g(x) − Λε,δ Eδ un,ε (x)).
As in the proof of Theorem 5.36, and part 3, we have that |∇Λε,δ Eδ un,ε (x)| ≤ C and
≤ C|∂ε/2 U | ≤ Cε.
Mathematical preliminaries
A.1 Inequalities
For x ∈ Rd the norm of x is
q
|x| := x21 + x22 + · · · + x2n .
X
d
x·y = xi yi .
i=1
Notice that
|x|2 = x · x.
Simple inequalities, when used in a clever manner, are very powerful tools in the
study of partial differential equations. We give a brief overview of some commonly
used inequalities here.
The Cauchy-Schwarz inequality states that
|x · y| ≤ |x||y|.
h(t) := |x + ty|2 .
For x, y ∈ Rd
|x + y|2 = (x + y) · (x + y) = x · x + x · y + y · x + y · y.
143
144 APPENDIX A. MATHEMATICAL PRELIMINARIES
Therefore
|x + y|2 = |x|2 + 2x · y + |y|2 .
|x + y| ≤ |x| + |y|.
|x| = |x − y + y| ≤ |x − y| + |y|.
|x − y| ≥ |x| − |y|.
0 ≤ (a − b)2 = a2 − 2ab + b2 .
Therefore
1 1
(A.1) ab ≤ a2 + b2 .
2 2
ap b q
(A.2) ab ≤ + for all a, b > 0.
p q
1 p )+ 1 log(bq ) 1 p 1 q ap b q
ab = elog(a)+log(b) = e p log(a q ≤ elog(a ) + elog(b ) = + .
p q p q
A.2 Topology
We will have to make use of basic point-set topology. We define the open ball of
radius r > 0 centered at x0 ∈ Rd by
B 0 (x0 , r) := {x ∈ Rd : |x − x0 | < r}.
The closed ball is defined as
B(x0 , r) := {x ∈ Rd : |x − x0 | ≤ r}.
Definition A.1. A set U ⊂ Rd is called open if for each x ∈ U there exists r > 0
such that B(x, r) ⊂ U .
Exercise A.2. Let U, V ⊂ Rd be open. Show that
W := U ∪ V := {x ∈ Rd : x ∈ U or x ∈ V }
is open. 4
Definition A.3. We say that a sequence {xk }∞k=1 in R converges to x ∈ R , written
d d
xk → x, if
lim |xk − x| = 0.
k→∞
We defined the boundary only for open sets, but is can be defined for any set.
Definition A.8. The interior of a set U ⊂ Rd , denoted int(U ), is defined as
∂U := U \ int(U ).
4
Definition A.12. We say a set U ⊂ Rd is bounded if there exists M > 0 such that
|x| ≤ M for all x ∈ U .
Definition A.13. We say a set U ⊂ Rd is compact if U is closed and bounded.
Definition A.14. For open sets V ⊂ U ⊂ Rd we say that V is compactly contained
in U if V is compact and V ⊂ U . If V is compactly contained in U we write V ⊂⊂ U .
A.3 Differentiation
A.3.1 Partial derivatives
The partial derivative of a function u = u(x1 , x2 , . . . , xn ) in the xi variable is defined
as
∂u u(x + hei ) − u(x)
(x) := lim ,
∂xi h→0 h
provided the limit exists. Here e1 , e2 , . . . , en are the standard basis vectors in Rd , so
ei = (0, . . . , 0, 1, 0, . . . , 0) ∈ Rd has a one in the ith entry. For simplicity of notation
we will write
∂u
u xi = .
∂xi
The gradient of a function u : Rd → R is the vector of partial derivatives
Higher derivatives are defined iteratively. The second derivatives of u are defined
as
∂ 2u ∂ ∂u
:= .
∂xi xj ∂xi ∂xj
This means that
∂ 2u 1 ∂u ∂u
(x) = lim (x + hei ) − (x) ,
∂xi xj h→0 h ∂xj ∂xj
provided the limit exists. As before, we write
∂ 2u
u xi xj =
∂xi xj
for notational simplicity. When uxi xj and uxj xi exist and are continuous we have
uxi xj = uxj xi ,
that is the second derivatives are the same regardless of which order we take them in.
We will generally always assume our functions are smooth (infinitely differentiable),
so equality of mixed partials is always assumed to hold.
The Hessian of u : Rd → R is the matrix of all second partial derivatives
ux1 x1 ux1 x2 ux1 x3 · · · ux1 xn
u x x u x x ux x · · · ux x
2 1 2 2 2 3 2 n
∇2 u(x) := (uxi xj )di,j=1 = ux3 x1 ux3 x2 ux3 x3 · · · ux3 xn
.. .. .. .. ..
.
. . . .
uxn x1 uxn x2 uxn x3 · · · uxn xn
Since we have equality of mixed partials, the Hessian is a symmetric matrix, i.e.,
(∇2 u)T = ∇2 u. Since we treat the gradient ∇u as a column vector, the product
∇2 u(x)∇u(x) denotes the Hessian matrix multiplied by the gradient vector. That is,
X
d
2
[∇ u(x)∇u(x)]j = u xi xj u xi .
i=1
d X d
(A.3) u(v(t)) = ∇u(v(t)) · v ′ (t) = uxi (v(t))vi′ (t).
dt i=1
∂ X d
u(F (x)) = ∇u(F (x)) · Fxj (x) = uxi (F (x))Fxij (x),
∂xj i=1
where Fxj = (Fx1j , Fx2j , . . . , Fxdj ). This is a special case of (A.3) with t = xj .
Product rule: Given two functions u, v : Rd → R, we have
This is entry-wise the usual product rule for single variable calculus.
Given a vector field F : Rd → Rd and a function u : Rd → R we have
∂
(uF i ) = uxi F i + uFxii .
∂xi
Therefore
div(uF ) = ∇u · F + u divF.
p
Exercise A.15. Let |x| = x21 + · · · + x2n .
Since φ is a function of one variable t, we can use the one dimensional Taylor series
to obtain
The constant in the O(|t|2 ) term depends on the maximum of |φ′′ (t)|. All that remains
is to compute the derivatives of φ. By the chain rule
X
d
′
(A.9) φ (t) = uxi (x + (y − x)t)(yi − xi ),
i=1
and
d X
d
′′
φ (t) = ux (x + (y − x)t)(yi − xi )
dt i=1 i
Xd
d
= uxi (x + (y − x)t)(yi − xi )
i=1
dt
X
d X
d
(A.10) = uxi xj (x + (y − x)t)(yi − xi )(yj − xj ).
i=1 j=1
In particular
X
d
′
φ (0) = uxi (x)(yi − xi ) = ∇u(x) · (y − x),
i=1
152 APPENDIX A. MATHEMATICAL PRELIMINARIES
where R2 (x, y) satisfies |R2 (x, y)| ≤ 12 maxt |φ′′ (t)|. Let C > 0 denote the maximum
value of |uxi xj (z)| over all z, i and j. Then by (A.10)
X
d X
d
|φ′′ (t)| ≤ C |yi − xi ||yj − xj | ≤ Cn2 |x − y|2 .
i=1 j=1
It follows that |R2 (x, y)| ≤ C2 n2 |x − y|2 , hence R2 (x, y) ∈ O(|x − y|2 ) and we arrive
at the first order Taylor series
This says that u can be locally approximated near x to order O(|x − y|2 ) by the affine
function
L(y) = u(x) + ∇u(x) · (y − x).
We can continue this way to obtain the second order Taylor expansion. We assume
now that u is three times continuously differentiable. Using the one dimensional
second order Taylor expansion we have
1
(A.12) φ(t) = φ(0) + φ′ (0)t + φ′′ (0)t2 + O(|t|3 ).
2
The constant in the O(|t|3 ) term depends on the maximum of |φ′′′ (t)|. Notice also
that
X
d X
d
φ′′ (0) = uxi xj (x)(yi − xi )(yj − xj ) = (y − x) · ∇2 u(x)(y − x),
i=1 j=1
where ∇2 u(x) is the Hessian matrix. Plugging this into (A.12) with t = 1 yields
1
u(y) = u(x) + ∇u(x) · (y − x) + (y − x) · ∇2 u(x)(y − x) + R3 (x, y),
2
where R3 (x, y) satisfies |R3 (x, y)| ≤ 1
6
maxt |φ′′′ (t)|. We compute
Xd X d
d
φ′′′ (t) = uxi xj (x + (y − x)t)(yi − xi )(yj − xj )
i=1 j=1
dt
X
d X
d X
d
= uxi xj xk (x + (y − x)t)(yi − xi )(yj − xj )(yk − xk ).
i=1 j=1 k=1
A.5. FUNCTION SPACES 153
Let C > 0 denote the maximum value of |uxi xj xk (z)| over all z, i, j, and k. Then we
have
Xd Xd Xd
′′′
|φ (t)| ≤ C |yi − xi ||yj − xj ||yk − xk | ≤ Cn3 |x − y|3 .
i=1 j=1 k=1
Therefore |R3 (x, y)| ≤ C6 n3 |x − y|3 and so R3 ∈ O(|x − y|3 ). Finally we arrive at the
second order Taylor expansion
1
(A.13) u(y) = u(x) + ∇u(x) · (y − x) + (y − x) · ∇2 u(x)(y − x) + O(|x − y|3 ).
2
This says that u can be locally approximated near x to order O(|x − y|3 ) by the
quadratic function
1
L(y) = u(x) + ∇u(x) · (y − x) + (y − x) · ∇2 u(x)(y − x).
2
The terminology k-times continuously differentiable means that all k th -order partial
derivatives of u exist and are continuous on U . We write C 0 (U ) = C(U ) for the space
of functions that are continuous on U .
Exercise A.21. Show that the function u(x) = x2 for x > 0 and u(x) = −x2 for
x ≤ 0 belongs to C 1 (R) but not to C 2 (R). 4
We also define
\
∞
∞
C (U ) := C k (U )
k=1
Notice that
kuk2L2 (U ) = (u, u)L2 (U ) .
We also define
n o
L2 (U ) := Functions u : U → R for which kukL2 (U ) < ∞ .
A.6 Analysis
A.6.1 The Riemann integral
Many students are accustomed to using different notation for integration in different
dimensions. For example, integration along the real line in R is usually written
Z b
u(x) dx,
a
A.6. ANALYSIS 155
u(x) = u(x1 , x2 , . . . , xn )
where dx = dx1 dx2 · · · dxn . This notation has the advantage of being far more com-
pact without losing the meaning. R
Let us recall the interpretation of the integral U u dx in the Riemann sense. We
partition the domain into M rectangles and approximate the integral by a Riemann
sum
Z XM
u dx ≈ u(xk )∆xk ,
U k=1
where xk ∈ Rd is a point in the k th rectangle, and ∆xk := ∆xk,1 ∆xk,2 · · · ∆xk,n is the
n-dimensional volume (or measure) of the k th rectangle (∆k,i for i = 1, . . . , n are the
side lengths of the k th rectangle). Then the Riemann integral is defined by taking
the limit as the largest side length in the partition tends to zero (provided the limit
exists and does not depend on the choice of partition or points xk ). Notice here that
xk = (xk,1 , . . . , xk,n ) ∈ Rd is a point in Rd , and not the k th entry of x. There is a
slight abuse of notation here; the reader will have to discern from the context which
is implied.
If S ⊂ Rd is an n − 1 dimensional (or possibly lower dimensional) surface, we
write the surface integral of u over S as
Z
u(x) dS(x).
S
Some properties will hold on all of Rd except for a set with measure zero. In this
case we will say the property holds almost everywhere or abbreviated “a.e.”.
The following theorems are the most useful for passing to limits within integrals.
Lemma A.25 (Fatou’s Lemma). If uk : Rd → [0, ∞) is a sequence of nonnegative
measureable functions then
Z Z
(A.14) lim inf uk (x) dx ≤ lim inf uk (x) dx.
Rd k→∞ k→∞ Rd
Noting that lim inf(−uk ) = − lim sup(uk ), this can be rearranged to obtain (A.15),
provided g is summable.
The following two theorems are more or less direct applications of Fatou’s Lemma
and the Reverse Fatou Lemma, but present the results in a way that is very useful in
practice.
A.6. ANALYSIS 157
Proof. Let u(x) = limk→∞ uk (x). Note that |uk − u| ≤ |uk | + |u| ≤ 2g. Therefore, we
can apply the Reverse Fatou Lemma to find that
Z Z
lim sup |u − uk | dx ≤ lim |uk − u| dx = 0.
k→∞ Rd Rd k→∞
R
Therefore limk→∞ U |u − uk | dx = 0 and we have
Z Z Z Z
u dx − uk dx = u − uk dx ≤ |uk − u| dx −→ 0
Rd Rd Rd Rd
where A ⊂ Rd is measurable and |A| < ∞. Then for each ε > 0 there exists a
measurable subset E ⊂ A such that |A − E| ≤ ε and uk → u uniformly on E.
158 APPENDIX A. MATHEMATICAL PRELIMINARIES
∂u
(x) := ∇u(x) · ν(x).
∂ν
Integration by parts in Rd is based on the Gauss-Green Theorem.
Theorem A.31 (Integration by parts). Let U ⊂ Rd be an open and bounded set with
a smooth boundary ∂U . If u, v ∈ C 2 (U ) then
Z Z Z
∂v
(i) u∆v dx = u dS − ∇u · ∇v dx,
U ∂U ∂ν U
Z Z
∂v ∂u
(ii) u∆v − v∆u dx = u −v dS, and
U ∂U ∂ν ∂ν
Z Z
∂v
(iii) ∆v dx = dS.
U ∂U ∂ν
Proof. It is enough to prove the result for θ = 0. Then if any statement holds for
θ > 0, we can define v(x) = u(x) − 2θ |x|2 and use the results for v with θ = 0.
The proof is split into three parts.
1. (i) =⇒ (ii): Assume u is convex. Let x0 ∈ R and set λ = 12 , x = x0 − h, and
y = x0 + h for a real number h. Then
1 1
λx + (1 − λ)y = (x0 − h) + (x0 + h) = x0 ,
2 2
and the convexity condition (A.22) yields
1 1
u(x0 ) ≤ u(x0 − h) + u(x0 + h).
2 2
Therefore
u(x0 − h) − 2u(x0 ) + u(x0 + h) ≥ 0
A.8. CONVEX FUNCTIONS 161
and
u(x) ≥ u(y) + u′ (y)(x − y).
Subtracting the equations above proves that (iv) holds.
4. (iv) =⇒ (i): Assume (iv) holds. Then u′ is nondecreasing and so u′′ (x) ≥ 0
for all x ∈ R, hence (ii) holds, and so does (iii). Let x, y ∈ Rd and λ ∈ (0, 1), and set
x0 = λx + (1 − λ)y. Define
and so u is convex.
Lemma A.37 has a natural higher dimensional analog, but we first need some new
notation. For a symmetric real-valued n × n matrix A = (aij )di,j=1 , we write
X
d
(A.24) A≥0 ⇐⇒ v · Av = aij vi vj ≥ 0, ∀v ∈ Rd .
i,j=1
Proof. Again, we may prove the result just for θ = 0. The proof follows mostly from
Lemma A.37, with some additional observations.
1. (i) =⇒ (ii): Assume u is convex. Since convexity is defined along lines, we
see that g(t) = u(x + tv) is convex for all x, v ∈ Rd , and by Lemma A.37 g ′′ (t) ≥ 0
for all t. By (A.10) we have
d2 XX d d
′′
(A.25) g (t) = 2 u(x + tv) = uxi xj (x)vi vj = v · ∇2 u(x)v,
dt i=1 j=1
for all t, s. By Lemma A.37 we have that g is convex for al x, v ∈ Rd , from which it
easily follows that u is convex.
A.9 Probability
Here, we give a brief overview of basic probability. For more details we refer the
reader to [25].
the union A ∪ B is the event that A or B occured, and the intersection A ∩ B is the
event that both A and B occured. By subadditivity of measures we have
PX (X ∈ B) := P(X −1 (B)).
EX [g(X)] to be
Z Z
EX [g(X)] = g(x) d PX (x) = g(X(ω)) d P(ω).
ΩX Ω
In particular Z Z
EX [X] = x d PX (x) = X(ω) d P(ω).
ΩX Ω
164 APPENDIX A. MATHEMATICAL PRELIMINARIES
EX [X]
(A.26) PX (X ≥ t) ≤ .
t
Proof. By definition we have
Z ∞ Z ∞ Z ∞
x 1 EX [X]
PX (X ≥ t) = d PX (x) ≤ d PX (x) = x d PX (x) = .
t t t t 0 t
Var (X)
(A.28) PX (|X − EX [X]| ≥ t) ≤ .
t2
EX [Y ] Var (X)
PX (|X − EX [X]| ≥ t) = PX (Y ≥ t2 ) ≤ = .
t2 t2
A.9. PROBABILITY 165
Xi (ω1 , ω2 , . . . , ωn ) = X(ωi ).
Y
n
(A.30) E(X1 ,X2 ,...,Xn ) [f1 (X1 )f2 (X2 ) · · · fn (Xn )] = EX [fi (X)].
i=1
for any i, so the choice of which expectation to use is irrelevant. Since we do not wish
to always specify the base random variable X on which the sequence is constructed,
we often write X1 or Xi in place of X.
1 X
n
= E[(Xi − µ)(Xj − µ)].
n2 i,j=1
A test function satisfying the requirements in the proof of Lemma A.43 is given
by (
exp − δ2 −|x−x
1
0|
2 , if |x − x0 | < δ
φ(x) =
0, if |x − x0 | ≥ δ.
Indeed, the equality is immediate, since for any φ we have ∇u · φ ≤ |∇u| due to the
∇u
condition |φ| ≤ 1, and we can take φ = |∇u| to saturate the inequality. Since u has
compact support in R , we can integrate by parts to obtain the identity
d
Z Z
(A.34) |∇u| dx = sup u divφ dx.
Rd φ∈C ∞ (Rd ,Rd ) Rd
|φ|≤1
R
The identity (A.34) is taken to be the definition of the total variation Rd
|∇u| dx
when u is not differentiable.
Therefore we have
Z Z
|∇χU | dx = sup χU (x) divφ(x) dx
Rd φ∈C ∞ (Rd ,Rd ) Rd
|φ|≤1
Z
= sup divφ(x) dx
φ∈C ∞ (Rd ,Rd ) U
|φ|≤1
Z
= sup φ · ν dS.
φ∈C ∞ (Rd ,Rd ) ∂U
|φ|≤1
A.10. MISCELLANEOUS RESULTS 169
[3] H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method,
I. The continuous dynamical system: global exploration of the local minima of
a real-valued function by asymptotic analysis of a dissipative dynamical system.
Communications in Contemporary Mathematics, 2(01):1–34, 2000.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for em-
bedding and clustering. In Advances in Neural Information Processing Systems,
pages 585–591, 2002.
[9] L. M. Bregman. The relaxation method of finding the common point of convex
sets and its application to the solution of problems in convex programming. USSR
computational mathematics and mathematical physics, 7(3):200–217, 1967.
171
172 BIBLIOGRAPHY
[11] J. Calder. The game theoretic p-Laplacian and semi-supervised learning with
few labels. Nonlinearity, 32(1), 2018.
[13] J. Calder. Consistency of Lipschitz learning with infinite unlabeled data and
finite labeled data. SIAM Journal on Mathematics of Data Science, 1(4):780–
812, 2019.
[14] J. Calder and N. García Trillos. Improved spectral convergence rates for graph
Laplacians on ε-graphs and k-NN graphs. arXiv preprint, 2019.
[16] J. Calder, D. Slepčev, and M. Thorpe. Rates of convergence for Laplacian semi-
supervised learning with low labelling rates. In preparation, 2019.
[17] J. Calder and A. Yezzi. PDE Acceleration: A convergence rate analysis and
applications to obstacle problems. Research in the Mathematical Sciences, 6(35),
2019.
[19] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex prob-
lems with applications to imaging. Journal of mathematical imaging and vision,
40(1):120–145, 2011.
[20] T. F. Chan and L. A. Vese. Active contours without edges. IEEE Transactions
on image processing, 10(2):266–277, 2001.
[22] B. Dacorogna. Direct methods in the calculus of variations, volume 78. Springer
Science & Business Media, 2007.
[25] R. Durrett. Probability: theory and examples, volume 49. Cambridge university
press, 2019.
BIBLIOGRAPHY 173
[27] S. Esedog, Y.-H. R. Tsai, et al. Threshold dynamics for the piecewise constant
mumford–shah functional. Journal of Computational Physics, 211(1):367–384,
2006.
[31] T. Goldstein and S. Osher. The split bregman method for l1-regularized prob-
lems. SIAM journal on imaging sciences, 2(2):323–343, 2009.
[32] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang. Manifold-ranking based image
retrieval. In Proceedings of the 12th annual ACM International Conference on
Multimedia, pages 9–16. ACM, 2004.
[33] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang. Generalized manifold-ranking-
based image retrieval. IEEE Transactions on Image Processing, 15(10):3170–
3177, 2006.
[34] M. Hein, J.-Y. Audibert, and U. v. Luxburg. Graph Laplacians and their conver-
gence on random neighborhood graphs. Journal of Machine Learning Research,
8(Jun):1325–1368, 2007.
[35] M. Hein, J.-Y. Audibert, and U. Von Luxburg. From graphs to manifolds–
weak and strong pointwise consistency of graph Laplacians. In International
Conference on Computational Learning Theory, pages 470–485. Springer, 2005.
[43] J. Moser. A new proof of de Giorgi’s theorem concerning the regularity problem
for elliptic differential equations. Communications on Pure and Applied Mathe-
matics, 13(3):457–468, 1960.
[49] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise
removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268, 1992.
[51] W. Rudin. Real and complex analysis. Tata McGraw-hill education, 2006.
[52] J. Shi and J. Malik. Normalized cuts and image segmentation. Departmental
Papers (CIS), page 107, 2000.
[54] Z. Shi, B. Wang, and S. J. Osher. Error estimation of weighted nonlocal Laplacian
on random point cloud. arXiv preprint arXiv:1809.08622, 2018.
[58] N. G. Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral
convergence of the graph laplacian on random geometric graphs towards the
laplace–beltrami operator. arXiv preprint arXiv:1801.10108, 2018.
[59] N. G. Trillos and R. Murray. A maximum principle argument for the uniform con-
vergence of graph laplacian regressors. arXiv preprint arXiv:1901.10089, 2019.
[61] N. G. Trillos and D. Slepčev. Continuum limit of total variation on point clouds.
Archive for rational mechanics and analysis, 220(1):193–241, 2016.
[64] B. Xu, J. Bu, C. Chen, D. Cai, X. He, W. Liu, and J. Luo. Efficient manifold
ranking for image retrieval. In Proceedings of the 34th International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages 525–
534. ACM, 2011.
[65] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via
graph-based manifold ranking. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 3166–3173, 2013.
176 BIBLIOGRAPHY
[69] D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data
on a directed graph. In Proceedings of the 22nd International Conference on
Machine Learning, pages 1036–1043. ACM, 2005.
[72] X. Zhou, M. Belkin, and N. Srebro. An iterated graph Laplacian approach for
ranking on manifolds. In Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 877–885. ACM,
2011.