0% found this document useful (0 votes)

22 views

Calculus of Variations

Uploaded by

Srinivasa

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Calculus of Variations

Uploaded by

Srinivasa

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 176

The Calculus of Variations

Jeﬀ Calder

University of Minnesota
School of Mathematics
[email protected]

February 6, 2024
2
Contents

1 Introduction 5
1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 The Euler-Lagrange equation 11

2.1 The gradient interpretation . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The Lagrange multiplier . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Accelerated gradient descent . . . . . . . . . . . . . . . . . . . . . . . 21

3 Applications 29
3.1 Shortest path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 The brachistochrone problem . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Minimal surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Minimal surface of revolution . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Isoperimetric inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Image restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.6.1 Gradient descent and acceleration . . . . . . . . . . . . . . . . 46
3.6.2 Primal dual approach . . . . . . . . . . . . . . . . . . . . . . . 49
3.6.3 The split Bregman method . . . . . . . . . . . . . . . . . . . . 51
3.6.4 Edge preservation properties of Total Variation restoration . . 53
3.7 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7.1 Ginzburg-Landau approximation . . . . . . . . . . . . . . . . 59

4 The direct method 61

4.1 Coercivity and lower semicontinuity . . . . . . . . . . . . . . . . . . . 62
4.2 Brief review of Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Deﬁnitions and basic properties . . . . . . . . . . . . . . . . . 66
4.2.2 Weak compactness in Lp spaces . . . . . . . . . . . . . . . . . 69
4.2.3 Compactness in Sobolev spaces . . . . . . . . . . . . . . . . . 72
4.3 Lower semicontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Existence and uniqueness of minimizers . . . . . . . . . . . . . . . . . 78
4.5 The Euler-Lagrange equation . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1 Regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3
4 CONTENTS

4.6 Minimal surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5 Graph-based learning 91
5.1 Calculus on graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1.1 The Graph Laplacian and maximum principle . . . . . . . . . 97
5.2 Concentration of measure . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Random geometric graphs and basic setup . . . . . . . . . . . . . . . 113
5.4 The maximum principle approach . . . . . . . . . . . . . . . . . . . . 116
5.5 The variational approach . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.1 Consistency for smooth functions . . . . . . . . . . . . . . . . 124
5.5.2 Discrete to nonlocal via transportation maps . . . . . . . . . . 127
5.5.3 Nonlocal to local estimates . . . . . . . . . . . . . . . . . . . . 131
5.5.4 Discrete to continuum . . . . . . . . . . . . . . . . . . . . . . 137

A Mathematical preliminaries 143

A.1 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.3 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.3.1 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.3.2 Rules for differentiation . . . . . . . . . . . . . . . . . . . . . 148
A.4 Taylor series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.4.1 One dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.4.2 Higher dimensions . . . . . . . . . . . . . . . . . . . . . . . . 151
A.5 Function spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
A.6.1 The Riemann integral . . . . . . . . . . . . . . . . . . . . . . 154
A.6.2 The Lebesgue integral . . . . . . . . . . . . . . . . . . . . . . 156
A.7 Integration by parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.8 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.9 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.9.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 162
A.9.2 Markov and Chebyshev inequalities . . . . . . . . . . . . . . . 164
A.9.3 Sequences of indepedendent random variables . . . . . . . . . 165
A.9.4 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . 166
A.10 Miscellaneous results . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
A.10.1 Vanishing lemma . . . . . . . . . . . . . . . . . . . . . . . . . 167
A.10.2 Total variation of characteristic function is perimeter . . . . . 168
Chapter 1

Introduction

The calculus of variations is a ﬁeld of mathematics concerned with minimizing (or

maximizing) functionals (that is, real-valued functions whose inputs are functions).
The calculus of variations has a wide range of applications in physics, engineering,
applied and pure mathematics, and is intimately connected to partial differential
equations (PDEs).
For example, a classical problem in the calculus of variations is finding the short-
est path between two points. The notion of length need not be Euclidean, or the path
may be constrained to lie on a surface, in which case the shortest path is called a
geodesic. In physics, Hamilton’s principle states that trajectories of a physical system
are critical points of the action functional. Critical points may be minimums, maxi-
mums, or saddle points of the action functional. In computer vision, the problem of
segmenting an image into meaningful regions is often cast as a problem of minimiz-
ing a functional over all possible segmentations—a natural problem in the calculus
of variations. Likewise, in image processing, the problem of restoring a degraded or
noisy images has been very successfully formulated as a problem in the calculus of
variations.
PDEs arise as the necessary conditions for minimizers of functionals. Recall in
multi-variable calculus that if a function u : Rd → R has a minimum at x ∈ Rd then
∇u(x) = 0. The necessary condition ∇u(x) = 0 can be used to solve for candidate
minimizers x. In the calculus of variations, if a function u : Rd → R is a minimizer
of a functional I(u) then the necessary condition ∇I(u) = 0 turns out to be a PDE
called the Euler-Lagrange equation. Studying the Euler-Lagrange equation allows us
to explicitly compute minimizers and to study their properties. For this reason, there
is a rich interplay between the calculus of variations and the theory of PDEs.
These notes aim to give a brief overview of the calculus of variations at the be-
ginning graduate level. The first four chapters are concerned with smooth solutions
of the Euler-Lagrange equations, and finding explicit solutions of classical problems,
like the Brachistochrone problem, and exploring applications to image processing and
computer vision. These chapters are written at an advanced undergraduate level, and

5
6 CHAPTER 1. INTRODUCTION

leave out some of the rigorous detail that is expected in a graduate course. These ﬁrst
four chapters require some familiarity with ordinary diﬀerential equations (ODEs),
and multi-variable calculus. The appendix contains some mathematical preliminaries
and notational conventions that are important to review.
The level of rigour increases dramatically in Chapter 4, where we begin a study of
the direct method in the Calculus of Variations for proving existence of minimizers.
This chapter requires some basic background in functional analysis and, in particular,
Sobolev spaces. We provide a very brief overview in Section 4.2, and refer the reader
to [28] for more details.
Finally, in Chapter 5 we give an overview of applications of the calculus of varia-
tions to prove discrete to continuum results in graph-based semi-supervised learning.
The ideas in this chapter are related to Γ-convergence, which is a notion of conver-
gence for functionals that ensures minimizers converge to minimizers.
These notes are just a basic introduction. We refer the reader to [28, Chapter
8] and [22] for a more thorough exposition of the theory behind the Calculus of
Variations.

1.1 Examples
We begin with some examples.

Example 1.1 (Shortest path). Let A and B be two points in the plane. What
is the shortest path between A and B? The answer depends on how we measure
length! Suppose the length of a short line segment near (x, y) is the usual Euclidean
length multiplied by a positive scale factor g(x, y). For example, the length of a path
could correspond to the length of time it would take a robot to navigate the path,
and certain regions in space may be easier or harder to navigate, yielding larger or
smaller values of g. Robotic navigation is thus a special case of ﬁnding the shortest
path between two points.
Suppose A lies to the left of B and the path is a graph u(x) over the x axis. See
Figure 1.1. Then the “length” of the path between x and x + ∆x is approximately
p
L = g(x, u(x)) 1 + u′ (x)2 ∆x.

If we let A = (0, 0) and B = (a, b) where a > 0, then the length of a path (x, u(x))
connecting A to B is
Z a p
I(u) = g(x, u(x)) 1 + u′ (x)2 dx.
0

The problem of ﬁnding the shortest path from A to B is equivalent to ﬁnding the
function u that minimizes the functional I(u) subject to u(0) = 0 and u(a) = b. 4
1.1. EXAMPLES 7

Figure 1.1: In our version of the shortest path problem, all paths must be graphs of
functions u = u(x).

Example 1.2 (Brachistochrone problem). In 1696 Johann Bernoulli posed the fol-
lowing problem. Let A and B be two points in the plane with A lying above B.
Suppose we connect A and B with a thin wire and allow a bead to slide from A to
B under the inﬂuence of gravity. Assuming the bead slides without friction, what is
the shape of the wire that minimizes the travel time of the bead? Perhaps counter-
intuitively, it turns out that the optimal shape is not a straight line! The problem
is commonly referred to as the brachistochrone problem—the word brachistochrone
derives from ancient Greek meaning “shortest time”.
Let g denote the acceleration due to gravity. Suppose that A = (0, 0) and B =
(a, b) where a > 0 and b < 0. Let u(x) for 0 ≤ x ≤ a describe the shape of the
wire, so u(0) = 0 and u(a) = b. Let v(x) denote the speed of the bead when it is
at position x. When the bead is at position (x, u(x)) along the wire, the potential
energy stored in the bead is PE = mgu(x) (relative to height zero), and the kinetic
energy is KE = 12 mv(x)2 , where m is the mass of the bead. By conservation of energy
1
mv(x)2 + mgu(x) = 0,
2
since the bead starts with zero total energy at point A. Therefore
p
v(x) = −2gu(x).
p
Between x and x + ∆x,pthe bead slides a distance of approximately 1 + u′ (x)2 ∆x
with a speed of v(x) = −2gu(x). Hence it takes approximately
p
1 + u′ (x)2
t= p ∆x
−2gu(x)
8 CHAPTER 1. INTRODUCTION

Figure 1.2: Depiction of possible paths for the brachistochrone problem.

time for the bead to move from position x to x + ∆x. Therefore the total time taken
for the bead to slide down the wire is given by
Z as
1 1 + u′ (x)2
I(u) = √ dx.
2g 0 −u(x)

The problem of determining the optimal shape of the wire is therefore equivalent to
ﬁnding the function u(x) that minimizes I(u) subject to u(0) = 0 and u(a) = b. 4

Example 1.3 (Minimal surfaces). Suppose we bend a piece of wire into a loop of
any shape we wish, and then dip the wire loop into a solution of soapy water. A
soap bubble will form across the loop, and we may naturally wonder what shape the
bubble will take. Physics tells us that soap bubble formed will be the one with least
surface area, at least locally, compared to all other surfaces that span the wire loop.
Such a surface is called a minimal surface.
To formulate this mathematically, suppose the loop of wire is the graph of a
function g : ∂U → R, where U ⊂ R2 is open and bounded. We also assume that all
possible surfaces spanning the wire can be expressed as graphs of functions u : U → R.
To ensure the surface connects to the wire we ask that u = g on ∂U . The surface
area of a candidate soap ﬁlm surface u is given by
Z p
I(u) = 1 + |∇u|2 dx.
U

Thus, the minimal surface problem is equivalent to ﬁnding a function u that minimizes
I subject to u = g on ∂U . 4

Example 1.4 (Image restoration). A grayscale image is a function u : [0, 1]2 → [0, 1].
For x ∈ R2 , u(x) represents the brightness of the pixel at location x. In real-world
1.1. EXAMPLES 9

applications, images are often corrupted in the acquisition process or thereafter, and
we observe a noisy version of the image. The task of image restoration is to recover
the true noise-free image from a noisy observation.
Let f (x) be the observed noisy image. A widely used and very successful approach
to image restoration is the so-called total variation (TV) restoration, which minimizes
the functional Z
1
I(u) = (u − f )2 + λ|∇u| dx,
U 2

where λ > 0 is a parameter and U = (0, 1)2 . The restored image is the function u
that minimizes I (we do not impose boundary conditions on the minimizer). The
ﬁrst term 21 (u − f )2 is a called a ﬁdelity term, and encourages the restored image to
be close to the observed noisy image f . The second term |∇u| measures the amount
of noise in the image and minimizing this term encourages the removal of R noise in
the restored image. The name TV restoration comes from the fact that U |∇u| dx
is called the total variation of u. Total variation image restoration was pioneered by
Rudin, Osher, and Fatemi [49]. 4

Example 1.5 (Image segmentation). A common task in computer vision is the seg-
mentation of an image into meaningful regions. Let f : [0, 1]2 → [0, 1] be a grayscale
image we wish to segment. We represent possible segmentations of the image by the
level sets of functions u : [0, 1]2 → R. Each function u divides the domain [0, 1]2 into
two regions deﬁned by

R+ (u) = {x ∈ [0, 1]2 : u(x) > 0} and R− (u) = {x ∈ [0, 1]2 : u(x) ≤ 0}.

The boundary between the two regions is the level set {x ∈ [0, 1]2 : u(x) = 0}.
At a very basic level, we might assume our image is composed of two regions
with diﬀerent intensity levels f = a and f = b, corrupted by noise. Thus, we might
propose to segment the image by minimizing the functional
Z Z
I(u, a, b) = (f (x) − a) dx +
2
(f (x) − b)2 dx,
R+ (u) R− (u)

over all possible segmentations u and real numbers a and b. However, this turns
out not to work very well since it does not incorporate the geometry of the region
in any way. Intuitively, a semantically meaningful object in an image is usually
concentrated in some region of the image, and might have a rather smooth boundary.
The minimizers of I could be very pathological and oscillate rapidly trying to capture
every pixel near a in one region and those near b in another region. For example, if f
only takes the values 0 and 1, then minimizing I will try to put all the pixels in the
image where f is 0 into one region, and all those where f is 1 into the other region,
and will choose a = 0 and b = 1. This is true regardless of whether the region where
f is zero is a nice circle in the center of the image, or if we randomly choose each
10 CHAPTER 1. INTRODUCTION

pixel to be 0 or 1. In the later case, the segmentation u will oscillate wildly and does
not give a meaningful result.
A common approach to ﬁxing this issue is to include a penalty on the length
of the boundary between the two regions. Let us denote the length of the boundary
between R+ (u) and R− (u) (i.e., the zero level set of u) by L(u). Thus, we seek instead
to minimize the functional
Z Z
I(u, a, b) = (f (x) − a) dx +
2
(f (x) − b)2 dx + λL(u),
R+ (u) R− (u)

where λ > 0 is a parameter. Segmentation of an image is therefore reduced to ﬁnding

a function u(x) and real numbers a and b minimizing I(u, a, b), which is a problem
in the calculus of variations. This widely used approach was proposed by Chan and
Vese in 2001 and is called Active Contours Without Edges [20].
The dependence of I on u is somewhat obscured in the form above. Let us write
the functional in another way. Recall the Heaviside function H is deﬁned as
(
1, if x > 0
H(x) =
0, if x ≤ 0.

Then the region R+ (u) is precisely the region where H(u(x)) = 1, and the region
R− (u) is precisely where H(u(x)) = 0. Therefore
Z Z
(f (x) − a) dx =
2
H(u(x))(f (x) − a)2 dx,
R+ (u) U

where U = (0, 1)2 . Likewise

Z Z
(f (x) − b) dx =
2
(1 − H(u(x))) (f (x) − b)2 dx.
R− (u) U

We also have the identity (see Section A.10.2)

Z Z
L(u) = |∇H(u(x))| dx = δ(u(x))|∇u(x)| dx.
U U

Therefore we have
Z
I(u, a, b) = H(u)(f − a)2 + (1 − H(u)) (f − b)2 + λδ(u)|∇u| dx.
U

4
Chapter 2

The Euler-Lagrange equation

We aim to study general functionals of the form

Z
(2.1) I(u) = L(x, u(x), ∇u(x)) dx,
U

where U ⊂ Rd is open and bounded, and L is a function

L : U × R × Rd → R.

The function L is called the Lagrangian. We will write L = L(x, z, p) where x ∈ U ,

z ∈ R and p ∈ Rd . Thus, z represents the variable where we substitute u(x), and p is
the variable where we substitute ∇u(x). Writing this out completely we have

L = L(x1 , x2 , . . . , xn , z, p1 , p2 , . . . , pn ).

The partial derivatives of L will be denoted Lz (x, z, p),

∇x L(x, z, p) = (Lx1 (x, z, p), . . . , Lxn (x, z, p)),

and
∇p L(x, z, p) = (Lp1 (x, z, p), . . . , Lpn (x, z, p)).
Each example from Section 1.1 involved a functional of the general form of (2.1).
For the shortest path problem d = 1 and
q
L(x, z, p) = g(x1 , z) 1 + p21 .

For the brachistochrone problem d = 1 and

s
1 + p21
L(x, z, p) = .
−z

11
12 CHAPTER 2. THE EULER-LAGRANGE EQUATION

For the minimal surface problem n = 2 and

p
L(x, z, p) = 1 + |p|2 .

For the image restoration problem n = 2 and

1
L(x, z, p) = (z − f (x))2 + λ|p|.
2
Finally, for the image segmentation problem

L(x, z, p) = H(z)(f (x) − a)2 + (1 − H(z)) (f (x) − b)2 + λδ(z)|p|.

We will always assume that L is smooth, and the boundary condition g : ∂U → R

is smooth. We now give necessary conditions for minimizers of I.
Theorem 2.1 (Euler-Lagrange equation). Suppose that u ∈ C 2 (U ) satisﬁes

(2.2) I(u) ≤ I(v)

for all v ∈ C 2 (U ) with v = u on ∂U . Then

(2.3) Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) = 0 in U.

Proof. Let φ ∈ Cc∞ (U ) and set v = u + tφ for a real number t. Since φ = 0 on ∂U

we have u = v on ∂U . Thus, by assumption

I(u) ≤ I(v) = I(u + tφ) for all t ∈ R.

This means that h(t) := I(u + tφ) has a global minimum at t = 0, i.e., h(0) ≤ h(t)
for all t. It follows that h′ (t) = 0, which is equivalent to
d
(2.4) I(u + tφ) = 0.
dt t=0

We now compute the derivative in (2.4). Notice that

Z
I(u + tφ) = L(x, u(x) + tφ(x), ∇u(x) + t∇φ(x)) dx.
U

For notational simplicity, let us suppress the x arguments from u(x) and φ(x). By
the chain rule
d
L(x, u+tφ, ∇u+t∇φ) = Lz (x, u+tφ, ∇u+t∇φ)φ+∇p L(x, u+tφ, ∇u+t∇φ)·∇φ.
dt
Therefore
d
L(x, u + tφ, ∇u + t∇φ) = Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ,
dt t=0
13

and we have
Z
d d
I(u + tφ) = L(x, u(x) + tφ(x), ∇u(x) + t∇φ(x)) dx
dt t=0 dt t=0 U
Z
d
= L(x, u(x) + tφ(x), ∇u(x) + t∇φ(x)) dx
dt t=0
Z U

= Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ dx

ZU Z
(2.5) = Lz (x, u, ∇u)φ dx + ∇p L(x, u, ∇u) · ∇φ dx.
U U

Since φ = 0 on ∂U we can use the Divergence Theorem (Theorem A.32) to compute

Z Z
∇p L(x, u, ∇u) · ∇φ dx = − div (∇p L(x, u, ∇u)) φ dx.
U U

Combining this with (2.4) and (2.5) we have

Z
d
0= I(u + tφ) = Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) φ dx.
dt t=0 U

It follows from the vanishing lemma (Lemma A.43 in the appendix) that

Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) = 0

everywhere in U , which completes the proof.

Remark 2.2. Theorem 2.1 says that minimizers of the functional I satisfy the PDE
(2.3). The PDE (2.3) is called the Euler-Lagrange equation for I.
Deﬁnition 2.3. A solution u of the Euler-Lagrange equation (2.3) is called a critical
point of I.
Remark 2.4. In dimension d = 1 we write x = x1 and p = p1 . Then the Euler-
Lagrange equation is
d
Lz (x, u(x), u′ (x)) − Lp (x, u(x), u′ (x)) = 0.
dx
Remark 2.5. In the proof of Theorem 2.1 we showed that
Z
Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ dx = 0
U

for all φ ∈ Cc∞ (U ). A function u ∈ C 1 (U ) satisfying the above for all φ ∈ Cc∞ (U ) is
called a weak solution of the Euler-Lagrange equation (2.3). Thus, weak solutions of
PDEs arise naturally in the calculus of variations.
14 CHAPTER 2. THE EULER-LAGRANGE EQUATION

Example 2.1. Consider the problem of minimizing the Dirichlet energy

Z
1
(2.6) I(u) = |∇u|2 − uf dx,
U 2
over all u satisfying u = g on ∂U . Here, f : U → R and g : ∂U → R are given
functions, and
1
L(x, z, p) = |p|2 − zf (x).
2
Therefore
Lz (x, z, p) = −f (x) and ∇p L(x, z, p) = p,
and the Euler-Lagrange equation is
−f (x) − div(∇u) = 0 in U.
This is Poisson’s equation
−∆u = f in U
subject to the boundary condition u = g on ∂U . 4
Exercise 2.6. Derive the Euler-Lagrange equation for the problem of minimizing
Z
1
I(u) = |∇u|p − uf dx
U p
subject to u = g on ∂U , where p ≥ 1. 4
Example 2.2. The Euler-Lagrange equation in dimension d = 1 can be simpliﬁed
when L has no x-dependence, so L = L(z, p). In this case the Euler-Lagrange equation
reads
d
Lz (u(x), u′ (x)) = Lp (u(x), u′ (x)).
dx
Using the Euler-Lagrange equation and the chain rule we compute
d
L(u(x), u′ (x)) = Lz (u(x), u′ (x))u′ (x) + Lp (u(x), u′ (x))u′′ (x)
dx
d
= u′ (x) Lp (u(x), u′ (x)) + Lp (u(x), u′ (x))u′′ (x)
dx
d ′
= (u (x)Lp (u(x), u′ (x))) .
dx
Therefore
d
(L(u(x), u′ (x)) − u′ (x)Lp (u(x), u′ (x))) = 0.
dx
It follows that
(2.7) L(u(x), u′ (x)) − u′ (x)Lp (u(x), u′ (x)) = C
for some constant C. This form of the Euler-Lagrange equation is often easier to
solve when L has no x-dependence. 4
2.1. THE GRADIENT INTERPRETATION 15

In some of the examples presented in Section 1.1, such as the image segmentation
and restoration problems, we did not impose any boundary condition on the minimizer
u. For such problems, Theorem 2.1 still applies, but the Euler-Lagrange equation
(2.3) is not uniquely solvable without a boundary condition. Hence, we need some
additional information about minimizers in order for the Euler-Lagrange equation to
be useful for these problems.
Theorem 2.7. Suppose that u ∈ C 2 (U ) satisﬁes

(2.8) I(u) ≤ I(v)

for all v ∈ C 2 (U ). Then u satisﬁes the Euler-Lagrange equation (2.3) with boundary
condition

(2.9) ∇p L(x, u, ∇u) · ν = 0 on ∂U.

Proof. By Theorem 2.1, u satisﬁes the Euler-Lagrange equation (2.3). We just need
to show that u also satisﬁes the boundary condition (2.9).
Let φ ∈ C ∞ (U ) be a smooth function that is not necessarily zero on ∂U . Then
by hypothesis I(u) ≤ I(u + tφ) for all t. Therefore, as in the proof of Theorem 2.1
we have
Z
d
(2.10) 0= I(u + tφ) = Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ dx.
dt t=0 U

Applying the Divergence Theorem (Theorem A.32) we ﬁnd that

Z Z
0= φ ∇p L(x, u, ∇u) · ν dS + (Lz (x, u, ∇u) − div (∇p L(x, u, ∇u))) φ dx.
∂U U

Since u satisﬁes the Euler-Lagrange equation (2.3), the second term above vanishes
and we have Z
φ ∇p L(x, u, ∇u) · ν dS = 0
∂U

for all test functions φ ∈ C ∞ (U ). By a slightly diﬀerent version of the vanishing

lemma (Lemma A.43 in the appendix) we have that

∇p L(x, u, ∇u) · ν = 0 on ∂U.

This completes the proof.

2.1 The gradient interpretation

We can interpret the Euler-Lagrange equation (2.3) as the gradient of I. That is, in
a certain sense (2.3) is equivalent to ∇I(u) = 0.
16 CHAPTER 2. THE EULER-LAGRANGE EQUATION

To see why, let us consider a function u : Rd → R. The gradient of u is deﬁned as

the vector of partial derivatives

∇u(x) = (ux1 (x), ux2 (x), . . . , uxn (x)).

By the chain rule we have

d
(2.11) u(x + tv) = ∇u(x) · v
dt t=0

for any vector v ∈ Rd . It is possible to take (2.11) as the deﬁnition of the gradient of
u. By this, we mean that w = ∇u(x) is the unique vector satisfying
d
u(x + tv) = w · v
dt t=0

for all v ∈ Rd .
In the case of functionals I(u), we showed in the proof of Theorem 2.1 that
d
I(u + tφ) = (Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) , φ)L2 (U )
dt t=0

for all φ ∈ Cc∞ (U ). Here, the L2 -inner product plays the role of the dot product from
the ﬁnite dimensional case. Thus, it makes sense to deﬁne the gradient, also called
the functional gradient, to be

(2.12) ∇I(u) := Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) ,

so that we have the identity

d
(2.13) I(u + tφ) = (∇I(u), φ)L2 (U ) .
dt t=0

The reader should compare this with the ordinary chain rule (2.11). Notice the
deﬁnition of the gradient ∇I depends on the choice of the L2 -inner product. Using
other inner products will result in diﬀerent notions of gradient.

2.2 The Lagrange multiplier

There are many problems in the calculus of variations that involve constraints on the
feasible minimizers. A classic example is the isoperimetric problem, which corresponds
to finding the shape of a simple closed curve that maximizes the enclosed area given
the curve has a fixed length ℓ. Here we are maximizing the area enclosed by the curve
subject to the constraint that the length of the curve is equal to ℓ.
Let J be a functional defined by
Z
J(u) = H(x, u(x), ∇u(x)) dx.
U
2.2. THE LAGRANGE MULTIPLIER 17

We consider the problem of minimizing I(u) subject to the constraint J(u) = 0. The
Lagrange multiplier method gives necessary conditions that must be satisﬁed by any
minimizer.
Theorem 2.8 (Lagrange multiplier). Suppose that u ∈ C 2 (U ) satisﬁes J(u) = 0 and

(2.14) I(u) ≤ I(v)

for all v ∈ C 2 (U ) with v = u on ∂U and J(v) = 0. Then there exists a real number
λ such that

(2.15) ∇I(u) + λ∇J(u) = 0 in U.

Here, ∇I and ∇J are the functional gradients of I and J, respectively, deﬁned by

(2.12).
Remark 2.9. The number λ in Theorem 2.8 is called a Lagrange multiplier.
Proof. We will give a short sketch of the proof. Let φ ∈ Cc∞ (U ). Then as in Theorem
2.1
d d
I(u + tφ) = (∇I(u), φ)L2 (U ) and J(u + tφ) = (∇J(u), φ)L2 (U ) .
dt t=0 dt t=0

Suppose that (∇J(u), φ)L2 (U ) = 0. Then, up to a small approximation error

0 = J(u) = J(u + tφ)

for small t. Since φ = 0 on ∂U , we also have u = u + tφ on ∂U . Thus by hypothesis

I(u) ≤ I(u + tφ) for all small t.

Therefore, t 7→ I(u + tφ) has a minimum at t = 0 and so

d
(∇I(u), φ)L2 (U ) = I(u + tφ) = 0.
dt t=0

Hence, we have shown that for all φ ∈ Cc∞ (U )

(∇J(u), φ)L2 (U ) = 0 =⇒ (∇I(u), φ)L2 (U ) = 0.

This says that ∇I(u) is orthogonal to everything that is orthogonal to ∇J(u). In-
tuitively this must imply that ∇I(u) and ∇J(u) are co-linear; we give the proof
below.
We now have three cases.
1. If ∇J(u) = 0 then (∇I(u), φ)L2 (U ) = 0 for all φ ∈ C ∞ (U ), and by the vanishing
lemma ∇I(u) = 0. Here we can choose any real number for λ.
2. If ∇I(u) = 0 then we can take λ = 0 to complete the proof.
18 CHAPTER 2. THE EULER-LAGRANGE EQUATION

3. Now we can assume ∇I(u) 6= 0 and ∇J(u) 6= 0. Deﬁne

(∇I(u), ∇J(u))L2 (U )
λ=− and v = ∇I(u) + λ∇J(u).
k∇J(u)kL2 (U )

By the deﬁnition of λ we can check that

(∇J(u), v)L2 (U ) = 0.

Therefore (∇I(u), v)L2 (U ) = 0 and we have

0 = (∇I(u), v)L2 (U )
= (v − λ∇J(u), v)L2 (U )
= (v, v)L2 (U ) − λ(∇J(u), v)L2 (U )
= kvk2L2 (U ) .

Therefore v = 0 and so ∇I(u) + λ∇J(u) = 0. This completes the proof.

Remark 2.10. Notice that (2.15) is equivalent to the necessary conditions minimizers
for the augmented functional

K(u) = I(u) + λJ(u).

2.3 Gradient descent

To numerically compute solutions of the Euler-Lagrange equation ∇I(u) = 0, we
often use gradient descent, which corresponds to solving the PDE
(
ut (x, t) = −∇I(u(x, t)), in U × (0, ∞)
(2.16)
u(x, 0) = u0 (x), on U × {t = 0}.

We also impose the boundary condition u = g or ∇p L(x, u, ∇u)·ν = 0 on ∂U ×(0, ∞),

depending on which condition was imposed in the variational problem. Gradient
descent evolves u in the direction that decreases I most rapidly, starting at some
initial guess u0 . If we reach a stationary point where ut = 0 then we have found a
solution of the Euler-Lagrange equation ∇I(u) = 0. If solutions of the Euler-Lagrange
equation are not unique, we may ﬁnd diﬀerent solutions depending on the choice of
u0 .
To see that gradient descent actually decreases I, let u(x, t) solve (2.16) and
2.3. GRADIENT DESCENT 19

compute
Z
d d
I(u) = L(x, u(x, t), ∇u(x, t)) dx
dt dt
ZU
= Lz (x, u, ∇u)ut + ∇p L(x, u, ∇u)∇ut dx
ZU
= (Lz (x, u, ∇u) − div (∇p L(x, u, ∇u))) ut dx
U
= (∇I(u), ut )L2 (U )
= (∇I(u), −∇I(u))L2 (U )
= −k∇I(u)kL2 (U )
≤ 0.

We used integration by parts in the third line, mimicking the proof of Theorem 2.1.
Notice that either boundary condition u = g or ∇p L(x, u, ∇u) · ν = 0 on ∂U ensures
the boundary terms vanish in the integration by parts.
Notice that by writing out the Euler-Lagrange equation we can write the gradient
descent PDE (2.16) as
(
ut + Lz (x, u, ∇u) − div (∇p L(x, u, ∇u)) = 0, in U × (0, ∞)
(2.17)
u = u0 , on U × {t = 0}.

Example 2.3. Gradient descent on the Dirichlet energy (2.6) is the heat equation
(
ut − ∆u = f, in U × (0, ∞)
(2.18)
u = u0 , on U × {t = 0}.

Thus, solving the heat equation is the fastest way to decrease the Dirichlet energy. 4
We now prove that gradient descent converges to the unique solution of the Euler-
Lagrange equation when I is strongly convex. For simplicity, we consider functionals
of the form
Z
(2.19) I(u) = Ψ(x, u) + Φ(x, ∇u) dx.
U

This includes the regularized variational problems used for image denoising in Exam-
ple 1.4 and Section 3.6, for example. The reader may want to review deﬁnitions and
properties of convex functions in Section A.8 before reading the following theorem.
Theorem 2.11. Assume I is given by (2.19), where z 7→ Ψ(x, z) is θ-strongly convex
for θ > 0 and p 7→ Φ(x, p) is convex. Let u∗ ∈ C 2 (U ) be any function satisfying

(2.20) ∇I(u∗ ) = 0 in U.
20 CHAPTER 2. THE EULER-LAGRANGE EQUATION

Let u(x, t) ∈ C 2 (U × [0, ∞)) be a solution of the gradient descent equation (2.16) and
assume that u = u∗ on ∂U for all t > 0. Then

(2.21) ku − u∗ k2L2 (U ) ≤ Ce−2θt for all t ≥ 0,

where C = ku0 − u∗ k2L2 (U ) .

Proof. Deﬁne the energy

Z
1 1
(2.22) e(t) = ku − u∗ k2L2 (U ) = (u(x, t) − u∗ (x))2 dx.
2 2 U

We diﬀerentiate e(t), use the equations solved by u and u∗ and the divergence theorem
to ﬁnd that
Z
e (t) = (u − u∗ )ut dx
′
UZ

= − (∇I(u) − ∇I(u∗ ))(u − u∗ ) dx

ZU
= − (Ψz (x, u) − Ψz (x, u∗ ))(u − u∗ ) dx
U Z
− div(∇p Φ(x, ∇u) − ∇p Φ(x, ∇u∗ ))(u − u∗ ) dx
Z U

= − (Ψz (x, u) − Ψz (x, u∗ ))(u − u∗ ) dx

U Z
+ (∇p Φ(x, ∇u) − ∇p Φ(x, ∇u∗ )) · (∇u − ∇u∗ ) dx.
U

Since z 7→ Ψ(x, z) is θ-strongly convex, Theorem A.38 yields

(Ψz (x, u) − Ψz (x, u∗ ))(u − u∗ ) ≥ θ(u − u∗ )2 .

Likewise, since p 7→ Φ(x, p) is convex, Theorem A.38 implies that

(∇p Φ(x, ∇u) − ∇p Φ(x, ∇u∗ )) · (∇u − ∇u∗ ) ≥ 0.

Therefore
Z
′
(2.23) e (t) ≤ −θ (u − u∗ )2 dx = −2θe(t).
U

Noting that
d
(e(t)e−2θt ) = e′ (t)e−2θt − 2θe(t)e−2θt ≤ 0,
dt
−2θt
we have e(t) ≤ e(0)e , which completes the proof.
2.4. ACCELERATED GRADIENT DESCENT 21

Remark 2.12. Notice that the proof of Theorem 2.11 shows that solutions of ∇I(u) =
0, subject to Dirichlet boundary condition u = g on ∂U , are unique.

Remark 2.13. From an optimization or numerical analysis point of view, the expo-
nential convergence rate (2.21) is a linear convergence rate. Indeed, if we consider
the decay in the energy e(t) deﬁned in (2.22) over a unit ∆t in time, then Eq. (2.23)
yields
Z t+∆t Z t+∆t
′
e(t + ∆t) − e(t) = e (s) ds ≤ −2θ e(s) ds ≤ −2θ∆te(t + ∆t),
t t

since e is decreasing. Therefore

e(t + ∆t) 1
≤ =: µ.
e(t) 1 + 2θ∆t

Hence, gradient descent converges with a convergence rate of µ ≈ 1 − 2θ∆t < 1. It is

important to point out that the convergence rate depends on the time step ∆t with
which we discretize the equation. This will be discussed further in Section 2.4.

2.4 Accelerated gradient descent

Theorem 2.11 shows that gradient descent converges at a linear rate in the strongly
convex setting. This is what one expects from ﬁrst order optimization methods that
only use gradient information. To obtain higher order convergence rates, one usually
needs to consider higher order methods, such as Newton’s method. However, for large
scale problems, computing and storing the Hessian matrix is intractable.
It is possible, however, to accelerate the convergence of gradient descent without
using second order methods. In some situations, this can yield an order of magnitude
better linear convergence rate, while adding no signiﬁcant additional computational
burden. We describe some of these methods, in the PDE setting, here.
Heuristically, one often hears gradient descent explained as a ball rolling down
the energy landscape and landing in a valley corresponding to a local or global mini-
mum. However, this is clearly a false analogy, since a rolling ball has momentum, and
its velocity cannot change instantaneously, whereas gradient descent ut = −∇I(u)
has no notion of momentum and the velocity depends only on the current position.
The idea of accelerated gradient descent is based on introducing additional momen-
tum terms into the descent equations, to make the equations appear more similar to
the equations of motion for a heavy ball with friction rolling down the energy land-
scape. In optimization, the ideas date back to Polyak’s heavy ball method [48] and
Nesterov’s accelerated gradient descent [45]. Accelerated gradient descent combined
with stochastic gradient descent is used to train modern machine learning algorithms,
such as deep neural networks [39].
22 CHAPTER 2. THE EULER-LAGRANGE EQUATION

Accelerated gradient descent can be formulated in a Lagrangian setting, with strik-

ing connections to the calculus of variations. In Lagrangian mechanics, the equations
of motion for the trajectory x(t) of a system is exactly the Euler-Lagrange equation
for an action functional of the form
Z b
K(x′ (t)) − P (x(t)) dt
a

where K is the kinetic energy and P is the potential energy. The total energy K +P is
conserved in Lagrangian mechanics, and momentum arises through the kinetic energy
K.
Example 2.4 (Mass on a spring). Consider the motion of a mass on a spring with
spring constant k > 0. Assume one end of the spring is fixed to a wall at position
x = 0 and the other end attached to a mass at position x(t) > 0. The spring is
relaxed (at rest) at position x = a > 0, and we assume the mass slides horizontally
along a frictionless surface. Here, the kinetic energy is 21 mx′ (t)2 , where m is the mass
of the spring. The potential energy is the energy stored in the spring, and is given by
1
2
k(x(t) − a)2 , according to Hooke’s law. Hence, the Lagrangian action is
Z b
1 1
mx′ (t)2 − k(x(t) − a)2 dt.
a 2 2
The corresponding Euler-Lagrange equation is
d
− (mx′ (t)) − k(x(t) − a) = 0,
dt
which reduces to the equations of motion mx′′ (t) + k(x(t) − a) = 0. 4
Exercise 2.14. Let f : R → R and suppose a bead of mass m > 0 is sliding
without friction along the graph of f under the force of gravity. Let g denote the
acceleration due to gravity, and write the trajectory of the bead as (x(t), f (x(t))).
What are the equations of motion satisfied by x(t). [Hint: Follow Example 2.4, using
the appropriate expressions for kinetic and potential energy. Note that kinetic energy
is not 12 mx′ (t)2 , since this only measures horizontal movement, and bead is moving
vertically as well. The potential energy is mgf (x(t)).] 4
In our setting, we may take the kinetic energy of u = u(x, t) to be
Z
1
(2.24) K(u) = u2t dx,
2 U
and the potential energy to be the functional I(u) we wish to minimize, defined in
(2.1). Then our action functional is
Z bZ
1 2
(2.25) ut − L(x, u, ∇u) dx dt.
a U 2
2.4. ACCELERATED GRADIENT DESCENT 23

The equations of motion are simply the Euler-Lagrange equations for (2.25), treating
the problem in Rn+1 with (x, t) as our variables, which are

∂
− ut − (Lz − divx (∇p L(x, u, ∇u))) = 0.
∂t

This reduces to utt = −∇I(u), which is a nonlinear wave equation in our setting.
Unfortunately this wave equation will conserve energy, and is thus useless for opti-
mization.
In order to ensure we descend on the energy I(u), we need to introduce some
dissipative agent into the problem, such as friction. In the Lagrangian setting, this
can be obtained by adding time-dependent functions into the action functional, in
the form
Z bZ
1 2
(2.26) e at
u − L(x, u, ∇u) dx dt,
a U 2 t

where a > 0. The Euler-Lagrange equations are now

∂ at
− (e ut ) − eat (Lz − divx (∇p L(x, u, ∇u))) = 0,
∂t

which reduce to

(2.27) utt + aut = −∇I(u).

Notice that in the descent equation (2.27), the gradient ∇I(u) acts as a forcing term
that causes acceleration, whereas in gradient descent ut = −∇I(u), the gradient is the
velocity. The additional term aut is a damping term, with the physical interpretation
of friction for a rolling (or sliding) ball, with a > 0 the coeﬃcient of friction. The
damping term dissipates energy, allowing the system to converge to a critical point of
I. We note it is possible to make other modiﬁcations to the action functional (2.26)
to obtain a more general class of descent equations (see, e.g., [17]).
The accelerated descent equation (2.27) does not necessarily descend on I(u).
However, since we derived the equation from a Lagrangian variational formulation,
we do have monotonicity of total energy.

Lemma 2.15 (Energy monotonicity [17, Lemma 1]). Assume a ≥ 0 and let u satisfy
(2.27). Suppose either u(x, t) = g(x) or ∇p L(x, u, ∇u) · n = 0 for x ∈ ∂U and t > 0.
Then for all t > 0 we have

d
(2.28) (K(u) + I(u)) = −2aK(u).
dt
24 CHAPTER 2. THE EULER-LAGRANGE EQUATION

Proof.
Z
d d
(2.29) I(u) = L(x, u, ∇u) dx
dt dt
ZU
= Lz (x, u, ∇u)ut + ∇p L(x, u, ∇u)∇ut dx
ZU
= Lz (x, u, ∇u)ut − divx (∇p L(x, u, ∇u))ut dx
U Z
+ ut ∇p L(x, u, ∇u) · ν dS
Z Z ∂U

= ∇I(u)ut dx + ut ∇p L(x, u, ∇u) · ν dS.

U ∂U

Due to the boundary condition on ∂U , either ut = 0 or ∇p L(x, u, ∇u) · n = 0 on ∂U .

Therefore
Z
d
(2.30) I(u) = ∇I(u)ut dx.
dt U

Similarly, we have
Z
d
(2.31) K(u) = ut utt dx
dt
ZU
= ut (−aut − ∇I(u)) dx
U Z
= −2aK(u) − ∇I(u)ut dx,
U

where we used the equations of motion (2.27) in the second step above. Adding (2.30)
and (2.31) completes the proof.
It is possible to prove an exponential convergence rate for the accelerated descent
equations (2.27) to the solution u∗ of ∇I(u∗ ) = 0 in the strongly convex setting, as
we did in Theorem 2.11 for gradient descent. The proof uses energy methods, as in
Theorem 2.11, but is more involved, so we refer the reader to [17, Theorem 1].
There is a natural question the reader may have at this point. If both gradient
descent and accelerated gradient descent have similar exponential convergence rates,
why would acceleration converge faster than gradient descent? The answer is that
the accelerated convergence rate is invisible at the level of the continuum descent
equations (2.16) and (2.27). The acceleration is realized only when the equations are
discretized in space-time, which is necessary for computations. Recall from Remark
2.13 that the convergence rate for a discretization with time step ∆t is

µ ≈ 1 − c∆t,
2.4. ACCELERATED GRADIENT DESCENT 25

where c > 0 is the exponential rate. Hence, the convergence rate µ depends crucially
on how large we can take the time step ∆t; larger time steps yield much faster con-
vergence rates. The time step is restricted by stability considerations in the discrete
schemes, also called Courant-Friedrichs-Lewy (CFL) conditions.
The gradient descent equation (2.16) is ﬁrst order in time and second order in
∆t
space, so the ratio ∆x 2 appears in the discretization of the descent equation, where

∆x is the spatial resolution. Stability considerations require bounding this ratio and
so ∆t ≤ C∆x2 turns out to be the stability condition for gradient descent equations
in this setting. This is a very restrictive stability condition, and such equations are
called numerically stiﬀ. This yields a convergence rate of the form

(2.32) Gradient descent rate µ ≈ 1 − C∆x2

for gradient descent.

On the other hand, the accelerated gradient descent equation (2.27) is second
∆t2
order in space and time. Hence, the ratio ∆x 2 appears in the discrete scheme, and

stability conditions amount to bounding ∆t ≤ C∆x. Another way to see this is that
wave equations have bounded speed of propagation c, and the CFL condition amounts
to ensuring that the speed of propagation of the discrete scheme is at least as fast as
c, that is ∆x
∆t
≥ c. Thus, accelerated gradient descent has a convergence rate of the
form

(2.33) Accelerated gradient descent rate µ ≈ 1 − C∆x.

Thus, as a function of the grid resolution ∆x, the accelerated equation (2.27) has a
convergence rate that is a whole order of magnitude better than the gradient descent
equation (2.16). At a basic level, gradient descent is a diffusion equation, which is
numerically stiff, while accelerated descent is a wave equation, which does not exhibit
the same stiffness. The discussion above is only heuristic; we refer the reader to [17]
for more details.
We present below the results of a simple numerical simulation to show the differ-
ence between gradient descent and accelerated gradient descent. We consider solving
the Dirichlet problem
(
∆u = 0 in U
(2.34)
u = g on ∂U,

in which case ∇I(u) = −∆u in both (2.27) and (2.16). We choose a = 2π 1 , g(x1 , x2 ) =
sin(2πx21 ) + sin(2πx22 ), and use initial condition u(x, 0) = g(x). We discretize using
the standard 5-point stencil for ∆u in n = 2 dimensions, and use forward diﬀerences
for ut and centered diﬀerences for utt . We choose the largest stable time step for each
scheme and run the corresponding descent equations until the discretization of (2.34)
26 CHAPTER 2. THE EULER-LAGRANGE EQUATION

Accelerated descent Gradient Descent

Mesh Time (s) Iterations Time (s) Iterations
642 0.012 399 0.148 8404
1282 0.05 869 2.4 3872
2562 0.38 1898 40 174569
5122 4.8 4114 1032 774606
10242 41 8813 23391 3399275

Table 2.1: Comparison of PDE acceleration (2.27) and gradient descent (2.16) for
solving the Dirichlet problem (2.34). Runtimes are for C code, and copied from [17].

is satisfied up to an O(∆x2 ) error. The iteration counts and computation times for
C code are shown in Table 2.1 for different grid resolutions.
We note that many other methods are faster for linear equations, such as multigrid
or Fast Fourier Transform. The point here is to contrast the difference between
gradient descent and accelerated gradient descent. The accelerated descent method
works equally well for non-linear problems where multigrid and FFT methods do not
work, or in the case of multigrid take substantial work to apply.

Remark 2.16. We have not taken into account discretization errors in our analysis
above. In practice, the solution is computed on a grid of spacing ∆x in the space
dimensions and ∆t in time. The discretization errors accumulate over time, and need
to be controlled to ensure convergence of the numerical scheme.
For gradient descent, the PDE is ﬁrst order in space and second order in time,
and the discretization errors on U × [0, T ] are bounded by CT (∆t + ∆x2 ), which is
proved in standard numerical analysis books. Thus, by (2.21) the error between the
discrete solution and the minimizer u∗ is bounded by

ku∆t,∆x − u∗ k2L2 (U ) ≤ C(e−2θt + t2 (∆t2 + ∆x4 )) for all t ≥ 0,

where u∆t,∆x denotes any piecewise constant extension of the numerical solution to
U . Furthermore, since ∆t ≤ C∆x2 is necessary for stability and convergence of the
gradient descent equation, we have the bound

ku∆t,∆x − u∗ k2L2 (U ) ≤ C(e−2θt + t2 ∆x4 ).

Choosing T = − log(∆x4 )/2θ as our stopping time, we have

ku∆t,∆x (·, T ) − u∗ (·, T )kL2 (U ) ≤ Cθ−1 log(∆x−1 )∆x2 .

Hence, our numerical solution computes the true solution u∗ up to O(∆x2 ) accuracy
T C
(excluding log factors) in k = ∆t = ∆x 2 iterations. The analysis for accelerated
1
The choice a = 2π is optimal. See Exercise 2.18 for an explanation in a simple setting.
2.4. ACCELERATED GRADIENT DESCENT 27

gradient descent is identical, except here ∆t ≤ C∆x is suﬃcient for stability and
C
convergence of the scheme, and so only k = ∆x iterations are needed for convergence.

Exercise 2.17. Consider the linear ordinary diﬀerential equation (ODE) version of
gradient descent

(2.35) x′ (t) = b − A x(t), t > 0,

where x : [0, ∞) → Rd , b ∈ Rd , and the matrix A ∈ Rn×n is symmetric and positive

deﬁnite. Prove that for any initial condition x(0) = x0 , the solution of (3.1) converges
exponentially fast (as in Theorem 2.11) to the solution of A x = b. How does the
convergence rate depend on A? [Hint: Since A is symmetric and positive deﬁnite,
there is an orthonormal basis v1 , v2 , . . . , vn for Rd consisting of eigenvectors of A with
corresponding eigenvalues 0 < λ1 ≤ λ2 ≤ · · · ≤ λn , counted with multiplicity. The
solution of (2.35) can be expressed in the form

X
d
x(t) = di (t)vi
i=1

for corresponding functions d1 (t), . . . , dn (t). Find the di (t) and complete the problem
from here.] 4

Exercise 2.18. Consider the linear ordinary diﬀerential equation (ODE) version of
acceleration

(2.36) x′′ (t) + a x′ (t) = b − A x(t), t > 0,

where x : [0, ∞) → Rd , b ∈ Rd , and the matrix A ∈ Rn×n is symmetric and positive

deﬁnite. Prove that for any initial condition x(0) = x0 , the solution of (3.1) converges
exponentially fast (as in Theorem 2.11) to the solution of A x = b. How does the
convergence rate depend on A and a? What is the optimal choice of the damping
coeﬃcient a > 0 to obtain the fastest convergence rate, and how does the convergence
rate compare with the rate for gradient descent obtained in Exercise 2.17? [Hint: Use
the same approach as in Exercise 2.17.] 4

Notes
We note the presentation here is an adaptation of the variational interpretation of
acceleration due to Wibisono, Wilson, and Jordan [63] to the PDE setting. The
framework is called PDE acceleration; for more details see [7, 17, 56, 66]. We also
mention the dynamical systems perspective is investigated in [3].
28 CHAPTER 2. THE EULER-LAGRANGE EQUATION
Chapter 3

Applications

We now continue the parade of examples by computing and solving the Euler-Lagrange
equations for the examples from Section 1.1.

3.1 Shortest path

Recall for the shortest path problem we wish to minimize
Z a p
I(u) = g(x, u(x)) 1 + u′ (x)2 dx,
0
subject to u(0) = 0 and u(a) = b. Here d = 1 and
p
L(x, z, p) = g(x, z) 1 + p2 .
p
Therefore Lz (x, z, p) = gz (x, z) 1 + p2 and Lp (x, z, p) = g(x, z)(1 + p2 )− 2 p. The
1

Euler-Lagrange equation is
p d ′ 2 − 12 ′

gz (x, u(x)) 1 + u (x) −
′ 2 g(x, u(x))(1 + u (x) ) u (x) = 0.
dx
This is in general diﬃcult to solve. In the special case that g(x, z) = 1, gz = 0 and
this reduces to !
d u′ (x)
p = 0.
dx 1 + u′ (x)2
Computing the derivative yields
p
1 + u′ (x)2 u′′ (x) − u′ (x)(1 + u′ (x)2 )− 2 u′ (x)u′′ (x)
1

= 0.
1 + u′ (x)2
p
Multiplying both sides by 1 + u′ (x)2 we obtain
(1 + u′ (x)2 )u′′ (x) − u′ (x)2 u′′ (x) = 0.
This reduces to u′′ (x) = 0, hence the solution is a straight line! This veriﬁes our
intuition that the shortest path between two points is a straight line.

29
30 CHAPTER 3. APPLICATIONS

3.2 The brachistochrone problem

Recall for the brachistochrone problem we wish to minimize
Z as
1 1 + u′ (x)2
I(u) = √ dx,
2g 0 −u(x)

subject to u(0) = 0 and u(a) = b. Here, d = 1 and

r
1 + p2
L(x, z, p) = .
−z
Therefore
p
Lp (x, z, p) = p .
−z(1 + p2 )
Notice in this case that L has no x-dependence. Hence we can use the alternative
form of the Euler-Lagrange equation (2.7), which yields
s
1 + u′ (x)2 u′ (x)2
−p =C
−u(x) −u(x)(1 + u′ (x)2 )
for a constant C. Making some algebraic simpliﬁcations leads to

(3.1) u(x) + u(x)u′ (x)2 = C,

where the constant C is diﬀerent than the one on the previous line. The constant C
should be chosen to ensure the boundary conditions hold.
Before solving (3.1), let us note that we can say quite a bit about the solution u
from the ODE it solves. First, since u(a) = b < 0, the left hand side must be negative
somewhere, hence C < 0. Solving for u′ (x)2 we have
C − u(x)
u′ (x)2 = .
u(x)
This tells us two things. First, since the left hand side is positive, so is the right
hand side. Hence C − u(x) ≤ 0 and so u(x) ≥ C. If u attains its maximum at an
interior point 0 < x < a then u′ (x) = 0 and u(x) = C, hence u is constant. This is
impossible, so the maximum of u is attained at x = 0 and we have

C ≤ u(x) < 0 for 0 < x ≤ a.

This means in particular that we must select C ≤ b.

Second, as x → 0+ we have u(x) → 0− and hence
C − u(x) C −t
lim+ u′ (x)2 = lim+ = lim− = ∞.
x→0 x→0 u(x) t→0 t
3.2. THE BRACHISTOCHRONE PROBLEM 31

Since x = 0 is a local maximum of u, we have u′ (x) < 0 for x > 0 small. Therefore

lim+ u′ (x) = −∞.

x→0

This says that the optimal curve starts out vertical.

Third, we can diﬀerentiate (3.1) to ﬁnd that

u′ (x) + u′ (x)3 + 2u(x)u′ (x)u′′ (x) = 0.

Therefore

(3.2) 1 + u′ (x)2 = −2u(x)u′′ (x).

Since the left hand side is positive, so is the right hand side, and we deduce u′′ (x) > 0.
Therefore u′ (x) is strictly increasing, and u is a convex function of x. In fact, we can
say slightly more. Solving for u′′ (x) and using (3.1) we have

′′ 1 + u′ (x)2 (1 + u′ (x)2 )2 1
u (x) = − =− ≥− > 0.
2u(x) 2C 2C

This means that u is uniformly convex, and for large enough x we will have u′ (x) > 0,
so the optimal curve could eventually be increasing.
In fact, since u′ is strictly increasing, there exists (by the intermediate value
theorem) a unique point x∗ > 0 such that u′ (x∗ ) = 0. We claim that u is symmetric
about this point, that is
u(x∗ + x) = u(x∗ − x).
To see this, write w(x) = u(x∗ + x) and v(x) = u(x∗ − x). Then

w′ (x) = u′ (x∗ + x) and v ′ (x) = −u′ (x∗ − x).

Using (3.1) we have

w(x) + w(x)w′ (x) = u(x∗ + x) + u(x∗ + x)u′ (x∗ + x)2 = C

and
v(x) + v(x)v ′ (x) = u(x∗ − x) + u(x∗ − x)u′ (x∗ − x)2 = C.
Differentiating the eqautions above, we find that both w and v satisfy the second
order ODE (3.2). We also note that w(0) = u(x∗ ) = v(0), and w′ (0) = u′ (x∗ ) = 0 and
v ′ (0) = −u′ (x∗ ) = 0, so w′ (0) = v ′ (0). Since solutions of the initial value problem for
(3.2) are unique, provided their values stay away from zero, we find that w(x) = v(x),
which establishes the claim. The point of this discussion is that without explicitly
computing the solution, we can say quite a bit both quantitatively and qualitatively
about the solutions.
32 CHAPTER 3. APPLICATIONS

We now solve (3.1). Note we can write

C
u(x) = .
1 + u′ (x)2
Since u′ is strictly increasing, the angle θ between the tangent vector (1, u′ (x)) and
the vertical (0, −1) is strictly increasing. Therefore, we can parametrize the curve in
terms of this angle θ. Let us write C(θ) = (x(θ), y(θ)). Then we have y(θ) = u(x)
and
1
sin2 θ = .
1 + u′ (x)2
Therefore
C
y(θ) = C sin2 θ = − (cos(2θ) − 1).
2
Since y(θ) = u(x)
dy cos θ
= u′ (x) = − .
dx sin θ
By the chain rule

′ dx dx dy sin θ
x (θ) = = = − (2C sin θ cos θ) = −2C sin2 θ.
dθ dy dθ cos θ
Therefore Z θ
C
x(θ) = C (2θ − sin(2θ)) .
cos(2θ) − 1 dt = −
0 2
This gives an explicit solution for the brachistochrone problem, where θ is just the
parameter of the curve.
There is a nice geometrical interpretation of the brachistochrone curve. Notice
that
x(θ) C 2θ C − sin(2θ)
=− − .
y(θ) 2 −1 2 cos(2θ)
The ﬁrst term parametrizes the line y = C/2, while the second term traverses the
circle of radius r = −C/2 in the counterclockwise direction. Thus, the curve is traced
by a point on the rim of a circular wheel as the wheel rolls along the x-axis. Such a
curve is called a cycloid.
Notice that the minimum occurs when θ = π2 , and y = C and x = − Cπ 2
. Hence
the minima of all brachistochrone curves lie on the line
π
x + y = 0.
2
It follow that if a + π2 b > 0, then the optimal path starts traveling steeply downhill,
reaches a low point, and then climbs uphill before arriving at the ﬁnal point (a, b). If
a + π2 b ≤ 0 then the bead is always moving downhill. See Figure 3.1 for an illustration
of the family of brachistochrone curves.
3.2. THE BRACHISTOCHRONE PROBLEM 33

-0.2

-0.4

-0.6

-0.8

-1

-1.2

-1.4

-1.6

-1.8

-2
0 1 2 3 4 5 6 7

Figure 3.1: Family of brachistochrone curves. The straight line is the line x + π2 y = 0
passing through the minima of all brachistochrone curves.

Now, suppose instead of releasing the bead from the top of the curve, we release
the bead from some position (x0 , u(x0 )) further down (but before the minimum) on
the brachistochrone curve. How long does it take the bead to reach the lowest point
on the curve? It turns out this time is the same regardless of where you place the
bead on the curve! To see why, we recall that conservation of energy gives us

1
mv(x)2 + mgu(x) = mgu(x0 ),
2
where v(x) is the velocity of the bead. Therefore
p
v(x) = 2g(u(x0 ) − u(x)),

and the time to descend to the lowest point is

Z − Cπ s
1 2 1 + u′ (x)2
T =√ dx.
2g x0 u(x0 ) − u(x)

Recall that
1
1 + u′ (x)2 = 2 , u(x) = y(θ) = C sin θ, and dx = −2C sin θdθ,
2 2
sin θ

where u(x0 ) = y(θ0 ) = C sin2 θ0 . Making the change of variables x → θ yields

s Z π
s Z π
−2C 2 sin θ −2C 2 sin θ
T = p dθ = √ dθ.
g θ0 sin2 θ − sin2 θ0 g θ0 cos2 θ0 − cos2 θ
34 CHAPTER 3. APPLICATIONS

Make the change of variables t = − cos θ/ cos θ0 . Then cos θ0 dt = sin θdθ and
s Z
−2C 0 1
T = √ dt.
g −1 1 − t2
We can integrate this directly to obtain
s s
−2C −C
T = (arcsin(0) − arcsin(−1)) = π .
g 2g

Notice this is independent of the initial position x0 at which the bead is released!
A curve with the property that the time taken by an object sliding down the curve
to its lowest point is independent of the starting position is called a tautochrone,
or isochrone curve. So it turns out that the tautochrone curve is the same as the
brachistochrone curve. The words tautochrone and isochrone are ancient Greek for
same-time and equal-time, respectively.

3.3 Minimal surfaces

Recall for the minimal surface problem we wish to minimize
Z p
I(u) = 1 + |∇u|2 dx
U

subject to u = g on ∂U . Here n ≥ 2 and

p q
L(x, z, p) = 1 + |p|2 = 1 + p11 + p22 + · · · + p2n .

Even though minimal surfaces are deﬁned in dimension n = 2, it can still be mathe-
matically interesting to consider the general case of arbitrary dimension n ≥ 2.
From the form of L we see that Lz (x, z, p) = 0 and
p
∇p L(x, z, p) = p .
1 + p2
Therefore, the Euler-Lagrange equation for the minimal surface problem is
!
∇u
(3.3) div p = 0 in U
1 + |∇u|2

subject to u = g on ∂U . This is called the minimal surface equation. Using the chain
rule, we can rewrite the PDE as
!
1 1
∇ p · ∇u + p div(∇u) = 0.
1 + |∇u| 2 1 + |∇u|2
3.3. MINIMAL SURFACES 35

Notice that
!
∂ 1 1 Xd
2 − 32
p = − (1 + |∇u| ) 2uxi uxi xj .
∂xj 1 + |∇u|2 2 i=1

Therefore the PDE in expanded form is

1 X
d
∆u
− 3 uxi xj uxi uxj + p = 0.
(1 + |∇u|2 ) 2 i,j=1 1 + |∇u|2

3
Multiplying both sides by (1 + |∇u|2 ) 2 we have

X
d
(3.4) − uxi xj uxi uxj + (1 + |∇u|2 )∆u = 0.
i,j=1

which is also equivalent to

(3.5) −∇u · ∇2 u∇u + (1 + |∇u|2 )∆u = 0.

Exercise 3.1. Show that the plane

u(x) = a · x + b

solves the minimal surface equation on U = Rd , where a ∈ Rd and b ∈ R. 4

Exercise 3.2. Show that for n = 2 the Scherk surface

cos(x1 )
u(x) = log
cos(x2 )

solves the minimal surface equation on the box U = (− π2 , π2 )2 . 4

Notice that if we specialize to the case of n = 2 then we have

−ux1 x1 u2x1 − 2ux1 x2 ux1 ux2 − ux2 x2 u2x2 + (1 + u2x1 + u2x2 )(ux1 x1 + ux2 x2 ) = 0,

which reduces to

(3.6) (1 + u2x2 )ux1 x1 − 2ux1 x2 ux1 ux2 + (1 + u2x1 )ux2 x2 = 0.

It is generally diﬃcult to ﬁnd solutions of the minimal surface equation (3.4). It

is possible to prove that a solution always exists and is unique, but this is outside the
scope of this course.
36 CHAPTER 3. APPLICATIONS

3.4 Minimal surface of revolution

Suppose we place rings of equal radius r > 0 at locations x = −L and x = L on
the x-axis. What is the resulting minimal surface formed between the rings? In
other words, if we dip the rings into a soapy water solution, what is the shape of the
resulting soap bubble?
Here, we may assume the surface is a surface of revolution. Namely, the surface
is formed by taking a function u : [−L, L] → R with u(−L) = r = u(L) and rotating
it around the x-axis. The surface area of a surface of revolution is
Z L p
I(u) = 2π u(x) 1 + u′ (x)2 dx.
−L
p
Since L(x, z, p) = z 1 + p2 does not have an x-dependence, the Euler-Lagrange
equation can be computed via (2.7) and we obtain
p u′ (x)2 u(x) 1
u(x) 1 + u′ (x)2 − p =
1 + u′ (x)2 c
p
for a constant c 6= 0. Multiplying both sides by 1 + u′ (x)2 and simplifying we have
p
(3.7) cu(x) = 1 + u′ (x)2 .

Before solving this, we make the important observation that at a minimum of u,

u′ (x) = 0 and hence u(x) = 1c at the minimum. Since we are using a surface of
revolution, we require u(x) > 0, hence we must take c > 0.
We now square both sides of (3.7) and rearrange to get

(3.8) c2 u(x)2 − u′ (x)2 = 1.

Since u′ (x)2 ≥ 0, we deduce that c2 u(x)2 ≥ 1 for all. Since u(L) = r we require
1
(3.9) c≥ .
r
To solve (3.8), we use a clever trick: We diﬀerentiate both sides to ﬁnd

2c2 u(x)u′ (x) − 2u′ (x)u′′ (x) = 0

or
u′′ (x) = c2 u(x).
Notice we have converted a nonlinear ODE into a linear ODE! Since c2 > 0 the general
solution is
A B
u(x) = ecx + e−cx .
2 2
3.4. MINIMAL SURFACE OF REVOLUTION 37

Tx sin θ
θ
Tx cos θ

(−T0 ,0)

−L 0 x L

Figure 3.2: Depicition of some quantities from the derivation of the shape of a hanging
chain. The shape of the curve is deduced by balancing the force of gravity and tension
on the blue segment above.

It is possible to show, as we did for the brachistochrone problem, that u must be an

even function. It follows that A = B and
ecx + e−cx
u(x) = A = A cosh(cx).
2
The minimum of u occurs at x = 0, so
1
A = u(0) = .
c
Therefore the solution is
1
(3.10) u(x) = cosh(cx).
c
This curve is called a catenary curve, and the surface of revolution obtained by
rotating it about the x-axis is called a catenoid.

Remark 3.3 (The shape of a hanging chain). We pause for a moment to point out
that the catenary curve is also the shape that an idealized hanging chain or cable
assumes. Due to this property, an inverted catenary, or catenary arch, has been used
in architectural designs since ancient times.
To see this, we assume the chain has length ℓ, with ℓ > 2L, and is supported at
x = ±L at a height of r, thus the shape of the chain w(x) is a positive even function
satisfying w(±L) = r. Consider the segment of the chain lying above the interval
[0, x] for x > 0. Let s(x) denote the arclength of this segment, given by
Z xp
s(x) = 1 + w′ (t)2 dt.
0
38 CHAPTER 3. APPLICATIONS

If the chain has density ρ, then the force of gravity acting on this segment of the
chain is F = −ρs(x)g in the vertical direction, where g is acceleration due to gravity.
We also have tension forces within the rope acting on the segment at both ends of the
interval. Let Tx denote the magnitude of the tension vector in the rope at position
x, noting that the tension vector is always directed tangential to the curve. Since
w′ (0) = 0, as w is even, the tension vector at 0 is (−T0 , 0). At the right point x of the
interval, the tension vector is Tx = (Tx cos θ, Tx sin θ), where θ is the angle between
the tangent vector to w and the x-axis. Since the chain is at rest, the sum of the
forces acting on the segment [0, x] of rope must vanish, which yields

T0 = Tx cos θ and ρs(x)g = Tx sin θ.

Therefore
Tx sin θ ρg
w′ (x) = tan θ = = s(x).
Tx cos θ T0
p
Setting c = ρg/T0 and diﬀerentiating above we obtain w′′ (x) = c 1 + w′ (x)2 , and so

w′′ (x)2 − c2 w′ (x)2 = c2 .

Noting the similarity to (3.8), we can use the methods above to obtain the solution

cA cx cB −cx
w′ (x) = e + e ,
2 2
for some constants A and B. Since w(x) is even, w′ (x) is odd, and so B = −A and
we have w′ (x) = cA sinh(cx). Integrating again yields

w(x) = A cosh(cx) + C

for an integration constant C. The difference now with the minimal surface problem
is that A, C are the free parameters, while c is fixed.
Note that for fixed A, we can choose C > 0 so that w(L) = r, yielding

w(x) = A(cosh(cx) − cosh(cL)) + r.

We set the parameter A by asking that the curve has length ℓ, that is
Z Lp Z Lq
(3.11) ℓ=2 ′ 2
1 + w (x) dx = 2 1 + A2 c2 sinh2 (cx) dx.
0 0

Unless Ac = 1, the integral on the right is not expressible in terms of simple functions.
Nevertheless, we can see that the length of w is continuous and strictly increasing in
A, and tends to ∞ as A → ∞. Since the length of w for A = 0 is 2L, and ℓ > 2L,
there exists, by the intermediate value theorem, a unique value of A for which (3.11)
holds.
3.4. MINIMAL SURFACE OF REVOLUTION 39

10 y=cosh(C)
y=3C
y=3C/2
8 y=C

6
y
4

0
0 0.5 1 1.5 2 2.5 3
C

Figure 3.3: When θ > θ0 there are two solutions of (3.12). When θ = θ0 there is
one solution, and when θ < θ0 there are no solutions. By numerical computations,
θ0 ≈ 1.509.

We now return to the minimal surface problem. We need to see if it is possible to

select c > 0 so that
u(−L) = r = u(L).
Since u is even, we only need to check that u(L) = r. This is equivalent choosing
c > 0 so that cosh(cL) = cr. Let us set C = cL and θ = Lr . Then we need to choose
C > 0 such that

(3.12) cosh(C) = θC.

This equation is not always solvable, and depends on the value of θ = Lr , that is,
on the ratio of the radius r of the rings to L, which is half of the separation distance.
There is a threshold value θ0 such that for θ > θ0 there are two solutions C1 < C2 of
(3.12). When θ = θ0 there is one solution C, and when θ < θ0 there are no solutions.
See Figure 3.3 for an illustration. To rephrase this, if θ < θ0 or r < Lθ0 , then the
rings are too far apart and there is no minimal surface spanning the two rings. If
r ≥ Lθ0 then the rings are close enough together and a minimal surface exists. From
numerical computations, θ0 ≈ 1.509.
Now, when there are two solutions C1 < C2 , which one gives the smallest surface
area? We claim it is C1 . To avoid complicated details, we give here a heuristic
argument to justify this claim. Let c1 < c2 such that C1 = c1 L and C2 = c2 L. So we
have two potential solutions
1 1
u1 (x) = cosh(c1 x) and u2 (x) = cosh(c2 x).
c1 c2
40 CHAPTER 3. APPLICATIONS

Figure 3.4: Simulation of the minimal surface of revolution for two rings being slowly
pulled apart. The rings are located at x = −L and x = L where L ranges from (left)
L = 0.095 to (right) L = 0.662, and both rings have radius r = 1. For larger L the
soap bubble will collapse.

Since u1 (0) = c11 ≥ c12 = u2 (0), we have u1 ≥ u2 . In other words, as we increase c the
solution decreases. Now, as we pull the rings apart we expect the solution to decrease
(the soap bubble becomes thinner), so the value of c should increase as the rings are
pulled apart. As the rings are pulled apart L is increasing, so θ = r/L is decreasing.
From Figure 3.3 we see that C2 is decreasing as θ decreases. Since C2 = c2 L and L
is increasing, c2 must be decreasing as the rings are pulled apart. In other words,
u2 is increasing as the rings are pulled apart, so u2 is a non-physical solution. The
minimal surface is therefore given by u1 . Figure 3.4 shows the solutions u1 as the two
rings pull apart, and Figure 3.5 shows non-physical solutions u2 .
We can also explicitly compute the minimal surface area for c = c1 . We have
Z L p Z L
(3.13) I(u) = 2π ′ 2
u(x) 1 + u (x) dx = 4πc u(x)2 dx,
−L 0

where we used the Euler-Lagrange equation (3.7) and the fact that u is even in the
last step above. Substituting u′′ (x) = c2 u(x) and integrating by parts we have
Z
4π L ′′
I(u) = u (x)u(x) dx
c 0
Z
4π ′ L 4π L ′ 2
= u (x)u(x) − u (x) dx
c 0 c 0
Z
4π ′ 4π L ′ 2
= u (L)u(L) − u (x) dx.
c c 0
Using (3.8) we have u′ (x)2 = c2 u(x)2 − 1 and so
Z
4πu(L) p 2 4π L 2
I(u) = c u(L) − 1 −
2 c u(x)2 − 1 dx.
c c 0
3.4. MINIMAL SURFACE OF REVOLUTION 41

Figure 3.5: Illustration of solutions of the minimal surface equation that do not
minimize surface area. The details are identical to Figure 3.4, except that we select
c2 instead of c1 . Notice the soap bubble is growing as the rings are pulled apart,
which is the opposite of what we expect to occur.

Since u(L) = r we have

Z Z
4πr √ 2 2 L
4π L
I(u) = c r − 1 − 4πc 2
u(x) dx + dx.
c 0 c 0

Recalling (3.13) we have

4πr √ 2 2 4πL
I(u) = c r − 1 − I(u) + .
c c
Solving for I(u) we have
2π √ 2 2
(3.14) I(u) = r c r −1+L .
c
Notice that we have at no point used the explicit formula u(x) = 1c cosh(cx). We
have simply used the ODE that u satisﬁes, some clever integration by parts, and the
boundary condition u(L) = r. There is an alternative expression for the surface area.
Recall c is chosen so that cr = cosh(cL). Thus

c2 r2 − 1 = cosh2 (cL) − 1 = sinh2 (cL),

and we have
2π
I(u) = (r sinh(cL) + L) .
c

While it is not possible to analytically solve for c1 and c2 , we can numerically com-
pute the values to arbitrarily high precision with our favorite root-ﬁnding algorithm.
42 CHAPTER 3. APPLICATIONS

Most root-ﬁnding algorithms require one to provide an initial interval in which the
solution is to be found. We already showed (see (3.9)) that c ≥ 1/r. For an upper
bound, recall we have the Taylor series

x2 x4 X x2n
∞
cosh(x) = 1 + + + ··· = ,
2 4! k=0
(2k)!

x2
and so cosh(x) ≥ 2
. Recall also that c1 and c2 are solutions of

cosh(cL) = cr.
c2 L 2
Therefore, if 2
> cr we know that cosh(cL) > cr. This gives the bounds
1 2r
≤ ci ≤ 2 (i = 1, 2).
r L
Furthermore, the point c∗ where the slope of c 7→ cosh(cL) equals r lies between c1
and c2 . Therefore
c1 < c ∗ < c 2 ,
where L sinh(c∗ L) = r, or
1 r
∗−1
c = sinh .
L L
Thus, if cosh(c∗ L) = c∗ r, then there is exactly one solution c1 = c2 = c∗ . If
cosh(c∗ L) < c∗ r then there are two solutions
1 2r
(3.15) ≤ c1 < c ∗ < c 2 ≤ 2 .
r L
Otherwise, if cosh(c∗ L) > c∗ r then there are no solutions. Now that we have the
bounds (3.15) we can use any root-ﬁnding algorithm to determine the values of c1
and c2 . In the code I showed in class I used a simple bisection search.

3.5 Isoperimetric inequality

Let C be a simple closed curve in the plane R2 of length ℓ, and let A denote the
area enclosed by C. How large can A be, and what shape of curve yields the largest
enclosed area? This question, which is called the isoperimetric problem, and similar
questions have intrigued mathematicians for many thousands of years. The origin of
the isoperimetric problem can be traced back to a Greek mathematician Zenodorus
sometime in the second century B.C.E.
Let us consider a few examples. If C is a rectangle of width w and height h, then
ℓ = 2(w + h) and A = wh. Since w = 12 ℓ − h we have
1
A = ℓh − h2 ,
2
3.5. ISOPERIMETRIC INEQUALITY 43

where h < 12 ℓ. The largest area for this rectangle is attained when 12 ℓ − 2h = 0, or
h = 14 ℓ. That is, the rectangle is a square! The area of the square is

ℓ2
A= .
16
Can we do better? We can try regular polygons with more sides, such as a
pentagon, hexagon, etc., and we will find that the area increases with the number of
sides. In the limit as the number of sides tends to infinity we get a circle, so perhaps
the circle is a good guess for the optimal shape. If C is a circle of radius r > 0 then
ℓ
2πr = ℓ, so r = 2π and
ℓ2
A = πr2 = .
4π
This is clearly better than the square, since π < 4.
The question again is: Can we do better still? Is there some shape we have not
thought of that would give larger area than a circle while having the same perimeter?
We might expect the answer is no, but lack of imagination is not a convincing proof.
We will prove shortly that, as expected, the circle gives the largest area for a fixed
perimeter. Thus for any simple closed curve C we have the isoperimetric inequality

(3.16) 4πA ≤ ℓ2 ,
ℓ
where equality holds only when C is a circle of radius r = 2π .
We give here a short proof of the isoperimetric inequality (3.16) using the Lagrange
multiplier method in the calculus of variations. Let the curve C be parametrized by
(x(t), y(t)) for t ∈ [0, 1]. For notational simplicity we write x = x1 and y = x2 in
this section. Since C is a simple closed curve, we may also assume without loss of
generality that

(3.17) x(0) = y(0) = x(1) = y(1) = 0.

Let U denote the interior of C. Then the area enclosed by the curve C is
Z Z
A(x, y) = dx = div(F ) dx,
U U

where F is the vector ﬁeld F (x, y) = 12 (x, y). By the divergence theorem
Z
A(x, y) = F · ν dS,
∂U

where ν is the outward normal to ∂U . Since ∂U = C we have

(y ′ (t), −x′ (t))
ν(t) = p ,
x′ (t)2 + y ′ (t)2
44 CHAPTER 3. APPLICATIONS

and p
dS = x′ (t)2 + y ′ (t)2 dt,
provided we take the curve to have positive orientation. Therefore
Z 1 Z
1 ′ ′ 1 1
A(x, y) = (x(t), y(t)) · (y (t), −x (t)) dt = x(t)y ′ (t) − x′ (t)y(t) dt.
0 2 2 0

The length of C is given by

Z 1 p
ℓ(x, y) = x′ (t)2 + y ′ (t)2 dt.
0

Thus, we wish to ﬁnd functions x, y : [0, 1] → R that maximize A(x, y) subject to

ℓ(x, y) = ℓ and the boundary conditions (3.17).
The area and length functionals depend on two functions x(t) and y(t), which is
a situation we have not encountered yet. Similar to the case of partial diﬀerentiation
of functions on Rd , we can freeze one input, say y(t), and take the functional gradient
with respect to x(t). So we treat A and ℓ as functions of x(t) only, and hence z = x(t)
and p = x′ (t). The gradient ∇x A, or Euler-Lagrange equation, is then given by

1 ′ d 1
∇x A(x, y) = y (t) − − y(t) = y ′ (t)
2 dt 2

while ∇x ℓ(x, y) is given by

!
d x′ (t)
∇x ℓ(x, y) = − p .
dt x′ (t)2 + y ′ (t)2

Similarly
1 d 1
∇y A(x, y) = − x′ (t) − x(t) = −x′ (t),
2 dt 2
and !
d y ′ (t)
∇y ℓ(x, y) = − p .
dt x′ (t)2 + y ′ (t)2
Then the gradients of A and ℓ are deﬁned as

∇x A(x, y) ∇x ℓ(x, y)
∇A(x, y) = and ∇ℓ(x, y) = .
∇y A(x, y) ∇y ℓ(x, y)

Following (2.15), the necessary conditions for our constrained optimization problem
are
∇A(x, y) + λ∇ℓ(x, y) = 0,
3.6. IMAGE RESTORATION 45

where λ is a Lagrange multiplier. This is a set of two equations

!
′
d λx (t)
y ′ (t) − p = 0,
dt x′ (t)2 + y ′ (t)2
and !
d λy ′ (t)
−x′ (t) − p = 0.
dt x′ (t)2 + y ′ (t)2
Integrating both sides we get
λx′ (t) λy ′ (t)
y(t) − p = a and x(t) + p = b,
x′ (t)2 + y ′ (t)2 x′ (t)2 + y ′ (t)2
for constants a and b. Therefore
!2 !2
λy ′ (t) λx′ (t)
(x(t) − a)2 + (y(t) − b)2 = −p + p
x′ (t)2 + y ′ (t)2 x′ (t)2 + y ′ (t)2
λ2 y ′ (t)2 λ2 x′ (t)2
= +
x′ (t)2 + y ′ (t)2 x′ (t)2 + y ′ (t)2
x′ (t)2 + y ′ (t)2
= λ2 ′ 2
x (t) + y ′ (t)2
= λ2 .
This means that the curve C(t) = (x(t), y(t)) is a circle of radius λ centered at (a, b).
Hence, as we expected, the circle is shape with largest area given a ﬁxed perimeter.
ℓ
Since the perimeter is ℓ(x, y) = ℓ we have λ = 2π . Since x(0) = y(0) = 0, a and b
must be chosen to satisfy
ℓ2
a2 + b 2 = λ 2 = 2 .
4π
That is, the circle must pass through the origin, due to our boundary conditions
(3.17).

3.6 Image restoration

Recall the total variation (TV) image restoration problem is based on minimizing
Z
1
(3.18) I(u) = (f − u)2 + λ|∇u| dx
U 2
over all u : U → R, where U = (0, 1)2 . The function f is the original noisy image,
and the minimizer u is the denoised image. Here, the Lagrangian
1
L(x, z, p) = (f (x) − z)2 + λ|p|
2
46 CHAPTER 3. APPLICATIONS

is not diﬀerentiable at p = 0. This causes some minor issues with numerical sim-
ulations, so it is common to take an approximation of the TV functional that is
diﬀerentiable. One popular choice is
Z p
1
(3.19) Iε (u) = (f − u)2 + λ |∇u|2 + ε2 dx,
U 2

where ε > 0 is a small parameter. When ε = 0 we get the TV functional (3.18). In

this case the Lagrangian is
1 p
Lε (x, z, p) = (f (x) − z)2 + λ |p|2 + ε2 ,
2
which is diﬀerentiable in both z and p. It is possible to prove that minimizers of Iε
converge to minimizers of I as ε → 0, but the proof is very technical and outside the
scope of this course.
So the idea is to ﬁx some small value of ε > 0 and minimize Iε . To compute the
Euler-Lagrange equation note that

λp
Lε,z (x, z, p) = z − f (x) and ∇p Lε (x, z, p) = p .
|p|2 + ε2

Therefore the Euler-Lagrange equation is

!
∇u
(3.20) u − λ div p =f in U
|∇u|2 + ε2

with homogeneous Neumann boundary conditions ∂u ∂ν

= 0 on ∂U . It is almost always
impossible to ﬁnd a solution of (3.20) analytically, so we are left to use numerical
approximations.

3.6.1 Gradient descent and acceleration

A standard numerical method for computing solutions of (3.20) is gradient descent,
as described in Section 2.1. The gradient descent partial diﬀerential equation is
!
∇u
ut + u − λ div p = f for x ∈ U, t > 0,
|∇u|2 + ε2

∂u
with initial condition u(x, 0) = f (x) and boundary conditions ∂ν
= 0 on ∂U . This is
a nonlinear heat equation where the thermal conductivity
1
κ= p
|∇u|2 + ε2
3.6. IMAGE RESTORATION 47

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) Noisy signal (b) TV denoising

Figure 3.6: Example of denoising a noisy signal with the total variations restoration
algorithm. Notice the noise is removed while the edges are preserved.

depends on ∇u. Discretized on a grid with spacing ∆t and ∆x, the stability condition
for the scheme is
ε∆x2
∆t ≤ .
4λ
Hence, the scheme is numerically stiﬀ when ∆x is small or when ε > 0 is small.
We can also solve (3.20) with accelerated gradient descent, described in Section
2.4, which leads to the accelerated descent equation
!
∇u
utt + aut + u − λ div p = f for x ∈ U, t > 0,
|∇u|2 + ε2

where a > 0 is the coeﬃcient of friction (damping). The CFL stability condition for
accelerated descent is
ε∆x2
∆t ≤
2
.
4λ + ε∆x2
The scheme is only numerically stiﬀ when ε > 0 is small or λ > 0 is large. However,
for accelerated descent, a nonlinear phenomenon allows us to violate the stability
condition and roughly set ∆t ≈ ∆x, independent of ε > 0. We refer the reader to [7]
for details.
Figure 3.6 shows a one dimensional example of denoising a signal with the TV
restoration algorithm. Notice the algorithm can remove noise while preserving the
sharp discontinuities in the signal. This suggests that minimizers of the TV restora-
tion functional can have discontinuities. Figure 3.7 shows an example of TV image
restoration of a noisy image of Vincent hall. The top image is the noisy image, the
middle image is TV restoration with a small value of λ, and the bottom image is TV
restoration with a large value of λ. Notice that as λ is increased, the images becomes
48 CHAPTER 3. APPLICATIONS

Figure 3.7: Example of TV image restoration: (top) noisy image (middle) TV restora-
tion with small value for λ, and (bottom) TV restoration with large value for λ.
3.6. IMAGE RESTORATION 49

smoother, and many ﬁne details are removed. We also observe that for small λ the
algorithm is capable of preserving edges and ﬁne details while removing unwanted
noise.

3.6.2 Primal dual approach

Since gradient descent for solving Total Variation regularized problems requires smooth-
ing out the functional with a parameter ε > 0, and the equations are numerically stiff
when ε > 0 is small, many works have developed approaches to solving Total Varia-
tion problems without smoothing the functional, and are hence less numerically stiff.
One such approach belongs to the class of primal dual approaches to Total Variation
restoration. The ideas originally appeared in [18], and were later expanded upon to
handle general problems in [19], among many other works.
We define the convex dual, or Legendre-Fenchel transform, of a function Φ : Rd →
R by
(3.21) Φ∗ (p) := sup {x · p − Φ(x)}.
x∈Rd

If Φ is convex, then by convex duality we have Φ∗∗ := (Φ∗ )∗ = Φ, that is we have the
representation
(3.22) Φ(x) := sup {x · p − Φ∗ (p)}.
p∈Rd

Exercise 3.4. Prove the assertion above for d = 1 and Φ smooth and strongly
convex. 4
We assume Φ : Rd → R is convex and consider for simplicity the problem
Z
(3.23) min Ψ(x, u) + Φ(∇u) dx,
u U

where u : U → R, which includes the case of Total Variation restoration. A primal

dual algorithm for solving (3.23) expresses Φ through its convex dual Φ∗ giving the
initially more looking complicated formation
Z
(3.24) min max Ψ(x, u) + p · ∇u − Φ∗ (p) dx,
u p U

where p : U → Rd . Here, u is the primal variable, p is the dual variable, and (3.24)
is called a saddle point formulation. Notice we are dualizing only in the gradient
variable ∇u, in which the original functional is nonsmooth, and leaving the variable
u, in which the functional is smooth, unchanged.
Provided u p · n = 0 on ∂U , we can integrate by parts to express the problem as
Z
(3.25) min max Ψ(x, u) − u div(p) − Φ∗ (p) dx.
u p U
50 CHAPTER 3. APPLICATIONS

In general, primal dual methods work by alternatively taking small steps in the di-
rection of minimizing with respect to u and maximizing with respect to p, with either
gradient descent steps, or proximal updates, which corresponds to implicit gradient
descent. Either (3.24) or (3.25) may be used in the descent equations, and often
(3.24) is more convenient for the dual update while (3.25) is more convenient for the
primal update.
For some application, things can be simpliﬁed signiﬁcantly. We consider Total
Variation image restoration where Ψ(x, z) = 2λ 1
(z − f (x))2 and Φ(p) = |p|. The
convex dual is easily computed to be
(
0, if |v| ≤ 1
Φ∗ (v) = sup {p · v − |p|} =
p∈Rd ∞, if |v| > 1,

and the representation of Φ via convex duality in (3.22) is the familiar formula
|p| = max p · v.
|v|≤1

Thus, the saddle point formulation (3.25) becomes

Z
1
(3.26) min max (u − f )2 + u div(p) dx,
u |p|≤1 U 2λ

where we swapped p for −p and assume the boundary condition p · n = 0 on ∂U .

Notice we have turned a problem with a nonsmooth dependence on ∇u, into a smooth
problem with a constraint |p| ≤ 1. Thus, primal dual methods allow us to avoid the
non-smoothness in the dependence on the gradient.
In fact, in (3.26), we can swap the min and max, due to strong convexity in u, to
obtain the equivalent problem
Z
1
(3.27) max min (u − f )2 + u div(p) dx.
|p|≤1 u U 2λ

We can now explicitly solve for u, ﬁnding that

(3.28) u = f − λdiv(p),
and yielding the dual problem
Z
λ
(3.29) max T (p) := f div(p) − div(p)2 dx,
|p|≤1 U 2
subject to p · n = 0 on ∂U . Since p : U → Rd is vector-valued, we do not have the
Euler-Lagrange equations for T on hand, and must compute them from scratch. To
do this, we compute a variation in the direction of φ ∈ C ∞ (U ; Rd )
Z
d
T (p + tφ) = f div(φ) − λ div(p)div(φ) dx.
dt t=0 U
3.6. IMAGE RESTORATION 51

Taking φ · n = 0 on ∂U we can integrate by parts to ﬁnd that

Z
d
T (p + tφ) = − ∇(f − λ div(p)) · φ dx.
dt t=0 U

Therefore
∇T (p) = ∇(λ div(p) − f ).
To handle the constraint |p| ≤ 1 we can do a type of projected gradient ascent
pk + τ ∇(div(pk ) − f /λ)
(3.30) pk+1 = ,
1 + τ |div(pk ) − f /λ)|
with time step τ > 0, starting from an initial guess p0 with |p0 | < 1, and taking
appropriate discretizations of div and ∇ on a grid with spacing ∆x above. Notice
we absorbed λ into the choice of time step, to be consistent with other presentations.
This primal dual approach, and its convergence properties, is studied closely in [18].
The iteration above converges when τ is suﬃciently small, and the solution of the
Total Variation problem is obtained by evaluating (3.28).

3.6.3 The split Bregman method

Another popular method for solving L1 regularized problems is the split Bregman
method [31]. Our presentation follows [31] closely.
The split Bregman method aims to handle the L1 norm by introducing a new
function p and solving the constrained problem
Z
(3.31) min Ψ(x, u) + |p| dx subject to ∇u = p,
u,p U

where Ψ(x, z) = 2λ1

(z − f (x))2 . Since the exact form of Ψ is not needed, we will
proceed in generality. So far nothing has changed. The constraint ∇u = p could be
handled by a penalty method
Z
µ
(3.32) min Ψ(x, u) + |p| + |p − ∇u|2 dx.
u,p U 2
In penalty methods we have to take µ → ∞ to ensure the constraint is satisﬁed, and
the problem would become ill-conditioned and diﬃcult to solve in this regime.
The split Bregman iteration essentially allows one to solve something similar to
(3.32) and obtain a solution of the constrained problem (3.31) without sending µ →
∞. The steps of the split Bregman method are outlined below. We introduce a new
variable b and iterate
 Z
(uk+1 , pk+1 ) = arg min Ψ(x, u) + |p| + µ |p − ∇u − bk |2 dx
u,p U 2

b k+1
= b + ∇u
k k+1
−p .
k+1
52 CHAPTER 3. APPLICATIONS

The Bregman iteration for solving constrained problems was originally proposed in [9],
and the novelty of the split Bregman method is in splitting the terms |p| and Ψ and
applying the Bregman iteration only to |p|.
We will not prove convergence of the Bregman iteration here, and refer the reader
to [47]. However, it is easy to see that if the method converges to a triple (u∗ , p∗ , b∗ ),
then the pair (u∗ , p∗ ) is a solution of the constrained problem (3.31). To see this, we
note that since bk → b∗ we have ∇u∗ = p∗ , so the constraint is satisﬁed. The pair
(u∗ , p∗ ) is clearly a solution of
Z
µ
(3.33) (u , p ) = arg min Ψ(x, u) + |p| + |p − ∇u − b∗ |2 dx
∗ ∗
u,p U 2

Let (û, p̂) be any pair of functions satisfying ∇û = p̂, that is, a feasible pair for the
constrained problem (3.31). By deﬁnition of (u∗ , p∗ ) we have
Z Z
∗ ∗ µ ∗ ∗ ∗ 2 µ
Ψ(x, u ) + |p | + |p − ∇u − b | dx ≤ Ψ(x, û) + |p̂| + |p̂ − ∇û − b∗ |2 dx.
U 2 U 2
Due to the constraints ∇û = p̂ and ∇u∗ = p∗ this simpliﬁes to
Z Z
∗ ∗
Ψ(x, u ) + |p | dx ≤ Ψ(x, û) + |p̂| dx.
U U

Thus, (u∗ , p∗ ) is a solution of the constrained problem (3.31). This holds indepen-
dently of the choice of µ > 0. One may think of the spit Bregman method as an
iteration scheme to ﬁnd for any µ > 0, a function b∗ so that the penalized problem
(3.33) is equivalent to the original constrained problem (3.31).
The eﬃciency of the split Bregman method relies on the ability to solve the prob-
lem
Z
µ
(3.34) arg min Ψ(x, u) + |p| + |p − ∇u − bk |2 dx
u,p U 2

eﬃciently. We focus on the L2 penalty case where Ψ(x, z) = 2λ 1

(z − f (x))2 . Here, the
problem (3.34) can be eﬃciently solved with the iteration scheme
 Z

u j+1 1 µ
(u − f )2 + |pj − ∇u − bk |2 dx
 = arg min
2λ 2
(3.35) u
nU o

 j+1
|p|
µ
|p − ∇u j+1
− |
k 2
 p = arg min + b
p 2
The Euler-Lagrange equation for the ﬁrst problem is

(3.36) u − µλ∆u = f + µλdiv(bk − pj ) in U.

This is a linear Poisson-type equation that can be solved very eﬃciently with either
a fast Fourier transform (FFT) or the multigrid method. Alternatively, in [31], the
3.6. IMAGE RESTORATION 53

authors report using only a handful of steps of gradient descent and can get by with
only approximately solving the problem.
The efficiency of the split Bregman method stems largely from the fact that the
second problem in (3.35) is pointwise, and can be solved explicitly. Note the objective
is strongly convex in p, so the minimum is attained and unique, and the critical point,
if it exists, is the minimizer. We can differentiate in p to find that the minimizer pj+1
satisfies
pj+1
+ µpj+1 = µ(∇uj+1 + bk ),
|p |
j+1

provided pj+1 6= 0. Taking norms on both sides we have

1
|pj+1 | = |∇uj+1 + bk | − .
µ

Clearly pj+1 points in the same direction as ∇uj+1 + bk , and so we have

∇uj+1 + bk 1
p j+1
= |∇u + b | −
j+1 k
|∇uj+1 + bk | µ

provided |∇uj+1 +bk | > µ1 . If |∇uj+1 +bk | ≤ µ1 , then there are no critical points where
the objective is smooth, and the minimizer is pj+1 = 0, i.e., the only non-smooth point.
The formula for pj+1 can be expressed compactly as

(3.37) pj+1 = shrink(∇uj+1 + bk , 1/µ),

where the shrinkage operator is deﬁned as

x
(3.38) shrink(x, γ) := max{|x| − γ, 0}.
|x|

The shrinkage operator is extremely fast and requires only a few operations per
element of pj , which is the key to the computational eﬃciency of the split Bregman
method.

3.6.4 Edge preservation properties of Total Variation restora-

tion
The simulations presented in Figures 3.6 and 3.7 suggest that minimizers of the TV
restoration functional I can be discontinuous. This may present as counter-intuitive,
since the derivative of u is very large near a discontinuity and we are in some sense
minimizing the derivative. It is important to keep in mind, however, that we are
minimizing the integral of the derivative, and while the derivative may be large at
some points, its integral can still be small.
54 CHAPTER 3. APPLICATIONS

As an example, let us consider the one dimensional case and ignore the ﬁdelity
term, since it does not involve the derivative of u. Hence, we consider the functional
Z 1
Jp (u) = |u′ (x)|p dx,
−1

where p ≥ 1. The TV functional corresponds to p = 1, but it is interesting to

consider other values of p to understand why p = 1 is preferred in signal processing
communities. Suppose we want to minimize Jp subject to u(−1) = 0 and u(1) = 1.
It is not diﬃcult to convince yourself that the minimizer should be an increasing
function, so we may write Z 1
Jp (u) = u′ (x)p dx,
−1
′
provided we restrict u (x) ≥ 0. If p > 1 then the Euler-Lagrange equation is
d
pu′ (x)p−1 = 0,
dx
which expands to
p(p − 1)u′ (x)p−2 u′′ (x) = 0.
The straight line u(x) = 12 x + 21 is a solution of the Euler-Lagrange equation, and
hence a minimizer since Jp is convex. When p = 1 the Euler-Lagrange equation is
d
(1) = 0
dx
which does not even involve u! This means every increasing function is a solution
of the Euler-Lagrange equation. A Lagrangian L(x, z, p) for which every function
solves the Euler-Lagrange equation is called a null Lagrangian, and they have many
important applications in analysis.
Notice that when p = 1 and u is increasing
Z 1
J1 (u) = u′ (x) dx = u(1) − u(−1) = 1 − 0 = 1.
−1

Hence, the functional J1 (u) actually only depends on the boundary values u(1) and
u(−1), provided u is increasing. This is the reason why the Euler-Lagrange equation
is degenerate; every increasing function satisfying the boundary conditions u(−1) = 0
and u(1) = 1 is a minimizer. Thus, the TV functional does not care how the function
gets from u(−1) = 0 to u(1) = 1, provided the function is increasing. So the linear
function u(x) = 12 x + 12 is a minimizer, but so is the sequence of functions


0, if − 1 ≤ x ≤ 0
un (x) = nx, if 0 ≤ x ≤ n1


1, if n1 ≤ x ≤ 1.
3.7. IMAGE SEGMENTATION 55

The function un has a sharp transition from zero to one with slope n between x = 0
and x = 1/d. For each n
Z 1
n
J1 (un ) = n dx = 1,
0

so each un is a minimizer. The pointwise limit of un as n → ∞ is the Heaviside

function (
0, if − 1 ≤ x ≤ 0
H(x) =
1, if 0 ≤ x ≤ 1.
So in some sense, the discontinuous function H(x) is also a minimizer. Indeed, we
can compute
Z 1 Z 1
′
J1 (H) = H (x) dx = δ(x) dx = 1,
−1 −1

where δ(x) is the Delta function. This explains why minimizers of the TV functional
can admit discontinuities.
Notice that if p > 1 then
Z 1
n
Jp (un ) = np dx = np−1 ,
0

hence Jp (un ) → ∞ as n → ∞. This means that a discontinuous function cannot

minimize Jp for p > 1, and the only sensible value for Jp (H) is ∞ when p > 1. Thus,
if we used a version of TV restoration where |∇u|p appeared with p > 1, we would
not expect that edges and ﬁne details would be preserved, since discontinuities have
inﬁnite cost in this case.

3.7 Image segmentation

Recall in the image segmentation problem we aim to minimize
Z
(3.39) I(u, a, b) = H(u)(f − a)2 + (1 − H(u)) (f − b)2 + λδ(u)|∇u| dx
U

over u : U → R and real numbers a and b. Let us assume for the moment that a and
b are ﬁxed, and I is a function of only u.
The Lagrangian

L(x, z, p) = H(z)(f (x) − a)2 + (1 − H(z))(f (x) − b)2 + λδ(z)|p|

is not even continuous, due to the presence of the Heaviside function H(z) and the
delta function δ(z). This causes problems numerically, hence in practice we usually
56 CHAPTER 3. APPLICATIONS

replace L by a smooth approximation. For ε > 0 we deﬁne the smooth approximate

Heaviside function
x
1 2
(3.40) Hε (x) = 1 + arctan .
2 π ε
The approximation to δ is then
1 ε
(3.41) δε (x) := Hε′ (x) = .
π ε2 + x 2
We then form the approximation
Z
(3.42) Iε (u) = Hε (u)(f − a)2 + (1 − Hε (u)) (f − b)2 + λδε (u)|∇u| dx.
U

The Lagrangian for Iε is

Lε (x, z, p) = Hε (z)(f (x) − a)2 + (1 − Hε (z))(f (x) − b)2 + λδε (z)|p|.

Therefore

Lε,z (x, z, p) = δε (z) (f (x) − a)2 − (f (x) − b)2 + λδε′ (z)|p|

and
λδε (z)p
∇p Lε (x, z, p) = .
|p|
By the chain and product rules

δε (u(x))∇u(x)
div (∇p L(x, u(x), ∇u(x))) = λ div
|∇u(x)|

∇δε (u(x)) · ∇u(x) ∇u(x)
=λ + λ div
|∇u(x)| |∇u(x)|
′

δε (u(x))∇u(x) · ∇u(x) ∇u(x)
=λ + λδε (u(x)) div
|∇u(x)| |∇u(x)|

′ ∇u(x)
= λδε (u(x))|∇u(x)| + λδε (u(x)) div
|∇u(x)|
Therefore, the Euler-Lagrange equation is

∇u
(3.43) δε (u) (f − a) − (f − b) − λ div
2 2
= 0 in U
|∇u|

subject to homogeneous Neumann boundary conditions ∂u ∂ν

= 0 on ∂U .
As with the image restoration problem, it is nearly impossible to solve the Euler-
Lagrange equation (3.43) analytically. Thus we are left to devise numerical algorithms
3.7. IMAGE SEGMENTATION 57

to ﬁnd solutions. Here, we are minimizing over u, a, and b, which is a situation we

have not encountered before. Note that if u is ﬁxed, then minimizing with respect to
a and b is easy. Indeed, diﬀerentiating Iε with respect to a yields
Z
0 = −2 Hε (u)(f − a) dx.
U

Therefore the optimal value for a is

R
Hε (u)f dx
(3.44) a = RU ,
U
Hε (u) dx

which is approximately the average of f in the region where u > 0. Similarly, if u is

ﬁxed, the optimal choice of b is
R
(1 − Hε (u))f dx
(3.45) b = UR .
U
1 − Hε (u) dx

Since it is easy to minimize over a and b, the idea now is to consider an alternating
minimization algorithm, whereby one ﬁxes a, b ∈ R and takes a small gradient descent
step in the direction of minimizing Iε with respect to u, and then one freezes u and
updates a and b according to (3.44) and (3.45). We repeat this iteratively until the
values of a, b, and u remain unchanged with each new iteration.
Gradient descent on Iε with respect to u is the partial diﬀerential equation

∇u
(3.46) ut + δε (u) (f − a) − (f − b) − λ div
2 2
= 0 for x ∈ U, t > 0
|∇u|

subject to homogeneous Neumann boundary conditions ∂u ∂ν

= 0 on ∂U and an initial
condition u(x, 0) = u0 (x). As with the image restoration problem, we normally
replace the non-diﬀerentiable norm |∇u| by a smooth approximation, so instead we
solve the partial diﬀerential equation
(3.47) " !#
∇u
ut + δε (u) (f − a) − (f − b) − λ div p
2 2
= 0 for x ∈ U, t > 0.
|∇u|2 + ε2

At each iteration of solving (3.47) numerically, we update the values of a and b

according to (3.44) and (3.45).
Figure 3.8 shows the result of segmenting the cameraman image. The bottom
four images in the ﬁgure show the evolution of the zero level set of u throughout the
gradient descent procedure resulting in the segmentation obtained in the lower right
image. Figure 3.9 shows that results of segmenting blurry and noisy versions of the
cameraman image, to illustrate that the algorithm is robust to image distortions.
58 CHAPTER 3. APPLICATIONS

Figure 3.8: Illustration of gradient descent for segmenting the cameraman image. Top
left is the original image, and top right is the initialization of the gradient descent
algorithm. The lower four images show the evolution of the zero level set of u.
3.7. IMAGE SEGMENTATION 59

Figure 3.9: Segmentation of clean, blurry, and noisy versions of the cameraman image.

3.7.1 Ginzburg-Landau approximation

Employing gradient descent (3.47) to minimize image segmentation energies can be
extremely slow when asking for accurate results (i.e., ε > 0 is small), and/or when
imposing a large amount of regularization (i.e., λ 1). Here, we describe some ap-
proximations based on Ginzburg-Landau functional that yield very fast approximate
algorithms for image segmentation. We follow [27] closely.
For ε > 0 the Ginzburg-Landau functional is deﬁned as
Z
1
(3.48) Gε (u) = ε|∇u|2 + W (u) dx,
U ε

where W is a double-well potential deﬁned by W (t) = t2 (1 − t)2 . When ε > 0 is

small, the double-well potential term forces u to be either 1 or 0 in most of the
domain with a sharp O(ε) transition between. It can be shown that the Ginzburg-
Landau functional
R Gε converges (in the Γ-sense that we will discuss later) to the total
variation U |∇u| dx. Thus, we can (formally for now) make the approximation
Z Z
1
|∇H(u)| dx ≈ ε|∇u|2 + W (u) dx
U U ε

in the segmentation energy (3.39). Since the double-well forces u to be close to 1 or

0 everywhere, we can approximate H(u) by u2 and 1 − H(u) by (1 − u)2 . This leads
to the approximate energy
Z
1
(3.49) Iε (u, a, b) = µ u2 (f − a)2 + (1 − u)2 (f − b)2 + ε|∇u|2 + W (u) dx.
U ε

The energy (3.49) is called a phase-field approximation of (3.39). Notice that µ = λ−1
from the original energy (3.39).
The phase-field approximation (3.49) can be efficiently solved by splitting into
two pieces, the double-well potential and the remaining portions, and alternatively
60 CHAPTER 3. APPLICATIONS

taking small steps of minimizing over each piece. Gradient descent on the double-
well potential part of the energy 1ε W (u) would be ut = − 1ε W ′ (u). When ε > 0
is small, we approximate this step by sending t → ∞, and u converges to stable
equillibrium points W ′ (u) = 0 and W ′′ (u) > 0, i.e., u = 0 or u = 1. By examining
W ′ (u) = 2u(1 − u)(1 − 2u), we see that W ′ (u) > 0 for u > 1/2 and W ′ (u) < 0 for
u < 1/2. So any value of u larger than 1/2 gets sent to 1, while value smaller than
1/2 is sent to 0. This step is thus thresholding.
The other part of the energy is
Z

µ u2 (f − a)2 + (1 − u)2 (f − b)2 + ε|∇u|2 dx.
U

Gradient descent on this portion of the energy corresponds to

ut = 2ε∆u − 2µ u(f − a)2 − (1 − u)(f − b)2 .

The scheme runs the gradient descent equation above for a short time, alternating
with the thresholding step described above. The steps are outlined below.
Given: Image f , initial guess u0 , and time step τ > 0. For k = 1, 2, . . .

1. Solve the equation


wt = ∆w − √µ w(f − a)2 − (1 − w)(f − b)2 , for 0 < t ≤ τ
(3.50) πτ
 k
w=u , at t = 0.

2. Set (
0, if w(x, τ ) ≤ 12
uk+1 (x) =
1, if w(x, τ ) > 12 .

The constants a, b are updated after each thresholding step according to the formulas:
R R
uf dx (1 − u)f dx
a= R U
and b = RU .
U
u dx U
(1 − u) dx

The algorithm above is an example of threshold dynamics, which was invented in [41]
for approximation motion by mean curvature, and has been widely used for other
problems since. The scheme is unconditionally stable, which means that the scheme
is stable and convergent for arbitrarily large time step τ > 0. This breaks the stiﬀness
in the segmentation energy and allows very fast computations of image segmentations.
Chapter 4

The direct method

We now turn to the problem of determining conditions under which the functional
Z
(4.1) I(u) = L(x, u(x), ∇u(x)) dx
U

attains its minimum value. That is, we want to understand under what conditions we
can construct a function u∗ so that I(u∗ ) ≤ I(u) for all u in an appropriate function
space. We will always assume L is a smooth function of all variables.
Example 4.1 (Optimization on Rd ). Suppose f : Rd → [0, ∞) is continuous. Then
f attains its minimum value over any closed and bounded set D ⊂ Rd . Indeed,
we can always take a sequence xm with f (xm ) → inf D f as m → ∞. Then by the
Bolzano-Weierstrauss Theorem, there is a subsequence xmj and a point x0 ∈ D such
that xmj → x0 . By continuity of f
f (x0 ) = lim f (xmj ) = inf f.
j→∞ D

Therefore, f attains its minimum value over D at x0 . 4

The situation is much diﬀerent in the inﬁnite dimensional setting, as the following
examples illustrate.
Example 4.2 (Weierstrauss). Consider minimizing
Z 1
I(u) = (xu′ (x))2 dx
−1

subject to u(−1) = −1 and u(1) = 1. Here, the Lagrangian L(x, p) = x2 p2 is convex

in p, though not strongly convex at x = 0. Let un (x) = max{min{nx, 1}, −1}. We
R 1/d 2
easily compute that I(u) = n2 −1/d x2 dx = 3n , which shows that inf I = 0 over
the space of Lipschitz functions. However, there is no Lipschitz function u for which
u(−1) = −1, u(1) = 1 and I(u) = 0, since any Lipschitz functions for which I(u) = 0
are constant. This example is originally due to Weierstrauss. 4

61
62 CHAPTER 4. THE DIRECT METHOD

In example above, the sequence un is converging to a step function u(x) = 1 for

x > 0 and u(x) = −1 for x < 0. One may try to interpret the step function as a
solution of the variational problem in some appropriate way, so the lack of existence of
minimizers stems from looking in the wrong function space (i.e., Lipschitz functions).
The next example shows that the situation can be far worse when L is not convex.
Example 4.3 (Bolza problem). Consider the problem of minimizing
Z 1
I(u) = u(x)2 + (u′ (x)2 − 1)2 dx,
0

subject to u(0) = 0 = u(1). We claim there is no solution to this problem. To see this,
let uk (x) be a sawtooth wave of period 2/k and amplitude 1/k. That is, uk (x) = x for
x ∈ [0, 1/k], uk (x) = 2/k − x for x ∈ [1/k, 2/k], uk (x) = x − 2/k for x ∈ [2/k, 3/k] and
so on. The function uk satisfies u′k (x) = 1 or u′k (x) = −1 at all but a finite number
of points of non-differentiability. Also 0 ≤ uk (x) ≤ 1/k. Therefore
Z 1
1 1
I(uk ) ≤ 2
dx = 2 .
0 k k
If the minimum exists, then
1
0 ≤ min I(u) ≤
u k2
for all natural numbers k. Therefore minu I(u) would have to be zero. However, there
is no function u for which I(u) = 0. Such a function would have to satisfy u(x)2 = 0
and u′ (x)2 = 1 for all x, which are not compatible conditions. Therefore, there is no
minimizer of I, and hence no solution to this problem. 4
In contrast with Example 4.2, the minimizing sequence uk constructed for the
Bolza problem converges to a function (the zero function) that cannot be easily inter-
preted as a feasible candidate for a minimizer of the variational problem. The gradient
u′k is oscillating wildely and is nonconvergent everywhere in any strong sense.
The issue underlying both examples above is that the function space we are op-
timizing over is infinite dimensional, and so the notion of closed and bounded is not
relevant. We will need a stronger compactness criterion. Indeed, the functional I in
the Bolza example is a smooth function of u and u′ , and the minimizing sequence uk
constructed in Example 4.3 satisfies |uk | ≤ 1 and |u′k | ≤ 1, so we are minimizing I
over a closed and bounded set. These examples show that infinite dimensional spaces
are much different and we must proceed more carefully. In particular, we require
further structure assumptions on L beyond smoothness to ensure minimizers exist.

4.1 Coercivity and lower semicontinuity

It is useful to ﬁrst study a more abstract problem
(4.2) inf f (x),
x∈X
4.1. COERCIVITY AND LOWER SEMICONTINUITY 63

where X is a topological space and f : X → [0, ∞). A topological space is the most
general space where we can make sense of limits limn→∞ xn = x. Beyond this, it is not
so important that the reader knows what a topological space is, so we will not give a
deﬁnition, other than to say that all spaces considered in these notes are toplogical
spaces with some additional structure (e.g., normed linear space, Hilbert space, etc).
The main question is: What conditions should we place on f to ensure its mini-
mum value is attained over X, that is, that a minimizer exists in (4.2)?
It is useful to consider how one may go about proving the existence of a minimizer.
Since f ≥ 0, we have inf x∈X f (x) ≥ 0, and so there exists a minimizing sequence (xn )
satisfying

(4.3) lim f (xn ) = inf f (x).

n→∞ x∈X

The limit of the sequence (xn ), or any subsequence thereof, is presumably a leading
candidate for a minimizer of f . If (xn ) is convergent in X, so there exists x0 ∈ X so
that xn → x0 as n → ∞, then we would like to show that f (x0 ) = inf x∈X f (x). If f
is continuous, then this follow directly from (4.3). However, functionals f of interest
are often not continuous in the correct topology. It turns out we can get by with
lower semicontinuity.

Deﬁnition 4.1. A function f : X → [0, ∞) is lower semicontinuous on X if for every

x∈X

(4.4) f (x) ≤ lim inf f (y).

y→x

If (xn ) is a minimizing sequence, xn → x0 as n → ∞, and f is lower semi-

continuous, then it follows that

f (x0 ) ≤ lim inf f (xn ) = lim f (xn ) = inf f (x).

n→∞ n→∞ x∈X

Therefore f (x0 ) = inf x∈X f (x).

Now we turn to the question of how to guarantee that (xn ), or a subsequence
thereof, is convergent. This is the question of compactness.

Deﬁnition 4.2. A subset A ⊂ X is sequentially compact if every sequence (xn ) ⊂

A has a convergent subsequence converging to a point in A. We call the set A
sequentially precompact if the subsequence converges in X, but not necessarily in A.

For topological spaces, there is another notion called compactness, which has a
deﬁnition in terms of open sets. In metric spaces, sequentially compact is equiva-
lent to compact, and in that setting we will just use the term compact. In general
topological spaces the notions are not the same, and for minimizing functionals we
64 CHAPTER 4. THE DIRECT METHOD

require sequential compactness. We recall that when X = Rd , a subset A ⊂ Rd is

sequentially compact (or just compact) if and only if A is closed and bounded, due
to the celebrated Bolzano-Weierstrauss Theorem.
To show that (xn ) has a convergent subsequence, we just need to ensure that the
sequence belongs to a sequentially compact subset of X.

Deﬁnition 4.3. We say that f : X → [0, ∞) is coercive if there exists λ > inf x∈X f (x)
such that the set

(4.5) A = {x ∈ X : f (x) < λ}

is sequentially precompact.

If f is coercive, then any minimizing sequence (xn ) eventually belongs to a sequen-

tially compact subset of X, for n suﬃciently large, and so (xn ) admits a convergent
subsequence, which converges to a minimizer of f if f is lower semicontinuous.
The discussion above is a proof of the following theorem.

Theorem 4.4. If f : X → [0, ∞) is coercive and lower semicontinuous, then there

exists x0 ∈ X such that

(4.6) f (x0 ) = min f (x).

x∈X

Theorem 4.4 is known as the direct method in the calculus of variations for prov-
ing existence of minimizers. It is important to understand how the minimizer is con-
structed as the limit of a minimizing sequence. In our setup, a minimizing sequence
uk has bounded energy, that is
Z
(4.7) sup L(x, uk , ∇uk ) dx < ∞.
k≥1 U

From this, we must prove estimates on appropriate norms of the sequence uk to obtain
coercivity—convergence of a subsequence ukj in some appropriate sense. Since (4.7)
only gives estimates on the integral of functions of uk and ∇uk , the best we can hope
for is that uk and ∇uk are uniformly bounded in Lp (U ) for some p ≥ 1. In the best
case we have p = ∞, and by the Arzelà-Ascoli Theorem (proof below) there exists a
subsequence ukj converging uniformly to a continuous (in fact Lipschitz) function u.
However, since u is constructed though a limit of a sequence of functions, even if we
ask that uk ∈ C 1 (U ), the uniform limit u need not be in C 1 (U ).

Exercise 4.5. Give an example of a sequence of functions uk ∈ C ∞ ((0, 1)) such

that uk is uniformly convergent to some continuous function u ∈ C((0, 1)), but u 6∈
C 1 ((0, 1)). Hence C 1 ((0, 1)) is not closed under the topology of uniform convergence.
4
4.1. COERCIVITY AND LOWER SEMICONTINUITY 65

This means that coercivity cannot hold when X = C 1 (U ) in Theorem 4.4, and so
the classical function spaces C k (U ) are entirely incompatible with the direct method.
While we may expect minimizers to belong to C k (U ), proving this directly via the
direct method is intractable. Instead, we have to work on a larger space of functions
where it is easier to ﬁnd minimizers, and treat the question of regularity of minimizers
separately from existence.
We give a brief sketch of the proof of the Arzelà-Ascoli Theorem in a special case.

Lemma 4.6. Let uk : (0, 1) → R be a sequence of smooth functions with

|uk (x)| + |u′k (x)| ≤ λ

for all k ≥ 1 and x ∈ (0, 1). Then there exists a subseuqence ukj and a function
u ∈ C((0, 1)) satisfying
|u(x) − u(y)| ≤ C|x − y|
such that ukj → u uniformly on (0, 1) (that is, limj→∞ max0≤x≤1 |ukj (x) − u(x)| = 0).

Proof. We claim there exists a subsequence ukj such that ukj (x) is a convergent se-
quence for all rational coordinates x. To prove this, we enumerate the rationals
r1 , r2 , r3 , . . . , and note that each sequence (uk (ri ))k is a bounded seqeunce, and
by Bolzano-Weierstrauss has a convergent subsequence. We construct the subse-
quence kj inductively as follows. Let ℓ1i such that uℓ1i (r1 ) is convergent (by Bolzano-
Weierstrauss) and set k1 = ℓ11 . Since uℓ1i (r2 ) is a bounded sequence, there is a con-
vergent subsequence (uℓ1i (r2 ))j , and we write ℓ2q = ℓ1iq and k2 = ℓ22 . Given we have
j

constructed ℓji so that uℓj (rj ) is convergent as i → ∞, and ki = ℓii , we construct

i
ki+1 and ℓj+1
i as follows: Since uℓj (rj+1 ) is a bounded sequence, there is a convergent
i
subsequence (uℓj (rj+1 ))q . We write ℓj+1
q = ℓjiq and kj+1 = ℓj+1
j+1 . We have constructed
iq

ki so that (ki ) ⊂ (ℓji )i for all j. Since ki ≥ i we have that uki (rj ) is convergent for all
j, which establishes the claim. The argument above is called a diagonal argument,
due to Cantor.
We now claim there is a unique continuous function u : (0, 1) → R such that
u(x) = u(x) for all rational x. To see this, let x ∈ (0, 1) be irrational. Let rij be an
approximating rational sequence, so that limj→∞ rij = x. We have

|u(rij ) − u(riℓ )| = lim |ukj (rij ) − ukj (riℓ )| ≤ λ|rij − riℓ |.

j→∞

It follows that the sequence (u(rij )j is a Cauchy sequence, and hence convergent. The
value of the limit is independent of the choice of approximating rational sequence,
since if rℓj is another such sequence with rℓj → x as j → ∞, and argument similar
to above shows that
|u(rij ) − u(rℓj )| ≤ λ|rij − riℓ | −→ 0
66 CHAPTER 4. THE DIRECT METHOD

as j → ∞. Therefore, we deﬁne u(x) = limj→∞ u(rij ), where rij is any rational

sequence convergening to x as j → ∞. To see that u is continuous, let x, y ∈ (0, 1)
and let rij → x and rℓj → y as j → ∞. Then we have

|u(x) − u(y)| = lim |u(rij ) − u(rℓj )| ≤ λ|x − y|.

j→∞

Therefore u is Lipschitz continuous.

Finally, we show that ukj → u uniformly. Let M ≥ 1 be an integer and xj = j/M .
Fix x ∈ (0, 1) choose xi with 0 ≤ i ≤ M and |x − xi | ≤= 1/M . Then

Sending j → ∞ we have that

2λ
lim sup |ukj (x) − u(x)| ≤ .
j→∞ M

Sending M → ∞ completes the proof.

4.2 Brief review of Sobolev spaces

Sobolev spaces are the natural function spaces in which to look for minimizers of
many functionals in the form (4.1). Section A.5 has a brief introduction to some
function spaces such as C k (U ) and L2 (U ). Here, we brieﬂy introduce the Lp spaces
before giving a very brief overview of the necessary Sobolev space theory. For more
details on Sobolev spaces we refer the reader to [28].

4.2.1 Deﬁnitions and basic properties

Let 1 ≤ p < ∞ and let U ⊂ Rd be open and bounded with a smooth boundary ∂U .
For u : U → R we deﬁne
Z 1/p
(4.8) kukLp (U ) = |u| dx
p
,
U

and

(4.9) Lp (U ) = {Lebesgue measurable u : U → R such that kukLp (U ) < ∞}.

4.2. BRIEF REVIEW OF SOBOLEV SPACES 67

The space Lp (U ) is a Banach space1 equipped with the norm kukLp (U ) , provided
we identify functions in Lp (U ) that agree up to sets of measure zero (i.e., almost
everywhere). As in any normed linear space, we say that uk → u in Lp (U ) provided
ku − uk kLp (U ) −→ 0 as k → ∞.
If uk → u in Lp (U ), there exists a subsequence ukj such that limj→∞ ukj (x) = u(x)
for almost every x ∈ U .
When p = 2, the norm kukL2 (U ) arises from an inner product
Z
(4.10) (u, v)L2 (U ) = uv dx,
U
p
in the sense that kukL2 (U ) = (u, v)L2 (U ) . The space L2 (U ) is thus a Hilbert space.
The borderline cases p = 1 and p = ∞ deserve special treatment in the Calculus of
Variations, so we will not consider them here.
The norms on the Lp spaces (and any Banach space for that matter), satisfy the
triangle inequality
(4.11) ku + vkLp (U ) ≤ kukLp (U ) + kvkLp (U ) .
Another important inequality is Hölder’s inequality
(4.12) (u, v)L2 (U ) ≤ kukLp (U ) kvkLq (U ) ,
which holds for any 1 < p, q < ∞ with p1 + 1q = 1. When p = q = 2, Hölder’s
inequality is called the Cauchy-Schwarz inequality.
To define Sobolev spaces, we need to define the notion of a weak derivative. We
write u ∈ L1loc (U ) when u ∈ L1 (V ) for all V ⊂⊂ U .
Definition 4.7 (Weak derivative). Suppose u, v ∈ L1loc (U ) and i ∈ {1, . . . , n}. We
say that v is the xi -weak partial derivative of u, written v = uxi , provided
Z Z
(4.13) uφxi dx = − vφ dx
U U

for all test functions φ ∈ Cc∞ (U ).

The weak derivative, when it exists, is unique, and when u ∈ C 1 (U ) the weak
and regular derivatives coincide. When u ∈ L1loc has weak partial derivatives in
all coordinates x1 , . . . , xn we say that u is weakly differentiable, and we will write
∇u = (uxi , . . . , uxn ) for the vector of all weak partial derivatives.
For 1 ≤ p < ∞ and weakly differentiable u ∈ L1loc we define
Z 1/p
(4.14) kukW 1,p (U ) = |u| + |∇u| dx
p p
.
U
1
A Banach space is a normed linear space that is complete, so that all Cauchy sequences are
convergent.
68 CHAPTER 4. THE DIRECT METHOD

Deﬁnition 4.8 (Sobolev space). We deﬁne the Sobolev space W 1,p (U ) as

(4.15) W 1,p (U ) = {Weakly diﬀerentiable u ∈ Lp (U ) for which kukW 1,p (U ) < ∞}.

The Sobolev spaces W 1,p (U ) are Banach spaces, and in particular, their norms
satisfy the triangle inequality (4.11). For p = 2 we write H 1 (U ) = W 1,2 (U ). The
space H 1 (U ) is a Hilbert space with inner product
Z
(4.16) (u, v)H 1 (U ) = u · v + ∇u · ∇v dx.
U

The smooth functions C ∞ (U ) are dense in W 1,p (U ) and Lp (U ). That is, for any
u ∈ W 1,p (U ), there exists a sequence um ∈ C ∞ (U ) such that um → u in W 1,p (U )
(that is limm→∞ kum − ukW 1,p (U ) = 0). However, a general function u ∈ W 1,p (U ) need
not be continuous, much less infinitely differentiable.
Example 4.4. Let u(x) = |x|α for α 6= 0. If α > −n then u is summable, and hence
a candidate for weak differentiability. Let us examine the values of α for which u is
weakly differentiable and belongs to W 1,p (U ), where U = B(0, 1). Provided x 6= 0,
u is classically differentiable and ∇u(x) = α|x|α−2 x. Define v(x) = α|x|α−2 xi . Let
φ ∈ Cc∞ (U ) and compute via the divergence theorem that
Z Z Z
u φxi dx = u φxi dx + u φxi dx
U U −B(0,ε) B(0,ε)
Z Z Z
=− uxi φ dx − uφνi dS + u φxi dx,
U −B(0,ε) ∂B(0,ε) B(0,ε)

where ν is the outward normal to ∂B(0, ε). Noting that |uxi φ| ≤ C|x|α−1 and |x|α−1 ∈
L1 (U ) when α > 1 − n we have
Z Z
lim uxi φ dx = v φ dx
ε→0 U −B(0,ε) U

when α > 1 − n. Similarly, |uφνi | ≤ C|x|α = Cεα for x ∈ ∂B(0, ε), and so
Z
uφνi dS ≤ Cεα+n−1 → 0
∂B(0,ε)

as ε → 0 provided α > 1 − n. Similarly, since u is summable we have

Z
u φxi dx → 0
B(0,ε)

as ε → 0. Therefore Z Z
u φxi dx = − vφ dx
U U
4.2. BRIEF REVIEW OF SOBOLEV SPACES 69

for all φ ∈ Cc∞ (U ) provided α > 1 − n. This is exactly the definition of weak
differentiability, so u is weakly differentiable and ∇u = α|x|α−2 xi in the weak sense
when α > 1 − n. We have |uxi |p ≤ C|x|p(α−1) , and so uxi ∈ Lp (U ) provided α > 1 − np .
Therefore, when α > 1 − np we have u ∈ W 1,p (U ). Notice the u is not continuous and,
in fact, unbounded at x = 0, when α < 0, which is allowed when p < n.
In fact, the situation can be much worse. Provided α > 1 − np the function
X∞
1
u(x) = k
|x − rk |α
k=1
2

belongs to W 1,p (U ), where r1 , r2 , r3 , . . . is any enumeration of the points in B(0, 1)

with rational coordinates. When p < n, so we can choose α < 0, the function u is
unbounded on every open subset of B(0, 1). 4
Finally, we need to consider the notion of boundary condition in Sobolev spaces.
For continuous functions u ∈ C(U ), the values of u(x) for x ∈ ∂U are uniquely
defined. This is less clear for Sobolev space functions u ∈ W 1,p (U ), which are defined
only up to sets of measure zero. Fortunately we have the notion of a trace. That is,
one can prove that there exists a bounded linear operator
(4.17) T : W 1,p (U ) → Lp (∂U )
such that T u = u|∂U whenever u ∈ W 1,p (U ) ∩ C(U ). The trace operator allows us to
make sense of the values of a Sobolev space function u ∈ W 1,p (U ) on the boundary
∂U .
There are a couple of important points to make about trace-zero functions.
Definition 4.9. We denote by W01,p (U ) the closure of Cc∞ (U ) in W 1,p (U ), and write
H01 (U ) = W01,p (U ).
Thus, u ∈ W01,p (U ) if and only if there exist functions um ∈ Cc∞ (U ) such that
um → u in W 1,p (U ). The spaces W01,p (U ) are closed subspaces of W 1,p (U ). A theorem
in the theory of Sobolev spaces states that
(4.18) T u ≡ 0 on ∂U if and only if u ∈ W01,p (U ).
Thus, we can interpret W01,p (U ) as the subspace of functions in W 1,p (U ) for which
“u ≡ 0 on ∂U .” When we consider boundary values u = g on ∂U , we will always
consider the boundary values in the trace sense, so we mean that T u = g with
g ∈ Lp (∂U ).

4.2.2 Weak compactness in Lp spaces

Recalling the discussion in Section 4.1, we need to examine the appropriate notion of
compactness in Lp and Sobolev spaces. However, in Banach spaces, compact sets in
the norm topology are hard to come by.
70 CHAPTER 4. THE DIRECT METHOD

Example 4.5. The closed unit ball B = {u ∈ L2 ((0, 1)) : kukL2√ ((0,1)) ≤ 1} is not
2
compact in L ((0, 1)). To see this, consider the sequence um (x) = 2 sin(πmx). We
have Z 1 Z 1
kum k2L2 ((0,1)) = 2 sin2 (πmx) dx = 1 − cos(2πmx) dx = 1.
0 0
Therefore um ∈ B. However, um has no convergent subsequences. Indeed, by the
Riemann-Lebesgue Lemma in Fourier analysis
√ Z 1
(4.19) lim (um , v)L2 ((0,1)) = lim 2 sin(πmx)v(x) dx = 0
m→∞ m→∞ 0

for all v ∈ L2 ((0, 1)). Essentially, the sequence um is oscillating very rapidly, and
when measured on average against a test function v, the oscillations average out to
zero in the limit. Therefore, if any subsequence of umj was convergent in L2 ((0, 1))
to some u0 ∈ L2 ((0, 1)), then we would have

ku0 kL2 ((0,1)) = lim kumj kL2 ((0,1)) = 1,

j→∞

and
(u0 , v)L2 ((0,1)) = lim (umj , v)L2 ((0,1)) = 0
j→∞

for all v ∈ L2 ((0, 1)). Taking v = u0 is a contradiction. 4

For any 1 ≤ p < ∞, define p′ so that p1 + p1′ = 1, taking p′ = ∞ when p = 1.
Note that by the Hölder inequality (4.12), the pairing (u, v)L2 (U ) is well-defined when
′
u ∈ Lp (U ) and v ∈ Lp (U ). In this setting, u, v may not 2
R be L (U ) functions, so we
will write (u, v) in place of (u, v)L2 (U ) , so that (u, v) = U u v dx.
While we must abandon compactness in the norm topology, Eq. (4.19) suggests
that in some weaker topology, the sequence um constructed in Example 4.5 should
converge to 0.
Definition 4.10 (Weak convergence). Let 1 ≤ p < ∞ and um , u ∈ Lp (U ). We say
that um converges weakly to u in Lp (U ), written um ⇀ u as m → ∞, provided
′
(4.20) lim (um , v) = (u, v) for all v ∈ Lp (U ).
m→∞

Any weakly convergent sequence is norm-bounded, that is, if um ⇀ u in Lp (U )

then um is bounded in Lp (U ) (supm≥1 kum kLp (U ) < ∞), and furthermore

(4.21) kukLp (U ) ≤ lim inf kum kLp (U ) .

m→∞

The inequality above is in general strict, as we saw in Example 4.5.

Taking p = q = 2, we see that the sequence um from Example 4.5 converges
weakly to 0 in L2 ((0, 1)). This is a special case of a more general theorem.
4.2. BRIEF REVIEW OF SOBOLEV SPACES 71

Deﬁnition 4.11. Let 1 ≤ p < ∞. We say a subset B ⊂ Lp (U ) is weakly sequentially

compact if for every bounded sequence um ∈ B, there is a subsequence umj and some
u ∈ B such that umj ⇀ u as j → ∞.

Theorem 4.12 (Banach-Alaoglu Theorem). Let 1 < p < ∞. The unit ball B = {u ∈
Lp (U ) : kukLp (U ) ≤ 1} is weakly sequentially compact.

Of course, it follows that any ball {u ∈ Lp (U ) : kukLp (U ) ≤ r} is weakly sequen-

tially compact as well.

Example 4.6. Theorem 4.12 does not hold for p = 1. We can take, for example,
U = (−1, 1) and the sequence of functions um (x) = m/2 for |x| ≤ 1/m and um (x) = 0
for |x| > 1/m. Note that kum kL1 (U ) = 1, so the sequence is bounded in L1 (U ), and
that um (x) → 0 pointwise for x 6= 0 (actually, uniformly away from x = 0). So any
weak limit in L1 (U ), if it exists, must be identically zero. However, for any smooth
function φ : U → R we have
Z 1/m
m
lim (um , φ)L2 (U ) = φ(x) dx = φ(0).
m→∞ 2 −1/m

Hence, the sequence um does not converge weakly to zero. In fact, the sequence
um converges to a measure, called the Dirac delta function. Note that kum kLp (U ) =
(m/2)1−1/p , and so the sequence is unbounded in Lp (U ) for p > 1.
We note that Theorem 4.12 does hold when p = ∞, provided we interpret weak
convergence in the correct sense (which is weak-* convergence here). The issue with
these borderline cases is that L1 and L∞ are not reﬂexive Banach spaces. 4

While weak convergence seems to solve some of our problems, it immediately

introduces others.

Example 4.7. If um ⇀ u in Lp (U ), it is in general not true that f (um ) ⇀ f (u), even

for smooth and bounded f : R → R. Take, for example, the setting of Example 4.5
and f (t) = t2 . Then

f (um ) = 2 sin2 (πmx) = 1 − 2 cos(2πmx).

Arguing as in Example 4.5 we have um ⇀ 0 and f (um ) ⇀ 1 6= f (0). 4

′
Example 4.8. If um ⇀ u in Lp (U ) and wm ⇀ w in Lp (U ), it is in general not true
that
(um , wm ) −→ (u, w) as m → ∞.
Indeed, we can take um = wm from Example 4.7. 4
72 CHAPTER 4. THE DIRECT METHOD

The examples above show that weak convergence does not behave well with func-
tion composition, even in the setting of the bilinear form given by the L2 inner-
product. The consequence is that our functional I deﬁned in (4.1) is in general not
continuous in the weak topology. So by weakening the topology we have gained a
crucial compactness condition, but have lost continuity.
The issues above show how it is, in general, diﬃcult to pass to limits in the weak
topology when one is dealing with nonlinear problems. This must be handled carefully
in our analysis of minimizers of Calculus of Variations problems. We record below a
useful lemma for passing to limits with weakly convergent sequences.
′
Lemma 4.13. Let 1 ≤ p < ∞. If um ⇀ u in Lp (U ) and wm → w in Lp (U ), then

(4.22) (um , wm ) −→ (u, w) as m → ∞.

Notice the diﬀerence with Example 4.8 is that wm converges strongly (and not
′
just weakly) to w in Lp (U ).
Proof. We write

|(um , wm ) − (u, w)| = |(um , w) − (u, w) + (um , wm ) − (um , w)|

≤ |(um , w) − (u, w)| + |(um , wm ) − (um , w)|
= |(um − u, w)| + |(um , wm − w)|
≤ |(um − u, w)| + kum kLp (U ) kwm − wkLp′ (U ) ,

where the last line follows from Hölder’s inequality. Since um ⇀ w, the ﬁrst term
′
vanishes as m → ∞ by deﬁnition. Since um is bounded and wm → w in Lp (U ), the
second term vanishes as m → ∞ as well, which completes the proof.

4.2.3 Compactness in Sobolev spaces

To address calculus of variations problems in Rd , we need to extend the notion of
weak convergence to Sobolev spaces.
Deﬁnition 4.14 (Weak convergence in W 1,p ). Let 1 ≤ p < ∞ and um , u ∈ W 1,p (U ).
We say that um converges weakly to u in W 1,p (U ), written um ⇀ u as m → ∞,
provided

(4.23) um ⇀ u and (um )xi ⇀ uxi in Lp (U )

for all i ∈ {1, . . . , n}.

Note we are using um ⇀ u for weak convergence in Lp and W 1,p spaces. It will be
clear from the context which is intended.
As a warning, subsets of W 1,p (U ) that are closed in the strong topology are not
in general closed in the weak topology. By Mazur’s Theorem [50], subsets of W 1,p (U )
4.2. BRIEF REVIEW OF SOBOLEV SPACES 73

that are strongly closed and convex, are also weakly closed. For example, the closed
linear subspace W01,p (U ) is weakly closed in W 1,p (U ). That is, if uk ⇀ u in W 1,p (U )
and uk ∈ W01,p (U ) for all k, then u ∈ W01,p (U ).
It follows from the Banach-Alaoglu Theorem that the unit ball in W 1,p (U ) is
weakly compact. However, we gain an additional, and very important, strong com-
pactness in Lp (U ) due to bounding the weak gradient in Lp (U ). This is essentially
what makes it possible to pass to limits with weak convergence, since we end up in
the setting where we can apply tools like Lemma 4.13.

Theorem 4.15 (Weak/strong compactness). Let 1 ≤ p ≤ ∞ and let um ∈ W 1,p (U )

be a bounded sequence in W 1,p (U ) (i.e., supm≥1 kum kW 1,p (U ) < ∞). Then there exists
a subsequence umj and a function u ∈ Lp (U ) such that

(4.24) umj → u in Lp (U ).

Furthermore, if 1 < p < ∞, we can pass to a further subsequence, if necessary, so

that u ∈ W 1,p (U ) and

(4.25) um j ⇀ u in W 1,p (U ).

Notice the convergence umj → u is ordinary (strong) convergence in Lp (U ) and

not weak convergence. This is the crucial part of the compact Sobolev embeddings
that make them far stronger statements than the Banach-Alaoglu Theorem. In other
words, we are saying that the unit ball in W 1,p (U ) is a compact subset of Lp (U ). We
say that W 1,p (U ) is compactly embedded in Lp (U ), which is written W 1,p (U ) ⊂⊂
Lp (U ). The reader should compare Theorem 4.15 with the Arzelá-Ascoli Theorem,
which gives a similar result for classes of equicontinuous functions (in fact, the Arzelá-
Ascoli Theorem is one of the main tools in the proof of Theorem 4.15).
An immediate application of compactness is the Poincaré inequality

Theorem 4.16 (Trace-zero Poincaré). Let 1 ≤ p ≤ ∞. There exists a constant

C > 0, depending on p and U , such that

(4.26) kukLp (U ) ≤ Ck∇ukLp (U )

for all u ∈ W01,p (U ).

Proof. Assume, by way of contradiction, that for each k ≥ 1 there exists uk ∈ W01,p (U )
such that

(4.27) kuk kLp (U ) > kk∇uk kLp (U ) .

We may assume kuk kLp (U ) = 1 so that k∇uk kLp (U ) < 1/k and ∇uk → 0 in Lp (U ) as
k → ∞. By Theorem 4.15 there exists a subsequence ukj and u ∈ Lp (U ) such that
ukj → u in Lp (U ). Since kuk kLp (U ) = 1 we have that kukLp (U ) = 1.
74 CHAPTER 4. THE DIRECT METHOD

For any φ ∈ Cc∞ (U ) we have

Z Z Z
uφxi dx = lim ukj φxi dx = lim − ukj ,xi φ dx = 0.
U j→∞ U j→∞ U

Therefore u ∈ W 1,p (U ) and ∇u = 0. It follows that ukj → u in W 1,p (U ), and so

u ∈ W01,p (U ), as ukj ∈ W01,p (U ). Since ∇u = 0 in the weak sense, and u = 0 on ∂U in
the trace sense, we have u = 0 almost everywhere in U (exercise), which contradicts
that kukLp (U ) = 1.
For some domains we can say more about the constant C. Let Br ⊂ Rd denote
the ball B(0, r) of radius r > 0 centered at zero.
Corollary 4.17. Let 1 ≤ p ≤ ∞. There exists a constant C > 0, depending only on
p, such that for all r > 0 we have
(4.28) kukLp (Br ) ≤ Crk∇ukLp (Br )

for all u ∈ W01,p (Br ).

Proof. By Theorem 4.16, there exists C > 0 such that
kwkLp (B1 ) ≤ Ck∇wkLp (B1 )

for all w ∈ W01,p (B1 ).

Let u ∈ W01,p (Br ), and set w(y) = u(ry). Then w ∈ W01,p (B1 ). By a change of
variables y = x/r we have
Z Z
|u| dx = r
p d
|w|p dx,
Br B1

and Z Z
|∇u| dx = r
p d−p
|∇w|p dx.
Br B1
Therefore
kukpLp (Br ) = rd kwkpLp (B1 ) ≤ C p rd k∇wkpLp (U ) = C p rp k∇ukpLp (Br ) ,
which completes the proof.
Poincaré-type inequalities are true in many other situations.
Exercise 4.18. Let 1 ≤ p ≤ ∞. Show that there exists a constant C > 0, depending
on p and U , such that
(4.29) kukLp (U ) ≤ C(kT ukLp (∂U ) + k∇ukLp (U ) )
for all u ∈ W 1,p (U ), where T : W 1,p (U ) → Lp (∂U ) is the trace operator. [Hint: Use
a very similar argument to the proof of Theorem 4.16.] 4
4.3. LOWER SEMICONTINUITY 75

Exercise 4.19. Let 1 ≤ p ≤ ∞. Show that there exists a constant C > 0, depending
on p and U , such that

(4.30) ku − (u)U kLp (U ) ≤ Ck∇ukLp (U )

R
for all u ∈ W 1,p (U ), where (u)U := −U u dx. [Hint: Use a very similar argument to
the proof of Theorem 4.16.] 4

Exercise 4.20. Let 1 ≤ p ≤ ∞. Show that there exists a constant C > 0, depending
on p, such that

(4.31) ku − (u)Br kLp (Br ) ≤ Crk∇ukLp (Br )

for all u ∈ W 1,p (Br ). 4

4.3 Lower semicontinuity

We now examine conditions on L for which (4.1) is lower semicontinuous in the weak
topology.

Deﬁnition 4.21. We say that a functional I, deﬁned by (4.1), is (sequentially) weakly

lower semicontinuous on W 1,q (U ), provided

I(u) ≤ lim inf I[uk ]

k→∞

whenever
uk ⇀ u weakly in W 1,q (U ).

Theorem 4.22 (Weak lower semicontinuity). Let 1 < q < ∞. Assume L is bounded
below and that p 7→ L(x, z, p) is convex for each z ∈ R and x ∈ U . Then I is weakly
lower semicontinuous on W 1,q (U ).

We split up the proof of Theorem 4.22 into two parts. The ﬁrst part, Lemma
4.23, contains the main ideas of the proof under the stronger condition of uniform
convergence, while the proof of Theorem 4.22, given afterwards, shows how to reduce
to this case using standard tools in analysis.

Lemma 4.23. Let 1 < q < ∞. Assume L is bounded below and convex in p. Let uk ∈
W 1,q (U ) and u ∈ W 1,p (U ) such that uk ⇀ u weakly in W 1,q (U ), uk → u uniformly on
U , and |∇u| is uniformly bounded on U . Then

(4.32) I(u) ≤ lim inf I(uk ).

k→∞
76 CHAPTER 4. THE DIRECT METHOD

Proof. Since L is convex in p, it follows from Theorem A.38 (iii) that

L(x, uk , ∇uk ) ≥ L(x, uk , ∇u) + ∇p L(x, uk , ∇u) · (∇uk − ∇u).

Integrating over U we have

Z Z
(4.33) I(uk ) ≥ L(x, uk , ∇u) dx + ∇p L(x, uk , ∇u) · (∇uk − ∇u) dx.
U U

Let M > 0 such that |u|, |∇u| ≤ M a.e. on U . Since L is smooth

|L(x, uk , ∇u) − L(x, u, ∇u)| ≤ sup |Lz (x, z, p)||uk − u|,

x∈U
|z|≤M
|p|≤M

a.e. in U . Since uk → u uniformly, we deduce that L(x, uk , ∇u) → L(x, u, ∇u)

uniformly on U and so Z
L(x, uk , ∇u) dx −→ I(u)
U

as k → ∞. Similarly we can show that ∇p L(x, uk , ∇u) → ∇p L(x, u, ∇u) uniformly

on U , and hence in Lp (U ) for any 1 ≤ p ≤ ∞. Since ∇uk ⇀ ∇u in W 1,q (U ), we can
use Lemma 4.13 to deduce that
Z
∇p L(x, uk , ∇u) · (∇uk − ∇u) dx −→ 0
U

as k → ∞. Inserting these observations into (4.33) completes the proof.

Remark 4.24. It is important to understand how convexity of p 7→ L(x, z, p) is used

to deal with weak convergence in the proof of Lemma 4.23, by allowing the gradient
∇uk to appear linearly. The uniform convergence uk → u is much stronger, and hence
we do not need any convexity assumption on z 7→ L(x, z, p).

We now give the proof of Theorem 4.22.

Proof of Theorem 4.22. Let uk ∈ W 1,q (U ) be any sequence with uk ⇀ w weakly in

W 1,q (U ). Then we have

(4.34) sup kwk kW 1,q (U ) < ∞.

It follows from the Sobolev compact embedding (Theorem 4.15) that, upon passing to
a subsequence if necessary, uk → u strongly in Lq (U ). Passing to another subsequence
we have
uk → u a.e. in U.
4.3. LOWER SEMICONTINUITY 77

Let m > 0. By Egoroﬀ’s Theorem (Theorem A.29) there exists Em ⊂ U such that
|U − Em | ≤ 1/m and uk → u uniformly on Em . We deﬁne
Fm := {x ∈ U : |u(x)| + |∇u(x)| ≤ m},
and
Gm = Fm ∩ (E1 ∪ E2 ∪ · · · ∪ Em ).
Then
(4.35) uk → u uniformly on Gm
for each m. Furthermore, the sequence of sets Gm is monotone, i.e.,
G1 ⊂ G2 ⊂ G3 ⊂ · · · ⊂ Gm ⊂ Gm+1 ⊂ · · · ,
and
1
|U − Gm | ≤ |U − Fm | + |U − Em | ≤ |U − Fm | +
.
m
By the Dominated Convergence Theorem (Theorem A.28) we have
Z
lim |U − Fm | = lim 1 − χFm (x) dx = 0,
m→∞ m→∞ U
where χFm (x) = 1 when x ∈ Fm and zero otherwise. Therefore |U − Gm | → 0 as
m → ∞.
Since L is bounded below, there exists β ∈ R such that L(x, z, p) + β ≥ 0 for all
x ∈ U , z ∈ R, and p ∈ Rd . We have
Z Z
(4.36) I(uk ) + β|U | = L(x, uk , ∇uk ) + β dx ≥ L(x, uk , ∇uk ) + β dx.
U Gm

Since uk → u uniformly on Gm , and |∇u| ≤ m on Gm , we can apply Lemma 4.23 to

deduce that Z Z
lim inf L(x, uk , ∇uk ) dx ≥ L(x, u, ∇u) dx.
k→∞ Gm Gm
Inserting this into (4.36) we have
Z
lim inf I(uk ) + β|U | ≥ L(x, u, ∇u) + β dx.
k→∞ Gm

By the Monotone Convergence Theorem (Theorem A.27) we have

Z Z
lim L(x, u, ∇u) + β dx = lim χGm (x)(L(x, u, ∇u) + β) dx
m→∞ G m→∞ U
m
Z
= L(x, u, ∇u) + β dx.
U
Therefore
Z
lim inf I(uk ) + β|U | ≥ L(x, u, ∇u) + β dx = I(u) + β|U |,
k→∞ U
which completes the proof.
78 CHAPTER 4. THE DIRECT METHOD

4.4 Existence and uniqueness of minimizers

Armed with lower semicontinuity, we can now turn to the question of existence of a
minimizer. We need a further assumption on L. We will assume that

there exists α > 0, β ≥ 0 such that

(4.37) L(x, z, p) ≥ α|p|q − β


for all x ∈ U, z ∈ R, p ∈ Rd .

We call (4.37) a coercivity condition, since it plays the role of coercivity from Section
4.1. We also need to allow for the imposition of constraints on the function u. Thus,
we write

(4.38) A = {w ∈ W 1,q (U ) : w = g on ∂U in the trace sense.},

for given g ∈ Lq (∂U ).

Theorem 4.25. Let 1 < q < ∞. Assume that L is bounded below, satisﬁes the
coercivity condition (4.37), and is convex in p. Suppose also that the admissible set
A is nonempty. Then there exists u ∈ A such that

(4.39) I(u) = min I(w).

w∈A

Proof. We may assume inf w∈A I(w) < ∞. Select a minimizing sequence uk ∈ A, so
that I(uk ) → inf w∈A as k → ∞. We can pass to a subsequence, if necessary, so that
I(uk ) < ∞ for all k, so that supk≥1 I(uk ) < ∞. By the coercivity condition (4.37)
Z
I(uk ) ≥ α|∇uk |q − β dx,
U

and so Z
sup |∇uk |q dx < ∞.
k≥1 U

Fix any function w ∈ A. Then w = uk on ∂U in the trace sense, and so w − uk ∈

W01,q (U ). By the Poincaré inequality (Theorem 4.16) we have

kw − uk kLq (U ) ≤ Ck∇w − ∇uk kLq (U ) ,

and so

kuk kLq (U ) ≤ kw − uk kLq (U ) + kwkLq

≤ Ck∇w − ∇uk kLq (U ) + kwkLq
≤ Ck∇wkLq (U ) + Ck∇uk kLq (U ) + kwkLq .
4.4. EXISTENCE AND UNIQUENESS OF MINIMIZERS 79

Therefore supk≥1 kuk kW 1,q (U ) < ∞, and so there exists a subsequence ukj and a func-
tion u ∈ W 1,q (U ) such that ukj ⇀ u weakly in W 1,q (U ) as j → ∞. By Theorem 4.22
we have
I(u) ≤ lim inf I(ukj ) = inf I(w).
j→∞ w∈A

To complete the proof, we need to show that u ∈ A, that is, that u = g on ∂U

in the trace sense. Since W01,q (U ) is weakly closed in W 1,q (U ) and w − uk ∈ W01,q (U )
for all k, we have w − u ∈ W01,q (U ). Therefore w = u = g on ∂U in the trace sense,
which completes the proof.
We ﬁnally turn to the problem of uniqueness of minimizers. There can, in general,
be many minimizers, and to ensure uniqueness we need further assumptions on L.
For simplicity, we present the results for L of the form
(4.40) L(x, z, p) = Ψ(x, z) + Φ(x, p).
Theorem 4.26. Suppose L is given by (4.40), and assume z 7→ Ψ(x, z) is θ1 -strongly
convex and p 7→ Φ(x, p) is θ2 -strongly convex for all x ∈ U , where θ1 , θ2 ≥ 0. If u ∈ A
is a minimizer of I, then
Z
1
(4.41) θ1 (u − w)2 + θ2 |∇u − ∇w|2 dx ≤ I(w) − I(u)
2 U
for all w ∈ A.
Remark 4.27. If θ1 > 0 or θ2 > 0, then (4.41) shows that minimizers of I are unique.
However, the estimate (4.41) is stronger than uniqueness. It shows that any function
w ∈ A whose energy I(w) is close to minimal is in fact close to the unique minimizer
of I in the H 1 (U ) norm.
Proof. Let u ∈ A be a minimizer of I and let w ∈ A. By Exercise A.36 we have
θ1
Ψ (x, λu + (1 − λ)w)) + λ(1 − λ)(u − w)2 ≤ λΨ(x, u) + (1 − λ)Ψ(x, w),
2
and
θ2
Φ (x, λ∇u + (1 − λ)∇w)) + λ(1 − λ)|∇u − ∇w|2 ≤ λΦ(x, ∇u) + (1 − λ)Φ(x, ∇w),
2
for any λ ∈ (0, 1). Adding the two equations above and integrating both sides over
U yields
Z
1
I (λu + (1 − λ)w))+ λ(1−λ) θ1 (u−w)2 +θ2 |∇u−∇w|2 dx ≤ λI(u)+(1−λ)I(w).
2 U

Since I (λu + (1 − λ)w)) ≥ I(u) we ﬁnd that

Z
1
λ(1 − λ) θ1 (u − w)2 + θ2 |∇u − ∇w|2 dx ≤ (1 − λ)(I(w) − I(u)).
2 U

Divide by 1 − λ on both sides and send λ → 1 to complete the proof.

80 CHAPTER 4. THE DIRECT METHOD

Exercise 4.28. Show that minimizers of the q-Dirichlet energy

Z
I(u) = |∇u|q dx
U

are unique for 1 < q < ∞. [Hint: Note that while L(x, z, p) = |p|q is not strongly
convex in p, it is strictly convex. Use a proof similar to Theorem 4.26.] 4

4.5 The Euler-Lagrange equation

We now examine the sense in which a minimizer u of I(u) satisﬁes the boundary-value
problem for the Euler-Lagrange equation
(
Lz (x, u, ∇u) − div (∇p L(x, u, ∇u) = 0 in U
(4.42)
u = g on ∂U

Since our candidate solutions live in W 1,q (U ), we need a weaker notion of solution
that does not require diﬀerentiating u twice. For motivation, recall when we derived
the Euler-Lagrange equation in Theorem 2.1 we arrive at Eq. (2.5), which reads
Z
(4.43) Lz (x, u, ∇u)φ + ∇p L(x, u, ∇u) · ∇φ dx = 0,
U

where φ ∈ Cc∞ (U ) is any test function. We obtained the Euler-Lagrange equation by

integrating by parts and using a vanishing lemma. If u is not twice diﬀerentiable, we
cannot complete this last step, and are left with (4.43).
Deﬁnition 4.29. We say u ∈ A is a weak solution of the boundary-value problem
(4.42) provided
Z
(4.44) Lz (x, u, ∇u)v + ∇p L(x, u, ∇u) · ∇v dx = 0
U

for all v ∈ W01,q (U ).

Remark 4.30. Notice the only diﬀerence between (4.43) and (4.44) is the choice of
function space for the test function v. Since Cc∞ (U ) is dense in W01,p (U ) for any p, it
is worth examining momentarily why we chose p = q in Deﬁnition 4.29. The typical
example of a functional on W 1,q (U ) is
Z
1
I(u) = |u|q + |∇u|q dx,
q U
in which case L(x, z, p) = 1q (|z|q + |p|q ). For any u ∈ W 1,q (U ) we have

|∇p L(x, u, ∇u)| = |∇u|q−1 ,

4.5. THE EULER-LAGRANGE EQUATION 81

′
and so ∇p L(x, u, ∇u) ∈ Lq (U ), where q ′ = q−1 q
, 1q + q1′ = 1. Hence, we need
∇v ∈ Lq (U ) to ensure the integral in (4.44) exists (by Hölder’s inequality). Simi-
′
larly |Lz (x, u, ∇u)| = |z|q−1 , and so Lz (x, u, ∇u) ∈ Lq (U ), which also necessitates
v ∈ Lq (U ).
Following the discussion in Remark 4.30, we must place some assumptions on L so
that the weak form (4.44) of the Euler-Lagrange equation is well-deﬁned. We assume
that

 |L(x, z, p)| ≤ C(|p| + |z| + 1),
 q q

(4.45) |Lz (x, z, p)| ≤ C(|p|q−1 + |z|q−1 + 1), and



|∇p L(x, z, p)| ≤ C(|p|q−1 + |z|q−1 + 1),
hold for some constant C > 0 and all p ∈ Rd , z ∈ R, and x ∈ U .
Theorem 4.31. Assume L satisﬁes (4.45) and u ∈ A satisﬁes
I(u) = min I(w).
w∈A

Then u is a weak solution of the Euler-Lagrange equation (4.42).

Proof. Let v ∈ W01,q (U ) and for t 6= 0 deﬁne
1
wt (x) = [L(x, u(x) + tv(x), ∇u(x) + t∇v(x)) − L(x, u(x), ∇u(x))] .
t
Then we have
Z
1 t d
wt (x) = L(x, u(x) + sv(x), ∇u(x) + s∇v(x)) ds
t 0 ds
Z
1 t
= Lz (x, u + sv, ∇u + s∇v)v + ∇p L(x, u + sv, ∇u + s∇v) · ∇v ds.
t 0
q q′
Applying Young’s inequality ab ≤ aq + bq′ (see (A.2)) we deduce
Z
C t ′
|wt (x)| ≤ |Lz (x, u + sv,∇u + s∇v)|q + |v|q +
t 0
′
|∇p L(x, u + sv, ∇u + s∇v)|q + |∇v|q ds.
′
Since s 7→ sq is convex, we have
′ ′ ′ ′ ′ ′
(a + b)q = 2q ( 12 a + 12 b)q ≤ 2q −1 (aq + bq )
for all a, b > 0. Invoking (4.45) we have
Z
C t q
|wt (x)| ≤ |v| + |∇v|q + |u|q + |∇u|q + 1 ds
t 0
≤ C(|v|q + |∇v|q + |u|q + |∇u|q + 1) ∈ L1 (U )
82 CHAPTER 4. THE DIRECT METHOD

for every nonzero t ∈ [−1, 1]. Thus, we can apply the Dominated Convergence The-
orem (Theorem A.28) to ﬁnd that
Z
d
0= I(u + tv) = lim wt (x) dx
dt t=0 t→0 U
Z
= lim wt (x) dx
t→0
Z U

= Lz (x, u, ∇u) + ∇p L(x, u, ∇u) · ∇v dx,

since u is a minimizer of I. This completes the proof.

So far, we have only shown that minimizers of I satisfy the Euler-Lagrange equa-
tion (4.42). However, the Euler-Lagrange equation may have other solutions that
are not minimizers (e.g., maximizers or saddle points). Consider, for example, the
minimal surface of revolution problem discussed in Section 3.4, where we found two
solutions of the Euler-Lagrange equation, but only one solution yielded the least area.
So a natural question concerns suﬃcient conditions for a weak solution of the
Euler-Lagrange equation to be a minimizer. Convexity in some form must play a
role, and it turns out we require joint convexity in z and p. Namely, we assume that

(4.46) (z, p) 7→ L(x, z, p) is convex for all x ∈ U.

If L is jointly convex in (z, p), then it is easy to check that z 7→ L(x, z, p) and
p 7→ L(x, z, p) are convex. The converse is not true.

Example 4.9. The function u(x) = x1 x2 is not convex, since

X
2 X
2
uxi xj vi vj = 2v1 v2
i=1 j=1

is clearly not positive for, say, v1 = 1 and v2 = −1. However, x1 7→ u(x1 , x2 ) is convex,
since ux1 x1 = 0, and x2 7→ u(x1 , x2 ) is convex, since ux2 x2 = 0. Therefore, convexity of
x 7→ u(x) is not equivalent to convexity in each variable x1 and x2 independently. 4

Theorem 4.32. Assume the joint convexity condition (4.46) holds. If u ∈ A is a

weak solution of the Euler-Lagrange equation (4.42) then

I(u) = min I(w).

w∈A

Proof. Let w ∈ A. Since (z, p) 7→ L(x, z, p) is convex, we can use Theorem A.38 (iii)
to obtain

L(x, w, ∇w) ≥ L(x, u, ∇u) + Lz (x, u, ∇u)(w − u) + ∇p L(x, u, ∇u) · (∇w − ∇u).
4.5. THE EULER-LAGRANGE EQUATION 83

We now integrate both sides over U and write φ = w − u to deduce

Z
I(w) ≥ I(u) + Lz (x, u, ∇u)(w − u) + ∇p L(x, u, ∇u) · ∇(w − u) dx.
U

Since u is a weak solution of the Euler-Lagrange equation (4.42) and w − u ∈ W01,q (U )

we have Z
Lz (x, u, ∇u)(w − u) + ∇p L(x, u, ∇u) · ∇(w − u) dx = 0
U
and so I(w) ≥ I(u).
Exercise 4.33. Show that L(x, z, p) = 12 |p|2 − zf (x) is jointly convex in z and p. 4
Exercise 4.34. Show that L(x, z, p) = zp1 is not jointly convex in z and p. 4

4.5.1 Regularity
It is natural to ask whether the weak solution we constructed to the Euler-Lagrange
equation (4.42) is in fact smooth, or has some additional regularity beyond W 1,q (U ).
In fact, this was one of the famous 23 problems posed by David Hilbert in 1900.
Hilbert’s 19th problem asked “Are the solutions of regular problems in the calculus
of variations necessarily analytic”? The term “regular problems” refers to strongly
convex Lagrangians L for which the Euler-Lagrange equation is uniformly elliptic,
and it is sufficient to interpret the term “analytic” as “smooth”.
The problem was resolved affirmatively by de Giorgi [23] and Nash [44], indepen-
dently. Moser later gave an alternative proof [43], and the resulting theory is often
called the de Giorgi-Nash-Moser theory. We should note that the special case of n = 2
is significantly easier, and was established earlier by Morrey [42].
We will not give proofs of regularity here, but will say a few words about the
difficulties and the main ideas behind the proofs. We consider the simple calculus of
variations problem
Z
(4.47) min L(∇u) dx,
u U

where L is smooth and θ-strongly convex with θ > 0. The corresponding Euler-
Lagrange equation is
X
d
(4.48) div(∇p L(∇u)) = ∂xj (Lpj (∇u)) = 0 in U.
j=1

The standard way to obtain regularity in PDE theory is to differentiate the equation
to get a PDE satisfied by the derivative uxk . For simplicity we assume u is smooth.
Differentiate (4.48) in xk to obtain
X
d
∂xj (Lpi pj (∇u)uxk xi ) = 0 in U.
i,j=1
84 CHAPTER 4. THE DIRECT METHOD

Setting aij (x) = Lpi pj (∇u(x)) and w = uxk we have

X
d
(4.49) ∂xj (aij wxi ) = 0 in U.
i,j=1

Since L is strongly convex and smooth we have

X
d
(4.50) θ|η|2 ≤ aij ηi ηj ≤ Θ|η|2 ,
i,j=1

for some Θ > 0. Thus (4.49) is uniformly elliptic. Notice we have converted our
problem of regularity of minimizers of (4.47) into an elliptic regularity problem.
However, since u ∈ W 1,q is not smooth, we at best have aij ∈ L∞ (U ), and in
particular, the coefficients aij are not a priori continuous, as is necessary to apply
elliptic regularity theory, such as the Schauder estimates [30]. All we know about the
coefficients is the ellipticity condition (4.50). Nonetheless, the remarkable conclusion
is that w ∈ C 0,α for some α > 0. This was proved separately by de Giorgi [23] and
Nash [44]. Since w = uxk we have u ∈ C 1,α and so aij ∈ C 0,α . Now we are in the
world of classical elliptic regularity. A version of the Schauder estimates shows that
u ∈ C 2,α , and so aij ∈ C 1,α . But another application of a different Schauder estimate
gives u ∈ C 3,α and so aij ∈ C 2,α . Iterating this argument is called bootstrapping, and
leads to the conclusion that u ∈ C ∞ is smooth.

4.6 Minimal surfaces

Our discussion in the preceding sections does not cover the minimal surface energy
Z p
I(u) = 1 + |∇u|2 dx,
U
p
since the Lagrangian L(p) = 1 + |p|2 is coercive (see (4.37)) with q = 1, and we
required q > 1 for weak compactness. In this section, we show how to problems like
this with a priori gradient estimates. We will consider the general functional
Z
(4.51) I(u) = L(∇u) dx,
U

where L is strictly convex, which means

(4.52) L(λp + (1 − λ)q) < λL(p) + (1 − λ)L(q)

for all p, q ∈ Rd and λ ∈ (0, 1), and consider minimizing L over the constraint set

(4.53) A = {u ∈ C 0,1 (U ) : u = g on ∂U },
4.6. MINIMAL SURFACES 85

where g ∈ C 0,1 (∂U ).

We recall that C 0,1 (U ) is the space of Lipschitz continuous functions u : U → R
with norm
(4.54) kukC 0,1 (U ) := kukC(U ) + Lip(u),
where kukC(U ) = supx∈U |u(x)| and
|u(x) − u(y)|
(4.55) Lip(u) := sup .
x,y∈U |x − y|
x̸=y

The Lipschitz constant Lip(u) is the smallest number C > 0 for which
|u(x) − u(y)| ≤ C|x − y|
holds for all x, y ∈ U . By Radamacher’s Theorem [28], every Lipschitz function is
diﬀerentiable almost everywhere in U , and C 0,1 (U ) is continuously embedded into
W 1,q (U ) for every q ≥ 1, which means
(4.56) kukW 1,q (U ) ≤ CkukC 0,1 (U ) .
Our strategy to overcome the lack of coercivity will be to modify the constraint
set to one of the form
(4.57) Am = {u ∈ C 0,1 (U ) : Lip(u) ≤ m and u = g on ∂U }.
By (4.56), minimizing sequences are bounded in W 1,q (U ), which was all that coercivity
was used for. Then we will show that minimizers over Am satisfy kukC 0,1 (U ) < m by
proving a priori gradient bounds on u, which will show that minimizers of Am are
the same as over A.
Theorem 4.35. Assume L is convex and bounded below. Then there exists u ∈ Am
such that I(u) = minw∈Am I(w).
Proof. Let uk ∈ Am be a minimizing sequence. By the continuous embedding (4.56),
the sequence uk is bounded in H 1 (U ) = W 1,2 (U ). Thus, we can apply the argument
from Theorem 4.25, and the Sobolev compact embeddings 4.15, to show there is a
subsequence ukj and a function u ∈ H 1 (U ) such that u = g on ∂U in the trace sense,
ukj ⇀ u weakly in H 1 (U ), ukj → u in L2 (U ), and
I(u) ≤ lim inf I (kj ) = inf I(w).
j→∞ w∈I

By passing to a further subsequence, if necessary, ukj → u a.e. in U . It follows that

|u(x) − u(y)| |ukj (x) − ukj (y)|
= lim ,
|x − y| j→∞ |x − y|
for almost every x, y ∈ U , and so (exercise) u has a continuous version (e.g., can be
redeﬁned on a set of measure zero) with u ∈ C 0,1 (U ) and Lip(u) ≤ m. Therefore
u ∈ Am , which completes the proof.
86 CHAPTER 4. THE DIRECT METHOD

We now turn to proving a priori estimates on minimizers. For this, we need the
comparison principle and notions of sub and supersolutions. We recall Cc0,1 (U ) is the
subset of C 0,1 (U ) consisting of functions with compact support in U , that is, each
u ∈ Cc0,1 (U ) is identically zero in some neighborhood of ∂U .
Deﬁnition 4.36. We say that u ∈ C 0,1 (U ) is a subsolution if

(4.58) I(u − φ) ≥ I(u) for all φ ∈ Cc0,1 (U ), φ ≥ 0.

Similarly, we say that v ∈ C 0,1 (U ) is a supersolution if

(4.59) I(v + φ) ≥ I(v) for all φ ∈ Cc0,1 (U ), φ ≥ 0.

Note that any minimizer of I over A is both a sub- and supersolution.

We have the following comparison principle.
Theorem 4.37. Assume L is strictly convex. Let u and v be sub- and supersolutions,
respectively, and assume u ≤ v on ∂U . Then u ≤ v in U .
Proof. Assume by way of contradiction that u(x) > v(x) for some x ∈ U . Deﬁne

K = {x ∈ U : u(x) > v(x)},

and consider the functions

(
u(x), if x ∈ U \ K
u(x) =
v(x), if x ∈ K,

and (
v(x), if x ∈ U \ K
v(x) =
u(x), if x ∈ K,
Since u ≤ v on ∂U , K is an open set compactly contained in U .
Note that u ≤ u and deﬁne φ := u − u. Then u = u − φ and φ ≥ 0 is supported
on K, so φ ∈ Cc0,1 (U ). By the deﬁnition of subsolution we have I(u) ≥ I(u). By a
similar argument we have I(v) ≥ I(v). Therefore
Z Z Z Z
L(∇u) dx ≤ L(∇u) dx = L(∇u) dx + L(∇v) dx,
U U U \K K

and so Z Z
L(∇u) dx ≤ L(∇v) dx.
K K
The opposite inequality is obtained by a similar argument and so we have
Z Z
(4.60) L(∇v) dx = L(∇u) dx.
K K
4.6. MINIMAL SURFACES 87

Define now (
u(x), if x ∈ U \ K
w(x) = 1
2
(u(x) + v(x)), if x ∈ K.
As above, the subsolution property yields I(w) ≥ I(u) and so we can use the argument
above and (4.60) to obtain
Z Z Z
1 1 1
(L(∇u) + L(∇v)) dx = L(∇u) dx + L(∇v)) dx
K 2 2
Z K
2 U
= L(∇u) dx
K
Z
∇u + ∇v
≤ I(w) = L dx.
K 2
Therefore Z
∇u + ∇v 1
L − (L(∇u) + L(∇v)) dx ≥ 0.
K 2 2
By convexity of L we conclude that

∇u + ∇v 1
L = (L(∇u) + L(∇v))
2 2
a.e. in K. Since L is strictly convex, we have

p+q 1
L ≤ (L(p) + L(q)).
2 2
with equality if and only if p = q. Therefore ∇u = ∇v a.e. in K, and so u − v is
constant in K. Since u = v on ∂K, we find that u ≡ v in K, which is a contradiction
to the definition of K.

Remark 4.38. Notice that if u, v ∈ Am in the proof of Theorem 4.37, then u, v, w ∈

Am . Hence, the comparison principle holds when u, v ∈ Am , and the notion of sub-
and supersolution is modiﬁed so that we require u + φ ∈ Am and u − φ ∈ Am .

As a corollary, we have the maximum principle.

Corollary 4.39. Let u and v be sub- and supersolutions, respectively. Then we have

(4.61) sup(u − v) ≤ sup(u − v).

U ∂U

In particular, if u is both a sub- and supersolution, then

(4.62) sup |u| = sup |u|.

U ∂U
88 CHAPTER 4. THE DIRECT METHOD

Proof. Since I(u + λ) = I(u) for any constant λ, adding constants does not change
the property of being a sub- or supersolution. Therefore w := v + sup∂U (u − v) is a
supersolution satisfying u ≤ w on ∂U . By Theorem 4.37 we have u ≤ w in U , which
proves (4.61).
Note that for any φ ∈ Cc0,1 (U ) we have by convexity of L that
Z Z Z
L(0 ± Dφ) dx ≥ L(0) ± ∇p L(0) · ∇φ dx = L(0),
U U U

where we used integration by parts in the last equality. Therefore, any constant
function is both a sub- and supersolution. Using (4.61) with v = 0 we have supU u ≤
sup∂U u. If u is also a supersolution, then the opposite inequality is also obtained,
yielding (4.62).

Lemma 4.40. Let u ∈ A be a minimizer of I over A. Then

|u(x) − u(y)| |u(x) − u(y)|

(4.63) sup = sup .
x,y∈U |x − y| x∈U,y∈∂U |x − y|

Proof. Let x1 , x2 ∈ U with x1 6= x2 . Deﬁne τ = x2 − x1 ,

uτ (x) = u(x + τ ), and Uτ = {x ∈ Rd : x + τ ∈ U }.

Both u and uτ are sub- and supersolutions in U ∩ Uτ . By Corollary 4.39 there exists
z ∈ ∂(U ∩ Uτ ) such that

u(x1 ) − u(x2 ) = u(x1 ) − uτ (x1 ) ≤ u(z) − uτ (z) = u(z) − u(z + τ ).

Note that z, z + τ ∈ U , and either z ∈ ∂U or z + τ ∈ ∂U . Therefore

u(x1 ) − u(x2 ) |u(x) − u(y)|

≤ sup .
|x2 − x1 | x∈U,y∈∂U |x − y|

The opposite inequality is obtained similarly. Since x1 , x2 ∈ U are arbitrary, the proof
is completed.
Lemma 4.40 reduces the problem of a priori estimates to estimates near the bound-
ary (or, rather, relative to boundary points). To complete the a priori estimates, we
need some further assumptions on ∂U and g. The simplest assumption is the bounded
slope condition, which is stated as follows:

(BSC) A function u ∈ A satisﬁes the bounded slope condition (BSC) if there exists
m > 0 such that for every x0 ∈ ∂U there exist aﬃne functions w, v with
|Dv| ≤ m and |Dw| ≤ m such that

(a) v(x0 ) = w(x0 ) = g(x0 ), and

4.6. MINIMAL SURFACES 89

(b) v(x) ≤ g(x), w(x) ≥ g(x) for every x ∈ ∂U .

The bounded slope condition is rather restrictive, and requires U to be convex.

Nonetheless, we can obtain the following a priori estimates.
Theorem 4.41. Let L be strictly convex and assume g satisfies the bounded slope
condition (BSC) for a constant m > 0. Then any minimizer u ∈ A of I over A
satisfies Lip(u) ≤ m.
Proof. Let x0 ∈ ∂U . As in the proof of Corollary 4.39, any function with constant
gradient is both a super and subsolution. Therefore, the affine functions v and w in
the definition of the bounded slope condition are sub- and supersolutions, respectively,
and v ≤ u ≤ w on ∂U . By the comparison principle (Theorem 4.37) we have v ≤
u ≤ w in U , and so

u(x) − u(x0 ) ≤ w(x) − w(x0 ) ≤ m|x − x0 |

and
u(x) − u(x0 ) ≥ v(x) − v(x0 ) ≥ −m|x − x0 |.
Therefore
|u(x) − u(x0 )|
≤ m,
|x − x0 |
for all x ∈ U , and so
|u(x) − u(y)|
sup ≤ m.
x∈U,y∈∂U |x − y|
Combining this with Lemma 4.40 completes the proof.
Finally, we prove existence and uniqueness of minimizers over A.
Theorem 4.42. Let L be strictly convex and assume g satisﬁes the bounded slope
condition (BSC) for a constant m > 0. Then there is a unique u ∈ A such that
I(u) = minw∈A I(w) and furthermore Lip(u) ≤ m.
Proof. By Theorem 4.35 there exists u ∈ Am+1 such that

I(u) = min I(w).

w∈Am+1

By Remark 4.38, Theorems 4.37 and 4.41 apply to minimizers in Am+1 with similar
arguments, and so we ﬁnd that Lip(u) ≤ m. Let w ∈ A. Then for small enough t > 0
we have that
(1 − t)u + tw ∈ Am+1
and so using convexity of L we have
Z
I(u) ≤ I((1 − t)u + tw) = L((1 − t)∇u + t∇w) dx ≤ (1 − t)I(u) + tI(w).
U
90 CHAPTER 4. THE DIRECT METHOD

Therefore I(u) ≤ I(w), which shows that u minimizes I over A.

If w ∈ A is any other minimizer then the maximum principle (Corollary 4.39)
gives that
max |u − v| ≤ max |u − v| = 0
U ∂U

since u = v = g on ∂U . Therefore the minimizer u is unique.

Chapter 5

Discrete to continuum in
graph-based learning

Machine learning algorithms learn relationships between data and labels, which can
be used to automate everyday tasks that were once difficult for computers to perform,
such as image annotation or facial recognition. To describe the setup mathematically,
we denote by X the space in which our data lives, which is often Rd , and by Y the label
space, which may be Rk . Machine learning algorithms are roughly split into three
categories: (i) fully supervised, (ii) semi-supervised, and (iii) unsupervised. Fully
supervised algorithms use labeled training data (x1 , y1 ), . . . , (xn , yn ) ∈ X × Y to learn
a function f : X → Y that appropriately generalizes the rule xi 7→ yi . The learned
function f (x) generalizes well if it gives the correct label for datapoints x ∈ X that it
was not trained on (e.g., the testing or validation set). For a computer vision problem,
each datapoint xi would be a digital image, and the corresponding label yi could, for
example, indicate the class the image belongs to, what type of objects are in the image,
or encode a caption for the image. Convolutional neural networks trained to classify
images [37] is a modern example of fully supervised learning, though convolutional
nets are used in semi-supervised and unsupervised settings as well. Unsupervised
learning algorithms, such as clustering [36] or autoencoders [24], make use of only the
data x1 , . . . , xn and do not assume access to any label information.
An important and active field of current research is semi-supervised learning, which
refers to algorithms that learn from both labeled and unlabeled data. Here, we are
given some labeled data (x1 , y1 ), . . . , (xm , ym ) and some unlabeled data xm+1 , . . . , xn ,
where usually m n, and the task is to use properties of the unlabeled data to
obtain enhanced learning outcomes compared to using labeled data alone. In many
applications, such as medical image analysis or speech recognition, unlabeled data
is abundant, while labeled data is costly to obtain. Thus, it is important to have
efficient and reliable semi-supervised learning algorithms that are able to properly
utilize the ever-increasing amounts of unlabeled data that is available for learning
tasks.

91
92 CHAPTER 5. GRAPH-BASED LEARNING

Figure 5.1: Example of some of the handwritten digits from the MNIST dataset [40].

In both the unsupervised and semi-supervised settings, data is often modeled as

a graph that encodes similarities between nearest neighbors. To describe the setup
mathematically, let G = (X , W) denote a graph with vertices (datapoints) X and
nonnegative edge weights W = (wxy )x,y∈X , which encode the similarity of pairs of
datapoints. The edge weight wxy is chosen to be small or zero when x and y are
dissimilar data points, and chosen to be large (e.g., wxy = 1) when the data points are
similar. In these notes, we always assume the graph is symmetric, so wxy = wyx . Non-
symmetric (also called directed) graphs are also very important, modeling connections
between webpages and link prediction, but there is less theory and understanding in
the directed case. In a multi-label classiﬁcation problem with k classes, the label
space is usually chosen as Y = {e1 , e2 , . . . , ek }, where ei ∈ Rk is the ith standard
basis vector in Rk and represents the ith class. We are given labels g : Γ → Y on a
subset Γ ⊂ X of the graph, and the task of graph-based semi-supervised learning is
to extend the labels to the rest of the vertices X \ Γ. In an unsupervised clustering
problem, one is tasked with grouping the vertices in some sensible way so that the
edge weights wxy are large when x and y are in the same cluster, and small otherwise.

In semi-supervised learning, the guiding principle is the semi-supervised smooth-

ness assumption [21], which stipulates that similar vertices in dense regions of the
graph should have similar labels. In practice, this is often enforced by solving the
optimization problem of the form

(
Minimize E(u) over u : X → Rk
(5.1)
subject to u = g on Γ,

where E is a functional that measures the smoothness of a potential labeling function

u : X → Rk , and u(x) = (u1 (x), . . . , uk (x)). The learned label ℓ(x) for x ∈ X \ Γ is
computed by solving (5.1) and setting

(5.2) ℓ(x) = arg max ui (x).

i∈{1,...,k}
93

80 k=10 80 k=10
k=20 k=20
k=30 k=30
60 60
Test error (%)

Test error (%)

40 40

20 20

0 100 200 300 400 500 0 5 10 15 20 25 30

Number of labels per class Number of labels per class

Figure 5.2: Error plots for MNIST experiment showing testing error versus number
of labels, averaged over 100 trials.

One of the most widely used smoothness functionals [73] is the graph Dirichlet energy
1 X
(5.3) E2 (u) := wxy |u(y) − u(x)|2 .
2 x,y∈X

When E = E2 in (5.1), the approach is called Laplacian regularization. Laplacian

regularization was originally proposed in [73], and has been widely used in semi-
supervised learning [1, 68, 69, 71] and manifold ranking [32, 33, 62, 64, 65, 72], among
many other problems.
Let us give an example application to image classification. We use the MNIST
dataset, which consists of 70,000 grayscale 28x28 pixel images of handwritten digits
0–9 [40]. See Figure 5.1 for some examples of MNIST digits. The graph is con-
structed as a symmetrized k-nearest neighbor graph using Euclidean distance be-
tween images to define the weights. We used 10 different labeling rates, labeling m =
1, 2, 4, 8, 16, 32, 64, 128, 256, 512 images per class, which correspond to labeling rates
of 0.014% up to 7.3%. The test error plots are shown in Figure 5.2 for k = 10, 20, 30
nearest neighbors. The code for this example is part of the Python GraphLearning
package available here: https://github.com/jwcalder/GraphLearning.
There are many variants on Laplacian regularization. Belkin and Niyogi [5, 6]
proposed the problem
( )
X
min (u(x) − g(x))2 + λE2 (u) ,
u:X →Rk
x∈Γ

while Zhou et al. [67–69] proposed

( !)
X X u(x) u(y)
min (u(x) − g(x))2 + λ wxy p −p ,
u:X →Rk d(x) d(y)
x∈X x,y∈X
94 CHAPTER 5. GRAPH-BASED LEARNING
P
where d(x) = y∈X wxy is the degree of vertex x and we take g(x) = 0 if x 6∈ Γ. Both
of the models above allow for uncertainty in the labels by allowing for u(x) 6= g(x) at
a labeled point x ∈ Γ, which can be desirable if the labels are noisy or adversarial.
Recently, p-Laplace regularization has been employed, which uses E = Ep , where
1 X
Ep (u) = wxy |u(y) − u(x)|p .
2 x,y∈X

The p-Laplace problem was introduced originally in [70] for semi-supervised learning.
It was proposed more recently in [26] with large p as a superior regularizer in problems
with very few labels where Laplacian regularization gives poor results. The well-
posedness of the p-Laplacian models with very few labels was proved rigorously in [55]
for the variational graph p-Laplacian, and in [11] for the game-theoretic graph p-
Laplacian. In addition, in [11] discrete regularity properties of the graph p-harmonic
functions are proved, and the p = ∞ case, called Lipschitz learning [38], is studied in
[13]. Eﬃcient algorithms for solving these problems are developed in recent work [29].
We also note there are alternative re-weighting approaches that have been successful
for problems with low label rates [15, 53].
Graph-based techniques area also widely used in unsupervised learning algorithms,
such as clustering. The binary clustering problem is to split X into two groups in
such a way that similar data points belong to the same group. Let A ⊂ X be one
group and Ac := X \ A be the other. One approach to clustering is to minimize the
graph cut energy

(5.4) MC(A) = Cut(A, Ac ),

where
X
(5.5) Cut(A, B) = wxy .
x∈A,y∈B

The graph cut energy MC(A) is the sum of the weights of edges that have to be cut
to partition the graph into A and Ac . Minimizing the graph cut energy is known
as the min-cut problem and the solution can be obtained eﬃciently. However, the
min-cut problem often yields poor clusterings, since the energy is often minimized
with A = {x} for some x ∈ X , which is usually undesirable.
This can be addressed by normalizing the graph cut energy by the size of the
clusters. One approach is the normalized cut [52] energy
Cut(A, Ac )
(5.6) NC(A) = ,
V (A)V (Ac )
where V (A) = Cut(A, X ) measures the size of the cluster A. There are many other
normalizations one can use, such as min{V (A), V (Ac )} [2]. Minimizing the normalized
cut energy NC(A) turns out to be an NP-hard problem.
95

To address hard computational problems, we often seek ways to approximate the

hard problem with problems that are easier to solve. To this end, let
(
1, if x ∈ A
1A (x) =
0, otherwise,
and note that
1 X
Cut(A, Ac ) = wxy (1A (x) − 1A (y))2 = E2 (1A ).
2 x,y∈X

Likewise, we have that

X X
V (A) = wxy 1A (x) = d(x)1A (x).
x,y∈X x∈X

Therefore, X
V (A)V (Ac ) = d(x)d(y)1A (x)(1 − 1A (y)),
x,y∈X

and the normalized cut problem can be written as

P
x,y∈X wxy (1A (x) − 1A (y))
2
min P .
x,y∈X d(x)d(y)1A (x)(1 − 1A (y))
A⊂X

So far, we have changed nothing besides notation, so the problem is still NP-hard.
However, this reformulation allows us to view the problem as optimizing over binary
functions, in the form
P
x,y∈X wxy (u(x) − u(y))
2
(5.7) min P ,
u:X →{0,1} x,y∈X d(x)d(y)u(x)(1 − u(y))

since binary functions are exactly the characteristic functions. Later we will cover
relaxations of (5.7), which essentially proceed by allowing u to take on all real values
R, but this must be done carefully. These relaxations lead to spectral methods for
dimension reduction and clustering, such as spectral clustering [46] and Laplacian
eigenmaps [4].
We remark that all of the methods above can be viewed as discrete calculus of
variations problems on graphs. A natural question to ask is what relationship they
have with their continuous counterparts. Such relationships can give us insights about
how much labeled data is necessary to get accurate classiﬁers, and which algorithms
and hyperparameters are appropriate in diﬀerent situations. Answering these question
requires a modeling assumption on our data. A standard assumption is that the data
points are a sequence of independent and identically distributed random variables
and the graph is constructed as a random geometric graph. The next few sections are
a review of calculus on graphs, and some basic probability, so that we can address
the random geometric graph model rigorously.
96 CHAPTER 5. GRAPH-BASED LEARNING

5.1 Calculus on graphs

We now give a basic review of calculus on graphs, to make the discrete variational
problems look more like their continuous counterparts. Let G = (X , W) be a sym-
metric graph, which means wxy = wyx . We recall the weights are nonnegative and
wxy = 0 indicates
P there is no edge between x and y. We also recall the degree is given
by d(x) = y∈X wxy .
Let ℓ2 (X ) denote the space of functions u : X → R equipped with the inner
product
X
(5.8) (u, v)ℓ2 (X ) = u(x)v(x).
x∈X

This induces a norm kuk2ℓ2 (X ) = (u, u)ℓ2 (X ) . We define a vector field on the graph to
be an antisymmetric function V : X 2 → R (i.e., V (x, y) = −V (y, x)). Thus, vector
fields are defined along edges of the graph. The gradient of a function u ∈ ℓ2 (X ),
denoted for simplicity as ∇u, is the vector field

(5.9) ∇u(x, y) = u(y) − u(x).

We deﬁne an inner product between vector ﬁelds V and W as

1 X
(5.10) (V, W )ℓ2 (X 2 ) = wxy V (x, y)W (x, y).
2 x,y∈X

This again induces a norm kV k2ℓ2 (X 2 ) = (V, V )ℓ2 (X 2 )

The graph divergence is an operator taking vector fields to functions in ℓ2 (X ),
and is defined as the negative adjoint of the gradient. That is, for a vector field
V : X 2 → R, the graph divergence divV : X → R is defined so that the identity

(5.11) (∇u, V )ℓ2 (X 2 ) = −(u, divV )ℓ2 (X )

holds for all u ∈ ℓ2 (X ). Compare this with the continuous Divergence Theorem
(Theorem A.32). To ﬁnd an expression for divV (x), we compute
1 X
(∇u, V )ℓ2 (X 2 ) = wxy (u(y) − u(x))V (x, y)
2 x,y∈X
1 X 1 X
= wxy u(y)V (x, y) − wxy u(x)V (x, y)
2 x,y∈X 2 x,y∈X
1 X 1 X
= wxy u(x)V (y, x) − wxy u(x)V (x, y)
2 x,y∈X 2 x,y∈X
X
=− wxy u(x)V (x, y),
x,y∈X
5.1. CALCULUS ON GRAPHS 97

where we used the anti-symmetry of V in the last line. Therefore we have

!
X X
(∇u, V )ℓ2 (X 2 ) = − wxy V (x, y) u(x).
x∈X y∈X

This suggests we should deﬁne the divergence of a vector ﬁeld V to be

X
(5.12) divV (x) = wxy V (x, y).
y∈X

Using this deﬁnition, the integration by parts formula (5.11) holds, and div is the
negative adjoint of ∇.

5.1.1 The Graph Laplacian and maximum principle

The graph Laplacian Lu of a function u ∈ ℓ2 (X ) is the composition of gradient and
divergence. That is
X
(5.13) Lu(x) = div(∇u)(x) = wxy (u(y) − u(x)).
y∈X

Using (5.11) we have

(5.14) (−Lu, v)ℓ2 (X ) = (−div∇u, v)ℓ2 (X ) = (∇u, ∇v)ℓ2 (X 2 ) .

In particular

(5.15) (Lu, v)ℓ2 (X ) = (u, Lv)ℓ2 (X ) ,

and so the graph Laplacian L is self-adjoint as an operator L : ℓ2 (X ) → ℓ2 (X ). We

also note that (5.14) implies

(5.16) (−Lu, u)ℓ2 (X ) = (∇u, ∇u)ℓ2 (X 2 ) = E2 (u) ≥ 0,

which shows that the graph Dirichlet energy E2 (u), deﬁned in (5.3), is closely related
to the graph Laplacian. Eq. (5.16) also shows that −L is a positive semi-deﬁnite
operator.
Now, we consider the Laplacian learning problem

(5.17) min E2 (u),

u∈A

where
A = {u ∈ ℓ2 (X ) : u(x) = g(x) for all x ∈ Γ},
where Γ ⊂ X and g : Γ → R are given.
98 CHAPTER 5. GRAPH-BASED LEARNING

Lemma 5.1 (Existence). There exists u ∈ A such that E2 (u) ≤ E2 (w) for all w ∈ A,
and

(5.18) min g(x) ≤ u ≤ max g(x).

x∈Γ x∈Γ

Proof. First, we note that E2 decreases by truncation, that is if u ∈ A and

w(x) := min {max {u(x), a} , b}

for any a < b, then E2 (w) ≤ E2 (u). Indeed, this follows from

|w(x) − w(y)| ≤ |u(x) − u(y)|,

which holds for all x, y ∈ X . By setting a = minx∈Γ g(x) and b = maxx∈Γ g(x), we
have that w ∈ A provided u ∈ A. Therefore, we may look for a minimizer of E2 over
the constrained set

B := {u ∈ A : min g(x) ≤ u ≤ max g(x)}.

x∈Γ x∈Γ

Since E2 (u) ≥ 0 for all u ∈ ℓ2 (X ), we have inf w∈B E2 (w) ≥ 0. For every integer k ≥ 1,
let uk ∈ B such that E2 (uk ) ≤ inf w∈A E2 (w) + 1/k. Then uk is a minimizing sequence,
meaning that limk→∞ E2 (uk ) = inf w∈A E2 (w). By the Bolzano-Weierstrauss Theorem,
there exists a subsequence ukj and u ∈ B such that limj→∞ ukj (x) = u(x) for all
x ∈ X . Since E2 (u) is a continuous function of u(x) for all x ∈ X , we have

E2 (u) = lim E2 (ukj ) = inf E2 (w),

j→∞ w∈A

which completes the proof.

We now turn to necessary conditions. To derive the Euler-Lagrange equation that
u satisﬁes, let v ∈ ℓ2 (X ) such that v = 0 on Γ. Then u + tv ∈ A for all t, and we
compute, as usual, the variation
d
0= E2 (u + tv)
dt t=0
d
= (−L(u + tv), u + tv)ℓ2 (X )
dt t=0
d
= (−Lu, u)ℓ2 (X ) + t(−Lu, v)ℓ2 (X ) + t(u, −Lv)ℓ2 (X ) + t2 (−Lv, v)ℓ2 (X )
dt t=0
= (−Lu, v)ℓ2 (X ) + (u, −Lv)ℓ2 (X ) = 2(−Lu, v)ℓ2 (X ) ,

where we used the fact that L is self-adjoint (Eq. (5.15)) in the last line. Therefore,
any minimizer u of (5.17) must satisfy

(Lu, v)ℓ2 (X ) = 0 for all v ∈ ℓ2 (X ) with v = 0 on Γ.

5.1. CALCULUS ON GRAPHS 99

We now choose (
Lu(x), if x ∈ X \ Γ
v(x) =
0, if x ∈ Γ,
to ﬁnd that X
0 = (Lu, v)ℓ2 (X ) = |Lu(x)|2 .
x∈X \Γ

Therefore Lu(x) = 0 for all x ∈ X \ Γ. Thus, any minimizer u ∈ ℓ2 (X ) of (5.17)

satisﬁes the boundary value problem (or Euler-Lagrange equation)
(
Lu = 0 in X \ Γ,
(5.19)
u = g on Γ.

It is interesting to note that Lu(x) = 0 can be expressed as

X X
0= wxy (u(y) − u(x)) = wxy u(y) − d(x)u(x),
y∈X y∈X

and so Lu(x) = 0 is equivalent to

1 X
(5.20) u(x) = wxy u(y).
d(x) y∈X

Eq. (5.20) is a mean value property on the graph, and says that any graph harmonic
function (i.e., Lu = 0) is equal to its average value over graph neighbors.
Uniqueness of solutions of the boundary value problem (5.19) follow from the
following maximum principle.

Theorem 5.2 (Maximum Principle). Let u ∈ ℓ2 (X ) such that Lu(x) ≥ 0 for all
x ∈ X \ Γ. If the graph G = (X , W) is connected to Γ, then

(5.21) max u(x) = max u(x).

x∈X x∈Γ

In the theorem, we say the graph is connected to Γ if for every x ∈ X \ Γ, there

exists y ∈ Γ and a sequence of points x = x0 , x1 , x2 , . . . , xm = y such that xi ∈ X and
wxi ,xi+1 > 0 for all i.

Proof. We ﬁrst note that Lu(x) ≥ 0 is equivalent to

1 X
(5.22) u(x) ≤ wxy u(y),
d(x) y∈X

by an argument similar to the derivation of (5.20).

100 CHAPTER 5. GRAPH-BASED LEARNING

Let x0 ∈ X such that u(x0 ) = maxx∈X u(x). We may assume that x0 ∈ X \ Γ, or

else (5.21) is immediate. By (5.22) and the fact that u(y) ≤ u(x0 ) for all y we have
1 X 1 X
u(x0 ) ≤ wx0 y u(y) ≤ wx y u(x0 ) = u(x0 ).
d(x0 ) y∈X d(x0 ) y∈X 0

Therefore, we must have equality throughout, and so

X
wx0 y (u(x0 ) − u(y)) = 0.
y∈X

Since wx0 y (u(x0 ) − u(y)) ≥ 0, we conclude that

wx0 y (u(x0 ) − u(y)) = 0 for all y ∈ X .

In particular, we have

(5.23) u(x0 ) = u(y) for all y ∈ X such that wx0 y > 0.

Since the graph is connected to Γ, there exists y ∈ Γ and a sequence of points

x0 , x1 , x2 , . . . , xm = y such that xi ∈ X and wxi ,xi+1 > 0 for all i. By (5.23) we have
u(x0 ) = u(x1 ) = maxx∈X u(x). But now the argument above can be applied to x1 to
obtain u(x0 ) = u(x1 ) = u(x2 ) = maxx∈X u(x). Continuing by induction we see that

max u(x) = u(x0 ) = u(xm ) = u(y) ≤ max u(x).

x∈X x∈Γ

This completes the proof.

Uniqueness of solutions to (5.19) is an immediate consequence of the following
corollary.
Corollary 5.3. Let u, v ∈ ℓ2 (X ) such that Lu(x) = Lv(x) = 0 for all x ∈ X \ Γ. If
the graph G = (X , W) is connected to Γ, then

(5.24) max |u(x) − v(x)| = max |u(x) − v(x)|.

x∈X x∈Γ

In particular, if u = v on Γ, then u ≡ v.
Proof. Since L(u − v)(x) = 0 for x ∈ X \ Γ, we have by Theorem 5.2 that

max(u(x) − v(x)) = max(u(x) − v(x)) ≤ max |u(x) − v(x)|

x∈X x∈Γ x∈Γ

and
max(v(x) − u(x)) = max(v(x) − u(x)) ≤ max |u(x) − v(x)|,
x∈X x∈Γ x∈Γ

from which we immediately conclude (5.24).

5.2. CONCENTRATION OF MEASURE 101

For use later on, we record another version of the maximum principle that does
not require connectivity of the graph.
Lemma 5.4 (Maximum principle). Let u ∈ ℓ2 (X ) such that Lu(x) > 0 for all x ∈
X \ Γ. Then
(5.25) max u(x) = max u(x).
x∈X x∈Γ

Proof. Let x0 ∈ X such that u(x0 ) = maxx∈X u(x). Since u(x0 ) ≥ u(y) for all y ∈ X ,
we have X
Lu(x0 ) = wxy (u(y) − u(x0 )) ≤ 0.
y∈X

Since Lu(x) > 0 for all x ∈ X \ Γ, we must have x0 ∈ Γ, which completes the
proof.

5.2 Concentration of measure

As we will be working with random geometric graphs, we will require some basic
probabilistic estimates, referred to as concentration of measure, to control the ran-
dom behavior of the graph. In this section, we review some basic, and very useful,
concentration of measure results. It is a good idea to review the Section A.9 for a
review of basic probability before reading this section.
Let X1 , X2 , . . . , Xn be a sequence of n independent
Pn and identically distributed
1
real-valued random variables and let Sn = n i=1 Xi . In Section A.9.4 we saw how
to use Chebyshev’s inequality to obtain bounds of the form
σ2
(5.26) P(|Sn − µ| ≥ t) ≤
nt2
for any t > 0, where µ = E[Xi ] and σ 2 = Var (Xi ). Without further assumptions on
the random variables Xi , these estimates are essentially tight. However, if the random
variables Xi are almost surely bounded (i.e., P(|Xi | ≤ b) = 1 for some b > 0), which
is often the case in practical applications, then we can obtain far sharper exponential
bounds.
To see what to expect, we note that the Central Limit Theorem says (roughly)
that
1 1
Sn = µ + √ N (0, σ ) + o √
2
as n → ∞
n n
where N (0, σ 2 ) represents a normally distributed random√variable with mean zero and
variance σ 2 . Ignoring error terms, this says that Yn := n(Sn − µ) is approximately
N (0, σ 2 ), and so we may expect Gaussian-like estimates of the form

x2
P(|Yn | ≥ x) ≤ C exp − 2
2σ
102 CHAPTER 5. GRAPH-BASED LEARNING
√
for x > 0. Setting x = nt we can rewrite this as

nt2
(5.27) P(|Sn − µ| ≥ t) ≤ C exp − 2
2σ

for any t > 0. Bounds of the form (5.26) and (5.27) are called concentration in-
equalities, or concentration of measure. In this section we describe the main ideas for
proving exponential bounds of the form (5.27), and prove the Hoeﬀding and Bernstein
inequalities, which are some of the most useful concentration inequalities. For more
details we refer the reader to [8].
One normally proves exponential bounds like (5.27) with the Chernoﬀ bounding
trick. Let s > 0 and note that

P(Sn − µ ≥ t) = P(s(Sn − µ) ≥ st) = P es(Sn −µ) ≥ est .

The random variable Y = es(Sn −µ) is nonnegative, so we can apply Markov’s inequality
(see Proposition A.39) to obtain

P(Sn − µ ≥ t) ≤ e−st E es(Sn −µ)
h s ∑n i
−st (Xi −µ)
=e E e n i=1

" n #
Y s
= e−st E e n (Xi −µ) .
i=1

Applying (A.30) yields

Y
n
s s n
(5.28) P(Sn − µ ≥ t) ≤ e−st E e n (Xi −µ) = e−st E e n (X1 −µ) .
i=1

This bound is the main result of the Chernoﬀ bounding trick. The key now is to
obtain bounds on the moment generating function

MX (λ) := E[eλ(X−µ) ],

where X = X1 .
In the case where the Xi are Bernoulli random variables, we can compute the
moment generating function explicitly, and this leads to the Chernoﬀ bounds. Be-
fore giving them, we present some preliminary technical propositions regarding the
function

(5.29) h(δ) = (1 + δ) log(1 + δ) − δ,

which appears in many concentration inequalities.

5.2. CONCENTRATION OF MEASURE 103

Proposition 5.5. For any δ > 0 we have

(5.30) max {δx − (ex − 1 − x)} = h(δ).

x≥0

and for any 0 ≤ δ < 1 we have

(5.31) max δx − e−x − 1 + x = h(−δ).
x≥0

Proof. Let f (x) = δx − (ex − 1 − x) and check that

f ′ (x) = δ − ex + 1 = 0

when x = log(1 + δ). Since f ′′ (x) = −ex , the critical point is a maximum and we
have
f (log(1 + δ)) = h(δ),
which completes the proof of (5.30). The proof of (5.31) is similar.

Proposition 5.6. For every δ ≥ −1 we have

δ2
(5.32) h(δ) ≥ ,
2(1 + 13 δ+ )

where δ+ := max{δ, 0}.

Proof. We ﬁrst note that h′ (0) = h(0) = 0 and h′′ (δ) = 1/(1 + δ). We deﬁne

δ2
f (δ) = ,
2(1 + 31 δ)

and check that f ′ (0) = f (0) = 0 and

1 1
f ′′ (δ) = 1 3 ≤ = h′′ (δ)
(1 + 3 δ) 1 + δ

for all δ > 0. Therefore, for δ > 0 we have

Z δZ t Z δZ t
′′
h(δ) = h (s) ds dt ≥ f ′′ (s) ds dt = f (δ).
0 0 0 0

For δ < 0 we have that h′′ (s) ≥ 1 for δ ≤ s ≤ 0 and so

Z 0Z 0 Z 0Z 0
′′ 1
h(δ) = h (s) ds dt ≥ ds dt = δ 2 ,
δ t δ t 2
which completes the proof.
104 CHAPTER 5. GRAPH-BASED LEARNING

We now state and prove the Chernoﬀ bounds.

Theorem 5.7 (Chernoﬀ bounds). Let X1 , X2 . . . , Xn be a sequence of i.i.d. Bernoulli
random variables with parameter p ∈ [0, 1] (i.e., P(Xi = 1) = p and P(Xi = 0) =
1 − p). Then for any δ > 0 we have
!
X n
np δ 2
(5.33) P Xi ≥ (1 + δ)np ≤ exp − 1 ,
i=1
2(1 + 3
δ)

and for any 0 ≤ δ < 1 we have

!
X
n
1
(5.34) P Xi ≤ (1 − δ)np ≤ exp − np δ ,
2

i=1
2

Proof. We can explicitly compute the moment generating function

MX (λ) = e−λp E[eλX ] = e−λp (peλ + (1 − p)e0 )

= e−λp (1 + p(eλ − 1))

≤ exp −λp + p(eλ − 1) ,
Pn
where we used the inequality 1 + x ≤ ex in the last line. Writing Sn = 1
n i=1 Xi and
using the Chernoﬀ bounding method (5.28) we have
s
P(Sn − p ≥ t) ≤ exp −st − sp + np(e n − 1) ,

since µ = E[Xi ] = p. Set t = δp and rearrange to ﬁnd that

!
Xn
s
P Xi ≥ (1 + δ)np ≤ exp −sδp − sp + np(e n − 1)
i=1
s
= exp −np δ ns − e n − 1 − s
n
,

for any s > 0. We use (5.30) from Proposition 5.5 to optimize over s > 0 yielding
!
Xn
P Xi ≥ (1 + δ)np ≤ exp (−np h(δ)) .
i=1

The proof of (5.33) is completed by applying Proposition 5.6.

To prove (5.34) we again use the Chernoﬀ bounding method (5.28) to obtain
s n
P(Sn − p ≤ −t) ≤ e−st E e− n (X1 −p) ≤ exp −st + sp + np(e− n − 1) .
s

Set t = δp to obtain
!
X
n

≤ exp −sδp + sp + np(e− n − 1) .
s
P Xi ≤ (1 − δ)np
i=1
5.2. CONCENTRATION OF MEASURE 105
!
X
n

≤ exp −np δ ns − e− n − 1 +
s
P Xi ≤ (1 − δ)np s
n
.
i=1

We use (5.31) to optimize over s > 0 yielding

!
X n
P Xi ≤ (1 − δ)np ≤ exp (−np h(−δ)) .
i=1

Invoking again Proposition 5.6 completes the proof.

Example 5.1. As an application of the Chernoﬀ bounds, we prove that the com-
putational complexity of quicksort is O(n log(n)) with high probability. Quicksort is
a randomized algorithm for sorting a list of numbers x1 , x2 , . . . , xn ∈ R that is one
of (if not the) fastest sorting algorithms. Let X = {x1 , . . . , xn }. To sort X, the
quicksort algorithm picks a random point p ∈ X, called the pivot, and splits X into
X+ (p) := X ∩ (−∞, p] and X− (p) := ∩(p, ∞], and recursively calls quicksort on the
sets X+ (p) and X− (p).
To analyze quicksort, we note that the process generates a tree, where the set X
is the parent of children X+ (p) and X− (p). The tree has exactly n leaves and n paths
from the root X to the leaves, one for each point x1 , . . . , xn . Let Z denote the depth
of the tree. Since each layer of the tree contributes n operations, the complexity of
quicksort is O(Zn). We call a pivot a good pivot if both X+ (p) and X− (p) contain at
least one third of the points in X, and otherwise we call p a bad pivot. The probability
of a bad pivot is p = 2/3, and every good pivot reduces the size of the dataset by the
factor 2/3. Thus, along any path from the root of the tree to a leaf, there can be no
more than
log n
k= ≤ 3 log n
log( 32 )
good pivots, before the size of the remaining set is less than one (and log(3/2) ≥ 1/3).
Let P denote any path from the root to a leaf, and let |P | denote its length. Let Y
denote the number of bad pivots along the path, and let C > 0 be a constant to be
determined. If |P | ≥ (C + 3) log n, then since there are at most 3 log n good pivots
we have Y ≥ C log n. By the Chernoﬀ bounds with δ = 1/3 we have

4 1
P Y ≥ mp ≤ exp − mp ,
3 20
where m is the largest integer smaller than (C + 3) log n and p = 2/3. Let us for
simplicity assume n ≥ e so log n ≥ 1. Then we have
(C + 2) log n ≤ m ≤ (C + 3) log n.
Since 4p/3 = 8/9 we have

1
P Y ≥ 8
9
(C + 3) log n ≤ exp − (C + 2) log n .
30
106 CHAPTER 5. GRAPH-BASED LEARNING

We now choose C + 2 = 60, or C = 58. Then we check that 89 (C + 3) ≤ C and so

1
P(|P | ≥ 61 log n) ≤ P(Y ≥ 58 log n) ≤ .
n2
Since here are n paths from leaves to the root, we union bound over all paths to ﬁnd
that
1
P(Z ≥ 61 log n) ≤ .
n
Therefore, with probability at least 1 − n1 , quicksort takes at most O(n log n) opera-
tions to complete. 4

In general, we cannot compute the moment generating function explicitly, and are
left to derive upper bounds. The ﬁrst bound is due to Hoeﬀding.

Lemma 5.8 (Hoeﬀding Lemma). Let X be a real-valued random variable for which
|X − µ| ≤ b almost surely for some b > 0, where µ = E[X]. Then we have
λ2 b 2
(5.35) MX (λ) ≤ e 2 .

Proof. Since x 7→ esx is a convex function, we have

x+b
eλx ≤ e−λb + sinh(λb)
b
provided |x| ≤ b (the right hand side is the secant line from (−b, e−λb ) to (b, eλb );
recall sinh(t) = (et − e−t )/2 and cosh(t) = (et + e−t )/2). Therefore we have

λ(X−µ) −λb X −µ+b
MX (λ) = E e ≤E e + sinh(λb)
b
E[X] − µ + b
= e−λb + sinh(λb)
b
= e−λb + sinh(λb) = cosh(λb).

x2
The proof is completed by the elementary inequality cosh(x) ≤ e 2 (compare Taylor
series).

Combining the Hoeﬀding Lemma with (5.28) yields

−st
s (X1 −µ) n s2 b 2
P(Sn − µ ≥ t) ≤ e E e n = exp −st + ,
2n

provided |Xi − µ| ≤ b almost surely. Optimizing over s > 0 we ﬁnd that s = nt/b2 ,
which yields the following result.
5.2. CONCENTRATION OF MEASURE 107

Theorem 5.9 (Hoeﬀding inequality). Let X1 , X2 . . . , Xn be a sequence of i.i.d.

Pn real-
valued random variables with ﬁnite expectation µ = E[Xi ], and write Sn = n i=1 Xi .
1

Assume there exists b > 0 such that |X − µ| ≤ b almost surely. Then for any t > 0
we have

nt2
(5.36) P(Sn − µ ≥ t) ≤ exp − 2 .
2b
Remark 5.10. Of course, the opposite inequality

nt2
P(Sn − µ ≤ −t) ≤ exp − 2
2b
holds by a similar argument. Thus, by the union bound we have

nt2
P(|Sn − µ| ≥ t) ≤ 2 exp − 2 .
2b
The Hoeffding inequality is tight if σ 2 ≈ b2 , so that the right hand side looks like
the Gaussian distribution in (5.27), up to constants. For example, if Xi are uniformly
distributed on [−b, b] then
Z
2 1 b 2 b2
σ = x dx = .
2b −b 3
However, if σ 2 b2 , then one would expect to see σ 2 in place of b2 as in (5.27), and
the presence of b2 leads to a suboptimal bound.
Example 5.2. Let
|X|
Y = max 1 − ,0 .
ε
where X is uniformly distributed on [−1, 1], as above, and ε b. Then |Y | ≤ 1, so
b = 1, but we compute Z
1 ε
σ ≤
2
dx = ε.
2 −ε
Hence, σ 2 1 when ε is small, and we expect to get sharper concentration bounds
than are provided by the Hoeffding inequality. This example is similar to what we
will see later in consistency of graph Laplacians. 4
The Bernstein inequality gives the sharper bounds that we desire, and follows
from Bernstein’s Lemma.
Lemma 5.11 (Bernstein Lemma). Let X be a real-valued random variable with finite
mean µ = E[X] and variance σ 2 = Var (X), and assume that |X − µ| ≤ b almost
surely for some b > 0. Then we have
2
σ λb
(5.37) MX (λ) ≤ exp (e − 1 − λb) .
b2
108 CHAPTER 5. GRAPH-BASED LEARNING

Proof. Note that for |x| ≤ b we have

X
∞
(λx)k X
∞
λk xk−2
λx 2
e = = 1 + λx + x
k=0
k! k=2
k!
X∞
λk bk−2
≤ 1 + λx + x 2

k=2
k!
2
x λb
= 1 + λx + (e − 1 − λb).
b2
Therefore

(X − µ)2 λb σ 2 λb
MX (λ) ≤ E 1 + λ(X − µ) + (e − 1 − λb) = 1 + (e − 1 − λb).
b2 b2
The proof is completed by using the inequality 1 + x ≤ ex .
We now prove the Bernstein inequality.
Theorem 5.12 (Bernstein Inequality). Let X1 , X2 . . . , Xn be a sequence of i.i.d. real-
valued random variables
P with ﬁnite expectation µ = E[Xi ] and variance σ 2 = Var (Xi ),
and write Sn = n1 i=1 Xi . Assume there exists b > 0 such that |X − µ| ≤ b almost
n

surely. Then for any t > 0 we have

nt2
(5.38) P(Sn − µ ≥ t) ≤ exp − .
2(σ 2 + 31 bt)
Proof. Combining the Bernstein Lemma with the Chernoﬀ bound (5.28) yields

−st
s (X1 −µ) n nσ 2 bt sb sb
P(Sn − µ ≥ t) ≤ e E e n ≤ exp − 2 σ2 · n − e n − 1 − n
sb
.
b
We use (5.30) to optimize over s > 0 and obtain

nσ 2 bt
(5.39) P(Sn − µ ≥ t) ≤ exp − 2 h .
b σ2
The proof is completed by invoking Proposition 5.6.
Remark 5.13. The Bernstein inequality has two distinct parameter regimes. If
bt ≤ σ 2 , then Bernstein’s inequality yields

3nt2
P(Sn − µ ≥ t) ≤ exp − 2 ,
8σ
which is, up to constants, the correct Gaussian tails. If bt ≥ σ 2 then we have

3nt
P(Sn − µ ≥ t) ≤ exp − ,
8b
which should be compared with the Hoeﬀding inequality (5.36).
5.2. CONCENTRATION OF MEASURE 109

Remark 5.14. As with the Hoeﬀding inequality, we can obtain two-sided estimates
of the form
nt2
P(|Sn − µ| ≥ t) ≤ 2 exp − .
2(σ 2 + 31 bt)
Later in the notes, we will encounter double sums of the form
1 X
(5.40) Un = f (Xi , Xj ),
n(n − 1) i̸=j

where X1 , . . . , Xn is a sequence of i.i.d. random variables. The random variable Un

is called a U-statistics of order 2. We record here the Bernstein inequality for U-
statistics. We assume f is symmetric, so f (x, y) = f (y, x).
Theorem 5.15 (Bernstein for U-statistics). Let X1 , . . . , Xn is a sequence of i.i.d. ran-
dom variables and let h : R2 → R be symmetric (i.e., f (x, y) = f (y, x)). Let
µ = E[f (Xi , Xj )], σ 2 = Var (f (Xi , Xj )) = E[(f (Xi , Xj ) − µ)2 ], and b := kf k∞ ,
and deﬁne the U-statistic Un by (5.40). Then for every t > 0 we have

nt2
(5.41) P(Un − µ ≥ t) ≤ exp − .
6(σ 2 + 13 bt)
Proof. Let k ∈ N such that n − 1 ≤ 2k ≤ n and deﬁne
1
V (x1 , x2 , . . . , xn ) = (f (x1 , x2 ) + f (x3 , x4 ) + · · · + f (x2k−1 , x2k )) .
k
Then we can write
1 X
Un = V (Xτ1 , Xτ2 , . . . , Xτn ),
n!
τ ∈S(n)

where S(n) is the group of permutations of {1, . . . , n}. Let

Yτ = V (Xτ1 , Xτ2 , . . . , Xτn ) − µ.
We use the Chernoﬀ bounding trick to obtain
P(Un − µ > t) ≤ e−st E[es(Un −µ) ]
s ∑
= e−st E[e n! τ ∈S(n) Yτ ]
1 X
≤ e−st E[esYτ ],
n!
τ ∈S(n)

where the last line follows from Jensen’s inequality. Since Yτ is a sum of k i.i.d. random
variables with zero mean, absolute bound b, and σ 2 variance, the same application of
Bernstein’s Lemma as used in Theorem 5.12 yields
2
kσ sb sb
E[e ] ≤ exp
sYτ
ek −1− .
b2 k
110 CHAPTER 5. GRAPH-BASED LEARNING

Therefore, we obtain

kσ 2 sb
P(Un − µ > t) ≤ exp − 2 σbt2 · sb
k
− e k −1− sb
k
.
b

Using (5.30) to optimize over s > 0 yields

kσ 2 bt
(5.42) P(Un − µ ≥ t) ≤ exp − 2 h ,
b σ2

and the proof is completed by applying Proposition 5.6 and noting that k ≥ n/3 for
all n ≥ 2.
We pause to give an application to Monte Carlo numerical integration, which is a
randomized numerical method for approximating integrals of the form
Z
(5.43) I(u) = u(x) dx.
[0,1]d

The method is based on approximating I(u) by an empirical mean

1X
n
(5.44) In (u) = u(Xi ),
n i=1

where X1 , X2 , X3 , . . . , Xn is a sequence of i.i.d. random variables uniformly distributed

on [0, 1]d . The usefulness of Monte Carlo integration is illustrated by the following
error estimates.

√ Let u−1: [0, 1] → R be continuous and

d
Theorem 5.16 (Monte Carlo error estimate).
let σ = Var (u(Xi )). Then for all 0 < λ ≤ σ nkukL∞ ([0,1]d ) we have
2

λσ
(5.45) |I(u) − In (u)| ≤ √
n

with probability at least 1 − 2 exp − 14 λ2 .

Remark 5.17. Let δ > 0 and choose λ > 0 so that δ = 2 exp − 41 λ2 ; that is
q
λ = 4 log 2δ . Then Theorem 5.16 can be rewritten to say that
s
2
4σ 2 log
(5.46) |I(u) − In (u)| ≤ δ
n

holds with probability at least 1 − δ. This is a more common form to see Monte Carlo
error estimates.
5.2. CONCENTRATION OF MEASURE 111

The reader should contrast this with the case where Xi form a uniform grid over
[0, 1]d . In this case, for Lipschitz functions the numerical integration error is O(∆x),
where ∆x is the grid spacing. For n points on a uniform grid in [0, 1]d the grid spacing
is ∆x ∼ n−1/d , which suﬀers from the curse of dimensionality when d is large. The
Monte Carlo error estimate (5.46), on the other hand, is remarkable in that it is
independent of dimension d! Thus, Monte Carlo integration overcomes the curse of
dimensionality by simply replacing a uniform discretization grid by random variables.
Monte Carlo based techniques have been used to solve PDEs in high dimensions via
sampling random walks or Brownian motions.
Proof
Pn of Theorem 5.16. Let Yi = u(Xi ). We apply Bernstein’s inequality with Sn =
1 2
n i=1 Yi = In (u), σ = Var(Yi ) and b = 2kukL∞ ([0,1]d ) to ﬁnd that

|I(u) − In (u)| ≤ t

nt2 √
with probability at least 1 − 2 exp − 2 σ2 + 1 bt for any t > 0. Set t = λσ/ n for
( 3 )
λ > 0 to find that
λσ
|I(u) − In (u)| ≤ √
n

2 2 √
with probability at least 1 − 2 exp − ( σ2 λbλσ ) . Restricting λ ≤ 3σ n/b completes
2 σ + 3 √n
the proof.
We conclude this section with the Azuma/McDiarmid inequality. This is slightly
more advanced and is not used in the rest of these notes, so the reader may skip ahead
without any loss. It turns out that the Chernoff bounding method used to prove the
Chernoff, Hoeffding, and Bernstein inequalities does not use in any essential way the
linearity of the sum defining Sn . Indeed, what matters is that Sn does not depend
too much on any particular random variable Xi . Using these ideas leads to far more
general (and more useful) concentration inequalities for functions of the form
(5.47) Yn = f (X1 , X2 , . . . , Xn )
that may depend nonlinearly on the Xi . To express that Yn does not depend too
much on any of the Xi , we assume that f satisfies the following bounded differences
condition: There exists b > 0 such that
(5.48) |f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , x
ei , . . . , xn )| ≤ b
ei . In this case we have the following concentration inequality.
for all x1 , . . . , xn and x
Theorem 5.18 (Azuma/McDiarmid inequality). Define Yn by (5.47), where X1 , . . . , Xn
are i.i.d. random variables satisfying |Xi | ≤ M almost surely, and assume f satisfies
(5.48). Then for any t > 0

t2
(5.49) P(Yn − E[Yn ] ≥ t) ≤ exp − .
2nb2
112 CHAPTER 5. GRAPH-BASED LEARNING

Proof. The proof uses conditional probability, which we have not developed in these
notes, so we give a sketch of the proof. For 2 ≤ k ≤ n we deﬁne

Zk = E[Yn | X1 , . . . , Xk ] − E[Yn | X1 , . . . , Xk−1 ].

and set Z1 = E[Yn | X1 ] − E[Yn ]. Since

Yn = E[Yn | X1 , . . . , Xn ],

we have the telescoping sum

X
n
Yn − E[Yn ] = Zk .
k=1

The random variables Zk record how much the conditional expectation changes when
we add information about Xk . While the Zk are not independent, they form a mar-
tingale difference sequence, which allows us to essentially treat them as independent
and use a similar proof to Hoeffding’s inequality. The useful martingale difference
property is the identity
E[Zk | X1 , . . . , Xk−1 ] = 0
for k ≥ 2, and E[Z1 ] = 0, which follow from the law of iterated expectations.
We now follow the Chernoff bounding method and law of iterated expectations to
obtain
∑n
P(Yn − E[Yn ] ≥ t) = P(es k=1 Zk
≥ est )
∑n
≤ e−st E[es k=1 Zk
]
∑n
= e−st E[E[es k=1 Zk
| X1 , . . . , Xn−1 ]]
∑n−1
= e−st E[es k=1 Zk
E[esZn | X1 , . . . , Xn−1 ]].

Deﬁne

Uk = sup E[f (X1 , . . . , Xk−1 , x, Xk+1 , . . . Xn ) − Yn | X1 , . . . , Xk−1 ],

|x|≤M

and
Lk = inf E[f (X1 , . . . , Xk−1 , x, Xk+1 , . . . Xn ) − Yn | X1 , . . . , Xk−1 ].
|x|≤M

Then Lk ≤ Zk ≤ Uk . By (5.48) we have Uk ≤ b and Lk ≥ −b, and so |Zk | ≤ b.

Following a very similar argument as in the proof of Lemma 5.8 we have

−sb Zk + b
E[e | X1 , . . . , Xk−1 ] ≤ E e +
sZk
sinh(sb) | X1 , . . . , Xk−1
b
s2 b 2
= e−st + sinh(sb) = cosh(sb) ≤ e 2 ,
5.3. RANDOM GEOMETRIC GRAPHS AND BASIC SETUP 113

since E[Zk | X1 , . . . , Xk−1 ] = 0. Inserting this above we have

s2 b 2 ∑n−1
P(Yn − E[Yn ] ≥ t) ≤ e−st+ 2 E[es k=1 Zk
].
Continuing by induction we ﬁnd that

ns2 b2
P(Yn − E[Yn ] ≥ t) ≤ exp −st + .
2
Optimizing over s > 0 completes the proof.

5.3 Random geometric graphs and basic setup

To prove discrete to continuum convergence of learning problems on graphs, we must
make a modeling assumption for the graph. Let X1 , X2 , . . . , Xn be a sequence of
i.i.d. random variables on an open connected domain U ⊂ Rd with a probability
density ρ : U → [0, ∞). We assume the boundary ∂U is smooth, ρ ∈ C 2 (U ) and
there exists α > 0 such that
(5.50) α ≤ ρ(x) ≤ α−1 for all x ∈ U.
The vertices of our graph are
(5.51) Xn := {X1 , X2 , . . . , Xn }.
Let η : [0, ∞) → [0, ∞) be smooth and nonincreasing such that η(t) ≥R 1 for 0 ≤ t ≤ 12 ,
and η(t) = 0 for t > 1. For ε > 0 define ηε (t) = ε1d η εt and set ση = Rd |z1 |2 η(|z|) dz.
The weight wxy between x and y is given by
(5.52) wxy = ηε (|x − y|) .
The graph Gn,ε = (Xn , Wn,ε ) where Wn,ε = (wxy )x,y∈Xn is called a random geometric
graph. It will be more convenient notationally to write out the weights in terms of
ηε , so we will not use the wxy notation in the remainder of this chapter.
Remark 5.19. It is also common in machine learning to make the manifold assump-
tion, where X1 , X2 , . . . , Xn are a sequence of i.i.d. random variables sampled from an
m-dimensional manifold M embedded in Rd , where m d. The rest of the construc-
tion is identical, in particular, the weights are defined using the ambient Euclidean
distance in Rd . The random variables are assumed to have a continuous density with
respect to the volume form on the manifold, so Xi ∼ µ where dµ = ρ dV olM , where
ρ ∈ C(M) and dV olM is the volume form on M. All results in these notes extend
to the manifold setting, the only difference being additional technical details. We
note there are also other graph constructions of interest, such as k-nearest neighbor
graphs, and much of the analysis can be extended to this setting, though the proofs
can be much different.
114 CHAPTER 5. GRAPH-BASED LEARNING

We deﬁne the normalized graph Dirichlet energy En,ε : ℓ2 (Xn ) → R by

1 X
(5.53) En,ε (u) = ηε (|x − y|)(u(x) − u(y))2 ,
ση n2 ε2 x,y∈X
n

and the corresponding normalized graph Laplacian

2 X
(5.54) Ln,ε u(x) = ηε (|x − y|)(u(y) − u(x)).
ση nε2 y∈X
n

As discussed in Section 5.1.1, the Laplacian learning problem can be posed as a

variational problem, by minimizing En,ε , or as a boundary value problem by solving
Ln,ε u = 0. These are the variational and PDE interpretations of Laplace learning,
respectively, and each gives its own set of tools and techniques for proving discrete
to continuum converge.
In Section 5.1 we defined an L2 structure and calculus on graphs. In the random
geometric setting, we rescale these definitions accordingly, so they agree in the con-
tinuum, as n → ∞ and ε → 0, with their continuum counterparts. For u ∈ ℓ2 (Xn )
we define the normalized graph L2 -norm kukℓ2 (Xn ) by
1 X
(5.55) kuk2ℓ2 (Xn ) := u(x)2 .
n x∈X
n

The L2 norm is, as before, induced by the L2 -inner product

1 X
(5.56) (u, v)ℓ2 (Xn ) := u(x)v(x).
n x∈X
n

We deﬁne the gradient vector ﬁeld ∇n,ε u for a function u ∈ ℓ2 (Xn ) by

1
(5.57) ∇n,ε u(x, y) := (u(y) − u(x)).
ε
The L2 norm of the gradient is then deﬁned as
1 X
(5.58) k∇n,ε uk2ℓ2 (Xn2 ) = ηε (|x − y|)|∇n,ε u(x, y)|2 .
ση n2 x,y∈X
n

In this notation, En,ε (u) = k∇n,ε uk2ℓ2 (Xn2 ) . The L2 norm of the gradient is of course
induced by the corresponding inner product
1 X
(5.59) (V, W )ℓ2 (Xn2 ) := ηε (|x − y|)V (x, y)W (x, y).
ση n2 x,y∈X
n

We deﬁne the H 1 (Xn ) inner product by

(5.60) (u, v)H 1 (Xn ) = (u, v)ℓ2 (Xn ) + (∇n,ε u, ∇n,ε u)ℓ2 (Xn2 ) ,
5.3. RANDOM GEOMETRIC GRAPHS AND BASIC SETUP 115

and H 1 (Xn ) norm by kuk2H 1 (Xn ) = (u, u)H 1 (Xn ) . Notice that

(5.61) kuk2H 1 (Xn ) = kuk2ℓ2 (Xn ) + En,ε (u).

Finally, for a vector ﬁeld V we deﬁne, as before, the divergence of V to be

2 X
(5.62) divn,ε V (x) = ηε (|x − y|)V (x, y).
ση nε y∈X
n

With these deﬁnitions, the graph Laplacian (5.54) is given by Ln,ε = divn,ε (∇n,ε u)
for u ∈ ℓ2 (Xn ), and the integration by parts formula

(5.63) (∇n,ε u, V )ℓ2 (Xn2 ) = −(u, divn,ε V )ℓ2 (Xn )

holds, due to 5.11, where V is a vector ﬁeld on Xn2 .

We say the graph Gn,ε is connected if for every x, y ∈ Xn there exists x =
x0 , x1 , . . . , xm = y such that xi ∈ Xn and ηε (|xi − xi+1 |) > 0 for all i. For the
reader’s convenience and interest, we record a standard result below on graph con-
nectivity. While uniqueness of the graph learning problem requires connectivity of
the graph, our discrete to continuum results do not explicitly use connectivity, so the
result below is presented for interest only.

Lemma 5.20. The graph Gn,ε is connected with probability at least 1−Cn exp −cnεd ,
for constants C, c > 0 depending only on α, d, and U .

Proof. For simplicity, we give the proof for a box U = [0, 1]d ; the extension to arbitrary
domains U with smooth boundaries is straightforward. We assume 0 < ε ≤ 1
Let Ω be the event that the graph Gn,ε is not connected. We partition the box U
into M = h−d sub boxes B1 , B2 , . . . , Bn of side
√ length h, and let ni denote the number
of points from Xn falling in Bi . If ε ≥ 4 dh, then every point √ in Bi is connected
to all points in neighboring boxes Bj . Thus, if we set h = ε/4 d then the graph is
connected if all boxes Bi contain at least one point from Xn , and paths between pairs
of points are constructed by hopping between neighboring boxes. Hence, if the graph
is not connected, then some box Bi must be empty. Letting Ai denote the event that
ni = 0, we have by the union bound
!
[
M XM X
M
P(Ω) ≤ P Ai ≤ P(Ai ) = P(ni = 0).
i=1 i=1 i=1

Since X1 , . . . , Xn are i.i.d. random variables with density ρ ≥ α > 0, we have that

Y
n
P(ni = 0) = P (∀j ∈ {1, . . . , n}, Xj 6∈ Bi ) = P(Xj 6∈ Bi ) = P(X1 6∈ Bi )n .
j=1
116 CHAPTER 5. GRAPH-BASED LEARNING

By direct computation we have

Z Z
P(X1 6∈ Bi ) = ρ dx = 1 − ρ dx ≤ 1 − α|Bi | = 1 − αhd .
U \Bi Bi
√
Since h = ε/4 d, there exists 0 < c ≤ 1, depending only on n and α, such that

P(ni = 0) ≤ (1 − cεd )n ≤ exp −cnεd .

Therefore
P(Ω) ≤ M exp −cnεd = Cε−d exp −cnεd .
If nεd ≥ 1, then
P(Ω) ≤ Cn exp −cnεd .
If nεd ≤ 1, then
Cn exp −cnεd ≥ Cn exp(−c) ≥ 1 ≥ P(Ω)
provided we choose C larger, if necessary. This completes the proof.
Remark 5.21. By Lemma 5.20, there exists C > 0 such that if nεd ≥ C log(n), then
the graph is connected with probability at least 1 − n12 . Note we can rewrite this
condition as
1/d
log(n)
(5.64) ε≥K ,
n

where K = C 1/d . If n → ∞ and ε = εn → 0 so that (5.64) holds, then the graph is

connected for large enough n almost surely. This gives our ﬁrst length scale restriction
on ε. We will see more severe restrictions in the following sections.

5.4 The maximum principle approach

The maximum principle is an incredibly useful tool for passing to limits in many
kinds of problems. In the graph setting, the graph Laplacian has a discrete max-
imum principle, and so it is natural to make use of this for discrete to continuum
convergence.
For the maximum principle approach, we interpret Laplace learning as the bound-
ary value problem
(
Ln,ε u(x) = 0 if x ∈ Xn \ ∂ε U
(5.65)
u(x) = g(x) if x ∈ Xn ∩ ∂ε U,

where g : U → R is a given Lipschitz function and

(5.66) ∂ε U = {x ∈ U : dist(x, ∂U ) ≤ ε}.

5.4. THE MAXIMUM PRINCIPLE APPROACH 117

We also write Uε = U \ ∂ε U . Note we have chosen our labeled points to be all points
within ε of the boundary. Other choices of boundary conditions can be used, and
it is an interesting, and somewhat open, problem to determine the fewest number of
labeled points for which discrete to continuum convergence holds.
The continuum version of (5.65) is
(
div(ρ2 ∇u) = 0 in U
(5.67)
u = g on ∂U.

The goal of this section is to use the maximum principle to prove convergence, with
a rate, of the solution of (5.65) to the solution of (5.67) as ε → 0 and n → ∞.
We ﬁrst prove pointwise consistency of graph Laplacians. The proof utilizes the
nonlocal operator
Z
2
(5.68) Lε u(x) = ηε (|x − y|)(u(y) − u(x))ρ(y) dy.
σ η ε2 U
Lemma 5.22 (Discrete to nonlocal). Let u : U → R be Lipschitz continuous and
ε > 0 with nεd ≥ 1. Then for any 0 < λ ≤ ε−1

(5.69) max |Ln,ε u(x) − Lε u(x)| ≤ CLip(u)(λ + ε).

x∈Xn

with probability at least 1 − C exp −cnεd+2 λ2 + 3 log(n) .
Proof. Fix x ∈ U and write

Yi = ηε (|Xi − x|)(u(Xi ) − u(x)),

so that
2 1X
n
Ln,ε u(x) = Yi .
ση ε2 n i=1
We compute

|Yi | = ηε (|Xi − x|)|u(Xi ) − u(x)| ≤ CLip(u)ε1−d =: b,

Z
E[Yi ] = ηε (|x − y|)(u(y) − u(x))ρ(y) dy,
U
and
Z
σ = E[(Yi − E[Yi ]) ] ≤
2 2
E[Yi2 ] = ηε (|x − y|)2 (u(y) − u(x))2 ρ(y) dy
B(x,ε)∩U
Z
≤ CLip(u) ε2 2
ηε (|x − y|)2 dy
B(x,ε)∩U

≤ CLip(u) ε2 2−d
.
118 CHAPTER 5. GRAPH-BASED LEARNING

By Bernstein’s inequality (Theorem 5.12)

Z
1X
n
Yi − ηε (|x − y|)(u(y) − u(x))ρ(y) dy ≤ t
n i=1 U

holds with probability at least 1 − 2 exp −cLip(u)−2 nεd−2 t2 for 0 < t < ση
2
Lip(u)ε.
Set t = σ2η Lip(u)ε2 λ and multiply both sides by 2/(ση ε2 ) to find that
(5.70) |Ln,ε u(x) − Lε u(x)| ≤ Lip(u)λ

holds with probability at least 1 − 2 exp −cnεd+2 λ2 for any 0 < λ < ε−1 .
The rest of the proof is completed by either conditioning on Xi = x, or by a
covering argument. Since we have not covered conditional probability, we will give
the covering argument here. Let Uh = Zdh ∩ U and N (x) = #B(x, 2ε) ∩ Xn . By the
Chernoff bounds (Theorem 5.7) we have P(N (x) ≥ Cnεd ) ≤ exp(−cnεd ) for constants
C, c > 0. Let Ωh denote the event that (5.70) holds and N (x) ≤ Cnεd for all x ∈ Uh .
Since Uh has at most Ch−d points we can use the union bound to obtain

P(Ωh ) ≥ 1 − Ch−d exp −cnεd+2 λ2 .
For each Xi , let xi ∈ Uh denote the closest point to Xi in Uh , which satisfies |Xi −xi | ≤
Ch for some constant C > 0. We assume Ch ≤ ε. Since y 7→ ηε (|x − y|)(u(y) − u(x))
is Lipschitz with constant at most Cε−n , we have
Z
Ch
|Lε u(xi ) − Lε u(Xi )| ≤ d+2 dx = CLip(u)hε−2 ,
ε B(x,ε)∩U

and
Ch X
|Ln,ε u(xi ) − Ln,ε u(Xi )| ≤ 1 ≤ CLip(u)hε−2 .
nεd+2
y∈Xn ∩B(xi ,2ε)

Therefore, if Ωh holds, we have

|Ln,ε u(Xi ) − Lε u(Xi )| ≤ Lip(u)λ + CLip(u)hε−2 .
for all 1 ≤ i ≤ n. Setting h = ε3 yields
max |Ln,ε u(x) − Lε u(x)| ≤ CLip(u)(λ + ε)
x∈Xn

with probability at least 1−Cε−3d exp −cnεd+2 λ2 for any 0 < λ ≤ ε−1 . We complete
the proof by using nεd ≥ 1 to obtain ε−3d ≤ n3 = exp(3 log(n)).
Remark 5.23. Lemma 5.22 can be strengthen to read

P ∀u ∈ C (U ), max |Ln,ε u(x) − Lε u(x)| ≤ CkukC 3 (U ) (λ + ε)
3
x∈Xn

≥ 1 − C exp −cnεd+2 λ2 + 3 log(n)
5.4. THE MAXIMUM PRINCIPLE APPROACH 119

for any 0 < λ ≤ ε−1 , provided u ∈ C 3 (U ). The proof is similar to Lemma 5.22,
but involves Taylor expanding u before applying concentration of measure so that
the application of Bernstein’s inequality does not depend on u. The uniformity over
u ∈ C 3 (U ) is required when using the viscosity solution machinery to prove discrete
to continuum convergence in the non-smooth setting. We refer the reader to [11,
Theorem 5] for more details. Also, when one uses conditional probability instead of
the covering argument in proving Lemma 5.22, the term 3 log(n) can be improved to
log(n). This is inconsequential for the results below.
We now turn to comparing the nonlocal operator Lε to its continuum counterpart

(5.71) ∆ρ u := ρ−1 div(ρ2 ∇u).

Lemma 5.24 (Nonlocal to local). There exists C > 0 such that for every u ∈ C 3 (U )
and x ∈ U with dist(x, ∂U ) ≥ ε we have

(5.72) |Lε u(x) − ∆ρ u(x)| ≤ CkukC 3 (U ) ε

Proof. We ﬁrst make a change of variables z = (y − x)/ε to ﬁnd that

Z
2
Lε u(x) = η(|z|)(u(x + zε) − u(x))ρ(x + zε) dz.
ση ε2 B(0,1)

Let β = kukC 3 (U ) . We use the Taylor expansions

ε2 T 2
u(x + zε) − u(x) = ∇u(x) · zε + z ∇ u(x)z + O(βε3 )
2
and
ρ(x + zε) = ρ(x) + ∇ρ(x) · zε + O(ε2 ),
for |z| ≤ 1, to obtain
Z
2 ε2 T 2
Lε u(x) = η(|z|) ∇u(x) · zε + z ∇ u(x)z + O(βε )
3
ση ε2 B(0,1) 2

× ρ(x) + ∇ρ(x) · zε + O(ε2 ) dz
Z
2 1
= η(|z|) ρ(x)∇u(x) · zε−1 + ρ(x)z T ∇2 u(x)z
ση B(0,1) 2
!
+ (∇u(x) · z)(∇ρ(x) · z) dz + O(βε)
Z
2 1
= η(|z|) ρ(x)z T ∇2 u(x)z + (∇u(x) · z)(∇ρ(x) · z) dz + O(βε)
ση B(0,1) 2
=: A + B + O(βε),
120 CHAPTER 5. GRAPH-BASED LEARNING

where we used the fact that

Z
η(|z|)∇u(x) · z dz = 0,
B(0,1)

since z 7→ ∇u(x) · z is odd.

To compute A, we write

X d Z
1
A = ρ(x) uxi xj (x) η(|z|)zi zj dz
ση i,j=1 B(0,1)

X d Z
1
= ρ(x) uxi xi (x) η(|z|)zi2 dz
ση i=1 B(0,1)

= ρ(x)∆u(x),

where we used that z 7→ zi zj is odd for i 6= j, so the integrals vanish for i 6= j. To

compute B, we have
Z
2 X
d
B= ux (x)ρxj (x) η(|z|)zi zj dz
ση i,j=1 i B(0,1)
Z
2 X
d
= ux (x)ρxi (x) η(|z|)zi2 dz
ση i=1 i B(0,1)

= 2∇ρ(x) · ∇u(x).

Combining this with the main argument above we have

Lε u(x) = ρ(x)∆u(x) + 2∇ρ(x) · ∇u(x) + O(βε)

= ρ(x)−1 ρ(x)2 ∆u(x) + 2ρ(x)∇ρ(x) · ∇u(x) + O(βε)
= ρ−1 div(ρ2 ∇u) + O(βε)
= ∆ρ u(x) + O(βε),

which completes the proof.

Remark 5.25. Combining Lemmas 5.22 and 5.24, for any u ∈ C 3 (U ) and 0 < λ ≤
ε−1 we have that

(5.73) max |Ln,ε u(x) − ∆ρ u(x)| ≤ C(λ + ε)

x∈Xn \∂ε U

hold with probability at least 1 − C exp −cnεd+2 λ2 + 3 log(n) . This is called point-
wise consistency for graph Laplacians. Notice that for the bound to be non-vacuous,
we require nεd+2 log(n). To get a linear O(ε) pointwise consistency rate, we set
5.4. THE MAXIMUM PRINCIPLE APPROACH 121

λ = ε and require nεd+4 ≥ C log(n) for a large constant C > 0. Written a diﬀerent
way, we require
1/(d+2)
log(n)
ε
n
for pointwise consistency of graph Laplacians and
1/(d+4)
log(n)
ε≥K
n
for pointwise consistency with a linear O(ε) rate. We note these are larger length
scale restrictions than for connectivity of the graph (compare with (5.64)). In the
length scale range
1/d 1/(d+2)
log(n) log(n)
(5.74) ε
n n
the graph is connected, and Laplacian learning well-posed, but pointwise consistency
does not hold and current PDE techniques cannot say anything about discrete to
continuum convergence.
Remark 5.26. If u ∈ C 4 (U ), then Lemma 5.24 can be sharpened to read
|Lε u(x) − ∆ρ u(x)| ≤ CkukC 4 (U ) ε2 .
We leave this as an exercise for the reader. Combining this with Lemma 5.22 we have
we have that
(5.75) max |Ln,ε u(x) − ∆ρ u(x)| ≤ C(λ + ε2 )
x∈Xn \∂ε U

hold with probability at least 1 − C exp −cnεd+2 λ2 + 3 log(n) for all 0 < λ ≤ ε−1
and all u ∈ C 4 (U ). To obtain the second order convergence rate, we must choose
λ = ε2 , and so we require the stricter length scale restriction
1/(d+6)
log n
ε≥C .
n

We now show how to use the maximum principle and pointwise consistency to
establish discrete to continuum convergence with a rate.
Theorem 5.27 (Discrete to continuum convergence). Let 0 < ε ≤ 1 and n ≥ 1. Let
un,ε ∈ ℓ2 (Xn ) be a solution of (5.65) satisfying (5.18) with Γ = Xn ∩ ∂ε U , and let
u ∈ C 3 (U ) be the solution of (5.67). Then for any 0 < λ ≤ 1
(5.76) max |un,ε (x) − u(x)| ≤ C(kukC 3 (U ) + 1)(λ + ε)
x∈Xn

holds with probability at least 1 − C exp −cnεd+2 λ2 + 3 log(n) .
122 CHAPTER 5. GRAPH-BASED LEARNING

Proof. Let φ ∈ C 3 (U ) be the solution of

(
−∆ρ φ = −ρ−1 div(ρ2 ∇φ) = 1 in U
(5.77)
φ = 0 on ∂U.

Since ∆ρ φ(x) ≥ 0 at any point x ∈ U where φ attains its minimum value, and
∆ρ φ(x) = −1 for all x ∈ U , the minimum value of φ is attained on the boundary ∂U ,
and so φ ≥ 0. Combining Lemmas 5.22 and 5.24, and recalling ∆ρ u = 0, we have
that

(5.78) max |Ln,ε φ(x) + 1| ≤ C1 (λ + ε)

x∈Xn \∂ε U

and

(5.79) max |Ln,ε u(x)| ≤ C1 kukC 3 (U ) (λ + ε)

x∈Xn \∂ε U

hold with probability at least 1 − C2 exp −cnεd+2 λ2 + 3 log(n) for any 0 < λ ≤ ε−1 .
For the rest of the proof, we assume (5.78) and (5.79) hold for some ﬁxed 0 < λ, ε ≤ 1.
First, if C1 (λ + ε) ≥ 12 then we have

|un,ε (x) − u(x)| ≤ |un,ε (x)| + |u(x)| ≤ 2kgkL∞ (U ) ≤ 4C1 kgkL∞ (U ) (λ + ε),

and the proof is complete. Therefore, we may assume that C1 (λ + ε) ≤ 21 , and so

by (5.78) we have Ln,ε φ(x) ≤ − 12 for all x ∈ Xn \ ∂ε U . Let K > 0 and deﬁne
w = u − un,ε − Kφ. Then we have
K
Ln,ε w(x) = Ln,ε u(x) − Ln,ε un,ε (x) − KLn,ε φ(x) ≥ − C1 kukC 3 (U ) (λ + ε)
2
for x ∈ Xn \ ∂ε U . Setting K = 2C1 (kukC 3 (U ) + 1)(λ + ε) we have Ln,ε w(x) > 0 for all
x ∈ Xn \ ∂ε U . By the maximum principle (Lemma 5.4) we have

max w(x) = max w(x).

x∈Xn x∈Xn ∩∂ε U

Since u(x) = un,ε (x) for x ∈ Xn ∩ ∂ε U and φ ≥ 0, we have w(x) ≤ 0 for x ∈ Xn ∩ ∂ε U

and so w(x) ≤ 0 for all x ∈ Xn , which yields

u(x) − un,ε (x) ≤ Kφ(x)

for all x ∈ Xn . For the other direction, we can deﬁne w = un,ε − u − Kφ and make a
similar argument to obtain

un,ε (x) − u(x) ≤ Kφ(x)

for all x ∈ Xn , which completes the proof.

5.5. THE VARIATIONAL APPROACH 123

Notes and references: Pointwise consistency for graph Laplacians was established
in [34, 35] for a random geometric graph on a manifold. Pointwise consistency for k-
nearest neighbor graphs was established without rates in [57], and with convergence
rates in [14]. The maximum principle is a well-established tool for passing to limits
with convergence rates, and has been well-used in numerical analysis. The theory
presented here requires regularity of the solution u of the continuum PDE. In some
problems, especially nonlinear problems, the solution is not suﬃciently regular, and
the theory of viscosity solutions has been developed for precisely this purpose. The pa-
pers [11] and [13] use the maximum principle, pointwise consistency and the viscosity
solution machinery (see [12] for more on viscosity solutions) to prove discrete to con-
tinuum convergence for p-Laplace and ∞-Laplace semi-supervised learning problems
with very few labels. The maximum principle is used in [16, 54] to prove convergence
rates for Laplacian regularized semi-supervised learning, and in [59] to prove rates for
Laplacian regularized regression on graphs.

5.5 The variational approach

The maximum principle approach described in the previous section is very powerful for
proving convergence rates, when applicable. However, there are many problems where
the maximum principle cannot be used. First, since pointwise consistency requires
a large length scale restriction on ε (see Remark 5.25), the maximum principle is,
with current techniques, not useful when ε is in the range (5.74). Second, the PDE
maximum principle approach requires the continuum PDE to have a comparison
principle and be well-posed. There are many important problems, such as spectral or
total variation problems, where the continuum limit problem does not have a unique
solution. In these situations, we can often use variational tools to prove discrete to
continuum convergence, and we illustrate some of the main ideas in this section for
the Laplace learning problem.
Variational techniques prove convergence rates by comparing the optimal values of
the energies for the discrete and continuum problems. As with pointwise consistency
for graph Laplacians, we pass through a nonlocal problem. We deﬁne the nonlocal
operator
Z Z
1
(5.80) Iε,δ (u) = ηε (|x − y| + 2δ) (u(x) − u(y))2 ρ(x)ρ(y) dx dy,
σ η ε2 U U

for u ∈ L2 (U ) and ε, t > 0. We write Iε (u) = Iε,0 (u). We also deﬁne the local energy
Z
(5.81) I(u) = |∇u|2 ρ2 dx
U

for u ∈ H 1 (U ).
124 CHAPTER 5. GRAPH-BASED LEARNING

5.5.1 Consistency for smooth functions

Our ﬁrst establish consistency for smooth functions, in similar spirit to pointwise
consistency results for graph Laplacians from the previous sections. The proof is split
into two lemmas. We recall En,ε is deﬁned in (5.53).

Lemma 5.28 (Discrete to nonlocal). There exists C, c > 0 such that for any 0 <
λ ≤ 1 and Lipschitz continuous u : U → R we have

1
(5.82) |En,ε (u) − Iε (u)| ≤ CLip(u) 2
+λ
n

with probability at least 1 − 2 exp −cnεd λ2 .

Proof. Deﬁne the U-statistic

1 X
Un = f (Xi , Xj ),
n(n − 1) i̸=j

where 2
u(x) − u(y)
f (x, y) = ηε (|x − y|) ,
ε
and note that En,ε (u) = n−1
U .
ση n n
Since u is Lipschitz, |ηε | ≤ Cε−d , and ηε (|x − y|) = 0
for |x − y| > ε we have

b = sup |f (x, y)| ≤ CLip(u)2 ε−d .

x,y∈U

We also have
Z Z 2
u(x) − u(y)
E[f (Xi , Xj )] = ηε (|x − y|) ρ(x)ρ(y) dx dy = ση Iε (u),
U U ε
and

σ 2 := Var (f (Xi , Xj ))
Z Z 4
u(x) − u(y)
≤ ηε (|x − y|) 2
ρ(x)ρ(y) dxdy
ε
U U
Z Z
4 −2d
≤ CLip(u) ε dx dy
U B(y,ε)
4 −d
≤ CLip(u) ε .

Applying Bernstein’s inequality for U-statistics (Theorem 5.15) yields

P(|Un − ση Iε (u)| ≥ Lip(u)2 λ) ≤ 2 exp −cnεd λ2
5.5. THE VARIATIONAL APPROACH 125

for any 0 < λ ≤ 1. Since En,ε (u) = n−1

U
ση n n
we have

1
|En,ε (u) − Iε (u)| = (1 − n1 )Un − ση Iε (u)
ση
1 ση
= (1 − n1 )(Un − ση Iε (u)) − n ε
I (u)
ση
1 1
≤ |Un − ση Iε (u)| + Iε (u).
ση n

Since u is Lipschitz we have Iε (u) ≤ Lip(u)2 . Therefore we can apply the result of
Bernstein above to obtain that

1
|En,ε (u) − Iε (u)| ≤ CLip(u) λ +
2
n

holds with probability at least 1 − 2 exp −cnεd λ2 , which completes the proof.

Lemma 5.29 (Nonlocal to local). There exist C > 0 such that for all u ∈ C 2 (U )
and 0 < ε ≤ 1

(5.83) |Iε (u) − I(u)| ≤ CLip(u)kukC 2 (U ) ε.

Proof. Let β = kukC 2 (U ) . We ﬁrst use the Taylor expansion

u(y) − u(x) = ∇u(x) · (y − x) + O(βε2 )

for |x − y| ≤ ε to obtain

(u(y) − u(x))2 = (u(y) − u(x))(∇u(x) · (y − x) + O(βε2 ))

= (u(y) − u(x))∇u(x) · (y − x) + O(βLip(u)ε3 )
= (∇u(x) · (y − x) + O(βε2 ))∇u(x) · (y − x) + O(βLip(u)ε3 )
= |∇u(x) · (y − x)|2 + O(βLip(u)ε3 )

for |x − y| ≤ ε. We combine this with the expansion

ρ(y) = ρ(x)(1 + O(ε))

for |x − y| ≤ ε to obtain
Z Z
1
Iε (u) = ηε (|x − y|) (|∇u(x) · (y − x)|2 + O(βLip(u)ε3 ))ρ(x)2 (1 + O(ε)) dy dx
ση ε2 U B(x,ε)∩U
Z Z
1
= 2
η ε (|x − y|) |∇u(x) · (y − x)| 2
+ O(βLip(u)ε 3
) ρ(x)2 dy dx
ση ε U B(x,ε)∩U
Z Z
1
= ηε (|x − y|) |∇u(x) · (y − x)|2 ρ(x)2 dy dx + O(βLip(u)ε).
ση ε2 U B(x,ε)∩U
126 CHAPTER 5. GRAPH-BASED LEARNING

Let A ∈ Rd×d be an orthogonal matrix such that

A∇u(x) = |∇u(x)|e1

where e1 = (1, 0, 0, . . . , 0) ∈ Rd , and make the change of variables z = x + A(y − x) in

the inner integral. Then since A is orthogonal we have ηε (|x − y|) = ηε (|x − z|) and

∇u(x) · (y − x) = A∇u(x) · A(y − x) = |∇u(x)|e1 · (z − x) = |∇u(x)|(z1 − x1 ).

Therefore
Z Z
1
Iε (u) = |∇u(x)| ρ(x)
2 2
ηε (|x − y|) |z1 − x1 |2 dz dx + O(βLip(u)ε),
σ η ε2 U B(x,ε)∩V

where V = x + A(U − x). If dist(x, ∂U ) ≥ ε then

Z Z
ηε (|x − y|) |z1 − x1 | dz = ε
2 2
η(|y|)y12 dy = ση ε2 ,
B(x,ε)∩V B(0,1)

where we used that B(x, ε) ∩ V = B(x, ε) and the change of variables y = (z − x)/ε.
For all x ∈ U , a similar computation yields
Z
ηε (|x − y|) |z1 − x1 |2 dz ≤ ση ε2 .
B(x,ε)∩V

Therefore
Z
|Iε (u) − I(u)| ≤ 2 |∇u(x)|2 ρ(x)2 dx + CβLip(u)ε
∂ε U
Z
≤ 2Lip(u) 2
ρ(x)2 dx + CβLip(u)ε
∂ε U
≤ CLip(u) ε + CβLip(u)ε,
2

since ∂U is smooth. This completes the proof.

Remark 5.30. Combining Lemma 5.28 and 5.29 we have that for any u ∈ C 2 (U )

1
(5.84) |En,ε (u) − I(u)| ≤ CLip(u) Lip(u) + λ + kukC 2 (U ) ε
n

with probability at least 1 − 2 exp −cnεd λ2 for any λ > 0. The estimate (5.84) is a
pointwise consistency estimate for the graph Dirichlet energy. In Section 5.4 we saw
that pointwise consistency of the graph Laplacian was suﬃcient to prove discrete to
continuum convergence, since we made use of powerful maximum principle tools. For
variational proofs of convergence, were the maximum principle is not applicable, the
pointwise consistency result (5.84) is not enough to ensure minimizers of En,ε converge
5.5. THE VARIATIONAL APPROACH 127

to minimizers of I as n → ∞ and ε → 0. The pointwise consistency result (5.84)

shows that when restricting a smooth function u ∈ C 2 (U ) to the graph, the graph
Dirichlet energy En,ε (u) is close to the continuum energy I(u). To use variational
methods to prove convergence we have to go in the other direction as well; that is,
we need to take a function u ∈ ℓ2 (Xn ) and extend u to a function on the domain U
for which I(u) and En,ε (u) are similar. This direction is much harder, since u has no
a priori regularity. This issue is the focus of the next few sections.

5.5.2 Discrete to nonlocal via transportation maps

In this section, we give an alternative method for passing from discrete to nonlocal
energies that does not use concentration of measure, as in the previous sections, and
does not require regularity. R
We say ρ is a probability density on U if ρ ∈ L1 (U ), ρ ≥ 0 and U ρ dx =S1. We
say that U1 , U2 , . . . , Un ⊂ U is a partition of U if Ui ∩ Uj = ∅ for i 6= j, U = ni=1 Ui ,
and each Ui has nonempty interior.

Lemma 5.31. Assume the probability density ρ is Lipschitz continuous on U . Let

δ > 0 such that nδ d ≥ 1, and let X1 , X2 , . . . , Xn be a sequence of i.i.d. random
variables with density ρ. Then exists a probability density ρδ ∈ L∞ (U ) and a partition
U1 , U2 , . . . , Un of U such that
Z
1
(5.85) Ui ⊂ B(Xi , δ) and ρδ dx =
Ui n

hold for 1 ≤ i ≤ n, and

(5.86) P kρδ − ρkL∞ (U ) ≤ C(λ + δ) ≥ 1 − Cn exp(−cnδ d λ2 )

holds for any 0 ≤ λ ≤ 1.

Proof. There exists a universal constants K > 0 such that for each h > 0 there is a
partition B1 , B2 , . . . , BM of U for which M ≤ Ch−n and Bi ⊂ B(xi , Kh) for some
xi ∈ Bi for all i. Let δ > 0 and set h = δ/2K so that Bi ⊂ B(xi , δ/2). Let ρδ be the
histogram density estimator

1 X ni
M
ρδ (x) = 1B (x),
n i=1 |Bi | i

where ni is the number of points from X1 , . . . , Xn that fall in Bi . We easily check

that
Z
1X
M
ρδ (x) dx = ni = 1.
U n i=1
128 CHAPTER 5. GRAPH-BASED LEARNING

To construct the sets U1 , U2 , . . . , Un , we construct a partition Bi,1 , Bi,2 , . . . , Bi,ni of

each Bi consisting of sets of equal measure. Let ki,1 , . . . , ki,ni denote the indices of
the random variables that fall in Bi and set Uki,j = Bi,j . That is, the partition of
Bi is assigned one-to-one with the random samples that fall in Bi . Then for some
1 ≤ j ≤ M and 1 ≤ k ≤ nj we have
Z Z
nj 1
ρδ (x) dx = dx = .
Ui n|Bj | Bj,k n
Fix 1 ≤ j ≤ n. Since Xj ∈ Bi for some i, and Bi ⊂ B(xi , δ/2), we have |Xj −xi | ≤ δ/2,
and so Uj ⊂ Bi ⊂ B(xi , δ) ⊂ B(Xj , δ). This proves (5.85).
To prove (5.86), note that each ni has the form
X
n
ni = 1Bi (Xk ),
k=1
R
where 1Bi (Xk ) is Bernoulli with parameter pi = Bi
ρ(x) dx. By the Chernoﬀ bounds
(Theorem 5.7) we have

3
P (|ni − npi | ≥ λnpi ) ≤ 2 exp − npi λ2
8
for 0 < λ ≤ 1. We note that
Z Z
pi = ρ(x) dx ≥ ρ(x) dx ≥ cδ d ,
Bi B(xi ,cδ)∩U

and Z
pi = ρ(x) dx ≤ C|Bi |.
Bi
Union bounding over i = 1, . . . , M , for any 0 < λ ≤ 1, with probability at least
1 − Cn exp −cnδ d λ2 , we have
Z
ni
(5.87) − ρ(x) dx ≤ Cλ|Bi |
n Bi

for all i = 1, . . . , M , where we used that nδ d ≥ 1. Let x ∈ U . Then x ∈ Bi for some

i and given that (5.87) holds we have
ni
|ρ(x) − ρδ (x)| = ρ(x) −
n|Bi |
Z Z
1 1 ni
= ρ(x) − ρ(x) dx + ρ(x) dx −
|Bi | Bi |Bi | Bi n|Bi |
Z Z
1 1 ni
≤ ρ(x) − ρ(x) dx + ρ(x) dx −
|Bi | Bi |Bi | Bi n|Bi |
≤ C(δ + λ),
which completes the proof.
5.5. THE VARIATIONAL APPROACH 129

Let δ > 0 with nδ d ≥ 1 and 0 < λ ≤ 1. Let U1 , U2 , . . . , Un and ρδ be the partition

and probability density provided by Lemma 5.31 satisfying (5.86) and (5.85), which
exists with probability at least 1−Cn exp(−cnδ d λ2 ). We deﬁne the extension operator
Eδ : ℓ2 (Xn ) → L2 (U ) by

X
n
(5.88) Eδ u(x) = u(Xi )1Ui (x).
i=1

Since (5.85) holds we have

Z Z
1X X
n n
(5.89) u(Xi ) = u(Xi )ρδ (x) dx = (Eδ u)ρδ dx
n i=1 i=1 Ui U

for every u ∈ ℓ2 (Xn ). Hence, the extension operator allow us to relate empirical
discrete sums with their continuum integral counterparts uniformly over L2 (U ) and
ℓ2 (Xn ), without using concentration of measure as we did for pointwise consistency
of graph Laplacians.
It will often be more be convenient in analysis to deﬁne the associated transporta-
tion map Tδ : U → Xn , which is deﬁned by Tδ (x) = Xi if and only if x ∈ Ui . Then the
extension map Eδ : ℓ2 (Xn ) → L2 (U ) can be written as Eδ u = u ◦ Tδ . In this notation,
(5.89) becomes
Z
1X
n
(5.90) u(Xi ) = u(Tδ (x))ρδ (x) dx.
n i=1 U

Since Tδ (Ui ) = Xi and Ui ⊂ B(Xi , δ), we have |Tδ (x) − x| ≤ δ. This implies that

(5.91) |Tδ (x) − Tδ (y)| ≤ |x − y| + 2δ

for all x, y ∈ U .
To see why Tδ is called a transportation map, note that Ui = Tδ−1 ({Xi }), and so
property (5.85) becomes Z
1
ρδ (x) dx = .
T −1 ({Xi }) n
P
We deﬁne the empirical distribution µn := n1 ni=1 δXi (x). Then for any A ⊂ U

1X
n
µn (A) = 1X ∈A
n i=1 i

and so Z Z
X
n
µn (A) = 1Xi ∈A ρδ (x) dx = ρδ dx.
i=1 T −1 ({Xi }) T −1 (A)
130 CHAPTER 5. GRAPH-BASED LEARNING

By deﬁnition, this means that Tδ pushes forward the probability density ρδ to the
empirical distribution µn , written Tδ # ρδ = µn . In other words, the map Tδ is a
transportation map, transporting the distribution ρδ to µn .
We can also extend this to double integrals. In the transportation map notation
we have
Z Z
1 X
n
(5.92) Φ(Xi , Xj ) = Φ(Tδ (x), Tδ (y))ρδ (x)ρδ (y) dx dy.
n2 i,j=1 U U

We are now ready to prove discrete to nonlocal estimates.

Lemma 5.32 (Discrete to nonlocal). Let n ≥ 1 and δ > 0 with nδ d ≥ 1. There exists
c > 0 such that for 0 < λ ≤ 1 with λ + δ ≤ c the event that
(5.93) Iε,δ (Eδ u) ≤ (1 + C(δ + λ))En,ε (u)
holds for all u ∈ ℓ2 (Xn ) has probability at least 1 − Cn exp(−cnδ d λ2 ).
Proof. Let uδ = Eδ u = u ◦ Tδ . Applying (5.92) we have
1 X
En,ε (u) = ηε (|x − y|)(u(x) − u(y))2
ση n2 ε2 x,y∈X
Z Z n
1
= ηε (|Tδ (x) − Tδ (y)|) (uδ (x) − uδ (y))2 ρδ (x)ρδ (y) dy dx
σ η ε2 U U
Z Z
1
≥ ηε (|x − y| + 2δ) (uδ (x) − uδ (y))2 ρδ (x)ρδ (y) dy dx,
σ η ε2 U U
due to (5.91) and the assumptions that t 7→ η(t) is nonincreasing. Since |ρ − ρδ | ≤
C(δ + λ) and ρ ≥ α > 0, we have that
ρδ ≥ ρ − C(δ + λ) ≥ ρ(1 − Cα−1 (δ + λ)),
for λ + δ suﬃciently small so that Cα−1 (δ + λ) ≤ 1/2. Therefore
En,ε (u) ≥ (1 − Cα−1 (δ + λ))Iε,δ (uδ ),
which, upon rearrangement, completes the proof.
Notes and references: This material in this section follows [14] closely. The idea
of using transportation maps to compare discrete and continuum measures originally
appeared in [61], and have been used in many subsequent works by the same authors.
The method in [61] looks for a transportation map Tδ that pushes ρ directly onto
µn ; that is, we take ρ = ρδ in Lemma 5.31. It is far more challenging to prove
existence of such a transportation map, and is related to hard problems in optimal
matching in probability. The analogous result to Lemma 5.31 with ρ = ρδ can be
found in [60]. The convergence rates for discrete to continuum can be made slightly
sharper using [60, Theorem 1.1] in place of Lemma 5.31, although in dimension n = 2,
the length scale restriction becomes nε2 log(n)3/2 , which is slightly suboptimal.
5.5. THE VARIATIONAL APPROACH 131

5.5.3 Nonlocal to local estimates

We must now estimate the local energy I(u) in terms of the nonlocal energy Iε,δ (u),
in a way that does not require regularity of u. In general this is impossible, since u
may be discontinuous and so I(u) = ∞ while Iε,δ (u) is ﬁnite. However, if we apply a
small amount of smoothing to u in just the right way, we can construct a smoothed
version of u for which I(u) is appropriately bounded.
Let us deﬁne
Z ∞
1
(5.94) ψ1,t (τ ) = η(s + 2t)s ds,
ση,t τ
R
where ση,t = Rd
|z1 |2 η(|z| + 2t) dz, and ψε,δ (τ ) = 1
ψ
εd 1,δ/ε
(τ /ε). Notice that
Z ∞
1
ψε,δ (τ ) = ηε (s + 2δ)s ds.
ση,δ/ε ε2 τ

We could have taken the above as a deﬁnition of ψε,δ . We also note that ση = ση,0
and

2α(n)Lip(η)
(5.95) ση − t ≤ ση,t ≤ ση ,
d+2

since η is Lipschitz and nonincreasing. Also, if 0 ≤ t ≤ 1/4 we have

Z Z
α(n)
(5.96) ση,t ≥ |z1 | η(|z| +
2 1
2
) dz ≥ |z1 |2 dz = .
Rd B(0,1/2) 2d+2 (d + 2)

since η ≥ 1 on [0, 1/2].

We now deﬁne the smoothing operator Λε,δ : L2 (U ) → C ∞ (U ) by
Z
1
(5.97) Λε,δ u(x) = ψε,δ (|x − y|) u(y) dy,
θε,δ (x) U

where
Z
(5.98) θε,δ (x) = ψε,δ (|x − y|) dy.
U

We ﬁrst have some preliminary propositions

Proposition 5.33. For all ε, δ > 0 with ε > 2δ and x ∈ U with dist(x, ∂U ) ≥ ε − 2δ
we have θε,δ (x) = 1.
132 CHAPTER 5. GRAPH-BASED LEARNING

Proof. We simply compute

Z Z
θε,δ (x) = ψε,δ (|x − y|) dy = ψ1,δ/ε (|z|) dz
U B(0,1)
Z 1
= nα(n)τ n−1 ψ1,δ/ε (τ ) dτ
0
Z 1 Z 1
1 n−1
= nα(n)τ η(s + 2δε )s ds dτ
ση,t 0 τ
Z 1 Z s
1
= 2δ
α(n)η(s + ε )s nτ n−1 dτ ds
ση,t 0 0
Z 1
1
= α(n)sn+1 η(s + 2δε )ds
ση,t 0
Z
1
= η(|z| + 2δε )|z|2 dz = 1.
nση,t B(0,1)
To prove nonlocal to local convergence, we deﬁne a more general nonlocal operator
Z Z
1
(5.99) Iε,δ (u; V ) = ηε (|x − y| + 2δ) (u(x) − u(y))2 ρ(x)ρ(y) dx dy,
σ η ε2 V U
for V ⊂ U , which will simplify notation. As before, we write Iε,δ (u) = Iε,δ (u; U ). We
ﬁrst prove basic estimates on Λε,δ .
Proposition 5.34. Let u ∈ L2 (U ) and ε, δ > 0 with ε ≥ 4δ. Then
(5.100) kΛε,δ u − uk2L2 (Uε−2δ ) ≤ Cση Iε,δ (u)ε2 .
Proof. First, we note that by monotonicity of η we have
Z 1−2t Z 1
1 1 1
ψ1,t (τ ) = η(s + 2t)s ds ≤ η(τ + 2t) dτ = η(τ + 2t),
ση,t τ ση,t 0 ση,t
and so
1 1 1
ψε,δ (τ ) = d
ψ1,δ/ε (τ /ε) ≤ η τ
ε
+ 2 δε = ηε (τ + 2δ).
ε ση,δ/ε εd ση,δ/ε
By (5.96) we have ση,δ/ε ≥ c, where c > 0 depends only on d, since ε ≥ 4δ. Therefore,
for any x ∈ U we have
Z 2
1
|Λε,δ u(x) − u(x)| ≤
2
ψε,δ (|x − y|)|u(y) − u(x)| dy
θε,δ (x) U
Z
1
≤ ψε,δ (|x − y|)|u(y) − u(x)|2 dy
θε,δ (x) U
Z
1
≤ ηε (|x − y| + 2δ) (u(y) − u(x))2 dx.
cθε,δ (x) U
5.5. THE VARIATIONAL APPROACH 133

Therefore
Z Z
1 ση
kθε,δ (Λε,δ u − u)k2L2 (U ) ≤ ηε (|x − y| + 2δ) (u(y) − u(x))2 dxdy ≤ Iε,δ (u)ε2 .
c U U c
Applying Proposition 5.33 completes the proof.
We now turn to comparing the local energy I(u) to the nonlocal energy Iε,δ (u).

Lemma 5.35. There exists C > 0 such that for every u ∈ L2 (U ) and ε, δ > 0 with
ε ≥ 4δ we have
Z
ση (1 + Cε) C
|∇Λε,δ u|2 ρ2 θε,δ
2
dx ≤ (Iε,δ (u) + Iε,δ (u; ∂ε−2δ U )) + 2 kΛε,δ u − uk2L2 (∂ε−2δ U ) .
U ση,δ/ε ε

Proof. Let w = Λε,δ u and note that

1 w(x)
∇w(x) = ∇v(x) − ∇θε,δ (x),
θε,δ (x) θε,δ (x)

where Z
v(x) = ψε,δ (|x − y|) u(y) dy.
U
We now compute
Z
1 ′ |x − y| (x − y)
∇v(x) = d ψ1,δ/ε u(y) dy
ε ε ε|x − y|
U
Z
1 |x − y| + 2δ
= η (y − x)u(y) dy
ση,δ/ε εd+2 ε
Z U
1
= ηε (|x − y| + 2δ) (y − x)u(y) dy,
ση,δ/ε ε2 U

and by a similar computation

Z
1
∇θε,δ (x) = ηε (|x − y| + 2δ) (y − x) dy.
ση,δ/ε ε2 U

Therefore
(5.101) "Z
1
∇w(x) = ηε (|x − y| + 2δ) (y − x)(u(y) − u(x)) dy
ση,δ/ε θε,δ (x)ε2 U
Z #
+ (u(x) − w(x)) ηε (|x − y| + 2δ) (y − x) dy .
U
134 CHAPTER 5. GRAPH-BASED LEARNING

Let ξ ∈ Rd with |ξ| = 1 so that ∇w(x) · ξ = |∇w(x)|. Then we have

(5.102) "Z
1
|∇w(x)| = ηε (|x − y| + 2δ) ((y − x) · ξ)(u(y) − u(x)) dy
ση,δ/ε θε,δ (x)ε2 U
Z #
+ (u(x) − w(x)) ηε (|x − y| + 2δ) (y − x) · ξ dy .
U

By the Cauchy-Schwarz inequality we have

Z
1
ηε (|x − y| + 2δ) ((y − x) · ξ)(u(y) − u(x)) dy
ση,δ/ε θε,δ (x)ε2 U
Z 1/2
1
≤ ηε (|x − y| + 2δ) |(x − y) · ξ| dy
2
ση,δ/ε θε,δ (x)ε2 U
Z 1/2
× ηε (|x − y| + 2δ)(u(y) − u(x)) dy
2
U
Z 1/2
1
≤√ ηε (|x − y| + 2δ)(u(y) − u(x)) dy 2
.
ση,δ/ε θε,δ (x)ε U

Similarly, we have
Z
1 C
ηε (|x − y| + 2δ) (y − x) · ξ dy ≤ 1x∈∂ε−2δ U ,
ση,δ/ε θε,δ (x)ε2 U ε
since Z
ηε (|x − y| + 2δ) (y − x) · ξ dy = 0
U
if dist(x, ∂U ) ≥ ε − 2δ. Therefore
Z 1/2
1
|∇w(x)| ≤ √ ηε (|x − y| + 2δ)(u(y) − u(x)) dy
2
ση,δ/ε θε,δ (x)ε U
C
+ |w(x) − u(x)|1∂ε−2δ U (x) .
ε
Squaring and integrating against ρ2 θε,δ 2
over U we have
Z Z Z
1
|∇w|2 ρ2 θε,δ
2
dx ≤ 2
ηε (|x − y| + 2δ)(u(y) − u(x))2 ρ(x)2 dxdy
σ η,δ/ε ε
U
Z
U U
Z
1
+ ηε (|x − y| + 2δ)(u(y) − u(x))2 ρ(x)2 dydx
ση,δ/ε ε2 ∂ε−2δ U U
Z
C
+ 2 (w(x) − u(x))2 dx,
ε ∂ε−2δ U
5.5. THE VARIATIONAL APPROACH 135

where we applied Cauchy’s inequality 2ab ≤ a2 + b2 to the cross terms. The proof is
completed by noting that ρ is Lipschitz continuous and ρ ≥ α > 0, and so ρ(x) ≤
ρ(y)(1 + Cε) for y ∈ B(x, ε).
We can sharpen Lemma 5.35 if we have some information about the regularity of
u near the boundary.
Theorem 5.36 (Nonlocal to local). Suppose that u ∈ L2 (U ) satisﬁes
(5.103) |u(x) − u(y)| ≤ Cε
for all x ∈ ∂ε−2δ U and y ∈ B(x, ε − 2δ) ∩ U . Then for ε > 0 and δ ≤ cε, with
0 < c ≤ 1/4 depending only on η and n, we have

(5.104) I(Λε,δ u) ≤ 1 + C ε + δε Iε,δ (u) + C ε + δε .
Proof. For any x ∈ ∂ε−2δ U we have by (5.103) that
Z
1
|(Λε,δ u)(x) − u(x)| ≤ ψε,δ (|x − y|)|u(y) − u(x)| dx
θε,δ (x) B(x,ε−2δ)∩U
Z
Cε
≤ ψε,δ (|x − y|) dx = Cε.
θε,δ (x) B(x,ε−2δ)∩U
Since 2δ < ε we have
(5.105) kΛε,δ u − uk2L2 (∂ε−2δ U ) ≤ Cε2 |∂ε U | ≤ Cε3 .
We again use (5.103) to obtain
Z Z
1
Iε,δ (u; ∂ε−2δ U ) = 2
ηε (|x − y| + 2δ)(u(x) − u(y))2 ρ(x)ρ(y) dxdy
ση ε ∂ε−2δ U B(x,ε−2δ)∩U
Z Z
≤C ηε (|x − y| + 2δ) dxdy ≤ C|∂ε U | ≤ Cε.
∂ε−2δ U B(x,ε−2δ)∩U

By Proposition 5.33 we have θε,δ = 1 on Uε−2δ . Combining these facts with (5.105)
and Lemma 5.35 we have
Z
ση (1 + Cε)
|∇Λε,δ u|2 ρ2 dx ≤ (Iε,δ (u) + Cε) + Cε.
Uε−2δ ση,δ/ε
Invoking (5.95), for δ ≤ cε, with 0 < c ≤ 1/4 depending only on η and n, we have
Z

|∇Λε,δ u|2 ρ2 dx ≤ 1 + C ε + δε (Iε,δ (u) + Cε) + Cε.
Uε−2δ

By (5.102) and (5.103) we have |∇Λε,δ u(x)| ≤ C for x ∈ ∂ε−2δ U , and so

Z
|∇Λε,δ u|2 ρ2 dx ≤ C|∂ε U | ≤ Cε,
∂ε−2δ U

which completes the proof.

136 CHAPTER 5. GRAPH-BASED LEARNING

We also need estimates in the opposite direction, bounding the nonlocal energy
Iε (u) by the local energy I(u). While a similar result was already established in
Lemma 5.29 under the assumption u ∈ C 2 (U ), we need a result that holds for Lips-
chitz u to prove discrete to continuum convergence.
Lemma 5.37 (Local to nonlocal). There exists C > 0 such that for every Lipschitz
continuous u : U → R and ε > 0 we have
(5.106) Iε (u) ≤ (1 + Cε)I(u) + CLip(u)2 ε.
Proof. We write
Iε (u) = Iε (u; Uε ) + Iε (u; ∂ε U )
Note that
Z 1 2 Z 1
d
(u(x) − u(y)) =2
u(x + t(y − x)) dt ≤ |∇u(x + t(y − x)) · (y − x)|2 dt,
0 dt 0

where we used Jensen’s inequality in the ﬁnal step. Since ρ is Lipschitz and bounded
below, we have ρ(y) ≤ ρ(x)(1 + Cε). Plugging these into the deﬁnition of Iε (u; Uε )
we have
Z Z Z
1 + Cε 1
Iε (u; Uε ) ≤ 2
ηε (|x − y|)|∇u(x + t(y − x)) · (y − x)|2 dy ρ(x)2 dx dt.
ση ε 0 Uε B(x,ε)

Now make the change of variables z = y − x in the inner integral to ﬁnd that
Z Z Z
1 + Cε 1
Iε (u; Uε ) ≤ 2
ηε (|z|)|∇u(x + tz) · z|2 dz ρ(x)2 dx dt
ση ε 0 Uε B(0,ε)
Z 1Z Z
1 + Cε
= ηε (|z|) |∇u(x + tz) · z|2 ρ(x)2 dx dz dt
ση ε2 0 B(0,ε) Uε
Z 1Z Z
1 + Cε
≤ ηε (|z|) |∇u(x) · z|2 ρ(x)2 dx dz dt
ση ε2 0 B(0,ε) U
Z Z
1 + Cε
= ηε (|z|)|∇u(x) · z|2 dz ρ(x)2 dx
ση ε2 U B(0,ε)
Z
= (1 + Cε) |∇u|2 ρ(x)2 dx = (1 + Cε)I(u).
U

On the other hand we have

Z Z
1
Iε (u; ∂ε U ) = ηε (|x − y|)(u(x) − u(y))2 ρ(x)ρ(y) dxdy
σ η ε2 ∂ ε U U
Z Z
≤ CLip(u) 2
ηε (|x − y|)ρ(x)ρ(y) dx dy
∂ε U U
≤ CLip(u)2 ε.
This completes the proof.
5.5. THE VARIATIONAL APPROACH 137

Notes and references: The material in this section roughly follows [58], though
we have made some modiﬁcations to simplify the presentation. In particular, our
smoothing operator Λε,δ is diﬀerent than the one in [58], which essentially uses Λε,0
everywhere. Some of the core arguments in [58] appeared earlier in [10], which consid-
ered the non-random setting with constant kernel η. The spectral convergence rates
from [10, 58] were recently improved in [14] by incorporating pointwise consistency of
graph Laplacians.

5.5.4 Discrete to continuum

We now use the results from the previous sections to prove discrete to continuum
convergence with rates for Laplacian learning via variational tools.
Let g : U → R be Lipschitz continuous and deﬁne

(5.107) An,ε = {u ∈ ℓ2 (Xn ) : u(x) = g(x) for all x ∈ ∂2ε U },

and

(5.108) A = {u ∈ H 1 (U ) : u = g on ∂U }.

We consider the discrete variational problem

(5.109) min En,ε (u)

u∈An,ε

and its continuum counterpart

(5.110) min I(u),

u∈A

R
where we recall En,ε is deﬁned in (5.53) and I(u) = U |∇u|2 ρ2 dx.
To prove discrete to continuum convergence with rates, we require stability for
the limiting problem.

Proposition 5.38 (Stability). Let u ∈ C 2 (U ) such that div (ρ2 ∇u) = 0 in U . There
exists C > 0, depending only on U , such that for all w ∈ H 1 (U )

(5.111) ku − wk2H 1 (U ) ≤ C I(w) − I(u) + kuk C 1 (U ) ku − wk
L2 (∂U ) + ku − wk2L2 (∂U ) .
138 CHAPTER 5. GRAPH-BASED LEARNING

Proof. Write R = I(w) − I(u) so that I(w) = I(u) + R. Then we compute

Z Z
α 2
|∇u − ∇w| dx ≤
2
|∇u − ∇w|2 ρ2 dx
U ZU
≤ |∇u|2 ρ2 − 2ρ2 ∇w · ∇u + |∇w|2 ρ2 dx
U Z
= I(u) + I(w) − 2 ρ2 ∇w · ∇u dx
Z U

= 2I(u) − 2 ρ ∇w · ∇u dx + R
2

Z U

= 2 ρ2 ∇u · ∇(u − w) dx + R
ZU
∂u
=2 (u − w)ρ2 dS + R
∂U ∂ν
≤ Cα−2 kukC 1 (U ) ku − wkL2 (∂U ) + R,

where we used integration by parts in the second to last line. We now use the Poincaré
inequality proved in Exercise 4.18 to obtain
Z Z Z
(u − w) dx ≤ C
2
|∇u − ∇w| dx +
2
(u − w) dx .
2
U U ∂U

Combining this with the bound above completes the proof.

Remark 5.39. We call Proposition 5.38 a stability estimate, since it shows that if u
is a minimizer of I, and w is an approximate minimizer, so that I(w) − I(u) is small,
then u and w are close in H 1 (U ), provided they are close on the boundary ∂U .
We also prove stability for the discrete graph problem. For the reader’s reference,
we refer to Section 5.3 for deﬁnitions of gradients and divergence operators on graphs.
Proposition 5.40 (Stability). Let u ∈ ℓ2 (Xn ) such that Ln,ε u(x) = 0 for all x ∈
Xn \ Γ, where Γ ⊂ Xn . For all w ∈ ℓ2 (Xn ) with w(x) = u(x) for all x ∈ Γ we have

(5.112) k∇n,ε u − ∇n,ε wk2ℓ2 (Xn2 ) = En,ε (w) − En,ε (u).

Proof. The proof is a discrete analog of Proposition 5.38. Write R = En,ε (w) − En,ε (u)
and compute

k∇n,ε u − ∇n,ε wk2ℓ2 (Xn2 ) = k∇n,ε uk2ℓ2 (Xn2 ) − 2(∇n,ε u, ∇n,ε w)ℓ2 (Xn2 ) + k∇n,ε wk2ℓ2 (Xn2 )
= 2(k∇n,ε uk2ℓ2 (Xn2 ) − (∇n,ε u, ∇n,ε w)ℓ2 (Xn2 ) ) + R
= 2(∇n,ε u, ∇n,ε u − ∇n,ε w)ℓ2 (Xn2 ) + R = 0
= −2(Ln,ε u, u − w)ℓ2 (Xn ) + R = R,

since Ln,ε u = 0 on Xn \ Γ and u = w on Γ.

5.5. THE VARIATIONAL APPROACH 139

We can now prove discrete to continuum convergence.

Theorem 5.41 (Discrete to continuum). Let u ∈ A be the solution of (5.110) and
let un,ε ∈ An,ε be the solution of (5.109). There exists C, c > 0 such that

(5.113) ku − un,ε k2H 1 (Xn ) ≤ C(kukC 2 (U ) + 1)(ε + δε + λ)

holds with probability at least 1 − Cn exp −cnδ d λ2 for δ ≤ cε and λ > 0.
Proof. The proof is split into several steps.
1. First, note that by elliptic regularity u ∈ C 3 (U ), since ρ ∈ C 2 (U ) and ∂U is
smooth. Let d(x) = dist(x, ∂U ) and deﬁne


1, if t ≤ 1
φ(t) = 2 − t, if 1 < t ≤ 2


0, if t > 2,

and
uε (x) = u(x) + φ d(x)
2ε
(g(x) − u(x)).
Since d and φ are Lipschitz continuous, uε is Lipschitz. Since u and g are Lipschitz
and u = g on ∂U , we also have |g(x) − u(x)| ≤ Cε for x ∈ ∂4ε U . Recalling |∇d| = 1
a.e., we have
∇d(x)
|∇uε (x)| = ∇u(x) + φ′ d(x)
2ε
(g(x) − u(x)) + φ d(x)
2ε
(∇g(x) − ∇u(x)
2ε
1
≤ |∇u(x)| + φ′ d(x)2ε
|g(x) − u(x)| + |∇g(x) − ∇u(x)|
2ε
1
≤ 2|∇u(x)| + |∇g(x)| + kg − ukL∞ (∂4ε U )
2ε
1
≤ 2|∇u(x)| + |∇g(x)| + C.
2
Therefore |∇uε | is bounded. Since uε = u on U4ε we have
Z Z Z
(5.114) I(uε ) − I(u) = |∇uε | ρ dx −
2 2
|∇u| ρ dx ≤
2 2
|∇uε |2 ρ2 dx ≤ Cε.
U U ∂4ε U

Now, since |∇uε | ≤ C a.e., uε is Lipschitz continuous with Lip(u) bounded indepen-
dently of ε. By Lemma 5.28 we have that

1
|En,ε (uε ) − Iε (uε )| ≤ C +λ
n

with probability at least 1 − 2 exp −cnεd λ2 . By Lemma 5.37 and (5.114) we have

Iε (uε ) ≤ (1 + Cε)I(uε ) + Cε ≤ (1 + Cε)I(u) + Cε.

140 CHAPTER 5. GRAPH-BASED LEARNING

Since uε = g on ∂2ε U , the restriction of uε to the graph belongs to An,ε and so

En,ε (un,ε ) ≤ En,ε (uε ). This yields

(5.115) En,ε (un,ε ) ≤ (1 + Cε)I(u) + C(ε + λ)

with probability at least 1 − 2 exp −cnεd λ2 , where we used nεd ≥ 1 so 1/n ≤ ε.
2. We now establish a bound in the opposite direction to (5.115). Let δ > 0 and
let Eδ : ℓ2 (Xn ) → L2 (U ) be the extension map, and Tδ : U → Xn the corresponding
transportation map, provided by Lemma 5.31 and the discussion thereafter. Recall
that Eδ u = u ◦ Tδ . Let x ∈ ∂ε−2δ U and y ∈ B(x, ε − 2δ). Since |Tδ (x) − x| ≤ δ we
have that Tδ (x) ∈ ∂ε−δ U and Tδ (y) ∈ B(x, ε − δ). Since un,ε = g on Xn ∩ ∂2ε U

|Eδ un,ε (x) − Eδ un,ε (y)| = |g(Tδ (x)) − g(Tδ (y))|

since δ ≤ ε. Thus, we can invoke Theorem 5.36 to obtain

I(Λε,δ Eδ un,ε ) ≤ 1 + C ε + δε Iε,δ (Eδ un,ε ) + C ε + δε .

By Lemma 5.32 we have

(5.116) Iε,δ (Eδ un,ε ) ≤ (1 + C(δ + λ))En,ε (un,ε ),

with probability at least 1 − Cn exp −cnδ d λ2 . Combining these two inequalities
yields

(5.117) I(Λε,δ Eδ un,ε ) ≤ (1 + C(ε + δε + λ))En,ε (un,ε ) + C ε + δε .

3. Combining (5.115) and (5.117) we have

(5.118) I(Λε,δ Eδ un,ε ) − I(u) ≤ C(ε + δε + λ),

with probability at least 1 − Cn exp −cnδ d λ2 , where δ ≤ cε. Let x ∈ ∂U . As in the
proof of Theorem 5.36, we have

|(Λε,δ Eδ un,ε )(x) − Eδ un,ε (x)| ≤ Cε.

Since un,ε = g on Xn ∩ ∂2ε U

|Eδ un,ε (x) − g(x)| = |un,ε (Tδ (x)) − g(x)| = |g(Tδ (x)) − g(x)| ≤ Cδ.

Since g(x) = u(x) and δ ≤ ε we have

|(Λε,δ Eδ un,ε )(x) − u(x)| ≤ Cε

5.5. THE VARIATIONAL APPROACH 141

for all x ∈ ∂U . Therefore, we can apply Proposition 5.38 to obtain

(5.119) ku − Λε,δ Eδ un,ε k2H 1 (U ) ≤ C(ε + δε + λ).

with probability at least 1 − Cn exp −cnδ d λ2 .
4. We now extend the rate to the graph Xn . By Proposition 5.34 we have

kΛε,δ Eδ un,ε − Eδ un,ε k2L2 (Uε ) ≤ CIε,δ (Eδ un,ε )ε2 ≤ Cε2 ,

since Iε,δ (Eδ un,ε ) ≤ C due to (5.116) and (5.115). Therefore

ku − Eδ un,ε k2L2 (Uε ) ≤ 2ku − Λε,δ Eδ un,ε k2L2 (Uε ) + 2kΛε,δ Eδ un,ε − Eδ un,ε k2L2 (Uε )
= C(ε + δε + λ) + Cε2 .

Since u is Lipschitz, we have |Eδ u − u| ≤ Cδ ≤ Cε, and so

1 X
ku − un,ε k2ℓ2 (Xn ) = (u(x) − un,ε (x))2
n x∈X
Z n

= (u(Tδ (x)) − un,ε (Tδ (x)))2 dx

U
= kEδ u − Eδ un,ε k2L2 (U )
≤ 2kEδ u − uk2L2 (U ) + 2ku − Eδ un,ε k2L2 (U )
≤ C(ε + δε + λ) + 2ku − Eδ un,ε k2L2 (∂ε U ) .

Using an argument similar to part 3, we have

ku − Eδ un,ε kL2 (∂ε U ) = ku − Eδ gkL2 (∂ε U )

≤ kg − Eδ gkL2 (∂ε U ) + ku − gkL2 (∂ε U )
≤ C(δ + ε2 ).

Therefore

(5.120) ku − un,ε k2ℓ2 (Xn ) ≤ C(ε + δε + λ).

5. Deﬁne

wε (x) = Λε,δ Eδ un,ε (x) + φ 4d(x)
ε
(g(x) − Λε,δ Eδ un,ε (x)).

As in the proof of Theorem 5.36, and part 3, we have that |∇Λε,δ Eδ un,ε (x)| ≤ C and

|Λε,δ Eδ un,ε (x) − u(x)| ≤ Cε

142 CHAPTER 5. GRAPH-BASED LEARNING

for x ∈ ∂ε−2δ U . Therefore |∇wε | ≤ C for x ∈ ∂ε−2δ U . Since wε = g on ∂U we have

wε ∈ A and so I(u) ≤ I(wε ). Since wε = Λε,δ Eδ un,ε on Uε/2 and ε − 2δ ≥ ε/2, since
ε ≥ 4δ, we have that

I(u) − I(Λε,δ Eδ un,ε ) ≤ I(wε ) − I(Λε,δ Eδ un,ε )

Z
= (|∇wε |2 − |∇Λε,δ Eδ un,ε |2 )ρ2 dx
∂ε/2 U

≤ C|∂ε/2 U | ≤ Cε.

Combining this with (5.117) we have

I(u) ≤ (1 + C(ε + δε + λ))En,ε (un,ε ) + C ε + δε

with probability at least 1 − Cn exp −cnδ d λ2 . By Remark (5.30) we have

|En,ε (u) − I(u)| ≤ C(kukC 2 (U ) + 1)(ε + λ)

with probability at least 1 − 2 exp cnεd λ2 . Therefore

En,ε (u) − En,ε (un,ε ) ≤ C(kukC 2 (U ) + 1) ε + δε + λ .

Applying Proposition 5.40 we have

k∇n,ε u − ∇n,ε un,ε k2ℓ2 (Xn2 ) ≤ C(kukC 2 (U ) + 1) ε + δε + λ .

Combining this with (5.120) completes the proof.

Appendix A

Mathematical preliminaries

We review here some basic mathematical preliminaries.

A.1 Inequalities
For x ∈ Rd the norm of x is
q
|x| := x21 + x22 + · · · + x2n .

When n = 2 or n = 3, |x − y| is the usual Euclidean distance between x and y. The

dot product between x, y ∈ Rd is

X
d
x·y = xi yi .
i=1

Notice that
|x|2 = x · x.
Simple inequalities, when used in a clever manner, are very powerful tools in the
study of partial diﬀerential equations. We give a brief overview of some commonly
used inequalities here.
The Cauchy-Schwarz inequality states that

|x · y| ≤ |x||y|.

To prove the Cauchy-Schwarz inequality ﬁnd the value of t that minimizes

h(t) := |x + ty|2 .

For x, y ∈ Rd

|x + y|2 = (x + y) · (x + y) = x · x + x · y + y · x + y · y.

143
144 APPENDIX A. MATHEMATICAL PRELIMINARIES

Therefore
|x + y|2 = |x|2 + 2x · y + |y|2 .

Using the Cauchy-Schwarz inequality we have

|x + y|2 ≤ |x|2 + 2|x||y| + |y|2 = (|x| + |y|)2 .

Taking square roots of both sides we have the triangle inequality

|x + y| ≤ |x| + |y|.

For x, y ∈ Rd the triangle inequality yields

|x| = |x − y + y| ≤ |x − y| + |y|.

Rearranging we obtain the reverse triangle inequality

|x − y| ≥ |x| − |y|.

For real numbers a, b we have

0 ≤ (a − b)2 = a2 − 2ab + b2 .

Therefore

1 1
(A.1) ab ≤ a2 + b2 .
2 2

This is called Cauchy’s inequality.

Another useful inequality is Young’s inequality: If 1 < p, q < ∞ and 1
p
+ 1
q
=1
then

ap b q
(A.2) ab ≤ + for all a, b > 0.
p q

To see this, we use convexity of the mapping x 7→ ex to obtain

1 p )+ 1 log(bq ) 1 p 1 q ap b q
ab = elog(a)+log(b) = e p log(a q ≤ elog(a ) + elog(b ) = + .
p q p q

Young’s inequality is a generalization of Cauchy’s inequality, which is obtained by

taking p = q = 2.
A.2. TOPOLOGY 145

A.2 Topology
We will have to make use of basic point-set topology. We define the open ball of
radius r > 0 centered at x0 ∈ Rd by
B 0 (x0 , r) := {x ∈ Rd : |x − x0 | < r}.
The closed ball is defined as
B(x0 , r) := {x ∈ Rd : |x − x0 | ≤ r}.
Definition A.1. A set U ⊂ Rd is called open if for each x ∈ U there exists r > 0
such that B(x, r) ⊂ U .
Exercise A.2. Let U, V ⊂ Rd be open. Show that
W := U ∪ V := {x ∈ Rd : x ∈ U or x ∈ V }
is open. 4
Definition A.3. We say that a sequence {xk }∞k=1 in R converges to x ∈ R , written
d d

xk → x, if
lim |xk − x| = 0.
k→∞

Deﬁnition A.4. The closure of a set U ⊂ Rd , denotes U , is deﬁned as

U := {x ∈ Rd : there exists a sequence xk ∈ U such that xk → x}.
The closure is the set of points that can be reached as limits of sequences belonging
to U .
Definition A.5. We say that a set U ⊂ Rd is closed if U = U .
Exercise A.6. Another definition of closed is: A set U ⊂ Rd is closed if the comple-
ment
Rd \ U := {x ∈ Rd : x ∈ 6 U}
is open. Verify that the two definitions are equivalent [This is not a trivial exercise].
4
Definition A.7. We define the boundary of an open set U ⊂ Rd , denoted ∂U , as
∂U := U \ U.
Example A.1. The open ball B 0 (x, r) is open, and its closure is the closed ball
B(x, r). The boundary of the open ball B 0 (x, r) is
∂B 0 (x0 , r) := {x ∈ Rd : |x − x0 | = r}.
It is a good idea to verify each of these facts from the definitions. 4
146 APPENDIX A. MATHEMATICAL PRELIMINARIES

We defined the boundary only for open sets, but is can be defined for any set.
Definition A.8. The interior of a set U ⊂ Rd , denoted int(U ), is defined as

int(U ) := {x ∈ U : B(x, r) ⊂ U for small enough r > 0}.

Exercise A.9. Show that U ⊂ Rd is open if and only if int(U ) = U . 4

We can now define the boundary of an arbitrary set U ⊂ Rd .
Definition A.10. We define the boundary of a set U ⊂ Rd , denoted ∂U , as

∂U := U \ int(U ).

Exercise A.11. Verify that

∂B(x, r) = ∂B 0 (x, r).

4
Definition A.12. We say a set U ⊂ Rd is bounded if there exists M > 0 such that
|x| ≤ M for all x ∈ U .
Definition A.13. We say a set U ⊂ Rd is compact if U is closed and bounded.
Definition A.14. For open sets V ⊂ U ⊂ Rd we say that V is compactly contained
in U if V is compact and V ⊂ U . If V is compactly contained in U we write V ⊂⊂ U .

A.3 Diﬀerentiation
A.3.1 Partial derivatives
The partial derivative of a function u = u(x1 , x2 , . . . , xn ) in the xi variable is deﬁned
as
∂u u(x + hei ) − u(x)
(x) := lim ,
∂xi h→0 h
provided the limit exists. Here e1 , e2 , . . . , en are the standard basis vectors in Rd , so
ei = (0, . . . , 0, 1, 0, . . . , 0) ∈ Rd has a one in the ith entry. For simplicity of notation
we will write
∂u
u xi = .
∂xi
The gradient of a function u : Rd → R is the vector of partial derivatives

∇u(x) := (ux1 (x), ux2 (x), . . . , uxn (x)).

We will treat the gradient as a column vector for matrix-vector multiplication.

A.3. DIFFERENTIATION 147

Higher derivatives are defined iteratively. The second derivatives of u are defined
as
∂ 2u ∂ ∂u
:= .
∂xi xj ∂xi ∂xj
This means that

∂ 2u 1 ∂u ∂u
(x) = lim (x + hei ) − (x) ,
∂xi xj h→0 h ∂xj ∂xj
provided the limit exists. As before, we write
∂ 2u
u xi xj =
∂xi xj
for notational simplicity. When uxi xj and uxj xi exist and are continuous we have
uxi xj = uxj xi ,
that is the second derivatives are the same regardless of which order we take them in.
We will generally always assume our functions are smooth (infinitely differentiable),
so equality of mixed partials is always assumed to hold.
The Hessian of u : Rd → R is the matrix of all second partial derivatives
 
ux1 x1 ux1 x2 ux1 x3 · · · ux1 xn
 u x x u x x ux x · · · ux x 
 2 1 2 2 2 3 2 n
 
∇2 u(x) := (uxi xj )di,j=1 =  ux3 x1 ux3 x2 ux3 x3 · · · ux3 xn 
 .. .. .. .. .. 
.
 . . . . 
uxn x1 uxn x2 uxn x3 · · · uxn xn
Since we have equality of mixed partials, the Hessian is a symmetric matrix, i.e.,
(∇2 u)T = ∇2 u. Since we treat the gradient ∇u as a column vector, the product
∇2 u(x)∇u(x) denotes the Hessian matrix multiplied by the gradient vector. That is,
X
d
2
[∇ u(x)∇u(x)]j = u xi xj u xi .
i=1

Given a vector ﬁeld F : Rd → Rd where F (x) = (F 1 (x), F 2 (x), . . . , F d (x)), the

divergence of F is deﬁned as
X
d
divF (x) := Fxii (x).
i=1

The Laplacian of a function u : Rd → R is deﬁned as

X
d
∆u := div(∇u) = u xi xi .
i=1
148 APPENDIX A. MATHEMATICAL PRELIMINARIES

A.3.2 Rules for diﬀerentiation

Most of the rules for diﬀerentiation from single variable calculus carry over to multi-
variable calculus.
Chain rule: If v(t) = (v1 (t), v2 (t), . . . , vn (t)) is a function v : R → Rd , and
u : Rd → R, then

d X d
(A.3) u(v(t)) = ∇u(v(t)) · v ′ (t) = uxi (v(t))vi′ (t).
dt i=1

Here v ′ (t) = (v1′ (t), v2′ (t), . . . , vn′ (t)).

If F (x) = (F 1 (x), F 2 (x), . . . , F d (x) is a function F : Rd → Rd then

∂ X d
u(F (x)) = ∇u(F (x)) · Fxj (x) = uxi (F (x))Fxij (x),
∂xj i=1

where Fxj = (Fx1j , Fx2j , . . . , Fxdj ). This is a special case of (A.3) with t = xj .
Product rule: Given two functions u, v : Rd → R, we have

∇(uv) = u∇v + v∇u.

This is entry-wise the usual product rule for single variable calculus.
Given a vector ﬁeld F : Rd → Rd and a function u : Rd → R we have
∂
(uF i ) = uxi F i + uFxii .
∂xi
Therefore
div(uF ) = ∇u · F + u divF.
p
Exercise A.15. Let |x| = x21 + · · · + x2n .

(a) Show that for x 6= 0

∂ xi
|x| = .
∂xi |x|

(b) Show that for x 6= 0

∂2 δij xi xj
|x| = − ,
∂xi ∂xj |x| |x|3
where δij is the Kronecker delta deﬁned by
(
1, if i = j
δij =
6 j.
0, if i =
A.4. TAYLOR SERIES 149

(c) Show that for x 6= 0

n−1
∆|x| = .
|x|
4
Exercise A.16. Find all real numbers α for which u(x) = |x|α is a solution of
Laplace’s equation
∆u(x) = 0 for x 6= 0.
4
Exercise A.17. Let 1 ≤ p ≤ ∞. The p-Laplacian is deﬁned by

∆p u := div |∇u|p−2 ∇u

for 1 ≤ p < ∞, and

1 XX
d d
∆∞ u := u x x u x ux .
|∇u|2 i=1 j=1 i j i j
Notice that ∆2 u = ∆u. A function u is called p-harmonic if ∆p u = 0.
(a) Show that
∆p u = |∇u|p−2 (∆u + (p − 2)∆∞ u) .

(b) Show that

1
∆∞ u = lim |∇u|2−p ∆p u.
p→∞ p
4
Exercise A.18. Let 1 ≤ p ≤ ∞. Find all real numbers α for which the function
u(x) = |x|α is p-harmonic away from x = 0. 4

A.4 Taylor series

A.4.1 One dimension
Let u : R → R be twice continuously diﬀerentiable. Then by the fundamental theorem
of calculus
Z y Z y
′
u(y) − u(x) = u (t) dt = u′ (x) + u′ (t) − u′ (x) dt.
x x
Ry
Since x
u′ (x) dt = u′ (x)(y − x), it follows that

(A.4) u(y) = u(x) + u′ (x)(y − x) + R2 (x, y),

150 APPENDIX A. MATHEMATICAL PRELIMINARIES

where R2 is the remainder given by

Z y
R2 (x, y) = u′ (t) − u′ (x) dt.
x

Applying the fundamental theorem again we have

Z yZ t
(A.5) R2 (x, y) = u′′ (s) ds dt.
x x
′′
Let C > 0 denote the maximum value of |u (s)|. Assuming, without loss of generality,
that y > x we have
Z y
C
|R2 (x, y)| ≤ C|t − x| dt = |y − x|2 .
x 2
Exercise A.19. Verify the final equality above. 4
When |g(y)| ≤ C|y|k we write g ∈ O(|y|k ). Thus R2 (x, y) ∈ O(|y − x|2 ) and we
have deduced the first order Taylor series
(A.6) u(y) = u(x) + u′ (x)(y − x) + O(|y − x|2 ).
A Taylor series expresses the fact that a sufficiently smooth function can be well-
approximated locally by its tangent line. It is important to keep in mind that the
constant C hidden in the O((y − x)2 ) term depends on how large |u′′ | is. Also note
that we can choose C > 0 to be the maximum of |u′′ (s)| for s between x and y, which
may be much smaller than the maximum of |u′ (s)| over all s (which may not exist).
It is useful sometimes to continue the Taylor series to higher order terms. For
this, suppose u is three times continuously differentiable. We first write the Taylor
series with remainder for u′ (t)
Z tZ τ
′ ′ ′′
u (t) = u (x) + u (x)(t − x) + u′′′ (s) ds dτ.
x x

Proceeding as before, we use the fundamental theorem of calculus to ﬁnd

Z y
u(y) = u(x) + u′ (t) dt
Z y
x
Z tZ τ
′ ′′
= u(x) + u (x) + u (x)(t − x) + u′′′ (s) ds dτ dt
x x x
1
= u(x) + u (x)(y − x) + u′′ (x)(y − x)2 + R3 (x, y),
′
2
where Z y Z tZ τ
R3 (x, y) = u′′′ (s) ds dτ dt.
x x x
As before, let C > 0 denote the maximum value of |u′′′ (s)|. Then
C
|R3 (x, y)| ≤ |y − x|3 .
6
A.4. TAYLOR SERIES 151

Exercise A.20. Verify the inequality above. 4

Therefore R3 ∈ O(|y − x|3 ) and we have the second order Taylor expansion
1
(A.7) u(y) = u(x) + u′ (x)(y − x) + u′′ (x)(y − x)2 + O(|y − x|3 ).
2
The second order Taylor series says that a suﬃciently smooth function can be ap-
proximated up to O((y − x)3 ) accuracy with a parabola. Again, we note that the
constant C > 0 hidden in O((y − x)3 ) depends on the size of |u′′′ (s)|, and C > 0 may
be chosen as the maximum of |u′′′ (s)| over s between x and y.

A.4.2 Higher dimensions

Taylor series expansions for functions u : Rd → R follow directly from the one dimen-
sional case and the chain rule. Suppose u is twice continuously differentiable and fix
x, y ∈ Rd . For t ∈ R define

φ(t) = u(x + (y − x)t).

Since φ is a function of one variable t, we can use the one dimensional Taylor series
to obtain

(A.8) φ(t) = φ(0) + φ′ (0)t + O(|t|2 ).

The constant in the O(|t|2 ) term depends on the maximum of |φ′′ (t)|. All that remains
is to compute the derivatives of φ. By the chain rule

X
d
′
(A.9) φ (t) = uxi (x + (y − x)t)(yi − xi ),
i=1

and

d X
d
′′
φ (t) = ux (x + (y − x)t)(yi − xi )
dt i=1 i
Xd
d
= uxi (x + (y − x)t)(yi − xi )
i=1
dt
X
d X
d
(A.10) = uxi xj (x + (y − x)t)(yi − xi )(yj − xj ).
i=1 j=1

In particular
X
d
′
φ (0) = uxi (x)(yi − xi ) = ∇u(x) · (y − x),
i=1
152 APPENDIX A. MATHEMATICAL PRELIMINARIES

and so (A.8) with t = 1 becomes

u(y) = u(x) + ∇u(x) · (y − x) + R2 (x, y),

X
d X
d
|φ′′ (t)| ≤ C |yi − xi ||yj − xj | ≤ Cn2 |x − y|2 .
i=1 j=1

It follows that |R2 (x, y)| ≤ C2 n2 |x − y|2 , hence R2 (x, y) ∈ O(|x − y|2 ) and we arrive
at the ﬁrst order Taylor series

(A.11) u(y) = u(x) + ∇u(x) · (y − x) + O(|x − y|2 ).

This says that u can be locally approximated near x to order O(|x − y|2 ) by the aﬃne
function
L(y) = u(x) + ∇u(x) · (y − x).
We can continue this way to obtain the second order Taylor expansion. We assume
now that u is three times continuously diﬀerentiable. Using the one dimensional
second order Taylor expansion we have
1
(A.12) φ(t) = φ(0) + φ′ (0)t + φ′′ (0)t2 + O(|t|3 ).
2
The constant in the O(|t|3 ) term depends on the maximum of |φ′′′ (t)|. Notice also
that
X
d X
d
φ′′ (0) = uxi xj (x)(yi − xi )(yj − xj ) = (y − x) · ∇2 u(x)(y − x),
i=1 j=1

where ∇2 u(x) is the Hessian matrix. Plugging this into (A.12) with t = 1 yields
1
u(y) = u(x) + ∇u(x) · (y − x) + (y − x) · ∇2 u(x)(y − x) + R3 (x, y),
2
where R3 (x, y) satisﬁes |R3 (x, y)| ≤ 1
6
maxt |φ′′′ (t)|. We compute

Xd X d
d
φ′′′ (t) = uxi xj (x + (y − x)t)(yi − xi )(yj − xj )
i=1 j=1
dt
X
d X
d X
d
= uxi xj xk (x + (y − x)t)(yi − xi )(yj − xj )(yk − xk ).
i=1 j=1 k=1
A.5. FUNCTION SPACES 153

Let C > 0 denote the maximum value of |uxi xj xk (z)| over all z, i, j, and k. Then we
have
Xd Xd Xd
′′′
|φ (t)| ≤ C |yi − xi ||yj − xj ||yk − xk | ≤ Cn3 |x − y|3 .
i=1 j=1 k=1

Therefore |R3 (x, y)| ≤ C6 n3 |x − y|3 and so R3 ∈ O(|x − y|3 ). Finally we arrive at the
second order Taylor expansion

1
(A.13) u(y) = u(x) + ∇u(x) · (y − x) + (y − x) · ∇2 u(x)(y − x) + O(|x − y|3 ).
2

This says that u can be locally approximated near x to order O(|x − y|3 ) by the
quadratic function

1
L(y) = u(x) + ∇u(x) · (y − x) + (y − x) · ∇2 u(x)(y − x).
2

A.5 Function spaces

For an open set U ⊂ Rd we deﬁne

C k (U ) := Functions u : U → R that are k-times continuously diﬀerentiable on U .

The terminology k-times continuously diﬀerentiable means that all k th -order partial
derivatives of u exist and are continuous on U . We write C 0 (U ) = C(U ) for the space
of functions that are continuous on U .

Exercise A.21. Show that the function u(x) = x2 for x > 0 and u(x) = −x2 for
x ≤ 0 belongs to C 1 (R) but not to C 2 (R). 4

We also deﬁne
\
∞
∞
C (U ) := C k (U )
k=1

to be the space of inﬁnitely diﬀerentiable functions. Functions u ∈ C ∞ (U ) are called

smooth.

Deﬁnition A.22. The support of a function u : U → R is deﬁned as

supp(u) := {x ∈ U : u(x) 6= 0}.

Deﬁnition A.23. We say that u : U → R is compactly supported in U if supp(u) ⊂⊂

U.
154 APPENDIX A. MATHEMATICAL PRELIMINARIES

A function u is compactly supported in U if u vanishes near the boundary ∂U .

Finally for k ∈ N ∪ {∞} we write
Cck (U ) := {u ∈ C k (U ) : u is compactly supported in U }.
For a function u : U → R we deﬁne the L2 -norm of u to be
Z 21
kukL2 (U ) := 2
u dx .
U

For two functions u, v : U → R we deﬁne the L2 -inner product of u and v to be

Z
(u, v)L2 (U ) := u v dx.
U

Notice that
kuk2L2 (U ) = (u, u)L2 (U ) .
We also deﬁne
n o
L2 (U ) := Functions u : U → R for which kukL2 (U ) < ∞ .

L2 (U ) is a Hilbert space (a complete inner-product space). We will often write kuk

in place of kukL2 (U ) and (u, v) in place of (u, v)L2 (U ) when it is clear from the context
that the L2 norm is intended.
As before, we have the Cauchy-Schwarz inequality
(u, v)L2 (U ) ≤ kukL2 (U ) kvkL2 (U ) .
We also have
ku + vk2L2 (U ) = kuk2L2 (U ) + 2(u, v)L2 (U ) + kvk2L2 (U ) .
Applying the Cauchy-Schwarz inequality we get the triangle inequality
ku + vkL2 (U ) ≤ kukL2 (U ) + kvkL2 (U ) ,
and the reverse triangle inequality
ku − vkL2 (U ) ≥ kukL2 (U ) − kvkL2 (U ) .

A.6 Analysis
A.6.1 The Riemann integral
Many students are accustomed to using diﬀerent notation for integration in diﬀerent
dimensions. For example, integration along the real line in R is usually written
Z b
u(x) dx,
a
A.6. ANALYSIS 155

while integration over a region U ⊂ R2 is written

ZZ ZZ
u(x, y) dxdy or u(x) dx,
U U

where x = (x, y). Integration over a volume U ⊂ R3 is then written as

ZZZ ZZZ
u(x, y, z) dxdydz or u(x) dx.
U U

This becomes cumbersome when we consider problems in an arbitrary number of

dimensions n. In these notes, we use x (or y or z) for a point in Rd , so x =
(x1 , x2 , . . . , xn ) ∈ Rd . We write

u(x) = u(x1 , x2 , . . . , xn )

for a function u : Rd → R. The integration of u over a domain U ⊂ Rd is then written

Z Z
u(x) dx or just u dx,
U U

where dx = dx1 dx2 · · · dxn . This notation has the advantage of being far more com-
pact without losing the meaning. R
Let us recall the interpretation of the integral U u dx in the Riemann sense. We
partition the domain into M rectangles and approximate the integral by a Riemann
sum
Z XM
u dx ≈ u(xk )∆xk ,
U k=1

where xk ∈ Rd is a point in the k th rectangle, and ∆xk := ∆xk,1 ∆xk,2 · · · ∆xk,n is the
n-dimensional volume (or measure) of the k th rectangle (∆k,i for i = 1, . . . , n are the
side lengths of the k th rectangle). Then the Riemann integral is deﬁned by taking
the limit as the largest side length in the partition tends to zero (provided the limit
exists and does not depend on the choice of partition or points xk ). Notice here that
xk = (xk,1 , . . . , xk,n ) ∈ Rd is a point in Rd , and not the k th entry of x. There is a
slight abuse of notation here; the reader will have to discern from the context which
is implied.
If S ⊂ Rd is an n − 1 dimensional (or possibly lower dimensional) surface, we
write the surface integral of u over S as
Z
u(x) dS(x).
S

Here, dS(x) is the surface area element at x ∈ S.

156 APPENDIX A. MATHEMATICAL PRELIMINARIES

A.6.2 The Lebesgue integral

In these notes, and most rigorous analysis books, we interpret the integral in the
Lebesgue sense (see [51] for deﬁnitions). The Lebesgue integral is more powerful for
analysis and facilitates passing to limits within the integral with ease. The Lebesgue
integral can be applied to Lebesgue measurable functions u : Rd → R. The notionR of
measurability excludes certain pathological classes of functions u for which Rd u dx
cannot be assigned a sensible value. We again refer to [51] for notions of Lebesgue
measure and measurable functions. Nearly all functions you encounter via simple
constructions are measurable, and the reader not familiar with measure theory can
ignore the issue and still understand the main ideas in these notes. We will always
abbreviate Lebesgue measurable with measurable, and we write the Lebesgue measure
of a set A ⊂ Rd by |A|.
Deﬁnition A.24. We say a measurable function u : Rd → R is summable if
Z
|u| dx < ∞.
Rd

Some properties will hold on all of Rd except for a set with measure zero. In this
case we will say the property holds almost everywhere or abbreviated “a.e.”.
The following theorems are the most useful for passing to limits within integrals.
Lemma A.25 (Fatou’s Lemma). If uk : Rd → [0, ∞) is a sequence of nonnegative
measureable functions then
Z Z
(A.14) lim inf uk (x) dx ≤ lim inf uk (x) dx.
Rd k→∞ k→∞ Rd

Remark A.26 (Reverse Fatou Lemma). A simple consequence of Fatou’s Lemma,

called the Reverse Fatou Lemma, is often useful. Let uk : Rd → R be a sequence
of measureable functions, and assume there exists a summable function g such that
uk ≤ g almost everywhere. Then
Z Z
(A.15) lim sup uk (x) dx ≤ lim sup uk (x) dx.
k→∞ Rd Rd k→∞

To see this, we apply Fatou’s Lemma to the nonnegative sequence g − uk to obtain

Z Z
lim inf (g(x) − uk (x)) dx ≤ lim inf g(x) − uk (x) dx.
Rd k→∞ k→∞ Rd

Noting that lim inf(−uk ) = − lim sup(uk ), this can be rearranged to obtain (A.15),
provided g is summable.
The following two theorems are more or less direct applications of Fatou’s Lemma
and the Reverse Fatou Lemma, but present the results in a way that is very useful in
practice.
A.6. ANALYSIS 157

Theorem A.27 (Monotone Convergence Theorem). Let uk : Rd → R be a sequence

of measureable functions satisfying
(A.16) 0 ≤ u1 ≤ u2 ≤ u3 ≤ · · · ≤ uk ≤ uk+1 ≤ · · · a.e.
If limk→∞ uk (x) exists almost everywhere then
Z Z
(A.17) lim uk (x) dx = lim uk (x) dx.
k→∞ Rd Rd k→∞

Proof. Let u(x) = limk→∞ uk (x). By Fatou’s lemma we have

Z Z
(A.18) u dx ≤ lim inf uk (x) dx.
Rd k→∞ Rd
R R
Since uk (x) ≤ u(x) for almost every x ∈ U , we clearly have Rd
uk dx ≤ Rd
u dx, and
so
Z Z
(A.19) lim sup uk (x) dx ≤ u dx.
k→∞ Rd Rd

Combining (A.18) and (A.19) completes the proof.

Theorem A.28 (Dominated Convergence Theorem). Let uk : Rd → R be a sequence
of measureable functions, and assume there exists a summable function g such that
(A.20) |uk | ≤ g a.e.
If limk→∞ uk (x) exists almost everywhere then
Z Z
(A.21) lim uk (x) dx = lim uk (x) dx.
k→∞ Rd Rd k→∞

Proof. Let u(x) = limk→∞ uk (x). Note that |uk − u| ≤ |uk | + |u| ≤ 2g. Therefore, we
can apply the Reverse Fatou Lemma to ﬁnd that
Z Z
lim sup |u − uk | dx ≤ lim |uk − u| dx = 0.
k→∞ Rd Rd k→∞
R
Therefore limk→∞ U |u − uk | dx = 0 and we have
Z Z Z Z
u dx − uk dx = u − uk dx ≤ |uk − u| dx −→ 0
Rd Rd Rd Rd

as k → ∞, which completes the proof.

Finally, we recall Egoroﬀ’s Theorem.
Theorem A.29 (Egoroﬀ’s Theorem). Let uk : Rd → R be a sequence of measurable
functions, and assume there is a measurable function u : Rd → R such that
lim uk (x) = u(x) a.e. in A,
k→∞

where A ⊂ Rd is measurable and |A| < ∞. Then for each ε > 0 there exists a
measurable subset E ⊂ A such that |A − E| ≤ ε and uk → u uniformly on E.
158 APPENDIX A. MATHEMATICAL PRELIMINARIES

A.7 Integration by parts

All of the sets U ⊂ Rd that we work with will be assumed to be open and bounded
with a smooth boundary ∂U . A set U ⊂ Rd has a smooth boundary if at each point
x ∈ ∂U we can make an orthogonal change of coordinates so that for some r > 0,
∂U ∩ B(0, r) is the graph of a smooth function u : Rn−1 → R. If ∂U is smooth, we
can deﬁne an outward normal vector ν = ν(x) at each point x ∈ ∂U , and ν varies
smoothly with x. Here, ν = (ν1 , . . . , νn ) ∈ Rd and ν is a unit vector so
q
|ν| = ν12 + · · · + νn2 = 1.

The normal derivative of u ∈ C 1 (U ) at x ∈ ∂U is

∂u
(x) := ∇u(x) · ν(x).
∂ν
Integration by parts in Rd is based on the Gauss-Green Theorem.

Theorem A.30 (Gauss-Green Theorem). Let U ⊂ Rd be an open and bounded set

with a smooth boundary ∂U . If u ∈ C 1 (U ) then
Z Z
uxi dx = uνi dS.
U ∂U

The Gauss-Green Theorem is the natural extension of the fundamental theorem

of calculus to dimensions n ≥ 2. A proof of the Gauss-Green Theorem is outside the
scope of this course.
We can derive a great many important integration by parts formulas from the
Gauss-Green Theorem. These identities are often referred to as Green’s identities or
simply integration by parts.

Theorem A.31 (Integration by parts). Let U ⊂ Rd be an open and bounded set with
a smooth boundary ∂U . If u, v ∈ C 2 (U ) then
Z Z Z
∂v
(i) u∆v dx = u dS − ∇u · ∇v dx,
U ∂U ∂ν U
Z Z
∂v ∂u
(ii) u∆v − v∆u dx = u −v dS, and
U ∂U ∂ν ∂ν
Z Z
∂v
(iii) ∆v dx = dS.
U ∂U ∂ν

Proof. (i) Notice that

∂xi (uvxi ) = uxi vxi + uvxi xi .
A.7. INTEGRATION BY PARTS 159

Applying the Gauss-Green Theorem to uvxi we have

Z Z
uvxi νi dS = uxi vxi + uvxi xi dx.
∂U U
Summing over i we have
Z Z
∂v
u dS = ∇u · ∇v + u∆v dx,
∂U ∂ν U
which is equivalent to (i).
(ii) Swap the roles of u and v in (i) and subtract the resulting identities to prove
(ii).
(iii) Take u = 1 in (i).
It will also be useful to prove the following version of the divergence theorem.
Recall that for a vector field F (x) = (F 1 (x), . . . , F d (x)) the divergence of F is
div(F ) = Fx11 + Fx22 + · · · + Fxdn .
Theorem A.32 (Divergence theorem). Let U ⊂ Rd be an open and bounded set with
a smooth boundary ∂U . If u ∈ C 1 (U ) and F is a C 1 vector field (i.e., F i ∈ C 1 (U )
for all i) then Z Z Z
u div(F ) dx = u F · ν dS − ∇u · F dx.
U ∂U U
Proof. The proof is similar to Theorem A.31 (i). Notice that
(uF i )xi = uxi F i + uFxii ,
and apply the Gauss-Green Theorem to find that
Z Z
i
uF νi dS = uxi F i + uFxii dx.
∂U U
Summing over i we have
Z Z
u F · ν dS = ∇u · F + u div(F ) dx,
∂U U
which is equivalent to the desired result.
Notice that when u = 1 Theorem A.32 reduces to
Z Z
div(F ) dx = F · ν dS,
U ∂U

which is the usual divergence theorem. If we take F = ∇v for v ∈ C 2 (U ), then we

recover Theorem A.31 (i).
Exercise A.33. Let u, w ∈ C 2 (U ) where U ⊂ Rd is open and bounded. Show that
for 1 ≤ p < ∞
Z Z Z
p−2 ∂w
u ∆p w dx = u |∇w| dS − |∇w|p−2 ∇u · ∇w dx.
U ∂U ∂ν U
The p-Laplacian ∆p was deﬁned in Exercise A.17. 4
160 APPENDIX A. MATHEMATICAL PRELIMINARIES

A.8 Convex functions

Here, we review some basic theory of convex functions.

Deﬁnition A.34. A function u : Rd → R is convex if

(A.22) u(λx + (1 − λ)y) ≤ λu(x) + (1 − λ)u(y)

for all x, y ∈ Rd and λ ∈ (0, 1).

Deﬁnition A.35 (Strongly convex). Let θ ≥ 0. A function u : Rd → R is θ-strongly

convex if u − 2θ |x|2 is convex.

Exercise A.36. Show that u is θ-strongly convex for θ ≥ 0 if and only if

θ
(A.23) u(λx + (1 − λ)y) + λ(1 − λ)|x − y|2 ≤ λu(x) + (1 − λ)u(y)
2
for all x, y ∈ Rd and λ ∈ (0, 1). The statement in (A.23) is often given as the deﬁnition
of strong convexity. 4

Lemma A.37. Let u : R → R be twice continuously diﬀerentiable and θ ≥ 0. The

following are equivalent

(i) u is θ-strongly convex.

(ii) u′′ (x) ≥ θ for all x ∈ R.

(iii) u(y) ≥ u(x) + u′ (x)(y − x) + 2θ (y − x)2 for all x, y ∈ R.

(iv) (u′ (x) − u′ (y))(x − y) ≥ θ(x − y)2 for all x, y ∈ R.

Proof. It is enough to prove the result for θ = 0. Then if any statement holds for
θ > 0, we can deﬁne v(x) = u(x) − 2θ |x|2 and use the results for v with θ = 0.
The proof is split into three parts.
1. (i) =⇒ (ii): Assume u is convex. Let x0 ∈ R and set λ = 12 , x = x0 − h, and
y = x0 + h for a real number h. Then
1 1
λx + (1 − λ)y = (x0 − h) + (x0 + h) = x0 ,
2 2
and the convexity condition (A.22) yields
1 1
u(x0 ) ≤ u(x0 − h) + u(x0 + h).
2 2
Therefore
u(x0 − h) − 2u(x0 ) + u(x0 + h) ≥ 0
A.8. CONVEX FUNCTIONS 161

for all h, and so

u(x0 − h) − 2u(x0 ) + u(x0 + h)
u′′ (x0 ) = lim ≥ 0.
h→0 h2
2. (ii) =⇒ (iii): Assume that u′′ (x) ≥ 0 for all x. Then recalling (A.4) and (A.5)
we have
u(y) = u(x) + u′ (x)(y − x) + R2 (x, y)
where R2 (x, y) ≥ 0 for all x, y, which completes the proof.
3. (iii) =⇒ (iv): Assume (iii) holds. Then we have

u(y) ≥ u(x) + u′ (x)(y − x)

and
u(x) ≥ u(y) + u′ (y)(x − y).
Subtracting the equations above proves that (iv) holds.
4. (iv) =⇒ (i): Assume (iv) holds. Then u′ is nondecreasing and so u′′ (x) ≥ 0
for all x ∈ R, hence (ii) holds, and so does (iii). Let x, y ∈ Rd and λ ∈ (0, 1), and set
x0 = λx + (1 − λ)y. Deﬁne

L(z) = u(x0 ) + u′ (x0 )(z − x0 ).

By (iii) we have u(z) ≥ L(z) for all z. Therefore

u(λx + (1 − λ)y) = u(x0 ) = λL(x) + (1 − λ)L(y) ≤ λu(x) + (1 − λ)u(y),

and so u is convex.
Lemma A.37 has a natural higher dimensional analog, but we ﬁrst need some new
notation. For a symmetric real-valued n × n matrix A = (aij )di,j=1 , we write

X
d
(A.24) A≥0 ⇐⇒ v · Av = aij vi vj ≥ 0, ∀v ∈ Rd .
i,j=1

For symmetric real valued matrices A, B, we write A ≥ B if A − B ≥ 0.

Theorem A.38. Let u : Rd → R be twice continuously diﬀerentiable. The following
are equivalent
(i) u is θ-strongly convex.

(ii) ∇2 u ≥ θI for all x ∈ Rd .

(iii) u(y) ≥ u(x) + ∇u(x) · (y − x) + 2θ |y − x|2 for all x, y ∈ Rd .

(iv) (∇u(x) − ∇u(y)) · (x − y) ≥ θ|x − y|2 for all x, y ∈ Rd .

162 APPENDIX A. MATHEMATICAL PRELIMINARIES

Proof. Again, we may prove the result just for θ = 0. The proof follows mostly from
Lemma A.37, with some additional observations.
1. (i) =⇒ (ii): Assume u is convex. Since convexity is deﬁned along lines, we
see that g(t) = u(x + tv) is convex for all x, v ∈ Rd , and by Lemma A.37 g ′′ (t) ≥ 0
for all t. By (A.10) we have

d2 XX d d
′′
(A.25) g (t) = 2 u(x + tv) = uxi xj (x)vi vj = v · ∇2 u(x)v,
dt i=1 j=1

and so ∇2 u(x) ≥ 0 for all x ∈ Rd .

2. (ii) =⇒ (iii): Assume (ii) holds and let g(t) = u(x + tv) for x, v ∈ Rd . Let
y ∈ Rd . Then by (A.25) we have g ′′ (t) ≥ 0 for all t, and so by Lemma A.37

g(t) ≥ g(s) + g ′ (s)(t − s)

for all s, t. Set v = y − x, t = 1 and s = 0 to obtain

u(y) ≥ u(x) + ∇u(x) · (y − x),

where we used the fact that

d
g ′ (0) = u(x + tv) = ∇u · v.
dt t=0

3. (iii) =⇒ (iv): The proof is similar to Lemma A.37.

4. (iv) =⇒ (i): Assume (iv) holds, and deﬁne g(t) = u(x + tv) for x, v ∈ Rd .
Then we have

(g ′ (t) − g ′ (s))(t − s) = (∇u(x + tv) − ∇u(x + sv)) · v(t − s) ≥ 0

for all t, s. By Lemma A.37 we have that g is convex for al x, v ∈ Rd , from which it
easily follows that u is convex.

A.9 Probability
Here, we give a brief overview of basic probability. For more details we refer the
reader to [25].

A.9.1 Basic deﬁnitions

A probability space is a measure space (Ω, F, P), where F is a σ-algebra of measurable
subsets of Ω and P is a nonegative measure on F with P(Ω) = 1 (i.e., a probability
measure). Each A ⊂ Ω with A ∈ F is an event, with probability P(A). We think of
each ω ∈ Ω as a trial and if ω ∈ A then event A occured. For two events A, B ∈ F
A.9. PROBABILITY 163

the union A ∪ B is the event that A or B occured, and the intersection A ∩ B is the
event that both A and B occured. By subadditivity of measures we have

P(A ∪ B) ≤ P(A) + P(B),

which is called the union bound.

Example A.2. Consider rolling a 6-sided die. Then Ω = {1, 2, 3, 4, 5, 6}, F consists of
all subsets of Ω, and P(A) = #A/6. If we roll the die twice, then Ω = {1, 2, 3, 4, 5, 6}2
and P(A) = #A/36. 4
Example A.3. Consider drawing a number uniformly at random in the interval
[0, 1]. Here, Ω = [0, 1], F is all Lebesgue measureable subsets of [0, 1], and P(A) is
the Lebesgue measure of A ∈ F . 4
We will from now on omit the σ-algebra F when referring to probability spaces.
Let (Ω, P) be a probability space. A random variable is a measurable function
X : Ω → Rd . That is, to each trial ω ∈ Ω we associate the value X(ω).
Example A.4. In Example (A.2), suppose we win 10 times the number on the die
in dollars. Then the random variable X(ω) = 10ω describes our winnings. 4
The image of Ω under X, denoted ΩX = {X(ω) : ω ∈ Ω} ⊂ Rd is the sample
space of X, and we often say X is a random variable on ΩX . The random variable
X : Ω → ΩX deﬁnes a measure on ΩX which we denote by PX . Indeed, for any
B ⊂ ΩX , the probability that X lies in B, written PX (X ∈ B) is

PX (X ∈ B) := P(X −1 (B)).

With this new notation we can write

Z
PX (X ∈ B) = d PX (x).
B

We say that X has a density if there exists a nonnegative Lebesgue measurable

ρ : ΩX → R such that Z
PX (X ∈ B) = ρ(x) dx.
B

Let g : Ω → R . Then Y = g(X) is a random variable. We deﬁne the expectation

X m

EX [g(X)] to be
Z Z
EX [g(X)] = g(x) d PX (x) = g(X(ω)) d P(ω).
ΩX Ω

In particular Z Z
EX [X] = x d PX (x) = X(ω) d P(ω).
ΩX Ω
164 APPENDIX A. MATHEMATICAL PRELIMINARIES

If X has a density then

Z
EX [g(X)] = g(x)ρ(x) dx.
ΩX

We note that the expectation is clearly linear, so that

EX [f (X) + g(X)] = EX [f (X)] + EX [g(X)],

due to linearity of the integral.

A.9.2 Markov and Chebyshev inequalities

We introduce here basic estimates for bounding probabilities of random variables. An
important result is Markov’s inequality.

Proposition A.39 (Markov’s inequality). Let (Ω, P) be a probability space and X :

Ω → [0, ∞) be a nonnegative random variable. Then for any t > 0

EX [X]
(A.26) PX (X ≥ t) ≤ .
t
Proof. By deﬁnition we have
Z ∞ Z ∞ Z ∞
x 1 EX [X]
PX (X ≥ t) = d PX (x) ≤ d PX (x) = x d PX (x) = .
t t t t 0 t

Markov’s inequality can be improved if we have information about the variance

of X. We deﬁne the variance of a random variable X as

(A.27) Var(X) = EX [(X − EX [X])2 ].

Proposition A.40 (Chebyshev’s inequality). Let (Ω, P) be a probability space and

X : Ω → R be a random variable with ﬁnite mean EX [X] and variance Var(X). Then
for any t > 0

Var (X)
(A.28) PX (|X − EX [X]| ≥ t) ≤ .
t2

Proof. Let Y = (X − EX [X])2 . Then Y is a nonegative random variable and by

Markov’s inequality (A.26) we have

EX [Y ] Var (X)
PX (|X − EX [X]| ≥ t) = PX (Y ≥ t2 ) ≤ = .
t2 t2
A.9. PROBABILITY 165

A.9.3 Sequences of indepedendent random variables

Let (Ω, P) be a probability space, and X : Ω → Rd be a random variable. We often
want to construct other independent copies of the random variable X. For example, if
we roll a die several times, then we have many instances of the same random variable.
We clearly cannot use the same probability space for each roll of the die, otherwise
all the rolls would always produce the same value (and would not be indepedent).
To construct an independent copy of X, we consider the product probability space
(Ω × Ω, P × P) with the product probability measure P × P. The product measure is
the unique measure satisfying
(P × P)(A × B) = P(A)P(B)
for all measurable A, B ⊂ Ω. On the product probability space Ω2 = Ω × Ω the two
independent copies of X are constructed via the random variable
(ω1 , ω2 ) 7→ (X(ω1 ), X(ω2 )).
We normally give the random variables diﬀerent names, so that X1 (ω1 , ω2 ) := X(ω1 )
and X2 (ω1 , ω2 ) := X(ω2 ). Then X1 and X2 are themselves random variables (now
on Ω2 ), and we say X1 and X2 are independent random variables with the same
distribution as X, or independent and identically distributed random variables.
An important property concerns the expectation of products of independent ran-
dom variables. If X1 and X2 are independent and identically distributed random
variables with the same distribution as X (as above) then
(A.29) E(X1 ,X2 ) [f (X1 )g(X2 )] = EX [f (X)]EX [g(X)].
Indeed, we have
Z Z
E(X1 ,X2 ) [f (X1 )g(X2 )] = f (x)g(y) d PX (x) d PX (y)
ZΩ Ω Z
= f (x) d PX (x) g(y) d PX (y)
Ω Ω
= EX [f (X)]EX [g(X)].
We also notice that
EX [f (X)] = E(X1 ,X2 ) [f (X1 )],
since
Z Z Z
E(X1 ,X2 ) [f (X1 )] = f (x) d PX (x) d PX (y) = f (x) d PX (x) = EX [f (x)].
Ω Ω Ω

We can continue constructing as many independent and identically distributed

copies of X as we like. The construction is as follows. Let n ≥ 1 and consider the
product probability space (Ωn , Pn ) with product measure
Pn = P
| ×P×
{z· · · × P} .
n times
166 APPENDIX A. MATHEMATICAL PRELIMINARIES

For i = 1, . . . , n we deﬁne the random variable Xi : Ωn → Rd by

Xi (ω1 , ω2 , . . . , ωn ) = X(ωi ).

We say that X1 , X2 , . . . , Xn is a sequence of n independent and identically distributed

(i.i.d.) random variables. It is important to note how all Xi for i = 1, . . . , n are deﬁned
on the same probability space, which allows us to compute probabilities involving all
the n random variables. As above, we have the product of expectations formula

Y
n
(A.30) E(X1 ,X2 ,...,Xn ) [f1 (X1 )f2 (X2 ) · · · fn (Xn )] = EX [fi (X)].
i=1

We leave it to the reader to verify (A.30). In applications of probability theory, we

will not burden the notation and will write P in place of Pn and E in place of EX and
E(X1 ,X2 ,...,Xn ) . It will almost always be clear from context which probability measures
and expectations are being used, and when it is not clear we will speciﬁcally denote
the dependence. As above we have

EX [f (X)] = E(X1 ,X2 ,...,Xn ) [f (Xi )],

for any i, so the choice of which expectation to use is irrelevant. Since we do not wish
to always specify the base random variable X on which the sequence is constructed,
we often write X1 or Xi in place of X.

A.9.4 Law of large numbers

To get some practice using probability, we give a proof of the weak law of large
numbers, using only the tools from Sections A.9.2 and A.9.3.

Theorem A.41 (Weak law of large numbers). Let X1 , . . . , Xn be a sequence of inde-

pendent and identically distributed random
P variables with ﬁnite mean µ := E[Xi ] and
variance σ 2 := Var (Xi ). Let Sn = n1 ni=1 Xi . Then for every ε > 0 we have

(A.31) lim P(|Sn − µ| ≥ ε) = 0.

n→∞

Remark A.42. The limit in (A.31) shows that Sn → µ in probability as n → ∞,

which is known as the weak law of large numbers. In fact, inspecting the proof below,
we have proved the slightly stronger statement

lim P(|Sn − µ| ≥ εn−α ) = 0,

n→∞

for any α ∈ (0, 21 ).

A.10. MISCELLANEOUS RESULTS 167

Proof. Note that E[Sn ] = µ and compute

Var (Sn ) = E[(Sn − µ)2 ]
 !2 
1 Xn
= 2E Xi − µ 
n i=1
" n #
1 X
= 2E (Xi − µ)(Xj − µ)
n i,j=1

1 X
n
= E[(Xi − µ)(Xj − µ)].
n2 i,j=1

If i 6= j, then due to (A.30) we have E[(Xi − µ)(Xj − µ)] = 0 and so

1 X
n
σ2
Var (Sn ) = E[(X i − µ) 2
] = .
n2 i,j=1 n

By Chebyshev’s inequality (Proposition A.40) we have

σ2
P(|Sn − µ| ≥ ε) ≤
nε2
for all ε > 0, which completes the proof.

A.10 Miscellaneous results

A.10.1 Vanishing lemma
Lemma A.43. Let U ⊂ Rd be open and bounded and let u ∈ C(U ). If
Z
u(x)φ(x) dx = 0 for all φ ∈ Cc∞ (U )
U

then u(x) = 0 for all x ∈ U .

Proof. Let us sketch the proof. Assume to the contrary that u(x0 ) 6= 0 at some
x0 ∈ U . We may assume, without loss of generality that ε := u(x0 ) > 0. Since u is
continuous, there exists δ > 0 such that
ε
u(x) ≥ whenever |x − x0 | < δ.
2
Now let φ ∈ Cc∞ (U ) be a test function satisfying φ(x) > 0 for |x − x0 | < δ and
φ(x) = 0 for |x − x0 | ≥ δ. Then
Z Z Z
ε
0= u(x)φ(x) dx = u(x)φ(x) dx ≥ φ(x) dx > 0,
U B(x0 ,δ) 2 B(x0 ,δ)
which is a contradiction.
168 APPENDIX A. MATHEMATICAL PRELIMINARIES

A test function satisfying the requirements in the proof of Lemma A.43 is given
by (
exp − δ2 −|x−x
1
0|
2 , if |x − x0 | < δ
φ(x) =
0, if |x − x0 | ≥ δ.

A.10.2 Total variation of characteristic function is perimeter

Let U ⊂ Rd be an open and bounded set with a smooth boundary, and let
(
1, if x ∈ U
(A.32) χU (x) =
0, otherwise.
We show here that the length, also called perimeter, of ∂U , denoted Per(∂U ), is given
by
Z
(A.33) Per(∂U ) = |∇χU | dx.
Rd
R
We ﬁrst need to understand how to interpret the total variation Rd |∇u| dx for a
non-smooth function u : Rd → R. For this, we ﬁrst note that when u ∈ C ∞ (Rd ) with
compact support we have
Z Z
|∇u| dx = sup ∇u · φ dx.
Rd φ∈C ∞ (Rd ,Rd ) Rd
|φ|≤1

Indeed, the equality is immediate, since for any φ we have ∇u · φ ≤ |∇u| due to the
∇u
condition |φ| ≤ 1, and we can take φ = |∇u| to saturate the inequality. Since u has
compact support in R , we can integrate by parts to obtain the identity
d
Z Z
(A.34) |∇u| dx = sup u divφ dx.
Rd φ∈C ∞ (Rd ,Rd ) Rd
|φ|≤1
R
The identity (A.34) is taken to be the deﬁnition of the total variation Rd
|∇u| dx
when u is not diﬀerentiable.
Therefore we have
Z Z
|∇χU | dx = sup χU (x) divφ(x) dx
Rd φ∈C ∞ (Rd ,Rd ) Rd
|φ|≤1
Z
= sup divφ(x) dx
φ∈C ∞ (Rd ,Rd ) U
|φ|≤1
Z
= sup φ · ν dS.
φ∈C ∞ (Rd ,Rd ) ∂U
|φ|≤1
A.10. MISCELLANEOUS RESULTS 169

Since φ · ν ≤ 1 and the choice of φ = ν for any smooth extension of ν to Rd yields

φ · ν = 1, we have Z Z
|∇χU | dx = dS = Per(∂U ).
Rd ∂U
170 APPENDIX A. MATHEMATICAL PRELIMINARIES
Bibliography

[1] R. K. Ando and T. Zhang. Learning on graph with Laplacian regularization. In

Advances in Neural Information Processing Systems, pages 25–32, 2007.

[2] K. Andreev and H. Racke. Balanced graph partitioning. Theory of Computing

Systems, 39(6):929–939, 2006.

[3] H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method,
I. The continuous dynamical system: global exploration of the local minima of
a real-valued function by asymptotic analysis of a dissipative dynamical system.
Communications in Contemporary Mathematics, 2(01):1–34, 2000.

[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for em-
bedding and clustering. In Advances in Neural Information Processing Systems,
pages 585–591, 2002.

[5] M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds.

Machine learning, 56(1-3):209–239, 2004.

[6] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric

framework for learning from labeled and unlabeled examples. Journal of machine
learning research, 7(Nov):2399–2434, 2006.

[7] M. Benyamin, J. Calder, G. Sundaramoorthi, and A. Yezzi. Accelerated PDE’s

for eﬃcient solution of regularized inversion problems. Journal of Mathematical
Imaging and Vision, 62(1), 2020.

[8] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A

nonasymptotic theory of independence. Oxford university press, 2013.

[9] L. M. Bregman. The relaxation method of ﬁnding the common point of convex
sets and its application to the solution of problems in convex programming. USSR
computational mathematics and mathematical physics, 7(3):200–217, 1967.

[10] D. Burago, S. Ivanov, and Y. Kurylev. A graph discretization of the Laplace-

Beltrami operator. Journal of Spectral Theory, 4(4):675–714, 2014.

171
172 BIBLIOGRAPHY

[11] J. Calder. The game theoretic p-Laplacian and semi-supervised learning with
few labels. Nonlinearity, 32(1), 2018.

[12] J. Calder. Lecture notes on viscosity solutions. Lecture notes, 2018.

[13] J. Calder. Consistency of Lipschitz learning with inﬁnite unlabeled data and
ﬁnite labeled data. SIAM Journal on Mathematics of Data Science, 1(4):780–
812, 2019.

[14] J. Calder and N. García Trillos. Improved spectral convergence rates for graph
Laplacians on ε-graphs and k-NN graphs. arXiv preprint, 2019.

[15] J. Calder and D. Slepčev. Properly-weighted graph Laplacian for semi-supervised

learning. Applied Mathematics and Optimization: Special Issue on Optimization
in Data Science, 2019.

[16] J. Calder, D. Slepčev, and M. Thorpe. Rates of convergence for Laplacian semi-
supervised learning with low labelling rates. In preparation, 2019.

[17] J. Calder and A. Yezzi. PDE Acceleration: A convergence rate analysis and
applications to obstacle problems. Research in the Mathematical Sciences, 6(35),
2019.

[18] A. Chambolle. An algorithm for total variation minimization and applications.

Journal of Mathematical imaging and vision, 20(1-2):89–97, 2004.

[19] A. Chambolle and T. Pock. A ﬁrst-order primal-dual algorithm for convex prob-
lems with applications to imaging. Journal of mathematical imaging and vision,
40(1):120–145, 2011.

[20] T. F. Chan and L. A. Vese. Active contours without edges. IEEE Transactions
on image processing, 10(2):266–277, 2001.

[21] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning. MIT, 2006.

[22] B. Dacorogna. Direct methods in the calculus of variations, volume 78. Springer
Science & Business Media, 2007.

[23] E. de Giorgi. Sulla diﬀerenziabilita e l’analiticita delle estremali degli integrali

multipli regolari. Mem. Accad. Sci. Torino. Cl. Sci. Fis. Mat. Nat, 3(3):25–43,
1957.

[24] P. K. Diederik, M. Welling, et al. Auto-encoding variational Bayes. In Proceedings

of the International Conference on Learning Representations (ICLR), 2014.

[25] R. Durrett. Probability: theory and examples, volume 49. Cambridge university
press, 2019.
BIBLIOGRAPHY 173

[26] A. El Alaoui, X. Cheng, A. Ramdas, M. J. Wainwright, and M. I. Jordan. Asymp-

totic behavior of ℓp -based Laplacian regularization in semi-supervised learning.
In Conference on Learning Theory, pages 879–906, 2016.

[27] S. Esedog, Y.-H. R. Tsai, et al. Threshold dynamics for the piecewise constant
mumford–shah functional. Journal of Computational Physics, 211(1):367–384,
2006.

[28] L. C. Evans. Partial Diﬀerential Equations, volume 19 of Graduate Studies in

Mathematics. AMS, Providence, Rhode Island, 1998.

[29] M. Flores, J. Calder, and G. Lerman. Algorithms for Lp-based semi-supervised

learning on graphs. arXiv:1901.05031, 2019.

[30] D. Gilbarg and N. S. Trudinger. Elliptic partial diﬀerential equations of second

order. springer, 2015.

[31] T. Goldstein and S. Osher. The split bregman method for l1-regularized prob-
lems. SIAM journal on imaging sciences, 2(2):323–343, 2009.

[32] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang. Manifold-ranking based image
retrieval. In Proceedings of the 12th annual ACM International Conference on
Multimedia, pages 9–16. ACM, 2004.

[33] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang. Generalized manifold-ranking-
based image retrieval. IEEE Transactions on Image Processing, 15(10):3170–
3177, 2006.

[34] M. Hein, J.-Y. Audibert, and U. v. Luxburg. Graph Laplacians and their conver-
gence on random neighborhood graphs. Journal of Machine Learning Research,
8(Jun):1325–1368, 2007.

[35] M. Hein, J.-Y. Audibert, and U. Von Luxburg. From graphs to manifolds–
weak and strong pointwise consistency of graph Laplacians. In International
Conference on Computational Learning Theory, pages 470–485. Springer, 2005.

[36] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM

computing surveys (CSUR), 31(3):264–323, 1999.

[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classiﬁcation with deep

convolutional neural networks. In Advances in Neural Information Processing
Systems, pages 1097–1105, 2012.

[38] R. Kyng, A. Rao, S. Sachdeva, and D. A. Spielman. Algorithms for Lipschitz

learning on graphs. In Conference on Learning Theory, pages 1190–1223, 2015.
174 BIBLIOGRAPHY

[39] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436,

2015.

[40] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied

to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[41] B. Merriman, J. K. Bence, and S. Osher. Diﬀusion generated motion by mean

curvature. In J. Taylor, editor, Computational Crystal Growers Workshop, pages
73–83. Sel. Lectures Math, AMS, Providence, RI, 1992.

[42] C. B. Morrey. On the solutions of quasi-linear elliptic partial diﬀerential equa-

tions. Transactions of the American Mathematical Society, 43(1):126–166, 1938.

[43] J. Moser. A new proof of de Giorgi’s theorem concerning the regularity problem
for elliptic diﬀerential equations. Communications on Pure and Applied Mathe-
matics, 13(3):457–468, 1960.

[44] J. Nash. Continuity of solutions of parabolic and elliptic equations. American

Journal of Mathematics, 80(4):931–954, 1958.

[45] Y. Nesterov. A method of solving a convex programming problem with conver-

gence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376,
1983.

[46] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and

an algorithm. In Advances in Neural Information Processing Systems, pages
849–856, 2002.

[47] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regulariza-

tion method for total variation-based image restoration. Multiscale Modeling &
Simulation, 4(2):460–489, 2005.

[48] B. T. Polyak. Some methods of speeding up the convergence of iteration methods.

USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

[49] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise
removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268, 1992.

[50] W. Rudin. Functional analysis. McGraw-hill, 1991.

[51] W. Rudin. Real and complex analysis. Tata McGraw-hill education, 2006.

[52] J. Shi and J. Malik. Normalized cuts and image segmentation. Departmental
Papers (CIS), page 107, 2000.

[53] Z. Shi, S. Osher, and W. Zhu. Weighted nonlocal Laplacian on interpolation

from sparse data. Journal of Scientiﬁc Computing, 73(2-3):1164–1177, 2017.
BIBLIOGRAPHY 175

[54] Z. Shi, B. Wang, and S. J. Osher. Error estimation of weighted nonlocal Laplacian
on random point cloud. arXiv preprint arXiv:1809.08622, 2018.

[55] D. Slepčev and M. Thorpe. Analysis of p-Laplacian regularization in semisu-

pervised learning. SIAM Journal on Mathematical Analysis, 51(3):2085–2120,
2019.

[56] G. Sundaramoorthi and A. Yezzi. Variational PDE’s for acceleration on mani-

folds and applications to diﬀeomorphisms. Neural Information Processing Sys-
tems, 2018.

[57] D. Ting, L. Huang, and M. Jordan. An analysis of the convergence of graph

Laplacians. arXiv:1101.5435, 2011.

[58] N. G. Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral
convergence of the graph laplacian on random geometric graphs towards the
laplace–beltrami operator. arXiv preprint arXiv:1801.10108, 2018.

[59] N. G. Trillos and R. Murray. A maximum principle argument for the uniform con-
vergence of graph laplacian regressors. arXiv preprint arXiv:1901.10089, 2019.

[60] N. G. Trillos and D. Slepčev. On the rate of convergence of empirical measures in

∞-transportation distance. Canadian Journal of Mathematics, 67(6):1358–1383,
2015.

[61] N. G. Trillos and D. Slepčev. Continuum limit of total variation on point clouds.
Archive for rational mechanics and analysis, 220(1):193–241, 2016.

[62] Y. Wang, M. A. Cheema, X. Lin, and Q. Zhang. Multi-manifold ranking: Us-

ing multiple features for better image retrieval. In Paciﬁc-Asia Conference on
Knowledge Discovery and Data Mining, pages 449–460. Springer, 2013.

[63] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on

accelerated methods in optimization. Proceedings of the National Academy of
Sciences, 113(47):E7351–E7358, 2016.

[64] B. Xu, J. Bu, C. Chen, D. Cai, X. He, W. Liu, and J. Luo. Eﬃcient manifold
ranking for image retrieval. In Proceedings of the 34th International ACM SIGIR
Conference on Research and Development in Information Retrieval, pages 525–
534. ACM, 2011.

[65] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang. Saliency detection via
graph-based manifold ranking. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 3166–3173, 2013.
176 BIBLIOGRAPHY

[66] A. Yezzi, G. Sundaramoorthi, and M. Benyamin. PDE acceleration for active

contours. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, 2019.

[67] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf. Semi-supervised

learning by maximizing smoothness. J. of Mach. Learn. Research, 2004.

[68] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with

local and global consistency. In Advances in Neural Information Processing Sys-
tems, pages 321–328, 2004.

[69] D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data
on a directed graph. In Proceedings of the 22nd International Conference on
Machine Learning, pages 1036–1043. ACM, 2005.

[70] D. Zhou and B. Schölkopf. Regularization on discrete spaces. In 27th DAGM

Conference on Pattern Recognition, pages 361–368, 2005.

[71] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on

data manifolds. In Advances in Neural Information Processing Systems, pages
169–176, 2004.

[72] X. Zhou, M. Belkin, and N. Srebro. An iterated graph Laplacian approach for
ranking on manifolds. In Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 877–885. ACM,
2011.

[73] X. Zhu, Z. Ghahramani, and J. D. Laﬀerty. Semi-supervised learning using

Gaussian ﬁelds and harmonic functions. In Proceedings of the 20th International
Conference on Machine learning (ICML-03), pages 912–919, 2003.

Solid Starts - First 100 Days
94% (18)
Solid Starts - First 100 Days
287 pages
Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Read People Like A Book by Patrick King-Edited
62% (65)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
36 Questions To Fall in Love 1
97% (31)
36 Questions To Fall in Love 1
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
71% (69)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Jim Dai Textbook PDF
No ratings yet
Jim Dai Textbook PDF
168 pages
Notes 20220602
No ratings yet
Notes 20220602
208 pages
Paul G.bamberg-Convexity and Optimization With Applications
No ratings yet
Paul G.bamberg-Convexity and Optimization With Applications
131 pages
Calculus of Variations
No ratings yet
Calculus of Variations
176 pages
Calculus of Variations
No ratings yet
Calculus of Variations
141 pages
ProbabilisticCombinatorics 15 MAR 2019
No ratings yet
ProbabilisticCombinatorics 15 MAR 2019
114 pages
Bartels2015 Book NumericalMethodsForNonlinearPa
100% (1)
Bartels2015 Book NumericalMethodsForNonlinearPa
394 pages
Introduction to random graphs
No ratings yet
Introduction to random graphs
682 pages
Book
No ratings yet
Book
652 pages
Lecture Notes: The Finite Element Method: Aurélien Larcher, Niyazi Cem de Girmenci Fall 2013
No ratings yet
Lecture Notes: The Finite Element Method: Aurélien Larcher, Niyazi Cem de Girmenci Fall 2013
43 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Introduction To Random Graphs
100% (1)
Introduction To Random Graphs
583 pages
BOOK
No ratings yet
BOOK
583 pages
2018F-217-lectures
No ratings yet
2018F-217-lectures
152 pages
Book
No ratings yet
Book
583 pages
237 Course Notes
No ratings yet
237 Course Notes
273 pages
NSPDE
No ratings yet
NSPDE
118 pages
Book
No ratings yet
Book
579 pages
Numerical Methods in Heat, Mass, and Momentum Transfer
No ratings yet
Numerical Methods in Heat, Mass, and Momentum Transfer
196 pages
ME608 Notes Murthy
No ratings yet
ME608 Notes Murthy
196 pages
Numerical Methods in Heat, Mass, and Heat Transfer PDF
No ratings yet
Numerical Methods in Heat, Mass, and Heat Transfer PDF
196 pages
Numerical Methods For Large Eigenvalue Problems
100% (1)
Numerical Methods For Large Eigenvalue Problems
285 pages
Elementary Course On PDEs by TMK
No ratings yet
Elementary Course On PDEs by TMK
199 pages
Computational Geometry Lecture Notes Ethz PDF
No ratings yet
Computational Geometry Lecture Notes Ethz PDF
182 pages
CT5123 LectureNotesThe Finite Element Method An Introduction PDF
No ratings yet
CT5123 LectureNotesThe Finite Element Method An Introduction PDF
125 pages
CAAM 454 554 1lvazxx
No ratings yet
CAAM 454 554 1lvazxx
422 pages
Math 6610 - Analysis of Numerical Methods I
No ratings yet
Math 6610 - Analysis of Numerical Methods I
103 pages
Notes Harmonic Measure
No ratings yet
Notes Harmonic Measure
202 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
PartB Asymptotics&Integrals
No ratings yet
PartB Asymptotics&Integrals
90 pages
Kinematics and Dynamics of Galactic Stellar Populations 1st Edition Rafael Cubarsi - The ebook version is available in PDF and DOCX for easy access
100% (5)
Kinematics and Dynamics of Galactic Stellar Populations 1st Edition Rafael Cubarsi - The ebook version is available in PDF and DOCX for easy access
66 pages
The Finite Element Method-An Introduction
No ratings yet
The Finite Element Method-An Introduction
125 pages
Algebra Lineal Mp103
No ratings yet
Algebra Lineal Mp103
202 pages
Stat 378
No ratings yet
Stat 378
73 pages
Linear Algebra Book
No ratings yet
Linear Algebra Book
202 pages
PDF Kinematics and Dynamics of Galactic Stellar Populations 1st Edition Rafael Cubarsi download
100% (2)
PDF Kinematics and Dynamics of Galactic Stellar Populations 1st Edition Rafael Cubarsi download
37 pages
Download Kinematics and Dynamics of Galactic Stellar Populations 1st Edition Rafael Cubarsi ebook All Chapters PDF
100% (4)
Download Kinematics and Dynamics of Galactic Stellar Populations 1st Edition Rafael Cubarsi ebook All Chapters PDF
47 pages
OptimML
No ratings yet
OptimML
41 pages
Script Convex Analysis
No ratings yet
Script Convex Analysis
167 pages
Probability May 12
No ratings yet
Probability May 12
201 pages
Introduction to optimization - Jean-François Aujol
No ratings yet
Introduction to optimization - Jean-François Aujol
51 pages
AMATH732 Course Notes
No ratings yet
AMATH732 Course Notes
123 pages
Classnotes Ma1101 PDF
No ratings yet
Classnotes Ma1101 PDF
139 pages
Computational Optimal Control: Tools and Practice
From Everand
Computational Optimal Control: Tools and Practice
Dr Subchan Subchan
No ratings yet
The Satisfiability Problem: Algorithms and Analyses
From Everand
The Satisfiability Problem: Algorithms and Analyses
Uwe Schöning
No ratings yet
Semiparametric Regression for the Social Sciences
From Everand
Semiparametric Regression for the Social Sciences
Luke John Keele
3/5 (1)
Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications
From Everand
Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications
Gunter Bolch
5/5 (1)
Control Systems
From Everand
Control Systems
Francisco Luis Pagola y de las Heras
No ratings yet
Quasi-Monte Carlo Methods in Finance: With Application to Optimal Asset Allocation
From Everand
Quasi-Monte Carlo Methods in Finance: With Application to Optimal Asset Allocation
Mario Rometsch
No ratings yet
Graphical Models: Representations for Learning, Reasoning and Data Mining
From Everand
Graphical Models: Representations for Learning, Reasoning and Data Mining
Christian Borgelt
No ratings yet
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods
From Everand
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods
Joseph Keshet
No ratings yet
The Finite Element Method for Electromagnetic Modeling
From Everand
The Finite Element Method for Electromagnetic Modeling
Gérard Meunier
No ratings yet
Template Matching Techniques in Computer Vision: Theory and Practice
From Everand
Template Matching Techniques in Computer Vision: Theory and Practice
Roberto Brunelli
No ratings yet
Generalized Least Squares
From Everand
Generalized Least Squares
Takeaki Kariya
No ratings yet
Open Data Structures: An Introduction
From Everand
Open Data Structures: An Introduction
Pat Morin
4/5 (4)
Stochastic Methods and their Applications to Communications: Stochastic Differential Equations Approach
From Everand
Stochastic Methods and their Applications to Communications: Stochastic Differential Equations Approach
Serguei Primak
No ratings yet
Large Deviations for Gaussian Queues: Modelling Communication Networks
From Everand
Large Deviations for Gaussian Queues: Modelling Communication Networks
Michel Mandjes
No ratings yet
Nano Mechanics and Materials: Theory, Multiscale Methods and Applications
From Everand
Nano Mechanics and Materials: Theory, Multiscale Methods and Applications
Wing Kam Liu
No ratings yet
Markov Processes and Applications: Algorithms, Networks, Genome and Finance
From Everand
Markov Processes and Applications: Algorithms, Networks, Genome and Finance
Etienne Pardoux
No ratings yet
Digital Audio Signal Processing
From Everand
Digital Audio Signal Processing
Udo Zölzer
No ratings yet
Unobserved Heterogeneity in Models of Competing Mortgage Termination Risks
No ratings yet
Unobserved Heterogeneity in Models of Competing Mortgage Termination Risks
35 pages
Commalg
No ratings yet
Commalg
528 pages
Book
No ratings yet
Book
296 pages
Credit Risk and Credit Rationing
No ratings yet
Credit Risk and Credit Rationing
22 pages
Dyn Part I
No ratings yet
Dyn Part I
38 pages
Cigarettes As Currency
No ratings yet
Cigarettes As Currency
5 pages
ABCs (And DS) of Understanding VARs
No ratings yet
ABCs (And DS) of Understanding VARs
11 pages
Tech Tips - LendingClub REST API Access With Python
No ratings yet
Tech Tips - LendingClub REST API Access With Python
4 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
Complexorth
No ratings yet
Complexorth
212 pages
Bayes and Minimax Solutions of Sequential Decision Problems
No ratings yet
Bayes and Minimax Solutions of Sequential Decision Problems
33 pages
Channelling Fisher - Randomization Tests and The Statistical-Young2016-ChannellingFisher
No ratings yet
Channelling Fisher - Randomization Tests and The Statistical-Young2016-ChannellingFisher
74 pages
(Carus Mathematical Monographs 8) Neal Henry McCoy-Rings and Ideals (Carus Mathematical Monographs 8) - Mathematical Association of America (1962)
No ratings yet
(Carus Mathematical Monographs 8) Neal Henry McCoy-Rings and Ideals (Carus Mathematical Monographs 8) - Mathematical Association of America (1962)
230 pages
Wave Propagation in Even and Odd Dimensional Spaces
No ratings yet
Wave Propagation in Even and Odd Dimensional Spaces
5 pages
Krauth W. Introduction To Monte-Carlo Algorithms (Cond-Mat 9612186, 1996) (43s) MNs
No ratings yet
Krauth W. Introduction To Monte-Carlo Algorithms (Cond-Mat 9612186, 1996) (43s) MNs
43 pages
How Foundations Came To Be
No ratings yet
How Foundations Came To Be
13 pages
Mean Multi-Variate Normal Distribution: Inadmissibility of The Usual Esti - Mator For The OF
No ratings yet
Mean Multi-Variate Normal Distribution: Inadmissibility of The Usual Esti - Mator For The OF
10 pages
An Introduction To Game Theory Notes - Latest
No ratings yet
An Introduction To Game Theory Notes - Latest
125 pages
Reid M. Galois Theory (Draft, 2004) (89s) MAa
No ratings yet
Reid M. Galois Theory (Draft, 2004) (89s) MAa
89 pages
Numerical Solution To Ordinary Differential Equations
No ratings yet
Numerical Solution To Ordinary Differential Equations
11 pages
Full Math Microsoft
No ratings yet
Full Math Microsoft
23 pages
SSCE1793 - Tutorial 2 - Higher Order Eqns PDF
No ratings yet
SSCE1793 - Tutorial 2 - Higher Order Eqns PDF
2 pages
MATH 161, Homework Set 4 Name
No ratings yet
MATH 161, Homework Set 4 Name
2 pages
Tutorial Questions - Chapter 9 QA 016 Latest
No ratings yet
Tutorial Questions - Chapter 9 QA 016 Latest
6 pages
De Zg515 Computational Fluid Dynamics: BITS Pilani
No ratings yet
De Zg515 Computational Fluid Dynamics: BITS Pilani
24 pages
Homogeneous Linear DE Group 5
No ratings yet
Homogeneous Linear DE Group 5
14 pages
Vectors Live Class 1 Notes
No ratings yet
Vectors Live Class 1 Notes
40 pages
Sigma Notation Area As A Limit
No ratings yet
Sigma Notation Area As A Limit
2 pages
Supplementary notes on Fourier series and Hilbert spaces: p C p C 2 C 1 C p C q C p C 2 C 2π 2π
No ratings yet
Supplementary notes on Fourier series and Hilbert spaces: p C p C 2 C 1 C p C q C p C 2 C 2π 2π
3 pages
Math III
No ratings yet
Math III
2 pages