Tao An Epsilon of Room
Tao An Epsilon of Room
xX
f(x) to denote the average value of a function f : X C
on a non-empty nite set X.
Acknowledgments
The author is supported by a grant from the MacArthur Foundation,
by NSF grant DMS-0649473, and by the NSF Waterman award.
Thanks to Blake Stacey and anonymous commenters for global
corrections to the text.
Chapter 1
Real analysis
1
2 1. Real analysis
1.1. A quick review of measure and integration
theory
In this section we quickly review the basics of abstract measure theory
and integration theory, which was covered in the previous course but
will of course be relied upon in the current course. This is only a
brief summary of the material; of course, one should consult a real
analysis text for the full details of the theory.
1.1.1. Measurable spaces. Ideally, measure theory on a space X
should be able to assign a measure (or volume, or mass, etc.) to
every set in X. Unfortunately, due to paradoxes such as the Banach-
Tarski paradox, many natural notions of measure (e.g. Lebesgue mea-
sure) cannot be applied to measure all subsets of X; instead, one must
restrict attention to certain measurable subsets of X. This turns out
to suce for most applications; for instance, just about any non-
pathological subset of Euclidean space that one actually encounters
will be Lebesgue measurable (as a general rule of thumb, any set
which does not rely on the axiom of choice in its construction will be
measurable).
To formalise this abstractly, we use
Denition 1.1.1 (Measurable spaces). A measurable space (X, A)
is a set X, together with a collection A of subsets of X which form
a -algebra, thus A contains the empty set and X, and is closed
under countable intersections, countable unions, and complements. A
subset of X is said to be measurable with respect to the measurable
space if it lies in A.
A function f : X Y from one measurable space (X, A) to
another (Y, }) is said to be measurable if f
1
(E) A for all E }.
Remark 1.1.2. The class of measurable spaces forms a category,
with the measurable functions being the morphisms. The symbol
stands for countable union; cf. -compact, -nite, F
set.
Remark 1.1.3. The notion of a measurable space (X, A) (and of a
measurable function) is supercially similar to that of a topological
space (X, T) (and of a continuous function); the topology T contains
1.1. Measure and integration 3
and X just as the -algebra A does, but is now closed under ar-
bitrary unions and nite intersections, rather than countable unions,
countable intersections, and complements. The two categories are
linked to each other by the Borel algebra construction, see Example
1.1.5 below.
Example 1.1.4. We say that one -algebra A on a set X is coarser
than another A
(or that A
is ner than A) if A A
(or equiv-
alently, if the identity map from (X, A
) to (X, A) is measurable);
thus every set which is measurable in the coarse space is also mea-
surable in the ne space. The coarsest -algebra on a set X is
the trivial -algebra ,X, while the nest is the discrete -algebra
2
X
:= E : E X.
Example 1.1.5. The intersection
_
A
A
:=
A
A
of an arbi-
trary family (A
)
A
of -algebras on X is another -algebra on X.
Because of this, given any collection T of sets on X we can dene the
-algebra B[T] generated by T, dened to be the intersection of all
the -algebras containing T, or equivalently the coarsest algebra for
which all sets in T are measurable. (This intersection is non-vacuous,
since it will always involve the discrete -algebra 2
X
.) In particular,
the open sets T of a topological space (X, T) generate a -algebra,
known as the Borel -algebra of that space.
We can also dene the join
_
A
A
of any family (A
)
A
of
-algebras on X by the formula
(1.1)
A
A
:= B[
_
A
A
].
For instance, the Lebesgue -algebra / of Lebesgue measurable sets
on a Euclidean space R
n
is the join of the Borel -algebra B and
of the algebra of null sets and their complements (also called co-null
sets).
Exercise 1.1.1. A function f : X Y from one topological space to
another is said to be Borel measurable if it is measurable once X and Y
are equipped with their respective Borel -algebras. Show that every
continuous function is Borel measurable. (The converse statement, of
course, is very far from being true; for instance, the pointwise limit
of a sequence of measurable functions, if it exists, is also measurable,
4 1. Real analysis
whereas the analogous claim for continuous functions is completely
false.)
Remark 1.1.6. A function f : R
n
C is said to be Lebesgue mea-
surable if it is measurable from R
n
(with the Lebesgue -algebra) to
C (with the Borel -algebra), or equivalently if f
1
(B) is Lebesgue
measurable for every open ball B in C. Note the asymmetry be-
tween Lebesgue and Borel here; in particular, the composition of two
Lebesgue measurable functions need not be Lebesgue measurable.
Example 1.1.7. Given a function f : X Y from a set X to a
measurable space (Y, }), we can dene the pullback f
1
(}) of } to
be the -algebra f
1
(}) := f
1
(E) : E }; this is the coarsest
structure on X that makes f measurable. For instance, the pullback
of the Borel -algebra from [0, 1] to [0, 1]
2
under the map (x, y) x
consists of all sets of the form E [0, 1], where E [0, 1] is Borel-
measurable.
More generally, given a family (f
: X Y
)
A
of functions into
measurable spaces (Y
, }
(}
)
generated by the f
simultaneously measurable.
Remark 1.1.8. In probability theory and information theory, the
functions f
: X Y
:
U
is an
1.1. Measure and integration 5
open cover of M and V
.
Example 1.1.10. A function f : X A into some index set A
will partition X into level sets f
1
() for A; conversely, every
partition X =
A
E
B
E
A
E
, A
)
A
be a family of measurable spaces,
then the Cartesian product
A
X
A
X
A
A
A
X
generated by the
as in Example 1.1.7.
Exercise 1.1.4. Let (X
)
A
be an at most countable family of sec-
ond countable topological spaces. Show that the Borel -algebra of
the product space (with the product topology) is equal to the product
of the Borel -algebras of the factor spaces. In particular, the Borel
-algebra on R
n
is the product of n copies of the Borel -algebra on
R. (The claim can fail when the countability hypotheses are dropped,
though in most applications in analysis, these hypotheses are satis-
ed.) We caution however that the Lebesgue -algebra on R
n
is not
the product of n copies of the one-dimensional Lebesgue -algebra,
as it contains some additional null sets; however, it is the completion
of that product.
Exercise 1.1.5. Let (X, A) and (Y, }) be measurable spaces. Show
that if E is measurable with respect to A }, then for every x X,
6 1. Real analysis
the set y Y : (x, y) E is measurable in }, and similarly for
every y Y , the set x X : (x, y) E is measurable in A. Thus,
sections of Borel-measurable sets are again Borel-measurable. (The
same is not true for Lebesgue-measurable sets.)
1.1.2. Measure spaces. Now we endow measurable spaces with a
measure, turning them into measure spaces.
Denition 1.1.12 (Measures). A (non-negative) measure on a
measurable space (X, A) is a function : A [0, +] such that
() = 0, and such that we have the countable additivity property
(
n=1
E
n
) =
n=1
(E
n
) whenever E
1
, E
2
, . . . are disjoint measur-
able sets. We refer to the triplet (X, A, ) as a measure space.
A measure space (X, A, ) is nite if (X) < ; it is a probability
space if (X) = 1 (and then we call a probability measure). It is
-nite if X can be covered by countably many sets of nite measure.
A measurable set E is a null set if (E) = 0. A property on points
x in X is said to hold for almost every x X (or almost surely, for
probability spaces) if it holds outside of a null set. We abbreviate
almost every and almost surely as a.e. and a.s. respectively. The
complement of a null set is said to be a co-null set or to have full
measure.
Example 1.1.13 (Dirac measures). Given any measurable space
(X, A) and a point x X, we can dene the Dirac measure (or
Dirac mass)
x
to be the measure such that
x
(E) = 1 when x E
and
x
(E) = 0 otherwise. This is a probability measure.
Example 1.1.14 (Counting measure). Given any measurable space
(X, A), we dene counting measure # by dening #(E) to be the
cardinality [E[ of E when E is nite, or + otherwise. This measure
is nite when X is nite, and -nite when X is at most countable. If
X is also nite, we can dene normalised counting measure
1
|E|
#; this
is a probability measure, also known as uniform probability measure
on X (especially if we give X the discrete -algebra).
Example 1.1.15. Any nite non-negative linear combination of mea-
sures is again a measure; any nite convex combination of probability
measures is again a probability measure.
1.1. Measure and integration 7
Example 1.1.16. If f : X Y is a measurable map from one
measurable space (X, A) to another (Y, }), and is a measure on
A, we can dene the push-forward f
(E) := (f
1
(E)); this is a measure on (Y, }). Thus, for instance,
f
x
=
f(x)
for all x X.
We record some basic properties of measures of sets:
Exercise 1.1.6. Let (X, A, ) be a measure space. Show the follow-
ing statements:
(i) (Monotonicity) If E F are measurable sets, then (E)
(F). (In particular, any measurable subset of a null set is
again a null set.)
(ii) (Countable subadditivity) If E
1
, E
2
, . . . are a countable se-
quence of measurable sets, then (
n=1
E
n
)
n=1
(E
n
).
(Of course, one also has subadditivity for nite sequences.)
In particular, any countable union of null sets is again a null
set.
(iii) (Monotone convergence for sets) If E
1
E
2
. . . are mea-
surable, then (
n=1
E
n
) = lim
n
(E
n
).
(iv) (Dominated convergence for sets) If E
1
E
2
. . . are mea-
surable, and (E
1
) is nite, then (
n=1
E
n
) = lim
n
(E
n
).
Show that the claim can fail if (E
1
) is innite.
Exercise 1.1.7. A measure space is said to be complete if every
subset of a null set is measurable (and is thus again a null set). Show
that every measure space (X, A, ) has a unique minimal complete
renement (X, A, ), known as the completion of (X, A, ), and that
a set is measurable in A if and only if it is equal almost everywhere to
a measurable set in A. (The completion of the Borel -algebra with
respect to Lebesgue measure is known as the Lebesgue -algebra.)
A powerful way to construct measures on -algebras A is to rst
construct them on a smaller Boolean algebra / that generates A, and
then extend them via the following result:
Theorem 1.1.17 (Caratheodorys extension theorem, special case).
Let (X, A) be a measurable space, and let / be a Boolean algebra
8 1. Real analysis
(i.e. closed under nite unions, intersections, and complements) that
generates A. Let : / [0, +] be a function such that
(i) () = 0;
(ii) If A
1
, A
2
, . . . / are disjoint and
n=1
A
n
/, then
(
n=1
A
n
) =
n=1
(A
n
).
Then can be extended to a measure : A [0, +] on A, which
we shall also call .
Remark 1.1.18. The conditions (i), (ii) in the above theorem are
clearly necessary if has any hope to be extended to a measure on
A. Thus this theorem gives a necessary and sucient condition for
a function on a Boolean algebra to be extended to a measure. The
extension can easily be shown to be unique when X is -nite.
Proof. (Sketch) Dene the outer measure
n=1
(A
n
), where (A
n
)
n=1
ranges over all coverings
of E by elements in /. It is not hard to see that if
agrees with
on /, so it will suce to show that it is a measure on A.
It is easy to check that
(A) =
(A E) +
(AE)
for all A X and E A. This can rst be shown for E /, and
then one observes that the class of E that obey (1.2) for all A is a
-algebra; we leave this as a (moderately lengthy) exercise.
The identity (1.2) already shows that
is nitely additive on
A; combining this with countable subadditivity and monotonicity, we
conclude that
n
i=1
X
i
,
n
i=1
A
i
) be the product
measurable space . Show that there exists a unique measure on
this space such that (
n
i=1
A
i
) =
n
i=1
(A
i
) for all A
i
A
i
. The
measure is referred to as the product measure of the
1
, . . . ,
n
and
is denoted
n
i=1
i
.
Exercise 1.1.10. Let E be a Lebesgue measurable subset of R
n
. and
let m be Lebesgue measure. Establish the inner regularity property
(1.3) m(E) = sup(K) : K E, compact
and the outer regularity property
(1.4) m(E) = inf(U) : E U, open.
Combined with the fact that m is locally nite, this implies that m
is a Radon measure.
1.1.3. Integration. Now we dene integration on a measure space
(X, A, ).
Denition 1.1.19 (Integration). Let (X, A, ) be a measure space.
(i) If f : X [0, +] is a non-negative simple function (i.e.
a measurable function that only takes on nitely many val-
ues a
1
, . . . , a
n
), we dene the integral
_
X
f d of f to be
_
X
f d =
n
i=1
a
i
(f
1
(a
i
)) (with the convention that
0 = 0). In particular, if f = 1
A
is the indicator function
of a measurable set A, then
_
X
1
A
d = (A).
10 1. Real analysis
(ii) If f : X [0, +] is a non-negative measurable func-
tion, we dene the integral
_
X
f d to be the supremum of
_
X
g d, where g ranges over all simple functions bounded
between 0 and f.
(iii) If f : X [, +] is a measurable function, whose pos-
itive and negative parts f
+
:= max(f, 0), f
:= max(f, 0)
have nite integral, we say that f is absolutely integrable and
dene
_
X
f d :=
_
X
f
+
d
_
X
f
d.
(iv) If f : X C is a measurable function with real and imagi-
nary parts absolutely integrable, we say that f is absolutely
integrable and dene
_
X
f d :=
_
X
Re f d +i
_
X
Imf d.
We will sometimes show the variable of integration, e.g. writing
_
X
f(x) d(x) for
_
X
f d, for sake of clarity.
The following results are standard, and the proofs are omitted:
Theorem 1.1.20 (Standard facts about integration). Let (X, A, )
be a measure space.
All the above integration notions are compatible with each
other; for instance, if f is both non-negative and absolutely
integrable, then the denitions (ii) and (iii) (and (iv)) agree.
The functional f
_
X
f d is linear over R
+
for sim-
ple functions or non-negative functions, is linear over R for
real-valued absolutely integrable functions, and linear over C
for complex-valued absolutely integrable functions. In partic-
ular, the set of (real or complex) absolutely integrable func-
tions on (X, A, ) is a (real or complex) vector space.
A complex-valued measurable function f : X C is ab-
solutely integrable if and only if
_
X
[f[ d < , in which
case we have the triangle inequality [
_
X
f d[
_
X
[f[ d.
Of course, the same claim holds for real-valued measurable
functions.
If f : X [0, +] is non-negative, then
_
X
f d 0, with
equality holding if and only if f = 0 a.e..
If one modies an absolutely integrable function on a set
of measure zero, then the new function is also absolutely
1.1. Measure and integration 11
integrable, and has the same integral as the original function.
Similarly, two non-negative functions that agree a.e. have
the same integral. (Because of this, we can meaningfully
integrate functions that are only dened almost everywhere.)
If f : X C is absolutely integrable, then f is nite a.e.,
and vanishes outside of a -nite set.
If f : X C is absolutely integrable, and > 0 then there
exists a complex-valued simple function g : X C such that
_
X
[f g[ d . (This is a manifestation of Littlewoods
second principle.)
(Change of variables formula) If : X Y is a measurable
map to another measurable space (Y, }), and g : Y C,
then we have
_
X
g d =
_
Y
g d
n=1
f
n
d =
n=1
_
X
f
n
d.
12 1. Real analysis
(Fatous lemma) If f
n
: X [0, +] are measurable, then
_
X
liminf
n
f
n
d liminf
n
_
X
f
n
d.
(Dominated convergence for sequences) If f
n
: X C are
measurable functions converging pointwise a.e. to a limit f,
and [f
n
[ g a.e. for some absolutely integrable g : X
[0, +], then
_
X
lim
n
f
n
d = lim
n
_
X
f
n
d.
(Dominated convergence for series) If f
n
: X C are mea-
surable functions with
n
_
X
[f
n
[ d < , then
n
f
n
(x)
is absolutely convergent for a.e. x and
_
X
n=1
f
n
d =
n=1
_
X
f
n
d.
(Egorovs theorem) If f
n
: X C are measurable functions
converging pointwise a.e. to a limit f on a subset A of X of
nite measure, and > 0, then there exists a set of measure
at most , outside of which f
n
converges uniformly to f in
A. (This is a manifestation of Littlewoods third principle.)
Remark 1.1.22. As a rule of thumb, if one does not have exact
or approximate monotonicity or domination (where approximate
means up to an error e whose L
1
norm
_
X
[e[ d goes to zero), then
one should not expect the integral of a limit to equal the limit of the
integral in general; there is just too much room for oscillation.
Exercise 1.1.11. Let f : X C be an absolutely integrable func-
tion on a measure space (X, A, ). Show that f is uniformly inte-
grable, in the sense that for every > 0 there exists > 0 such that
_
E
[f[ d whenever E is a measurable set of measure at most .
(The property of uniform integrability becomes more interesting, of
course when applied to a family of functions, rather than to a single
function.)
With regard to product measures and integration, the fundamen-
tal theorem in this subject is
Theorem 1.1.23 (Fubini-Tonelli theorem). Let (X, A, ) and (Y, }, )
be -nite measure spaces, with product space (X Y, A }, ).
(Tonelli theorem) If f : XY [0, +] is measurable, then
_
XY
f d =
_
X
(
_
Y
f(x, y) d(y)) d(x) =
_
Y
(
_
X
f(x, y) d(x))d(y).
1.2. Signed measures 13
(Fubini theorem) If f : X Y C is absolutely integrable,
then we also have
_
XY
f d =
_
X
(
_
Y
f(x, y) d(y)) d(x)
=
_
Y
(
_
X
f(x, y) d(x))d(y), with the inner integrals being
absolutely integrable a.e. and the outer integrals all being
absolutely integrable.
If (X, A, ) and (Y, }, ) are complete measure spaces, then the same
claims hold with the product -algebra A } replaced by its comple-
tion.
Remark 1.1.24. The theorem fails for non--nite spaces, but vir-
tually every measure space actually encountered in hard analysis
applications will be -nite. (One should be cautious, however, with
any space constructed using ultralters or the rst uncountable or-
dinal.) It is also important that f obey some measurability in the
product space; there exist non-measurable f for which the iterated
integrals exist (and may or may not be equal to each other, depending
on the properties of f and even on which axioms of set theory one
chooses), but the product integral (of course) does not.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/01.
Thanks to Andy, PDEBeginner, Phil, Sune Kristian Jacobsen, wangtwo,
and an anonymous commenter for corrections.
Several commenters noted Solovays theorem, which asserts that
there exist models of set theory without the axiom of choice in which
all sets are measurable. This led to some discussion of the extent
in which one could formalise the claim that any set which could be
dened without the axiom of choice was necessarily measurable, but
the discussion was inconclusive.
1.2. Signed measures and the
Radon-Nikodym-Lebesgue theorem
In this section, X = (X, A) is a xed measurable space. We shall
often omit the -algebra A, and simply refer to elements of A as
measurable sets. Unless otherwise indicated, all subsets of X appear-
ing below are restricted to be measurable, and all functions on X
appearing below are also restricted to be measurable.
14 1. Real analysis
We let /
+
(X) denote the space of measures on X, i.e. functions
: A [0, +] which are countably additive and send to 0. For
reasons that will be clearer later, we shall refer to such measures as
unsigned measures. In this section we investigate the structure of this
space, together with the closely related spaces of signed measures and
nite measures.
Suppose that we have already constructed one unsigned measure
m /
+
(X) on X (e.g. think of X as the real line with the Borel
-algebra, and let m be Lebesgue measure). Then we can obtain
many further unsigned measures on X by multiplying m by a function
f : X [0, +], to obtain a new unsigned measure m
f
, dened by
the formula
(1.5) m
f
(E) :=
_
X
1
E
f d
If f = 1
A
is an indicator function, we write m
A
for m
1
A
, and
refer to this measure as the restriction of m to A.
Exercise 1.2.1. Show (using the monotone convergence theorem,
Theorem 1.1.21) that m
f
is indeed a unsigned measure, and for any
g : X [0, +], we have
_
X
g dm
f
=
_
X
gf dm. We will express
this relationship symbolically as
(1.6) dm
f
= fdm.
Exercise 1.2.2. Let m be -nite. Given two functions f, g : X
[0, +], show that m
f
= m
g
if and only if f(x) = g(x) for m-almost
every x. (Hint: as usual, rst do the case when m is nite. The
key point is that if f and g are not equal m-almost everywhere, then
either f > g on a set of positive measure, or f < g on a set of positive
measure.) Give an example to show that this uniqueness statement
can fail if m is not -nite. (Hint: the space X can be very simple.)
In view of Exercises 1.2.1 and 1.2.2, let us temporarily call a
measure dierentiable with respect to m if d = fdm (i.e. = m
f
)
for some f : X [0, +], and call f the Radon-Nikodym derivative
of with respect to m, writing
(1.7) f =
d
dm
;
1.2. Signed measures 15
by Exercise 1.2.2, we see if m is -nite that this derivative is dened
up to m-almost everywhere equivalence.
Exercise 1.2.3 (Relationship between Radon-Nikodym derivative
and classical derivative). Let m be Lebesgue measure on [0, +),
and let be an unsigned measure that is dierentiable with respect
to m. If has a continuous Radon-Nikodym derivative
d
dm
, show that
the function x ([0, x]) is dierentiable, and
d
dx
([0, x]) =
d
dm
(x)
for all x.
Exercise 1.2.4. Let X be at most countable. Show that every mea-
sure on X is dierentiable with respect to counting measure #.
If every measure was dierentiable with respect to m (as is the
case in Exercise 1.2.4), then we would have completely described the
space of measures of X in terms of the non-negative functions of
X (modulo m-almost everywhere equivalence). Unfortunately, not
every measure is dierentiable with respect to every other: for in-
stance, if x is a point in X, then the only measures that are dier-
entiable with respect to the Dirac measure
x
are the scalar multi-
ples of that measure. We will explore the precise obstruction that
prevents all measures from being dierentiable, culminating in the
Radon-Nikodym-Lebesgue theorem that gives a satisfactory under-
standing of the situation in the -nite case (which is the case of
interest for most applications).
In order to establish this theorem, it will be important to rst
study some other basic operations on measures, notably the ability
to subtract one measure from another. This will necessitate the study
of signed measures, to which we now turn.
1.2.1. Signed measures. We have seen that if we x a reference
measure m, then non-negative functions f : X [0, +] (modulo
m-almost everywhere equivalence) can be identied with unsigned
measures m
f
: A [0, +]. This motivates various operations on
measures that are analogous to operations on functions (indeed, one
could view measures as a kind of generalised function with respect
to a xed reference measure m). For instance, we can dene the sum
16 1. Real analysis
of two unsigned measures , : A [0, +] as
(1.8) ( +)(E) := (E) +(E)
and non-negative scalar multiples c for c > 0 by
(1.9) (c)(E) := c((E)).
We can also say that one measure is less than another if
(1.10) (E) (E) for all E A.
These operations are all consistent with their functional counterparts,
e.g. m
f+g
= m
f
+m
g
, etc.
Next, we would like to dene the dierence of two unsigned
measures. The obvious thing to do is to dene
(1.11) ( )(E) := (E) (E)
but we have a problem if (E) and (E) are both innite:
is undened! To x this problem, we will only dene the dierence
of two unsigned measures , if at least one of them is a nite mea-
sure. Observe that in that case, takes values in (, +] or
[, +), but not both.
Of course, we no longer expect to be monotone. However, it
is still nitely additive, and even countably additive in the sense that
the sum
n=1
( )(E
n
) converges to ( )(
n=1
E
n
) whenever
E
1
, E
2
, . . . are disjoint sets, and furthermore that the sum is abso-
lutely convergent when ( )(
n=1
E
n
) is nite. This motivates
Denition 1.2.1 (Signed measure). A signed measure is a map :
A [, +] such that
(i) () = 0;
(ii) can take either the value + or , but not both;
(iii) If E
1
, E
2
, . . . X are disjoint, then
n=1
(E
n
) converges
to (
n=1
E
n
), with the former sum being absolutely con-
vergent
1
if the latter expression is nite.
1
Actually, the absolute convergence is automatic from the Riemann rearrange-
ment theorem. Another consequence of (iii) is that any subset of a nite measure
set is again nite measure, and the nite union of nite measure sets again has nite
measure.
1.2. Signed measures 17
Thus every unsigned measure is a signed measure, and the dier-
ence of two unsigned measures is a signed measure if at least one of
the unsigned measures is nite; we will see shortly that the converse
statement is also true, i.e. every signed measure is the dierence of
two unsigned measures (with one of the unsigned measures being -
nite). Another example of a signed measure are the measures m
f
dened by (1.5), where f : X [, +] is now signed rather than
unsigned, but with the assumption that at least one of the signed parts
f
+
:= max(f, 0), f
such that
X+
0 and
X
0.
Proof. By replacing with if necessary, we may assume that
avoids the value +.
Call a set E totally positive if
E
0, and totally negative if
E
0. The idea is to pick X
+
to be the totally positive set of
maximal measure - a kind of greedy algorithm, if you will. More
precisely, dene m
+
to be the supremum of (E), where E ranges
over all totally positive sets. (The supremum is non-vacuous, since
the empty set is totally positive.) We claim that the supremum is
actually attained. Indeed, we can always nd a maximising sequence
E
1
, E
2
, . . . of totally positive sets with (E
n
) m
+
. It is not hard
to see that the union X
+
:=
n=1
E
n
is also totally positive, and
(X
+
) = m
+
as required. Since avoids +, we see in particular
that m
+
is nite.
Set X
:= XX
+
. We claim that X
is totally negative. We
do this as follows. Suppose for contradiction that X
is not totally
negative, then there exists a set E
1
in X
j
E
j
then also
has positive measure, hence nite, which implies that the n
j
go to
innity; it is then not dicult to see that E itself cannot contain any
subsets of strictly larger measure, and so E is a totally positive set of
positive measure in X
n1
n
=n0
E
n1
has
measure O(2
n
) for any n
0
n. This allows one to control the
unions
n=n0
E
n
, and thence the lim sup X
+
of the E
n
, which one
can then show to have the required properties. One can in fact show
that any signed measure that avoids + must have nite positive
variation, but this turns out to require a certain amount of work.
Let us say that a set E is null for a signed measure if
E
= 0.
(This implies that (E) = 0, but the converse is not true, since a set E
of signed measure zero could contain subsets of non-zero measure.) It
is easy to see that the sets X
, X
+
given by the Hahn decomposition
theorem are unique modulo null sets.
Let us say that a signed measure is supported on E if the
complement of E is null (or equivalently, if
E
= . If two signed
measures , can be supported on disjoint sets, we say that they are
mutually singular (or that is singular with respect to ) and write
. If we write
+
:=
X+
and
:=
X
, we thus soon
establish
1.2. Signed measures 19
Exercise 1.2.5 (Jordan decomposition theorem). Every signed mea-
sure an be uniquely decomposed as =
+
, where
+
,
are
mutually singular unsigned measures. (The only claim not already
established is the uniqueness.) We refer to
+
,
.
Exercise 1.2.6. Show that [[ is the minimal unsigned measure such
that [[ [[. Furthermore, [[(E) is equal to the maximum
value of
n=1
[(E
n
)[, where (E
n
)
n=1
ranges over the partitions of
E. (This may help explain the terminology total variation.)
Exercise 1.2.7. Show that (E) is nite for every E if and only
if [[ is a nite unsigned measure, if and only if
+
,
are nite
unsigned measures. If any of these properties hold, we call a nite
measure. (In a similar spirit, we call a signed measure -nite if
[[ is -nite.)
The space of nite measures on X is clearly a real vector space,
and is denoted /(X).
1.2.2. The Lebesgue-Radon-Nikodym theorem. Let m be a
reference unsigned measure. We saw at the beginning of this sec-
tion that the map f m
f
is an embedding of the space L
+
(X, dm)
of non-negative functions (modulo m-almost everywhere equivalence)
into the space /
+
(X) of unsigned measures. The same map is also
an embedding of the space L
1
(X, dm) of absolutely integrable func-
tions (again modulo m-almost everywhere equivalence) into the space
/(X) of nite measures. (To verify this, one rst makes the easy
observation that the Jordan decomposition of a measure m
f
given by
an absolutely integrable function f is simply m
f
= m
f+
m
f
.)
In the converse direction, one can ask if every nite measure
in /(X) can be expressed as m
f
for some absolutely integrable f.
Unfortunately, there are some obstructions to this. Firstly, from (1.5)
we see that if = m
f
, then any set that has measure zero with respect
20 1. Real analysis
to m, must also have measure zero with respect to . In particular,
this implies that a non-trivial measure that is singular with respect
to m cannot be expressed in the form m
f
.
In the -nite case, this turns out to be the only obstruction:
Theorem 1.2.4 (Lebesgue-Radon-Nikodym theorem). Let m be an
unsigned -nite measure, and let be a signed -nite measure.
Then there exists a unique decomposition = m
f
+
s
, where f
L
1
(X, dm) and
s
m. If is unsigned, then f and
s
are also.
Proof. We prove this only for the case when , are nite rather than
-nite, and leave the general case as an exercise. The uniqueness
follows from Exercise 1.2.2 and the previous observation that m
f
cannot be mutually singular with m for any non-zero f, so it suces
to prove existence. By the Jordan decomposition theorem, we may
assume that is unsigned as well. (In this case, we expect f and
s
to be unsigned also.)
The idea is to select f greedily. More precisely, let M be the
supremum of the quantity
_
X
f dm, where f ranges over all non-
negative functions such that m
f
. Since is nite, M is nite. We
claim that the supremum is actually attained for some f. Indeed, if we
let f
n
be a maximising sequence, thus m
fn
and
_
X
f
n
dm M,
one easily checks that the function f = sup
n
f
n
attains the supremum.
The measure
s
:= m
f
is a non-negative nite measure by
construction. To nish the theorem, it suces to show that
s
m.
It will suce to show that (
s
m)
+
m for all , as the claim
then easily follows by letting be a countable sequence going to zero.
But if (
s
m)
+
were not singular with respect to m, we see from
the Hahn decomposition theorem that there is a set E with m(E) > 0
such that (
s
m)
E
0, and thus
s
m
E
. But then one could
add 1
E
to f, contradicting the construction of f.
Exercise 1.2.8. Complete the proof of Theorem 1.2.4 for the -nite
case.
We have the following corollary:
1.2. Signed measures 21
Corollary 1.2.5 (Radon-Nikodym theorem). Let m be an unsigned
-nite measure, and let be a signed -nite measure. Then the
following are equivalent.
(i) = m
f
for some f L
1
(X, dm).
(ii) (E) = 0 whenever m(E) = 0.
(iii) For every > 0, there exists > 0 such that (E) <
whenever m(E) .
When any of these statements occur, we say that is absolutely
continuous with respect to m, and write m. As in the start of
this section, we call f the Radon-Nikodym derivative of with respect
to m, and write f =
d
dm
.
Proof. The implication of (iii) from (i) is Exercise 1.1.11. The im-
plication of (ii) from (iii) is trivial. To deduce (i) from (ii), apply
Theorem 1.2.2 to and observe that
s
is supported on a set of m-
measure zero E by hypothesis. Since E is null for m, it is null for m
f
and also, and so
s
is trivial, giving (i).
Corollary 1.2.6 (Lebesgue decomposition theorem). Let m be an
unsigned -nite measure, and let be a signed -nite measure.
Then there is a unique decomposition =
ac
+
s
, where
ac
m
and
s
m. (We refer to
ac
and
s
as the absolutely continuous
and singular components of with respect to m.) If is unsigned,
then
ac
and
s
are also.
Exercise 1.2.9. If every point in X is measurable, we call a signed
measure continuous if (x) = 0 for all x. Let the hypotheses be
as in Corollary 1.2.6, but suppose also that every point is measur-
able and m is continuous. Show that there is a unique decomposition
=
ac
+
sc
+
pp
, where
ac
m,
pp
is supported on an at most
countable set, and
sc
is both singular with respect to m and contin-
uous. Furthermore, if is unsigned, then
ac
,
sc
,
pp
are also. We
call
sc
and
pp
the singular continuous and pure point components
of respectively.
Example 1.2.7. A Cantor measure is singular continuous with re-
spect to Lebesgue measure, while Dirac measures are pure point.
22 1. Real analysis
Lebesgue measure on a line is singular continuous with respect to
Lebesgue measure on a plane containing that line.
Remark 1.2.8. Suppose one is decomposing a measure on a Eu-
clidean space R
d
with respect to Lebesgue measure m on that space.
Very roughly speaking, a measure is pure point if it is supported
on a 0-dimensional subset of R
d
, it is absolutely continuous if its
support is spread out on a full dimensional subset, and is singular
continuous if it is supported on some set of dimension intermediate
between 0 and d. For instance, if is the sum of a Dirac mass at
(0, 0) R
2
, one-dimensional Lebesgue measure on the x-axis, and
two-dimensional Lebesgue measure on R
2
, then these are the pure
point, singular continuous, and absolutely continuous components of
respectively. This heuristic is not completely accurate (in part be-
cause I have left the denition of dimension vague) but is not a
bad rule of thumb for a rst approximation. We will study analytic
concepts of dimension in more detail in Section 1.15.
To motivate the terminology continuous and singular contin-
uous, we recall two denitions on an interval I R, and make a
third:
A function f : I R is continuous if for every x I and
every > 0, there exists > 0 such that [f(y) f(x)[
whenever y I is such that [y x[ .
A function f : I R is uniformly continuous if for every
> 0, there exists > 0 such that [f(y)f(x)[ whenever
[x, y] I has length at most .
A function f : I R is absolutely continuous if for every
> 0, there exists > 0 such that
n
i=1
[f(y
i
) f(x
i
)[
whenever [x
1
, y
1
], . . . , [x
n
, y
n
] are disjoint intervals in I of
total length at most .
Clearly, absolute continuity implies uniform continuity, which in turn
implies continuity. The signicance of absolute continuity is that it
is the largest class of functions for which the fundamental theorem
of calculus holds (using the classical derivative, and the Lebesgue
integral), as can be seen in any introductory graduate real analysis
course.
1.2. Signed measures 23
Exercise 1.2.10. Let mbe Lebesgue measure on the interval [0, +],
and let be a nite unsigned measure.
Show that is a continuous measure if and only if the function
x ([0, x]) is continuous. Show that is an absolutely continuous
measure with respect to m if and only if the function x ([0, x]) is
absolutely continuous.
1.2.3. A nitary analogue of the Lebesgue decomposition
(optional). At rst glance, the above theory is only non-trivial when
the underlying set X is innite. For instance, if X is nite, and m is
the uniform distribution on X, then every other measure on X will
be absolutely continuous with respect to m, making the Lebesgue
decomposition trivial. Nevertheless, there is a non-trivial version of
the above theory that can be applied to nite sets (cf. Section 1.3 of
Structure and Randomness). The cleanest formulation is to apply it
to a sequence of (increasingly large) sets, rather than to a single set:
Theorem 1.2.9 (Finitary analogue of the Lebesgue-Radon-Nikodym
theorem). Let X
n
be a sequence of nite sets (and with the discrete
-algebra), and for each n, let m
n
be the uniform distribution on X
n
,
and let
n
be another probability measure on X
n
. Then, after passing
to a subsequence, one has a decomposition
(1.12)
n
=
n,ac
+
n,sc
+
n,pp
where
(i) (Uniform absolute continuity) For every > 0, there exists
> 0 (independent of n) such that
n,ac
(E) whenever
m
n
(E) , for all n and all E X
n
.
(ii) (Asymptotic singular continuity)
n,sc
is supported on a set
of m
n
-measure o(1), and we have
n,sc
(x) = o(1) uni-
formly for all x X
n
, where o(1) denotes an error that goes
to zero as n .
(iii) (Uniform pure point) For every > 0 there exists N > 0 (in-
dependent of n) such that for each n, there exists a set E
n
X
n
of cardinality at most N such that
n,pp
(X
n
E
n
) .
24 1. Real analysis
Proof. Using the Radon-Nikodym theorem (or just working by hand,
since everything is nite), we can write d
n
= f
n
dm
n
for some
f
n
: X
n
[0, +) with average value 1.
For each positive integer k, the sequence
n
(f
n
k) is bounded
between 0 and 1, so by the Bolzano-Weierstrass theorem, it has a
convergent subsequence. Applying the usual diagonalisation argu-
ment (as in the proof of the Arzela-Ascoli theorem, Theorem 1.8.23),
we may thus assume (after passing to a subsequence, and relabeling)
that
n
(f
n
k) converges for positive k to some limit c
k
.
Clearly, the c
k
are decreasing and range between 0 and 1, and so
converge as k to some limit 0 < c < 1.
Since lim
k
lim
n
n
(f
n
k) = c, we can nd a sequence
k
n
going to innity such that
n
(f
n
k
n
) c as n . We
now set
n,ac
to be the restriction of
n
to the set f
n
< k
n
. We
claim the absolute continuity property (i). Indeed, for any > 0, we
can nd a k such that c
k
c /10. For n suciently large, we thus
have
(1.13)
n
(f
n
k) c /5
and
(1.14)
n
(f
n
k
n
) c +/5
and hence
(1.15)
n,ac
(f
n
k) 2/5.
If we take < /5k, we thus see (for n suciently large) that (i) holds.
(For the remaining n, one simply shrinks as much as is necessary.)
Write
n,s
:=
n
n,ac
, thus
n,s
is supported on a set of size
[X
n
[/K
n
= o([X
n
[) by Markovs inequality. It remains to extract out
the pure point components. This we do by a similar procedure as
above. Indeed, by arguing as before we may assume (after passing
to a subsequence as necessary) that the quantities
n
x :
n
(x)
1/j converge to a limit d
j
for each positive integer j, that the d
j
themselves converge to a limit d, and that there exists a sequence
j
n
such that
n
x :
n
(x) 1/j
n
converges to d. If one sets
sc
and
pp
to be the restrictions of
s
to the sets x :
n
(x) <
1.2. Signed measures 25
1/j
n
and x :
n
(x) 1/j
n
respectively, one can verify the
remaining claims by arguments similar to those already given.
Exercise 1.2.11. Generalise Theorem 1.2.9 to the setting where the
X
n
can be innite and non-discrete (but we still require every point to
be measurable), the m
n
are arbitrary probability measures, and the
n
are arbitrary nite measures of uniformly bounded total variation.
Remark 1.2.10. This result is still not fully nitary because it
deals with a sequence of nite structures, rather than with a single
nite structure. It appears in fact to be quite dicult (and per-
haps even impossible) to make a fully nitary version of the Lebesgue
decomposition (in the same way that the nite convergence princi-
ple in Section 1.3 of Structure and Randomnessvwas a fully nitary
analogue of the innite convergence principle), though one can cer-
tainly form some weaker nitary statements that capture a portion
of the strength of this theorem. For instance, one very cheap thing
to do, given two probability measures , m, is to introduce a thresh-
old parameter k, and partition =
k
+
>k
, where
k
km,
and
>k
is supported on a set of m-measure at most 1/k; such a
decomposition is automatic from Theorem 1.2.4 and Markovs in-
equality, and has meaningful content even when the underlying space
X is nite, but this type of decomposition is not as powerful as the
full Lebesgue decompositions (mainly because the size of the sup-
port for
>k
is relatively large compared to the threshold k). Us-
ing the nite convergence principle, one can do a bit better, writing
=
k
+
k<F(k)
+
F(k)
for any function F and any > 0, where
k = O
F,
(1),
k
km,
F(k)
is supported on a set of m-measure
at most 1/F(k), and
k<F(k)
has total mass at most , but this is
still fails to capture the full strength of the innitary decomposition,
because needs to be xed in advance. I have not been able to nd
a fully nitary statement that is equivalent to, say, Theorem 1.2.9; I
suspect that if it does exist, it will have quite a messy formulation.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/04.
The material here is largely based on Follands text [Fo2000], except
for the last section. Thanks to Ke, Max Baroi, Xiaochuan Liu, and
several anonymous commenters for corrections.
26 1. Real analysis
1.3. L
p
spaces
Now that we have reviewed the foundations of measure theory, let us
now put it to work to set up the basic theory of one of the fundamental
families of function spaces in analysis, namely the L
p
spaces (also
known as Lebesgue spaces). These spaces serve as important model
examples for the general theory of topological and normed vector
spaces, which we will discuss a little bit in this lecture and then in
much greater detail in later lectures.
Just as scalar quantities live in the space of real or complex num-
bers, and vector quantities live in vector spaces, functions f : X C
(or other objects closely related to functions, such as measures) live
in function spaces. Like other spaces in mathematics (e.g. vector
spaces, metric spaces, topological spaces, etc.) a function space V is
not just mere sets of objects (in this case, the objects are functions),
but they also come with various important structures that allow one
to do some useful operations inside these spaces, and from one space
to another. For example, function spaces tend to have several (though
usually not all) of the following types of structures, which are usually
related to each other by various compatibility conditions:
Vector space structure. One can often add two func-
tions f, g in a function space V , and expect to get another
function f + g in that space V ; similarly, one can multiply
a function f in V by a scalar c and get another function
cf in V . Usually, these operations obey the axioms of a
vector space, though it is important to caution that the di-
mension of a function space is typically innite. (In some
cases, the space of scalars is a more complicated ring than
the real or complex eld, in which case we need the notion
of a module rather than a vector space, but we will not use
this more general notion in this course.) Virtually all of the
function spaces we shall encounter in this course will be vec-
tor spaces. Because the eld of scalars is real or complex,
vector spaces also come with the notion of convexity, which
turns out to be crucial in many aspects of analysis. As a
consequence (and in marked contrast to algebra or number
theory), much of the theory in real analysis does not seem
1.3. L
p
spaces 27
to extend to other elds of scalars (in particular, real anal-
ysis fails spectacularly in the nite characteristic setting).
Algebra structure. Sometimes (though not always), we also
wish to multiply two functions f, g in V and get another
function fg in V ; when combined with the vector space
structure and assuming some compatibility conditions (e.g.
the distributive law), this makes V an algebra. This mul-
tiplication operation is often just pointwise multiplication,
but there are other important multiplication operations on
function spaces too, such as
2
convolution.
Norm structure. We often want to distinguish large
functions in V from small ones, especially in analysis, in
which small terms in an expression are routinely discarded
or deemed to be acceptable errors. One way to do this is
to assign a magnitude or norm |f|
V
to each function that
measures its size. Unlike the situation with scalars, where
there is basically a single notion of magnitude, functions
have a wide variety of useful notions of size, each measuring
a dierent aspect (or combination of aspects) of the function,
such as height, width, oscillation, regularity, decay, and so
forth. Typically, each such norm gives rise to a separate
function space (although sometimes it is useful to consider a
single function space with multiple norms on it). We usually
require the norm to be compatible with the vector space
structure (and algebra structure, if present), for instance by
demanding that the triangle inequality hold.
Metric structure. We also want to tell whether two func-
tions f, g in a function space V are near together or far
apart. A typical way to do this is to impose a metric
d : V V R
+
on the space V . If both a norm ||
V
and a
vector space structure are available, there is an obvious way
to do this: dene the distance between two functions f, g in
2
One sometimes sees other algebraic structures than multiplication appear in
function spaces, such as commutators and derivations, but again we will not encounter
those in this course. Another common algebraic operation for function spaces is con-
jugation or adjoint, leading to the notion of a *-algebra.
28 1. Real analysis
V to be
3
d(f, g) := |f g|
V
. It is often important to know
if the vector space is complete
4
with respect to the given
metric; this allows one to take limits of Cauchy sequences,
and (with a norm and vector space structure) sum abso-
lutely convergent series, as well as use some useful results
from point set topology such as the Baire category theorem,
see Section 1.7. All of these operations are of course vital in
analysis.
Topological structure. It is often important to know
when a sequence (or, occasionally, nets) of functions f
n
in
V converges in some sense to a limit f (which, hopefully,
is still in V ); there are often many distinct modes of con-
vergence (e.g. pointwise convergence, uniform convergence,
etc.) that one wishes to carefully distinguish from each
other. Also, in order to apply various powerful topological
theorems (or to justify various formal operations involving
limits, suprema, etc.), it is important to know when certain
subsets of V enjoy key topological properties (most notably
compactness and connectedness), and to know which oper-
ations on V are continuous. For all of this, one needs a
topology on V . If one already has a metric, then one of
course has a topology generated by the open balls of that
metric; but there are many important topologies on function
spaces in analysis that do not arise from metrics. We also
often require the topology to be compatible with the other
structures on the function space; for instance, we usually
require the vector space operations of addition and scalar
multiplication to be continuous. In some cases, the topol-
ogy on V extends to some natural superspace W of more
general functions that contain V ; in such cases, it is often
3
This will be the only type of metric on function spaces encountered in this course.
But there are some nonlinear function spaces of importance in nonlinear analysis (e.g.
spaces of maps from one manifold to another) which have no vector space structure or
norm, but still have a metric.
4
Compactness would be an even better property than completeness to have, but
function spaces unfortunately tend be non-compact in various rather nasty ways, al-
though there are useful partial substitutes for compactness that are available, see e.g.
Section 1.6 of Poincares Legacies, Vol. I.
1.3. L
p
spaces 29
important to know whether V is closed in W, so that limits
of sequences in V stay in V .
Functional structures. Since numbers are easier to un-
derstand and deal with than functions, it is not surprising
that we often study functions f in a function space V by
rst applying some functional : V C to V to identify
some key numerical quantity (f) associated to f. Norms
f |f|
V
are of course one important example of a func-
tional; integration f
_
X
f d provides another; and eval-
uation f f(x) at a point x provides a third important
class. (Note, though, that while evaluation is the fundamen-
tal feature of a function in set theory, it is often a quite minor
operation in analysis; indeed, in many function spaces, eval-
uation is not even dened at all, for instance because the
functions in the space are only dened almost everywhere!)
An inner product , on V (see below) also provides a large
family f f, g of useful functionals. It is of particular
interest to study functionals that are compatible with the
vector space structure (i.e. are linear) and with the topo-
logical structure (i.e. are continuous); this will give rise to
the important notion of duality on function spaces.
Inner product structure. One often would like to pair
a function f in a function space V with another object g
(which is often, though not always, another function in the
same function space V ) and obtain a number f, g, that typ-
ically measures the amount of interaction or correlation
between f and g. Typical examples include inner products
arising from integration, such as f, g :=
_
X
fg d; integra-
tion itself can also be viewed as a pairing, f, :=
_
X
f d.
Of course, we usually require such inner products to be com-
patible with the other structures present on the space (e.g.,
to be compatible with the vector space structure, we usu-
ally require the inner product to be bilinear or sesquilinear).
Inner products, when available, are incredibly useful in un-
derstanding the metric and norm geometry of a space, due
30 1. Real analysis
to such fundamental facts as the Cauchy-Schwarz inequal-
ity and the parallelogram law. They also give rise to the
important notion of orthogonality between functions.
Group actions. We often expect our function spaces to
enjoy various symmetries; we might wish to rotate, reect,
translate, modulate, or dilate our functions and expect to
preserve most of the structure of the space when doing so.
In modern mathematics, symmetries are usually encoded
by group actions (or actions of other group-like objects,
such as semigroups or groupoids; one also often upgrades
groups to more structured objects such as Lie groups). As
usual, we typically require the group action to preserve the
other structures present on the space, e.g. one often restricts
attention to group actions that are linear (to preserve the
vector space structure), continuous (to preserve topological
structure), unitary (to preserve inner product structure),
isometric (to preserve metric structure), and so forth. Be-
sides giving us useful symmetries to spend, the presence of
such group actions allows one to apply the powerful tech-
niques of representation theory, Fourier analysis, and er-
godic theory. However, as this is a foundational real analysis
class, we will not discuss these important topics much here
(and in fact will not deal with group actions much at all).
Order structure. In some cases, we want to utilise the
notion of a function f being non-negative, or dominat-
ing another function g. One might also want to take the
max or supremum of two or more functions in a function
space V , or split a function into positive and negative
components. Such order structures interact with the other
structures on a space in many useful ways (e.g. via the
Stone-Weierstrass theorem, Theorem 1.10.24). Much like
convexity, order structure is specic to the real line and is
another reason why much of real analysis breaks down over
other elds. (The complex plane is of course an extension
1.3. L
p
spaces 31
of the real line and so is able to exploit the order struc-
ture of that line, usually by treating the real and imaginary
components separately.)
There are of course many ways to combine various avours of
these structures together, and there are entire subelds of mathemat-
ics that are devoted to studying particularly common and useful cat-
egories of such combinations (e.g. topological vector spaces, normed
vector spaces, Banach spaces, Banach algebras, von Neumann alge-
bras, C
denote the
space of essentially bounded functions, quotiented out by equivalence,
and given the norm | |
L
. It is not hard to see that this is also a
36 1. Real analysis
normed vector space. Observe that a sequence f
n
L
converges to
a limit f L
if and only if f
n
converges essentially uniformly to
f, i.e. it converges uniformly to f outside of a set of measure zero.
(Compare with Egorovs theorem (Theorem 1.1.21), which equates
pointwise convergence with uniform convergence outside of a set of
arbitrarily small measure.)
Now we explain why we call this norm the L
norm:
Example 1.3.4. Let f be a (generalised) step function, thus f =
A1
E
for some amplitude A > 0 and some set E; let us assume that E
has positive nite measure. Then |f|
L
p = A(E)
1/p
for all 0 < p <
, and also |f|
L
= A. Thus in this case, at least, the L
norm is
the limit of the L
p
norms. This example illustrates also that the L
p
norms behave like combinations of the height A of a function, and
the width (E) of such a function, though of course the concepts
of height and width are not formally dened for functions that are
not step functions.
Exercise 1.3.5. If f L
L
p0
for some 0 < p
0
< ,
show that |f|
L
p |f|
L
as p . (Hint: use the
monotone convergence theorem, Theorem 1.1.21.)
If f , L
n=1
a
n
of scalars is convergent if it is
absolutely convergent (i.e. if
n=1
[a
n
[ < . This fact turns out to
be closely related to the fact that the eld of scalars C is complete.
This can be seen from the following result:
Exercise 1.3.7. Let (V, ||) be a normed vector space (and hence
also a metric space and a topological space). Show that the following
are equivalent:
V is a complete metric space (i.e. every Cauchy sequence
converges).
Every sequence f
n
V which is absolutely convergent (i.e.
n=1
|f
n
| < ), is also conditionally convergent (i.e.
N
n=1
f
n
converges to a limit as N .
Remark 1.3.5. The situation is more complicated for complete quasi-
normed vector spaces; not every absolutely convergent series is con-
ditionally convergent. On the other hand, if |f
n
| decays faster than
a suciently large negative power of n, one recovers conditional con-
vergence; see [Ta].
Remark 1.3.6. Let X be a topological space, and let BC(X) be the
space of bounded continuous functions on X; this is a vector space.
We can place the uniform norm |f|
u
:= sup
xX
[f(x)[ on this space;
this makes BC(X) into a normed vector space. It is not hard to verify
that this space is complete, and so every absolutely convergent series
in BC(X) is conditionally convergent. This fact is better known as
the Weierstrass M-test.
38 1. Real analysis
A space obeying the properties in Exercise 1.3.5 (i.e. a complete
normed vector space) is known as a Banach space. We will study
Banach spaces in more detail later in this course. For now, we give
one of the fundamental examples of Banach spaces.
Proposition 1.3.7. L
p
is a Banach space for every 1 p .
Proof. By Exercise 1.3.7, it suces to show that any series
n=1
f
n
of functions in L
p
which is absolutely convergent, is also conditionally
convergent. This is easy in the case p = and is left as an exercise.
In the case 1 p < , we write M :=
n=1
|f
n
|
L
p, which is a
nite quantity by hypothesis. By the triangle inequality, we have
|
N
n=1
[f
n
[|
L
p M for all N. By monotone convergence (Theorem
1.1.21), we conclude |
n=1
[f
n
[|
L
p M. In particular,
n=1
f
n
(x)
is absolutely convergent for almost every x. Write the limit of this
series as F(x). By dominated convergence (Theorem 1.1.21), we see
that
N
n=1
f
n
(x) converges in L
p
norm to F, and we are done.
An important fact is that functions in L
p
can be approximated
by simple functions:
Proposition 1.3.8. If 0 < p < , then the space of simple functions
with nite measure support is a dense subspace of L
p
.
Remark 1.3.9. The concept of a non-trivial dense subspace is one
which only comes up in innite dimensions, and is hard to visualise
directly. Very roughly speaking, the innite number of degrees of
freedom in an innite dimensional space gives a subspace an innite
number of opportunities to come as close as one desires to any given
point in that space, which is what allows such spaces to be dense.
Proof. The only non-trivial thing to show is the density. An appli-
cation of the monotone convergence theorem (Theorem 1.1.21) shows
that the space of bounded L
p
functions are dense in L
p
. Another
application of monotone convergence (and Exercise 1.3.4) then shows
that the space bounded L
p
functions of nite measure support are
dense in the space of bounded L
p
functions. Finally, by discretising
the range of bounded L
p
functions, we see that the space of simple
functions with nite measure support is dense in the space of bounded
L
p
functions with nite support.
1.3. L
p
spaces 39
Remark 1.3.10. Since not every function in L
p
is a simple function
with nite measure support, we thus see that the space of simple
functions with nite measure support with the L
p
norm is an example
of a normed vector space which is not complete.
Exercise 1.3.8. Show that the space of simple functions (not neces-
sarily with nite measure support) is a dense subspace of L
. Is the
same true if one reinstates the nite measure support restriction?
Exercise 1.3.9. Suppose that is -nite and A is separable (i.e.
countably generated). Show that L
p
is separable (i.e. has a countable
dense subset) for all 1 p < . Give a counterexample that shows
that L
1
p
|f|
L
p. When does equality occur?
Exercise 1.3.12. If f L
p
for some 0 < p < , and every set of
positive measure in X has measure at least m, show that f L
q
for all p < q , with |f|
L
q m
1
q
1
p
|f|
L
p. When does equality
occur? (This result is especially useful for the
p
spaces, in which
is counting measure and m can be taken to be 1.)
Exercise 1.3.13. If f L
p0
L
p1
for some 0 < p
0
< p
1
, show
that f L
p
for all p
0
p p
1
, and that |f|
L
p |f|
1
L
p
0
|f|
L
p
1
,
where 0 < < 1 is such that
1
p
=
1
p0
+
p1
. Another way of saying
this is that the function
1
p
log |f|
L
p is convex. When does equal-
ity occur? This convexity is a prototypical example of interpolation,
about which we shall say more in Section 1.11.
Exercise 1.3.14. If f L
p0
for some 0 < p
0
, and its support
E := x X : f(x) ,= 0 has nite measure, show that f L
p
for all
0 < p < p
0
, and that |f|
p
L
p (E) as p 0. (Because of this, the
measure of the support of f is sometimes known as the L
0
norm of
f, or more precisely the L
0
norm raised to the power 0.)
1.3.2. Linear functionals on L
p
. Given an exponent 1 p ,
dene the dual exponent 1 p
by the formula
1
p
+
1
p
= 1 (thus
p
= p/(p1) for 1 < p < , while 1 and are duals of each other).
From Holders inequality, we see that for any g L
p
, the functional
g
: L
p
C dened by
(1.26)
g
(f) :=
_
X
fg d
is well-dened on L
p
; the functional is also clearly linear. Further-
more, Holders inequality also tells us that this functional is continu-
ous.
42 1. Real analysis
A deep and important fact about L
p
spaces is that, in most cases,
the converse is true: the recipe (1.26) is the only way to create con-
tinuous linear functionals on L
p
.
Theorem 1.3.16 (Dual of L
p
). Let 1 p < , and assume is
-nite. Let : L
p
C be a continuous linear functional. Then
there exists a unique g L
p
such that =
g
.
This result should be compared with the Radon-Nikodym theo-
rem (Corollary 1.2.5). Both theorems start with an abstract function
: A R or : L
p
C, and create a function out of it. Indeed,
we shall see shortly that the two theorems are essentially equivalent
to each other. We will develop Theorem 1.3.16 further in Section 1.5,
once we introduce the notion of a dual space.
To prove Theorem 1.3.16, we rst need a simple and useful lemma:
Lemma 1.3.17 (Continuity is equivalent to boundedness for linear
operators). Let T : X Y be a linear transformation from one
normed vector space (X, ||
X
) to another (Y, ||
Y
). Then the following
are equivalent:
(i) T is continuous.
(ii) T is continuous at 0.
(iii) There exists a constant C such that |Tx|
Y
C|x|
X
for all
x X.
Proof. It is clear that (i) implies (ii), and that (iii) implies (ii). Next,
from linearity we have Tx = Tx
0
+T(xx
0
) for any x, x
0
X, which
(together with the continuity of addition, which follows from the tri-
angle inequality) shows that continuity of T at 0 implies continuity
of T at any x
0
, so that (ii) implies (i). The only remaining task is
to show that (i) implies (iii). By continuity, the inverse image of the
unit ball in Y must be an open neighbourhood of 0 in X, thus there
exists some radius r > 0 such that |Tx|
Y
< 1 whenever |x|
X
< r.
The claim then follows (with C := 1/r) by homogeneity. (Alterna-
tively, one can deduce (iii) from (ii) by contradiction. If (iii) failed,
then there exists a sequence x
n
of non-zero elements of X such that
|Tx
n
|
Y
/|x
n
|
X
goes to innity. By homogeneity, we can arrange
1.3. L
p
spaces 43
matters so that |x
n
|
X
goes to zero, but |Tx
n
|
Y
stays away from
zero, thus contradicting continuity at 0.)
Proof of Theorem 1.3.16. The uniqueness claim is similar to the
uniqueness claim in the Radon-Nikodym theorem (Exercise 1.2.2) and
is left as an exercise to the reader; the hard part is establishing exis-
tence.
Let us rst consider the case when is nite. The linear func-
tional : L
p
C induces a functional : A C on sets E by the
formula
(1.27) (E) := (1
E
).
Since is linear, is nitely additive (and sends the empty set to
zero). Also, if E
1
, E
2
, . . . are a sequence of disjoint sets, then 1
N
n=1
En
converges in L
p
to 1
n=1
En
as n (by the dominated convergence
theorem and the niteness of ), and thus (by continuity of and -
nite additivity of ), is countably additive as well. Finally, from
(1.27) we also see that (E) = 0 whenever (E) = 0, thus is ab-
solutely continuous with respect to . Applying the Radon-Nikodym
theorem (Corollary 1.2.5) to both the real and imaginary components
of , we conclude that =
g
for some g L
1
; thus by (1.27) we
have
(1.28) (1
E
) =
g
(1
E
)
for all measurable E. By linearity, this implies that and
g
agree
on simple functions. Taking uniform limits (using Exercise 1.3.8) and
using continuity (and the nite measure of ) we conclude that and
g
agree on all bounded functions. Taking monotone limits (working
on the positive and negative supports of the real and imaginary parts
of g separately) we conclude that and
g
agree on all functions in
L
p
, and in particular that
_
X
fg d is absolutely convergent for all
f L
p
.
To nish the theorem in this case, we need to establish that g lies
in L
p
1
, since we formally have
g
(f) = |g|
p
L
p
and |f|
L
p = |g|
p
1
L
p
. (Not coincidentally, this is also the choice
that would make Holders inequality an equality, see Exercise 1.3.10.)
Cancelling the |g|
L
p
factors would then give the desired niteness
of |g|
L
p
.
We cant quite make that argument work, because it is circu-
lar: it assumes |g|
L
p
is nite in order to show that |g|
L
p
is -
nite! But this can be easily remedied. We test the inequality with
f
N
:= min(g, N)
p
1
for some large N; this lies in L
p
. We have
g
(f) | min(g, N)|
p
L
p
and |f
N
|
L
p = | min(g, N)|
p
1
L
p
, and hence
| min(g, N)|
L
p
C for all N. Letting N go to innity and using
monotone convergence (Theorem 1.1.21), we obtain the claim.
In the p = 1 case, we instead use f := 1
g>N
as the test functions,
to conclude that g is bounded almost everywhere by N; we leave the
details to the reader.
This handles the case when is nite. When is -nite, we can
write X as the union of an increasing sequence E
n
of sets of nite
measure. On each such set, the above arguments let us write =
gn
for some g
n
L
p
(E
n
). The uniqueness arguments tell us that the
g
n
are all compatible with each other, in particular if n < m, then
g
n
and g
m
agree on E
n
. Thus all the g
n
are in fact restrictions of a
single function g to E
n
. The previous arguments also tell us that the
L
p
norm of g
n
is bounded by the same constant C uniformly in n, so
by monotone convergence (Theorem 1.1.21), g has bounded L
p
norm
also, and we are done.
Remark 1.3.18. When 1 < p < , the hypothesis that is -nite
can be dropped, but not when p = 1; see e.g. [Fo2000, Section 6.2]
for further discussion. In these lectures, though, we will be content
with working in the -nite setting. On the other hand, the claim
fails when p = (except when X is nite); we will see this in Section
1.5, when we discuss the Hahn-Banach theorem.
1.3. L
p
spaces 45
Remark 1.3.19. We have seen how the Lebesgue-Radon-Nikodym
theorem can be used to establish Theorem 1.3.16. The converse is
also true: Theorem 1.3.16 can be used to deduce the Lebesgue-Radon-
Nikodym theorem (a fact essentially observed by von Neumann). For
simplicity, let us restrict attention to the unsigned nite case, thus
and m are unsigned and nite. This implies that the sum + m
is also unsigned and nite. We observe that the linear functional
: f
_
X
f d is continuous on L
1
( + m), hence by Theorem
1.3.16, there must exist a function g L
.
This gives the desired Lebesgue-Radon-Nikodym decomposition =
m
g
1g
+
E
.
Remark 1.3.20. The argument used in Remark 1.3.19 also shows
that the Radon-Nikodym theorem implies the Lebesgue-Radon-Nikodym
theorem.
Remark 1.3.21. One can given an alternate proof of Theorem 1.3.16,
which relies on the geometry (and in particular, the uniform convex-
ity) of L
p
spaces rather than on the Radon-Nikodym theorem, and
can thus be viewed as giving an independent proof of that theorem;
see Exercise 1.4.14.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/09.
Thanks to Xiaochuan Li for corrections.
46 1. Real analysis
1.4. Hilbert spaces
In the next few lectures, we will be studying four major classes of
function spaces. In decreasing order of generality, these classes are
the topological vector spaces, the normed vector spaces, the Banach
spaces, and the Hilbert spaces. In order to motivate the discussion
of the more general classes of spaces, we will rst focus on the most
special class - that of (real and complex) Hilbert spaces. These spaces
can be viewed as generalisations of (real and complex) Euclidean
spaces such as R
n
and C
n
to innite-dimensional settings, and in-
deed much of ones Euclidean geometry intuition concerning lengths,
angles, orthogonality, subspaces, etc. will transfer readily to arbitrary
Hilbert spaces; in contrast, this intuition is not always accurate in the
more general vector spaces mentioned above. In addition to Euclidean
spaces, another fundamental example
7
of Hilbert spaces comes from
the Lebesgue spaces L
2
(X, A, ) of a measure space (X, A, ).
Hilbert spaces are the natural abstract framework in which to
study two important (and closely related) concepts: orthogonality
and unitarity, allowing us to generalise familiar concepts and facts
from Euclidean geometry such as the Cartesian coordinate system,
rotations and reections, and the Pythagorean theorem to Hilbert
spaces. (For instance, the Fourier transform (Section 1.12) is a uni-
tary transformation and can thus be viewed as a kind of generalised
rotation.) Furthermore, the Hodge duality on Euclidean spaces has a
partial analogue for Hilbert spaces, namely the Riesz representation
theorem for Hilbert spaces, which makes the theory of duality and
adjoints for Hilbert spaces especially simple (when compared with
the more subtle theory of duality for, say, Banach spaces; see Section
1.5).
These notes are only the most basic introduction to the theory
of Hilbert spaces. In particular, the theory of linear transformations
7
There are of course many other Hilbert spaces of importance in complex analysis,
harmonic analysis, and PDE, such as Hardy spaces H
2
, Sobolev spaces H
s
= W
s,2
,
and the space HS of Hilbert-Schmidt operators; see for instance Section 1.14 for a
discussion of Sobolev spaces. Complex Hilbert spaces also play a fundamental role in
the foundations of quantum mechanics, being the natural space to hold all the possible
states of a quantum system (possibly after projectivising the Hilbert space), but we
will not discuss this subject here.
1.4. Hilbert spaces 47
between two Hilbert spaces, which is perhaps the most important
aspect of the subject, is not covered much at all here.
1.4.1. Inner product spaces. The Euclidean norm
(1.31) [(x
1
, . . . , x
n
)[ :=
_
x
2
1
+. . . +x
2
n
in real Euclidean space R
n
can be expressed in terms of the dot
product : R
n
R
n
R, dened as
(1.32) (x
1
, . . . , x
n
) (y
1
, . . . , y
n
) := x
1
y
1
+. . . +x
n
y
n
by the well-known formula
(1.33) [x[ = (x x)
1/2
.
In particular, we have the positivity property
(1.34) x x 0
with equality if and only if x = 0. One reason why it is more advan-
tageous to work with the dot product than the norm is that while the
norm function is only sublinear, the dot product is bilinear, thus
(1.35) (cx+dy)z = c(xz)+d(yz); z(cx+dy) = c(zx)+d(zy)
for all vectors x, y and scalars c, d, and also symmetric,
(1.36) x y = y x.
These properties make the inner product easier to manipulate
algebraically than the norm.
The above discussion was for the real vector space R
n
, but one
can develop analogous statements for the complex vector space C
n
,
in which the norm
(1.37) |(z
1
, . . . , z
n
)| :=
_
[z
1
[
2
+. . . +[z
n
[
2
can be represented in terms of the complex inner product , : C
n
C
n
C dened by the formula
(1.38) (z
1
, . . . , z
n
) (w
1
, . . . , w
n
) := z
1
w
1
+. . . +z
n
w
n
by the analogue of (1.33), namely
(1.39) |x| = (x, x)
1/2
.
48 1. Real analysis
In particular, as before with (1.34), we have the positivity property
(1.40) x, x 0,
with equality if and only if x = 0. The bilinearity property (1.35) is
modied to the sesquilinearity property
(1.41) cx+dy, z = cx, z+dy, z; z, cx+dy = cz, x+dz, y
while the symmetry property (1.36) needs to be replaced with
(1.42) x, y = y, x
in order to be compatible with sesquilinearity.
We can formalise all these properties axiomatically as follows.
Denition 1.4.1 (Inner product space). A complex inner product
space (V, , ) is a complex vector space V , together with an inner
product , : V V C which is sesquilinear (i.e. (1.41) holds for
all x, y V and c, d C) and symmetric in the sesquilinear sense
(i.e. (1.42) holds for all x, y V ), and obeys the positivity property
(1.40) for all x V , with equality if and only if x = 0. We will usually
abbreviate (V, , ) as V .
A real inner product space is dened similarly, but with all refer-
ences to C replaced by R (and all references to complex conjugation
dropped).
Example 1.4.2. R
n
with the standard dot product (1.32) is a real
inner product space, and C
n
with the complex inner product (1.38)
is a complex inner product space.
Example 1.4.3. If (X, A, ) is a measure space, then the complex L
2
space L
2
(X, A, ) = L
2
(X, A, ; C) with the complex inner product
(1.43) f, g :=
_
X
fg d
(which is well dened, by the Cauchy-Schwarz inequality) is easily
veried to be a complex inner product space, and similarly for the
real L
2
space (with the complex conjugate signs dropped, of course.
Note that the nite dimensional examples R
n
, C
n
can be viewed as
the special case of the L
2
examples in which X is 1, . . . , n with the
discrete -algebra and counting measure.
1.4. Hilbert spaces 49
Example 1.4.4. Any subspace of a (real or complex) inner product
space is again a (real or complex) inner product space, simply by
restricting the inner product to the subspace.
Example 1.4.5. Also, any real inner product space V can be com-
plexied into the complex inner product space V
C
, dened as the
space of formal combinations x + iy of vectors x, y V (with the
obvious complex vector space structure), and with inner product
(1.44) a +ib, c +id := a, c +ib, c ia, d +b, d.
Example 1.4.6. Fix a probability space (X, A, ). The space of
square-integrable real-valued random variables of mean zero is an in-
ner product space if one uses covariance as the inner product. (What
goes wrong if one drops the mean zero assumption?)
Given a (real or complex) inner product space V , we can dene
the norm |x| of any vector x V by the formula (1.39), which is
well dened thanks to the positivity property; in the case of the L
2
spaces, this norm of course corresponds to the usual L
2
norm. We
have the following basic facts:
Lemma 1.4.7. Let V be a real or complex inner product space.
(i) (Cauchy-Schwarz inequality) For any x, y V , we have
[x, y[ |x||y|.
(ii) The function x |x| is a norm on V . (Thus every inner
product space is a normed vector space.)
Proof. We shall just verify the complex case, as the real case is
similar (and slightly easier). The positivity property tells us that
the quadratic form ax +by, ax +by is non-negative for all complex
numbers a, b. Using sesquilinearity and symmetry, we can expand
this form as
(1.45) [a[
2
|x|
2
+ 2 Re(abx, y) +[b[|y|
2
.
Optimising in a, b (see also Section 1.10 of Structure and Random-
ness) we obtain the Cauchy-Schwarz inequality. To verify the norm
property, the only non-trivial verication is that of the triangle in-
equality |x+y| |x|+|y|. But on expanding |x+y|
2
= x+y, x+y
50 1. Real analysis
we see that
(1.46) |x +y|
2
= |x|
2
+ 2 Re(x, y) +|y|
2
and the claim then follows from the Cauchy-Schwarz inequality.
Observe from the Cauchy-Schwarz inequality that the inner prod-
uct , : H H C is continuous.
Exercise 1.4.1. Let T : V W be a linear map from one (real or
complex) inner product space to another. Show that T preserves the
inner product structure (i.e. Tx, Ty = x, y for all x, y V ) if and
only if T is an isometry (i.e. |Tx| = |x| for all x V ). (Hint: in
the real case, express x, y in terms of |x +y|
2
and |x y|
2
. In the
complex case, use x +y, x y, x +iy, x iy instead of x +y, x y.)
Inspired by the above exercise, we say that two inner product
spaces are isomorphic if there exists an invertible isometry from one
space to the other; such invertible isometries are known as isomor-
phisms.
Exercise 1.4.2. Let V be a real or complex inner product space. If
x
1
, . . . , x
n
are a nite collection of vectors in V , show that the Gram
matrix (x
i
, x
j
)
1i,jn
is Hermitian and positive semi-denite, and
is positive denite if and only if the x
1
, . . . , x
n
are linearly indepen-
dent. Conversely, given a Hermitian positive semi-denite matrix
(a
ij
)
1i,jn
with real (resp. complex) entries, show that there exists
a real (resp. complex) inner product space V and vectors x
1
, . . . , x
n
such that x
i
, x
j
= a
ij
for all 1 i, j n.
In analogy with the Euclidean case, we say that two vectors x,
y in a (real or complex) vector space are orthogonal if x, y = 0.
(With this convention, we see in particular that 0 is orthogonal to
every vector, and is the only vector with this property.)
Exercise 1.4.3 (Pythagorean theorem). Let V be a real or complex
inner product space. If x
1
, . . . , x
n
are a nite set of pairwise orthogo-
nal vectors, then |x
1
+. . . +x
n
|
2
= |x
1
|
2
+. . . +|x
n
|
2
. In particular,
we see that |x
1
+x
2
| |x
1
| whenever x
2
is orthogonal to x
1
.
1.4. Hilbert spaces 51
A (possibly innite) collection (e
)
A
of vectors in a (real or
complex) inner product space is said to be orthonormal if they are
pairwise orthogonal and all of unit length.
Exercise 1.4.4. Let (e
)
A
be an orthonormal system of vectors
in a real or complex inner product space. Show that this system is
(algebraically) linearly independent (thus any non-trivial nite linear
combination of vectors in this system is non-zero). If x lies in the
algebraic span of this system (i.e. it is a nite linear combination of
vectors in the system), establish the inversion formula
(1.47) x =
A
x, e
(with only nitely many of the terms non-zero) and the (nite) Plancherel
formula
(1.48) |x|
2
=
A
[x, e
[
2
.
Exercise 1.4.5 (Gram-Schmidt theorem). Let e
1
, . . . , e
n
be a nite
orthonormal system in a real or complex inner product space, and let
v be a vector not in the span of e
1
, . . . , e
n
. Show that there exists
a vector e
n+1
with span(e
1
, . . . , e
n
, e
n+1
) = span(e
1
, . . . , e
n
, v) such
that e
1
, . . . , e
n+1
is an orthonormal system. Conclude that an n-
dimensional real or complex inner product space is isomorphic to R
n
or C
n
respectively. Thus, any statement about inner product spaces
which only involves a nite-dimensional subspace of that space can
be veried just by checking it on Euclidean spaces.
Exercise 1.4.6 (Parallelogram law). For any inner product space V ,
establish the parallelogram law
(1.49) |x +y|
2
+|x y|
2
= 2|x|
2
+ 2|y|
2
.
Show that this inequality fails for L
p
(X, A, ) for p ,= 2 as soon as X
contains at least two disjoint sets of non-empty nite measure. On
the other hand, establish the Hanner inequalities
(1.50) |f +g|
p
p
+|f g|
p
p
(|f|
p
+|g|
p
)
p
+[|f|
p
|g|
p
[
p
and
(1.51) (|f +g|
p
+|f g|
p
)
p
+[|f +g|
p
|f g|
p
[
p
2
p
(|f|
p
p
+|g|
p
p
52 1. Real analysis
for 1 p 2, with the inequalities being reversed for 2 p < .
(Hint: (1.51) can be deduced from (1.50) by a simple substitution.
For (1.50), reduce to the case when f, g are non-negative, and then
exploit the inequality
[x +y[
p
+[x y[
p
((1 +r)
p1
+ (1 r)
p1
)x
p
+ ((1 +r)
p1
(1 r)
p1
)r
1p
y
p
(1.52)
for all non-negative x, y, 0 < r < 1, and 1 p 2, with the inequality
being reversed for 2 p < , and with equality being attained when
y < x and r = y/x.
1.4.2. Hilbert spaces. Thus far, our discussion of inner product
spaces has been largely algebraic in nature; this is because we have
not been able to take limits inside these spaces and do some actual
analysis. This can be rectied by adding an additional axiom:
Denition 1.4.8 (Hilbert spaces). A (real or complex) Hilbert space
is a (real or complex) inner product space which is complete (or equiv-
alently, an inner product space which is also a Banach space).
Example 1.4.9. From Proposition 1.3.7, (real or complex) L
2
(X, A, )
is a Hilbert space for any measure space (X, A, ). In particular, R
n
and C
n
are Hilbert spaces.
Exercise 1.4.7. Show that a subspace of a Hilbert space H will itself
be a Hilbert space if and only if it is closed. (In particular, proper
dense subspaces of Hilbert spaces are not Hilbert spaces.)
Example 1.4.10. By Example 1.4.9, the space l
2
(Z) of doubly in-
nite square-summable sequences is a Hilbert space. Inside this space,
the space c
c
(Z) of sequences of nite support is a proper dense sub-
space (as can be seen for instance by Proposition 1.3.8, though this
can also be seen much more directly), and so cannot be a Hilbert
space.
Exercise 1.4.8. Let V be an inner product space. Show that there
exists a Hilbert space V which contains a dense subspace isomorphic
to V ; we refer to V as a completion of V. Furthermore, this space is
essentially unique in the sense that if V , V
. Because
of this fact, inner product spaces are sometimes known as pre-Hilbert
spaces, and can always be identied with dense subspaces of actual
Hilbert spaces.
Exercise 1.4.9. Let H, H
with
inner product (x, x
), (y, y
)
HH
:= x, x
H
+y, y
H
. Show that
H H
.
Proposition 1.4.12 has some importance in calculus of variations,
but we will not pursue those applications here.
Since every subspace is necessarily convex, we have a corollary:
Exercise 1.4.12 (Orthogonal projections). Let V be a closed sub-
space of a Hilbert space H. Then for every x H there exists a
unique decomposition x = x
V
+ x
V
, where x
V
V and x
V
is
orthogonal to every element of V . Furthermore, x
V
is the closest
element of V to x.
1.4. Hilbert spaces 55
Let
V
: H V be the map
V
: x x
V
, where x
V
is given
by the above exercise; we refer to
V
as the orthogonal projection
from H onto V . It is not hard to see that
V
is linear, and from the
Pythagorean theorem we see that
V
is a contraction (thus |
V
x|
|x| for all x V ). In particular,
V
is continuous.
Exercise 1.4.13 (Orthogonal complement). Given a subspace V of
a Hilbert space H, dene the orthogonal complement V
of V to be
the set of all vectors in H that are orthogonal to every element of V.
Establish the following claims:
V
is the closure
of V .
V
.
If V , W are two closed subspaces of H, then (V + W)
=
V
and (V W)
= V
+W
.
Every vector v in a Hilbert space gives rise to a continuous linear
functional
v
: H C, dened by the formula
v
(w) := w, v (the
continuity follows from the Cauchy-Schwarz inequality). The Riesz
representation theorem for Hilbert spaces gives a converse:
Theorem 1.4.13 (Riesz representation theorem for Hilbert spaces).
Let H be a complex Hilbert space, and let : H C be a continuous
linear functional on H. Then there exists a unique v in H such that
=
v
. A similar claim holds for real Hilbert spaces (replacing C by
R throughout).
Proof. We just show the claim for complex Hilbert spaces, as the
claim for real Hilbert spaces is very similar. First, we show unique-
ness: if
v
=
v
, then
vv
= 0, and in particular vv
, vv
= 0,
and so v = v
.
Now we show existence. We may assume that is not identically
zero, since the claim is obvious otherwise. Observe that the kernel
V := x H : (x) = 0 is then a proper subspace of H, which
is closed since is continuous. By Exercise 1.4.13, the orthogonal
56 1. Real analysis
complement V
H.
Exercise 1.4.14. Using Exercise 1.4.11, give an alternate proof of
the 1 < p < case of Theorem 1.3.16.
One important consequence of the Riesz representation theorem
is the existence of adjoints:
Exercise 1.4.15 (Existence of adjoints). Let T : H H
be a con-
tinuous linear transformation. Show that that there exists a unique
continuous linear transformation T
: H
. The transformation T
= T.
Show that T is an isometry if and only if T
T= id
H
.
1.4. Hilbert spaces 57
Show that T is an isomorphism if and only if T
T= id
H
and
TT
= id
H
.
If S : H
= T
.
Remark 1.4.16. An isomorphism of complex Hilbert spaces is also
known as a unitary transformation. (For real Hilbert spaces, the
term orthogonal transformation is used instead.) Note that unitary
and orthogonal nn matrices generate unitary and orthogonal trans-
formations on C
n
and R
n
respectively.
Exercise 1.4.17. Show that the projection map
V
: H V from a
Hilbert space to a closed subspace is the adjoint of the inclusion map
V
: V H.
1.4.3. Orthonormal bases. In the section on inner product spaces,
we studied nite linear combinations of orthonormal systems. Now
that we have completeness, we turn to innite linear combinations.
We begin with countable linear combinations:
Exercise 1.4.18. Suppose that e
1
, e
2
, e
3
, . . . is a countable orthonor-
mal system in a complex Hilbert space H, and c
1
, c
2
, . . . is a sequence
of complex numbers. (As usual, similar statements will hold here for
real Hilbert spaces and real numbers.)
(i) Show that the series
n=1
c
n
e
n
is conditionally convergent
in H if and only if c
n
is square-summable.
(ii) If c
n
is square-summable, show that
n=1
c
n
e
n
is uncondi-
tionally convergent in H, i.e. every permutation of the c
n
e
n
sums to the same value.
(iii) Show that the map (c
n
)
n=1
n=1
c
n
e
n
is an isometry
from the Hilbert space
2
(N) to H. The image V of this
isometry is the smallest closed subspace of H that contains
e
1
, e
2
, . . ., and which we shall therefore call the (Hilbert
space) span of e
1
, e
2
, . . ..
(iv) Take adjoints of (ii) and conclude that for any x H, we
have
V
(x) =
n=1
x, e
n
e
n
and |
V
(x)| = (
n=1
[x, e
n
[
2
)
1/2
.
Conclude in particular the Bessel inequality
n=1
[x, e
n
[
2
|x|
2
.
58 1. Real analysis
Remark 1.4.17. Note the contrast here between conditional and un-
conditional summability (which needs only square-summability of the
coecients c
n
) and absolute summability (which requires the stronger
condition that the c
n
are absolutely summable). In particular there
exist non-absolutely summable series that are still unconditionally
summable, in contrast to the situation for scalars, in which one has
the Riemann rearrangement theorem.
Now we can handle arbitrary orthonormal systems (e
)
A
. If
(c
)
A
is square-summable, then at most countably many of the
c
are non-zero (by Exercise 1.3.4). Using parts (i), (ii) of Exercise
1.4.18, we can then form the sum
A
c
in an unambiguous
manner. It is not hard to use Exercise 1.4.18 to then conclude that
this gives an isometric embedding of
2
(A) into H. The image of
this isometry is the smallest closed subspace of H that contains the
orthonormal system, which we call the (Hilbert space) span of that
system. (It is the closure of the algebraic span of the system.)
Exercise 1.4.19. Let (e
)
A
be an orthonormal system in H. Show
that the following statements are equivalent:
(i) The Hilbert space span of (e
)
A
is all of H.
(ii) The algebraic span of (e
)
A
(i.e. the nite linear combi-
nations of the e
) is dense in H.
(iii) One has the Parseval identity |x|
2
=
A
[x, e
[
2
for
all x H.
(iv) One has the inversion formula x =
A
x, e
for all
x H (in particular, the coecients x, e
is the zero
vector.
(vi) There is an isomorphism from
2
(A) to H that maps
to
e
)
A
obeying any (and hence all) of the properties
in Exercise 1.4.19 is known as an orthonormal basis of the Hilbert
space H. All Hilbert spaces have such a basis:
1.4. Hilbert spaces 59
Proposition 1.4.18. Every Hilbert space has at least one orthonor-
mal basis.
Proof. We use the standard Zorns lemma argument (see Section
2.4). Every Hilbert space has at least one orthonormal system, namely
the empty system. We order the orthonormal systems by inclusion,
and observe that the union of any totally ordered set of orthonormal
systems is again an orthonormal system. By Zorns lemma, there
must exist a maximal orthonormal system (e
)
A
. There cannot be
any unit vector orthogonal to all the elements of this system, since
otherwise one could add that vector to the system and contradict or-
thogonality. Applying Exercise 1.4.19 in the contrapositive, we obtain
an orthonormal basis as claimed.
Exercise 1.4.20. Show that every vector space V has at least one
algebraic basis, i.e. a set of basis vectors such that every vector in
V can be expressed uniquely as a nite linear combination of basis
vectors. (Such bases are also known as Hamel bases.)
Corollary 1.4.19. Every Hilbert space is isomorphic to
2
(A) for
some set A.
Exercise 1.4.21. Let A, B be sets. Show that
2
(A) and
2
(B) are
isomorphic i A and B have the same cardinality. (Hint: the case
when A or B is nite is easy, so suppose A and B are both innite.
If
2
(A) and
2
(B) are isomorphic, show that B can be covered by a
family of at most countable sets indexed by A, and vice versa. Then
apply the Schroder-Bernstein theorem (Section 3.13).
We can now classify Hilbert spaces up to isomorphism by a single
cardinal, the dimension of that space:
Exercise 1.4.22. Show that all orthonormal bases of a given Hilbert
space H have the same cardinality. This cardinality is called the
(Hilbert space) dimension of the Hilbert space.
Exercise 1.4.23. Show that a Hilbert space is separable (i.e. has a
countable dense subset) if and only if its dimension is at most count-
able. Conclude in particular that up to isomorphism, there is exactly
one separable innite-dimensional Hilbert space.
60 1. Real analysis
Exercise 1.4.24. Let H, H
H H
= c(x x
) +
d(y x
) and x (cx
+dy
) = c(x x
) +d(x y
) for all
x, y H, x
, y
, c, d C;
(ii) We have xx
, yy
HH
= x, y
H
x
, y
H
for all x, y
H, x
, y
.
(iii) The (algebraic) span of xx
: x H, x
is dense in
H H
.
Furthermore, show that HH
and
: H H
are another
pair of objects obeying the above properties, then there exists an
isomorphism : HH
such that x
= (xx
) for all
x H, x
, and x x
.
Exercise 1.4.25. Let (X, A, ) and (Y, }, ) be measure spaces.
Show that L
2
(XY, A}, ) is the tensor product of L
2
(X, A, )
and L
2
(Y, }, ), if one denes the tensor product fg of f L
2
(X, A, )
and g L
2
(Y, }, ) as f g(x, y) := f(x)g(y).
We do not yet have enough theory in other areas to give the really
useful applications of Hilbert space theory yet, but let us just illus-
trate a simple one, namely the development of Fourier series on the
unit circle R/Z. We can give this space the usual Lebesgue measure
(identifying the unit circle with [0, 1), if one wishes), giving rise to the
complex Hilbert space L
2
(R/Z). On this space we can form the char-
acters e
n
(x) := e
2inx
for all integer n; one easily veries that (e
n
)
nZ
is an orthonormal system. We claim that it is in fact an orthonor-
mal basis. By Exercise 1.4.19, it suces to show that the algebraic
span of the e
n
, i.e. the space of trigonometric polynomials, is dense
in L
2
(R/Z). But
8
from an explicit computation (e.g. using Fejer
8
One can also use the Stone-Weierstrass theorem here, see Theorem 1.10.24.
1.4. Hilbert spaces 61
kernels) one can show that the indicator function of any interval can
be approximated to arbitrary accuracy in L
2
norm by trigonometric
polynomials, and is thus in the closure of the trigonometric polyno-
mials. By linearity, the same is then true of an indicator function of a
nite union of intervals; since Lebesgue measurable sets in R/Z can
be approximated to arbitrary accuracy by nite unions of intervals,
the same is true for indicators of measurable sets. By linearity, the
same is true for simple functions, and by density (Proposition 1.3.8)
the same is true for arbitrary L
2
functions, and the claim follows.
The Fourier transform
f : Z C of a function f L
2
(R/Z) is
dened as
(1.56)
f(n) := f, e
n
=
_
1
0
f(x)e
2inx
dx.
From Exercise 1.4.19, we obtain the Parseval identity
nZ
[
f(n)[
2
=
_
R/Z
[f(x)[
2
dx
(in particular,
f
2
(Z)) and the inversion formula
f =
nZ
f(n)e
n
where the right-hand side is unconditionally convergent. Indeed,
the Fourier transform f
f is a unitary transformation between
L
2
(R/Z) and
2
(Z). (These facts are collectively referred to as Plancherels
theorem for the unit circle.) We will develop Fourier analysis on other
spaces than the unit circle in Section 1.12.
Remark 1.4.20. Of course, much of the theory here generalises the
corresponding theory in nite-dimensional linear algebra; we will con-
tinue this theme much later in the course when we turn to the spectral
theorem. However, not every aspect of nite-dimensional linear alge-
bra will carry over so easily. For instance, it turns out to be quite
dicult to take the determinant or trace of a linear transformation
from a Hilbert space to itself in general (unless the transformation
is particularly well behaved, e.g. of trace class). The Jordan nor-
mal form also does not translate to the innite-dimensional setting,
leading to the notorious invariant subspace problem in the subject. It
is also worth cautioning that while the theory of orthonormal bases
62 1. Real analysis
in nite-dimensional Euclidean spaces generalises very nicely to the
Hilbert space setting, the more general theory of bases in nite di-
mensions becomes much more subtle in innite dimensional Hilbert
spaces, unless the basis is almost orthonormal in some sense (e.g.
if it forms a frame).
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/17.
Thanks to Americo Tavares, S, and Xiaochuan Liu for corrections.
Uhlrich Groh and Dmitriy raised the interesting open problem
of whether any closed subset K of H for which distance minimisers
to every point x existed and unique were necessarily convex, thus
providing a converse to Proposition 1.4.12. (Sets with this property
are known as Chebyshev sets.)
1.5. Duality and the Hahn-Banach theorem
When studying a mathematical space X (e.g. a vector space, a topo-
logical space, a manifold, a group, an algebraic variety etc.), there are
two fundamentally basic ways to try to understand the space:
(i) By looking at subobjects in X, or more generally maps f :
Y X from some other space Y into X. For instance, a
point in a space X can be viewed as a map from pt to X; a
curve in a space X could be thought of as a map from [0, 1]
to X; a group G can be studied via its subgroups K, and so
forth.
(ii) By looking at objects on X, or more precisely maps f : X
Y from X into some other space Y . For instance, one can
study a topological space X via the real- or complex-valued
continuous functions f C(X) on X; one can study a group
G via its quotient groups : G G/H; one can study an
algebraic variety V by studying the polynomials on V (and
in particular, the ideal of polynomials that vanish identically
on V ); and so forth.
(There are also more sophisticated ways to study an object via its
maps, e.g. by studying extensions, joinings, splittings, universal lifts,
etc. The general study of objects via the maps between them is
1.5. The Hahn-Banach theorem 63
formalised abstractly in modern mathematics as category theory, and
is also closely related to homological algebra.)
A remarkable phenomenon in many areas of mathematics is that
of (contravariant) duality: that the maps into and out of one type of
mathematical object X can be naturally associated to the maps out
of and into a dual object X
is essentially
identical to X itself.
In these notes we discuss a third important case of duality, namely
duality of normed vector spaces, which is of an intermediate nature
to the previous two examples: the dual X
: Y
of a continuous
linear operator T : X Y .
A fundamental tool in understanding duality of normed vector
spaces will be the Hahn-Banach theorem, which is an indispensable
tool for exploring the dual of a vector space. (Indeed, without this
theorem, it is not clear at all that the dual of a non-trivial normed
vector space is non-trivial!) Thus, we shall study this theorem in
detail in this section concurrently with our discussion of duality.
1.5.1. Duality. In the category of normed vector spaces, the natural
notion of a map (or morphism) between two such spaces is that of
a continuous linear transformation T : X Y between two normed
vector spaces X, Y . By Lemma 1.3.17, any such linear transformation
is bounded, in the sense that there exists a constant C such that
64 1. Real analysis
|Tx|
Y
C|x|
X
for all x X. The least such constant C is known
as the operator norm of T, and is denoted |T|
op
or simply |T|.
Two normed vector spaces X, Y are equivalent if there is an in-
vertible continuous linear transformation T : X Y from X to
Y , thus T is bijective and there exist constants C, c > 0 such that
c|x|
X
|Tx|
Y
C|x|
X
for all x X. If one can take C = c = 1,
then T is an isometry, and X and Y are called isomorphic. When one
has two norms ||1, ||
2
on the same vector space X, we say that the
norms are equivalent if the identity from (X, ||
1
) to (X, ||
2
) is an
invertible continuous transformation, i.e. that there exist constants
C, c > 0 such that c|x|
1
|x|
2
C|x|
1
for all x X.
Exercise 1.5.1. Show that all linear transformations from a nite-
dimensional space to a normed vector space are continuous. Conclude
that all norms on a nite-dimensional space are equivalent.
Let B(X Y ) denote the space of all continuous linear trans-
formations from X to Y . (This space is also denoted by many other
names, e.g. /(X, Y ), Hom(X Y ), etc.) This has the structure of a
vector space: the sum S +T : x Sx +Tx of two continuous linear
transformations is another continuous linear transformation, as is the
scalar multiple cT : x cTx of a linear transformation.
Exercise 1.5.2. Show that B(X Y ) with the operator norm is a
normed vector space. If Y is complete (i.e. is a Banach space), show
that B(X Y ) is also complete (i.e. is also a Banach space).
Exercise 1.5.3. Let X, Y , Z be Banach spaces. Show that if T
B(X Y ) and S B(Y Z), then the composition ST : X Z
lies in B(X Z) and |ST|
op
|S|
op
|T|
op
. (As a consequence of
this inequality, we see that B(X X) is a Banach algebra.)
Now we can dene the notion of a dual space.
Denition 1.5.1 (Dual space). Let X be a normed vector space.
The (continuous) dual space X
of X is dened to be X
:= B(X
R) if X is a real vector space, and X
:= B(X C) if X is a
complex vector space. Elements of X
are iso-
morphic in the sense that there exists an isomorphism from
X to X
and (X)
are isomorphic to
each other.
The next few exercises are designed to give some intuition as to
how dual spaces work.
Exercise 1.5.5. Let R
n
be given the Euclidean metric. Show that
(R
n
)
is isomorphic to R
n
. Establish the corresponding result for the
complex spaces C
n
.
Exercise 1.5.6. Let c
c
(N) be the vector space of sequences (a
n
)
nN
of real or complex numbers which are compactly supported (i.e. at
most nitely many of the a
n
are non-zero). We give c
c
the uniform
norm ||
.
(i) Show that the dual space c
c
(N)
is isomorphic to
1
(N).
(ii) Show that the completion of c
c
(N) is isomorphic to c
0
(N),
the space of sequences on N that go to zero at innity (again
with the uniform norm); thus, by Exercise 1.5.4, the dual
space of c
0
(N) is isomorphic to
1
(N) also.
(iii) On the other hand, show that the dual of
1
(N) is isomorphic
to
: Y
:= T X
, thus (T
of a continuous linear
transformation T between normed vector spaces is again a continu-
ous linear transformation with |T
|
op
|T|
op
, thus the transpose
operation is itself a linear map from B(X Y ) to B(Y
).
(We will improve this result in Theorem 1.5.13 below.)
Exercise 1.5.10. An n m matrix A with complex entries can be
identied with a linear transformation L
A
: C
n
C
m
. Identifying
the dual space of C
n
with itself as in Exercise 1.5.5, show that the
transpose L
A
: C
m
C
n
is equal to L
A
t , where A
t
is the transpose
matrix of A.
1.5. The Hahn-Banach theorem 67
Exercise 1.5.11. Show that the transpose of a surjective continuous
linear transformation between normed vector spaces is injective. Show
also that the condition of surjectivity can be relaxed to that of having
a dense image.
Remark 1.5.4. Observe that if T : X Y and S : Y Z are con-
tinuous linear transformations between normed vector spaces, then
(ST)
= T
of
continuous linear transformations, form a contravariant functor from
the category of normed vector spaces (or Banach spaces) to itself.
Remark 1.5.5. The transpose T
: H
H of a continuous linear
transformation T : H H
: H
and H
. This is analogous to
the linear algebra fact that the adjoint matrix is the complex conju-
gate of the transpose matrix. One should note that in the literature,
the transpose operator T
|
X
= ||
Y
. (Note: the extension
is, in general,
not unique.)
68 1. Real analysis
We prove this important theorem in stages. We rst handle the
codimension one real case:
Proposition 1.5.7. The Hahn-Banach theorem is true when X, Y
are real vector spaces, and X is spanned by Y and an additional vector
v.
Proof. We can assume that v lies outside Y , since the claim is trivial
otherwise. We can also normalise ||
Y
= 1 (the claim is of course
trivial if ||
Y
vanishes). To specify the extension
of , it suces
by linearity to specify the value of
(v). In order for the extension
(y +tv)[ |y +tv|
X
for all t R and y Y . This is automatic for t = 0, so by homo-
geneity it suces to attain this bound for t = 1. We rearrange this a
bit as
sup
y
Y
(y
) |y
+v|
X
(v) inf
yY
|y +v|
X
(y).
But as has operator norm 1, an application of the triangle inequality
shows that the inmum on the right-hand side is at least as large
as the supremum on the left-hand side, and so one can choose
(v)
obeying the required properties.
Corollary 1.5.8. The Hahn-Banach theorem is true when X, Y are
real normed vector spaces.
Proof. This is a standard Zorns lemma argument (see Section
2.4). Fix Y , X, . Dene a partial extension of to be a pair
(Y
), where Y
) (Y
)
if Y
contains Y
and
extends
). If Y
= X, we
are done; otherwise, one can nd v XY
. By Proposition 1.5.7, we
can then extend
and v,
a contradiction; and the claim follows.
1.5. The Hahn-Banach theorem 69
Remark 1.5.9. Of course, this proof of the Hahn-Banach theorem
relied on the axiom of choice (via Zorns lemma) and is thus non-
constructive. It turns out that this is, to some extent, necessary: it
is not possible to prove the Hahn-Banach theorem if one deletes the
axiom of choice from the axioms of set theory (although it is possible
to deduce the theorem from slightly weaker versions of this axiom,
such as the ultralter lemma).
Finally, we establish the complex case by leveraging the real case.
Proof of Hahn-Banach theorem (complex case). Let : Y
C be a continuous complex-linear functional, which we can normalise
to have operator norm 1. Then the real part := Re() : Y R is a
continuous real-linear functional on Y (now viewed as a real normed
vector space rather than a complex one), which has operator norm
at most 1 (in fact, it is equal to 1, though we will not need this).
Applying Corollary 1.5.8, we can extend this real-linear functional
to a continuous real-linear functional : X R on X (again viewed
now just as a real normed vector space) of norm at most 1.
To nish the job, we have to somehow complexify to a complex-
linear functional
: X R of norm at most 1 that agrees with on
Y. It is reasonable to expect that Re
(x)[ |x|
X
for all x X.
We can amplify this (cf. Section 1.9 of Structure and Randomness)
by exploiting phase rotation symmetry, thus [ Re
(e
i
x)[ |x|
X
for
all R. Optimising in we see that has norm at most 1, as
required.
Exercise 1.5.12. In the special case when X is a Hilbert space, give
an alternate proof of the Hahn-Banach theorem, using the material
from Section 1.4, that avoids Zorns lemma or the axiom of choice.
Now we put this Hahn-Banach theorem to work in the study of
duality and transposes.
70 1. Real analysis
Exercise 1.5.13. Let T : X Y be a continuous linear transforma-
tion which is bounded from below (i.e. there exists c > 0 such that
|Tx| c|x| for all x X); note that this ensures that X is equiva-
lent to some subspace of Y . Show that the transpose T
: Y
is
surjective. Give an example to show that the claim fails if T is merely
assumed to be injective rather than bounded from below. (Hint:
consider the map (a
n
)
n=1
(a
n
/n)
n=1
on some suitable space of
sequences.) This should be compared with Exercise 1.5.11.
Exercise 1.5.14. Let x be an element of a normed vector space X.
Show that there exists X
such that ||
X
= 1 and (x) = |x|
X
.
Conclude in particular that the dual of a non-trivial normed vector
space is again non-trivial.
Given a normed vector space X, we can form its double dual
(X
, dened as
(1.58) (x)() := (x)
for all x X and X
which
vanish on Y .
(i) Show that Y
is a closed subspace of X
, and that Y :=
x X : (x) = 0 for all Y
.
(ii) Show that Y
=
X
is isomorphic to X
/Y
.
From Theorem 1.5.10, every normed vector space can be identi-
ed with a subspace of its double dual (and every Banach space is
identied with a closed subspace of its double dual). If is surjec-
tive, then we have an isomorphism X X
(N)
consisting of bounded convergent sequences (equivalently, this is the
space spanned by c
0
(N) and the constant sequence (1)
nN
). The limit
functional (a
n
)
n=1
lim
n
a
n
is a bounded linear functional on
c(N), with operator norm 1, and thus by the Hahn-Banach theorem
can be extended to a generalised limit functional : l
(N) C
which is a continuous linear functional of operator norm 1. As such
generalised limit functionals annihilate all of c
0
(N) but are still non-
trivial, they do not correspond to any element of
1
(N) c
0
(N)
.
Exercise 1.5.16. Let : l
n=1
) =
((x
n
)
n=1
)((y
n
)
n=1
) for all sequences (x
n
)
n=1
, (y
n
)
n=1
(N).
Show that there exists a unique non-principal ultralter p NN
(as dened for instance Section 1.5 of Structure and Randomness)
such that ((x
n
)
n=1
) = lim
np
x
n
for all sequences (x
n
)
n=1
(N).
Conversely, show that every non-principal ultralter generates a gen-
eralised limit functional that is also an algebra homomorphism. (This
72 1. Real analysis
exercise may become easier once one is acquainted with the Stone-
(N).)
Theorem 1.5.10 gives a classication of sorts for normed vector
spaces:
Corollary 1.5.11. Every normed vector space X is isomorphic to
a subspace of BC(Y ), the space of bounded continuous functions on
some bounded complete metric space Y , with the uniform norm.
Proof. Take Y to be the unit ball in X
|
op
= |T|
op
; thus
the transpose operation is an isometric embedding of B(X Y ) into
B(Y
).
Proof. By Exercise 1.5.9, it suces to show that |T
|
op
|T|
op
.
Accordingly, let be any number strictly less than |T|
op
, then we can
nd x X such that |Tx|
Y
|x|. By Exercise 1.5.14 we can then
nd Y
such that ||
Y
= 1 and (Tx) = T
(x) = |Tx|
Y
|x|, and thus |T
|
X
. This implies that |T
|
op
; taking
suprema over all strictly less than |T|
op
we obtain the claim.
1.5. The Hahn-Banach theorem 73
If we identify X and Y with subspaces of X
and Y
respec-
tively, we thus see that T
: X
is an extension of T : X Y
with the same operator norm. In particular, if X and Y are reex-
ive, we see that T
, B
) if A
contains A and
B
are not in W, some playing around with the property that A and
B are convex sets partitioning V shows that the plane spanned by
w and w
, B
R V dened by A
, y
)) := max(d
X
(x, x
), d
Y
(y, y
)).
(One can also pick slightly dierent metrics here, such as d
X
(x, x
) +
d
Y
(y, y
.
There is however an alternate approach to dening these con-
cepts, which takes the concept of an open set as a primitive, rather
than the distance function, and denes other terms in terms of open
sets. For instance:
78 1. Real analysis
Exercise 1.6.1. Let (X, d) be a metric space.
(i) Show that a sequence x
n
of points in X converges to a limit
x X if and only if every open neighbourhood of x (i.e. an
open set containing x) contains x
n
for all suciently large
n.
(ii) Show that a point x is an adherent point of a set E if and
only if every open neighbourhood of x intersects E.
(iii) Show that a set E is closed if and only if its complement is
open.
(iv) Show that the closure of a set E is the intersection of all the
closed sets containing E.
(v) Show that a set E is dense if and only if every non-empty
open set intersects E.
(vi) Show that the interior of a set E is the union of all the open
sets contained in E, and that x is an interior point of E if
and only if some neighbourhood of x is contained in E.
In the next section we will adopt this open sets rst perspective
when dening topological spaces.
On the other hand, there are some other properties of subsets
of a metric space which require the metric structure more fully, and
cannot be dened purely in terms of open sets (see e.g. Example
1.6.24), although some of these concepts can still be dened using a
structure intermediate to metric spaces and topological spaces, such
as uniform space. For instance:
Denition 1.6.7. Let (X, d) be a metric space.
A sequence (x
n
)
n=1
of points in X is a Cauchy sequence if
d(x
n
, x
m
) 0 as n, m (i.e. for every > 0 there
exists N > 0 such that d(x
n
, x
m
) for all n, m N).
A space X is complete if every Cauchy sequence is conver-
gent.
A set E in X is bounded if it is contained inside a ball.
A set E is totally bounded in X if for every > 0, E can be
covered by nitely many balls of radius .
1.6. Point set topology 79
Exercise 1.6.2. Show that any metric space X can be identied with
a dense subspace of a complete metric space X, known as a metric
completion or Cauchy completion of X. (For instance, R is a metric
completion of Q.) (Hint: one can dene a real number to be an
equivalence class of Cauchy sequences of rationals. Once the reals are
dened, essentially the same construction works in arbitrary metric
spaces.) Furthermore, if X
)
A
of X (i.e. a col-
lection of open sets V
)
A
is a collection of
closed subsets of X such that any nite subcollection of sets
has non-empty intersection, then the entire collection has
non-empty intersection.
(iv) X is complete and totally bounded.
Proof. ((ii) = (i)) If there was an innite sequence x
n
with no
convergent subsequence, then given any point x in X there must exist
80 1. Real analysis
an open ball centred at x which contains x
n
for only nitely many
n (since otherwise one could easily construct a subsequence of x
n
converging to n. By (ii), one can cover X with a nite number of
such balls. But then the sequence x
n
would be nite, a contradiction.
((i) = (iv)) If X was not complete, then there would exist a
Cauchy sequence which is not convergent; one easily shows that this
sequence cannot have any convergent subsequences either, contradict-
ing (i). If X was not totally bounded, then there exists > 0 such
that X cannot be covered by any nite collection of balls of radius ;
a standard greedy algorithm argument then gives a sequence x
n
such
that d(x
n
, x
m
) for all distinct n, m. This sequence clearly has no
convergent subsequence, again a contradiction.
((ii) (iii)) This follows from de Morgans laws and Exercise
1.6.1(iii).
((iv) = (iii)) Let (F
)
A
be as in (iii). Call a set E in X
rich if it intersects all of the F
; since the
F
)) whenever d
X
(x, x
) .
(Sequential continuity) For every sequence x
n
X that con-
verges to a limit x X, f(x
n
) converges to f(x).
(Topological continuity) The inverse image f
1
(V ) of every
open set V in Y , is an open set in X.
The inverse image f
1
(F) of every closed set F in Y , is a
closed set in X.
82 1. Real analysis
A function f obeying any one of the properties in Exercise 1.6.5
is known as a continuous map.
Exercise 1.6.6. Let X, Y, Z be metric spaces, and let f : X Y
and g : X Z be continuous maps. Show that the combined map
f g : X Y Z dened by f g(x) := (f(x), g(x)) is continuous
if and only if f and g are continuous. Show also that the projection
maps
Y
: Y Z Y ,
Z
: Y Z Z dened by
Y
(y, z) := y,
Z
(y, z) := z are continuous.
Exercise 1.6.7. Show that the image of a compact set under a con-
tinuous map is again compact.
1.6.2. Topological spaces. Metric spaces capture many of the no-
tions of convergence and continuity that one commonly uses in real
analysis, but there are several such notions (e.g. pointwise conver-
gence, semi-continuity, or weak convergence) in the subject that turn
out to not be modeled by metric spaces. A very useful framework
to handle these more general modes of convergence and continuity
is that of a topological space, which one can think of as an abstract
generalisation of a metric space in which the metric and balls are
forgotten, and the open sets become the central object
9
.
Denition 1.6.10 (Topological space). A topological space X =
(X, T) is a set X, together with a collection T of subsets of X, known
as open sets, which obey the following axioms:
and X are open.
The intersection of any nite number of open sets is open.
The union of any arbitrary number of open sets is open.
The collection T is called a topology on X.
Given two topologies T, T
is a
ner (or stronger) topology than T), if T T
(informally, T
has
more open sets than T).
9
There are even more abstract notions, such as pointless topological spaces, in
which the collection of open sets has become an abstract lattice, in the spirit of Section
2.3, but we will not need such notions in this course.
1.6. Point set topology 83
Example 1.6.11. Every metric space (X, d) generates a topology T
d
,
namely the space of sets which are open with respect to the metric
d. Observe that if two metrics d, d
(x, y) Cd(x, y)
for all x, y in X and some constants c, C > 0, then they generate
identical topologies.
Example 1.6.12. The nest (or strongest) topology on any set X is
the discrete topology 2
X
= E : E X, in which every set is open;
this is the topology generated by the discrete metric (Example 1.6.5).
The coarsest (or weakest) topology is the trivial topology , X, in
which only the empty set and the full set are open.
Example 1.6.13. Given any collection / of sets of X, we can dene
the topology T[/] generated by / to be the intersection of all the
topologies that contain /; this is easily seen to be the coarsest topol-
ogy that makes all the sets in / open. For instance, the topology
generated by a metric space is the same as the topology generated by
its open balls.
Example 1.6.14. If (X, T) is a topological space, and Y is a subset
of X, then we can dene the relative topology T
Y
:= EY : E T
to be the collection of all open sets in X, restricted to Y , this makes
(Y, T
Y
) a topological space, known as a subspace of (X, T).
Any notion in metric space theory which can be dened purely in
terms of open sets, can now be dened for topological spaces. Thus
for instance:
Denition 1.6.15. Let (X, T) be a topological space.
A sequence x
n
of points in X converges to a limit x X if
and only if every open neighbourhood of x (i.e. an open set
containing x) contains x
n
for all suciently large n. In this
case we write x
n
x in the topological space (X, T), and
(if x is unique) we write x = lim
n
x
n
.
A point is a sequentially adherent point of a set E if it is the
limit of some sequence in E.
84 1. Real analysis
A point x is an adherent point of a set E if and only if every
open neighbourhood of x intersects E.
The set of all adherent points of E is called the closure of
E and is denoted E.
A set E is closed if and only if its complement is open, or
equivalently if it contains all its adherent points.
A set E is dense if and only if every non-empty open set
intersects E, or equivalently if its closure is X.
The interior of a set E is the union of all the open sets
contained in E, and x is called an interior point of E if and
only if some neighbourhood of x is contained in E.
A space X is sequentially compact if every sequence has a
convergent subsequence.
A space X is compact if every open cover has a nite sub-
cover.
The concepts of being -compact, locally compact, and pre-
compact can be dened as before. (One could also dene
sequential -compactness, etc., but these notions are rarely
used.)
A map f : X Y between topological spaces is sequentially
continuous if whenever x
n
converges to a limit x in X, f(x
n
)
converges to a limit f(x) in X.
A map f : X Y between topological spaces is continuous
if the inverse image of every open set is open.
Remark 1.6.16. The stronger a topology becomes, the more open
and closed sets it will have, but fewer sequences will converge, there
are fewer (sequentially) adherent points and (sequentially) compact
sets, closures become smaller, and interiors become larger. There will
be more (sequentially) continuous functions on this space, but fewer
(sequentially) continuous functions into the space. Note also that
the identity map from a space X with one topology T to the same
space X with a dierent topology T
.
1.6. Point set topology 85
Example 1.6.17. In a metric space, these topological notions coin-
cide with their metric counterparts, and sequential compactness and
compactness are equivalent, as are sequential continuity and continu-
ity.
Exercise 1.6.8 (Urysohns subsequence principle). Let x
n
be a se-
quence in a topological space X, and let x be another point in X.
Show that the following are equivalent:
x
n
converges to x.
Every subsequence of x
n
converges to x.
Every subsequence of x
n
has a further subsequence that con-
verges to x.
Exercise 1.6.9. Show that every sequentially adherent point is an
adherent point, and every continuous function is sequentially contin-
uous.
Remark 1.6.18. The converses to Exercise 1.6.9 are unfortunately
not always true in general topological spaces. For instance, if we en-
dow an uncountable set X with the cocountable topology (so that a
set is open if it is either empty, or its complement is at most count-
able) then we see that the only convergent sequences are those which
are eventually constant. Thus, every subset of X contains its se-
quentially adherent points, and every function from X to another
topological space is sequentially continuous, even though not every
set in X is closed and not every function on X is continuous. An ex-
ample of a set which is sequentially compact but not compact is the
rst uncountable ordinal with the order topology (Exercise 1.6.10).
It is more tricky to give an example of a compact space which is
not sequentially compact; this will have to wait until we establish
Tychonos theorem (Theorem 1.8.14). However one can x this
discrepancy between the sequential and non-sequential concepts by
replacing sequences with the more general notion of nets, see Section
1.6.3.
Remark 1.6.19. Metric space concepts such as boundedness, com-
pleteness, Cauchy sequences, and uniform continuity do not have
counterparts for general topological spaces, because they cannot be
86 1. Real analysis
dened purely in terms of open sets. (They can however be ex-
tended to some other types of spaces, such as uniform spaces or coarse
spaces.)
Now we give some important topologies that capture certain
modes of convergence or continuity that are dicult or impossible
to capture using metric spaces alone.
Example 1.6.20 (Zariski topology). This topology is important in
algebraic geometry, though it will not be used in this course. If F
is an algebraically closed eld, we dene the Zariski topology on the
vector space F
n
to be the topology generated by the complements
of proper algebraic varieties in F
n
; thus a set is Zariski open if it is
either empty, or is the complement of a nite union of proper algebraic
varieties. A set in F
n
is then Zariski dense if it is not contained in
any proper subvariety, and the Zariski closure of a set is the smallest
algebraic variety that contains that set.
Example 1.6.21 (Order topology). Any totally ordered set (X, <)
generates the order topology, dened as the topology generated by the
sets x X : x > a and x X : x < a for all a X. In particular,
the extended real line [, +] can be given the order topology,
and the notion of convergence of sequences in this topology to either
nite or innite limits is identical to the notion one is accustomed
to in undergraduate real analysis. (On the real line, of course, the
order topology corresponds to the usual topology.) Also observe that
a function n x
n
from the extended natural numbers N +
(with the order topology) into a topological space X is continuous if
and only if x
n
x
+
as n , so one can interpret convergence
of sequences as a special case of continuity.
Exercise 1.6.10. Let be the rst uncountable ordinal, endowed
with the order topology. Show that is sequentially compact (Hint:
every sequence has a lim sup), but not compact (Hint: every point
has a countable neighbourhood).
Example 1.6.22 (Half-open topology). The right half-open topology
T
r
on the real line R is the topology generated by the right half-open
intervals [a, b) for < a < b < ; this is a bit ner than the usual
topology on R. Observe that a sequence x
n
converges to a limit x in
1.6. Point set topology 87
the right half-open topology if and only if it converges in the ordinary
topology T, and also if x
n
x for all suciently large x. Observe
that a map f : R R is right-continuous i it is a continuous map
from (R, T
r
) to (R, T). One can of course model left-continuity via
a suitable left half-open topology in a similar fashion.
Example 1.6.23 (Upper topology). The upper topology T
u
on the
real line is dened as the topology generated by the sets (a, +) for all
a R. Observe that (somewhat confusingly), a function f : R R
is lower semi-continuous i it is continuous from (R, T) to (R, T
u
).
One can of course model upper semi-continuity via a suitable lower
topology in a similar fashion.
Example 1.6.24 (Product topology). Let Y
X
be the space of all
functions f : X Y from a set X to a topological space Y . We
dene the product topology on Y
X
to be the topology generated by
the sets f Y
X
: f(x) V for all x X and all open V Y .
Observe that a sequence of functions f
n
: X Y converges pointwise
to a limit f : X Y i it converges in the product topology. We will
study the product topology in more depth in Section 1.8.3.
Example 1.6.25 (Product topology, again). If (X, T
X
) and (Y, T
Y
)
are two topological spaces, we can dene the product space (X
Y, T
X
T
Y
) to be the Cartesian product X Y with the topology
generated by the product sets U V , where U and V are open in X
and Y respectively. Observe that two functions f : Z X, g : Z Y
from a topological space Z are continuous if and only if their direct
sum f : Z X Y is continuous in the product topology, and also
that the projection maps
X
: X Y X and
Y
: X Y Y are
continuous (cf. Exercise 1.6.6).
We mention that not every topological space can be generated
from a metric (such topological spaces are called metrisable). One
important obstruction to this arises from the Hausdor property:
Denition 1.6.26. A topological space X is said to be a Hausdor
space if for any two distinct points x, y in X, there exist disjoint
neighbourhoods V
x
, V
y
of x and y respectively.
88 1. Real analysis
Example 1.6.27. Every metric space is Hausdor (one can use the
open balls B(x, d(x, y)/2) and B(y, d(x, y)/2) as the separating neigh-
bourhoods). On the other hand, the trivial topology (Example 1.6.13)
on two or more points is not Hausdor, and neither is the cocount-
able topology (Remark 1.6.18) on an uncountable set, or the upper
topology (Example 1.6.23) on the real line. Thus, these topologies do
not arise from a metric.
Exercise 1.6.11. Show that the half-open topology (Example 1.6.22)
is Hausdor, but does not arise from a metric. (Hint: assume for
contradiction that the half-open topology did arise from a metric;
then show that for every real number x there exists a rational number
q and a positive integer n such that the ball of radius 1/n centred at
q has inmum x.) Thus there are more obstructions to metrisability
than just the Hausdor property; a more complete answer is provided
by Urysohns metrisation theorem (Theorem 2.5.7).
Exercise 1.6.12. Show that in a Hausdor space, any sequence can
have at most one limit. (For a more precise statement, see Exercise
1.6.16 below.)
A homeomorphism (or topological isomorphism) between two topo-
logical spaces is a continuous invertible map f : X Y whose inverse
f
1
: Y X is also continuous. Such a map identies the topology
on X with the topology on Y , and so any topological concept of X
will be preserved by f to the corresponding topological concept of
Y . For instance, X is compact if and only if Y is compact, X is
Hausdor if and only if Y is Hausdor, x is adherent to E if and only
if f(x) is adherent to f(E), and so forth. When there is a homeo-
morphism between two topological spaces, we say that X and Y are
homeomorphic (or topologically isomorphic).
Example 1.6.28. The tangent function is a homeomorphism be-
tween (/2, /2) and R (with the usual topologies), and thus pre-
serves all topological structures on these two spaces. Note however
that the former space is bounded as a metric space while the latter is
not, and the latter is complete while the former is not. Thus metric
properties such as boundedness or completeness are not purely topo-
logical properties, since they are not preserved by homeomorphisms.
1.6. Point set topology 89
1.6.3. Nets (optional). A sequence (x
n
)
n=1
in a space X can be
viewed as a function from the natural numbers N to X. We can
generalise this concept as follows.
Denition 1.6.29 (Nets). A net in a space X is a tuple (x
)
A
,
where A = (A, <) is a directed set (i.e. a partially ordered set such
that any two elements have at least one upper bound), and x
X
for each A. We say that a statement P() holds for suciently
large in a directed set A if there exists A such that P() holds
for all . (Note in particular that if P() and Q() separately
hold for suciently large , then their conjunction P() Q() also
holds for suciently large .)
A net (x
)
A
in a topological space X is said to converge to a
limit x X if for every neighbourhood V of x, we have x
V for
all suciently large .
A subnet of a net (x
)
A
is a tuple of the form (x
()
)
B
, where
(B, <) is another directed set, and : B A is a monotone map
(thus (
) () whenever
)
A
converges
to a limit x
+
if and only if the function x
is continuous on
A + (cf. Example 1.6.21). Also, if (x
()
)
B
is a subnet of
(x
)
A
, then is a continuous map from B+ to A+, if
90 1. Real analysis
we adopt the convention that (+) = +. In particular, a subnet
of a convergent net remains convergent to the same limit.
The point of working with nets instead of sequences is that one no
longer needs to worry about the distinction between sequential and
non-sequential concepts in topology, as the following exercises show:
Exercise 1.6.13. Let X be a topological space, let E be a subset of
X, and let x be an element of X. Show that x is an adherent point of
E if and only if there exists a net (x
)
A
in E that converges to x.
(Hint: take A to be the directed set of neighbourhoods of x, ordered
by reverse set inclusion.)
Exercise 1.6.14. Let f : X Y be a map between two topological
spaces. Show that f is continuous if and only if for every net (x
)
A
in X that converges to a limit x, the net (f(x
))
A
converges in Y
to f(x).
Exercise 1.6.15. Let X be a topological space. Show that X is
compact if and only if every net has a convergent subnet. (Hint:
equate both properties of X with the nite intersection property, and
review the proof of Theorem 1.6.8.) Similarly, show that a subset E of
X is relatively compact if and only if every net in E has a subnet that
converges in X. (Note that as not every compact space is sequentially
compact, this exercise shows that we cannot enforce injectivity of
in the denition of a subnet.)
Exercise 1.6.16. Show that a space is Hausdor if and only if every
net has at most one limit.
Exercise 1.6.17. In the product space Y
X
in Example 1.6.24, show
that a net (f
)
A
converges in Y
X
to f Y
X
if and only if for every
x X, the net (f
(x))
A
converges in Y to f().
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/30.
Thanks to Franciscus Rebro, johan, Josh Zahl, Xiaochuan Liu, and
anonymous commenters for corrections.
An anonymous commenter pointed out that while the real line can
be viewed very naturally as the metric completion of the rationals,
this cannot quite be used to give a denition of the real numbers,
1.7. The Baire category theorem 91
because the notion of a metric itself requires the real numbers in its
denition! However, K. P. Hart noted that Bourbaki resolves this
problem by dening the reals as the completion of the rationals as a
uniform space rather than as a metric space.
1.7. The Baire category theorem and its Banach
space consequences
The notion of what it means for a subset E of a space X to be small
varies from context to context. For instance, in measure theory, when
X = (X, A, ) is a measure space, one useful notion of a small set
is that of a null set: a set E of measure zero (or at least contained in
a set of measure zero). By countable additivity, countable unions of
null sets are null. Taking contrapositives, we obtain
Lemma 1.7.1 (Pigeonhole principle for measure spaces). Let E
1
, E
2
, . . .
be an at most countable sequence of measurable subsets of a measure
space X. If
n
E
n
has positive measure, then at least one of the E
n
has positive measure.
Now suppose that X was a Euclidean space R
d
with Lebesgue
measure m. The Lebesgue dierentiation theorem easily implies that
having positive measure is equivalent to being dense in certain balls:
Proposition 1.7.2. Let E be a measurable subset of R
d
. Then the
following are equivalent:
E has positive measure.
For any > 0, there exists a ball B such that m(E B)
(1 )m(B).
Thus one can think of a null set as a set which is nowhere dense
in some measure-theoretic sense.
It turns out that there are analogues of these results when the
measure space X = (X, A, ) is replaced instead by a complete metric
space X = (X, d). Here, the appropriate notion of a small set is
not a null set, but rather that of a nowhere dense set: a set E which
is not dense in any ball, or equivalently a set whose closure has empty
interior. (A good example of a nowhere dense set would be a proper
92 1. Real analysis
subspace, or smooth submanifold, of R
d
, or a Cantor set; on the
other hand, the rationals are a dense subset of R and thus clearly not
nowhere dense.) We then have the following important result:
Theorem 1.7.3 (Baire category theorem). Let E
1
, E
2
, . . . be an at
most countable sequence of subsets of a complete metric space X. If
n
E
n
contains a ball B, then at least one of the E
n
is dense in a
sub-ball B
_
n=1
(q
n
2
n
, q
n
+ 2
n
),
where q
1
, q
2
, . . . is an enumeration of the rationals, is open and dense,
but has Lebesgue measure at most 2; thus its complement has innite
measure in R but is nowhere dense (hence meager). As a variant of
this, the set
(1.62)
m=1
_
n=1
(q
n
2
n
/m, q
n
+ 2
n
/m),
is a null set, but is the intersection of countably many open dense
sets and is thus co-meager.
1.7. The Baire category theorem 95
Exercise 1.7.3. A real number x is Diophantine if for every > 0
there exists c > 0 such that [x
a
q
[
c
|q|
2+
for every rational number
a
q
. Show that the set of Diophantine real numbers has full measure
but is meager.
Remark 1.7.4. If one assumes some additional axioms of set theory
(e.g. the continuum hypothesis), it is possible to show that the col-
lection of meager subsets of R and the collection of null subsets of R
(viewed as -ideals of the collection of all subsets of R) are isomor-
phic; this is the Sierpinski-Erdos theorem, which we will not prove
here. Roughly speaking, this theorem tells us that any eective
rst-order statement which is true about meager sets will also be true
about null sets, and conversely.
1.7.1. The uniform boundedness principle. As mentioned in
the introduction, the Baire category theorem implies various equiva-
lences between qualitative and quantitative properties of linear trans-
formations between Banach spaces. Note that Lemma 1.16 already
gave a prototypical such equivalence between a qualitative property
(continuity) and a quantitative one (boundedness).
Theorem 1.7.5 (Uniform boundedness principle). Let X be a Ba-
nach space, let Y be a normed vector space, and let (T
)
A
be a
family of continuous linear operators T
x :
A is bounded.
(Uniform boundedness) The operator norms |T
|
op
:
A are bounded.
The uniform boundedness principle is also known as the Banach-
Steinhaus theorem.
Proof. It is clear that (ii) implies (i); now assume (i) holds and let
us obtain (ii).
For each n = 1, 2, . . ., let E
n
be the set
(1.63) E
n
:= x X : |T
x|
Y
n for all A.
96 1. Real analysis
The hypothesis (i) is nothing more than the assertion that the E
n
cover X, and thus by the Baire category theorem must be dense in a
ball. Since the T
|
op
are bounded by n, and the claim follows.
Exercise 1.7.4. Give counterexamples to show that the uniform
boundedness principle fails one relaxes the assumptions in any of the
following ways:
X is merely a normed vector space rather than a Banach
space (i.e. completeness is dropped).
The T
|
op
is unbounded. We can then nd a sequence
n
A such that
|T
n+1
|
op
> 100
n
|T
n
|
op
(say) for all n. We can then nd unit
vectors x
n
such that |T
n
x
n
|
Y
1
2
|T
n
|
op
.
We can then form the absolutely convergent (and hence condi-
tionally convergent, by completeness) sum x =
n=1
n
10
n
x
n
for
some choice of signs
n
= 1 recursively as follows: once
1
, . . . ,
n1
have been chosen, choose the sign
n
so that
(1.64) |
n
m=1
n
10
n
T
n
x
n
|
Y
|10
n
T
n
x
n
|
Y
1
2
10
n
|T
n
|
op
.
From the triangle inequality we soon conclude that
(1.65) |T
n
x|
Y
1
4
10
n
|T
n
|
op
.
But by hypothesis, the right-hand side of (1.65) is unbounded in n,
contradicting (i).
1.7. The Baire category theorem 97
A common way to apply the uniform boundedness principle is via
the following corollary:
Corollary 1.7.7 (Uniform boundedness principle for norm conver-
gence). Let X be a Banach space, let Y be a normed vector space, and
let (T
n
)
n=1
be a family of continuous linear operators T
n
: X Y .
Then the following are equivalent:
(i) (Pointwise convergence) For every x X, T
n
x converges
strongly in Y as n .
(ii) (Pointwise convergence to a continuous limit) There exists
a continuous linear T : X Y such that for every x X,
T
n
x converges strongly in Y to Tx as n .
(iii) (Uniform boundedness + dense subclass convergence) The
operator norms |T
n
| : n = 1, 2, . . . are bounded, and for a
dense set of x in X, T
n
x converges strongly in Y as n .
Proof. Clearly (ii) implies (i), and as convergent sequences are bounded,
we see from Theorem 1.7.3 that (i) implies (iii). The implication of
(ii) from (iii) follows by a standard limiting argument and is left as
an exercise.
Remark 1.7.8. The same equivalences hold if one replaces the se-
quence (T
n
)
n=1
by a net (T
)
A
.
Example 1.7.9 (Fourier inversion formula). For any f L
2
(R) and
N > 0, dene the Dirichlet summation operator
(1.66) S
N
f(x) :=
_
N
N
f()e
2ix
d
where
f is the Fourier transform of f, dened on smooth compactly
supported functions f C
0
(R) by the formula
f() :=
_
f(x)e
2ix
dx
and then extended to L
2
by the Plancherel theorem (see Section 1.12).
Using the Plancherel identity, we can verify that the operator norms
|S
N
|
op
are uniformly bounded (indeed, they are all 1); also, one can
check that for f C
0
(R), that S
N
f converges in L
2
norm to f as
N . As C
0
(R) is known to be dense in L
2
(R), this implies that
S
N
f converges in L
2
norm to f for every f L
2
(R).
98 1. Real analysis
This argument only used the easy implication of Corollary
1.7.7, namely the deduction of (ii) from (iii). The hard implication
using the Baire category theorem was not directly utilised. However,
from a metamathematical standpoint, that implication is important
because it tells us that the above strategy to prove convergence in
norm of the Fourier inversion formula on L
2
- i.e. to obtain uniform
operator norms on the partial sums, and to establish convergence on a
dense subclass of nice functions - is in some sense the only strategy
available to prove such a result.
Remark 1.7.10. There is a partial analogue of Corollary 1.7.7 for
the question of pointwise almost everywhere convergence rather than
norm convergence, known as Steins maximal principle (discussed for
instance in Section 1.9 of Structure and Randomness). For instance, it
reduces Carlesons theorem on the pointwise almost everywhere con-
vergence of Fourier series to the boundedness of a certain maximal
function (the Carleson maximal operator) related to Fourier summa-
tion, although the latter task is again quite non-trivial. (As in Ex-
ample 1.7.9, the role of the maximal principle is meta-mathematical
rather than direct.)
Remark 1.7.11. Of course, if we omit some of the hypotheses, it is
no longer true that pointwise boundedness and uniform boundedness
are the same. For instance, if we let c
0
(N) be the space of complex
sequences with only nitely many non-zero entries and with the uni-
form topology, and let
n
: c
0
(N) C be the map (a
m
)
m=1
na
n
,
then the
n
are pointwise bounded but not uniformly bounded; thus
completeness of X is important. Also, even in the one-dimensional
case X = Y = R, the uniform boundedness principle can easily be
seen to fail if the T
n
f
n
of elements in E (since one can iteratively approximate
the residual f
N1
n=1
f
n
to arbitrary accuracy by an element of E
for N = 1, 2, 3, . . .), and the claim easily follows. So it suces to show
that (iii) implies (iv).
For each n, let E
n
Y be the set of all f Y for which there
exists a solution to Lu = f with |u|
X
n|f|
Y
. From the hypothesis
(iii), we see that
n
E
n
= Y . Since Y is complete, the Baire category
theorem implies that there is some E
n
which is dense in some ball
B(f
0
, r) in Y . In other words, the problem Lu = f is approximately
quantitatively solvable in the ball B(f
0
, r) in the sense that for every
100 1. Real analysis
> 0 and every f B(f
0
, r), there exists an approximate solution u
with |Luf|
Y
and |u|
X
n|Lu|
Y
, and thus |u|
X
nr +n.
By subtracting two such approximate solutions, we conclude that
for any f B(0, 2r) and any > 0, there exists u X with |Lu
f|
Y
2 and |u|
X
2nr + 2n.
Since L is homogeneous, we can rescale and conclude that for any
f Y and any > 0 there exists u X with |Lu f|
Y
2 and
|u|
X
2n|f|
Y
+ 2n.
In particular, setting =
1
4
|f|
Y
(treating the case f = 0 sepa-
rately), we conclude that for any f Y , we may write f = Lu + f
,
where |f
|
Y
1
2
|f|
Y
and |u|
X
5
2
n|f|
Y
.
We can iterate this procedure and then take limits (now using
the completeness of X rather than Y ) to obtain a solution to Lu = f
for every f Y with |u|
X
5n|f|
Y
, and the claim follows.
Remark 1.7.13. The open mapping theorem provides metamathe-
matical justication for the method of a priori estimates for solving
linear equations such as Lu = f for a given datum f Y and for an
unknown u X, which is of course a familiar problem in linear PDE.
The a priori method assumes that f is in some dense class of nice
functions (e.g. smooth functions) in which solvability of Lu = f is
presumably easy, and then proceeds to obtain the a priori estimate
|u|
X
C|f|
Y
for some constant Y. Theorem 1.7.12 then assures
that Lu = f is solvable for all f in Y (with a similar bound). As be-
fore, this implication does not directly use the Baire category theorem,
but that theorem helps explain why this method is not wasteful.
A pleasant corollary of the open mapping theorem is that, as with
ordinary linear algebra or with arbitrary functions, invertibility is the
same thing as bijectivity:
Corollary 1.7.14. Let T : X Y be a continuous linear operator
between two Banach spaces X, Y . Then the following are equivalent:
(Qualitative invertibility) T is bijective.
(Quantitative invertibility) T is bijective, and T
1
: Y X
is a continuous (hence bounded) linear transformation.
1.7. The Baire category theorem 101
Remark 1.7.15. The claim fails without the completeness hypothe-
ses on X and Y . For instance, consider the operator T : c
c
(N)
c
c
(N) dened by T(a
n
)
n=1
:= (
an
n
)
n=1
, where we give c
c
(N) the
uniform norm. Then T is continuous and bijective, but T
1
is un-
bounded.
Exercise 1.7.5. Show that Corollary 1.7.14 can still fail if we drop
the completeness hypothesis on just X, or just Y .
Exercise 1.7.6. Suppose that L : X Y is a surjective continu-
ous linear transformation between Banach spaces. By combining the
open mapping theorem with the Hahn-Banach theorem, show that
the transpose map L
: Y
|
X
c||
Y
for all Y
. Conclude
that L
is an isomorphism between Y
and L
(Y
).
Let L be as in Theorem 1.7.12, so that the problem Lu = f is
both qualitatively and quantitatively solvable. A standard applica-
tion of Zorns lemma (similar to that used to prove the Hahn-Banach
theorem) shows that the problem Lu = f is also qualitatively lin-
early solvable, in the sense that there exists a linear transformation
S : Y X such that LSf = f for all f Y (i.e. S is a right-inverse
of L). In view of the open mapping theorem, it is then tempting to
conjecture that L must also be quantitatively linearly solvable, in the
sense that there exists a continuous linear transformation S : Y X
such that LSf = f for all f Y . By Corollary 1.7.14, we see that
this conjecture is true when the problem Lu = f is determined, i.e.
there is exactly one solution u for each datum f. Unfortunately, the
conjecture can fail when Lu = f is underdetermined (more than one
solution u for each f); we discuss this in Section 1.7.4. On the other
hand, the situation is much better for Hilbert spaces:
Exercise 1.7.7. Suppose that L : H H
is a surjective continuous
linear transformation between Hilbert spaces. Show that there exists
a continuous linear transformation S : H
H such that LS =
I. Furthermore, show that we can ensure that the range of S is
orthogonal to the kernel of L, and that this condition determines S
uniquely.
102 1. Real analysis
Remark 1.7.16. In fact, Hilbert spaces are essentially the only type
of Banach space for which we have this nice property, due to the
Lindenstrauss-Tzafriri solution [LiTz1971] of the complemented sub-
spaces problem.
Exercise 1.7.8. Let M and N be closed subspaces of a Banach space
X. Show that the following statements are equivalent:
(i) (Qualitative complementation) Every x in X can be ex-
pressed in the form m + n for m M, n N in exactly
one way.
(ii) (Quantitative complementation) Every x in X can be ex-
pressed in the form m+n for m M, n N in exactly one
way. Furthermore there exists C > 0 such that |m|
X
, |n|
X
C|x|
X
for all x.
When either of these two properties hold, we say that M (or N) is
a complemented subspace, and that N is a complement of M (or vice
versa).
The property of being complemented is closely related to that of
quantitative linear solvability:
Exercise 1.7.9. Let L : X Y be a surjective map between Banach
spaces. Show that there exists a bounded linear map S : Y X such
that LSf = f for all f Y if and only if the kernel u X : Lu = 0
is a complemented subspace of X.
Exercise 1.7.10. Show that any nite-dimensional or nite co-dimensional
subspace of a Banach space is complemented.
Remark 1.7.17. The problem of determining whether a given closed
subspace of a Banach space is complemented or not is, in general,
quite dicult. However, non-complemented subspaces do exist in
abundance; some example are given in the apendix, and the Lindenstrauss-
Tzafriri theorem [LiTz1971] asserts that any Banach space not iso-
morphic to a Hilbert space contains at least one non-complemented
subspace. There is also a remarkable construction of Gowers and
Maurey [Go1993] of a Banach space such that every subspace, other
than those ruled out by Exercise 1.7.10, are uncomplemented.
1.7. The Baire category theorem 103
1.7.3. The closed graph theorem. Recall that a map T : X Y
between two metric spaces is continuous if and only if, whenever x
n
converges to x in X, Tx
n
converges to Tx in Y . We can also dene
the weaker property of being closed: an map T : X Y is closed if
and only if whenever x
n
converges to x in X, and Tx
n
converges to
a limit y in Y , then y is equal to Tx; equivalently, T is closed if its
graph (x, Tx) : x X is a closed subset of X Y . This is weaker
than continuity because it has the additional requirement that the
sequence Tx
n
is already convergent. the name, closed operators are
not directly related to open operators.)
Example 1.7.18. Let T : c
0
(N) c
0
(N) be the transformation
T(a
m
)
m=1
:= (ma
m
)
m=1
. This transformation is unbounded and
hence discontinuous, but one easily veries that it is closed.
As Example 1.7.18 shows, being closed is often a weaker property
than being continuous. However, the remarkable closed graph theorem
shows that as long as the domain and range of the operator are both
Banach spaces, the two statements are equivalent:
Theorem 1.7.19 (Closed graph theorem). Let T : X Y be a
linear transformation between two Banach spaces. Then the following
are equivalent:
(i) T is continuous.
(ii) T is closed.
(iii) (Weak continuity) There exists some topology T on Y , weaker
than the norm topology (i.e. containing fewer open sets) but
still Hausdor, for which T : X (Y, T) is continuous.
Proof. It is clear that (i) implies (iii) (just take T to equal the norm
topology). To see why (iii) implies (ii), observe that if x
n
x in
X and Tx
n
y in norm, then Tx
n
y in the weaker topology T
as well; but by weak continuity Tx
n
Tx in T. Since Hausdor
topological spaces have unique limits, we have Tx = y and so T is
closed.
Now we show that (ii) implies (i). If T is closed, then the graph
:= (x, Tx) : x X is a closed linear subspace of the Banach
space XY and is thus also a Banach space. On the other hand, the
104 1. Real analysis
projection map : (x, Tx) x from to X is clearly a continuous
linear bijection. By Corollary 1.7.14, its inverse x (x, Tx) is also
continuous, and so T is continuous as desired.
We can reformulate the closed graph theorem in the following
fashion:
Corollary 1.7.20. Let X, Y be Banach spaces, and suppose we have
some continuous inclusion Y Z of Y into a Hausdor topological
vector space Z. Let T : X Z be a continuous linear transformation.
Then the following are equivalent.
(i) (Qualitative regularity) For all x X, Tx Y .
(ii) (Quantitative regularity) For all x X, Tx Y , and fur-
thermore |Tx|
Y
C|x|
X
for some C > 0 independent of
x.
(iii) (Quantitative regularity on a dense subclass) For all x in
a dense subset of X, Tx Y , and furthermore |Tx|
Y
C|x|
X
for some C > 0 independent of x.
Proof. Clearly (ii) implies (iii) or (i). If we have (iii), then T extends
uniquely to a bounded linear map from X to Y , which must agree with
the original continuous map from X to Z since limits in the Hausdor
space Z are unique, and so (iii) implies (ii). Finally, if (i) holds, then
we can view T as a map from X to Y , which by Theorem 1.7.19 is
continuous, and the claim now follows from Lemma 1.3.17.
In practice, one should think of Z as some sort of low regularity
space with a weak topology, and Y as a high regularity subspace
with a stronger topology. Corollary 1.7.20 motivates the method of a
priori estimates to establish the Y -regularity of some linear transform
Tx of an arbitrary element x in a Banach space X, by rst establishing
the a priori estimate |Tx|
Y
C|x|
X
for a dense subclass of nice
elements of X, and then using the above corollary (and some weak
continuity of T in a low regularity space) to conclude. The closed
graph theorem provides the metamathematical explanation as to why
this approach is at least as powerful as any other approach to proving
regularity.
1.7. The Baire category theorem 105
Example 1.7.21. Let 1 p 2, and let p
f|
L
p
(R)
C
p
|f|
L
p
(R)
for some constant C
p
and all f in some suitable dense subclass of
L
p
(R) (e.g. the space C
0
(R) of smooth functions of compact sup-
port), together with the soft observation that the Fourier trans-
form is continuous from L
p
(R) to the space of tempered distribu-
tions, which is a Hausdor space into which L
p
m=1
0, 1
N
: x
n
= 1.
(From a probabilistic view point, one can think of X as the event
space for ipping a countably innite number of coins, and E
n
as the
event that the n
th
coin lands as heads.)
Let M(X) be the space of nite Borel measures on X; this can
be veried to be a Banach space. There is a map L : M(X)
(N)
dened by
(1.69) L() := ((E
n
))
n=1
.
This is a continuous linear transformation. The equation Lu = f
is quantitatively solvable for every f
(N). Indeed, if f is an
indicator function f = 1
A
, then f = L
x
A
, where x
A
0, 1
Z
is the
sequence that equals 1 on A and 0 outside of A, and
x
A
is the Dirac
106 1. Real analysis
mass at A. The general case then follows by expressing a bounded
sequence as an integral of indicator functions (e.g. if f takes values
in [0,1], we can write f =
_
1
0
1
{f>t}
dt). Note however that this is
a nonlinear operation, since the indicator 1
{f>t}
depends nonlinearly
on f.
We now claim that the equation Lu = f is not quantitatively
linearly solvable, i.e. there is no bounded linear map S :
(N)
M(X) such that LSf = f for all f
m=1
[
n
(E
m
)[ converges uniformly in n.
is a signed nite measure.
Proof. It suces to prove the rst claim, since this easily implies that
is also countably additive, and is thence a signed nite measure.
Suppose for contradiction that the claim failed, then one could nd
disjoint E
1
, E
2
, . . . and > 0 such that one has limsup
n
m=M
[
n
(E
m
)[ >
for all M. We now construct disjoint sets A
1
, A
2
, . . ., each consisting
of the union of a nite collection of the E
j
, and an increasing sequence
n
1
, n
2
, . . . of positive integers, by the following recursive procedure:
0. Initialise k = 0.
1. Suppose recursively that n
1
< . . . < n
2k
and A
1
, . . . , A
k
has
already been constructed for some k 0.
2. Choose n
2k+1
> n
2k
so large that for all n n
2k+1
,
n
(A
1
. . . A
k
) diers from (A
1
. . . A
k
) by at most /10.
1.7. The Baire category theorem 107
3. Choose M
k
so large that M
k
is larger than j for any E
j
A
1
. . . A
k
, and such that
m=M
k
[
nj
(E
m
)[ /100
k+1
for all 1 j 2k + 1.
4. Choose n
2k+2
> n
2k+1
so that
m=M
k
[
n
2k+2
(E
m
)[ > .
5. Pick A
k+1
to be a nite union of the E
j
with j M
k
such
that [
n
2k+2
(A
k+1
)[ > /2.
6. Increment k to k + 1 and then return to Step 2.
It is then a routine matter to show that if A :=
j=1
A
j
, then
[
2k+2
(A)
2k+1
(A)[ /10 for all j, contradicting the hypoth-
esis that
j
is weakly convergent to .
Exercise 1.7.11 (Schurs property for
1
). Show that if a sequence
in
1
(N) is convergent in the weak topology, then it is convergent in
the strong topology.
We return now to the map S :
dened by a
n
:= (1
mn
)
m=1
, i.e. each
a
n
is the sequence consisting of n 1s followed by an innite number
of 0s. As the dual of c
0
(N) is isomorphic to
1
(N), we see from the
dominated convergence theorem that a
n
is a weakly Cauchy sequence
in c
0
(N), in the sense that (a
n
) is Cauchy for any c
0
(N)
.
Applying S, we conclude that S(a
n
) is weakly Cauchy in M(X).
In particular, using the bounded linear functionals (E) on
M(X), we see that S(a
n
)(E) converges to some limit (E) for all
measurable sets E. Applying the Nikodym convergence theorem we
see that is also a signed nite measure. We then see that S(a
n
)
converges in the weak topology to . (One way to see this is to
dene :=
n=1
2
n
[S(a
n
)[ +[[, then is nite and S(a
n
), are all
absolutely continuous with respect to ; now use the Radon-Nikodym
theorem (see Section 1.2) and the fact that L
1
()
().) On the
other hand, as LS = I and L and S are both bounded, S is a Banach
space isomorphism between c
0
and S(c
0
). Thus S(c
0
) is complete,
hence closed, hence weakly closed (by the Hahn-Banach theorem),
and so = S(a) for some a c
0
. By the Hahn-Banach theorem
again, this implies that a
n
converges weakly to a c
0
. But this is
easily seen to be impossible, since the constant sequence (1)
m=1
does
not lie in c
0
, and the claim follows.
108 1. Real analysis
Now we give the hard analysis proof. Let e
1
, e
2
, . . . be the
standard basis for
norm of
1
e
1
+. . . +
N
e
N
is 1, we have
(1.71) |S(
1
e
1
+. . . +
N
e
N
)|
M(X)
C
for some constant C independent of N. On the other hand, we can
write S(e
n
) = f
n
for some nite measure and some f
n
L
1
()
using Radon-Nikodym as in the previous proof, and then
(1.72) |
1
f
1
+. . . +
N
f
N
|
L
1
()
C.
Taking expectations and applying Khintchines inequality we con-
clude
(1.73) |(
N
n=1
[f
n
[
2
)
1/2
|
L
1
()
C
n=1
[f
n
[|
L
1
()
C
N.
But as |f
n
|
L
1
()
= |S(e
n
)|
M(X)
c for some constant c > 0
independent of N, we obtain a contradiction for N large enough, and
the claim follows.
Remark 1.7.23. The phenomenon of nonlinear quantitative solvabil-
ity actually comes up in many applications of interest. For instance,
consider the Feerman-Stein decomposition theorem[FeSt1972], which
asserts that any f BMO(R) of bounded mean oscillation can be
decomposed as f = g + Hh for some g, h L
T on X also yields a
compact topological space (X, T
).
Show that the trivial topology on X is always compact.
Exercise 1.8.3. Let X be a Hausdor topological space.
Show that every compact subset of X is closed.
Show that any stronger topology T
T on X also yields a
Hausdor topological space (X, T
).
1.8. Compactness 111
Show that the discrete topology on X is always Hausdor.
The rst exercise asserts that compact topologies tend to be weak,
while the second exercise asserts that Hausdor topologies tend to be
strong. The next lemma asserts that the two concepts only barely
overlap:
Lemma 1.8.1. Let T T
.
(In other words, a compact topology cannot be strictly stronger than
a Hausdor one, and a Hausdor topology cannot be strictly weaker
than a compact one.)
Proof. Since T T
) is compact in
(X, T). But from Exercises 1.8.2, 1.8.3, every set which is closed
in (X, T
) is compact in (X, T
.
Corollary 1.8.2. Any continuous bijection f : X Y from a com-
pact topological space (X, T
X
) to a Hausdor topological space (Y, T
Y
)
is a homeomorphism.
Proof. Consider the pullback f
#
(T
Y
) := f
1
(U) : U T
Y
of the
topology on Y by f; this is a topology on X. As f is continuous,
this topology is weaker than T
X
, and thus by Lemma 1.8.1 is equal
to T
X
. As f is a bijection, this implies that f
1
is continuous, and
the claim follows.
One may wish to compare this corollary with Corollary 1.7.14.
Remark 1.8.3. Spaces which are both compact and Hausdor (e.g.
the unit interval [0, 1] with the usual topology) have many nice prop-
erties and are moderately common, so much so that the two properties
are often concatenated as CH. Spaces that are locally compact and
Hausdor (e.g. manifolds) are much more common and have nearly
as many nice properties, and so these two properties are often con-
catenated as LCH. One should caution that (somewhat confusingly)
112 1. Real analysis
in some older literature (particularly those in the French tradition),
compact is used for compact Hausdor.
(Optional) Another way to contrast compactness and the Haus-
dor property is via the machinery of ultralters. Dene an lter on
a space X to be a collection p of sets of 2
X
which is closed under nite
intersection, is also monotone (i.e. if E p and E F X, then
F p), and does not contain the empty set. Dene an ultralter to
be a lter with the additional property that for any E X, exactly
one of E and XE lies in p. (See also Section 1.5 of Structure and
Randomness.)
Exercise 1.8.4 (Ultralter lemma). Show that every lter is con-
tained in at least one ultralter. (Hint: use Zorns lemma, see Section
2.4.)
Exercise 1.8.5. A collection of subsets of X has the nite inter-
section property if every nite intersection of sets in the collection
has non-empty intersection. Show that every lter has the nite in-
tersection property, and that every collection of sets with the nite
intersection property is contained in a lter (and hence contained in
an ultralter, by the ultralter lemma).
Given a point x X and an ultralter p on X, we say that p
converges to x if every neighbourhood of x belongs to p.
Exercise 1.8.6. Show that a space X is Hausdor if and only if every
ultralter has at most one limit. (Hint: For the if part, observe
that if x, y cannot be separated by disjoint neighbourhoods, then
the neighbourhoods of x and y together enjoy the nite intersection
property.)
Exercise 1.8.7. Show that a space X is compact if and only if every
ultralter has at least one limit. (Hint: use the nite intersection
property formulation of compactness and Exercise 1.8.5.)
1.8.2. Compactness and bases. Compactness is the property that
every open cover has a nite subcover. This property can be dicult
to verify in practice, in part because the class of open sets is very
large. However, in many cases one can replace the class of open sets
1.8. Compactness 113
with a much smaller class of sets. For instance, in metric spaces, a
set is open if and only if it is the union of open balls (note that the
union may be innite or even uncountable). We can generalise this
notion as follows:
Denition 1.8.4 (Base). Let (X, T) be a topological space. A base
for this space is a collection B of open sets such that every open set
in X can be expressed as the union of sets in the base. The elements
of B are referred to as basic open sets.
Example 1.8.5. The collection of open balls B(x, r) in a metric
space forms a base for the topology of that space. As another (rather
trivial) example of a base: any topology T is a base for itself.
This concept should be compared with that of a basis of a vector
space: every vector in that space can be expressed as a linear combi-
nation of vectors in a basis. However, one dierence between a base
and a basis is that the representation of an open set as the union of
basic open sets is almost certainly not unique.
Given a base B, dene a basic open neighbourhood of a point
x X to be a basic open set that contains x. Observe that a set U is
open if and only if every point in U has a basic open neighbourhood
contained in U.
Exercise 1.8.8. Let B be a collection of subsets of a set X. Show
that B is a basis for some topology T if and only if it it covers X and
has the following additional property: given any x X and any two
basic open neighbourhoods U, V of x, there exists another basic open
neighbourhood W of x that is contained in U V . Furthermore, the
topology T is uniquely determined by B.
To verify the compactness property, it suces to do so for basic
open covers (i.e. coverings of the whole space by basic open sets):
Exercise 1.8.9. Let (X, T) be a topological space with a base B.
Then the following are equivalent:
Every open cover has a nite subcover (i.e. X is compact);
Every basic open cover has a nite subcover.
114 1. Real analysis
A useful fact about compact metric spaces is that they are in
some sense countably generated.
Lemma 1.8.6. Let X = (X, d
X
) be a compact metric space.
(i) X is separable (i.e. it has an at most countably innite
dense subset).
(ii) X is second-countable (i.e. it has an at most countably
innite base).
Proof. By Theorem 1.6.8, X is totally bounded. In particular, for ev-
ery n 1, one can cover X by a nite number of balls B(x
n,1
,
1
n
), . . . , B(x
n,kn
,
1
n
)
of radius
1
n
. The set of points x
n,i
: n 1; 1 i k
n
is then easily
veried to be dense and at most countable, giving (i). Similarly, the
set of balls B(x
n,i
,
1
n
) : n 1; 1 i k
n
can be easily veried to
be a base which is at most countable, giving (ii).
Remark 1.8.7. One can easily generalise compactness here to -
compactness; thus for instance nite-dimensional vector spaces R
n
are separable and second-countable. The properties of separability
and second-countability are much weaker than -compactness in gen-
eral, but can still serve to provide some constraint as to the size
or complexity of a metric space or topological space in many situ-
ations.
We now weaken the notion of a base to that of a sub-base.
Denition 1.8.8 (Sub-base). Let (X, T) be a topological space. A
sub-base for this space is a collection B of subsets of X such that T is
the weakest topology that makes B open (i.e. T is generated by B).
Elements of B are referred to as sub-basic open sets.
Observe for instance that every base is a sub-base. The converse
is not true: for instance, the half-open intervals (, a), (a, +) for
a R form a sub-base for the standard topology on R, but not
a base. In contrast to bases, which need to obey the property in
Exercise 1.8.8, no property is required on a collection B in order for it
to be a sub-base; every collection of sets generates a unique topology
with respect to which it is a sub-base.
1.8. Compactness 115
The precise relationship between sub-bases and bases is given by
the following exercise.
Exercise 1.8.10. Let (X, T) be a topological space, and let B be a
collection of subsets of X. Then the following are equivalent:
B is a sub-base for (X, T).
The space B
:= B
1
. . . B
k
: B
1
, . . . , B
k
B of -
nite intersections of B (including the whole space X, which
corresponds to the case k = 0) is a base for (X, T).
Thus a set is open i it is the union of nite intersections of
sub-basic open sets.
Many topological facts involving open sets can often be reduced to
verications on basic or sub-basic open sets, as the following exercise
illustrates:
Exercise 1.8.11. Let (X, T) be a topological space, and B be a
sub-base of X, and let B
be a base of X.
Show that a sequence x
n
X converges to a limit x
X if and only if every sub-basic open neighbourhood of x
contains x
n
for all suciently large x
n
. (Optional: show
that an analogous statement is also true for nets.)
Show that a point x X is adherent to a set E if and only
if every basic open neighbourhood of x intersects E. Give
an example to show that the claim fails for sub-basic open
sets.
Show that a point x X is in the interior of a set U if and
only if U contains a basic open neighbourhood of x. Give
an example to show that the claim fails for sub-basic open
sets.
If Y is another topological space, show that a map f : Y
X is continuous if and only if the inverse image of every
sub-basic open set is open.
There is a useful strengthening of Exercise 1.8.9 in the spirit of
the above exercise, namely the Alexander sub-base theorem:
116 1. Real analysis
Theorem 1.8.9 (Alexander sub-base theorem). Let (X, T) be a topo-
logical space with a sub-base B. Then the following are equivalent:
Every open cover has a nite subcover (i.e. X is compact);
Every sub-basic open cover has a nite subcover.
Proof. Call an open cover bad if it had no nite subcover, and good
otherwise. In view of Exercise 1.8.9, it suces to show that if every
sub-basic open cover is good, then every basic open cover is good also,
where we use the basis B
)
A
. Thus this cover has no nite subcover, but if one
adds any new basic open set to this cover, then there must now be a
nite subcover.
Pick a basic open set U
= B
1
. . . B
k
for some sub-basic open sets B
1
, . . . , B
k
. We
claim that at least one of the B
1
, . . . , B
k
also lie in the cover (. To
see this, suppose for contradiction that none of the B
1
, . . . , B
k
was in
(. Then adding any of the B
i
to ( enlarges the basic open cover and
thus creates a nite subcover; thus B
i
together with nitely many sets
from ( cover X, or equivalently that one can cover XB
i
with nitely
many sets from(. Thus one can also cover XU
k
i=1
(XB
i
) with
nitely many sets from (, and thus X itself can be covered by nitely
many sets from (, a contradiction.
From the above discussion and the axiom of choice, we see that for
each basic set U
containing U
, U
= B
cover X, the B
can
cover X, and so ( is good, which gives the desired a contradiction.
Exercise 1.8.12. (Optional) Use Exercise 1.8.7 to give another proof
of the Alexander sub-base theorem.
1.8. Compactness 117
Exercise 1.8.13. Use the Alexander sub-base theorem to show that
the unit interval [0, 1] (with the usual topology) is compact, without
recourse to the Heine-Borel or Bolzano-Weierstrass theorems.
Exercise 1.8.14. Let X be a well-ordered set, endowed with the
order topology (Exercise 1.6.10); such a space is known as an ordinal
space. Show that X is Hausdor, and that X is compact if and only
if X has a maximal element.
One of the major applications of the sub-base theorem is to prove
Tychonos theorem, which we turn to next.
1.8.3. Compactness and product spaces. Given two topological
spaces X = (X, T
X
) and Y = (Y, T
Y
), we can form the product
space X Y , using the cylinder sets U Y : U T
X
X
V : V T
Y
as a sub-base, or equivalently using the open boxes
U V : U T
X
, V T
Y
as a base (cf. Example 1.6.25). One
easily veries that the obvious projection maps
X
: X Y X,
Y
: X Y Y are continuous, and that these maps also provide
homeomorphisms between X y and X, or between x Y and
Y , for every x X, y Y . Also observe that a sequence (x
n
, y
n
)
n=1
(or net (x
, y
)
A
) converges to a limit (x, y) in X if and only if
(x
n
)
n=1
and (y
n
)
n=1
(or (x
)
A
and (y
)
A
) converge in X and Y
to x and y respectively.
This operation preserves a number of useful topological proper-
ties, for instance
Exercise 1.8.15. Prove that the product of two Hausdor spaces is
still Hausdor.
Exercise 1.8.16. Prove that the product of two sequentially compact
spaces is still sequentially compact.
Proposition 1.8.10. The product of two compact spaces is compact.
Proof. By Exercise 1.8.9 it suces to show that any basic open cover
of XY by boxes (U
)
A
has a nite subcover. For any x X,
this open cover covers x Y ; by the compactness of Y x Y ,
we can thus cover x Y by a nite number of open boxes U
.
Intersecting the U
, T
)
A
of
topological spaces, let X :=
A
X
)
A
with x
: X X
that maps
(x
)
A
to x
.
We dene the product topology on X to be the topology
generated by the cylinder sets
1
(U
) for A and U
continuous.
We dene the box topology on X to be the topology gen-
erated by all the boxes
A
U
, where U
for all
A.
Unless otherwise specied, we assume the product space to be en-
dowed with the product topology rather than the box topology.
When A is nite, the product topology and the box topology
coincide. When A is innite, the two topologies are usually dierent
(as we shall see), but the box topology is always at least as strong as
the product topology. Actually, in practice the box topology is too
strong to be of much use - there are not enough convergent sequences
in it. For instance, in the space R
N
of real-valued sequences (x
n
)
n=1
,
even sequences such as (
1
m!
e
nm
)
n=1
do not converge to the zero
sequence as m (why?), despite converging in just about every
other sense.
Exercise 1.8.18. Show that the arbitrary product of Hausdor spaces
remains Hausdor in either the product or the box topology.
1.8. Compactness 119
Exercise 1.8.19. Let (X
n
, d
n
) be a sequence of metric spaces. Show
that the the function d : X X R
+
on the product space X :=
n
X
n
dened by
d((x
n
)
n=1
, (y
n
)
n=1
) :=
n=1
2
n
d
n
(x
n
, y
n
)
1 +d
n
(x
n
, y
n
)
is a metric on X which generates the product topology on X.
Exercise 1.8.20. Let X =
A
X
(x
n
) converges in X
to
n=1
of sequentially
compact spaces, and consider the product space X =
n=1
X
n
with
the product topology. Let x
(1)
, x
(2)
, . . . be a sequence in X, thus each
x
(m)
is itself a sequence x
(m)
= (x
(m)
n
)
n=1
with x
(m)
n
X
n
for all n.
Our objective is to nd a subsequence x
(mj)
which converges to some
limit x = (x
n
)
n=1
in the product topology, which by Exercise 1.8.20
120 1. Real analysis
is the same as pointwise convergence (i.e. x
(mj)
n
x
n
as j for
each n).
Consider the rst coordinates x
(m)
1
X
1
of the sequence x
(m)
.
As X
1
is sequentially compact, we can nd a subsequence (x
(m1,j)
)
j=1
in X such that x
(m1,j)
1
converges in X
1
to some limit x
1
X
1
.
Now, in this subsequence, consider the second coordinates x
(m1,j)
2
X
2
. As X
2
is sequentially compact, we can nd a further subse-
quence (x
(m2,j)
)
j=1
in X such that x
(m2,j)
2
converges in X
2
to some
limit x
2
X
1
. Also, we inherit from the preceding subsequence that
x
(m2,j)
1
converges in X
1
to x
1
.
We continue in this vein, creating nested subsequences (x
(mi,j)
)
j=1
for i = 1, 2, 3, . . . whose rst i components x
(mi,j)
1
, . . . , x
(mi,j)
i
converge
to x
1
X
1
, . . . , x
i
X
i
respectively.
None of these subsequences, by themselves are sucient to nish
the problem. But now we use the diagonalisation trick: we consider
the diagonal sequence (x
(mj,j)
)
j=1
. One easily veries that x
(mj,j)
n
converges in X
n
to x
n
as j for every n, and so we have extracted
a sequence that is convergent in the product topology.
Remark 1.8.13. In the converse direction, if a product of spaces
is sequentially compact, then each of the factor spaces must also be
sequentially compact, since they are continuous images of the product
space and one can apply Exercise 1.8.1.
The sequential Tychono theorem breaks down for uncountable
products. Consider for instance the product space X := 0, 1
{0,1}
N
of functions f : 0, 1
N
0, 1. As 0, 1 (with the discrete
topology) is sequentially compact, this is an (uncountable) prod-
uct of sequentially compact spaces. On the other hand, for each
n N we can dene the evaluation function f
n
: 0, 1
N
0, 1 by
f
n
: (a
m
)
m=1
a
n
. This is a sequence in X; we claim that it has
no convergent subsequence. Indeed, given any n
j
, we can nd
x = (x
m
)
m=1
0, 1
such that x
nj
= f
nj
(x) does not converge to
a limit as j , and so f
nj
does not converge pointwise (i.e. does
not converge in the product topology).
1.8. Compactness 121
However, we can recover the result for uncountable products as
long as we work with topological compactness rather than sequential
compactness, leading to Tychonos theorem:
Theorem 1.8.14 (Tychono theorem). Any product of compact topo-
logical spaces is compact.
Proof. Write X =
A
X
(U
))
B
has a nite sub-cover, where
B is some index set, and for each B,
A and U
is open in
X
.
For each A, consider the sub-basic open sets
1
(U
) that are
associated to those B with
here cover
X
, then by compactness of X
already
suce to cover X
(U
) cover X,
and we are done. So we may assume that the U
do not cover X
,
thus there exists x
with
= . One
then sees that the point (x
)
A
in X avoids all of the
1
(U
), a
contradiction. The claim follows.
Remark 1.8.15. The axiom of choice was used in several places in
the proof (in particular, via the Alexander sub-base theorem). This
turns out to be necessary, because one can use Tychonos theorem
to establish the axiom of choice. This was rst observed by Kelley,
and can be sketched as follows. It suces to show that the prod-
uct
A
X
, with X
closed in X
. By Tychonos theo-
rem, the product X :=
A
(X
(X
) in X, where
: X X
)
A
which converges
in X to x, has a convergent subsequence (x
(n)
)
n=1
(i.e. a
subnet whose index set is N).
Show that any compact space which is rst-countable, is also
sequentially compact. (The converse is not true: Exercise
1.6.10 provides a counterexample.)
(Optional) There is an alternate proof of the Tychono theorem
that uses the machinery of universal nets. We sketch this approach
in a series of exercises.
Denition 1.8.17. A net (x
)
A
in a set X is universal if for every
function f : X 0, 1, the net (f(x
))
A
converges to either 0 or
1.
Exercise 1.8.22. Show that a universal net (x
)
A
in a compact
topological space is necessarily convergent. (Hint: show that the
collection of closed sets which contain x
)
A
in a set X
has a universal subnet (x
()
)
B
. (Hint: First use Exercise 1.8.5 to
nd an ultralter p on A that contains the upsets A :
for all A. Now let B be the space of all pairs (U, ), where
1.8. Compactness 123
U p, ordered by requiring (U, ) (U
) when U U
and
)
A
be a family of functions f
BC(X Y ).
We say that this family f
(x) : A is bounded in Y .
We say that this family f
(x) : A is precompact in Y .
We say that this family f
(x
), f
U.
If X = (X, d
X
) is also a metric space, we say that the family
f
(x
), f
, x x with d
X
(x, x
) .
Remark 1.8.19. From the Heine-Borel theorem, the pointwise bound-
edness and pointwise precompactness properties are equivalent if Y
is a subset of R
n
for some n. Any nite collection of continuous
functions is automatically an equicontinuous family (why?), and any
nite collection of uniformly continuous functions is automatically a
uniformly equicontinuous family; the concept only acquires additional
meaning once one considers innite families of continuous functions.
Example 1.8.20. With X = [0, 1] and Y = R, the family of func-
tions f
n
(x) := x
n
for n = 1, 2, 3, . . . are pointwise bounded (and thus
pointwise precompact), but not equicontinuous. The family of func-
tions g
n
(x) := n for n = 1, 2, 3, . . ., on the other hand, are equicon-
tinuous, but not pointwise bounded or pointwise precompact. The
family of functions h
n
(x) := sin nx for n = 1, 2, 3, . . . are pointwise
bounded (even uniformly bounded), but not equicontinuous.
Example 1.8.21. With X = Y = R, the functions f
n
(x) = arctan nx
are pointwise bounded (even uniformly bounded), are equicontinuous,
and are each individually uniformly continuous, but are not uniformly
equicontinuous.
Exercise 1.8.27. Show that the uniform boundedness principle (The-
orem 1.7.5) can be restated as the assertion that any family of bounded
linear operators from the unit ball of a Banach space to a normed vec-
tor space is pointwise bounded if and only if it is equicontinuous.
Example 1.8.22. A function f : X Y between two metric spaces
is said to be Lipschitz (or Lipschitz continuous) if there exists a con-
stant C such that d
Y
(f(x), f(x
)) Cd
X
(x, x
) for all x, x
X;
the smallest constant C one can take here is known as the Lipschitz
constant of f. Observe that Lipschitz functions are automatically
continuous, hence the name. Also observe that a family (f
)
A
of Lipschitz functions with uniformly bounded Lipschitz constant is
equicontinuous.
One nice consequence of equicontinuity is that it equates uniform
convergence with pointwise convergence, or even pointwise conver-
gence on a dense subset.
1.8. Compactness 125
Exercise 1.8.28. Let X be a topological space, let Y be a complete
metric space, let f
1
, f
2
, . . . BC(X Y ) be an equicontinuous
family of functions. Show that the following are equivalent:
The sequence f
n
is pointwise convergent.
The sequence f
n
is pointwise convergent on some dense sub-
set of X.
If X is compact, show that the above two statements are also equiv-
alent to
The sequence f
n
is uniformly convergent.
(Compare with Corollary 1.7.7.) Show that no two of the three
statements remain equivalent if the hypothesis of equicontinuity is
dropped.
We can now use Proposition 1.8.12 to give a useful characterisa-
tion of precompactness in C(X Y ) when X is compact, known as
the Arzela-Ascoli theorem:
Theorem 1.8.23 (Arzela-Ascoli theorem). Let Y be a metric space,
X be a compact metric space, and let (f
)
A
be a family of functions
f
)
A
is pointwise precompact and equicontinuous.
(iii) (f
)
A
is pointwise precompact and uniformly equicontin-
uous.
Proof. We rst show that (i) implies (ii). For any x X, the evalu-
ation map f f(x) is a continuous map from C(X Y ) to Y , and
thus maps precompact sets to precompact sets. As a consequence, any
precompact family in C(X Y ) is pointwise precompact. To show
equicontinuity, suppose for contradiction that equicontinuity failed
at some point x, thus there exists > 0, a sequence
n
A, and
points x
n
x such that d
Y
(f
n
(x
n
), f
n
(x)) > for every n. One
then veries that no subsequence of f
n
can converge uniformly to
a continuous limit, contradicting precompactness. (Note that in the
metric space C(X Y ), precompactness is equivalent to sequential
precompactness.)
126 1. Real analysis
Now we show that (ii) implies (iii). It suces to show that
equicontinuity implies uniform equicontinuity. This is a straightfor-
ward generalisation of the more familiar argument that continuity
implies uniform continuity on a compact domain, and we repeat it
here. Namely, x > 0. For every x X, equicontinuity provides a
x
> 0 such that d
Y
(f
(x), f
(x
)) whenever x
B(x,
x
) and
A. The balls B(x,
x
/2) cover X, thus by compactness some nite
subcollection B(x
i
,
xi
/2), i = 1, . . . , n of these balls cover X. One
then easily veries that d
Y
(f
(x), f
(x
)) whenever x, x
X
with d
X
(x, x
) min
1in
xi
/2.
Finally, we show that (iii) implies (i). It suces to show that any
sequence f
n
BC(X Y ), n = 1, 2, . . ., which is pointwise precom-
pact and uniformly equicontinuous, has a convergent subsequence.
By embedding Y in its metric completion Y , we may assume without
loss of generality that Y is complete. (Note that for every x X, the
set f
n
(x) : n = 1, 2, . . . is precompact in Y , hence the closure in Y
is complete and thus closed in Y also. Thus any pointwise limit of the
f
n
in Y will take values in Y .) By Lemma 1.8.6, we can nd a count-
able dense subset x
1
, x
2
, . . . of X. For each x
m
, we can use pointwise
precompactness to nd a compact set K
m
Y such that f
(x
m
)
takes values in K
m
. For each n, the tuple F
n
:= (f
n
(x
m
))
m=1
can
then be viewed as a point in the product space
n=1
K
n
. By Propo-
sition 1.8.12, this product space is sequentially compact, hence we
may nd a subsequence n
j
such that F
n
is convergent in the
product topology, or equivalently that f
n
pointwise converges on the
countable dense set x
1
, x
2
, . . .. The claim now follows from Exercise
1.8.28.
Remark 1.8.24. The above theorem characterises precompact sub-
sets of BC(X Y ) when X is a compact metric space. One can also
characterise compact subsets by observing that a subset of a metric
space is compact if and only if it is both precompact and closed.
There are many variants of the Arzela-Ascoli theorem with stronger
or weaker hypotheses or conclusions; for instance, we have
Corollary 1.8.25 (Arzela-Ascoli theorem, special case). Let f
n
:
X R
m
be a sequence of functions from a compact metric space
1.8. Compactness 127
X to a nite-dimensional vector space R
m
which are equicontinuous
and pointwise bounded. Then there is a subsequence f
nj
of f
n
which
converges uniformly to a limit (which is necessarily bounded and con-
tinuous).
Thus, for instance, any sequence of uniformly bounded and uni-
formly Lipschitz functions f
n
: [0, 1] R will have a uniformly con-
vergent subsequence. This claim fails without the uniform Lipschitz
assumption (consider, for instance, the functions f
n
(x) := sin(nx)).
Thus one needs a little bit extra uniform regularity in addition
to uniform boundedness in order to force the existence of uniformly
convergent subsequences. This is a general phenomenon in innite-
dimensional function spaces: compactness in a strong topology tends
to require some sort of uniform control on regularity or decay in ad-
dition to uniform bounds on the norm.
Exercise 1.8.29. Show that the equivalence of (i) and (ii) continues
to hold if X is assumed to be just a compact Hausdor space rather
than a compact metric space (the statement (iii) no longer makes
sense in this setting). Hint: X need not be separable any more,
however one can still adapt the diagonalisation argument used to
prove Proposition 1.8.12. The starting point is the observation that
for every > 0 and every x X, one can nd a neighbourhood U
of x and some subsequence f
nj
which only oscillates by at most (or
maybe 2) on U.
Exercise 1.8.30 (Locally compact Hausdor version of Arzela-Ascoli).
Let X be a locally compact Hausdor space which is also -compact,
and let f
n
C(X R) be an equicontinuous, pointwise bounded
sequence of functions. Then there exists a subsequence f
nj
C(X
R) which converges uniformly on compact subsets of X to a limit
f C(X R). (Hint: Express X as a countable union of com-
pact sets K
n
, each one contained in the interior of the next. Apply
the compact Hausdor Arzela-Ascoli theorem on each compact set
(Exercise 1.8.29). Then apply the Arzela-Ascoli argument one last
time.)
Remark 1.8.26. The Arzela-Ascoli theorem (and other compactness
theorems of this type) are often used in partial dierential equations,
128 1. Real analysis
to demonstrate existence of solutions to various equations or varia-
tional problems. For instance, one may wish to solve some equation
F(u) = f, for some function u : X R
m
. One way to do this
is to rst construct a sequence u
n
of approximate solutions, so that
F(u
n
) f as n in some suitable sense. If one can also ar-
range these u
n
to be equicontinuous and pointwise bounded, then
the Arzela-Ascoli theorem allows one to pass to a subsequence that
converges to a limit u. Given enough continuity (or semi-continuity)
properties on F, one can then show that F(u) = f as required.
More generally, the use of compactness theorems to demonstrate
existence of solutions in PDE is known as the compactness method.
It is applicable in a remarkably broad range of PDE problems, but
often has the drawback that it is dicult to establish uniqueness of the
solutions created by this method (compactness guarantees existence
of a limit point, but not uniqueness). Also, in many cases one can only
hope for compactness in rather weak topologies, and as a consequence
it is often dicult to establish regularity of the solutions obtained via
compactness methods.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/02/09.
Thanks to Nate Chandler, Emmanuel Kowalski, Eric, K. P. Hart, Ke,
Luca Trevisan, PDEBeginner, RR, Samir Chomsky, Xiaochuan Liu,
and anonymous commenters for corrections.
David Speyer and Eric pointed out that the axiom of choice was
used in two dierent ways in the proof of Tychonos theorem; rstly
to prove the sub-base theorem, and secondly to select an element x
from each X
)
A
be a
(possibly innite) family of topologies on V , each of which turning V
into a topological vector space. Let T :=
_
A
T
be the topology
generated by
A
T
)
A
on V .
Exercise 1.9.4. Let T : V W be a linear map between vector
spaces. Suppose that we give V the topology induced by a family of
semi-norms (||
V
)
A
, and W the topology induced by a family of
semi-norms (||
W
)
B
. Show that T is continuous if and only if, for
each B, there exists a nite subset A
of A and a constant C
|f|
V
for all f V .
Example 1.9.3 (Pointwise convergence). Let X be a set, and let
C
X
be the space of complex-valued functions f : X C; this is a
complex vector space. Each point x X gives rise to a seminorm
|f|
x
:= [f(x)[. The topology generated by all of these seminorms is
the topology of pointwise convergence on C
X
(and is also the product
topology on this space); a sequence f
n
C
X
converges to f in this
topology if and only if it converges pointwise. Note that if X has more
than one point, then none of the semi-norms individually generate a
Hausdor topology, but when combined together, they do.
Example 1.9.4 (Uniform convergence). Let X be a topological space,
and let C(X) be the space of complex-valued continuous functions
f : X C. If X is not compact, then one does not expect functions
in C(X) to be bounded in general, and so the sup norm does not nec-
essarily make C(X) into a normed vector space. Nevertheless, one
can still dene balls B(f, r) in C(X) by
B(f, r) := g C(X) : sup
xX
[f(x) g(x)[ r
and verify that these form a base for a topological vector space. A
sequence f
n
C(X) converges in this topology to a limit f C(X)
if and only if f
n
converges uniformly to f, thus sup
xX
[f
n
(x) f(x)[
is nite for suciently large n and converges to zero as n . More
generally, one can make a topological vector space out of any norm,
quasi-norm, or semi-norm which is innite on some portion of the
vector space.
Example 1.9.5 (Uniform convergence on compact sets). Let X and
C(X) be as in the previous example. For every compact subset K
of X, we can dene a seminorm ||
C(K)
on C(X) by |f|
C(K)
:=
132 1. Real analysis
sup
xK
[f(x)[. The topology generated by all of these seminorms
(as K ranges over all compact subsets of X) is called the topology of
uniform convergence on compact sets; it is stronger than the topology
of poitnwise convergence but weaker than the topology of uniform
convergence. Indeed, a sequence f
n
C(X) converges to f C(X)
in this topology if and only if f
n
converges uniformly to f on each
compact set.
Exercise 1.9.5. Show that an arbitrary product of topological vector
spaces (endowed with the product topology) is again a topological
vector space
10
.
Exercise 1.9.6. Show that a topological vector space is Hausdor
if and only if the origin 0 is closed. (Hint: rst use the continuity
of addition to prove the lemma that if V is an open neighbourhood
of 0, then there exists another open neighbourhood U of 0 such that
U +U V , i.e. u +u
V for all u, u
U.)
Example 1.9.6 (Smooth convergence). Let C
j=0
sup
x[0,1]
[f
(j)
(x)[,
where f
(j)
is the j
th
derivative of f. The topology generated by all
the C
k
norms for k = 0, 1, 2, . . . is the smooth topology: a sequence
f
n
converges in this topology to a limit f if f
(j)
n
converges uniformly
to f
(j)
for each j 0.
Exercise 1.9.7 (Convergence in measure). Let (X, A, ) be a mea-
sure space, and let L(X) be the space of measurable functions f :
X C. Show that the sets
B(f, , r) := g L(X) : (x : [f(x) g(x)[ r < )
for f L(X), > 0, r > 0 form the base for a topology that turns
L(X) into a topological vector space, and that a sequence f
n
L(X)
converges to a limit f in this topology if and only if it converges in
measure.
10
I am not sure if the same statement is true for the box topology; I believe it is
false.
1.9. The strong and weak topologies 133
Exercise 1.9.8. Let [0, 1] be given the usual Lebesgue measure.
Show that the vector space L
from normed
vector spaces to topological vector spaces in the obvious manner: the
dual space V
n=1
V
n
is dense. Show
that the following statement is equivalent to the rst three:
(iv) K is closed and bounded, and for every > 0 there exists
an n such that K lies in the -neighbourhood of V
n
.
Example 1.9.8. Let 1 p < . In order for a set K
p
(N) to be
compact in the strong topology, it needs to be closed and bounded,
and also uniformly p
th
-power integrable at spatial innity in the sense
that for every > 0 there exists n > 0 such that
(
m>n
[f(m)[
p
)
1/p
for all f K. Thus, for instance, the moving bump example
e
1
, e
2
, e
3
, . . ., where e
n
is the sequence which equals 1 on n and
zero elsewhere, is not uniformly p
th
power integrable and thus not a
compact subset of
p
(N), despite being closed and bounded.
For continuous L
p
spaces, such as L
p
(R), uniform integrability
at spatial innity is not sucient to force compactness in the strong
topology; one also needs some uniform integrability at very ne scales,
which can be described using harmonic analysis tools such as the
Fourier transform (Section 1.12). We will not discuss this topic here.
Exercise 1.9.12. Let V be a normed vector space.
If W is a nite-dimensional subspace of V , and x V , show
that there exists y W such that |x y| |x y
| for all
136 1. Real analysis
y
:
Denition 1.9.9 (Weak and weak* topologies). Let V be a topolog-
ical vector space, and let V
be its dual.
The weak topology on V is the topology generated by the
seminorms |x|
.
The weak* topology on V
is a topological vector
space structure on V
. When V is reexive,
show that the weak and weak* topologies on V
are equivalent.
From the denition, we see that a sequence x
n
V converges in
the weak topology, or converges weakly for short, to a limit x V
1.9. The strong and weak topologies 137
if and only if (x
n
) (x) for all V
if
n
(x) (x) for all x V (thus
n
, viewed as
a function on V , converges pointwise to ).
Remark 1.9.11. If V is a Hilbert space, then from the Riesz repre-
sentation theorem for Hilbert spaces (Theorem 1.4.13) we see that a
sequence x
n
V converges weakly (or in the weak* sense) to a limit
x V if and only if x
n
, y x, y for all y V .
Exercise 1.9.14. Show that if V is a normed vector space, then
the weak topology on V and the weak* topology on V
are both
Hausdor. (Hint: You will need the Hahn-Banach theorem.) In par-
ticular, we conclude the important fact that weak and weak* limits,
when they exist, are unique.
The following exercise shows that the strong, weak, and weak*
topologies can all dier from each other.
Exercise 1.9.15. Let V := c
0
(N), thus V
1
(N) and V
(N). Let e
1
, e
2
, . . . be the standard basis of either V , V
, or V
.
Show that the sequence e
1
, e
2
, . . . converges weakly in V to
zero, but does not converge strongly in V .
Show that the sequence e
1
, e
2
, . . . converges in the weak*
sense in V
.
Show that the sequence
m=n
e
m
for n = 1, 2, . . . converges
in the weak* topology of V
1
(N) which converge in the weak topology, also converge in the
strong topology. We caution however that the two topologies are
not quite equivalent; for instance, the open unit ball in
1
(N) is open
in the strong topology, but not in the weak.
Exercise 1.9.16. Let V be a normed vector space, and let E be a
subset of V . Show that the following are equivalent:
138 1. Real analysis
E is strongly bounded (i.e. E is contained in a ball).
E is weakly bounded (i.e. (E) is bounded for all V
).
(Hint: use the Hahn-Banach theorem and the uniform boundedness
principle.) Similarly, if F is a subset of V
, and V is a Banach
space, show that F is strongly bounded if and only if F is weak*
bounded (i.e. (x) : F is bounded for each x V ).) Conclude
in particular that any sequence which is weakly convergent in V or
weak* convergent in V
is necessarily bounded.
Exercise 1.9.17. Let V be a Banach space, and let x
n
V converge
weakly to a limit x V . Show that the sequence x
n
is bounded, and
|x|
V
liminf
n
|x
n
|
V
.
Observe from Exercise 1.9.15 that strict inequality can hold (cf. Fa-
tous lemma, Theorem 1.1.21). Similarly, if
n
V
converges in
the weak* topology to a limit V
N
n=1
x
n
converges strongly to x as n .
Show that if x
n
converges strongly to x, then it also con-
verges in the Cesaro sense to x.
Give examples to show that weak convergence does not im-
ply Cesaro convergence, and vice versa. On the other hand,
1.9. The strong and weak topologies 139
if a sequence x
n
converges both weakly and in the Cesaro
sense, show that the weak limit is necessarily equal to the
Cesaro limit.
Show that if a bounded sequence converges in the Cesaro
sense to a limit x, then some subsequence converges weakly
to x.
Show that a sequence x
n
converges weakly to x if and only if
every subsequence has a further subsequence that converges
in the Cesaro sense to x.
Exercise 1.9.20. Let V be a Banach space. Show that the closed
unit ball in V is also closed in the weak topology, and the closed unit
ball in V
is complete.
Exercise 1.9.22. Let V be a normed vector space, let W be a sub-
space of V which is closed in the strong topology of V .
Show that W is closed in the weak topology of V .
If w
n
W is a sequence and w W, show that w
n
converges
to w in the weak topology of W if and only if it converges
to w in the weak topology of V . (Because of this fact, we
can often refer to the weak topology without specifying
the ambient space precisely.)
Exercise 1.9.23. Let V := c
0
(N) with the uniform (i.e.
) norm,
and identify the dual space V
with
1
(N) in the usual manner.
Show that a sequence x
n
c
0
(N) converges weakly to a
limit x c
0
(N) if and only if the x
n
are bounded in c
0
(N)
and converge pointwise to x.
Show that a sequence
n
1
(N) converges in the weak*
topology to a limit
1
(N) if and only if the
n
are
bounded in
1
(N) and converge pointwise to .
Show that the weak topology in c
0
(N) is not complete.
(More generally, it may help to think of the weak and weak* topologies
as being analogous to pointwise convergence topologies.)
140 1. Real analysis
One of the main reasons why we use the weak and weak* topolo-
gies in the rst place is that they have much better compactness
properties than the strong topology, thanks to the Banach-Alaoglu
theorem:
Theorem 1.9.13 (Banach-Alaoglu theorem). Let V be a normed
vector space. Then the closed unit ball of V
,
then any linear functional B
with a
subset of D
B
, the space of functions from B to D. One easily veries
that the weak* topology on B
is
closed in D
B
. But by Tychonos theorem, D
B
is compact, and so
B
is compact also.
One should caution that the Banach-Alaoglu theorem does not
imply that the space V
is
sequentially compact in the weak* topology.
Proof. The functionals in B
. We thus have
Corollary 1.9.16. If V is a reexive normed vector space, then the
closed unit ball in V is weakly compact, and (if V
is separable) is
also sequentially weakly compact.
Remark 1.9.17. If V is a normed vector space that is not separa-
ble, then one can show that V
, which is not
compatible with separability. As a consequence, a reexive space is
separable if and only if its dual is separable
11
.
11
On the other hand, separable spaces can have non-separable duals; consider
1
(N), for instance.
142 1. Real analysis
In particular, any bounded sequence in a reexive separable normed
vector space has a weakly convergent subsequence. This fact leads
to the very useful weak compactness method in PDE and calculus of
variations, in which a solution to a PDE or variational problem is con-
structed by rst constructing a bounded sequence of near-solutions
or near-extremisers to the PDE or variational problem, and then
extracting a weak limit. However, it is important to caution that
weak compactness can fail for non-reexive spaces; indeed, for such
spaces the closed unit ball in V may not even be weakly complete, let
alone weakly compact, as already seen in Exercise 1.9.23. Thus, one
should be cautious when applying the weak compactness method to
a non-reexive space such as L
1
or L
.
1.9. The strong and weak topologies 143
Note that a sequence T
n
B(X Y ) converges in the strong
operator topology to a limit T B(X Y ) if and only if T
n
x Tx
strongly in Y for all x X, and T
n
converges in the weak operator
topology. (In contrast, T
n
converges to T in the operator norm topol-
ogy if and only if T
n
x converges to Tx uniformly on bounded sets.)
One easily sees that the weak operator topology is weaker than the
strong operator topology, which in turn is (somewhat confusingly)
weaker than the operator norm topology.
Example 1.9.19. When X is the scalar eld, then B(X Y ) is
canonically isomorphic to Y . In this case, the operator norm and
strong operator topology coincide with the strong topology on Y , and
the weak operator norm topology coincides with the weak topology on
Y . Meanwhile, B(Y X) coincides with Y
.
We can rephrase the uniform boundedness principle for conver-
gence (Corollary 1.7.7) as follows:
Proposition 1.9.20 (Uniform boundedness principle). Let T
n
B(X Y ) be a sequence of bounded linear operators from a Ba-
nach space X to a normed vector space Y , let T B(X Y ) be
another bounded linear operator, and let D be a dense subspace of X.
Then the following are equivalent:
T
n
converges in the strong operator topology of B(X Y )
to T.
T
n
is bounded in the operator norm (i.e. |T
n
|
op
is bounded),
and the restriction of T
n
to D converges in the strong oper-
ator topology of B(D Y ) to the restriction of T to D.
Exercise 1.9.25. Let the hypotheses be as in Proposition 1.9.20, but
now assume that Y is also a Banach space. Show that the conclusion
of Proposition 1.9.20 continues to hold if strong operator topology
is replaced by weak operator topology.
Exercise 1.9.26. Show that the operator norm topology, strong op-
erator topology, and weak operator topology, are all Hausdor. As
144 1. Real analysis
these topologies are nested, we thus conclude that it is not possible
for a sequence of operators to converge to one limit in one of these
topologies and to converge to a dierent limit in another.
Example 1.9.21. Let X = L
2
(R), and for each t R, let T
t
:
X X be the translation operator by t: T
t
f(x) := f(x t). If f
is continuous and compactly supported, then (e.g. from dominated
convergence) we see that T
t
f f in L
2
as t 0. Since the space of
continuous and compactly supported functions is dense in L
2
(R), this
implies (from the above proposition, with some obvious modications
to deal with the continuous parameter t instead of the discrete param-
eter n) that T
t
converges in the strong operator topology (and hence
weak operator topology) to the identity. On the other hand, T
t
does
not converge to the identity in the operator norm topology. Indeed,
observe for any t > 0 that |(T
t
I)1
[0,t]
|
L
2
(R)
=
2|1
[0,t]
|
L
2
(R)
, and
thus |T
t
I|
op
2.
In a similar vein, T
t
does not converge to anything in the strong
operator topology (and hence does not converge in the operator norm
topology either) in the limit t , since T
t
1
[0,1]
(say) does not
converge strongly in L
2
. However, one easily veries that T
t
f, g 0
as t for any compactly supported f, g L
2
(R), and hence for all
f, g L
2
(R) by the usual limiting argument, and hence T
t
converges
in the weak operator topology to zero.
The following exercise may help clarify the relationship between
the operator norm, strong operator, and weak operator topologies.
Exercise 1.9.27. Let H be a Hilbert space, and let T
n
B(H H)
be a sequence of bounded linear operators.
Show that T
n
0 in the operator norm topology if and only
if T
n
x
n
, y
n
0 for any bounded sequences x
n
, y
n
H.
Show that T
n
0 in the strong operator topology if and
only if T
n
x
n
, y
n
0 for any convergent sequence x
n
H
and any bounded sequence y
n
H.
Show that T
n
0 in the weak operator topology if and only
if T
n
x
n
, y
n
0 for any convergent sequences x
n
, y
n
H.
1.9. The strong and weak topologies 145
Show that T
n
0 in the operator norm (resp. weak oper-
ator) topology if and only if T
n
0 in the operator norm
(resp. weak operator) topology. Give an example to show
that the corresponding claim for the strong operator topol-
ogy is false.
There is a counterpart of the Banach-Alaoglu theorem (and its
sequential analogue), at least in the case of Hilbert spaces:
Exercise 1.9.28. Let H, H
)
is sequentially compact in the weak operator topology.
The behaviour of convergence in various topologies with respect
to composition is somewhat complicated, as the following exercise
shows.
Exercise 1.9.29. Let H be a Hilbert space, let S
n
, T
n
B(H
H) be sequences of operators, and let S B(H H) be another
operator.
If T
n
0 in the operator norm (resp. strong operator or
weak operator) topology, show that ST
n
0 and T
n
S 0
in the operator norm (resp. strong operator or weak opera-
tor) topology.
If T
n
0 in the operator norm topology, and S
n
is bounded
in the operator norm topology, show that S
n
T
n
0 and
T
n
S
n
0 in the operator norm topology.
If T
n
0 in the strong operator topology, and S
n
is bounded
in the operator norm topology, show that S
n
T
n
0 in the
strong operator norm topology.
Give an example where T
n
0 in the strong operator topol-
ogy, and S
n
0 in the weak operator topology, but T
n
S
n
does not converge to zero even in the weak operator topol-
ogy.
Exercise 1.9.30. Let H be a Hilbert space. An operator T
B(H H) is said to be nite rank if its image T(H) is nite di-
mensional. T is said to be compact if the image of the unit ball is
146 1. Real analysis
precompact. Let K(H H) denote the space of compact operators
on H.
Show that T B(H H) is compact if and only if it is the
limit of nite rank operators in the operator norm topology.
Conclude in particular that K(H H) is a closed subset
of B(H H) in the operator norm topology.
Show that an operator T B(H H) is compact if and
only if T
is compact.
If H is separable, show that every T B(H H) is the
limit of nite rank operators in the strong operator topology.
If T K(H H), show that T maps weakly convergent
sequences to strongly convergent sequences. (This property
is known as complete continuity.)
Show that K(H H) is a subspace of B(H H), which
is closed with respect to left and right multiplication by ele-
ments of B(H H). (In other words, the space of compact
operators is an two-ideal in the algebra of bounded opera-
tors.)
The weak operator topology plays a particularly important role
on the theory of von Neumann algebras, which we will not discuss
here.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/02/21.
Thanks to Eric, etale, less than epsilon, Matt Daws, PDEBeginner,
Sebastian Scholtes, Xiaochuan Liu, Yasser Taima, and anonymous
commenters for corrections.
1.10. Continuous functions on locally compact
Hausdor spaces
A key theme in real analysis is that of studying general functions
f : X R or f : X C by rst approximating them by sim-
pler or nicer functions. But the precise class of simple or nice
functions may vary from context to context. In measure theory, for
instance, it is common to approximate measurable functions by indi-
cator functions or simple functions. But in other parts of analysis, it
1.10. LCH spaces 147
is often more convenient to approximate rough functions by contin-
uous or smooth functions (perhaps with compact support, or some
other decay condition), or by functions in some algebraic class, such
as the class of polynomials or trigonometric polynomials.
In order to approximate rough functions by more continuous ones,
one of course needs tools that can generate continuous functions with
some specied behaviour. The two basic tools for this are Urysohns
lemma, which approximates indicator functions by continuous func-
tions, and the Tietze extension theorem, which extends continuous
functions on a subdomain to continuous functions on a larger do-
main. An important consequence of these theorems is the Riesz rep-
resentation theorem for linear functionals on the space of compactly
supported continuous functions, which describes such functionals in
terms of Radon measures.
Sometimes, approximation by continuous functions is not enough;
one must approximate continuous functions in turn by an even smoother
class of functions. A useful tool in this regard is the Stone-Weierstrass
theorem, that generalises the classical Weierstrass approximation the-
orem to more general algebras of functions.
As an application of this theory (and of many of the results accu-
mulated in previous lecture notes), we will present (in an optional sec-
tion) the commutative Gelfand-Neimark theorem classifying all com-
mutative unital C
-algebras.
1.10.1. Urysohns lemma. Let X be a topological space. An in-
dicator function 1
E
in this space will not typically be a continuous
function (indeed, if X is connected, this only happens when E is
the empty set or the whole set). Nevertheless, for certain topologi-
cal spaces, it is possible to approximate an indicator function by a
continuous function, as follows.
Lemma 1.10.1 (Urysohns lemma). Let X be a topological space.
Then the following are equivalent:
(i) Every pair of disjoint closed sets K, L in X can be separated
by disjoint open neighbourhoods U K, V L.
148 1. Real analysis
(ii) For every closed set K in X and every open neighbourhood
U of K, there exists an open set V and a closed set L such
that K V L U.
(iii) For every pair of disjoint closed sets K, L in X, there exists
a continuous function f : X [0, 1] which equals 1 on K
and 0 on L.
(iv) For every closed set K in X and every open neighbourhood
U of K, there exists a continuous function f : X [0, 1]
such that 1
K
(x) f(x) 1
U
(x) for all x X.
A topological space which obeys any (and hence all) of (i-iv) is
known as a normal space; denition (i) is traditionally taken to be
the standard denition of normality. We will give some examples of
normal spaces shortly.
Proof. The equivalence of (iii) and (iv) is clear, as the complement
of a closed set is an open set and vice versa. The equivalence of (i)
and (ii) follows similarly.
To deduce (i) from (iii), let K, L be disjoint closed sets, let f be
as in (iii), and let U, V be the open sets U := x X : f(x) > 2/3
and V := x X : f(x) < 1/3.
The only remaining task is to deduce (iv) from (ii). Suppose we
have a closed set K = K
1
and an open set U = U
0
with K
1
U
0
.
Applying (ii), we can nd an open set U
1/2
and a closed set K
1/2
such
that
K
1
U
1/2
K
1/2
U
0
.
Applying (ii) two more times, we can nd more open sets U
1/4
, U
3/4
and closed sets K
1/4
, K
3/4
such that
K
1
U
3/4
K
3/4
U
1/2
K
1/2
U
1/4
K
1/4
U
0
.
Iterating this process, we can construct open sets U
q
and closed sets
K
q
for every dyadic rational q = a/2
n
in (0, 1) such that U
q
K
q
for
all 0 < q < 1, and K
q
U
q
for any 0 q < q
1.
If we now dene f(x) := supq : x U
q
= infq : x K
q
,
where q ranges over dyadic rationals between 0 and 1, and with the
convention that the empty set has sup 1 and inf 0, one easily veries
1.10. LCH spaces 149
that the sets f(x) > =
q>
U
q
and f(x) < =
q<
XK
q
are open for every real number , and so f is continuous as required.
such that n
x
= n
x
,= j). Show that K
1
, K
2
are disjoint and closed.
Show that given any open neighbourhood U of K
1
, there
exists disjoint nite subsets A
1
, A
2
, . . . of R and an injective
function f :
i=1
A
i
N such that for any j 0, any
(m
x
)
xR
such that m
x
= f(x) for all x A
1
. . . A
j
and
is identically 1 on A
j+1
, lies in U.
Show that any open neighbourhood of K
1
and any open
neighbourhood of K
2
necessarily intersect, and so N
R
is
not normal.
Conclude that R
R
with the product topology is not normal.
The property of being normal is a topological one, thus if one
topological space is normal, then any other topological space home-
omorphic to it is also normal. However, (unlike, say, the Hausdor
property), the property of being normal is not preserved under pas-
sage to subspaces:
Exercise 1.10.5. Given an example of a subspace of a normal space
which is not normal. (Hint: use Exercise 1.10.4, possibly after re-
placing R with a homeomorphic equivalent.)
Let C
c
(X R) be the space of real continuous compactly sup-
ported functions on X. Urysohns lemma generates a large number of
useful elements of C
c
(X R), in the case when X is locally compact
Hausdor (LCH):
Exercise 1.10.6. Let X be a locally compact Hausdor space, let
K be a compact set, and let U be an open neighbourhood of K.
Show that there exists f C
c
(X R) such that 1
K
(x) f(x)
1
U
(x) for all x X. (Hint: First use the local compactness of X to
nd a neighbourhood of K with compact closure; then restrict U to
this neighbourhood. The closure of U is now a compact set; restrict
everything to this set, at which point the space becomes normal.)
One consequence of this exercise is that C
c
(X R) tends to be
dense in many other function spaces. We give an important example
here:
1.10. LCH spaces 151
Denition 1.10.2 (Radon measure). Let X be a locally compact
Hausdor space that is also -compact, and let B be the Borel -
algebra. An (unsigned) Radon measure is a unsigned measure :
B R
+
with the following properties:
(Local niteness) For any compact subset K of X, (K) is
nite.
(Outer regularity) For any Borel set E of X, (E) = inf(U) :
U E; U open.
(Inner regularity) For any Borel set E of X, (E) = sup(K) :
K E; K compact.
Example 1.10.3. Lebesgue measure m on R
n
is a Radon measure,
as is any absolutely continuous unsigned measure m
f
, where f
L
1
(R
n
, dm). More generally, if is Radon and is a nite unsigned
measure which is absolutely continuous with respect to , then is
Radon. On the other hand, counting measure on R
n
is not Radon
(it is not locally nite). It is possible to dene Radon measures on
Hausdor spaces that are not -compact or locally compact, but the
theory is more subtle and will not be considered here. We will study
Radon measures more thoroughly in the next section.
Proposition 1.10.4. Let X be a locally compact Hausdor space
which is also -compact, and let be a Radon measure on X. Then
for any 0 < p < , C
c
(X R) is a dense subset in (real-valued)
L
p
(X, ). In other words, every element of L
p
(X, ) can be expressed
as a limit (in L
p
(X, )) of continuous functions of compact support.
Proof. Since continuous functions of compact support are bounded,
and compact sets have nite measure, we see that C
c
(X) is a subspace
of L
p
(X, ). We need to show that the closure C
c
(X) of this space
contains all of L
p
(X, ).
Let K be a compact set, and let E K be a Borel set, then E
has nite measure. Applying inner and outer regularity, we can nd
a sequence of compact sets K
n
E and open sets U
n
E such that
(EK
n
), (U
n
E) 0. Applying Exercise 1.10.6, we can then nd
f
n
C
c
(X R) such that 1
Kn
(x) f
n
(x) 1
Un
(x). In particular,
this implies (by the squeeze theorem) that f
n
converges in L
p
(X, )
152 1. Real analysis
to 1
E
(here we use the niteness of p); thus 1
E
lies in C
c
(X R)
for any measurable subset E of K. By linearity, all simple functions
supported on K also lie in C
c
(X R); taking closures, we see that
any L
p
function supported in K also lies in C
c
(X R). As X is
-nite, one can express any non-negative L
p
function as a monotone
limit of compactly supported functions, and thus every non-negative
L
p
function lies in C
c
(X R), and thus all L
p
functions lie in this
space, and the claim follows.
Of course, the real-valued version of the above proposition imme-
diately implies a complex-valued analogue. On the other hand, the
claim fails when p = :
Exercise 1.10.7. Let X be a locally compact Hausdor space that
is -compact, and let be a Radon measure. Show that the closure
of C
c
(X R) in L
(X, ) is C
0
(X R), the space of continuous
real-valued functions which vanish at innity (i.e. for every > 0
there exists a compact set K such that [f(x)[ for all x K).
Thus, in general, C
c
(X R) is not dense in L
(X, ).
Thus we see that the L
1
3
1
f1/3
.
By Urysohns lemma, we can nd a continuous function g : X
[1/3, 1/3] such that g = 1/3 on the closed set x K : f 1/3
and g = 1/3 on the closed set x K : f 1/3. Now, Tg is not
quite equal to f; but observe from construction that f Tg has sup
norm 2/3.
Scaling this fact, we conclude that, given any f BC(K R),
we can nd a decomposition f = Tg + f
, where |g|
BC(XR)
1
3
|f|
BC(KR)
and |f
|
BC(KR)
2
3
|f|
BC(KR)
.
Starting with any f = f
0
BC(K R), we can now iterate this
construction to express f
n
= Tg
n
+f
n+1
for all n = 0, 1, 2, . . ., where
|f
n
|
BC(KR)
(
2
3
)
n
|f|
BC(KR)
and |g
n
|
BC(XR)
1
3
(
2
3
)
n
|f|
BC(KR)
.
As BC(X R) is a Banach space, we see that
n=0
g
n
converges
absolutely to some limit g BC(X R), and that Tg = f, as
desired.
Remark 1.10.6. Observe that Urysohns lemma can be viewed the
special case of the Tietze extension theorem when K is the union of
two disjoint closed sets, and f is equal to 1 on one of these sets and
equal to 0 on the other.
Remark 1.10.7. One can extend the Tietze extension theorem to
nite-dimensional vector spaces: if K is a closed subset of a normal
vector space X and f : K R
n
is bounded and continuous, then
one has a bounded continuous extension f : K R
n
. Indeed, one
simply applies the Tietze extension theorem to each component of f
separately. However, if the range space is replaced by a space with
a non-trivial topology, then there can be topological obstructions to
continuous extension. For instance, a map f : 0, 1 Y from a
two-point set into a topological space Y is always continuous, but
can be extended to a continuous map
f : R Y if and only if f(0)
and f(1) lie in the same path-connected component of Y . Similarly, if
f : S
1
Y is a map from the unit circle into a topological space Y ,
154 1. Real analysis
then a continuous extension from S
1
to R
2
exists if and only if the
closed curve f : S
1
Y is contractible to a point in Y . These sorts
of questions require the machinery of algebraic topology to answer
them properly, and are beyond the scope of this course.
There are analogues for the Tietze extension theorem in some
other categories of functions. For instance, in the Lipschitz category,
we have
Exercise 1.10.8. Let X be a metric space, let K be a subset of
X, and let f : K R be a Lipschitz continuous map with some
Lipschitz constant A (thus [f(x) f(y)[ Ad(x, y) for all x, y K).
Show that there exists an extension
f : X R of f which is Lipschitz
continuous with the same Lipschitz constant A. (Hint: A greedy
algorithm will work here: pick
f to be as large as one can get away
with (or as small as one can get away with).)
One can also remove the requirement that the function f be
bounded in the Tietze extension theorem:
Exercise 1.10.9. Let X be a normal topological space, let K be a
closed subset of X, and let f : K R be a continuous map (not
necessarily bounded). Then there exists an extension
f : X R of f
which is still continuous. (Hint: rst compress f to be bounded by
working with, say, arctan(f) (other choices are possible), and apply
the usual Tietze extension theorem. There will be some sets in which
one cannot invert the compression function, but one can deal with
this by a further appeal to Urysohns lemma to damp the extension
out on such sets.)
There is also a locally compact Hausdor version of the Tietze
extension theorem:
Exercise 1.10.10. Let X be locally compact Hausdor, let K be
compact, and let f C(K R). Then there exists
f C
c
(X R)
which extends f.
Proposition 1.10.4 shows that measurable functions in L
p
can be
approximated by continuous functions of compact support (cf. Little-
woods second principle). Another approximation result in a similar
spirit is Lusins theorem:
1.10. LCH spaces 155
Theorem 1.10.8 (Lusins theorem). Let X be a locally compact
Hausdor space that is -compact, and let be a Radon measure.
Let f : X R be a measurable function supported on a set of nite
measure, and let > 0. Then there exists g C
c
(X R) which
agrees with f outside of a set of measure at most .
Proof. Observe that as f is nite everywhere, it is bounded outside
of a set of arbitrarily small measure. Thus we may assume without
loss of generality that f is bounded. Similarly, as X is -compact
(or by inner regularity), the support of f diers from a compact set
by a set of arbitrarily small measure; so we may assume that f is
also supported on a compact set K. By Theorem 1.10.5, it then
suces to show that f is continuous on the complement of an open
set of arbitrarily small measure; by outer regularity, we may delete
the adjective open from the preceding sentence.
As f is bounded and compactly supported, f lies in L
p
(X, )
for every 0 < p < , and using Proposition 1.10.4 and Chebyshevs
inequality, it is not hard to nd, for each n = 1, 2, . . ., a function
f
n
C
c
(X R) which diers from f by at most 1/2
n
outside of
a set of measure at most /2
n+2
(say). In particular, f
n
converges
uniformly to f outside of a set of measure at most /4, and f is
therefore continuous outside this set. The claim follows.
Another very useful application of Urysohns lemma is to create
partitions of unity.
Lemma 1.10.9 (Partitions of unity). Let X be a normal topological
space, and let (K
)
A
be a collection of closed sets that cover X.
For each A, let U
be an open neighbourhood of K
, which are
nitely overlapping in the sense that each x X belongs to at most
nitely many of the U
:
X [0, 1] supported on U
A
f
(x) = 1
for all x X.
If X is locally compact Hausdor instead of normal, and the K
to be compactly supported.
Proof. Suppose rst that X is normal. By Urysohns lemma, one
can nd a continuous function g
. Observe that
the function g :=
A
g
:= g
/g.
The nal claim follows by using Exercise 1.10.6 instead of Urysohns
lemma.
Exercise 1.10.11. Let X be a topological space. A function f :
X R is said to be upper semi-continuous if f
1
((, a)) is open
for all real a, and lower semi-continuous if f
1
((a, +)) is open for
all real a.
Show that an indicator function 1
E
is upper semi-continuous
if and only if E is closed, and lower semi-continuous if and
only if E is open.
If X is normal, show that a function f is upper semi-continuous
if and only if f(x) = infg(x) : g C(X R), g f
for all x X, and lower semi-continuous if and only if
f(x) = supg(x) : g C(X R), g f for all x X,
where we write f g if f(x) g(x) for all x X.
1.10.2. The Riesz representation theorem. Let X be a locally
compact Hausdor space which is also -compact. In Denition 1.10.2
we dened the notion of a Radon measure. Such measures are quite
common in real analysis. For instance, we have the following result.
Theorem 1.10.10. Let be a non-negative nite Borel measure on
a compact metric space X. Then is a Radon measure.
Proof. As is nite, it is locally nite, so it suces to show inner
and outer regularity. Let / be the collection of all Borel subsets E
of X such that
sup(K) : K E, closed = inf(U) : U E, open = (E),
It will then suce to show that every Borel set lies in / (note that as
X is compact, a subset K of X is closed if and only if it is compact).
Clearly / contains the empty set and the whole set X, and is
closed under complements. It is also closed under nite unions and in-
tersections. Indeed, given two sets E, F /, we can nd a sequences
K
n
E U
n
, L
n
F V
n
of closed sets K
n
, L
n
and open sets
1.10. LCH spaces 157
U
n
, V
n
such that (K
n
), (U
n
) (E) and (L
n
), (V
n
) (F).
Since
(K
n
L
n
) +(K
n
L
n
) = (K
n
) +(L
n
)
(E) +(F)
= (E F) +(E F)
we have (by monotonicity of ) that
(K
n
L
n
) (E F); (K
n
L
n
) (E F)
and similarly
(U
n
V
n
) (E F); (U
n
V
n
) (E F)
and so E F, E F /.
One can also show that / is closed under countable disjoint
unions and is thus a -algebra. Indeed, given disjoint sets E
n
/
and > 0, pick a closed K
n
E
n
and open U
n
E
n
such that
(E
n
K
n
), (U
n
E
n
) /2
n
; then
(
_
n=1
E
n
) (
_
n=1
U
n
)
n=1
(E
n
) +
and
(
_
n=1
E
n
) (
_
n=1
K
n
)
N
n=1
(E
n
)
for any N, and the claim follows from the squeeze test.
To nish the claim it suces to show that every open set V lies
in /. For this it will suce to show that V is a countable union
of closed sets. But as X is a compact metric space, it is separable
(Lemma 1.8.6), and so V has a countable dense subset x
1
, x
2
, . . .. One
then easily veries that every point in the open set V is contained
in a closed ball of rational radius centred at one of the x
i
that is in
turn contained in V ; thus V is the countable union of closed sets as
desired.
This result can be extended to more general spaces than com-
pact metric spaces, for instance to Polish spaces (provided that the
measure remains nite). For instance:
158 1. Real analysis
Exercise 1.10.12. Let X be a locally compact metric space which
is -compact, and let be an unsigned Borel measure which is nite
on every compact set. Show that is a Radon measure.
When the assumptions of X are weakened, then it is possible to
nd locally nite Borel measures that are not Radon measures, but
they are somewhat pathological in nature.
Exercise 1.10.13. Let X be a locally compact Hausdor space which
is -compact, and let be a Radon measure. Dene a F
set to be
a countable union of closed sets, and a G
set to be a countable
intersection of open sets. Show that every Borel set can be expressed
as the union of an F
(f) :=
_
X
f d for every f C
c
(X R), since assigns every
compact set a nite measure. Furthermore, I
is a linear functional
on C
c
(X R) which is positive in the sense that I
(f) 0 whenever
f is non-negative. If we place the uniform norm on C
c
(X R), then
I
.
Remark 1.10.12. The -compactness hypothesis can be dropped
(after relaxing the inner regularity condition to only apply to open
sets, rather than to all sets); but I will restrict attention here to the -
compact case (which already covers a large fraction of the applications
of this theorem) as the argument simplies slightly.
Proof. We rst prove the uniqueness, which is quite easy due to
all the properties that Radon measures enjoy. Suppose we had two
1.10. LCH spaces 159
Radon measures ,
such that I = I
= I
; in particular, we have
(1.75)
_
X
f d =
_
X
f d
for all f C
c
(X R). Now let K be a compact set, and let U
be an open neighbourhood of K. By Exercise 1.10.6, we can nd
f C
c
(X R) with 1
K
f 1
U
; applying this to (1.75), we
conclude that
(U)
(K).
Taking suprema in K and using inner regularity, we conclude that
(U)
agree
on open sets; by outer regularity we then conclude that and
agree
on all Borel sets.
Now we prove existence, which is signicantly trickier. We will
initially make the simplifying assumption that X is compact (so in
particular C
c
(X R) = C(X R) = BC(X R)), and remove
this assumption at the end of the proof.
Observe that I is monotone on C(X R), thus I(f) I(g)
whenever f g.
We would like to dene the measure on Borel sets E by den-
ing (E) := I(1
E
). This does not work directly, because 1
E
is not
continuous. To get around this problem we shall begin by extending
the functional I to the class BC
lsc
(X R
+
) of bounded lower semi-
continuous non-negative functions. We dene I(f) for such functions
by the formula
I(f) := supI(g) : g C
c
(X R); 0 g f
(cf. Exercise 1.10.11). This denition agrees with the existing de-
nition of I(f) in the case when f is continuous. Since I(1) is nite
and I is monotone, one sees that I(f) is nite (and non-negative) for
all f BC
lsc
(X R
+
). One also easily sees that I is monotone
on BC
lsc
(X R
+
): I(f) I(g) whenever f, g BC
lsc
(X R
+
)
and f g, and homogeneous in the sense that I(cf) = cI(f) for all
f BC
lsc
(X R
+
) and c > 0. It is also easy to verify the super-
additivity property I(f + f
) I(f) + I(f
) for f, f
BC
lsc
(X
160 1. Real analysis
R
+
); this simply reects the linearity of I on C
c
(X R), to-
gether with the fact that if 0 g f and 0 g
, then
0 g +g
f +f
.
We now complement the super-additivity property with a count-
ably sub-additive one: if f
n
BC
lsc
(X R
+
) is a sequence, and
f BC
lsc
(X R
+
) is such that f(x)
n=1
f
n
(x) for all x X,
then I(f)
n=1
I(f
n
).
Pick a small 0 < < 1. It will suce to show that I(g)
n=1
I(f
n
) + O(
1/2
) (say) whenever g C
c
(X R) is such that
0 g f, and O(
1/2
) denotes a quantity bounded in magnitude by
C
1/2
, where C is a quantity that is independent of .
Fix g. For every x X, we can nd a neighbourhood U
x
of x
such that [g(y) g(x)[ for all y U
x
; we can also nd N
x
> 0
such that
Nx
n=1
f
n
(x) f(x) . By shrinking U
x
if necessary, we
see from the lower semicontinuity of the f
n
and f that we can also
ensure that f
n
(y) f
n
(x) /2
n
for all 1 n N
x
and y U
x
.
By normality, we can nd open neighbourhoods V
x
of x whose clo-
sure lies in U
x
. The V
x
form an open cover of X. Since we are assum-
ing X to be compact, we can thus nd a nite subcover V
x1
, . . . , V
x
k
of X. Applying Lemma 1.10.9, we can thus nd a partition of unity
1 =
k
j=1
j
, where each
j
is supported on U
xj
.
Let x X be such that g(x)
. Then we can write g(x) =
j:xUx
j
g(x)
j
(x). If j is in this sum, then [g(x
j
) g(x)[ , and
thus (for small enough) g(x
j
)
/2, and hence f(x
j
)
/2.
We can then write
1
Nx
j
n=1
f
n
(x
j
)
f(x
j
)
+O(
)
and thus
g(x)
n=1
j:f(xj)
/2;Nx
j
n
f
n
(x
j
)
f(x
j
)
g(x
j
)
j
(x) +O(
)
(here we use the fact that
j
j
(x) = 1 and that the continuous com-
pactly supported function g is bounded). Observe that only nitely
1.10. LCH spaces 161
many summands are non-zero. We conclude that
I(g)
n=1
I(
j:f(xj)
/2;Nx
j
n
f
n
(x
j
)
f(x
j
)
g(x
j
)
j
) +O(
)
(here we use that 1 C
c
(X) and so I(1) is nite). On the other hand,
for any x X and any n, the expression
j:f(xj)
/2;Nx
j
n
f
n
(x
j
)
f(x
j
)
g(x
j
)
j
(x)
is bounded from above by
j
f
n
(x
j
)
j
(x);
since f
n
(x) f
n
(x
j
) /2
n
and
j
j
(x) = 1, this is bounded above
in turn by
/2
n
+f
n
(x).
We conclude that
I(g)
n=1
[I(f
n
) +O(/2
n
)] +O(
)
and the sub-additivity claim follows.
Combining sub-additivity and super-additivity we see that I is
additive: I(f +g) = I(f) +I(g) for f, g BC
lsc
(X R
+
).
Now that we are able to integrate lower semi-continuous func-
tions, we can start dening the Radon measure . When U is open,
we dene (U) by
(U) := I(1
U
),
which is well-dened and non-negative since 1
U
is bounded, non-
negative and lower semi-continuous. When K is closed we dene
(K) by complementation:
(K) := (X) (XK);
this is compatible with the denition of on open sets by additivity
of I, and is also non-negative. The monotonicity of I implies mono-
tonicity of : in particular, if a closed set K lies in an open set U,
then (K) (U).
162 1. Real analysis
Given any set E X, dene the outer measure
+
(E) := inf(U) : E U, open
and the inner measure
(E)
+
(E) (X). We call a set E measurable if
(E) =
+
(E). By arguing as in the proof of Theorem 1.10.10, we
see that the class of measurable sets is a Boolean algebra. Next, we
claim that every open set U is measurable. Indeed, unwrapping all
the denitions we see that
(U) = supI(f) : f C
c
(X R); 0 f 1
U
.
Each f in this supremum is supported in some closed subset K of
U, and from this one easily veries that
+
(U) = (U) =
(U).
Similarly, every closed set K is measurable. We can now extend
to measurable sets by declaring (E) :=
+
(E) =
(E) when E is
measurable; this is compatible with the previous denitions of .
Next, let E
1
, E
2
, . . . be a countable sequence of disjoint measur-
able sets. Then for any > 0, we can nd open neighbourhoods U
n
of
E
n
and closed sets K
n
in E
n
such that (E
n
) (U
n
) (E
n
)+/2
n
and (E
n
)/2
n
(K
n
) (E
n
). Using the sub-additivity of I on
BC(X R
+
), we have (
n=1
U
n
)
n=1
(U
n
)
n=1
(E
n
)+
. Similarly, from the additivity of I we have (
N
n=1
K
n
) =
N
n=1
(K
n
)
N
n=1
(E
n
) . Letting 0, we conclude that
n=1
E
n
is mea-
surable with (
n=1
E
n
) =
n=1
(E
n
). Thus the Boolean alge-
bra of measurable sets is in fact a -algebra, and is a countably
additive measure on it. From construction we also see that it is -
nite, outer regular, and inner regular, and therefore is a Radon mea-
sure. The only remaining thing to check is that I(f) = I
n=0
n
into continuous compactly supported functions
n
C
c
(X R
+
), with each x X being contained in the support
of nitely many
n
. (Indeed, from -compactness and the locally
compact Hausdor property one can nd a nested sequence K
1
K
2
. . . of compact sets, with each K
n
in the interior of K
n+1
, such
that
n
K
n
= X. Using Exercise 1.10.6, one can nd functions
n
C
c
(X R
+
) that equal 1 on K
n
and are supported on K
n+1
; now
take
n
:=
n+1
n
and
0
:=
0
.) Observe that I(f) =
n
I(
n
f)
for all f C
c
(X R). From the compact case we see that there
exists a nite Radon measure
n
such that I(
n
f) = I
n
(f) for all
f C
c
(X R); setting :=
n
one can verify (using the
monotone convergence theorem, Theorem 1.1.21) that obeys the
required properties.
Remark 1.10.13. One can also construct the Radon measure using
the Caratheodory extension theorem (Theorem 1.1.17); this proof of
the Riesz representation theorem can be found in many real analysis
texts. A third method is to rst create the space L
1
by taking the
completion of C
c
(X R) with respect to the L
1
norm |f|
L
1 :=
I([f[), and then dene (E) := |1
E
|
L
1. It seems to me that all three
proofs are about equally lengthy, and ultimately rely on the same
ingredients; they all seem to have their strengths and weaknesses,
and involve at least one tricky computation somewhere (in the above
argument, the most tricky thing is the countable subadditivity of I
on lower semicontinuous functions). I have yet to nd a proof of this
theorem which is both clean and conceptual, and would be happy to
learn of other proofs of this theorem.
Remark 1.10.14. One can use the Riesz representation theorem
to provide an alternate construction of Lebesgue measure, say on
R. Indeed, the Riemann integral already provides a positive linear
functional on C
0
(R R), which by the Riesz representation theorem
must come from a Radon measure, which can be easily veried to
assign the value b a to every interval [a, b] and thus must agree
with Lebesgue measure. The same approach lets one dene volume
measures on manifolds with a volume form.
Exercise 1.10.14. Let X be a locally compact Hausdor space which
is -compact, and let be a Radon measure. For any non-negative
164 1. Real analysis
Borel measurable function f, show that
_
X
f d = inf
_
X
g d : g f; g lower semi-continuous
and
_
X
f d = sup
_
X
g d : 0 g f; g upper semi-continuous.
Similarly, for any non-negative lower semi-continuous function g, show
that
_
X
g d = sup
_
X
h d : 0 h g; h C
c
(X R).
Now we consider signed functionals on C
c
(X R), which we
now turn into a normed vector space using the uniform norm. The
key lemma here is the following variant of the Jordan decomposition
theorem (Exercise 1.2.5).
Lemma 1.10.15 (Jordan decomposition for functions). Let I C
c
(X
R)
C
c
(X R)
such that I = I
+
I
.
Proof. For f C
c
(X R
+
), we dene
I
+
(f) := supI(g) : g C
c
(X R) : 0 g f.
Clearly 0 I
+
(f) I(f) for f C
c
(X R
+
); one also easily veri-
es the homogeneity property I
+
(cf) = cI
+
(f) and super-additivity
property I
+
(f
1
+ f
2
) I
+
(f
1
) + I
+
(f
2
) for c > 0 and f, f
1
, f
2
C
c
(X R
+
). On the other hand, if g, f
1
, f
2
C
c
(X R
+
) are
such that g f
1
+ f
2
, then we can decompose g = g
1
+ g
2
for
some g
1
, g
2
C
c
(X R
+
) with g
1
f
1
and g
2
f
2
; for instance
we can take g
1
:= min(g, f
1
) and g
2
:= g g
1
. From this we can
complement super-additivity with sub-additivity and conclude that
I
+
(f
1
+f
2
) = I
+
(f
1
) +I
+
(f
2
).
Every function in C
c
(X R) can be expressed as the dierence
of two functions in C
c
(X R
+
). From the additivity and homogene-
ity of I
+
on C
c
(X R
+
) we may thus extend I
+
uniquely to be a
linear functional on C
c
(X R). Since I is bounded on C
c
(X R),
we see that I
+
is also. If we then dene I
:= I
+
I, one quickly
veries all the required properties.
1.10. LCH spaces 165
Exercise 1.10.15. Show that the functionals I
+
, I
appearing in
the above lemma are unique.
Dene a signed Radon measure on a -compact, locally compact
Hausdor space X to be a signed Borel measure whose positive and
negative variations are both Radon. It is easy to see that a signed
Radon measure generates a linear functional I
on C
c
(X R) as
before, and I
.
(Hint: combine Theorem 1.10.11 with Lemma 1.10.15.)
The space of signed nite Radon measures on X is denoted M(X
R), or M(X) for short.
Exercise 1.10.17. Show that the space M(X), with the total varia-
tion norm||
M(X)
:= [[(X), is a real Banach space, which is isomor-
phic to the dual of both C
c
(X R) and its completion C
0
(X R),
thus
C
c
(X R)
C
0
(X R)
M(X).
Remark 1.10.16. Note that the previous exercise generalises the
identications c
c
(N)
c
0
(N)
1
(N) from previous notes. For
compact Hausdor spaces X, we have C(X R) = C
0
(X R),
and thus C(X R)
M(X), where X is the Stone-
n
i=1
i/n
converge vaguely as
n to the measure m
[0,1]
. (Hint: Continuous, com-
pactly supported functions are Riemann integrable.)
Show that the measures
n
converge vaguely as n to
the zero measure 0.
Exercise 1.10.20. Let X be locally compact Hausdor and -compact.
Show that for every unsigned Radon measure , the map : L
1
()
M(X) dened by sending f L
1
() to the measure
f
is an isometry,
thus L
1
() can be identied with a subspace of M(X). Show that
this subspace is closed in the norm topology, but give an example
to show that it need not be closed in the vague topology. Show that
M(X) =
L
1
(), where ranges over all unsigned Radon measures
on X; thus one can think of M(X) as many L
1
s glued together.
Exercise 1.10.21. Let X be a locally compact Hausdor space which
is -compact. Let f
n
C
0
(X R) be a sequence of functions, and
let f C
0
(X R) be another function. Show that f
n
converges
weakly to f in C
0
(X R) if and only if the f
n
are uniformly bounded
and converge pointwise to f.
Exercise 1.10.22. Let X be a locally compact metric space which
is -compact.
1.10. LCH spaces 167
Show that the space of nitely supported measures in M(X)
is a dense subset of M(X) in the vague topology.
Show that a Radon probability measure in M(X) can be
expressed as the vague limit of a sequence of discrete (i.e.
nitely supported) probability measures.
1.10.3. The Stone-Weierstrass theorem. We have already seen
how rough functions (e.g. L
p
functions) can be approximated by con-
tinuous functions. Now we study in turn how continuous functions
can be approximated by even more special functions, such as polyno-
mials. The natural topology to work with here is the uniform topology
(since uniform limits of continuous functions are continuous).
For non-compact spaces, such as R, it is usually not possible
to approximate continuous functions uniformly by a smaller class of
functions. For instance, the function sin(x) cannot be approximated
uniformly by polynomials on R, since sin(x) is bounded, the only
bounded polynomials are the constants, and constants cannot con-
verge to anything other than another constant. On the other hand, on
a compact domain such as [1, 1], one can easily approximate sin(x)
uniformly by polynomials, for instance by using Taylor series. So we
will focus instead on compact Hausdor spaces X such as [1, 1], in
which continuous functions are automatically bounded.
The space T([1, 1]) of (real-valued) polynomials is a subspace
of the Banach space C([1, 1]). But it is also closed under point-
wise multiplication f, g fg, making T([1, 1]) an algebra, and not
merely a vector space. We can then rephrase the classical Weierstrass
approximation theorem as the assertion that T([1, 1]) is dense in
C([1, 1]).
One can then ask the more general question of when a sub-algebra
/ of C(X) - i.e. a subspace closed under pointwise multiplication - is
dense. Not every sub-algebra is dense: the algebra of constants, for
instance, will not be dense in C(X) when X has at least two points.
Another example in a similar spirit: given two distinct points x
1
, x
2
in X, the space f C(X) : f(x
1
) = f(x
2
) is a sub-algebra of C(X),
but it is not dense, because it is already closed, and cannot separate
168 1. Real analysis
x
1
and x
2
in the sense that it cannot produce a function that assigns
dierent values to x
1
and x
2
.
The remarkable Stone-Weierstrass theorem shows that this in-
ability to separate points is the only obstruction to density, at least
for algebras with the identity.
Theorem 1.10.18 (Stone-Weierstrass theorem, real version). Let X
be a compact Hausdor space, and let / be a sub-algebra of C(X
R) which contains the constant function 1 and separates points (i.e.
for every distinct x
1
, x
2
X, there exists at least one f in / such
that f(x
1
) ,= f(x
2
). Then / is dense in C(X R).
Remark 1.10.19. Observe that this theorem contains the Weier-
strass approximation theorem as a special case, since the algebra of
polynomials clearly separates points. Indeed, we will use (a very spe-
cial case) of the Weierstrass approximation theorem in the proof.
Proof. It suces to verify the claim for algebras / which are closed
in the C(X R) topology, since the claim follows in the general case
by replacing / with its closure (note that the closure of an algebra is
still an algebra).
Observe from the Weierstrass approximation theorem that on any
bounded interval [K, K], the function [x[ can be expressed as the
uniform limit of polynomials P
n
(x); one can even write down explicit
formulae for such a P
n
, though we will not need such formulae here.
Since continuous functions on the compact space X are bounded, this
implies that for any f /, the function [f[ is the uniform limit of
polynomial combinations P
n
(f) of f. As / is an algebra, the P
n
(f)
lie in /; as / is closed; we see that [f[ lies in /.
Using the identities max(f, g) =
f+g
2
+[
fg
2
[, min(f, g) =
f+g
2
[
fg
2
[, we conclude that / is a lattice in the sense that one has
max(f, g), min(f, g) / whenever f, g /.
Now let f C(X R) and > 0. We would like to nd g /
such that [f(x) g(x)[ for all x X.
Given any two points x, y X, we can at least nd a function
g
xy
/ such that g
xy
(x) = f(x) and g
xy
(y) = f(y); this follows since
the vector space / separates points and also contains the identity
1.10. LCH spaces 169
function (the case x = y needs to be treated separately). We now use
these functions g
xy
to build the approximant g. First, observe from
continuity that for every x, y X there exists an open neighbourhood
V
xy
of y such that g
xy
(y
) f(y
) for all y
V
xy
. By compactness,
for any xed x we can cover X by a nite number of these V
xy
. Taking
the max of all the g
xy
associated to this nite subcover, we create
another function g
x
/ such that g
x
(x) = f(x) and g
x
(y) f(y)
for all y X. By continuity, we can nd an open neighbourhood U
x
of x such that g
x
(x
) f(x
) + for all x
U
x
. Again applying
compactness, we can cover X by a nite number of the U
x
; taking
the min of all the g
x
associated to this nite subcover we obtain
g / with f(x) g(x) f(x) + for all x X, and the claim
follows.
There is an analogue of the Stone-Weierstrass theorem for alge-
bras that do not contain the identity:
Exercise 1.10.23. Let X be a compact Hausdor space, and let /
be a closed sub-algebra of C(X R) which separates points but does
not contain the identity. Show that there exists a unique x
0
X such
that / = f C(X R) : f(x
0
) = 0.
The Stone-Weierstrass theorem is not true as stated in the com-
plex case. For instance, the space C(D C) of complex-valued
functions on the closed unit disk D := z C : [z[ 1 has a closed
proper sub-algebra that separates points, namely the algebra 1(D)
of functions in C(D C) that are holomorphic on the interior of this
disk. Indeed, by Cauchys theorem and its converse (Moreras theo-
rem), a function f C(D C) lies in 1(D) if and only if
_
f = 0
for every closed contour in D, and one easily veries that this im-
plies that 1(D) is closed; meanwhile, the holomorphic function z z
separates all points. However, the Stone-Weierstrass theorem can be
recovered in the complex case by adding one further axiom, namely
that the algebra be closed under conjugation:
Exercise 1.10.24 (Stone-Weierstrass theorem, complex version). Let
X be a compact Hausdor space, and let / be a complex sub-algebra
of C(X C) which contains the constant function 1, separates
170 1. Real analysis
points, and is closed under the conjugation operation f f. Then
/ is dense in C(X C).
Exercise 1.10.25. Let T C([0, 1] C) be the space of trigono-
metric polynomials x
N
n=N
c
n
e
2inx
, where N 0 and the c
n
are complex numbers. Show that T is dense in C([0, 1] C) (with
the uniform topology), and that T is dense in L
p
([0, 1] C) (with
the L
p
topology) for all 0 < p < .
Exercise 1.10.26. Let X be a locally compact Hausdor space that
is -compact, and let / be a sub-algebra of C(X R) which sepa-
rates points and contains the identity function. Show that for every
function f C(X R) there exists a sequence f
n
/ which con-
verges to f uniformly on compact subsets of X.
Exercise 1.10.27. Let X, Y be compact Hausdor spaces. Show
that every function f C(X Y R) can be expressed as the uni-
form limit of functions of the form (x, y)
k
j=1
f
j
(x)g
j
(y), where
f
j
C(X R) and g
j
C(Y R).
Exercise 1.10.28. Let (X
)
A
be a family of compact Hausdor
spaces, and let X :=
A
X
An
X
R)
such that f
n
((x
)
A
) = g
n
((x
)
An
) for all (x
)
A
X.
One useful application of the Stone-Weierstrass theorem is to
demonstrate separability of spaces such as C(X).
Proposition 1.10.20. Let X be a compact metric space. Then
C(X C) and C(X R) are separable.
Proof. It suces to show that C(X R) is separable. By Lemma
1.8.6, X has a countable dense subset x
1
, x
2
, . . .. By Urysohns lemma,
for each n, m 1 we can nd a function
n,m
C(X R) which
equals 1 on B(x
n
, 1/m) and is supported on B(x
n
, 2/m). The
n,m
can then easily be veried to separate points, and so by the Stone-
Weierstrass theorem, the algebra of polynomial combinations of the
1.10. LCH spaces 171
n,m
in C(X R) are dense; this implies that the algebra of ra-
tional polynomial combinations of the
n,m
are dense, and the claim
follows.
Combining this with the Riesz representation theorem and the
sequential Banach-Alaoglu theorem (Theorem 1.9.14), we obtain
Corollary 1.10.21. If X is a compact metric space, then M(X) is
sequentially compact.
Combining this with Theorem 1.10.10, we conclude a special case
of Prokhorovs theorem:
Corollary 1.10.22 (Prokhorovs theorem, compact case). Let X be
a compact metric space, and let
n
be a sequence of Borel (hence
Radon) probability measures on X. Then there exists a subsequence
of
n
which converge vaguely to another Borel probability measure .
Exercise 1.10.29 (Prokhorovs theorem, non-compact case). Let X
be a locally compact metric space which is -compact, and let
n
be a
sequence of Borel probability measures. We assume that the sequence
n
is tight, which means that for every > 0 there exists a compact set
K such that
n
(XK) for all n. Show that there is a subsequence
of
n
which converges vaguely to another Borel probability measure .
If tightness is not assumed, show that there is a subsequence which
converges vaguely to a non-negative Borel measure , but give an
example to show that this measure need not be a probability measure.
This theorem can be used to establish Hellys selection theorem:
Exercise 1.10.30 (Hellys selection theorem). Let f
n
: R R be
a sequence of functions whose total variation is uniformly bounded
in n, and which is bounded at one point x
0
R (i.e. f
n
(x
0
) : n =
1, 2, . . . is bounded). Show that there exists a subsequence of f
n
which converges uniformly on compact subsets of R. (Hint: one can
deduce this from Prokhorovs theorem using the fundamental theorem
of calculus for functions of bounded variation.)
1.10.4. The commutative Gelfand-Naimark theorem (optional).
One particularly beautiful application of the machinery developed in
172 1. Real analysis
the last few notes is the commutative Gelfand-Naimark theorem, that
classies commutative C
-algebra
is a complex Banach algebra with an anti-linear map x x
from
A to A which is an isometry (thus |x
identity
|x
x| = |x|
2
for all x A.
A homomorphism : A B between two C
-algebras is a
continuous algebra homomorphism such that (x
) = (x)
for all
x X. An isomorphism is an homomorphism whose inverse exists
and is also a homomorphism; two C
:=
f.
The remarkable (unital commutative) Gelfand-Naimark theorem
asserts the converse statement to Exercise 1.10.32:
Theorem 1.10.24 (Unital commutative Gelfand-Naimark theorem).
Every unital commutative C
A of invertible
elements of A is open.
Dene the spectrum (x) of an element x A to be the set of all
z C such that x z1 is not invertible.
Exercise 1.10.34. If A is a unital Banach algebra and x A, show
that (x) is a compact subset of C that is contained inside the disk
z C : [z[ |x|.
Exercise 1.10.35 (Beurling-Gelfand spectral radius formula). If A
is a unital Banach algebra and x A, show that (x) is non-empty
with sup[z[ : z (x) = lim
n
|x
n
|
1/n
. (Hint: To get the upper
bound, observe that if x
n
z
n
1 is invertible for some n 1, then
so is x zI, then use Exercise 1.10.34. To get the lower bound, rst
observe that for any A
, the function f
: z ((x zI)
1
)
is holomorphic on the complement of (x), which is already enough
(with Liouvilles theorem) to show that is non-empty. Let r >
sup[z[ : z (x) be arbitrary, then use Laurent series to show
that (x
n
) C
,r
r
n
for all n and some C
,r
independent of n. Then
divide by r
n
and use the uniform boundedness principle to conclude.)
Exercise 1.10.36 (C
x)
2
n
|
1/2
n+1
= |(xx
)
2
n
|
1/2
n+1
for all n 1 and x A. Conclude that any homomorphism between
C
-algebra. A
character of A is be an element A
algebra). We let
A A
-algebra, show
that
A is a compact Hausdor subset of A
-
algebra A to be a proper subspace I of A such that xy, yx I for all
x I and y A. Show that if
A, then the kernel
1
(0) is a
maximal ideal in A; conversely, if I is a maximal ideal in A, show that
I is closed, and there is exactly one
A such that I =
1
(0).
Thus the spectrum of A can be canonically identied with the space
of maximal ideals in A.
Exercise 1.10.39. Let X be a compact Hausdor space, and let A
be the C
-
algebra, then the Gelfand representation is a homomorphism of C
-
algebras.
Exercise 1.10.41. Let x be a non-invertible element of a unital com-
mutative C
-
algebra, then the Gelfand representation is an isometry. (Hint: use
Exercise 1.10.36 and Exercise 1.10.41.)
1.11. Interpolation of L
p
spaces 175
Exercise 1.10.43. Use the complex Stone-Weierstrass theorem and
Exercises 1.10.40, 1.10.42 to conclude the proof of Theorem 1.10.24.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/03/02.
Thanks to Anush Tserunyan, Haokun Xu, Max Baroi, mmailliw/william,
PDEbeginner, and anonymous commenters for corrections.
Eric noted another example of a locally compact Hausdor space
which was not normal, namely ( + 1) (
1
+ 1)(,
1
), where
is the rst innite ordinal, and
1
is the rst uncountable ordinal
(endowed with the order topology, of course).
1.11. Interpolation of L
p
spaces
In the previous sections, we have been focusing largely on the soft
side of real analysis, which is primarily concerned with qualitative
properties such as convergence, compactness, measurability, and so
forth. In contrast, we will now emphasise the hard side of real
analysis, in which we study estimates and upper and lower bounds
of various quantities, such as norms of functions or operators. (Of
course, the two sides of analysis are closely connected to each other; an
understanding of both sides and their interrelationships, are needed
in order to get the broadest and most complete perspective for this
subject; see Section 1.3 of Structure and Randomness for more dis-
cussion.)
One basic tool in hard analysis is that of interpolation, which al-
lows one to start with a hypothesis of two (or more) upper bound
estimates, e.g. A
0
B
0
and A
1
B
1
, and conclude a family of
intermediate estimates A
(or maybe A
, where C
:= A
1
0
A
1
and
(1.80) B
:= B
1
0
B
1
;
indeed one simply raises (1.76) to the power 1, (1.77) to the power
, and multiplies the two inequalities together. Thus for instance,
when = 1/2 one obtains the geometric mean of (1.76) and (1.77):
A
1/2
0
A
1/2
1
B
1/2
0
B
1/2
1
.
One can view A
and B
, log B
is given by
A
= AL
1/p
, where 0 < p
=
1
p0
+
p1
.
1.11. Interpolation of L
p
spaces 179
The deduction of (1.78) from (1.76), (1.77) is utterly trivial, but
there are still some useful lessons to be drawn from it. For instance,
let us take A
0
= A
1
= A for simplicity, so we are interpolating two
upper bounds A B
0
, A B
1
on the same quantity A to give a new
bound A B
min(
B
0
B
1
,
B
1
B
0
)
for any suciently small > 0 (indeed one can take any less than or
equal to min(, 1)). Indeed one sees this simply by applying (1.78)
with with and + and taking minima. Thus we see that (1.78)
is only sharp when the two original bounds B
0
, B
1
are comparable;
if instead we have B
1
2
n
B
0
for some integer n, then (1.81) tells
us that we can improve (1.78) by an exponentially decaying factor
of 2
|n|
. The geometric series formula tells us that such factors are
absolutely summable, and so in practice it is often a useful heuristic
to pretend that the n = O(1) cases dominate so strongly that the
other cases can be viewed as negligible by comparison.
Also, one can trivially extend the deduction of (1.78) from (1.76),
(1.77) as follows: if A
is
convex, thus in particular A
(1)A
0
+A
1
for all 0 1 (note
180 1. Real analysis
that this generalises the arithmetic mean-geometric mean inequality).
Of course, the converse statement is not true.
Now we turn to the complex version of the interpolation of log-
convex functions, a result known as Lindelof s theorem:
Theorem 1.11.3 (Lindelofs theorem). Let s f(s) be a holomor-
phic function on the strip S := + it : 0 1; t R, which
obeys the bound
(1.82) [f( +it)[ Aexp(exp(( )t))
for all + it S and some constants A, > 0. Suppose also that
[f(0 + it)[ B
0
and [f(1 + it)[ B
1
for all t R. Then we have
[f( + it)[ B
is of course
dened by (1.80).
Remark 1.11.4. The hypothesis (1.82) is a qualitative hypothesis
rather than a quantitative one, since the exact values of A, do not
show up in the conclusion. It is quite a mild condition; any function of
exponential growth in t, or even with such super-exponential growth
as O([t[
|t|
) or O(e
|t|
O(1)
), will obey (1.82). The principle however fails
without this hypothesis, as one can see for instance by considering
the holomorphic function f(s) := exp(i exp(is)).
Proof. Observe that the function s B
1s
0
B
s
1
is holomorphic and
non-zero on S, and has magnitude exactly B
= 1 for all 0 1.
Suppose we temporarily assume that f( +it) 0 as [ +it[
. Then by the maximum modulus principle (applied to a suciently
large rectangular portion of the strip), it must then attain a maximum
on one of the two sides of the strip. But [f[ 1 on these two sides,
and so [f[ 1 on the interior as well.
To remove the assumption that f goes to zero at innity, we use
the trick of giving ourselves an epsilon of room (Section 2.7). Namely,
we multiply f(s) by the holomorphic function g
(s)g
(1 + [t[)
(1)a0+a1
for all + it S and some
constant C
(X)
is dened similarly, but where |f|
L
(X)
is the essential supremum of
[f[ on X.
A simple test case in which to understand the L
p
norms better is
that of a step function f = A1
E
, where A is a non-negative number
and E a set of nite measure. Then one has |f|
L
p
(X)
= A(E)
1/p
for 0 < p . Observe that this is a log-convex function of 1/p.
This is a general phenomenon:
Lemma 1.11.5 (Log-convexity of L
p
norms). Let that 0 < p
0
< p
1
and f L
p0
(X) L
p1
(X). Then f L
p
(X) for all p
0
p p
1
,
and furthermore we have
|f|
L
p
(X)
|f|
1
L
p
0(X)
|f|
L
p
1(X)
for all 0 1, where the exponent p
is dened by 1/p
:= (1
)/p
0
+/p
1
.
In particular, we see that the function 1/p |f|
L
p
(X)
is log-
convex whenever the right-hand side is nite (and is in fact log-convex
for all 0 1/p < , if one extends the denition of log-convexity
to functions that can take the value +). In other words, we can
interpolate any two bounds |f|
L
p
0(X)
B
0
and |f|
L
p
1(X)
B
1
to
obtain |f|
L
p
(X)
B
for all 0 1.
Let us give several proofs of this lemma. We will focus on the
case p
1
< ; the endpoint case p
1
= can be proven directly, or by
modifying the arguments below, or by using an appropriate limiting
argument, and we leave the details to the reader.
The rst proof is to use Holders inequality
|f|
p
L
p
(X)
=
_
X
[f[
(1)p
[f[
p
d |[f[
(1)p
|
L
p
0
/((1)p
) |[f[
p
|
L
p
1
/(p
)
when p
1
is nite (with some minor modications in the case p
1
= ).
Another (closely related) proof proceeds by using the log-convexity
inequality
[f(x)[
p
(1 )[f(x)[
p0
+[f(x)[
p1
1.11. Interpolation of L
p
spaces 183
for all x, where 0 < < 1 is the quantity such that p
= (1
)p
0
+p
1
. If one integrates this inequality in x, one already obtains
the claim in the normalised case when |f|
L
p
0(X)
= |f|
L
p
1(X)
= 1.
To obtain the general case, one can multiply the function f and the
measure by appropriately chosen constants to obtain the above
normalisation; we leave the details as an exercise to the reader. (The
case when |f|
L
p
0(X)
or |f|
L
p
1(X)
vanishes is of course easy to handle
separately.)
A third approach is more in the spirit of the real interpolation
method, avoiding the use of convexity arguments. As in the second
proof, we can reduce to the normalised case |f|
L
p
0(X)
= |f|
L
p
1(X)
=
1. We then split f = f1
|f|1
+ f1
|f|>1
, where 1
|f|1
is the indicator
function to the set x : [f(x)[ 1, and similarly for 1
|f|>1
. Observe
that
|f1
|f|1
|
p
L
p
(X)
=
_
|f|1
[f[
p
d
_
X
[f[
p0
d = 1
and similarly
|f1
|f|>1
|
p
L
p
(X)
=
_
|f|>1
[f[
p
d
_
X
[f[
p1
d = 1
and so by the quasi-triangle inequality (or triangle inequality, when
p
1)
|f|
L
p
(X)
C
for some constant C depending on p
(X) L
p0
(X) +L
p1
(X).
This is o by a constant factor by what we want. But one can
eliminate this constant by using the tensor power trick (Section 1.9 of
Structure and Randomness). Indeed, if one replaces X with a Carte-
sian power X
M
(with the product -algebra A
M
and product mea-
sure
M
), and replace f by the tensor power f
M
: (x
1
, . . . , x
m
)
f(x
1
) . . . f(x
m
), we see from many applications of the Fubini-Tonelli
theorem that
|f
M
|
L
p
(X)
= |f|
M
L
p
(X)
184 1. Real analysis
for all p. In particular, f
M
obeys the same normalisation hypotheses
as f, and thus by applying the previous inequality to f
M
, we obtain
|f|
M
L
p
(X)
C
for every M, where it is key to note that the constant C on the right
is independent of M. Taking M
th
roots and then sending M ,
we obtain the claim.
Finally, we give a fourth proof in the spirit of the complex in-
terpolation method. By replacing f by [f[ we may assume f is non-
negative. By expressing non-negative measurable functions as the
monotone limit of simple functions and using the monotone conver-
gence theorem (Theorem 1.1.21), we may assume that f is a simple
function, which is then necessarily of nite measure support from the
L
p
niteness hypotheses. Now consider the function s
_
X
[f[
(1s)p0+sp1
d.
Expanding f out in terms of step functions we see that this is an an-
alytic function of f which grows at most exponentially in s; also, by
the triangle inequality this function has magnitude at most
_
X
[f[
p0
when s = 0 + it and magnitude
_
X
[f[
p1
when s = 1 + it. Applying
Theorem 1.11.3 and specialising to s := we obtain the claim.
Exercise 1.11.5. If 0 < < 1, show that equality holds in Lemma
1.11.5 if and only if [f[ is a step function.
Now we consider variants of interpolation in which the strong
L
p
spaces are replaced by their weak counterparts L
p,
. Given a
measurable function f : X C, we dene the distribution function
f
: R
+
[0, +] by the formula
f
(t) := (x X : [f(x)[ t) =
_
X
1
|f|t
d.
This distribution function is closely connected to the L
p
norms. In-
deed, from the calculus identity
[f(x)[
p
= p
_
0
1
|f|t
t
p
dt
t
and the Fubini-Tonelli theorem, we obtain the formula
(1.84) |f|
p
L
p
(X)
= p
_
0
f
(t)t
p
dt
t
1.11. Interpolation of L
p
spaces 185
for all 0 < p < , thus the L
p
norms are essentially moments of
the distribution function. The L
(X)
= inft 0 :
f
(t) = 0.
Exercise 1.11.6. Show that we have the relationship
|f|
p
L
p
(X)
p
nZ
f
(2
n
)2
np
for any measurable f : X C and 0 < p < , where we use
X
p
Y to denote a pair of inequalities of the form c
p
Y X C
p
Y
for some constants c
p
, C
p
> 0 depending only on p. (Hint:
f
(t)
is non-increasing in t.) Thus we can relate the L
p
norms of f to
the dyadic values
f
(2
n
) of the distribution function; indeed, for any
0 < p , |f|
L
p
(X)
is comparable (up to constant factors depending
on p) to the
p
(Z) norm of the sequence n 2
n
f
(2
n
)
1/p
.
Another relationship between the L
p
norms and the distribution
function is given by observing that
|f|
p
L
p
(X)
=
_
X
[f[
p
d
_
|f|t
t
p
d = t
p
f
(t)
for any t > 0, leading to Chebyshevs inequality
f
(t)
1
t
p
|f|
p
L
p
(X)
.
(The p = 1 version of this inequality is also known as Markovs in-
equality. In probability theory, Chebyshevs inequality is often spe-
cialised to the case p = 2, and with f replaced by a normalised func-
tion f Ef. Note that, as with many other Cyrillic names, there
are also a large number of alternative spellings of Chebyshev in the
Roman alphabet.)
Chebyshevs inequality motivates one to dene the weak L
p
norm
|f|
L
p,
(X)
of a measurable function f : X C for 0 < p < by
the formula
|f|
L
p,
(X)
:= sup
t>0
t
f
(t)
1/p
,
thus Chebyshevs inequality can be expressed succinctly as
|f|
L
p,
(X)
|f|
L
p
(X)
.
186 1. Real analysis
It is also natural to adopt the convention that |f|
L
,
(X)
= |f|
L
(X)
.
If f, g : X C are two functions, we have the inclusion
[f +g[ t [f[ t/2 [g[ t/2
and hence
f+g
(t)
f
(t/2) +
g
(t/2);
this easily leads to the quasi-triangle inequality
|f +g|
L
p,
(X)
p
|f|
L
p,
(X)
+|f|
L
p,
(X)
where we use X
p
Y as shorthand for the inequality X C
p
Y for
some constant C
p
depending only on p (it can be a dierent constant
at each use of the
p
notation). [Note: in analytic number theory,
it is more customary to use
p
instead of
p
, following Vinogradov.
However, in analysis is sometimes used instead to denote much
smaller than, e.g. X Y denotes the assertion X cY for some
suciently small constant c.]
Let L
p,
(X) be the space of all f : X C which have -
nite L
p,
(X), modulo almost everywhere equivalence; this space is
also known as weak L
p
(X). The quasi-triangle inequality soon im-
plies that L
p,
(X) is a quasi-normed vector space with the L
p,
(X)
quasi-norm, and Chebyshevs inequality asserts that L
p,
(X) con-
tains L
p
(X) as a subspace (though the L
p
norm is not a restriction
of the L
p,
(X) norm).
Example 1.11.6. If X = R
n
with the usual measure, and 0 < p <
, then the function f(x) := [x[
n/p
is in weak L
p
, but not strong
L
p
. It is also not in strong or weak L
q
for any other q. But the local
component [x[
n/p
1
|x|1
of f is in strong and weak L
q
for all q > p,
and the global component [x[
n/p
1
|x|>1
of f is in strong and weak
L
q
for all q > p.
Exercise 1.11.7. For any 0 < p, q and f : X C, dene
the (dyadic) Lorentz norm |f|
L
p,q
(X)
to be
q
(Z) norm of the se-
quence n 2
n
f
(2
n
)
1/p
, and dene the Lorentz space L
p,q
(X) be
the space of functions f with |f|
L
p,q
(X)
nite, modulo almost ev-
erywhere equivalence. Show that L
p,q
(X) is a quasi-normed space,
which is equivalent to L
p,
(X) when q = and to L
p
(X) when
q = p. Lorentz spaces arise naturally in more rened applications of
1.11. Interpolation of L
p
spaces 187
the real interpolation method, and are useful in certain endpoint
estimates that fail for Lebesgue spaces, but which can be rescued
by using Lorentz spaces instead. However, we will not pursue these
applications in detail here.
Exercise 1.11.8. Let X be a nite set with counting measure, and
let f : X C be a function. For any 0 < p < , show that
|f|
L
p,
(X)
|f|
L
p
(X)
p
log(1 +[X[)|f|
L
p,
(X)
.
(Hint: to prove the second inequality, normalise |f|
L
p,
(X)
= 1, and
then manually dispose of the regions of X where f is too large or too
small.) Thus, in some sense, weak L
p
and strong L
p
are equivalent
up to logarithmic factors.
One can interpolate weak L
p
bounds just as one can strong L
p
bounds: if |f|
L
p
0
,
(X)
B
0
and |f|
L
p
1
,
(X)
B
1
, then
(1.85) |f|
L
p
,
(X)
B
f
(t)
B
p0
0
t
p0
and
f
(t)
B
p1
1
t
p1
for all t > 0, and hence by scalar interpolation (using an interpolation
parameter 0 < < 1 dened by p
= (1)p
0
+p
1
, and after doing
some algebra) we have
(1.86)
f
(t)
B
p
t
p
f
(t)
B
p
t
p
min(t/t
0
, t
0
/t)
(X)
L
p
,
(X) L
p0
(X) +L
p1
(X) L
p0,
(X) +L
p1,
(X).
Dene the strong type diagram of a function f : X C to be the
set of all 1/p for which f lies in strong L
p
, and the weak type diagram
to be the set of all 1/p for which f lies in weak L
p
. Then both the
strong and weak type diagrams are connected subsets of [0, +), and
the strong type diagram is contained in the weak type diagram, and
contains in turn the interior of the weak type diagram. By experi-
menting with linear combinations of the examples in Example 1.11.6
we see that this is basically everything one can say about the strong
and weak type diagrams, without further information on f or X.
Exercise 1.11.10. Let f : X C be a measurable function which
is nite almost everywhere. Show that there exists a unique non-
increasing left-continuous function f
: R
+
R
+
such that
f
(t) =
f
(t) for all t 0, and in particular |f|
L
p
(X)
= |f
|
L
p
(R
+
)
for all
0 < p , and |f|
L
p,
(X)
= |f
|
L
p,
(R
+
)
. (Hint: rst look for
the formula that describes f
such that [
_
X
f1
E
d[ C
(E)
1/p
L
p,
(X)
:= sup
E
(E)
1/p
[
_
X
f1
E
d[, where E ranges over all
sets of positive nite measure, is a genuine norm on L
p,
(X) which
is equivalent to the L
p,
(X) quasinorm.
Exercise 1.11.12. Let n > 1 be an integer. Find a probability space
(X, A, ) and functions f
1
, . . . , f
n
: X R with |f
j
|
L
1,
(X)
1 for
j = 1, . . . , n such that |
n
j=1
f
j
|
L
1,
(X)
cnlog n for some absolute
constant c > 0. (Hint: exploit the logarithmic divergence of the
harmonic series
j=1
1
j
.) Conclude that there exists a probability
space X such that the L
1,
(X) quasi-norm is not equivalent to an
actual norm.
Exercise 1.11.13. Let (X, A, ) be a -nite measure space, let
0 < p < , and f : X C be a measurable function. Show that the
following are equivalent:
f lies in L
p,
(X).
There exists a constant C such that for every set E of nite
measure, there exists a subset E
with (E
)
1
2
(E) such
that [
_
X
f1
E
d[ C(E)
1/p
.
Exercise 1.11.14. Let (X, A, ) be a measure space of nite mea-
sure, and f : X C be a measurable function. Show that the
following two statements are equivalent:
There exists a constant C > 0 such that |f|
L
p
(X)
Cp for
all 1 p < .
There exists a constant c > 0 such that
_
X
e
c|f|
d < .
190 1. Real analysis
1.11.3. Interpolation of operators. We turn at last to the cen-
tral topic of these notes, which is interpolation of operators T be-
tween functions on two xed measure spaces X = (X, A, ) and
Y = (Y, }, ). To avoid some (very minor) technicalities we will
make the mild assumption throughout that X and Y are both -
nite, although much of the theory here extends to the non--nite
setting.
A typical situation is that of a linear operator T which maps one
L
p0
(X) space to another L
q0
(Y ), and also maps L
p1
(X) to L
q1
(Y )
for some exponents 0 < p
0
, p
1
, q
0
, q
1
; thus (by linearity) T will
map the larger vector space L
p0
(X) + L
p1
(X) to L
q0
(Y ) + L
q1
(Y ),
and one has some estimates of the form
(1.88) |Tf|
L
q
0(Y )
B
0
|f|
L
p
0(X)
and
(1.89) |Tf|
L
q
1(Y )
B
1
|f|
L
p
1(X)
for all f L
p0
(X), f L
p1
(X) respectively, and some B
0
, B
1
> 0.
We would like to then interpolate to say something about how T maps
L
p
(X) to L
q
(Y ).
The complex interpolation method gives a satisfactory result as
long as the exponents allow one to use duality methods, a result
known as the Riesz-Thorin theorem:
Theorem 1.11.7 (Riesz-Thorin theorem). Let 0 < p
0
, p
1
and
1 q
0
, q
1
. Let T : L
p0
(X) + L
p1
(X) L
q0
(Y ) + L
q1
(Y )
be a linear operator obeying the bounds (1.88), (1.89) for all f
L
p0
(X), f L
p1
(X) respectively, and some B
0
, B
1
> 0. Then we
have
|Tf|
L
q
(Y )
B
|f|
L
p
(X)
for all 0 < < 1 and f L
p
:= 1 /p
0
+ /p
1
,
1/q
:= 1 /q
0
+/q
1
, and B
:= B
1
0
B
1
.
Remark 1.11.8. When X is a point, this theorem essentially col-
lapses to Lemma 1.11.5 (and when Y is a point, this is a dual for-
mulation of that lemma); and when X and Y are both points; this
collapses to interpolation of scalars.
1.11. Interpolation of L
p
spaces 191
Proof. If p
0
= p
1
then the claim follows from Lemma 1.11.5, so
we may assume p
0
,= p
1
, which in particular forces p
to be nite.
By symmetry we can take p
0
< p
1
. By multiplying the measures
and (or the operator T) by various constants, we can normalise
B
0
= B
1
= 1 (the case when B
0
= 0 or B
1
= 0 is trivial). Thus we
have B
= 1 also.
By Holders inequality, the bound (1.88) implies that
(1.90) [
_
Y
(Tf)g d[ |f|
L
p
0(X)
|g|
L
q
0(Y )
for all f L
p0
(X) and g L
q
0
(Y ), where q
0
is the dual exponent of
q
0
. Similarly we have
(1.91) [
_
Y
(Tf)g d[ |f|
L
p
1(X)
|g|
L
q
1(Y )
for all f L
p1
(X) and g L
q
1
(Y ).
We now claim that
(1.92) [
_
Y
(Tf)g d[ |f|
L
p
(X)
|g|
L
q
(Y )
for all f, g that are simple functions with nite measure support.
To see this, we rst normalise |f|
L
p
(X)
= |g|
L
q
(Y )
= 1. Observe
that we can write f = [f[ sgn(f), g = [g[ sgn(g) for some functions
sgn(f), sgn(g) of magnitude at most 1. If we then introduce the quan-
tity
F(s) :=
_
Y
(T[[f[
(1s)p
/p0+sp
/p1
sgn(f)])[[g[
(1s)q
/q
0
+sq
/q
1
sgn(g)] d
(with the conventions that q
/q
0
, q
/q
1
= 1 in the endpoint case q
0
=
q
1
= q
(X)L
p1
(X)
192 1. Real analysis
by simple functions of nite measure support, and approximating the
latter in L
p
(X) L
p0
(X) by simple functions of nite measure sup-
port, and taking limits using (1.90), (1.91) to justify the passage to
the limit. One can then also allow arbitrary g L
q
(Y ) by using the
monotone convergence theorem (Theorem 1.1.21). The claim now
follows from the duality between L
q1
(Y ) and L
q
1
(Y ).
Suppose one has a linear operator T that maps simple functions
of nite measure support on X to measurable functions on Y (modulo
almost everywhere equivalence). We say that such an operator is of
strong type (p, q) if it can be extended in a continuous fashion to
an operator on L
p
(X) to an operator on L
q
(Y ); this is equivalent to
having an estimate of the form |Tf|
L
q
(Y )
B|f|
L
p
(X)
for all simple
functions f of nite measure support. (The extension is unique if p is
nite or if X has nite measure, due to the density of simple functions
of nite measure support in those cases. Annoyingly, uniqueness fails
for L
, q
is of strong type (q
, p
).
Thus, taking formal adjoints reects the strong type diagram around
the line of duality 1/p + 1/q = 1, at least inside the Banach space
region [0, 1]
2
.
Remark 1.11.9. There is a powerful extension of the Riesz-Thorin
theorem known as the Stein interpolation theorem, in which the single
operator T is replaced by a family of operators T
s
for s S that vary
holomorphically in s in the sense that
_
Y
(T
s
1
E
)1
F
d is a holomorphic
function of s for any sets E, F of nite measure. Roughly speaking,
the Stein interpolation theorem asserts that if T
j+it
is of strong type
(p
j
, q
j
) for j = 0, 1 with a bound growing at most exponentially in
t, and T
s
itself grows at most exponentially in t in some sense, then
T
, q
f[, where
(S
)
A
is a family of linear operators, is also a non-negative sublin-
ear operator; note that one can also replace the supremum here by any
194 1. Real analysis
other norm in , e.g. one could take an
p
norm (
A
[S
f[
p
)
1/p
for any 1 p . (After p = and p = 1, a particularly common
case is when p = 2, in which case T is known as a square function.)
The basic theory of sublinear operators is similar to that of linear
operators in some respects. For instance, continuity is still equivalent
to boundedness:
Exercise 1.11.17. Let T be a sublinear operator, and let 0 < p, q
. Then the following are equivalent:
T can be extended to a continuous operator from L
p
(X) to
L
q
(Y ).
There exists a constant B > 0 such that |Tf|
L
q
(Y )
B|f|
L
p
(X)
for all simple functions f of nite measure sup-
port.
T can be extended to a operator from L
p
(X) to L
q
(Y ) such
that |Tf|
L
q
(Y )
B|f|
L
p
(X)
for all f L
p
(X) and some
B > 0.
Show that the extension mentioned above is unique of p is nite, or if
X has nite measure. Finally, show that the same equivalences hold
if L
q
(Y ) is replaced by L
q,
(Y ) throughout.
We say that T is of strong type (p, q) if any of the above equiva-
lent statements (for L
q
(Y )) hold, and of weak type (p, q) if any of the
above equivalent statements (for L
q,
(Y )) hold. We say that a linear
operator S is of strong or weak type (p, q) if its non-negative counter-
part [S[ is; note that this is compatible with our previous denition
of strong type for such operators. Also, Chebyshevs inequality tells
us that strong type (p, q) implies weak type (p, q).
We now give the real interpolation counterpart of the Riesz-
Thorin theorem, namely the Marcinkeiwicz interpolation theorem:
Theorem 1.11.10 (Marcinkiewicz interpolation theorem). Let 0 <
p
0
, p
1
, q
0
, q
1
and 0 < < 1 be such that q
0
,= q
1
, and p
i
q
i
for
i = 0, 1. Let T be a sublinear operator which is of weak type (p
0
, q
0
)
and of weak type (p
1
, q
1
). Then T is of strong type (p
, q
).
1.11. Interpolation of L
p
spaces 195
Remark 1.11.11. Of course, the same claim applies to linear opera-
tors S by setting T := [S[. One can also extend the argument to quasi-
linear operators, in which the pointwise bound [T(f+g)[ [Tf[+[Tg[
is replaced by [T(f + g)[ C([Tf[ +[Tg[) for some constant C > 0,
but this generalisation only appears occasionally in applications. The
conditions p
0
q
0
, p
1
q
1
can be replaced by the variant condition
p
, q
Tf
(t)t
q
dt
t
|f|
q
L
p
(X)
.
By homogeneity we can normalise |f|
L
p
(X)
= 1.
Actually, it will be more slightly convenient to work with the
dyadic version of the above estimate, namely
(1.95)
nZ
Tf
(2
n
)2
q
n
1;
see Exercise 1.11.6. The hypothesis |f|
L
p
(X)
= 1 similarly implies
that
(1.96)
mZ
f
(2
m
)2
p
m
1.
196 1. Real analysis
The basic idea is then to get enough control on the numbers
Tf
(2
n
)
in terms of the numbers
f
(2
m
) that one can deduce (1.95) from
(1.96).
When p
0
= p
1
, the claim follows from direct substitution of
(1.91), (1.94) (see also the discussion in the previous section about
interpolating strong L
p
bounds from weak ones), so let us assume
p
0
,= p
1
; by symmetry we may take p
0
< p
1
, and thus p
0
< p
< p
1
.
In this case we cannot directly apply (1.91), (1.94) because we only
control f in L
p
, not L
p0
or L
p1
. To get around this, we use the basic
real interpolation trick of decomposing f into pieces. There are two
basic choices for what decomposition to pick. On one hand, one could
adopt a minimalistic approach and just decompose into two pieces
f = f
s
+f
<s
where f
s
:= f1
|f|s
and f
<s
:= f1
|f|<s
, and the threshold s is a
parameter (depending on n) to be optimised later. Or we could adopt
a maximalistic approach and perform the dyadic decomposition
f =
mZ
f
m
where f
m
= f1
2
m
|f|<2
m+1. (Note that only nitely many of the
f
m
are non-zero, as we are assuming f to be a simple function.)
We will adopt the latter approach, in order to illustrate the dyadic
decomposition method; the former approach also works, but we leave
it as an exercise to the interested reader.
From sublinearity we have the pointwise estimate
Tf
m
Tf
m
which implies that
Tf
(2
n
)
Tfm
(c
n,m
2
n
)
whenever c
n,m
are positive constants such that
m
c
n,m
= 1, but for
which we are otherwise at liberty to choose. We will set aside the
problem of deciding what the optimal choice of c
n,m
is for now, and
continue with the proof.
1.11. Interpolation of L
p
spaces 197
From (1.91), (1.94), we have two bounds for the quantity
Tfm
(c
n,m
2
n
),
namely
Tfm
(c
n,m
2
n
) c
q0
n,m
2
nq0
|f
m
|
q0
L
p
0(X)
and
Tfm
(c
n,m
2
n
) c
q1
n,m
2
nq1
|f
m
|
q1
L
p
1(X)
.
From construction of f
m
we can bound
|f
m
|
L
p
0(X)
2
m
f
(2
m
)
1/p0
and similarly for p
1
, and thus we have
Tfm
(c
n,m
2
n
) c
qi
n,m
2
nqi
2
mqi
f
(2
m
)
qi/pi
.
for i = 0, 1. To prove (1.95), it thus suces to show that
n
2
nq
m
min
i=0,1
c
qi
n,m
2
nqi
2
mqi
f
(2
m
)
qi/pi
1.
It is convenient to introduce the quantities a
m
:=
f
(2
m
)2
mp
ap-
pearing in (1.96), thus
m
a
m
1
and our task is to show that
n
2
nq
m
min
i=0,1
c
qi
n,m
2
nqi
2
mqi
2
mqip
/pi
a
qi/pi
m
1.
Since p
i
q
i
, we have a
qi/pi
m
a
m
, and so we are reduced to the
purely numerical task of locating constants c
n,m
with
m
c
n,m
1
for all n such that
(1.97)
n
2
nq
m
min
i=0,1
c
qi
n,m
2
nqi
2
mqi
2
mqip
/pi
1
for all m.
We can simplify this expression a bit by collecting terms and mak-
ing some substitutions. The points (1/p
0
, 1/q
0
), (1/p
, 1/q
), (1/p
1
, 1/q
1
)
are collinear, and we can capture this by writing
1
p
i
=
1
p
+x
i
;
1
q
i
=
1
q
+x
i
for some x
0
> 0 > x
1
and some R. We can then simplify the
left-hand side of (1.97) to
m
min
i=0,1
(c
1
n,m
2
nq
mp
)
qixi
.
198 1. Real analysis
Note that q
0
x
0
is positive and q
1
x
1
is negative. If we then pick c
n,m
to be a suciently small multiple of 2
|nq
mp
|/2
(say), we obtain
the claim by summing geometric series.
Remark 1.11.12. A closer inspection of the proof (or a rescaling ar-
gument to reduce to the normalised case B
0
= B
1
= 1, as in preceding
sections) reveals that one establishes the estimate
|Tf|
L
q
(Y )
C
p0,p1,q0,q1,,C
B
1
0
B
1
|f|
L
p
(X)
for all simple functions f of nite measure support (or for all f
L
p
, q
< q
.
Exercise 1.11.20. For any R, let X
nN
2
n
n
, thus each point
n has mass 2
n
. Show that if > > 0, then the identity operator
from X
to X
,r
(Y )
B|f|
L
p
,r
(X)
for all simple functions f of nite measure support, where the Lorentz
norms L
p,q
were dened in Exercise 1.11.7. (Hint: repeat the proof
of the Marcinkiewicz interpolation theorem, but partition the sum
n,m
into regions of the form nq
mp
.
This Lorentz space version of the interpolation theorem is in some
sense the right version of the theorem, but the Lorentz spaces are
slightly more technical to deal with than the Lebesgue spaces, and
the Lebesgue space version of Marcinkiewicz interpolation is largely
sucient for most applications.
Exercise 1.11.22. For i = 1, 2, let X
i
= (X
i
, A
i
,
i
), Y
i
= (Y
i
, }
i
,
i
)
be -nite measure spaces, and let T
i
be a linear operator from simple
functions of nite measure support on X
i
to measurable functions on
Y
i
(modulo almost everywhere equivalence, as always). Let X =
X
1
X
2
, Y = Y
1
Y
2
be the product spaces (with product -algebra
200 1. Real analysis
and product measure). Show that there exists a unique (modulo
a.e. equivalence) linear operator T dened on linear combinations of
indicator functions 1
E1E2
of product sets of sets E
1
X
1
, E
2
X
2
of nite measure, such that
T1
E1E2
(y
1
, y
2
) := T
1
1
E1
(y
1
)T
2
1
E2
(y
2
)
for a.e. (y
1
, y
2
) Y ; we refer to T as the tensor product of T
1
and T
2
and write T = T
1
T
2
. Show that if T
1
, T
2
are of strong-type (p, q)
for some 1 p, q < with operator norms B
1
, B
2
respectively, then
T can be extended to a bounded linear operator on L
p
(X) to L
q
(Y )
with operator norm exactly equal to B
1
B
2
, thus
|T
1
T
2
|
L
p
(X1X2)L
q
(Y1Y2)
= |T
1
|
L
p
(X1)L
q
(Y1)
|T
2
|
L
p
(X2)L
q
(Y2)
.
(Hint: for the lower bound, show that T
1
T
2
(f
1
f
2
) = (T
1
f
1
)
(T
2
f
2
) for all simple functions f
1
, f
2
. For the upper bound, express
T
1
T
2
as the composition of two other operators T
1
I
1
and I
2
T
2
for
some identity operators I
1
, I
2
, and establish operator norm bounds
on these two operators separately.) Use this and the tensor power
trick to deduce the Riesz-Thorin theorem (in the special case when
1 p
i
q
i
< for i = 0, 1, and q
0
,= q
1
) from the Marcinkiewicz
interpolation theorem. Thus one can (with some eort) avoid the use
of complex variable methods to prove the Riesz-Thorin theorem, at
least in some cases.
Exercise 1.11.23 (H olders inequality for Lorentz spaces). Let f
L
p1,r1
(X) and g L
p2,r2
(X) for some 0 < p
1
, p
2
, r
1
, r
2
. Show
that fg L
p3,r3
(X), where 1/p
3
= 1/p
1
+ 1/p
2
and 1/r
3
= 1/r
1
+
1/r
2
, with the estimate
|fg|
L
p
3
,r
3(X)
C
p1,p2,r1,r2
|f|
L
p
1
,r
1(X)
|g|
L
p
2
,r
2(X)
for some constant C
p1,p2,r1,r2
. (This estimate is due to ONeil[ON1963].)
Remark 1.11.13. Just as interpolation of functions can be claried
by using step functions f = A1
E
as a test case, it is instructive to use
rank one operators such as
Tf := Af, 1
E
1
F
= A(
_
E
f d)1
F
where E X, F Y are nite measure sets, as test cases for the
real and complex interpolation methods. (After understanding the
1.11. Interpolation of L
p
spaces 201
rank one case, I then recommend looking at the rank two case, e.g.
Tf := A
1
f, 1
E1
1
F1
+ A
2
f, 1
E2
1
F2
, where E
2
, F
2
could be very
dierent in size from E
1
, F
1
.)
1.11.4. Some examples of interpolation. Now we apply the in-
terpolation theorems to some classes of operators. An important such
class is given by the integral operators
Tf(y) :=
_
X
K(x, y)f(x) d(x)
from functions f : X C to functions Tf : Y C, where K :
X Y C is a xed measurable function, known as the kernel of
the integral operator T. Of course, this integral is not necessarily
convergent, so we will also need to study the sublinear analogue
[T[f(y) :=
_
X
[K(x, y)[[f(x)[ d(x)
which is well-dened (though it may be innite).
The following useful lemma gives us strong-type bounds on [T[
and hence T, assuming certain L
p
type bounds on the rows and
columns of K.
Lemma 1.11.14 (Schurs test). Let K : XY C be a measurable
function obeying the bounds
|K(x, )|
L
q
0(Y )
B
0
for almost every x X, and
|K(, y)|
L
p
1(X)
B
1
for almost every y Y , where 1 p
1
, q
0
and B
0
, B
1
> 0.
Then for every 0 < < 1, [T[ and T are of strong-type (p
, q
), with
Tf(y) well-dened for all f L
p
|f|
L
p
(X)
.
Here we adopt the convention that p
0
:= 1 and q
1
:= , thus q
=
q
0
/(1 ) and p
= p
1
/.
202 1. Real analysis
Proof. The hypothesis |K(x, )|
L
q
0(Y )
B
0
, combined with Minkowskis
integral inequality, shows us that
|[T[f|
L
q
0(Y )
B
0
|f|
L
1
(X)
for all f L
1
(X); in particular, for such f, Tf is well-dened almost
everywhere, and
|Tf|
L
q
0(Y )
B
0
|f|
L
1
(X)
.
Similarly, Holders inequality tells us that for f L
p1
(X), Tf is
well-dened everywhere, and
|Tf|
L
(Y )
B
1
|f|
L
p
1(X)
.
Applying the Riesz-Thorin theorem we conclude that
|Tf|
L
q
(Y )
B
|f|
L
p
(X)
for all simple functions f with nite measure support; replacing K
with [K[ we also see that
|[T[f|
L
q
(Y )
B
|f|
L
p
(X)
for all simple functions f with nite measure support, and thus (by
monotone convergence, Theorem 1.1.21) for all f L
p
(X). The
claim then follows.
Example 1.11.15. Let A = (a
ij
)
1in,1jm
be a matrix such that
the sum of the magnitudes of the entries in every row and column is
at most B, i.e.
n
i=1
[a
ij
[ B for all j and
m
j=1
[a
ij
[ B for all i.
Then one has the bound
|Ax|
p
m
B|x|
p
n
for all vectors x C
n
and all 1 p . Note the extreme cases
p = 1, p = can be seen directly; the remaining cases then follow
from interpolation.
A useful special case arises when A is an S-sparse matrix, which
means that at most S entries in any row or column are non-zero (e.g.
permutation matrices are 1-sparse). We then conclude that the
p
operator norm of A is at most S sup
i,j
[a
i,j
[.
1.11. Interpolation of L
p
spaces 203
Exercise 1.11.24. Establish Schurs test by more direct means, tak-
ing advantage of the duality relationship
|g|
L
p
(Y )
:= sup[
_
Y
gh[ : |h|
L
p
(Y )
1
for 1 p , as well as Youngs inequality xy
1
r
x
r
+
1
r
x
r
for
1 < r < . (You may wish to rst work out Example 1.11.15, say
with p = 2, to gure out the logic.)
A useful corollary of Schurs test is Youngs convolution inequality
for the convolution f g of two functions f : R
n
C, g : R
n
C,
dened as
f g(x) :=
_
R
n
f(y)g(x y) dy
provided of course that the integrand is absolutely convergent.
Exercise 1.11.25 (Youngs inequality). Let 1 p, q, r be such
that
1
p
+
1
q
=
1
r
+ 1. Show that if f L
p
(R
n
) and g L
q
(R
n
),
then f g is well-dened almost everywhere and lies in L
r
(R
n
), and
furthermore that
|f g|
L
r
(R
n
)
|f|
L
p
(R
n
)
|g|
L
q
(R
n
)
.
(Hint: Apply Schurs test to the kernel K(x, y) := g(x y).)
Remark 1.11.16. There is nothing special about R
n
here; one could
in fact use any locally compact group G with a bi-invariant Haar
measure. On the other hand, if one specialises to R
n
, then it is
possible to improve Youngs inequality slightly, to
|f g|
L
r
(R
n
)
(A
p
A
q
A
r
)
n/2
|f|
L
p
(R
n
)
|g|
L
q
(R
n
)
.
where A
p
:= p
1/p
/(p
)
1/p
(R
n
).
Youngs inequality tells us that f g L
(R
n
). Rene this further
by showing that f g C
0
(R
n
), i.e. f g is continuous and goes to
zero at innity. (Hint: rst show this when f, g C
c
(R
n
), then use
a limiting argument.)
204 1. Real analysis
We now give a variant of Schurs test that allows for weak esti-
mates.
Lemma 1.11.17 (Weak-type Schurs test). Let K : X Y C be
a measurable function obeying the bounds
|K(x, )|
L
q
0
,
(Y )
B
0
for almost every x X, and
|K(, y)|
L
p
1
,
(X)
B
1
for almost every y Y , where 1 < p
1
, q
0
< and B
0
, B
1
> 0
(note the endpoint exponents 1, are now excluded). Then for every
0 < < 1, [T[ and T are of strong-type (p
, q
|f|
L
p
(X)
.
Here we again adopt the convention that p
0
:= 1 and q
1
:= .
Proof. From Exercise 1.11.11 we see that
_
Y
[K(x, y)[1
E
(y) d(y) B
0
(E)
1/q
0
for any measurable E Y , where we use A B to denote A
C
p1,q0,
B for some C
p1,q0,
depending on the indicated parameters.
By the Fubini-Tonelli theorem, we conclude that
_
Y
[T[f(y)1
E
(y) d(y) B
0
(E)
1/q
0
|f|
L
1
(X)
for any f L
1
(X); by Exercise 1.11.11 again we conclude that
|[T[f|
L
q
0
,
(Y )
B
0
|f|
L
1
(X)
thus [T[ is of weak-type (1, q
0
). In a similar vein, from yet another
application of Exercise 1.11.11 we see that
|[T[f|
L
(Y )
B
1
(F)
1/p1
whenever 0 f 1
F
and F X has nite measure; thus [T[ is of
restricted type (p
1
, ). Applying Exercise 1.11.18 we conclude that
[T[ is of strong type (p
, q
), and the
claim follows.
This leads to a weak-type version of Youngs inequality:
1.12. The Fourier transform 205
Exercise 1.11.27 (Weak-type Youngs inequality). Let 1 < p, q, r <
be such that
1
p
+
1
q
=
1
r
+ 1. Show that if f L
p
(R
n
) and
g L
q,
(R
n
), then f g is well-dened almost everywhere and lies
in L
r
(R
n
), and furthermore that
|f g|
L
r
(R
n
)
C
p,q
|f|
L
p
(R
n
)
|g|
L
q,
(R
n
)
.
for some constant C
p,q
> 0.
Exercise 1.11.28. Rene the previous exercise by replacing L
r
(R
n
)
with the Lorentz space L
r,p
(R
n
) throughout.
Recall that the function 1/[x[
will lie in L
n/,
(R
n
) for > 0.
We conclude
Corollary 1.11.18 (Hardy-Littlewood-Sobolev fractional integration
inequality). Let 1 < p, r < and 0 < < n be such that
1
p
+
n
=
1
r
+ 1. If f L
p
(R
n
), then the function I
f, dened as
I
f(x) :=
_
R
n
f(y)
[x y[
dy
is well-dened almost everywhere and lies in L
r
(R
n
), and furthermore
that
|I
f|
L
r
(R
n
)
C
p,,n
|f|
L
p
(R
n
)
for some constant C
p,,n
> 0.
This inequality is of importance in the theory of Sobolev spaces,
which we will discuss in Section 1.14.
Exercise 1.11.29. Show that Corollary 1.11.18 can fail at the end-
points p = 1, r = , or = n.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/03/30.
Thanks to PDEbeginner, Samir Chomsky, Spencer, Xiaochuan Liu
and anonymous commenters for corrections.
1.12. The Fourier transform
In these notes we lay out the basic theory of the Fourier transform,
which is of course the most fundamental tool in harmonic analysis
and also of major importance in related elds (functional analysis,
206 1. Real analysis
complex analysis, PDE, number theory, additive combinatorics, rep-
resentation theory, signal processing, etc.). The Fourier transform, in
conjunction with the Fourier inversion formula, allows one to take es-
sentially arbitrary (complex-valued) functions on a group G (or more
generally, a space X that G acts on, e.g. a homogeneous space G/H),
and decompose them as a (discrete or continuous) superposition of
much more symmetric functions on the domain, such as characters
: G S
1
; the precise superposition is given by Fourier coecients
f(), which take values in some dual object such as the Pontryagin
dual
G of G. Characters behave in a very simple manner with respect
to translation (indeed, they are eigenfunctions of the translation ac-
tion), and so the Fourier transform tends to simplify any mathemat-
ical problem which enjoys a translation invariance symmetry (or an
approximation to such a symmetry), and is somehow linear (i.e. it
interacts nicely with superpositions). In particular, Fourier analytic
methods are particularly useful for studying operations such as con-
volution f, g f g and set-theoretic addition A, B A + B, or
the closely related problem of counting solutions to additive problems
such as x = a
1
+ a
2
+ a
3
or x = a
1
a
2
, where a
1
, a
2
, a
3
are con-
strained to lie in specic sets A
1
, A
2
, A
3
. The Fourier transform is
also a particularly powerful tool for solving constant-coecient linear
ODE and PDE (because of the translation invariance), and can also
approximately solve some variable-coecient (or slightly non-linear)
equations if the coecients vary smoothly enough and the nonlinear
terms are suciently tame.
The Fourier transform
f() also provides an important new way
of looking at a function f(x), as it highlights the distribution of f
in frequency space (the domain of the frequency variable ) rather
than physical space (the domain of the physical variable x). A given
property of f in the physical domain may be transformed to a rather
dierent-looking property of
f in the frequency domain. For instance:
Smoothness of f in the physical domain corresponds to de-
cay of
f in the Fourier domain, and conversely. (More gen-
erally, ne scale properties of f tend to manifest themselves
as coarse scale properties of
f, and conversely.)
1.12. The Fourier transform 207
Convolution in the physical domain corresponds to point-
wise multiplication in the Fourier domain, and conversely.
Constant coecient dierential operators such as d/dx in
the physical domain corresponds to multiplication by poly-
nomials such as 2i in the Fourier domain, and conversely.
More generally, translation invariant operators in the phys-
ical domain correspond to multiplication by symbols in the
Fourier domain, and conversely.
Rescaling in the physical domain by an invertible linear
transformation corresponds to an inverse (adjoint) rescal-
ing in the Fourier domain.
Restriction to a subspace (or subgroup) in the physical do-
main corresponds to projection to the dual quotient space
(or quotient group) in the Fourier domain, and conversely.
Frequency modulation in the physical domain corresponds
to translation in the frequency domain, and conversely.
(We will make these statements more precise below.)
On the other hand, some operations in the physical domain re-
main essentially unchanged in the Fourier domain. Most importantly,
the L
2
norm (or energy) of a function f is the same as that of its
Fourier transform, and more generally the inner product f, g of two
functions f is the same as that of their Fourier transforms. Indeed,
the Fourier transform is a unitary operator on L
2
(a fact which is vari-
ously known as the Plancherel theorem or the Parseval identity). This
makes it easier to pass back and forth between the physical domain
and frequency domain, so that one can combine techniques that are
easy to execute in the physical domain with other techniques that are
easy to execute in the frequency domain. (In fact, one can combine
the physical and frequency domains together into a product domain
known as phase space, and there are entire elds of mathematics (e.g.
microlocal analysis, geometric quantisation, time-frequency analysis)
devoted to performing analysis on these sorts of spaces directly, but
this is beyond the scope of this course.)
In these notes, we briey discuss the general theory of the Fourier
transform, but will mainly focus on the two classical domains for
208 1. Real analysis
Fourier analysis: the torus T
d
:= (R/Z)
d
, and the Euclidean space
R
d
. For these domains one has the advantage of being able to perform
very explicit algebraic calculations, involving concrete functions such
as plane waves x e
2ix
or Gaussians x A
d/2
e
A|x|
2
.
1.12.1. Generalities. Let us begin with some generalities. An abelian
topological group is an abelian group G = (G, +) with a topological
structure, such that the group operations of addition + : GG G
and negation : G G are continuous. (One can of course also
consider abelian multiplicative groups G = (G, ), but to x the no-
tation we shall restrict attention to additive groups.) For technical
reasons (and in particular, in order to apply many of the results from
the previous sections) it is convenient to restrict attention to abelian
topological groups which are locally compact Hausdor (LCH); these
are known as locally compact abelian (LCA) groups.
Some basic examples of locally compact abelian groups are:
Finite additive groups (with the discrete topology), such as
cyclic groups Z/NZ.
Finitely generated additive groups (with the discrete topol-
ogy), such as the standard lattice Z
d
.
Tori, such as the standard d-dimensional torus T
d
:= (R/Z)
d
with the standard topology.
Euclidean spaces, such the standard d-dimensional Euclidean
space R
d
(with the standard topology, of course).
The rationals Q are not locally compact with the usual
topology; but if one uses the discrete topology instead, one
recovers an LCA group.
Another example of an LCA group, of importance in num-
ber theory, is the adele ring A, discussed in Section 1.5 of
Poincares legacies, Vol. I.
Thus we see that locally compact abelian groups can be either
discrete or continuous, and either compact or non-compact; all four
combinations of these cases are of importance. The topology of course
generates a Borel -algebra in the usual fashion, as well as a space
C
c
(G) of continuous, compactly supported complex-valued functions.
1.12. The Fourier transform 209
There is a translation action x
x
of G on C
c
(G), where for every
x G,
x
: C
c
(G) C
c
(G) is the translation operation
x
f(y) := f(y x).
LCA groups need not be -compact (think of the free abelian
group on uncountably many generators, with the discrete topology),
but one has the following useful substitute:
Exercise 1.12.1. Show that every LCA group Gcontains a -compact
open subgroup H, and in particular is the disjoint union of -compact
sets. (Hint: Take a compact symmetric neighbourhood K of the iden-
tity, and consider the group H generated by this neighbourhood.)
An important notion for us will be that of a Haar measure: a
Radon measure on G which is translation-invariant (i.e. (E+x) =
(E) for all Borel sets E G and all x G, where E +x := y +x :
y E is the translation of E by x). From this and the denition
of integration we see that integration f
_
G
f d against a Haar
measure (an operation known as the Haar integral ) is also translation-
invariant, thus
(1.98)
_
G
f(y x) d(y) =
_
G
f(y) d(y)
or equivalently
(1.99)
_
G
x
f d =
_
G
f d
for all f C
c
(G) and x G. The trivial measure 0 is of course a
Haar measure; all other Haar measures are called non-trivial.
Let us note some non-trivial Haar measures in the four basic
examples of locally compact abelian groups:
For a nite additive group G, one can take either counting
measure # or normalised counting measure #/#(G) as a
Haar measure. (The former measure emphasises the discrete
nature of G; the latter measure emphasises the compact
nature of G.)
For nitely generated additive groups such as Z
d
, counting
measure # is a Haar measure.
210 1. Real analysis
For the standard torus (R/Z)
d
, one can obtain a Haar mea-
sure by identifying this torus with [0, 1)
d
in the usual manner
and then taking Lebesgue measure on the latter space. This
Haar measure is a probability measure.
For the standard Euclidean space R
d
, Lebesgue measure is
a Haar measure.
Of course, any non-negative constant multiple of a Haar measure
is again a Haar measure. The converse is also true:
Exercise 1.12.2 (Uniqueness of Haar measure up to scalars). Let
, be two non-trivial Haar measures on a locally compact abelian
group G. Show that , are scalar multiples of each other, i.e. there
exists a constant c > 0 such that = c. (Hint: for any f, g C
c
(G),
compute the quantity
_
G
_
G
g(y)f(x +y) d(x)d(y) in two dierent
ways.)
The above argument also implies a useful symmetry property of
Haar measures:
Exercise 1.12.3 (Haar measures are symmetric). Let be a Haar
measure on a locally compact abelian group G. Show that
_
G
f(x) dx =
_
G
f(x) dx for all f C
c
(G). (Hint: expand
_
G
_
G
f(y)f(x+y) d(x)d(y)
in two dierent ways.) Conclude that Haar measures on LCA groups
are symmetric in the sense that (E) = (E) for all measurable E,
where E := x : x E is the reection of E.
Exercise 1.12.4 (Open sets have positive measure). Let be a non-
trivial Haar measure on a locally compact abelian group G. Show that
(U) > 0 for any non-empty open set U. Conclude that if f C
c
(G)
is non-negative and not identically zero, then
_
G
f d > 0.
Exercise 1.12.5. If G is an LCA group with non-trivial Haar mea-
sure , show that L
1
(G)
is identiable with L
(G). (Unfortunately,
G is not always -nite, and so the standard duality theorem from
Section 1.3 does not directly apply. However, one can get around this
using Exercise 1.12.1.)
It is a (not entirely trivial) theorem, due to Andre Weil, that all
LCA groups have a non-trivial Haar measure. For discrete groups, one
1.12. The Fourier transform 211
can of course take counting measure as a Haar measure. For compact
groups, the result is due to Haar, and one can argue as follows:
Exercise 1.12.6 (Existence of Haar measure, compact case). Let
G be a compact metrisable abelian group. For any real-valued f
C
c
(G), and any Borel probability measure on G, dene the oscil-
lation osc
f
() of with respect to f to be the quantity osc
f
() :=
sup
yG
_
G
y
f d(x) inf
yG
_
G
y
f d(x).
(a) Show that a Borel probability measure is a Haar mea-
sure if and only if osc
f
() = 0 for all f C
c
(G).
(b) If a sequence
n
of Borel probability measures converges
in the vague topology to another Borel probability measure
, show that osc
f
(
n
) osc
f
() for all f C
c
(G).
(c) If is a Borel probability measure and f C
c
(G) is such
that osc
f
() > 0, show that there exists a Borel probability
measure
) < osc
f
() and osc
g
(
)
osc
g
() for all g C
c
(G). (Hint: take
to be the an
average of certain translations of .)
(d) Given any nite number of functions f
1
, . . . , f
n
C
c
(G),
show that there exists a Borel probability measure such
that osc
fi
() = 0 for all i = 1, . . . , n. (Hint: Use Prokhorovs
theorem, see Corollary 1.10.22. Try the n = 1 case rst.)
(e) Show that there exists a unique Haar probability mea-
sure on G. (Hint: One can identify each probability measure
with the element (
_
G
f d)
fCc(G)
of the product space
fCc(G)
[sup
xG
[f(x)[, sup
xG
[f(x)[], which is compact
by Tychonos theorem. Now use (d) and the nite inter-
section property.)
(The argument can be adapted to the case when G is not metris-
able, but one has to replace the sequential compactness given by
Prokhorovs theorem with the topological compactness given by the
Banach-Alaoglu theorem.)
For general LCA groups, the proof is more complicated:
Exercise 1.12.7 (Existence of Haar measure, general case). Let G
be an LCA group. Let C
c
(G)
+
denote the space of non-negative
212 1. Real analysis
functions f C
c
(G) that are not identically zero. Given two f, g
C
c
(G)
+
, dene a g-cover of f to be an expression of the form a
1
x1
g+
. . . + a
n
xn
g that pointwise dominates f, where a
1
, . . . , a
n
are non-
negative numbers and x
1
, . . . , x
n
G. Let (f : g) denote the inmum
of the quantity a
1
+. . . +a
n
for all g-covers of f.
(a) (Finiteness) Show that 0 < (f : g) < + for all f, g
C
c
(G)
+
.
(b) Let is a Haar measure on G. Show that
_
G
f d
(f : g)(
_
G
g d) for all f, g C
c
(G)
+
. Conversely, for every
f C
c
(G)
+
and > 0, show that there exists g C
c
(G)
+
such that
_
G
f d (f : g)(
_
G
g d) . (Hint: f is
uniformly continuous. Take g to be an approximation to
the identity.) Thus Haar integrals are related to certain
renormalised versions of the functionals f (f : g); this
observation underlies the strategy for construction of Haar
measure in the rest of this exercise.
(c) (Transitivity) Show that (f : h) (f : g)(g : h) for all
f, g, h C
c
(G)
+
.
(d) (Translation invariance) Show that (
x
f : g) = (f : g)
for all f, g C
c
(G)
+
and x G.
(e) (Sublinearity) Show that (f + g : h) (f : h) + (g : h)
and (cf : g) = c(f : g) for all f, g, h C
c
(G)
+
and c > 0.
(f) (Approximate superadditivity) If f, g C
c
(G)
+
and >
0, show that there exists a neighbourhood U of the identity
such that (f : h) +(g : h) (1 +)(f +g : h) whenever h
C
c
(G)
+
is supported in U. (Hint: f, g, f+g are all uniformly
continuous. Take a h-cover of f +g and multiply the weight
a
i
at x
i
by weights such as f(x
i
)/(f(x
i
) + g(x
i
) ) and
g(x
i
)/(f(x
i
) +g(x
i
) ).)
Next, x a reference function f
0
C
c
(G)
+
, and dene the functional
I
g
: C
c
(G)
+
R
+
for all g C
c
(G)
+
by the formula I
g
(f) := (f :
g)/(f
0
: g).
(g) Show that for any xed f, I
g
(f) ranges in the compact
interval [(f
0
: f)
1
, (f : f
0
)]; thus I
g
can be viewed as an
1.12. The Fourier transform 213
element of the product space
fCc(G)
+[(f
0
: f)
1
, (f : f
0
)],
which is compact by Tychonos theorem.
(h) From (d), (e) we have the translation-invariance prop-
erty I
g
(
x
f) = I
g
(f), the homogeneity property I
g
(cf) =
cI
g
(f), and the sub-additivity property I
g
(f +f
) I
g
(f) +
I
g
(f
) for all g, f, f
C
c
(G)
+
, x G, and c > 0; we
also have the normalisation I
g
(f
0
) = 1. Now show that for
all f
1
, . . . , f
n
, f
1
, . . . , f
n
C
c
(G)
+
and > 0, there exists
g C
c
(G)
+
such that I
g
(f
i
+ f
i
) I
g
(f
i
) + I
g
(f
i
) for
all i = 1, . . . , n.
(i) Show that there exists a unique Haar measure on G
with (f
0
) = 1. (Hint: Use (h) and the nite intersec-
tion property to obtain a translation-invariant positive lin-
ear functional on C
c
(G), then use the Riesz representation
theorem.)
Now we come to a fundamental notion, that of a character.
Denition 1.12.1 (Characters). Let G be a LCA group. A mul-
tiplicative character is a continuous function : G S
1
to the
unit circle S
1
:= z C : [z[ = 1 which is a homomorphism, i.e.
(x + y) = (x)(y) for all x, y G. An additive character or fre-
quency : x x is a continuous function : G R/Z which is a
homomorphism, thus (x+y) = x+ y for all x, y G. The set
of all frequencies is called the Pontryagin dual of G and is denoted
:
H
G between their
Pontryagin duals, dened by
() x := (x) for
H
and x G.
(e) If H is a closed subgroup of an LCA group G (and is thus
also LCA), show that
H is identiable with
G/H
, where
H
f() :=
_
G
f(x)e
2ix
d(x).
This is clearly a linear transformation, with the obvious bound
sup
G
[
f()[ |f|
L
1
(G)
.
It converts translations into frequency modulations: indeed, one eas-
ily veries that
(1.100)
x0
f() = e
2ix0
f()
for any f L
1
(G), x
0
G, and
G. Conversely, it converts
frequency modulations to translations: one has
(1.101)
0
f() =
f(
0
)
for any f L
1
(G) and
0
,
G, where
0
is the multiplicative
character
0
: x e
2i0x
.
Exercise 1.12.11 (Riemann-Lebesgue lemma). If f L
1
(G), show
that
f :
G C is continuous. Furthermore, show that
f goes to zero
at innity in the sense that for every > 0 there exists a compact
subset K of
G such that [
GH with
G
H as per Exercise 1.12.9(f). Informally, this exercise
asserts that the Fourier transform commutes with tensor products.
(Because of this fact, the tensor power trick (Section 1.9 of Structure
and Randomness) is often available when proving results about the
Fourier transform on general groups.)
Exercise 1.12.15 (Convolution and Fourier transform of measures).
If M(G) is a nite Radon measure on an LCA group G with non-
trivial Haar measure , dene the Fourier-Stieltjes transform :
G
C by the formula () :=
_
G
e
2ix
d(x) (thus for instance
f
=
f
for any f L
1
(G)). Show that is a bounded continuous function
1.12. The Fourier transform 217
on
G. Given any f L
1
(G), dene the convolution f : G C to
be the function
f (x) :=
_
G
f(x y) d(y)
and given any nite Radon measure , let : G C be the
measure
(E) :=
_
G
_
G
1
E
(x +y) d(x)d(y).
Show that f L
1
(G) and
f () =
f() () for all
G, and
similarly that is a nite measure and () = () () for all
G. Thus the convolution and Fourier structure on L
1
(G) can be
extended to the larger space M(G) of nite Radon measures.
1.12.2. The Fourier transform on compact abelian groups.
In this section we specialise the Fourier transform to the case when
the locally compact group G is in fact compact, thus we now have a
compact abelian group G with non-trivial Haar measure . This case
includes that of nite groups, together with that of the tori (R/Z)
d
.
As is a Radon measure, compact groups G have nite measure.
It is then convenient to normalise the Haar measure so that (G) =
1, thus is now a probability measure. For the remainder of this
section, we will assume that G is a compact abelian group and is
its (unique) Haar probability measure, as given by Exercise 1.12.6.
A key advantage of working in the compact setting is that mul-
tiplicative characters : G S
1
now lie in L
2
(G) and L
1
(G). In
particular, they can be integrated:
Lemma 1.12.2. Let be a multiplicative character. Then
_
G
d
equals 1 when is trivial and 0 when is non-trivial. Equivalently,
for
G, we have
_
G
e
2ix
d =
0
(), where is the Kronecker
delta function at 0.
Proof. The claim is clear when is trivial. When is non-trivial,
there exists x G such that (x) ,= 1. If one then integrates the
identity
x
= (x) using (1.99) one obtains the claim.
Exercise 1.12.16. Show that the Pontryagin dual
G of a compact
abelian group G is discrete (compare with Exercise 1.12.8).
218 1. Real analysis
Exercise 1.12.17. Show that the Fourier transform of the constant
function 1 is the Kronecker delta function
0
at 0. More generally,
for any
0
G, show that the Fourier transform of the multiplicative
character x e
2i0x
is the Kronecker delta function
0
at
0
.
Since the pointwise product of two multiplicative characters is
again a multiplicative character, and the conjugate of a multiplicative
character is also a multiplicative character, we obtain
Corollary 1.12.3. The space of multiplicative chararacters is an or-
thonormal set in the complex Hilbert space L
2
(G).
Actually, one can say more:
Theorem 1.12.4 (Plancherel theorem for compact abelian groups).
Let G be a compact abelian group with probability Haar measure .
Then the space of multiplicative characters is an orthonormal basis
for the complex Hilbert space L
2
(G).
The full proof of this theorem requires the spectral theorem and
is not given here, though see Exercise 1.12.43 below. However, we
can work out some important special cases here.
When G is a torus G = T
d
= (R/Z)
d
, the multiplica-
tive characters x e
2ix
separate points (given any two
x, y G, there exists a character which takes dierent val-
ues at x and at y). The space of nite linear combinations
of multiplicative characters (i.e. the space of trigonometric
polynomials) is then an algebra closed under conjugation
that separates points and contains the unit 1, and thus by
the Stone-Weierstrass theorem, is dense in C(G) in the uni-
form (and hence in L
2
) topology, and is thus dense in L
2
(G)
(in the L
2
topology) also.
The same argument works when G is a cyclic group Z/NZ,
using the multiplicative characters x e
2ix/N
for
Z/NZ. As every nite abelian group is isomorphic to the
product of cyclic groups, we also obtain the claim for nite
abelian groups.
Alternatively, when G is nite, one can argue by viewing the
linear operators
x
: C
c
(G) C
c
(G) as [G[ [G[ unitary
1.12. The Fourier transform 219
matrices (in fact, they are permutation matrices) for each
x G. The spectral theorem for unitary matrices allows
each of these matrices to be diagonalised; as G is abelian,
the matrices commute and so one can simultaneously di-
agonalise these matrices. It is not hard to see that each
simultaneous eigenvector of these matrices is a multiple of
a character, and so the characters span L
2
(G), yielding the
claim. (The same argument will in fact work for arbitrary
compact abelian groups, once we obtain the spectral theo-
rem for unitary operators.)
If f L
2
(G), the inner product f,
L
2
(G)
of f with any mul-
tiplicative character
: x e
2ix
is just the Fourier coecient
G
[
f()[
2
.
(Parseval identity, II) For any f, g L
2
(G), we have f, g
L
2
(G)
=
f() g().
(Unitarity) Thus the Fourier transform is a unitary trans-
formation from L
2
(G) to
2
(
G).
(Inversion formula) For any f L
2
(G), the series x
f()e
2ix
converges unconditionally in L
2
(G) to f.
(Inversion formula, II) For any sequence (c
G
in
2
(
G),
the series x
G
c
e
2ix
converges unconditionally in
L
2
(G) to a function f with c
(
G) with norm 1. Applying the inter-
polation theorem, we conclude the Hausdor-Young inequality
(1.103) |
f|
(
G)
|f|
L
p
(G)
for all 1 p 2 and all f L
p
(G); in particular, the Fourier trans-
form maps L
p
(G) to
p
(
G), where p
f|
q
(
G)
|f|
L
p
(G)
whenever 2 q and
1
p
+
1
q
1. These are the optimal hypotheses
on p, q for which (1.104) holds, though we will not establish this fact
here.
Exercise 1.12.18. If f, g L
2
(G), show that the Fourier transform
of fg L
1
(G) is given by the formula
fg() =
f() g( ).
Thus multiplication is converted via the Fourier transform to convo-
lution; compare this with (1.102).
Exercise 1.12.19 (Hardy-Littlewood majorant property). Let p 2
be an even integer. If f, g L
p
(G) are such that [
f(n) =
_
R/Z
f(x)e
2inx
dx for all f L
1
(T). For every integer
N > 0 and f L
1
(T), dene the partial Fourier series S
N
f to be
1.12. The Fourier transform 221
the expression
S
N
f(x) :=
N
n=N
f(n)e
2inx
.
Show that S
N
f = f D
N
, where D
N
is the Dirichlet kernel
D
N
(x) :=
sin((N+1/2)x)
sin x/2
.
Show that |D
N
|
L
1
(T)
c log N for some absolute constant
c > 0. Conclude that the operator norm of S
N
on C(T)
(with the uniform norm) is at least c log N.
Conclude that there exists a continuous function f such that
the partial Fourier series S
N
f does not converge uniformly.
(Hint: use the uniform boundedness principle.) This is de-
spite the fact that S
N
f must converge to f in L
2
norm, by
the Plancherel theorem. (Another example of non-uniform
convergence of S
N
f is given by the Gibbs phenomenon.)
Exercise 1.12.21. We continue the notational conventions of the
preceding exercise. For every integer N > 0 and f L
1
(T), dene
the Cesaro-summed partial Fourier series C
N
f to be the expression
C
N
f(x) :=
1
N
N1
n=0
D
n
f(x).
Show that C
N
f = f F
N
, where F
N
is the Fejer kernel
F
N
(x) :=
1
n
(
sin(nx/2)
sin(x/2)
)
2
.
Show that |F
N
|
L
1
(T)
= 1. (Hint: what is the Fourier coef-
cient of F
N
at zero?)
Show that C
N
f converges uniformly to f for every f
C(T). (Thus we see that Cesaro averaging improves the
convergence properties of Fourier series.)
Exercise 1.12.22. Carlesons inequality asserts that for any f
L
2
(T), one has the weak-type inequality
| sup
N>0
[D
N
f(x)[|
L
2,
(T)
C|f|
L
2
(T)
for some absolute constant C. Assuming this (deep) inequality, estab-
lish Carlesons theorem that for any f L
2
(T), the partial Fourier
series D
N
f(x) converge for almost every x to f(x). (Conversely, a
222 1. Real analysis
general principle of Stein[St1961], analogous to the uniform bound-
edness principle, allows one to deduce Carlesons inequality from Car-
lesons theorem. A later result of Hunt[Hu1968] extends Carlesons
theorem to L
p
(T) for any p > 1, but a famous example of Kolmogorov
shows that almost everywhere convergence can fail for L
1
(T) func-
tions; in fact the series may diverge pointwise everywhere.)
1.12.3. The Fourier transform on Euclidean spaces. We now
turn to the Fourier transform on the Euclidean space R
d
, where d 1
is a xed integer. From Exercise 1.12.9 we can identify the Pontryagin
dual of R
d
with itself, and then the Fourier transform
f : R
d
C of
a function f L
1
(R
d
) is given by the formula
(1.105)
f() :=
_
R
d
f(x)e
2ix
dx.
Remark 1.12.6. One needs the Euclidean inner product structure
on R
d
in order to identify
R
d
with R
d
. Without this structure, it
is more natural to identify
R
d
with the dual space (R
d
)
of R
d
. (In
the language of physics, one should interpret frequency as a covector
rather than a vector.) However, we will not need to consider such
subtleties here. In other areas of mathematics than harmonic anal-
ysis, the normalisation of the Fourier transform (particularly with
regard to the positioning of the sign and the factor 2) is some-
times slightly dierent from that presented here. For instance, in
PDE, the factor of 2 is often omitted from the exponent in order
to slightly simplify the behaviour of dierential operators under the
Fourier transform (at the cost of introducing factors of 2 in various
identities, such as the Plancherel formula or inversion formula).
In Exercise 1.12.11 we saw that if f was in L
1
(R
d
), then
f was
continuous and decayed to zero at innity. One can improve both the
regularity and decay on
f by strengthening the hypotheses on f. We
need two basic facts:
Exercise 1.12.23 (Decay transforms to regularity). Let 1 j
d, and suppose that f, x
j
f both lie in L
1
(R
d
), where x
j
is the j
th
coordinate function. Show that
f is continuously dierentiable in the
1.12. The Fourier transform 223
j
variable, with
f() = 2i
x
j
f().
(Hint: The main diculty is to justify dierentiation under the in-
tegral sign. Use the fact that the function x e
ix
has a derivative
of magnitude 1, and is hence Lipschitz by the fundamental theorem
of calculus. Alternatively, one can show rst that
f() is the indef-
inite integral of 2i
x
j
f and then use the fundamental theorem of
calculus.)
Exercise 1.12.24 (Regularity transforms to decay). Let 1 j d,
and suppose that f L
1
(R
d
) has a derivative
f
xj
in L
1
(R
d
), for
which one has the fundamental theorem of calculus
f(x
1
, . . . , x
n
) =
_
xj
f
x
j
(x
1
, . . . , x
j1
, t, x
j+1
, . . . , x
n
) dt
for almost every x
1
, . . . , x
n
. (This is equivalent to f being absolutely
continuous in x
j
for almost every x
1
, . . . , x
j1
, x
j+1
, . . . , x
n
.) Show
that
f
x
j
() = 2i
j
f().
In particular, conclude that [
j
[
: o(R
d
) o(R
d
)
by the formula
T
F(x) :=
_
R
d
e
2ix
F() d
(note the sign change from (1.105)). We will shortly demonstrate that
the adjoint Fourier transform is also the inverse Fourier transform:
T
= T
1
.
From the identity
(1.107) T
f = Tf
we see that T
g
L
2
(R
d
)
for all f, g o(R
d
).
Now we show that T
T(f g) = f T
Tg.
226 1. Real analysis
Next, we perform a computation:
Exercise 1.12.32 (Fourier transform of Gaussians). Let r > 0.
Show that the Fourier transform of the gaussian function g
r
(x) :=
r
d
e
|x|
2
/r
2
is g
r
() = e
r
2
||
2
. (Hint: Reduce to the case d = 1
and r = 1, then complete the square and use contour integration and
the classical identity
_
e
x
2
dx = 1.) Conclude that T
Tg
r
= g
r
.
Exercise 1.12.33. With g
r
as in the previous exercise, show that
f g
r
converges in the Schwartz space topology to f as r 0 for all
f o(R
d
). (Hint: First show convergence in the uniform topology,
then use the identities
xj
(f g) = (
xj
f) g and x
j
(f g) = (x
j
f)
g +f(x
j
g) for f, g o(R
d
).)
From Exercises 1.12.31, 1.12.32 we see that
T
T(f g
r
) = f g
r
for all r > 0 and f o(R
d
). Taking limits as r 0 using Exercises
1.12.29, 1.12.33 we conclude that
T
Tf = f
for all f o(R
d
), or in other words we have the Fourier inversion
formula
(1.108) f(x) =
_
R
d
f()e
2ix
d
for all x R
d
. From (1.107) we also have
TT
f = f.
Taking inner products with another Schwartz function g, we obtain
Parsevals identity
Tf, Tg
L
2
(R
d
)
= f, g
L
2
(R
d
)
for all f, g o(R
d
), and similarly for T
. In particular, we obtain
Plancherels identity
|Tf|
L
2
(R
d
)
= |f|
L
2
(R
d
)
= |T
f|
L
2
(R
d
)
for all f o(R
d
). We conclude that
1.12. The Fourier transform 227
Theorem 1.12.10 (Plancherels theorem for R
d
). The Fourier trans-
form operator T : o o can be uniquely extended to a unitary trans-
formation T : L
2
(R
d
) L
2
(R
d
).
Exercise 1.12.34. Show that the Fourier transform on L
2
(R
d
) given
by Plancherels theorem agrees with the Fourier transform on L
1
(R
d
)
given by (1.105) on the common domain L
2
(R
d
) L
1
(R
d
). Thus we
may dene
f for f L
1
(R
d
) or f L
2
(R
d
) (or even f L
1
(R
d
) +
L
2
(R
d
) without any ambiguity (other than the usual identication of
any two functions that agree almost everywhere).
Note that it is certainly possible for a function f to lie in L
2
(R
d
)
but not in L
1
(R
d
) (e.g. the function (1 +[x[)
d
). In such cases, the
integrand in (1.105) is not absolutely integrable, and so this formula
does not dene the Fourier transform of f directly. Nevertheless, one
can recover the Fourier transform via a limiting version of (1.105):
Exercise 1.12.35. Let f L
2
(R
d
). Show that the partial Fourier
integrals
_
|x|R
f(x)e
2ix
dx converge in L
2
(R
d
) to
f as R
.
Remark 1.12.11. It is a famous open question whether the partial
Fourier integrals of an L
2
(R
d
) function also converge pointwise almost
everywhere for d 2. For d = 1, this is essentially the celebrated
theorem of Carleson mentioned in Exercise 1.12.22.
Exercise 1.12.36 (Heisenberg uncertainty principle). Let d = 1. De-
ne the position operator X : o(R) o(R) and momentum operator
D : o(R) o(R) by the formulae
Xf(x) := xf(x); Df(x) :=
1
2i
d
dx
f(x).
Establish the identities
(1.109) TD = XT; TX = TD; DX XD =
1
2i
and the formal self-adjointness relationships
Xf, g
L
2
(R)
= f, Xg
L
2
(R)
; Df, g
L
2
(R)
= f, Dg
L
2
(R)
and then establish the inequality
|Xf|
L
2
(R)
|Df|
L
2
(R)
1
4
|f|
2
L
2
(R)
.
228 1. Real analysis
(Hint: start with the obvious inequality (aX+ibD)f, (aX+ibD)f
L
2
(R)
0 for real numbers a, b, and optimise in a and b.) If |f|
L
2
(R)
= 1,
deduce the Heisenberg uncertainty principle
[
_
R
(
0
)[
f()[
2
d]
1/2
[
_
R
(x x
0
)[f(x)[
2
dx]
1/2
1
4
for any x
0
,
0
R. (Hint: one can use the translation and modulation
symmetries (1.100), (1.101) of the Fourier transform to reduce to the
case x
0
=
0
= 0.) Classify precisely the f, x
0
,
0
for which equality
occurs.
Remark 1.12.12. For x
0
,
0
R
d
and R > 0, dene the gaussian
wave packet g
x0,0,R
by the formula
g
x0,0,R
(x) := 2
d/2
R
d/2
e
2i0x
e
|xx0|
2
/R
2
.
These wave packets are normalised to have L
2
norm one, and their
Fourier transform is given by
(1.110) g
x0,0,R
= e
2i0x0
g
0,x0,1/R
.
Informally, g
x0,0,R
is localised to the region x = x
0
+O(R) in physical
space, and to the region =
0
+O(1/R) in frequency space; observe
that this is consistent with the uncertainty principle. These packets
almost diagonalise the position and momentum operators X, D in
the sense that (taking d = 1 for simplicity)
Xg
x0,0,R
x
0
g
x0,0,R
; Dg
x0,0,R
0
g
x0,0,R
(where the errors terms are morally of the form O(Rg
x0,0,R
) and
O(R
1
g
x0,0,R
) respectively). Of course, the non-commutativity of D
and X as evidenced by the last equation in (1.109) shows that exact
diagonalisation is impossible. Nevertheless it is useful, at an intuitive
level at least, to view these wave-packets as a sort of (overdeter-
mined) basis for L
2
(R) that approximately diagonalises X and D (as
well as other formal combinations a(X, D) of these operators, such
as dierential operators or pseudodierential operators). Meanwhile,
the Fourier transform morally maps the point (x
0
,
0
) in phase space
to (
0
, x
0
), as evidenced by (1.110) or (1.109); it is the model ex-
ample of the more general class of Fourier integral operators, which
morally move points in phase space around by canonical transforma-
tions. The study of these types of objects (which are of importance in
1.12. The Fourier transform 229
linear PDE) is known as microlocal analysis, and is beyond the scope
of this course.
The proof of the Hausdor-Young inequality (1.103) carries over
to the Euclidean space setting, and gives
(1.111) |
f|
L
p
(R
d
)
|f|
L
p
(R
d
)
for all 1 p 2 and all f L
p
(R
d
); in particular the Fourier trans-
form is bounded from L
p
(R
d
) to L
p
(R
d
). The constant of 1 on the
right-hand side of (1.111) turns out to not be optimal in the Euclidean
setting, in contrast to the compact setting; the sharp constant is in
fact (p
1/p
/(p
)
1/p
)
d/2
, a result of Beckner[Be1975]. (The fact that
this constant cannot be improved can be seen by using the gaussians
from Exercise 1.12.32.)
Exercise 1.12.37 (Entropy uncertainty principle). For any f
o(R
d
) with |f|
L
2
(R
d
)
= 1, show that
_
R
d
[f(x)[
2
log
1
[f(x)[
2
dx
_
R
d
[
f()[
2
log
1
[
f()[
2
d 0.
(Hint: dierentiate (!) (1.104) in p at p = 2, where one has equality
in (1.104).) Using Beckners improvement to (1.103), improve the
right-hand side to the optimal value of d log(2e).
Exercise 1.12.38 (Fourier transform under linear changes of vari-
able). Let L : R
d
R
d
be an invertible linear transformation. If
f o(R
d
) and f
L
(x) := f(Lx), show that the Fourier transform of
f
L
is given by the formula
f
L
() =
1
[ det L[
f((L
)
1
)
where L
: R
d
R
d
is the adjoint operator to L. Verify that this
transformation is consistent with (1.104), and indeed shows that the
exponent p
f() :=
_
(R/LZ)
d
f(x)e
2ix
dx
for all f L
1
((R/LZ)
d
) and
1
L
Z
d
.
Show that for any f L
2
((R/LZ)
d
), the Fourier series
1
L
d
1
L
Z
d
f()e
2ix
converges unconditionally in L
2
((R/LZ)
d
).
Use this to give an alternate proof of the Fourier inversion
formula (1.108) in the case where f is smooth and compactly
supported.
Exercise 1.12.41 (Poisson summation formula). Let f o(R
d
).
Show that the function F : (R/Z)
d
C dened by F(x + Z
d
) :=
nZ
d f(x + n) has Fourier transform
F() =
f() for all Z
d
R
d
(note the two dierent Fourier transforms in play here). Conclude
the Poisson summation formula
nZ
d
f(n) =
mZ
d
f(m).
Exercise 1.12.42. Let f : R
d
C be a compactly supported, abso-
lutely integrable function. Show that the function
f is real-analytic.
Conclude that it is not possible to nd a non-trivial f L
1
(R
d
) such
that f and
f are both compactly supported.
1.12. The Fourier transform 231
1.12.4. The Fourier transform on general groups (optional).
The eld of abstract harmonic analysis is concerned, among other
things, with extensions of the above theory to more general groups,
for instance arbitrary LCA groups. One of the ways to proceed is
via Gelfand theory, which for instance can be used to show that the
Fourier transform is at least injective:
Exercise 1.12.43 (Fourier analysis via Gelfand theory). (Optional)
In this exercise we use the Gelfand theory of commutative Banach
*-algebras (see Section 1.10.4) to establish some basic facts of Fourier
analysis in general groups. Let G be an LCA group. We view L
1
(G)
as a commutative Banach *-algebra L
1
(G) (see Exercise 1.12.13).
(a) If f L
1
(G) is such that liminf
n
|f
n
|
1/n
L
1
(G)
> 0, where
f
n
= f . . .f is the convolution of n copies of f, show that
there exists a non-zero complex number z such that the map
g f g zg is not invertible on L
1
(G). (Hint: If L
1
(G)
contains a unit, one can use Exercise 1.10.36; otherwise,
adjoin a unit.)
(b) If f and z are as in (a), show that there exists a character
: L
1
(G) C (in the sense of Banach *-algebras, see
Denition 1.10.25) such that f g zg lies in the kernel
of for all g L
1
(G). Conclude in particular that (f) is
non-zero.
(c) If : L
1
(G) C is a character, show that there exists
a multiplicative character : G S
1
such that (f) =
f, for all f L
1
(G). (You will need Exercise 1.12.5 and
Exercise 1.12.10.)
(d) For any f L
1
(G) and g L
2
(G), show that [f gg
(0)[
[f f
g g
(0)[
1/2
[g g
(0)[
1/2
, where 0 is the group iden-
tity and f
g
:= f
1
f
2
g g
(0) ,= 0 and g g
)
n
|
1/n
L
1
(G)
> 0.
232 1. Real analysis
Then use (a), (b), (c).) Conclude that the Fourier trans-
form is injective on L
1
(G). (The image of L
1
(G) under the
Fourier transform is then a Banach *-algebra known as the
Wiener algebra, and is denoted A(
G).)
(f) Prove Theorem 1.12.4.
It is possible to use arguments similar to those in Exercise 1.12.43
to characterise positive measures on
G in terms of continuous func-
tions on G, leading to Bochners theorem:
Theorem 1.12.14 (Bochners theorem). Let C(G) be a continu-
ous function on an LCA group G. Then the following are equivalent:
(a)
N
n=1
N
m=1
c
n
c
m
(x
n
x
m
) 0 for all x
1
, . . . , x
N
G
and c
1
, . . . , c
N
C.
(b) There exists a non-negative nite Radon measure on
G
such that (x) =
_
G
e
2ix
d().
Functions obeying either (a) or (b) are known as positive-denite
functions. The space of such functions is denoted B(G).
Exercise 1.12.44. Show that (b) implies (a) in Bochners theorem.
(The converse implication is signicantly harder, reprising much of
the machinery in Exercise 1.12.43, but with taking the place of
g g
G
[
f()[
2
d()
for all f L
2
(G), and the Parseval identity
_
G
f(x)g(x) d(x) =
_
f()e
2ix
d()
is valid for f in a dense subclass of L
2
(G) (in particular, it is valid
for f L
1
(G) B(G)).
Again, see [Ru1962] for details. A related result is that of Pon-
tryagin duality: if
G is the Pontryagin dual of an LCA group G, then
G is the Pontryagin dual of
G. (Certainly, every element x G de-
nes a character x : x on
G, thus embedding G into
G via
the Gelfand transform (see Section 1.10.4); the non-trivial fact is that
this embedding is in fact surjective.) One can use Pontryagin duality
to convert various properties of LCA groups into other properties on
LCA groups. For instance, we have already seen that
G is compact
(resp. discrete) if G is discrete (resp. compact); with Pontryagin du-
ality, the implications can now also be reversed. As another example,
one can show that
G is connected (resp. torsion-free) if and only if G
is torsion-free (resp. connected). We will not prove these assertions
here.
It is natural to ask what happens for non-abelian locally compact
groups G = (G, ). One can still build non-trivial Haar measures (the
proof sketched out in Exercise 1.12.7 extends without diculty to the
non-abelian setting), though one must now distinguish between left-
invariant and right-invariant Haar measures. (The two notions are
equivalent for some classes of groups, notably compact groups, but
not in general. Groups for which the two notions of Haar measures
coincide are called unimodular.) However, when G is non-abelian
then there are not enough multiplicative characters : G S
1
to
have a satisfactory Fourier analysis. (Indeed, such characters must
annihilate the commutator group [G, G], and it is entirely possible for
this commutator group to be all of G, e.g. if G is simple and non-
abelian.) Instead, one must generalise the notion of a multiplicative
character to that of a unitary representation : G U(H) from G to
the group of unitary transformations on a complex Hilbert space H;
thus the Fourier coecients
f() of a function will now be operators
on thisl Hilbert space H, rather than complex numbers. When G
234 1. Real analysis
is a compact group, it turns out to be possible to restrict attention
to nite-dimensional representations (thus one can replace U(H) by
the matrix group U(n) for some n). The analogue of the Pontryagin
dual
G is then the collection of (irreducible) nite-dimensional unitary
representations of G, up to isomorphism. There is an analogue of the
Plancherel theorem in this setting, closely related to the Peter-Weyl
theorem in representation theory. We will not discuss these topics
here, but refer the reader instead to any representation theory text.
The situation for non-compact non-abelian groups (e.g. SL
2
(R))
is signicantly more subtle, as one must now consider innite-dimensional
representations as well as nite-dimensional ones, and the inversion
formula can become quite non-trivial (one has to decide what weight
each representation should be assigned in that formula). At this
point it seems unprotable to work in the category of locally com-
pact groups, and specialise to a more structured class of groups, e.g.
algebraic groups. The representation theory of such groups is a mas-
sive subject and well beyond the scope of this course.
1.12.5. Relatives of the Fourier transform (optional). There
are a number of other Fourier-like transforms used in mathematics,
which we will briey survey here. Firstly, there are some rather trivial
modications one can make to the denition of Fourier transform, for
instance by replacing the complex exponential e
2ix
by trigonometric
functions such as sin(x) and cos(x), or moving around the various
factors of 2, i, 1, etc. in the denition. In this spirit, we have the
Laplace transform
(1.112) /f(t) :=
_
0
f(s)e
st
ds
of a measurable function f : [0, +) R with some reasonable
growth at innity, where t > 0. Roughly speaking, the Laplace trans-
form is the Fourier transform without the i (cf. Wick rotation),
and so has the (mild) advantage of being denable in the realm of
real-valued functions rather than complex-valued functions. It is par-
ticularly well suited for studying ODE on the half-line [0, +) (e.g.
initial value problems for a nite-dimensional system). The Laplace
transform and Fourier transform can be unied by allowing the t pa-
rameter in (1.112) to vary in the right-half plane t C : Re(t) 0.
1.12. The Fourier transform 235
When the Fourier transform is applied to a spherically symmetric
function f(x) := F([x[) on R
d
, then the Fourier transform is also
spherically symmetric, given by the formula
f() = G([[) where G is
the Fourier-Bessel transform (or Hankel transform)
G(r) := 2r
(d2)/2
_
0
F(s)J
(d2)/2
(2rs)s
d/2
ds
where J
n=
of complex
numbers to a formal Laurent series
:c(z) :=
n=
c
n
z
n
(some authors use z
n
instead of z
n
here). If one makes the substitu-
tion z = e
2inx
then this becomes a (formal) Fourier series expansion
on the unit circle. If the sequence c
n
is restricted to only be non-zero
for non-negative n, and does not grow too quickly as n , then the
z-transform becomes holomorphic on the unit disk, thus providing a
link between Fourier analysis and complex analysis. For instance, the
standard formula
c
n
=
1
2i
_
|z|=1
f(z)
z
n+1
dz
for the Taylor coecients of a holomorphic function f(z) =
n=0
c
n
z
n
at the origin can be viewed as a version of the Fourier inversion for-
mula for the torus R/Z. Just as the Fourier or Laplace transforms
are useful for analysing dierential equations in continuous settings,
the z-transform is useful for analysing dierence equations in discrete
settings. The z-transform is of course also very similar to the method
of generating functions in combinatorics and probability.
In probability theory one also considers the characteristic func-
tion E(e
itX
) of a real-valued random variable X; this is essentially
the Fourier transform of the probability distribution of X. Just as
the Fourier transform is useful for understanding convolutions f g,
the characteristic function is useful for understanding sums X
1
+X
2
of independent random variables.
We have briey touched upon the role of Gelfand theory in the
general theory of the Fourier transform. Indeed, one can view the
Fourier transform as the special case of the Gelfand transform for
Banach *-algebras, which we already discussed in Section 1.10.4.
The Fast Fourier Transform (FFT) is not, strictly speaking, a
variant of the Fourier transform, but rather is an ecient algorithm
1.12. The Fourier transform 237
for computing the Fourier transform
f() =
1
N
N1
n=0
f(x)e
2ix/N
on a cyclic group Z/NZ 0, . . . , N 1, when N is large but com-
posite. Note that a brute force computation of this transform for all
N values of would require about O(N
2
) addition and multiplication
operations. The FFT algorithm, in contrast, takes only O(N log N)
operations, and is based on reducing the FFT for a large N to FFT
for smaller N. For instance, suppose N is even, say N = 2M, then
observe that
f() =
1
2
(
f
0
() +e
2i/N
f
1
())
where f
0
, f
1
: Z/MZ C are the functions f
j
(x) := f(2x+j). Thus
one can obtain the Fourier transform of the length N vector f from
the Fourier transforms of the two length M vectors f
0
, f
1
after about
O(N) operations. Iterating this we see that we can indeed compute
G
f
, where f
(x) :=
_
G
e
2ig
f(gx) dg, where
the series is unconditionally convergent in L
2
(X). The reason for
doing this is that each of the f
(gx) = e
2ig
f
(x) for
(almost) all g G, x X. This decomposition is closely related to
the decomposition in representation theory of a given representation
into irreducible components. Perhaps the most basic example of this
type of operation is the decomposition of a function f : R R into
even and odd components
f(x)+f(x)
2
,
f(x)f(x)
2
; here the underlying
group is Z/2Z, which acts on R by reections, gx := (1)
g
x.
238 1. Real analysis
The operation of converting a square matrix A = (a
ij
)
1i,jn
of
numbers into eigenvalues
1
, . . . ,
n
or singular values
1
, . . . ,
n
can
be viewed as a sort of non-commutative generalisation of the Fourier
transform. (Note that the eigenvalues of a circulant matrix are es-
sentially the Fourier coecients of the rst row of that matrix.) For
instance, the identity
n
i=1
n
j=1
[a
ij
[
2
=
n
k=1
2
k
can be viewed as
a variant of the Plancherel identity. More generally, there are close
relationships between spectral theory and Fourier analysis (as one can
already see from the connection to Gelfand theory). For instance, in
R
d
and T
d
, one can view Fourier analysis as the spectral theory of the
gradient operator (note that the characters e
2ix
are joint eigen-
functions of ). As the gradient operator is closely related to the
Laplacian , it is not surprising that Fourier analysis is also closely
related to the spectral theory of the Laplacian, and in particular to
various operators built using the Laplacian (e.g. resolvents, heat ker-
nels, wave operators, Schrodinger operators, Littlewood-Paley projec-
tions, etc.). Indeed, the spectral theory of the Laplacian can serve as
a partial substitute for the Fourier transform in situations in which
there is not enough symmetry to exploit Fourier-analytic techniques
(e.g. on a manifold with no translation symmetries).
Finally, there is an analogue of the Fourier duality relationship
between an LCA group G and its Pontryagin dual
G in algebraic
geometry, known as the Fourier-Mukai transform, which relates an
abelian variety X to its dual
X, and transforms coherent sheaves on
the former to coherent sheaves on the latter. This transform obeys
many of the algebraic identities that the Fourier transform does, al-
though it does not seem to have much of the analytic structure.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/04/06.
Thanks to Hunter, Marco Frasca, Max Baroi, PDEbeginner, timur,
Xiaochuan Liu, and anonymous commenters for corrections.
1.13. Distributions
In set theory, a function f : X Y is dened as an object that eval-
uates every input x to exactly one output f(x). However, in various
branches of mathematics, it has become convenient to generalise this
1.13. Distributions 239
classical concept of a function to a more abstract one. For instance, in
operator algebras, quantum mechanics, or non-commutative geometry,
one often replaces commutative algebras of (real or complex-valued)
functions on some space X, such as C(X) or L
-algebra
or a von Neumann algebra). Elements in this more abstract algebra
are no longer denable as functions in the classical sense of assigning
a single value f(x) to every point x X, but one can still dene other
operations on these generalised functions (e.g. one can multiply or
take inner products between two such objects).
Generalisations of functions are also very useful in analysis. In
our study of L
p
spaces, we have already seen one such generalisation,
namely the concept of a function dened up to almost everywhere
equivalence. Such a function f (or more precisely, an equivalence
class of classical functions) cannot be evaluated at any given point x,
if that point has measure zero. However, it is still possible to perform
algebraic operations on such functions (e.g. multiplying or adding
two functions together), and one can also integrate such functions
on measurable sets (provided, of course, that the function has some
suitable integrability condition). We also know that the L
p
spaces
can usually be described via duality, as the dual space of L
p
(except
in some endpoint cases, namely when p = , or when p = 1 and the
underlying space is not -nite).
We have also seen (via the Lebesgue-Radon-Nikodym theorem)
that locally integrable functions f L
1
loc
(R) on, say, the real line R,
can be identied with locally nite absolutely continuous measures
m
f
on the line, by multiplying Lebesgue measure m by the function
f. So another way to generalise the concept of a function is to consider
arbitrary locally nite Radon measures (not necessarily absolutely
continuous), such as the Dirac measure
0
. With this concept of gen-
eralised function, one can still add and subtract two measures , ,
and integrate any measure against a (bounded) measurable set E
to obtain a number (E), but one cannot evaluate a measure (or
more precisely, the Radon-Nikodym derivative d/dm of that mea-
sure) at a single point x, and one also cannot multiply two measures
together to obtain another measure. From the Riesz representation
240 1. Real analysis
theorem, we also know that the space of (nite) Radon measures can
be described via duality, as linear functionals on C
c
(R).
There is an even larger class of generalised functions that is very
useful, particularly in linear PDE, namely the space of distributions,
say on a Euclidean space R
d
. In contrast to Radon measures ,
which can be dened by how they pair up against continuous,
compactly supported test functions f C
c
(R
d
) to create numbers
f, :=
_
R
d
f d, a distribution is dened by how it pairs up
against a smooth compactly supported function f C
c
(R
d
) to cre-
ate a number f, . As the space C
c
(R
d
) of smooth compactly sup-
ported functions is smaller than (but dense in) the space C
c
(R
d
) of
continuous compactly supported functions (and has a stronger topol-
ogy), the space of distributions is larger than that of measures. But
the space C
c
(R
d
) is closed under more operations than C
c
(R
d
), and
in particular is closed under dierential operators (with smooth coef-
cients). Because of this, the space of distributions is similarly closed
under such operations; in particular, one can dierentiate a distri-
bution and get another distribution, which is something that is not
always possible with measures or L
p
functions. But as measures or
functions can be interpreted as distributions, this leads to the notion
of a weak derivative for such objects, which makes sense (but only
as a distribution) even for functions that are not classically dieren-
tiable. Thus the theory of distributions can allow one to rigorously
manipulate rough functions as if they were smooth, although one
must still be careful as some operations on distributions are not well-
dened, most notably the operation of multiplying two distributions
together. Nevertheless one can use this theory to justify many formal
computations involving derivatives, integrals, etc. (including several
computations used routinely in physics) that would be dicult to
formalise rigorously in a purely classical framework.
If one shrinks the space of distributions slightly, to the space
of tempered distributions (which is formed by enlarging dual class
C
c
(R
d
) to the Schwartz class o(R
d
)), then one obtains closure un-
der another important operation, namely the Fourier transform. This
allows one to dene various Fourier-analytic operations (e.g. pseudo-
dierential operators) on such distributions.
1.13. Distributions 241
Of course, at the end of the day, one is usually not all that in-
terested in distributions in their own right, but would like to be able
to use them as a tool to study more classical objects, such as smooth
functions. Fortunately, one can recover facts about smooth functions
from facts about the (far rougher) space of distributions in a number
of ways. For instance, if one convolves a distribution with a smooth,
compactly supported function, one gets back a smooth function. This
is a particularly useful fact in the theory of constant-coecient lin-
ear partial dierential equations such as Lu = f, as it allows one to
recover a smooth solution u from smooth, compactly supported data
f by convolving f with a specic distribution G, known as the fun-
damental solution of L. We will give some examples of this later in
this section.
It is this unusual and useful combination of both being able to
pass from classical functions to generalised functions (e.g. by dif-
ferentiation) and then back from generalised functions to classical
functions (e.g. by convolution) that sets the theory of distributions
apart from other competing theories of generalised functions, in par-
ticular allowing one to justify many formal calculations in PDE and
Fourier analysis rigorously with relatively little additional eort. On
the other hand, being dened by linear duality, the theory of distri-
butions becomes somewhat less useful when one moves to more non-
linear problems, such as nonlinear PDE. However, they still serve an
important supporting role in such problems as a ambient space of
functions, inside of which one carves out more useful function spaces,
such as Sobolev spaces, which we will discuss in the next set of notes.
1.13.1. Smooth functions with compact support. In the rest
of the notes we will work on a xed Euclidean space R
d
. (One can
also dene distributions on other domains related to R
d
, such as open
subsets of R
d
, or d-dimensional manifolds, but for simplicity we shall
restrict attention to Euclidean spaces in these notes.)
A test function is any smooth, compactly supported function f :
R
d
C; the space of such functions is denoted C
c
(R
d
). (In some
texts, this space is denoted C
0
(R
d
) instead.)
242 1. Real analysis
From analytic continuation one sees that there are no real-analytic
test functions other than the zero function. Despite this negative re-
sult, test functions actually exist in abundance:
Exercise 1.13.1.
(i) Show that there exists at least one test function that is not
identically zero. (Hint: it suces to do this for d = 1. One
starting point is to use the fact that the function f : R R
dened by f(x) := e
1/x
for x > 0 and f(x) := 0 otherwise
is smooth, even at the origin 0.)
(ii) Show that if f C
c
(R
d
) and g : R
d
R is absolutely
integrable and compactly supported, then the convolution
f g is also in C
c
(R
d
). (Hint: rst show that f g is
continuously dierentiable with (f g) = (f) g.)
(iii) (C
c
(R
d
) supported in U which equals
1 on K. (Hint: use the ordinary Urysohn lemma to nd a
function in C
c
(R
d
) that equals 1 on a neighbourhood of K
and is supported in a compact subset of U, then convolve
this function by a suitable test function.)
(iv) Show that C
c
(R
d
) is dense in C
0
(R
d
) (in the uniform topol-
ogy), and dense in L
p
(R
d
) (with the L
p
topology) for all
0 < p < .
The space C
c
(R
d
) is clearly a vector space. Now we place a (very
strong!) topology on it. We rst observe that C
c
(R
d
) =
K
C
c
(K),
where K ranges over all compact subsets of R
d
and C
c
(K) consists of
those functions f C
c
(R
d
) which are supported in K. Each C
c
(K)
will be given a topology (called the smooth topology) generated by the
norms
|f|
C
k := sup
xR
d
k
j=0
[
j
f(x)[
for k = 0, 1, . . ., where we view
j
f(x) as a d
j
-dimensional vector
(or, if one wishes, a d-dimensional rank j tensor); thus a sequence
f
n
C
c
(K) converges to a limit f C
c
(K) if and only if
j
f
n
1.13. Distributions 243
converges uniformly to
j
f for all j = 0, 1, . . .. (This gives C
c
(K)
the structure of a Frechet space, though we will not use this fact here.)
We make the trivial remark that if K K
c
(K) is a subspace of C
c
(K
c
(R
d
) the nal topology induced by
the topologies on the C
c
(K), dened as the strongest topology on
C
c
(R
d
) which restricts to the topologies on C
c
(K) for each K.
Equivalently, a set is open in C
c
(R
d
) if and only if its restriction to
C
c
(K) is open for every compact K.
Exercise 1.13.2. Let f
n
be a sequence in C
c
(R
d
), and let f be
another function in C
c
(R
d
). Show that f
n
converges in the topology
of C
c
(R
d
) to f if and only if there exists a compact set K such that
f
n
, f are all supported in K, and f
n
converges to f in the smooth
topology of C
c
(K).
Exercise 1.13.3.
(i) Show that the topology of C
c
(K) is rst countable for every
compact K.
(ii) Show that the topology of C
c
(R
d
) is not rst countable.
(Hint: given any countable sequence of open neighbour-
hoods of 0, build a new open neighbourhood that does not
contain any of the previous ones, using the -compact na-
ture of R
d
.)
(iii) Despite this, show that an element f C
c
(R
d
) is an ad-
herent point of a set E C
c
(R
d
) if and only if there is a
sequence f
n
E that converges to f. (Hint: argue by con-
tradiction.) Conclude in particular that a subset of C
c
(R
d
)
is closed if and only if it is sequentially closed. Thus while
rst countability fails for C
c
(R
d
), we have a serviceable
substitute for this property.
There are plenty of continuous operations on C
c
(R
d
):
Exercise 1.13.4.
244 1. Real analysis
(i) Let K be a compact set. Show that a linear map T :
C
c
(K) X into a normed vector space X is continu-
ous if and only if there exists k 0 and C > 0 such that
|Tf|
X
C|f|
C
k for all f C
c
(K).
(ii) Let K, K
c
(K) C
c
(K
0 and a constant C
k
> 0 such that
|Tf|
C
k C
k
|f|
C
k
for all f C
c
(K).
(iii) Show that a map T : C
c
(R
d
) X to a topological space
is continuous if and only if for every compact set K R
d
,
T maps C
c
(K) continuously to X.
(iv) Show that the inclusion map from C
c
(R
d
) to L
p
(R
d
) is
continuous for every 0 < p .
(v) Show that a map T : C
c
(R
d
) C
c
(R
d
) is continuous if
and only if for every compact set K R
d
there exists a
compact set K
c
(K) continuously to
C
c
(K
).
(vi) Show that every linear dierential operator with smooth co-
ecients is a continuous operation on C
c
(R
d
).
(vii) Show that convolution with any absolutely integrable, com-
pactly supported function is a continuous operation on C
c
(R
d
).
(viii) Show that C
c
(R
d
) is a topological vector space.
(ix) Show that the product operation f, g fg is continuous
from C
c
(R
d
) C
c
(R
d
) to C
c
(R
d
).
A sequence
n
C
c
(R
d
) of continuous, compactly supported
functions is said to be an approximation to the identity if the
n
are
non-negative, have total mass
_
R
n
n
equal to 1, and converge uni-
formly to zero away from the origin, thus sup
|x|r
[
n
(x)[ 0 for
all r > 0. One can generate such a sequence by starting with a sin-
gle non-negative continuous compactly supported function of total
mass 1, and then setting
n
(x) := n
d
(nx); many other constructions
are possible also.
One has the following useful fact:
1.13. Distributions 245
Exercise 1.13.5. Let
n
C
c
(R
d
) be a sequence of approximations
to the identity.
(i) If f C(R
d
) is continuous, show that f
n
converges
uniformly on compact sets to f.
(ii) If f L
p
(R
d
) for some 1 p < , show that f
n
con-
verges in L
p
(R
d
) to f. (Hint: use (i), the density of C
0
(R
d
)
in L
p
(R
d
), and Youngs inequality, Exercise 1.11.25.)
(iii) If f C
c
(R
d
), show that f
n
converges in C
c
(R
d
) to f.
(Hint: use the identity (f
n
) = (f)
n
, cf. Exercise
1.13.1(ii).)
Exercise 1.13.6. Show that C
c
(R
d
) is separable. (Hint: it suces
to show that C
c
(K) is separable for each compact K. There are
several ways to accomplish this. One is to begin with the Stone-
Weierstrass theorem, which will give a countable set which is dense
in the uniform topology, then use the fundamental theorem of calculus
to strengthen the topology. Another is to use Exercise 1.13.5 and then
discretise the convolution. Another is to embed K into a torus and
use Fourier series, noting that the Fourier coecients
f of a smooth
function f : T
d
C decay faster than any power of [n[.)
1.13.2. Distributions. Now we can dene the concept of a distri-
bution.
Denition 1.13.1 (Distribution). A distribution on R
d
is a contin-
uous linear functional : f f, from C
c
(R
d
) to C. The space
of such distributions is denoted C
c
(R
d
)
c
(R
d
).
A technical point: we endow the space C
c
(R
d
)
c
(R
d
)
, and c is a complex
number, then c is the distribution that maps a test function f to
cf, rather than cf, ; thus f, c = cf, . This is to keep the
analogy between the evaluation of a distribution against a function,
and the usual Hermitian inner product f, g =
_
R
d
fg of two test
functions.
246 1. Real analysis
From Exercise 1.13.4, we see that a linear functional : C
c
(R
d
)
C is a distribution if, for every compact set K R
d
, there exists
k 0 and C > 0 such that
(1.113) [f, [ C|f|
C
k
for all f C
c
(K).
Exercise 1.13.7. Show that C
c
(R
d
)
is a Hausdor topological
vector space.
We note two basic examples of distributions:
Any locally integrable function g L
1
loc
(R
d
) can be viewed
as a distribution, by writing f, g :=
_
R
d
f(x)g(x) dx for
all test functions f.
Any complex Radon measure can be viewed as a distribu-
tion, by writing f, :=
_
R
d
f(x) d, where is the com-
plex conjugate of (thus (E) := (E)). (Note that this
example generalises the preceding one, which corresponds
to the case when is absolutely continuous with respect to
Lebesgue measure.) Thus, for instance, the Dirac measure
at the origin is a distribution, with f, = f(0) for all
test functions f.
Exercise 1.13.8. Show that the above identications of locally in-
tegrable functions or complex Radon measures with distributions are
injective. (Hint: use Exercise 1.13.1(iv).)
From the above exercise, we may view locally integrable func-
tions and locally nite measures as a special type of distribution. In
particular, C
c
(R
d
) and L
p
(R
d
) are now contained in C
c
(R
d
)
for
all 1 p .
Exercise 1.13.9. Show that if a sequence of locally integrable func-
tions converge in L
1
loc
to a limit, then they also converge in the sense
of distributions; similarly, if a sequence of complex Radon measures
converge in the vague topology to a limit, then they also converge in
the sense of distributions.
Thus we see that convergence in the sense of distributions is
among the weakest of the notions of convergence used in analysis;
1.13. Distributions 247
however, from the Hausdor property, distributional limits are still
unique.
Exercise 1.13.10. If
n
is a sequence of approximations to the iden-
tity, show that
n
converges in the sense of distributions to the Dirac
distribution .
More exotic examples of distributions can be given:
Exercise 1.13.11 (Derivative of the delta function). Let d = 1.
Show that the functional
: f f
r
,
r
dier by a constant multiple of the Dirac delta distribution.
Exercise 1.13.14. A distribution is said to be real if f, is real
for every real-valued test function f. Show that every distribution
can be uniquely expressed as Re() + i Im() for some real distribu-
tions Re(), Im().
Exercise 1.13.15. A distribution is said to be non-negative if
f, is non-negative for every non-negative test function f. Show
248 1. Real analysis
that a distribution is non-negative if and only if it is a non-negative
Radon measure. (Hint: use the Riesz representation theorem and
Exercise 1.13.1(iv).) Note that this implies that the analogue of the
Jordan decomposition fails for distributions; any distribution which
is not a Radon measure will not be the dierence of non-negative
distributions.
We will now extend various operations on locally integrable func-
tions or Radon measures to distributions by arguing by analogy.
(Shortly we will give a more formal approach, based on density.)
We begin with the operation of multiplying a distribution by a
smooth function h : R
d
C. Observe that
f, gh = fh, g
for all test functions f, g, h. Inspired by this formula, we dene the
product h = h of a distribution with a smooth function by setting
f, h := fh,
for all test functions f. It is easy to see (e.g. using Exercise 1.13.4(vi))
that this denes a distribution h, and that this operation is compat-
ible with existing denitions of products between a locally integrable
function (or Radon measure) with a smooth function. It is important
that h is smooth (and not merely, say, continuous) because one needs
the product of a test function f with h to still be a test function.
Exercise 1.13.16. Let d = 1. Establish the identity
f = f(0)
for any smooth function f. In particular,
x = 0
where we abuse notation slightly and write x for the identity function
x x. Conversely, if is a distribution such that
x = 0,
show that is a constant multiple of . (Hint: Use the identity
f(x) = f(0) +x
_
1
0
f
.)
Exercise 1.13.18. Show that every distribution is the limit of a se-
quence of compactly supported distributions (using the weak-* topol-
ogy, of course). (Hint: Approximate a distribution by the truncated
distributions
n
for some smooth cuto functions
n
constructed us-
ing Exercise 1.13.1(iii).)
In a similar spirit, we can convolve a distribution by an abso-
lutely integrable, compactly supported function h L
1
(R
d
). From
Fubinis theorem we observe the formula
f, g h = f
h, g
for all test functions f, g, h, where
h(x) := h(x). Inspired by this
formula, we dene the convolution h = h of a distribution
250 1. Real analysis
with an absolutely integrable, compactly supported function by the
formula
(1.114) f, h := f
h,
for all test functions f. This gives a well-dened distribution h
(thanks to Exercise 1.13.4(vii)) which is compatible with previous
notions of convolution.
Example 1.13.3. One has f = f = f for all test functions f.
In one dimension, we have
f = f
c
(R
d
)
c
(R
d
) be a test function. Then h is equal to a smooth function.
Proof. If were itself a smooth function, then one could easily verify
the identity
(1.115) h(x) = h
x
,
where h
x
(y) := h(x y). As h is a test function, it is easy to see
that h
x
varies smoothly in x in any C
k
norm (indeed, it has Taylor
expansions to any order in such norms) and so the right-hand side is a
smooth function of x. So it suces to verify the identity (1.115). As
distributions are dened against test functions f, it suces to show
that
f, h =
_
R
d
f(x)h
x
, dx.
On the other hand, we have from (1.114) that
f, h = f
h, =
_
R
d
f(x)h
x
dx, .
1.13. Distributions 251
So the only issue is to justify the interchange of integral and inner
product:
_
R
d
f(x)h
x
, dx =
_
R
d
f(x)h
x
dx, .
Certainly, (from the compact support of f) any Riemann sum can be
interchanged with the inner product:
n
f(x
n
)h
xn
, x =
n
f(x
n
)h
xn
x, ,
where x
n
ranges over some lattice and x is the volume of the fun-
damental domain. A modication of the argument that shows con-
vergence of the Riemann integral for smooth, compactly supported
functions then works here and allows one to take limits; we omit the
details.
This has an important corollary:
Lemma 1.13.5. Every distribution is the limit of a sequence of test
functions. In particular, C
c
(R
d
) is dense in C
c
(R
d
)
.
Proof. By Exercise 1.13.18, it suces to verify this for compactly
supported distributions . We let
n
be a sequence of approximations
to the identity. By Exercise 1.13.5(iii) and (1.114), we see that
n
converges in the sense of distributions to . By Lemma 1.13.4,
n
is a smooth function; as and
n
are both compactly supported,
n
is compactly supported also. The claim follows.
Because of this lemma, we can formalise the previous procedure of
extending operations that were previously dened on test functions,
to distributions, provided that these operations were continuous in
distributional topologies. However, we shall continue to proceed by
analogy as it requires fewer verications in order to motivate the
denition.
Exercise 1.13.19. Another consequence of Lemma 1.13.4 is that
it allows one to extend the denition (1.114) of convolution to the
case when h is not an integrable function of compact support, but
is instead merely a distribution of compact support. Adopting this
convention, show that convolution of distributions of compact support
252 1. Real analysis
is both commutative and associative. (Hint: this can either be done
directly, or by carefully taking limits using Lemma 1.13.5.)
The next operation we will introduce is that of dierentiation.
An integration by parts reveals the identity
f,
x
j
g =
x
j
f, g
for any test functions f, g and j = 1, . . . , d. Inspired by this, we dene
the (distributional) partial derivative
xj
of a distribution by the
formula
f,
x
j
:=
x
j
f, .
This can be veried to still be a distribution, and by Exercise 1.13.4(vi),
the operation of dierentiation is a continuous one on distributions.
More generally, given any linear dierential operator P with smooth
coecients, one can dene P for a distribution by the formula
f, P := P
f,
where P
f, g
for test functions f, g, or more explicitly by replacing all coecients
with complex conjugates, replacing each partial derivative
xj
with
its negative, and reversing the order of operations (thus for instance
the adjoint of the rst-order operator a(x)
d
dx
: f af
would be
d
dx
a(x) : f (af)
).
Example 1.13.6. The distribution
c
(R
d
)
be a distribu-
tion, and let f : R
d
C be smooth. Show that
x
j
(f) = (
x
j
)f +(
x
j
f)
1.13. Distributions 253
for all j = 1, . . . , d.
Exercise 1.13.21. Let d = 1. Show that
x = in three dierent
ways:
Directly from the denitions;
using the product rule;
Writing as the limit of approximations
n
to the identity.
Exercise 1.13.22. Let d = 1.
(i) Show that if is a distribution and n 1 is an integer, then
x
n
= 0 if and only if is a linear combination of and its
rst n 1 derivatives
, . . . ,
(n1)
.
(ii) Show that a distribution is supported on 0 if and only
if it is a linear combination of and nitely many of its
derivatives.
(iii) Generalise (ii) to the case of general dimension d (where of
course one now uses partial derivatives instead of deriva-
tives).
Exercise 1.13.23. Let d = 1.
Show that the derivative of the Heaviside function 1
[0,+)
is equal to .
Show that the derivative of the signum function sgn(x) is
equal to 2.
Show that the derivative of the locally integrable function
log [x[ is equal to p. v.
1
x
.
Show that the derivative of the locally integrable function
log [x[ sgn(x) is equal to the distribution
1
from Exercise
1.13.13.
Show that the derivative of the locally integrable function
[x[ is the locally integrable function sgn(x).
If a locally integrable function has a distributional derivative
which is also a locally integrable function, we refer to the latter as the
weak derivative of the former. Thus, for instance, the weak derivative
of [x[ is sgn(x) (as one would expect), but sgn(x) does not have a
254 1. Real analysis
weak derivative (despite being (classically) dierentiable almost ev-
erywhere), because the distributional derivative 2 of this function is
not itself a locally integrable function. Thus weak derivatives dier
in some respects from their classical counterparts, though of course
the two concepts agree for smooth functions.
Exercise 1.13.24. Let d 1. Show that for any 1 i, j d, and
any distribution C
c
(R
d
)
, we have
xi
xj
=
xj
xi
, thus
weak derivatives commute with each other. (This is in contrast to
classical derivatives, which can fail to commute for non-smooth func-
tions; for instance,
x
y
xy
3
x
2
+y
2
,=
y
x
xy
3
x
2
+y
2
at the origin (x, y) = 0,
despite both derivatives being dened. More generally, weak deriva-
tives tend to be less pathological than classical derivatives, but of
course the downside is that weak derivatives do not always have a
classical interpretation as a limit of a Newton quotient.)
Exercise 1.13.25. Let d = 1, and let k 0 be an integer. Let us say
that a compactly supported distribution C
c
(R)
has of order
at most k if the functional f f, is continuous in the C
k
norm.
Thus, for instance, has order at most 0, and
c
(R
d
)
0
f(x)
x
dx
whenever f is a bump function supported in (0, +). To
show this, approximate f by the function
_
f(e
t
x)
n
(t) dt =
_
0
f(y)
y
n
(log
x
y
)1
x>0
dy
for
n
an approximation to the identity.)
Remark 1.13.7. One can also compose distributions with dieo-
morphisms. However, things become much more delicate if the map
one is composing with contains stationary points; for instance, in one
dimension, one cannot meaningfully make sense of (x
2
) (the compo-
sition of the Dirac delta distribution with x x
2
); this can be seen
by rst noting that for an approximation
n
to the identity,
n
(x
2
)
does not converge to a limit in the distributional sense.
256 1. Real analysis
Exercise 1.13.27 (Tensor product of distributions). Let d, d
1
be integers. If C
c
(R
d
)
and C
c
(R
d
are distributions,
show that there is a unique distribution C
c
(R
d+d
with
the property that
(1.116) f g, = f, g,
for all test functions f C
c
(R
d
), g C
c
(R
d
), where f g :
C
c
(R
d+d
) := f(x)g(x
) of f and
g. (Hint: like many other constructions of tensor products, this is
rather intricate. One way is to start by xing two cuto functions
,
on R
d
, R
d
x
(x)
(x
, and then
use Fourier series to dene on F(x, x
)(x)
(x
) for smooth F.
Then show that these denitions of are compatible for dier-
ent choices of ,
[ of the derivative
of the delta function. (One also cannot multiply by sgn(x) - why?)
Exercise 1.13.28. Let X be a normed vector space which contains
C
c
(R
d
) as a dense subspace (and such that the inclusion of C
c
(R
d
)
to X is continuous). The adjoint (or transpose) of this inclusion map
1.13. Distributions 257
is then an injection from X
c
(R
d
)
;
thus X
f, g
for test functions f, g, so one would like to dene the Fourier transform
T =
of a distribution by the formula
(1.117) f, T := T
f, .
Unfortunately this does not quite work, because the adjoint Fourier
transform T
.
Since C
c
(R
d
) embeds continuously into o(R
d
) (with a dense im-
age), we see that the space of tempered distributions can be embed-
ded into the space of distributions. However, not every distribution
is tempered:
258 1. Real analysis
Example 1.13.9. The distribution e
x
is not tempered. Indeed, if is
a bump function, observe that the sequence of functions e
n
(xn)
converges to zero in the Schwartz space topology, but e
n
(x
n), e
x
does not go to zero, and so this distribution does not corre-
spond to a tempered distribution.
On the other hand, distributions which avoid this sort of exponen-
tial growth, and instead only grow polynomially, tend to be tempered:
Exercise 1.13.29. Show that any Radon measure which is of poly-
nomial growth in the sense that [[(B(0, R)) CR
k
for all R 1 and
some constants C, k > 0, where B(0, R) is the ball of radius R centred
at the origin in R
d
, is tempered.
Remark 1.13.10. As a zeroth approximation, one can roughly think
of tempered as being synonymous with polynomial growth. How-
ever, this is not strictly true: for instance, the (weak) derivative of a
function of polynomial growth will still be tempered, but need not be
of polynomial growth (for instance, the derivative e
x
cos(e
x
) of sin(e
x
)
is a tempered distribution, despite having exponential growth). While
one can eventually describe which distributions are tempered by mea-
suring their growth in both physical space and in frequency space,
we will not do so here.
Most of the operations that preserve the space of distributions,
also preserve the space of tempered distributions. For instance:
Exercise 1.13.30. Show that any derivative of a tempered
distribution is again a tempered distribution.
Show that and any convolution of a tempered distribution
with a compactly supported distribution is again a tempered
distribution.
Show that if f is a measurable function which is rapidly
decreasing in the sense that [x[
k
f(x) is an L
(R
d
) function
for each k = 0, 1, 2, . . ., then a convolution of a tempered
distribution with f can be dened, and is again a tempered
distribution.
Show that if f is a smooth function such that f and all its
derivatives have at most polynomial growth (thus for each
1.13. Distributions 259
j 0 there exists C, k 0 such that [
j
f(x)[ C(1 +[x[)
k
for all x R
d
) then the product of a tempered distribution
with f is again a tempered distribution. Give a counterex-
ample to show that this statement fails if the polynomial
growth hypotheses are dropped.
Show that the translate of a tempered distribution is again
a tempered distribution.
But we can now add a new operation to this list using (1.117):
as the Fourier transform T maps Schwartz functions continuously to
Schwartz functions, it also continuously maps the space of tempered
distributions to itself. One can also dene the inverse Fourier trans-
form T
= T
1
on tempered distributions in a similar manner.
It is not dicult to extend many of the properties of the Fourier
transform from Schwartz functions to distributions. For instance:
Exercise 1.13.31. Let o(R
d
)
T = TT
= .
(Multiplication intertwines with convolution) Show that T(f) =
(T) (Tf) and T( f) = (T)(Tf).
(Translation intertwines with modulation) For any x
0
R
d
,
show that T(
x0
) = e
x0
T, where e
x0
() := e
2ix0
.
Similarly, show that for any
0
R
d
, one has T(e
0
) =
0
T.
(Linear transformations) For any invertible linear transfor-
mation L : R
d
R
d
, show that T( L) =
1
| det L|
(T)
(L
)
1
.
(Dierentiation intertwines with polynomial multiplication)
For any 1 j d, show that T(
xj
) = 2i
j
T, where x
j
and
j
is the j
th
coordinate function in physical space and
frequency space respectively, and similarly T(2ix
j
) =
j
T.
Exercise 1.13.32. Let d 1.
(Inversion formula) Show that T = 1 and T1 = .
260 1. Real analysis
(Orthogonality) Let V be a subspace of R
d
, and let be
Lebesgue measure on V . Show that T is Lebesgue measure
on the orthogonal complement V
kZ
d
k
be the distri-
bution
f,
kZ
d
k
:=
kZ
d
f(k).
Show that this is a tempered distribution which is equal to
its own Fourier transform.
One can use these properties of tempered distributions to start
solving constant-coecient PDE. We rst illustrate this by an ODE
example, showing how the formal symbolic calculus for solving such
ODE that you may have seen as an undergraduate, can now be (some-
times) justied using tempered distributions.
Exercise 1.13.33. Let d = 1, let a, b be real numbers, and let D be
the operator D =
d
dx
.
If a ,= b, use the Fourier transform to show that all tempered
distribution solutions to the ODE (Dia)(Dib) = 0 are
of the form = Ae
iax
+Be
ibx
for some constants A, B.
If a = b, show that all tempered distribution solutions to
the ODE (Dia)(Dib) = 0 are of the form = Ae
iax
+
Bxe
iax
for some constants A, B.
Remark 1.13.11. More generally, one can solve any homogeneous
constant-coecient ODE using tempered distributions and the Fourier
transform so long as the roots of the characteristic polynomial are
purely imaginary. In all other cases, solutions can grow exponentially
as x + or x and so are not tempered. There are other
theories of generalised functions that can handle these objects (e.g.
hyperfunctions) but we will not discuss them here.
Now we turn to PDE. To illustrate the method, let us focus on
solving Poissons equation
(1.118) u = f
1.13. Distributions 261
in R
d
, where f is a Schwartz function and u is a distribution, where
=
d
j=1
2
x
2
j
is the Laplacian. (In some texts, particularly those
using spectral analysis, the Laplacian is occasionally dened instead
as
d
j=1
2
x
2
j
, to make it positive semi-denite, but we will eschew
that sign convention here, though of course the theory is only changed
in a trivial fashion if one adopts it.)
We rst settle the question of uniqueness:
Exercise 1.13.34. Let d 1. Using the Fourier transform, show
that the only tempered distributions o(R
d
)
[[
2
so we now need to gure out what the Fourier transform of a negative
power of [x[ (or the adjoint Fourier transform of a negative power of
[[) is.
Let us work formally at rst, and consider the problem of com-
puting the Fourier transform of the function [x[
in R
d
for some
exponent . A direct attack, based on evaluating the (formal) Fourier
integral
(1.120)
[x[
() =
_
R
d
[x[
e
2ix
dx
does not seem to make much sense (the integral is not absolutely
integrable), although a change of variables (or dimensional analysis)
heuristic can at least lead to the prediction that the integral (1.120)
should be some multiple of [[
d
. But which multiple should it be?
To continue the formal calculation, we can write the non-integrable
function [x[
e
t
2
|x|
2
() = t
d
e
||
2
/t
2
for t > 0 (see Exercise 1.12.32). To get from Gaussians to [x[
, one
can observe that [x[
f(tx)
for t > 0. Thus, it is natural to average the standard Gaussian e
|x|
2
with respect to this scaling, thus producing the function t
e
t
2
|x|
2
,
then integrate with respect to the multiplicative Haar measure
dt
t
. A
1.13. Distributions 263
straightforward change of variables then gives the identity
_
0
t
e
t
2
|x|
2 dt
t
=
1
2
/2
[x[
(/2)
where
(s) :=
_
0
t
s
e
t
dt
t
is the Gamma function. If we formally take Fourier transforms of this
identity, we obtain
_
0
t
t
d
e
|x|
2
/t
2 dt
t
=
1
2
/2
[x[
()(/2).
Another change of variables shows that
_
0
t
t
d
e
|x|
2
/t
2 dt
t
=
1
2
(d)/2
[[
(d)
((d )/2)
and so we conclude (formally) that
(1.121)
[x[
() =
(d)/2
((d )/2)
/2
(/2)
[[
(d)
thus solving the problem of what the constant multiple of [[
(d)
should be.
Exercise 1.13.35. Give a rigorous proof of (1.121) for 0 < < d
(when both sides are locally integrable) in the sense of distributions.
(Hint: basically, one needs to test the entire formal argument against
an arbitrary Schwartz function.) The identity (1.121) can in fact be
continued meromorphically in , but the interpretation of distribu-
tions such as [x[
when [x[
we see that
1
[x[
2
() = [[
1
and similarly
T
1
[[
2
= [x[
1
264 1. Real analysis
and so from (1.119) we see that one choice of the fundamental solution
K is the Newton potential
K =
1
4[x[
,
leading to an explicit (and rigorously derived) solution
(1.122) u(x) := f K(x) =
1
4
_
R
3
f(y)
[x y[
dy
to the Poisson equation (1.118) in d = 3 for Schwartz functions f.
(This is not quite the only fundamental solution K available; one
can add a harmonic polynomial to K, which will end up adding a
harmonic polynomial to u, since the convolution of a harmonic poly-
nomial with a Schwartz function is easily seen to still be harmonic.)
Exercise 1.13.36. Without using the theory of distributions, give
an alternate (and still rigorous) proof that the function u dened in
(1.122) solves (1.118) in d = 3.
Exercise 1.13.37. Show that for any d 3, a fundamental
solution K to the Poisson equation is given by the locally
integrable function
K(x) =
1
d(d 2)
d
1
[x[
d2
,
where
d
=
d/2
/(
d
2
+ 1) is the volume of the unit ball in
d dimensions.
Show that for d = 1, a fundamental solution is given by the
locally integrable function K(x) = [x[/2.
Show that for d = 2, a fundamental solution is given by the
locally integrable function K(x) =
1
2
log [x[.
This we see that for the Poisson equation, d = 2 is a critical
dimension, requiring a logarithmic correction to the usual formula.
Similar methods can solve other constant coecient linear PDE.
We give some standard examples in the exercises below.
Exercise 1.13.38. Let d 1. Show that a smooth solution u :
R
+
R
d
C to the heat equation
t
u = u with initial data
1.13. Distributions 265
u(0, x) = f(x) for some Schwartz function f is given by u(t) = f K
t
for t > 0, where K
t
is the heat kernel
K
t
(x) =
1
(4t)
d/2
e
|xy|
2
/4t
.
(This solution is unique assuming certain smoothness and decay con-
ditions at innity, but we will not pursue this issue here.)
Exercise 1.13.39. Let d 1. Show that a smooth solution u :
RR
d
C to the Schrodinger equation
t
u = iu with initial data
u(0, x) = f(x) for some Schwartz function f is given by u(t) = f K
t
for t ,= 0, where K
t
is the Schrodinger kernel
12
K
t
(x) =
1
(4it)
d/2
e
i|xy|
2
/4t
and we use the standard branch of the complex logarithm (with cut
on the negative real axis) to dene (4it)
d/2
. (Hint: You may wish
to investigate the Fourier transform of e
z||
2
, where z is a complex
number with positive real part, and then let z approach the imaginary
axis.)
Exercise 1.13.40. Let d = 3. Show that a smooth solution u :
R R
3
C to the wave equation
tt
u + u with initial data
u(0, x) = f(x),
t
u(0, x) = g(x) for some Schwartz functions f is
given by the formula
u(t) = f
t
K
t
+g K
t
for t ,= 0, where K
t
is the distribution
f, K
t
:=
t
4
_
S
2
f(t) d
where is Lebesgue measure on the sphere S
2
, and the derivative
t
K
t
is dened in the Newtonian sense lim
dt0
K
t+dt
Kt
dt
, with the
limit taken in the sense of distributions.
12
The close similarity here with the heat kernel is a manifestation of Wick ro-
tation in action. However, from an analytical viewpoint, the two kernels are very
dierent. For instance, the convergence of f Kt to f as t 0 follows in the heat
kernel case by the theory of approximations to the identity, whereas the convergence
in the Schrodinger case is much more subtle, and is best seen via Fourier analysis.
266 1. Real analysis
Remark 1.13.12. The theory of (tempered) distributions is also
highly eective for studying variable coecient linear PDE, especially
if the coecients are fairly smooth, and particularly if one is primarily
interested in the singularities of solutions to such PDE and how they
propagate; here the Fourier transform must be augmented with more
general transforms of this type, such as Fourier integral operators.
A classic reference for this topic is [Ho1990]. For nonlinear PDE,
subspaces of the space of distributions, such as Sobolev spaces, tend
to be more useful; we will discuss these in the next section.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/04/19.
Thanks to Dale Roberts, Max Baroi, and an anonymous commenter
for corrections.
1.14. Sobolev spaces
As discussed in previous sections, a function space norm can be viewed
as a means to rigorously quantify various statistics of a function f :
X C. For instance, the height and width can be quantied via
the L
p
(X, ) norms (and their relatives, such as the Lorentz norms
|f|
L
p,q
(X,)
). Indeed, if f is a step function f = A1
E
, then the L
p
norm of f is a combination |f|
L
p
(X,)
= [A[(E)
1/p
of the height (or
amplitude) A and the width (E).
However, there are more features of a function f of interest than
just its width and height. When the domain X is a Euclidean space
R
d
(or domains related to Euclidean spaces, such as open subsets of
R
d
, or manifolds), then another important feature of such functions
(especially in PDE) is the regularity of a function, as well as the
related concept of the frequency scale of a function. These terms
are not rigorously dened; but roughly speaking, regularity measures
how smooth a function is (or how many times one can dierentiate the
function before it ceases to be a function), while the frequency scale
of a function measures how quickly the function oscillates (and would
be inversely proportional to the wavelength). One can illustrate this
informal concept with some examples:
Let C
c
(R) be a test function that equals 1 near the
origin, and N be a large number. Then the function f(x) :=
1.14. Sobolev spaces 267
(x) sin(Nx) oscillates at a wavelength of about 1/N, and
a frequency scale of about N. While f is, strictly speak-
ing, a smooth function, it becomes increasingly less smooth
in the limit N ; for instance, the derivative f
(x) =
c
(R
d
) is a test function, would have
a frequency scale of about [[.
268 1. Real analysis
There are a variety of function space norms that can be used
to capture frequency scale (or regularity) in addition to height and
width. The most common and well-known examples of such spaces
are the Sobolev space norms |f|
W
s,p
(R
d
)
, although there are a num-
ber of other norms with similar features (such as Holder norms, Besov
norms, and Triebel-Lizorkin norms). Very roughly speaking, the W
s,p
norm is like the L
p
norm, but with s additional degrees of regular-
ity. For instance, in one dimension, the function A(x/R) sin(Nx),
where is a xed test function and R, N are large, will have a
W
s,p
norm of about [A[R
1/p
N
s
, thus combining the height [A[,
the width R, and the frequency scale N of this function together.
(Compare this with the L
p
norm of the same function, which is about
[A[R
1/p
.)
To a large extent, the theory of the Sobolev spaces W
s,p
(R
d
) re-
sembles their Lebesgue counterparts L
p
(R
d
) (which are as the special
case of Sobolev spaces when s = 0), but with the additional benet
of being able to interact very nicely with (weak) derivatives: a rst
derivative
f
xj
of a function in an L
p
space usually leaves all Lebesgue
spaces, but a rst derivative of a function in the Sobolev space W
s,p
will end up in another Sobolev space W
s1,p
. This compatibility with
the dierentiation operation begins to explain why Sobolev spaces are
so useful in the theory of partial dierential equations. Furthermore,
the regularity parameter s in Sobolev spaces is not restricted to be
a natural number; it can be any real number, and one can use frac-
tional derivative or integration operators to move from one regularity
to another. Despite the fact that most partial dierential equations
involve dierential operators of integer order, fractional spaces are
still of importance; for instance it often turns out that the Sobolev
spaces which are critical (scale-invariant) for a certain PDE are of
fractional order.
The uncertainty principle in Fourier analysis places a constraint
between the width and frequency scale of a function; roughly speaking
(and in one dimension for simplicity), the product of the two quanti-
ties has to be bounded away from zero (or to put it another way, a
wave is always at least as wide as its wavelength). This constraint can
1.14. Sobolev spaces 269
be quantied as the very useful Sobolev embedding theorem, which al-
lows one to trade regularity for integrability: a function in a Sobolev
space W
s,p
will automatically lie in a number of other Sobolev spaces
W
s, p
with s < s and p > p; in particular, one can often embed Sobolev
spaces into Lebesgue spaces. The trade is not reversible: one cannot
start with a function with a lot of integrability and no regularity, and
expect to recover regularity in a space of lower integrability. (One can
already see this with the most basic example of Sobolev embedding,
coming from the fundamental theorem of calculus. If a (continuously
dierentiable) function f : R R has f
in L
1
(R), then we of course
have f L
(R
d
)
.
This norm gives C
0
the structure of a Banach space. More generally,
one can then dene the spaces C
k
(R
d
) for any non-negative integer
k as the space of all functions which are k times continuously dier-
entiable, with all derivatives of order k bounded, and whose norm is
given by the formula
|f|
C
k
(R
d
)
:=
k
j=0
sup
xR
d
[
j
f(x)[ =
k
j=0
|
j
f|
L
(R
d
)
,
where we view
j
f as a rank j, dimension d tensor with complex
coecients (or equivalently, as a vector of dimension d
j
with complex
coecients), thus
[
j
f(x)[ = (
i1,...,ij=1,...d
[
j
xi
1
. . .
xi
j
f(x)[
2
)
1/2
.
(One does not have to use the
2
norm here, actually; since all norms
on a nite-dimensional space are equivalent, any other means of tak-
ing norms here will lead to an equivalent denition of the C
k
norm.
More generally, all the norms discussed here tend to have several def-
initions which are equivalent up to constants, and in most cases the
exact choice of norm one uses is just a matter of personal taste.)
Remark 1.14.1. In some texts, C
k
(R
d
) is used to denote the func-
tions which are k times continuously dierentiable, but whose deriva-
tives up to k
th
order are allowed to be unbounded, so for instance e
x
would lie in C
k
(R) for every k under this denition. Here, we will refer
to such functions (with unbounded derivatives) as lying in C
k
loc
(R
d
)
(i.e. they are locally in C
k
), rather than C
k
(R
d
). Similarly, we make
a distinction between C
loc
(R
d
) =
k=1
C
k
loc
(R
d
) (smooth functions,
with no bounds on derivatives) and C
(R
d
) =
k=1
C
k
(R
d
) (smooth
1.14. Sobolev spaces 271
functions, all of whose derivatives are bounded). Thus, for instance,
e
x
lies in C
loc
(R) but not C
(R).
Exercise 1.14.1. Show that C
k
(R
d
) is a Banach space.
Exercise 1.14.2. Show that for every d 1 and k 0, the C
k
(R
d
)
norm is equivalent to the modied norm
|f|
C
k
(R
d
)
:= |f|
L
(R
d
)
+|
k
f|
L
(R
d
)
in the sense that there exists a constant C (depending on k and d)
such that
C
1
|f|
C
k
(R
d
)
|f|
C
k
(R
d
)
|f|
C
k
(R
d
)
for all f C
k
(R
d
). (Hint: use Taylor series with remainder.) Thus
when dening the C
k
norms, one does not really need to bound all the
intermediate derivatives
j
f for 0 < j < k; the two extreme terms
j = 0, j = k suce. (This is part of a more general interpolation
phenomenon; the extreme terms in a sum often already suce to
control the intermediate terms.)
Exercise 1.14.3. Let C
c
(R
d
) be a bump function, and k 0.
Show that if R
d
with [[ 1, R 1/[[, and A > 0, then the
function A(x/R) sin( x) has a C
k
norm of at most CA[[
k
, where
C is a constant depending only on , d and k. Thus we see how the
C
c
norm relates to the height A, width R
d
, and frequency scale N of
the function, and in particular how the width R is largely irrelevant.
What happens when the condition R 1/[[ is dropped?
We clearly have the inclusions
C
0
(R
d
) C
1
(R
d
) C
2
(R
d
) . . .
and for any constant-coecient partial dierential operator
L =
i1,...,i
d
0:i1+...+i
d
m
c
i1,...,i
d
i1+...+i
d
x
i
1
1
. . .
x
i
d
d
of some order m 0, it is easy to see that L is a bounded linear
operator from C
k+m
(R
d
) to C
k
(R
d
) for any k 0.
The Holder spaces C
k,
(R
d
) are designed to ll up the gaps
between the discrete spectrum C
k
(R
d
) of the continuously dieren-
tiable spaces. For k = 0 and 0 1, these spaces are dened as
272 1. Real analysis
the subspace of functions f C
0
(R
d
) whose norm
|f|
C
0,
(R
d
)
:= |f|
C
0
(R
d
)
+ sup
x,yR
d
:x=y
[f(x) f(y)[
[x y[
(R
d
), as
can be seen from the fundamental theorem of calculus.)
Exercise 1.14.8. Let f (C
c
(R))
of f also lies in L
(R)
+|f
|
L
(R)
.
We can then dene the C
k,
(R
d
) spaces for natural numbers
k 0 and 0 1 to be the subspace of C
k
(R
d
) whose norm
|f|
C
k,
(R
d
)
:=
k
j=0
|
j
f|
C
0,
(R
d
)
1.14. Sobolev spaces 273
is nite. (As before, there are a variety of ways to dene the C
0,
norm of the tensor-valued quantity
j
f, but they are all equivalent
to each other.)
Exercise 1.14.9. Show that C
k,
(R
d
) is a Banach space which con-
tains C
k+1
(R
d
), and is contained in turn in C
k
(R
d
).
As before, C
k,0
(R
d
) is equal to C
k
(R
d
), and C
k,
(R
d
) is con-
tained in C
k,
(R
d
). The space C
k,1
(R
d
) is slightly larger than C
k+1
,
but is fairly close to it, thus providing a near-continuum of spaces
between the sequence of spaces C
k
(R
d
). The following examples il-
lustrates this:
Exercise 1.14.10. Let C
c
(R) be a test function, let k 0 be
a natural number, and let 0 1.
Show that the function [x[
s
(x) lies in C
k,
(R) whenever
s k +.
Conversely, if s is not an integer, (0) ,= 0, and s < k + ,
show that [x[
s
(x) does not lie in C
k,
(R).
Show that [x[
k+1
(x)1
x>0
lies in C
k,1
(R), but not in C
k+1
(R).
This example illustrates that the quantity k+ can be viewed as mea-
suring the total amount of regularity held by functions in C
k,
(R):
k full derivatives, plus an additional amount of Holder continuity.
Exercise 1.14.11. Let C
c
(R
d
) be a test function, let k 0 be
a natural number, and let 0 1. Show that for R
d
with
[[ 1, the function (x) sin( x) has a C
k,
(R) norm of at most
C[[
k+
, for some C depending on , d, k, .
By construction, it is clear that continuously dierential operators
L of order m will map C
k+m,
(R
d
) continuously to C
k,
(R
d
).
Now we consider what happens with products.
Exercise 1.14.12. Let k, l 0 be natural numbers, and 0 ,
1.
If f C
k
(R
d
) and g C
l
(R
d
), show that fg C
min(k,l)
(R
d
),
and that the multiplication map is continuous fromC
k
(R
d
)
C
l
(R
d
) to C
min(k,l)
(R
d
). (Hint: reduce to the case k = l
and use induction.)
274 1. Real analysis
If f C
k,
(R
d
) and g C
l,
(R
d
), and k + l + ,
show that fg C
k,
(R
d
), and that the multiplication map
is continuous from C
k,
(R
d
) C
l,
(R
d
) to C
k,
(R
d
).
It is easy to see that the regularity in these results cannot be improved
(just take g = 1). This illustrates a general principle, namely that a
pointwise product fg tends to acquire the lower of the regularities of
the two factors f, g.
As one consequence of this exercise, we see that any variable-
coecient dierential operator L of order m with C
(R) coecients
will map C
m+k,
(R
d
) to C
k,
(R
d
) for any k 0 and 0 1.
We now briey remark on Holder spaces on open domains in
Euclidean space R
d
. Here, a new subtlety emerges; instead of having
just one space C
k,
for each choice of exponents k, , one actually
has a range of spaces to choose from, depending on what kind of
behaviour one wants to impose at the boundary of the domain. At
one extreme, one has the space C
k,
(), dened as the space of k
times continuously dierentiable functions f : C whose Holder
norm
|f|
C
k,
()
:=
k
j=0
sup
x
[
j
f(x)[ + sup
x,y:x=y
[
j
f(x)
j
f(y)[
[x y[
c
(R
d
) is a dense subset of C
k,
(R
d
)
for any k 0 and 0 1. (Hint: To approximate a C
k,
function by a C
c
one, rst smoothly truncate the function at a large
spatial scale to be compactly supported, then convolve with a smooth,
compactly supported approximation to the identity.)
Holder spaces are particularly useful in elliptic PDE, because
tools such as the maximum principle lend themselves well to the
suprema that appear inside the denition of the C
k,
norms; see
[GiTr1998] for a thorough treatment. For simple examples of ellip-
tic PDE, such as the Poisson equation u = f, one can also use the
explicit fundamental solution, through lengthy but straightforward
computations. We give a typical example here:
Exercise 1.14.14 (Schauder estimate). Let 0 < < 1, and let
f C
0,
(R
3
) be a function supported on the unit ball B(0, 1). Let
u be the unique bounded solution to the Poisson equation u = f
(where =
3
j=1
2
x
2
j
is the Laplacian), given by convolution with
the Newton kernel:
u(x) :=
1
4
_
R
3
f(y)
[x y[
dy.
(i) Show that u C
0
(R
3
).
(ii) Show that u C
1
(R
3
), and rigorously establish the formula
u
x
j
(x) =
1
4
_
R
3
(x
j
y
j
)
f(y)
[x y[
3
dy
for j = 1, 2, 3.
(iii) Show that u C
2
(R
3
), and rigorously establish the formula
2
u
x
i
x
j
(x) =
1
4
lim
0
_
|xy|
[
3(x
i
y
i
)(x
j
y
j
)
[x y[
5
ij
[x y[
3
]f(y) dy
for i, j = 1, 2, 3, where
ij
is the Kronecker delta. (Hint:
rst establish this in the two model cases when f(x) = 0,
and when f is constant near x.)
(iv) Show that u C
2,
(R
3
), and establish the Schauder esti-
mate
|u|
C
2,
(R
3
)
C
|f|
C
0,
(R
3
)
where C
depends only on .
276 1. Real analysis
(v) Show that the Schauder estimate fails when = 0. Using
this, conclude that there eixsts f C
0
(R
3
) supported in the
unit ball such that the function u dened above fails to be
in C
2
(R
3
). (Hint: use the closed graph theorem, Theorem
1.7.19.) This failure helps explain why it is necessary to
introduce Holder spaces into elliptic theory in the rst place
(as opposed to the more intuitive C
k
spaces).
Remark 1.14.2. Roughly speaking, the Schauder estimate asserts
that if u has C
0,
regularity, then all other second derivatives of
u have C
0,
regularity as well. This phenomenon - that control of
a special derivative of u at some order implies control of all other
derivatives of u at that order - is known as elliptic regularity, and
relies crucially on being an elliptic dierential operator. We will
discus ellipticity a little bit more later in Exercise 1.14.36. The theory
of Schauder estimates is by now extremely well developed, and applies
to large classes of elliptic operators on quite general domains, but we
will not discuss these estimates and their applications to various linear
and nonlinear elliptic PDE here.
Exercise 1.14.15 (Rellich-Kondrakov type embedding theorem for
Holder spaces). Let 0 < 1. Show that any bounded sequence
of functions f
n
C
0,
(R
d
) that are all supported in the same compact
subset of R
n
will have a subsequence that converges in C
0,
(R
d
).
(Hint: use the Arzela-Ascoli theorem (Theorem 1.8.23) to rst obtain
uniform convergence, then upgrade this convergence.) This is part of
a more general phenomenon: sequences bounded in a high regularity
space, and constrained to lie in a compact domain, will tend to have
convergent subsequences in low regularity spaces.
1.14.2. Classical Sobolev spaces. We now turn to the classical
Sobolev spaces W
k,p
(R
d
), which involve only an integral amount k
of regularity.
Denition 1.14.3. Let 1 p , and let k 0 be a natural
number. A function f is said to lie in W
k,p
(R
d
) if its weak derivatives
j
f exist and lie in L
p
(R
d
) for all j = 0, . . . , k. If f lies in W
k,p
(R
d
),
1.14. Sobolev spaces 277
we dene the W
k,p
norm of f by the formula
|f|
W
k,p
(R
d
)
:=
k
j=0
|
j
f|
L
p
(R
d
)
.
(As before, the exact choice of convention in which one measures the
L
p
norm of
j
is not particularly relevant for most applications, as
all such conventions are equivalent up to multiplicative constants.)
The space W
k,p
(R
d
) is also denoted L
p
k
(R
d
) in some texts.
Example 1.14.4. W
0,p
(R
d
) is of course the same space as L
p
(R
d
),
thus the Sobolev spaces generalise the Lebesgue spaces. From Exer-
cise 1.14.8 we see that W
1,
(R) is the same space as C
0,1
(R), with
an equivalent norm. More generally, one can see from induction that
W
k+1,
(R) is the same space as C
k,1
(R) for k 0, with an equiv-
alent norm. It is also clear that W
k,p
(R
d
) contains W
k+1,p
(R
d
) for
any k, p.
Example 1.14.5. The function [ sin x[ lies in W
1,
(R), but is not
everywhere dierentiable in the classical sense; nevertheless, it has
a bounded weak derivative of cos xsgn(sin(x)). On the other hand,
the Cantor function (aka the Devils staircase) is not in W
1,
(R),
despite having a classical derivative of zero at almost every point;
the weak derivative is a Cantor measure, which does not lie in any
L
p
space. Thus one really does need to work with weak derivatives
rather than classical derivatives to dene Sobolev spaces properly (in
contrast to the C
k,
spaces).
Exercise 1.14.16. Let C
c
(R
d
) be a bump function, k 0,
and 1 p . Show that if R
d
with [[ 1, R 1/[[, and
A > 0, then the function (x/R) sin(x) has a W
k,p
(R) norm of at
most CA[[
k
R
d/p
, where C is a constant depending only on , p and
k. (Compare this with Exercise 1.14.3 and Exercise 1.14.11.) What
happens when the condition R 1/[[ is dropped?
Exercise 1.14.17. Show that W
k,p
(R
d
) is a Banach space for any
1 p and k 0.
The fact that Sobolev spaces are dened using weak derivatives
is a technical nuisance, but in practice one can often end up working
with classical derivatives anyway by means of the following lemma:
278 1. Real analysis
Lemma 1.14.6. Let 1 p < and k 0. Then the space C
c
(R
d
)
of test functions is a dense subspace of W
k,p
(R
d
).
Proof. It is clear that C
c
(R
d
) is a subspace of W
k,p
(R
d
). We rst
show that the smooth functions C
loc
(R
d
) W
k,p
(R
d
) is a dense sub-
space of W
k,p
(R
d
), and then show that C
c
(R
d
) is dense in C
loc
(R
d
)
W
k,p
(R
d
).
We begin with the former claim. Let f W
k,p
(R
d
), and let
n
be a sequence of smooth, compactly supported approximations to the
identity. Since f L
p
(R
d
), we see that f
n
converges to f in
L
p
(R
d
). More generally, since
j
f is in L
p
(R
d
) for 0 j k, we
see that (
j
f)
n
=
j
(f
n
) converges to
j
f in L
p
(R
d
). Thus
we see that f
n
converges to f in W
k,p
(R
d
). On the other hand,
as
n
is smooth, f
n
is smooth; and the claim follows.
Now we prove the latter claim. Let f be a smooth function in
W
k,p
(R
d
), thus
j
f L
p
(R
d
) for all 0 j k. We let C
c
(R
d
)
be a compactly supported function which equals 1 near the origin,
and consider the functions f
R
(x) := f(x)(x/R) for R > 0. Clearly,
each f
R
lies in C
c
(R
d
). As R , dominated convergence shows
that f
R
converges to f in L
p
(R
d
). An application of the product
rule then lets us write f
R
(x) = (f)(x)(x/R) +
1
R
f(x)()(x/R).
The rst term converges to f in L
p
(R
d
) by dominated convergence,
while the second term goes to zero in the same topology; thus f
R
converges to f in L
p
(R
d
). A similar argument shows that
j
f
R
converges to
j
f in L
p
(R
d
) for all 0 j k, and so f
R
converges to
f in W
k,p
(R
d
), and the claim follows.
As a corollary of this lemma we also see that the space o(R
d
) of
Schwartz functions is dense in W
k,p
(R
d
).
Exercise 1.14.18. Let k 0. Show that the closure of C
c
(R
d
)
in W
k,
(R
d
) is C
k+1
(R
d
), thus Lemma 1.14.6 fails at the endpoint
p = .
Now we come to the important Sobolev embedding theorem, which
allows one to trade regularity for integrability. We illustrate this
phenomenon rst with some very simple cases. First, we claim that
the space W
1,1
(R) embeds continuously into W
0,
(R) = L
(R),
1.14. Sobolev spaces 279
thus trading in one degree of regularity to upgrade L
1
integrability
to L
(R)
C|f|
W
1,1
(R)
for all test functions f C
c
(R) and some constant C, as the claim
then follows by taking limits using Lemma 1.14.6. (Note that any
limit in either the L
or W
1,1
topologies, is also a limit in the sense
of distributions, and such limits are necessarily unique. Also, since
L
(R) remains in L
(t) dt[ |f
|
L
1
(R)
|f|
W
1,1
(R)
for all x; in particular, from the triangle inequality
|f|
L
(R)
[f(0)[ +|f|
W
1,1
(R)
.
Also, taking x to be suciently large, we see (from the compact sup-
port of f) that
[f(0)[ |f|
W
1,1
(R)
and (1.123) follows.
Since the closure of C
c
(R) in L
(R) is C
0
(R), we actually ob-
tain the stronger embedding, that W
1,1
(R) embeds continuously into
C
0
(R).
Exercise 1.14.19. Show that W
d,1
(R
d
) embeds continuously into
C
0
(R
d
), thus there exists a constant C (depending only on d) such
that
|f|
C0(R
d
)
C|f|
W
d,1
(R
d
)
for all f W
d,1
(R
d
).
Now we turn to Sobolev embedding for exponents other than
p = 1 and p = .
Theorem 1.14.7 (Sobolev embedding theorem for one derivative).
Let 1 p q be such that
d
p
1
d
q
d
p
, but that one is not in
280 1. Real analysis
the endpoint cases (p, q) = (d, ), (1,
d
d1
). Then W
1,p
(R
d
) embeds
continuously into L
q
(R
d
).
Proof. By Lemma 1.14.6 and the same limiting argument as before,
it suces to establish the Sobolev embedding inequality
|f|
L
q
(R
d
)
C
p,q,d
|f|
W
1,p
(R
d
)
for all test functions f C
c
(R
d
), and some constant C
p,q,d
de-
pending only on p, q, d, as the inequality will then extend to all
f W
1,p
(R
d
). To simplify the notation we shall use X Y to
denote an estimate of the form X C
p,q,d
Y , where C
p,q,d
is a con-
stant depending on p, q, d (the exact value of this constant may vary
from instance to instance).
The case p = q is trivial. Now let us look at another extreme case,
namely when
d
p
1 =
d
q
; by our hypotheses, this forces 1 < p < d.
Here, we use the fundamental theorem of calculus (and the compact
support of f) to write
f(x) =
_
0
f(x +r) dr
for any x R
d
and any direction S
d1
. Taking absolute values,
we conclude in particular that
[f(x)[
_
0
[f(x +r)[ dr.
We can average this over all directions :
[f(x)[
_
S
d1
_
0
[f(x +r) drd.
Switching from polar coordinates back to Cartesian (multiplying and
dividing by r
d1
) we conclude that
[f(x)[
_
R
d
1
[y[
d1
[f(x y)[ dy,
thus f is pointwise controlled by the convolution of [f[ with the frac-
tional integration
1
|x|
d1
. By the Hardy-Littlewood-Sobolev theorem
on fractional integration (Corollary 1.11.18) we conclude that
|f|
L
q
(R
d
)
|f|
L
p
(R
d
)
1.14. Sobolev spaces 281
and the claim follows. (Note that the hypotheses 1 < p < d are
needed here in order to be able to invoke this theorem.)
Now we handle intermediate cases, when
d
p
1 <
d
q
<
d
p
. (Many
of these cases can be obtained from the endpoints already established
by interpolation, but unfortunately not all such cases can be, so we
will treat this case separately.) Here, the trick is not to integrate out
to innity, but instead to integrate out to a bounded distance. For
instance, the fundamental theorem of calculus gives
f(x) = f(x +R)
_
R
0
f(x +r) dr
for any R > 0, hence
[f(x)[ [f(x +R)[ +
_
R
0
[f(x +r)[ dr
What value of R should one pick? If one picks any specic value of
R, one would end up with an average of f over spheres, which looks
somewhat unpleasant. But what one can do here is average over a
range of Rs, for instance between 1 and 2. This leads to
[f(x)[
_
2
1
[f(x +R)[ dR +
_
2
0
[f(x +r)[ dr;
averaging over all directions and converting back to Cartesian co-
ordinates, we see that
[f(x)[
_
1|y|2
[f(x y)[ dy +
_
|y|2
1
[y[
d1
[f(x y)[ dy.
Thus one is bounding [f[ pointwise (up to constants) by the convo-
lution of [f[ with the kernel K
1
(y) := 1
1|y|2
, plus the convolution
of [f[ with the kernel K
2
(y) := 1
|y|2
1
|y|
d1
. A short computation
shows that both kernels lie in L
r
(R
d
), where r is the exponent in
Youngs inequality, and more specically that
1
q
+ 1 =
1
p
+
1
r
(and
in particular 1 < r <
d
d1
). Applying Youngs inequality (Exercise
1.11.25), we conclude that
|f|
L
q
(R
d
)
|f|
L
p
(R
d
)
+|f|
L
p
(R
d
)
and the claim follows.
282 1. Real analysis
Remark 1.14.8. It is instructive to insert the example in Exer-
cise 1.14.16 into the Sobolev embedding theorem. By replacing the
W
1,p
(R
d
) norm with the L
q
(R
d
) norm, one trades one factor of the
frequency scale [[ for
1
q
1
p
powers of the width R
d
. This is consistent
with the Sobolev embedding theorem so long as R
d
1/[[
d
, which is
essentially one of the hypotheses in that exercise. Thus, one can view
Sobolev embedding as an assertion that the width of a function must
always be greater than or comparable to the wavelength scale (the
reciprocal of the frequency scale), raised to the power of the dimen-
sion; this is a manifestation of the uncertainty principle (see Section
2.6 for further discussion).
Exercise 1.14.20. Let d 2. Show that the Sobolev endpoint esti-
mate fails in the case (p, q) = (d, ). (Hint: experiment with func-
tions f of the form f(x) :=
N
n=1
(2
n
x), where is a test function
supported on the annulus 1 [x[ 2.) Conclude in particular that
W
1,d
(R
d
) is not a subset of L
(R
d
). (Hint: Either use the closed
graph theorem, or use some variant of the function f used in the rst
part of this exercise.) Note that when d = 1, the Sobolev endpoint
theorem for (p, q) = (1, ) follows from the fundamental theorem of
calculus, as mentioned earlier. There are substitutes known for the
endpoint Sobolev embedding theorem, but they involve more sophis-
ticated function spaces, such as the space BMO of spaces of bounded
mean oscillation, which we will not discuss here.
The p = 1 case of the Sobolev inequality cannot be proven via the
Hardy-Littlewood-Sobolev inequality; however, there are other proofs
available. One of these (due to Gagliardo and Nirenberg) is based on
Exercise 1.14.21 (Loomis-Whitney inequality). Let d 1, let f
1
, . . . , f
d
L
p
(R
d1
) for some 0 < p , and let F : R
d
C be the function
F(x
1
, . . . , x
d
) :=
d
i=1
f
i
(x
1
, . . . , x
i1
, x
i+1
, . . . , x
d
).
Show that
|F|
L
p/(d1)
(R
d
)
d
i=1
|f
i
|
L
p
(R
d
)
.
(Hint: induct on d, using Holders inequality and Fubinis theorem.)
1.14. Sobolev spaces 283
Lemma 1.14.9 (Endpoint Sobolev inequality). W
1,1
(R
d
) embeds
continuously into L
d/(d1)
(R
d
).
Proof. It will suce to show that
|f|
L
d/(d1)
(R
d
)
|f|
L
1
(R
d
)
for all test functions f C
c
(R
d
). From the fundamental theorem of
calculus we see that
[f(x
1
, . . . , x
d
)[
_
R
[
f
x
i
(x
1
, . . . , x
i1
, t, x
i+1
, . . . , x
d
)[ dt
and thus
[f(x
1
, . . . , x
d
)[ f
i
(x
1
, . . . , x
i1
, x
i+1
, . . . , x
d
)
where
f
i
(x
1
, . . . , x
i1
, x
i+1
, . . . , x
d
) :=
_
R
[f(x
1
, . . . , x
i1
, t, x
i+1
, . . . , x
d
)[ dt.
From Fubinis theorem we have
|f
i
|
L
1
(R
d
)
= |f|
L
1
(R
d
)
and hence by the Loomis-Whitney inequality
|f
1
. . . f
d
|
L
1/(d1)
(R
d
)
|f|
d
L
1
(R
d
)
,
and the claim follows.
Exercise 1.14.22 (Connection between Sobolev embedding and isoperi-
metric inequality). Let d 2, and let be an open subset of R
d
whose boundary is a smooth d 1-dimensional manifold. Show
that the surface area [[ of is related to the volume [[ of by
the isoperimetric inequality
[[ C
d
[[
d/(d1)
for some constant C
d
depending only on d. (Hint: Apply the endpoint
Sobolev theorem to a suitably smoothed out version of 1
.) It is also
possible to reverse this implication and deduce the endpoint Sobolev
embedding theorem from the isoperimetric inequality and the coarea
formula, which we will do in later notes.
Exercise 1.14.23. Use dimensional analysis to argue why the Sobolev
embedding theorem should fail when
d
q
<
d
p
1. Then create a rigor-
ous counterexample to that theorem in this case.
284 1. Real analysis
Exercise 1.14.24. Show that W
k,p
(R
d
) embeds into W
l,q
(R
d
) when-
ever k l 0 and 1 < p < q are such that
d
p
k
d
q
l, and
such that at least one of the two inequalities q ,
d
p
k
d
q
l is
strict.
Exercise 1.14.25. Show that the Sobolev embedding theorem fails
whenever q < p. (Hint: experiment with functions of the form f(x) =
n
j=1
(x x
j
), where is a test function and the x
j
are widely
separated points in space.)
Exercise 1.14.26 (Holder-Sobolev embedding). Let d < p < .
Show that W
1,p
(R
d
) embeds continuously into C
0,
(R
d
), where 0 <
< 1 is dened by the scaling relationship
d
p
1 = . Use dimen-
sional analysis to justify why one would expect this scaling relation-
ship to arise naturally, and give an example to show that cannot
be improved to any higher exponent.
More generally, with the same assumptions on p, , show that
W
k+1,p
(R
d
) embeds continuously into C
k,
(R
d
) for all natural num-
bers k 0.
Exercise 1.14.27 (Sobolev product theorem, special case). Let k
1, 1 < p, q < d/k, and 1 < r < be such that
1
p
+
1
q
k
d
=
1
r
.
Show that whenever f W
k,p
(R
d
) and g W
k,q
(R
d
), then fg
W
k,r
(R
d
), and that
|fg|
W
k,r
(R
d
)
C
p,q,k,d,r
|f|
W
k,p
(R
d
)
|g|
W
k,q
(R
d
)
for some constant C
p,q,k,d,r
depending only on the subscripted param-
eters. (This is not the most general range of parameters for which
this sort of product theorem holds, but it is an instructive special
case.)
Exercise 1.14.28. Let L be a dierential operator of order m whose
coecients lie in C
(R
d
). Show that L maps W
k+m,p
(R
d
) continu-
ously to W
k,p
(R
d
) for all 1 p and all integers k 0.
1.14.3. L
2
-based Sobolev spaces. It is possible to develop more
general Sobolev spaces W
s,p
(R
d
) than the integer-regularity spaces
W
k,p
(R
d
) dened above, in which s is allowed to take any real number
(including negative numbers) as a value, although the theory becomes
1.14. Sobolev spaces 285
somewhat pathological unless one restricts attention to the range 1 <
p < , for reasons having to do with the theory of singular integrals.
As the theory of singular integrals is beyond the scope of this
course, we will illustrate this theory only in the model case p = 2,
in which Plancherels theorem is available, which allows one to avoid
dealing with singular integrals by working purely on the frequency
space side.
To explain this, we begin with the Plancherel identity
_
R
d
[f(x)[
2
dx =
_
R
d
[
f()[
2
d,
which is valid for all L
2
(R
d
) functions and in particular for Schwartz
functions f o(R
d
). Also, we know that the Fourier transform of
any derivative
f
xj
f of f is 2i
j
f(). From this we see that
_
R
d
[
f
x
j
(x)[
2
dx =
_
R
d
(2[
j
[)
2
[
f()[
2
d,
for all f o(R
d
) and so on summing in j we have
_
R
d
[f(x)[
2
dx =
_
R
d
(2[[)
2
[
f()[
2
d.
A similar argument then gives
_
R
d
[
j
f(x)[
2
dx =
_
R
d
(2[[)
2j
[
f()[
2
d
and so on summing in j we have
|f|
2
W
k,2
(R
d
)
=
_
R
d
k
j=0
(2[[)
2j
[
f()[
2
d
for all k 0 and all Schwartz functions f o(R
d
). Since the
Schwartz functions are dense in W
k,2
(R
d
), a limiting argument (us-
ing the fact that L
2
is complete) then shows that the above formula
also holds for all f W
k,2
(R
d
).
Now observe that the quantity
k
j=0
(2[[)
2j
is comparable (up
to constants depending on k, d) to the expression
2k
, where x :=
(1 + [x[
2
)
1/2
(this quantity is sometimes known as the Japanese
bracket of x). We thus conclude that
|f|
W
k,2
(R
d
)
|
k
f()|
L
2
(R
d
)
,
286 1. Real analysis
where we use x y here to denote the fact that x and y are compa-
rable up to constants depending on d, k, and denotes the variable
of independent variable on the right-hand side. If we then dene, for
any real number s, the space H
s
(R
d
) to be the space of all tempered
distributions f such that the distribution
s
f() lies in L
2
, and give
this space the norm
|f|
H
s
(R
d
)
:= |
s
f()|
L
2
(R
d
)
,
then we see that W
k,2
(R
d
) embeds into H
k
(R
d
), and that the norms
are equivalent.
Actually, the two spaces are equal:
Exercise 1.14.29. For any s R, show that o(R
d
) is a dense sub-
space of H
s
(R
d
). Use this to conclude that W
k,2
(R
d
) = H
k
(R
d
) for
all non-negative integers k.
It is clear that H
0
(R
d
) L
2
(R
d
), and that H
s
(R
d
) H
s
(R
d
)
whenever s > s
. The spaces H
s
(R
d
) are also (complex) Hilbert
spaces, with the Hilbert space inner product
f, g
H
s
(R
d
)
:=
_
R
d
2s
f()g() d.
It is not hard to verify that this inner product does indeed give
H
s
(R
d
) the structure of a Hilbert space (indeed, it is isomorphic
under the Fourier transform to the Hilbert space L
2
(
2s
d) which is
isomorphic in turn under the map F()
s
F() to the standard
Hilbert space L
2
(R
d
)).
Being a Hilbert space, H
s
(R
d
) is isomorphic to its dual H
s
(R
d
)
f() g() d.
1.14. Sobolev spaces 287
Also show that
|f|
H
s
(R
d
)
:= sup[f, g
L
2
(R
d
)
: g o(R
d
); |g|
H
s
(R
d
)
1
for all f H
s
(R
d
).
The H
s
Sobolev spaces also enjoy the same type of embedding
estimates as their classical counterparts:
Exercise 1.14.31 (Sobolev embedding for H
s
, I). If s > d/2, show
that H
s
(R
d
) embeds continuously into C
0,
(R
d
) whenever 0 <
min(s
d
2
, 1). (Hint: use the Fourier inversion formula and the
Cauchy-Schwarz inequality.)
Exercise 1.14.32 (Sobolev embedding for H
s
, II). If 0 < s < d/2,
show that H
s
(R
d
) embeds continuously into L
q
(R
d
) whenever
d
2
s
d
q
d
2
. (Hint: it suces to handle the extreme case
d
q
=
d
2
s. For
this, rst reduce to establishing the bound |f|
L
q
(R
d
)
C|f|
H
s
(R
d
)
to the case when f H
s
(R
d
) is a Schwartz function whose Fourier
transform vanishes near the origin (and C depends on s, d, q), and
write
f() = g()/[[
s
for some g which is bounded in L
2
(R
d
). Then
use Exercise 1.13.35 and Corollary 1.11.18.
Exercise 1.14.33. In this exercise we develop a more elementary
variant of Sobolev spaces, the L
p
Holder spaces. For any 1 p
and 0 < < 1, let
p
(R
d
) be the space of functions f whose norm
|f|
(R
d
)
:= |f|
L
p
(R
d
)
+ sup
xR
d
\{0}
|
x
f f|
L
p
(R
d
)
[x[
is nite, where
x
(y) := f(y x) is the translation of f by x. Note
that
(R
d
) = C
0,
(R
d
) (with equivalent norms).
(i) For any 0 < < 1, establish the inclusions
2
+
(R
d
)
H
(R
d
)
2
(R
d
) for any 0 < < 1 . (Hint: take
Fourier transforms and work in frequency space.)
(ii) Let C
c
(R
d
) be a bump function, and let
n
be the
approximations to the identity
n
(x) := 2
dn
(2
n
x). If f
(R
d
), show that one has the equivalence
|f|
(R
d
)
|f|
L
p
(R
d
)
+ sup
n0
2
n
|f
n+1
f
n
|
L
p
(R
d
)
288 1. Real analysis
where we use x y to denote the assertion that x and y
are comparable up to constants depending on p, d, . (Hint:
To upper bound |
x
f f|
L
p
(R
d
)
for [x[ 1, express f as
a telescoping sum of f
n+1
f
n
for 2
n
x, plus a
nal term f
n0
where 2
n0
is comparable to x.)
(iii) If 1 p q and 0 < < 1 are such that
d
p
<
d
q
,
show that
p
(R
d
) embeds continuously into L
q
(R
d
). (Hint:
express f(x) as f
1
0
plus a telescoping series of f
n+1
n
f
n
n1
, where
n
is as in the previous
exercise. The additional convolution is in place in order to
apply Youngs inequality.)
The functions f
n+1
f
n
are crude versions of Littlewood-Paley
projections, which play an important role in harmonic analysis and
nonlinear wave and dispersive equations.
Exercise 1.14.34 (Sobolev trace theorem, special case). Let s > 1/2.
For any f C
c
(R
d
), establish the Sobolev trace inequality
|f
R
d1 |
H
s1/2
(R
d
)
C|f|
H
s
(R
d
)
where C depends only on d and s, and f
R
d1 is the restriction of f to
the standard hyperplane R
d1
R
d1
0 R
d
. (Hint: Convert
everything to L
2
-based statements involving the Fourier transform of
f, and use Schurs test, see Lemma 1.11.14.)
Exercise 1.14.35. (i) Show that if f H
s
(R
d
) for some s
R, and g C
(R
d
), then fg H
s
(R
d
) (note that this
product has to be dened in the sense of tempered distri-
butions if s is negative), and the map f fg is continuous
from H
s
(R
d
) to H
s
(R
d
). (Hint: As with the previous ex-
ercise, convert everything to L
2
-based statements involving
the Fourier transform of f, and use Schurs test.)
(ii) Let L be a partial dierential operator of order m with co-
ecients in C
(R
d
) for some m 0. Show that L maps
H
s
(R
d
) continuously to H
sm
(R
d
) for all s R.
Now we consider a partial converse to Exercise 1.14.35.
1.14. Sobolev spaces 289
Exercise 1.14.36 (Elliptic regularity). Let m 0, and let
L =
j1,...,j
d
0;j1+...+j
d
=m
c
j1,...,j
d
d
x
j1
. . . x
j
d
be a constant-coecient homogeneous dierential operator of order
m. Dene the symbol l : R
d
C of L to be the homogeneous
polynomial of degree m, dened by the formula
L(
1
, . . . ,
d
) :=
j1,...,j
d
0;j1+...+j
d
=m
c
j1,...,j
d
j1
. . .
j
d
.
We say that L is elliptic if one has the lower bound
l() c[[
m
for all R
d
and some constant c > 0. Thus, for instance, the
Laplacian is elliptic. Another example of an elliptic operator is the
Cauchy-Riemann operator
x1
i
x2
in R
2
. On the other hand, the
heat operator
t
, the Schrodinger operator i
t
+, and the wave
operator
2
t
2
+ are not elliptic on R
1+d
.
(i) Show that if L is elliptic of order m, and f is a tempered dis-
tribution such that f, Lf H
s
(R
d
), then f H
s+m
(R
d
),
and that one has the bound
(1.124) |f|
H
s+m
(R
d
)
C(|f|
H
s
(R
d
)
+|Lf|
H
s
(R
d
)
)
for some C depending on s, m, d, L. (Hint: Once again,
rewrite everything in terms of the Fourier transform
f of
f.)
(ii) Show that if L is a constant-coecient dierential operator
of m which is not elliptic, then the estimate (1.124) fails.
(iii) Let f L
2
loc
(R
d
) be a function which is locally in L
2
, and let
L be an elliptic operator of order m. Show that if Lf = 0,
then f is smooth. (Hint: First show inductively that f
H
k
(R
d
) for every test function and every natural number
k 0.)
Remark 1.14.10. The symbol l of an elliptic operator (with real
coecients) tends to have level sets that resemble ellipsoids, hence
the name. In contrast, the symbol of parabolic operators such as the
heat operator
t
has level sets resembling paraboloids, and the
290 1. Real analysis
symbol of hyperbolic operators such as the wave operator
2
t
2
+
has level sets resembling hyperboloids. The symbol in fact encodes
many important features of linear dierential operators, in particular
controlling whether singularities can form, and how they must prop-
agate in space and/or time; but this topic is beyond the scope of this
course.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/04/30.
Thanks to Antonio, bk, lutfu, PDEbeginner, Polam, timur, and anony-
mous commenters for corrections.
1.15. Hausdor dimension
A fundamental characteristic of many mathematical spaces (e.g. vec-
tor spaces, metric spaces, topological spaces, etc.) is their dimension,
which measures the complexity or degrees of freedom inherent in
the space. There is no single notion of dimension; instead, there are
a variety of dierent versions of this concept, with dierent versions
being suitable for dierent classes of mathematical spaces. Typically,
a single mathematical object may have several subtly dierent no-
tions of dimension that one can place on it, which will be related
to each other, and which will often agree with each other in non-
pathological cases, but can also deviate from each other in many
other situations. For instance:
One can dene the dimension of a space X by seeing how
it compares to some standard reference spaces, such as R
n
or C
n
; one may view a space as having dimension n if it
can be (locally or globally) identied with a standard n-
dimensional space. The dimension of a vector space or a
manifold can be dened in this fashion.
Another way to dene dimension of a space X is as the
largest number of independent objects one can place in-
side that space; this can be used to give an alternate notion
of dimension for a vector space, or of an algebraic variety, as
well as the closely related notion of the transcendence degree
of a eld. The concept of VC dimension in machine learning
also broadly falls into this category.
1.15. Hausdor dimension 291
One can also try to dene dimension inductively, for in-
stance declaring a space X to be n-dimensional if it can be
separated somehow by an n 1-dimensional object; thus
an n-dimensional object will tend to have maximal chains
of sub-objects of length n (or n + 1, depending on how one
initialises the chain and how one denes length). This can
give a notion of dimension for a topological space or of a
commutative ring (Krull dimension).
The notions of dimension as dened above tend to necessarily
take values in the natural numbers (or the cardinal numbers); there
is no such space as R
2
, for instance, nor can one talk about a basis
consisting of linearly independent elements, or a chain of maximal
ideals of length e. There is however a somewhat dierent approach
to the concept of dimension which makes no distinction between inte-
ger and non-integer dimensions, and is suitable for studying rough
sets such as fractals. The starting point is to observe that in the d-
dimensional space R
d
, the volume V of a ball of radius R grows like
R
d
, thus giving the following heuristic relationship
(1.125)
log V
log R
d
between volume, scale, and dimension. Formalising this heuristic
leads to a number of useful notions of dimension for subsets of R
n
(or more generally, for metric spaces), including (upper and lower)
Minkowski dimension (also known as box-packing dimension or Minkowski-
Bougliand dimension), and Hausdor dimension.
Remark 1.15.1. In K-theory, it is also convenient to work with
virtual vector spaces or vector bundles, such as formal dierences
of such spaces, and which may therefore have a negative dimension;
but as far as I am aware there is no connection between this notion
of dimension and the metric ones given here.
Minkowski dimension can either be dened externally (relating
the external volume of -neighbourhoods of a set E to the scale
) or internally (relating the internal -entropy of E to the scale).
Hausdor dimension is dened internally by rst introducing the d-
dimensional Hausdor measure of a set E for any parameter 0 d <
292 1. Real analysis
, which generalises the familiar notions of length, area, and vol-
ume to non-integer dimensions, or to rough sets, and is of interest in
its own right. Hausdor dimension has a lengthier denition than its
Minkowski counterpart, but is more robust with respect to operations
such as countable unions, and is generally accepted as the standard
notion of dimension in metric spaces. We will compare these concepts
against each other later in these notes.
One use of the notion of dimension is to create ner distinctions
between various types of small subsets of spaces such as R
n
, be-
yond what can be achieved by the usual Lebesgue measure (or Baire
category). For instance, a point, line, and plane in R
3
all have zero
measure with respect to three-dimensional Lebesgue measure (and are
nowhere dense), but of course have dierent dimensions (0, 1, and 2
respectively). (Another good example is provided by Kakeya sets.)
This can be used to clarify the nature of various singularities, such as
that arising from non-smooth solutions to PDE; a function which is
non-smooth on a set of large Hausdor dimension can be considered
less smooth than one which is non-smooth on a set of small Hausdor
dimension, even if both are smooth almost everywhere. While many
properties of the singular set of such a function are worth studying
(e.g. their rectiability), understanding their dimension is often an
important starting point. The interplay between these types of con-
cepts is the subject of geometric measure theory.
1.15.1. Minkowski dimension. Before we study the more stan-
dard notion of Hausdor dimension, we begin with the more ele-
mentary concept of the (upper and lower) Minkowski dimension of a
subset E of a Euclidean space R
n
.
There are several equivalent ways to approach Minkowski dimen-
sion. We begin with an external approach, based on a study of
the -neighbourhoods E
:= x R
n
: dist(x, E) < of E, where
dist(x, E) := inf[x y[ : y E and we use the Euclidean metric on
R
n
. These are open sets in R
n
and therefore have a d-dimensional
volume (or Lebesgue measure) vol
d
(E
B
d
(0, 2) B
nd
(0, )
for all 0 < < 1, which implies that
c
nd
vol
n
(E
) C
nd
for some constants c, C > 0 depending only on n, d. In particular, we
have
lim
0
n
log vol
n
(E
)
log
= d
(compare with (1.125)). This motivates our rst denition of Minkowski
dimension:
Denition 1.15.2. Let E be a bounded subset of R
n
. The upper
Minkowski dimension dim
M
(E) is dened as
dim
M
(E) := limsup
0
n
log vol
n
(E
)
log
and the lower Minkowski dimension dim
M
(E) is dened as
dim
M
(E) := liminf
0
n
log vol
n
(E
)
log
.
If the upper and lower Minkowski dimensions match, we refer to
dim
M
(E) := dim
M
(E) = dim
M
(E) as the Minkowski dimension of
E. In particular, the empty set has a Minkowski dimension of .
Unwrapping all the denitions, we have the following equivalent
formulation, where E is a bounded subset of R
n
and R:
We have dim
M
(E) < i for every > 0, one has vol
d
(E
)
C
nd
for all suciently small > 0 and some C > 0.
We have dim
M
(E) < i for every > 0, one has vol
d
(E
)
C
nd
for arbitrarily small > 0 and some C > 0.
We have dim
M
(E) > i for every > 0, one has vol
d
(E
)
c
nd
for arbitrarily small > 0 and some c > 0.
We have dim
M
(E) > i for every > 0, one has vol
d
(E
)
c
nd
for all suciently small > 0 and some c > 0.
294 1. Real analysis
Exercise 1.15.1. (i) Let C R be the Cantor set consisting
of all base 4 strings
i=1
a
i
4
i
, where each a
i
takes values
in 0, 3. Show that C has Minkowski dimension 1/2. (Hint:
approximate any small by a negative power of 4.)
(ii) Let C
i=1
a
i
4
i
, where each a
i
takes values in 0, 3 when
(2k)! i < (2k + 1)! for some integer k 0, and a
i
is ar-
bitrary for the other values of i. Show that C
has a lower
Minkowski dimension of 1/2 and an upper Minkowski di-
mension of 1.
Exercise 1.15.2. Suppose that E R
n
is a compact set with the
property that there exist 0 < r < 1 and an integer k > 1 such that E
is equal to the union of k disjoint translates of r E := rx : x E.
(This is a special case of a self-similar fractal ; the Cantor set is a
typical example.) Show that E has Minkowski dimension
log k
log 1/r
.
If the k translates of r E are allowed to overlap, establish the
upper bound dim
M
(E)
log k
log 1/r
.
It is clear that we have the inequalities
0 dim
M
(E) dim
M
(E) n
for non-empty bounded E R
n
, and the monotonicity properties
dim
M
(E) dim
M
(F); dim
M
(E) dim
M
(F)
whenever E F R
n
are bounded sets. It is thus natural to extend
the denitions of lower and upper Minkowski dimension to unbounded
sets E by dening
(1.126) dim
M
(E) := sup
FE, bounded
dim
M
(F)
and
(1.127) dim
M
(E) := sup
FE, bounded
dim
M
(F).
In particular, we easily verify that d-dimensional subspaces of R
n
have Minkowski dimension d.
Exercise 1.15.3. Show that any subset of R
n
with lower Minkowski
dimension less than n has Lebesgue measure zero. In particular,
1.15. Hausdor dimension 295
any subset E R
n
of positive Lebesgue measure must have full
Minkowski dimension dim
M
(E) = n.
Now we turn to other formulations of Minkowski dimension. Given
a bounded set E and > 0, we make the following denitions:
A
ext
):
Exercise 1.15.4. For any bounded set E R
n
and any > 0, show
that
A
net
2
(E) = A
pack
(E)
vol
n
(E
)
vol
n
(B
n
(0, ))
A
ext
(E) A
int
(E) A
net
(E).
As a consequence of this exercise, we see that
(1.128) dim
M
(E) = limsup
0
A
(E)
log 1/
and
(1.129) dim
M
(E) = liminf
0
A
(E)
log 1/
.
where is any of ext, int, net, pack.
One can now take the formulae (1.128), (1.129) as the denition of
Minkowski dimension for bounded sets (and then use (1.126), (1.127)
to extend to unbounded sets). The formulations (1.128), (1.129) for
296 1. Real analysis
= int, net, pack have the advantage of being intrinsic - they only
involve E, rather than the ambient space R
n
. For metric spaces, one
still has a partial analogue of Exercise 1.15.4, namely
A
net
2
(E) A
pack
(E) A
int
(E) A
net
(E).
As such, these formulations of Minkowski dimension extend without
any diculty to arbitrary bounded metric spaces (E, d) (at least when
the spaces are locally compact), and then to unbounded metric spaces
by (1.126), (1.127).
Exercise 1.15.5. If : (X, d
X
) (Y, d
Y
) is a Lipschitz map be-
tween metric spaces, show that dim
M
((E)) dim
M
(E) and dim
M
((E))
dim
M
(E) for all E X. Conclude in particular that the graph
(x, (x)) : x R
d
of any Lipschitz function : R
d
R
nd
has
Minkowski dimension d, and the graph of any measurable function
: R
d
R
nd
has Minkowski dimension at least d.
Note however that the dimension of graphs can become larger
than that of the base in the non-Lipschitz case:
Exercise 1.15.6. Show that the graph (x, sin
1
x
) : 0 < x < 1 has
Minkowski dimension 3/2.
Exercise 1.15.7. Let (X, d) be a bounded metric space. For each
n 0, let E
n
be a maximal 2
n
-net of X (thus the cardinality of E
n
is A
net
2
n
(X)). Show that for any continuous function f : X R and
any x
0
X, one has the inequality
sup
xX
f(x) sup
x0E0
f(x
0
)+
+
n=0
sup
xnEn,xn+1En+1:|xnxn+1|
3
2
2
n
(f(x
n
) f(x
n+1
)).
(Hint: For any x X, dene x
n
E
n
to be the nearest point in
E
n
to x, and use a telescoping series.) This inequality (and variants
thereof), which replaces a continuous supremum of a function f(x)
by a sum of discrete suprema of dierences f(x
n
) f(x
n+1
) of that
function, is the basis of the generic chaining technique in probability,
used to estimate the supremum of a continuous family of random
processes. It is particularly eective when combined with bounds on
1.15. Hausdor dimension 297
the metric entropy A
net
2
n
(X), which of course is closely related to the
Minkowski dimension of X, and with large deviation bounds on the
dierences f(x
n
) f(x
n+1
). A good reference for generic chaining is
[Ta2005].
Exercise 1.15.8. If E R
n
and F R
m
are bounded sets, show
that
dim
M
(E) + dim
M
(F) dim
M
(E F)
and
dim
M
(E F) dim
M
(E) + dim
M
(F).
Give a counterexample that shows that either of the inequalities here
can be strict. (Hint: There are many possible constructions; one of
them is a modication of Exercise 1.15.1(ii).)
It is easy to see that Minkowski dimension reacts well to nite
unions, and more precisely that
dim
M
(E F) = max(dim
M
(E), dim
M
(F))
and
dim
M
(E F) = max(dim
M
(E), dim
M
(F))
for any E, F R
n
. However, it does not respect countable unions.
For instance, the rationals Q have Minkowski dimension 1, despite
being the countable union of points, which of course have Minkowski
dimension 0. More generally, it is not dicult to see that any set
E R
n
has the same upper or lower Minkowski dimension as its topo-
logical closure E, since both sets have the same -neighbourhoods.
Thus we see that the notion of Minkowski dimension misses some of
the ne structure of a set E, in particular the presence of holes
within the set. We now turn to the notion of Hausdor dimension,
which recties some of these defects.
1.15.2. Hausdor measure. The Hausdor approach to dimen-
sion begins by noting that d-dimensional objects in R
n
tend to have
a meaningful d-dimensional measure to assign to them. For instance,
298 1. Real analysis
the 1-dimensional boundary of a polygon has a perimeter, the 0-
dimensional vertices of that polygon have a cardinality, and the poly-
gon itself has an area. So to dene the notion of a d-Hausdor di-
mensional set, we will rst dene the notion of the d-dimensional
Hausdor measure 1
d
(E) of a set E.
To do this, let us quickly review one of the (many) constructions
of n-dimensional Lebesgue measure, which we are denoting here by
vol
n
. One way to build this measure is to work with half-open boxes
B =
n
i=1
[a
i
, b
i
) in R
n
, which we assign a volume of [B[ :=
n
i=1
(b
i
a
i
). Given this notion of volume for boxes, we can then dene the
outer Lebesgue measure (vol
n
)
(E) := inf
k=1
[B
k
[ : B
k
covers E
where the inmum ranges over all at most countable collections B
1
, B
2
, . . .
of boxes that cover E. One easily veries that (vol
n
)
is indeed an
outer measure (i.e. it is monotone, countably subadditive, and as-
signs zero to the empty set). We then dene a set A R
n
to be
(vol
n
)
(E) = (vol
n
)
(E A) + (vol
n
)
(EA)
for all E R
n
. By Caratheodorys theorem, the space of (vol
n
)
-
measurable sets is a -algebra, and outer Lebesgue measure is a count-
ably additive measure on this -algebra, which we denote vol
n
. Fur-
thermore, one easily veries that every box B is (vol
n
)
-measurable,
which soon implies that every Borel set is also; thus Lebesgue measure
is a Borel measure (though it can of course measure some non-Borel
sets also).
Finally, one needs to verify that the Lebesgue measure vol
n
(B)
of a box is equal to its classical volume [B[; the above construc-
tion trivially gives vol
n
(B) [B[ but the converse is not as obvious.
This is in fact a rather delicate matter, relying in particular on the
completeness of the reals; if one replaced R by the rationals Q, for
instance, then all the above constructions go through but now boxes
have Lebesgue measure zero (why?). See [Fo2000, Chapter 1], for
instance, for details.
1.15. Hausdor dimension 299
Anyway, we can use this construction of Lebesgue measure as a
model for building d-dimensional Hausdor measure. Instead of using
half-open boxes as the building blocks, we will instead work with the
open balls B(x, r). For d-dimensional measure, we will assign each
ball B(x, r) a measure r
d
(cf. (1.125)). We can then dene the
unlimited Hausdor content h
d,
(E) of a set E R
n
by the formula
h
d,
(E) := inf
k=1
r
d
k
: B(x
k
, r
k
) covers E
where the inmum ranges over all at most countable families of balls
that cover E. (Note that if E is compact, then it would suce to
use nite coverings, since every open cover of E has a nite subcover.
But in general, for non-compact E we must allow the use of innitely
many balls.)
As with Lebesgue measure, h
d,
is easily seen to be an outer
measure, and one could dene the notion of a h
d,
-measurable set on
which Caratheodorys theorem applies to build a countably additive
measre. Unfortunately, a key problem arises: once d is less than n,
most sets cease to be h
d,
-measurable! We illustrate this in the one-
dimensional case with n = 1 and d = 1/2, and consider the problem
of computing the unlimited Hausdor content h
1/2,
([a, b]). On the
one hand, this content is at most [
ba
2
[
1/2
, since one can cover [a, b] by
the ball of radius
ba
2
+ centred at
a+b
2
for any > 0. On the other
hand, the content is also at least [
ba
2
[
1/2
. To see this, suppose we
cover [a, b] by a nite or countable family of balls B(x
k
, r
k
) (one can
reduce to the nite case by compactness, though it isnt necessary to
do so here). The total one-dimensional Lebesgue measure
k
2r
k
of
these balls must equal or exceed the Lebesgue measure of the entire
interval [b a[, thus
k
r
k
[b a[
2
.
From the inequality
k
r
k
(
k
r
1/2
k
)
2
(which is obvious after ex-
panding the RHS and discarding cross-terms) we see that
k
r
1/2
k
(
[b a[
2
)
1/2
and the claim follows.
300 1. Real analysis
We now see some serious breakdown of additivity: for instance,
the unlimited 1/2-dimensional content of [0, 2] is 1, despite being the
disjoint union of [0, 1] and (1, 2], which each have an unlimited content
of 1/
k=1
r
d
k
: B(x
k
, r
k
) covers E; r
k
r
where the balls B(x
k
, r
k
) are now restricted to be less than or equal
to r in radius. This quantity is increasing in r, and we then dene
the Hausdor outer measure (1
d
)
(E) := lim
r0
h
d,r
(E).
(This is analogous to the Riemann integral approach to volume of
sets, covering them by balls, boxes, or rectangles of increasingly small
size; this latter approach is also closely connected to the Minkowski
dimension concept studied earlier. The key dierence between the
Lebesgue/Hausdor approach and the Riemann/Minkowski approach
is that in the former approach one allows the balls or boxes to be
countable in number, and to be variable in size, whereas in the latter
approach the cover is nite and uniform in size.)
Exercise 1.15.9. Show that if d > n, then (1
d
)
(EF) = (1
d
)
(E) +(1
d
)
(F).
(Hint: one inequality is easy. For the other, observe that any small
ball can intersect E or intersect F, but not both.)
One consequence of this is that there is a large class of measurable
sets:
Proposition 1.15.3. Let d 0. Then every Borel subset of R
n
is
(1
d
)
-measurable.
Proof. Since the collection of (1
d
)
(E) = (1
d
)
(E A) + (1
d
)
(EA).
We may assume that (1
d
)
(E A) and (1
d
)
is an outer mea-
sure, we already have
(1
d
)
(EA)+(1
d
)
(EA
1/m
) (1
d
)
(E) (1
d
)
(EA)+(1
d
)
(EA),
where A
1/m
is the 1/m-neighbourhood of A. So it suces to show
that
lim
m
(1
d
)
(EA
1/m
) = (1
d
)
(EA).
For any m, we have the telescoping sum EA = (EA
1/m
)
l>m
F
l
,
where F
l
:= (EA
1/(l+1)
) A
l
, and thus by countable subadditivity
and monotonicity
(1
d
)
(EA
1/m
) (1
d
)
(EA) (1
d
)
(EA
1/m
) +
l>m
(1
d
)
(F
l
)
so it suces to show that the sum
l=1
(1
d
)
(F
l
) is absolutely con-
vergent.
Consider the even-indexed sets F
2
, F
4
, F
6
, . . .. These sets are sep-
arated from each other, so by many applications of Exercise 1.15.10
followed by monotonicity we have
L
l=1
(1
d
)
(F
2l
) = (1
d
)
(
L
_
l=1
F
2l
) (1
d
)
(EA) <
302 1. Real analysis
for all L, and thus
l=1
(1
d
)
(F
2l
) is absolutely convergent. Similarly
for
l=1
(1
d
)
(F
2l1
), and the claim follows.
On the (1
d
)
(E),
thus 1
d
is a Borel measure on R
n
. We now study what this measure
looks like for various values of d. The case d = 0 is easy:
Exercise 1.15.11. Show that every subset of R
n
is (1
0
)
-measurable,
and that 1
0
is counting measure.
Now we look at the opposite case d = n. It is easy to see that any
Lebesgue-null set of R
n
has n-dimensional Hausdor measure zero
(since it may be covered by balls of arbitrarily small total content).
Thus n-dimensional Hausdor measure is absolutely continuous with
respect to Lebesgue measure, and we thus have
dH
n
d vol
n
= c for some
locally integrable function c. As Hausdor measure and Lebesgue
measure are clearly translation-invariant, c must also be translation-
invariant and thus constant. We therefore have
1
n
= c vol
n
for some constant c 0.
We now compute what this constant is. If
n
denotes the volume
of the unit ball B(0, 1), then we have
k
r
n
k
=
1
k
vol
n
(B(x
k
, r
k
))
1
n
vol
n
(
_
k
B(x
k
, r
k
))
for any at most countable collection of balls B(x
k
, r
k
). Taking inma,
we conclude that
1
n
n
vol
n
and so c
1
n
.
In the opposite direction, observe from Exercise 1.15.4 that given
any 0 < r < 1, one can cover the unit cube [0, 1]
n
by at most C
n
r
n
balls of radius r, where C
n
depends only on n; thus
1
n
([0, 1]
n
) C
n
and so c C
n
; in particular, c is nite.
We can in fact compute c explicitly (although knowing that c is
nite and non-zero already suces for many applications):
1.15. Hausdor dimension 303
Lemma 1.15.4. We have c =
1
n
, or in other words 1
n
=
1
n
vol
n
.
(In particular, a ball B
n
(x, r) has n-dimensional Hausdor measure
r
n
.)
Proof. Let us consider the Hausdor measure 1
n
([0, 1]
n
) of the unit
cube. By denition, for any > 0 one can nd an 0 < r < 1/2 such
that
h
n,r
([0, 1]
n
) 1
n
([0, 1]
n
) .
Observe (using Exercise 1.15.4) that we can nd at least c
n
r
n
dis-
joint balls B(x
1
, r), . . . , B(x
k
, r) of radius r inside the unit cube. We
then observe that
h
n,r
([0, 1]
n
) kr
n
+1
n
([0, 1]
n
k
_
i=1
B(x
k
, r)).
On the other hand,
1
n
([0, 1]
n
k
_
i=1
B(x
k
, r)) = c vol
n
([0, 1]
n
k
_
i=1
B(x
k
, r)) = c(1k
n
r
n
);
putting all this together, we obtain
c = 1
n
([0, 1]
n
) kr
n
+c(1 k
n
r
n
) +
which rearranges as
1 c
n
kr
n
.
Since kr
n
is bounded below by c
n
, we can then send 0 and
conclude that c
1
n
; since we already showed c
1
n
, the claim
follows.
Thus n-dimensional Hausdor measure is an explicit constant
multiple of n-dimensional Lebesgue measure. The same argument
shows that for integers 0 < d < n, the restriction of d-dimensional
Hausdor measure to any d-dimensional linear subspace (or ane
subspace) V is equal to the constant
1
d
times d-dimensional Lebesgue
measure on V . (This shows, by the way, that 1
d
is not a -nite mea-
sure on R
n
in general, since one can partition R
n
into uncountably
many d-dimensional ane subspaces. In particular, it is not a Radon
measure in general).
304 1. Real analysis
One can then compute d-dimensional Hausdor measure for other
sets than subsets of d-dimensional ane subspaces by changes of vari-
able. For instance:
Exercise 1.15.12. Let 0 d n be an integer, let be an open
subset of R
d
, and let : R
n
be a smooth injective map which
is non-degenerate in the sense that the Hessian D (which is a d n
matrix) has full rank at every point of . For any compact subset E
of , establish the formula
1
d
((E)) =
_
E
J d1
d
=
1
d
_
E
J d vol
d
where the Jacobian J is the square root of the sum of squares of all
the determinants of the d d minors of the d n matrix D. (Hint:
By working locally, one can assume that is the graph of some map
from to R
nd
, and so can be inverted by the projection function;
by working even more locally, one can assume that the Jacobian is
within an epsilon of being constant. The image of a small ball in
then resembles a small ellipsoid in (), and conversely the projection
of a small ball in () is a small ellipsoid in . Use some linear algebra
and several variable calculus to relate the content of these ellipsoids
to the radius of the ball.) It is possible to extend this formula to
Lipschitz maps : R
n
that are not necessarily injective, leading
to the area formula
_
(E)
#(
1
(y)) d1
d
(y) =
1
d
_
E
J d vol
d
for such maps, but we will not prove this formula here.
From this exercise we see that d-dimensional Hausdor measure
does coincide to a large extent with the d-dimensional notion of sur-
face area; for instance, for a simple smooth curve : [a, b] R
n
with
everywhere non-vanishing derivative, the 1
1
measure of ([a, b]) is
equal to its classical length [[ =
_
b
a
[
, and let E R
n
be a Borel
set. Show that if 1
d
is innite.
Example 1.15.5. Let 0 d n be integers. The unit ball B
d
(0, 1)
R
d
R
n
has a d-dimensional Hausdor measure of 1 (by Lemma
1.15.4), and so it has zero d
< d.
On the other hand, we know from Exercise 1.15.11 that 1
0
(E)
is positive for any non-empty set E, and that 1
d
(E) = 0 for every
d > n. We conclude (from the least upper bound property of the reals)
that for any Borel set E R
n
, there exists a unique number in [0, n],
called the Hausdor dimension dim
H
(E) of E, such that 1
d
(E) = 0
for all d > dim
H
(E) and 1
d
(E) = for all d < dim
H
(E). Note
that at the critical dimension d = dim
H
itself, we allow 1
d
(E) to
be zero, nite, or innite, and we shall shortly see in fact that all
three possibilities can occur. By convention, we give the empty set a
Hausdor dimension of . One can also assign Hausdor dimension
to non-Borel sets, but we shall not do so to avoid some (very minor)
technicalities.
Example 1.15.6. The unit ball B
d
(0, 1) R
d
R
n
has Hausdor
dimension d, as does R
d
itself. Note that the former set has nite
d-dimensional Hausdor measure, while the latter has an innite mea-
sure. More generally, any d-dimensional smooth manifold in R
n
has
Hausdor dimension d.
Exercise 1.15.14. Show that the graph (x, sin
1
x
) : 0 < x < 1 has
Hausdor dimension 1; compare this with Exercise 1.15.6.
It is clear that Hausdor dimension is monotone: if E F are
Borel sets, then dim
H
(E) dim
H
(F). Since Hausdor measure is
countably additive, it is also not hard to see that Hausdor dimension
interacts well with countable unions:
dim
H
(
_
i=1
E
i
) = sup
1i
dim
H
(E
i
).
Thus for instance the rationals, being a countable union of 0-dimensional
points, have Hausdor dimension 0, in contrast to their Minkowski
306 1. Real analysis
dimension of 1. On the other hand, we at least have an inequality
between Hausdor and Minkowski dimension:
Exercise 1.15.15. For any Borel set E R
n
, show that dim
H
(E)
dim
M
(E) dim
M
(E). (Hint: use (1.129). Which of the choices of
is most convenient to use here?)
It is instructive to compare Hausdor dimension and Minkowski
dimension as follows.
Exercise 1.15.16. Let E be a bounded Borel subset of R
n
, and let
d 0.
Show that dim
M
(E) d if and only if, for every > 0 and
arbitrarily small r > 0, one can cover E by nitely many
balls B(x
1
, r
1
), . . . , B(x
k
, r
k
) of radii r
i
= r equal to r such
that
k
i=1
r
d+
i
.
Show that dim
M
(E) d if and only if, for every > 0 and
all suciently small r > 0, one can cover E by nitely many
balls B(x
1
, r
1
), . . . , B(x
k
, r
k
) of radii r
i
= r equal to r such
that
k
i=1
r
d+
i
.
Show that dim
H
(E) d if and only if, for every >
0 and r > 0, one can cover E by countably many balls
B(x
1
, r
1
), . . . of radii r
i
r at most r such that
k
i=1
r
d+
i
.
The previous two exercises give ways to upper-bound the Haus-
dor dimension; for instance, we see from Exercise 1.15.2 that self-
similar fractals E of the type in that exercise (i.e. E is k translates
of r E) have Hausdor dimension at most
log k
log 1/r
. To lower bound
the Hausdor dimension of a set E, one convenient way to do so is
to nd a measure with a certain dimension property (analogous to
(1.125)) that assigns a positive mass to E:
Exercise 1.15.17. Let d 0. A Borel measure on R
n
is said
to be a Frostman measure of dimension at most d if it is compactly
supported there exists a constant C such that (B(x, r)) Cr
d
for
all balls B(x, r) of radius 0 < r < 1. Show that if has dimen-
sion at most d, then any Borel set E with (E) > 0 has positive
d-dimensional Hausdor content; in particular, dim
H
(E) d.
1.15. Hausdor dimension 307
Note that this gives an alternate way to justify the fact that
smooth d-dimensional manifolds have Hausdor dimension d, since on
the one hand they have Minkowski dimension d, and on the other hand
they support a non-trivial d-dimensional measure, namely Lebesgue
measure.
Exercise 1.15.18. Show that the Cantor set in Exercise 1.15.1(i)
has Hausdor dimension 1/2. More generally, establish the analogue
of the rst part of Exercise 1.15.2 for Hausdor measure.
Exercise 1.15.19. Construct a subset of R of Hausdor dimension
1 that has zero Lebesgue measure. (Hint: A modied Cantor set,
vaguely reminiscent of Exercise 1.15.1(ii), can work here.)
A useful fact is that Exercise 1.15.17 can be reversed:
Lemma 1.15.7 (Frostmans lemma). Let d 0, and let E R
n
be a
compact set with 1
d
(E) > 0. Then there exists a non-trivial Frostman
measure of dimension at least d supported on E (thus (E) > 0 and
(R
d
E) = 0).
Proof. Without loss of generality we may place the compact set E
in the half-open unit cube [0, 1)
n
. It is convenient to work dyadically.
For each integer k 0, we subdivide [0, 1)
n
into 2
kn
half-open cubes
Q
k,1
, . . . , Q
k,2
nk of sidelength (Q
k,i
) = 2
k
in the usual manner, and
refer to such cubes as dyadic cubes. For each k and any F [0, 1)
n
,
we can dene the dyadic Hausdor content h
d,k
(F) to be the quantity
h
d,2
k(F) := inf
j
(Q
kj,ij
)
d
: Q
kj,ij
cover F; k
j
k
where the Q
kj,ij
range over all at most countable families of dyadic
cubes of sidelength at most 2
k
that cover F. By covering cubes by
balls and vice versa, it is not hard to see that
ch
d,C2
k(F) h
d,2
k(F) Ch
d,c2
k(F)
for some absolute constants c, C depending only on d, n. Thus, if we
dene the dyadic Hausdor measure
(1
d
)
(F) := lim
k
h
d,2
k(F)
308 1. Real analysis
then we see that the dyadic and non-dyadic Huausdor measures are
comparable:
c1
d
(F) (1
d
)
(F) C(1
d
)
(F).
In particular, the quantity := (1
d
)
+
(Q) := h
d,k
(E Q).
Then
+
([0, 1)
n
) . By covering E Q by Q, we also have the
bound
+
(Q) (Q)
d
.
Finally, by the subadditivity property of Hausdor content, if we
decompose Q into 2
n
cubes Q
of sidelength (Q
) = 2
k1
, we have
+
(Q)
+
(Q
).
The quantity
+
behaves like a measure, but is subadditive rather
than additive. Nevertheless, one can easily nd another quantity (Q)
to assign to each dyadic cube such that
([0, 1)
n
) =
+
([0, 1)
n
)
and
(Q)
+
(Q)
for all dyadic cubes, and such that
(Q) =
(Q
)
whenever a dyadic cube is decomposed into 2
n
sub-cubes of half the
sidelength. Indeed, such a can be constructed by a greedy algo-
rithms starting at the largest cube [0, 1)
n
and working downward; we
omit the details. One can then use this measure to integrate any
continuous compactly supported function on R
n
(by approximating
such a function by one which is constant on dyadic cubes of a certain
scale), and so by the Riesz representation theorem, it extends to a
Radon measure supported on [0, 1]
n
. (One could also have used the
Caratheodory extension theorem at this point.) Since ([0, 1)
n
) ,
is non-trivial; since (Q)
+
(Q) (Q)
d
for all dyadic cubes Q,
1.15. Hausdor dimension 309
it is not hard to see that is a Frostman measure of dimension at
most d, as desired.
The study of Hausdor dimension is then intimately tied to the
study of the dimensional properties of various measures. We give
some examples in the next few exercises.
Exercise 1.15.20. Let 0 < d n, and let E R
n
be a compact
set. Show that dim
H
(E) d if and only if, for every 0 < < d, there
exists a compactly supported probability Borel measure with
_
R
d
_
R
d
1
[x y[
d
d(x)d(y) < .
Show that this condition is also equivalent to lying in the Sobolev
space H
(nd+)/2
(R
n
). Thus we see a link here between Hausdor
dimension and Sobolev norms: the lower the dimension of a set, the
rougher the measures that it can support, where the Sobolev scale is
used to measure roughness.
Exercise 1.15.21. Let E be a compact subset of R
n
, and let be
a Borel probability measure supported on E. Let 0 d n.
Suppose that for every > 0, every 0 < < 1/10, and
every subset E
of E with (E
)
1
log
2
(1/)
, one could es-
tablish the bound A
(E
) c
(
1
)
d
for equal to any of
ext, int, net, pack (the exact choice of is irrelevant thanks
to Exercise 1.15.4). Show that E has Hausdor dimension
at least d. (Hint: cover E by small balls, then round the
radius of each ball to the nearest power of 2. Now use count-
able additivity and the observation that sum
1
log
2
(1/)
is
small when ranges over suciently small powers of 2.)
Show that one can replace (E
)
1
log
2
(1/)
with (E
)
1
log log
2
(1/)
in the previous statement. (Hint: instead of
rounding the radius to the nearest power of 2, round instead
to radii of the form 1/2
2
n
for integers n.) This trick of using
a hyper-dyadic range of scales rather than a dyadic range
of scales is due to Bourgain[Bo1999]. The exponent 2 in
the double logarithm can be replaced by any other exponent
strictly greater than 1.
310 1. Real analysis
This should be compared with the task of lower bounding the lower
Minkowski dimension, which only requires control on the entropy of E
itself, rather than of large subsets E
1
(t)
g(x) d1
n1
(x)) dt.
(Hint: Subdivide the support of g to be small, and then apply a
change of variables to make linear, e.g. (x) = x
1
.) This formula
is in fact valid for all absolutely integrable g and Lipschitz , but
is dicult to prove for this level of generality, requiring a version of
Sards theorem.
The coarea formula (1.130) can be used to link geometric inequal-
ities to analytic ones. For instance, the sharp isoperimetric inequality
vol
n
()
n1
n
1
n
1/n
n
1
n1
(),
valid for bounded open sets in R
n
, can be combined with the coarea
formula (with g := 1) to give the sharp Sobolev inequality
||
L
n
n1
(R
n
)
1
n
1/n
n
_
R
n
[(x)[ dx
for any test function , the main point being that
1
(t)
1
(t) is
the boundary of [[ t (one also needs to do some manipulations
relating the volume of those level sets to ||
L
n
n1
(R
n
)
). We omit the
details.
1.15. Hausdor dimension 311
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/05/19.
Thanks to Vicky for corrections.
Further discussion of Hausdor dimension can be found in [Fa2003],
[Ma1995], [Wo2003], as well as in many other places.
There was some interesting discussion online as to whether there
could be an analogue of K-theory for Hausdor dimension, although
the results of the discussion were inconclusive.
Chapter 2
Related articles
313
314 2. Related articles
2.1. An alternate approach to the Caratheodory
extension theorem
In this section, I would like to give an alternate proof of a (weak form
of the) Caratheodory extension theorem (Theorem 1.1.17). This ar-
gument is restricted to the -nite case, and does not extend the
measure to quite as large a -algebra as is provided by the standard
proof of this theorem, but I nd it conceptually clearer (in particu-
lar, hewing quite closely to Littlewoods principles, and the general
Lebesgue philosophy of treating sets of small measure as negligible),
and suces for many standard applications of this theorem, in par-
ticular the construction of Lebesgue measure.
Let us rst state the precise statement of the theorem:
Theorem 2.1.1 (Weak Caratheodory extension theorem). Let / be
a Boolean algebra of subsets of a set X, and let : / [0, +] be a
function obeying the following three properties:
(i) () = 0.
(ii) (Pre-countable additivity) If A
1
, A
2
, . . . / are disjoint
and such that
n=1
A
n
also lies in /, then (
n=1
A
n
) =
n=1
(A
n
).
(iii) (-niteness) X can be covered by at most countably many
sets in /, each of which has nite -measure.
Let A be the -algebra generated by /. Then can be uniquely ex-
tended to a countably additive measure on A.
We will refer to sets in / as elementary sets and sets in A as
measurable sets. A typical example is when X = [0, 1] and / is the
collection of all sets that are unions of nitely many intervals; in this
case, A are the Borel-measurable sets.
2.1.1. Some basics. Let us rst observe that the hypotheses on the
premeasure imply some other basic and useful properties:
From properties (i) and (ii) we see that is nitely additive (thus
(A
1
. . .A
n
) = (A
1
)+. . .+(A
n
) whenever A
1
, . . . , A
n
are disjoint
elementary sets).
2.1. Caratheodory extension theorem 315
As particular consequences of nite additivity, we have mono-
tonicity ((A) (B) whenever A B are elementary sets) and
nite subadditivity ((A
1
. . . A
n
) (A
1
) + . . . + (A
n
) for all
elementary A
1
, . . . , A
n
, not necessarily disjoint).
We also have pre-countable subadditivity: (A)
n=1
(A
n
)
whenever the elementary sets A
1
, A
2
, . . . cover the elementary set A.
To see this, rst observe by replacing A
n
with A
n
n1
i=1
A
i
and us-
ing monotonicity that we may take the A
i
to be disjoint; next, by
restricting all the A
i
to A and using monotonicity we may assume
that A is the union of the A
i
, and now the claim is immediate from
pre-countable additivity.
2.1.2. Existence. Let us rst verify existence. As is standard in
measure-theoretic proofs for -nite spaces, we rst handle the nite
case (when (X) < ), and then rely on countable additivity or
sub-additivity to recover the -nite case.
The basic idea, following Littlewoods principles, is to view the
measurable sets as lying in the completion of the elementary sets,
or in other words to exploit the fact that measurable sets can be
approximated to arbitrarily high accuracy by elementary sets.
Dene the outer measure
n=1
(A
n
), where A
1
, A
2
, . . . range over all at most countable col-
lections of elementary sets that cover A. It is clear that outer measure
is monotone and countably subadditive. Also, since is pre-countably
subadditive, we see that
, the function A
(A) is Lipschitz
continuous. Since this function is nitely additive on elementary sets,
we see on taking limits (using subadditivity to control error terms)
that it must be nitely additive on measurable sets also. Since
1
,
2
of agree when restricted to each of these sets, and the claim
then follows by countable additivity. This proves Theorem 2.1.1.
Remark 2.1.2. The uniqueness claim fails when the -niteness con-
dition is dropped. Consider for instance the rational numbers X = Q,
2.1. Caratheodory extension theorem 317
and let the elementary sets be the nite unions of intervals [a, b) Q.
Dene the measure (A) of an elementary set to be zero if A is empty,
and + otherwise. As the rationals are countable, we easily see
that every set of rationals is measurable. One easily veries the pre-
countable additivity condition (though the -niteness condition fails
horribly). However, has multiple extensions to the measurable sets;
for instance, any positive scalar multiple of counting measure is such
an extension.
Remark 2.1.3. It is not dicult to show that the measure comple-
tion A of A with respect to is the same as the topological closure
of A (or of /) with respect to the above pseudometric. Thus, for in-
stance, a subset of [0, 1] is Lebesgue measurable if and only if it can be
approximated to arbitrary accuracy (with respect to outer measure)
by a nite union of intervals.
A particularly simple case of Theorem 2.1.1 occurs when X is a
compact Hausdor totally disconnected space (i.e. a Stone space),
such as the innite discrete cube 0, 1
N
or any other Cantor space.
Then (see forthcoming lecture notes) the Borel -algebra A is gen-
erated by the Boolean algebra / of clopen sets. Also, as clopen sets
here are simultaneously compact and open, we see that any innite
cover of one clopen set by others automatically has a nite subcover.
From this, we conclude
Corollary 2.1.4. Let X be a compact Hausdor totally disconnected
space. Then any nitely additive -nite measure on the clopen sets
uniquely extends to a countably additive measure on the Borel sets.
By identifying 0, 1
N
with [0, 1] up to a countable set, this pro-
vides one means to construct Lebesgue measure on [0, 1]; similar con-
structions are available for R or R
n
.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/03.
Thanks to Americo Tavares, JB, Max Menzies, and mmailliw/william
for corrections.
318 2. Related articles
2.2. Amenability, the ping-pong lemma, and the
Banach-Tarski paradox
Notational convention: In this section (and in Section 2.4) only,
I will colour a statement red if it assumes the axiom of choice. (For
the rest of this text, the axiom of choice will be implicitly assumed
throughout.)
The famous Banach-Tarski paradox asserts that one can take the
unit ball in three dimensions, divide it up into nitely many pieces,
and then translate and rotate each piece so that their union is now
two disjoint unit balls. As a consequence of this paradox, it is not
possible to create a nitely additive measure on R
3
that is both trans-
lation and rotation invariant, which can measure every subset of R
3
,
and which gives the unit ball a non-zero measure. This paradox helps
explain why Lebesgue measure (which is countably additive and both
translation and rotation invariant, and gives the unit ball a non-zero
measure) cannot measure every set, instead being restricted to mea-
suring sets that are Lebesgue measurable.
On the other hand, it is not possible to replicate the Banach-
Tarski paradox in one or two dimensions; the unit interval in R or
unit disk in R
2
cannot be rearranged into two unit intervals or two
unit disks using only nitely many pieces, translations, and rota-
tions, and indeed there do exist non-trivial nitely additive measures
on these spaces. However, it is possible to obtain a Banach-Tarski
type paradox in one or two dimensions using countably many such
pieces; this rules out the possibility of extending Lebesgue measure
to a countably additive translation invariant measure on all subsets
of R (or any higher-dimensional space).
In this section we will establish all of the above results, and tie
them in with some important concepts and tools in modern group
theory, most notably amenability and the ping-pong lemma.
2.2.1. One-dimensional equidecomposability. Before we study
the three-dimensional situation, let us rst review the simpler one-
dimensional situation. To avoid having to say X can be cut up into
nitely many pieces, which can then be moved around to create Y
all the time, let us make a convenient denition:
2.2. The Banach-Tarski paradox 319
Denition 2.2.1 (Equidecomposability). Let G = (G, ) be a group
acting on a space X, and let A, B be subsets of X.
We say that A, B are nitely G-equidecomposable if there
exist nite partitions A =
n
i=1
A
i
and B =
n
i=1
B
i
and
group elements g
1
, . . . , g
n
G such that B
i
= g
i
A
i
for all
1 i n.
We say that A, B are countably G-equidecomposable if there
exist countable partitions A =
i=1
A
i
and B =
i=1
B
i
and group elements g
1
, g
2
, . . . G such that B
i
= g
i
A
i
for
all i.
We say that A is nitely G-paradoxical if it can be parti-
tioned into two subsets, each of which is nitely G-equidecomposable
with A.
We say that A is countably G-paradoxical if it can be par-
titioned into two subsets, each of which is countably G-
equidecomposable with A.
One can of course make similar denitions when G = (G, +) is
an additive group rather than a multiplicative one.
Clearly, nite G-equidecomposability implies countable G-equidecomposability,
but the converse is not true. Observe that any nitely (resp. count-
ably) additive and G-invariant measure on X that measures every
single subset of X, must give either a zero measure or an innite
measure to a nitely (resp. countably) G-paradoxical set. Thus,
paradoxical sets provide signicant obstructions to constructing ad-
ditive measures that can measure all sets.
Example 2.2.2. If R acts on itself by translation, then [0, 2] is
nitely R-equidecomposable with [10, 11) [21, 22], and R is nitely
R-equidecomposable with (, 10] (10, +).
Example 2.2.3. If G acts transitively on X, then any two nite
subsets of X are nitely G-equidecomposable i they have the same
cardinality, and any two countably innite sets of X are countably
G-equidecomposable. In particular, any countably innite subset of
X is countably G-paradoxical.
320 2. Related articles
Exercise 2.2.1. Show that nite G-equidecomposability and count-
able G-equidecomposability are both equivalence relations.
Exercise 2.2.2 (Banach-Schroder-Bernstein theorem). Let G act on
X, and let A, B be subsets of X.
(i) If A is nitely G-equidecomposable with a subset of B, and
B is nitely G-equidecomposable with a subset of A, show
that A and B are nitely G-equidecomposable with each
other. (Hint: adapt the proof of the Schroder-Bernstein
theorem, see Section 3.13.)
(ii) If A is nitely G-equidecomposable with a superset of B,
and B is nitely G-equidecomposable with a superset of A,
show that A and B are nitely G-equidecomposable with
each other. (Hint: use part (i).)
Show that claims (i) and (ii) also hold when nitely is replaced by
countably.
Exercise 2.2.3. Show that if G acts on X, A is a subset of X which
is nitely (resp. countably) G-paradoxical, and x X, then the
recurrence set g G : gx A is also nitely (resp. countably)
G-paradoxical (where G acts on itself by translation).
Let us rst establish countable equidecomposability paradoxes in
the reals.
Proposition 2.2.4. Let R act on itself by translations. Then [0, 1]
and R are countably R-equidecomposable.
Proof. By Exercise 2.2.2, it will suce to show that some set con-
tained in [0, 1] is countably R-equidecomposable with R. Consider
the space R/Q of all cosets x +Q of the rationals. By the axiom of
choice, we can express each such coset as x +Q for some x [0, 1/2],
thus we can partition R =
xE
x + Q for some E [0, 1/2]. By
Example 2.2.3, Q[0, 1/2] is countably Q-equidecomposable with Q,
which implies that
xE
x+(Q[0, 1/2]) is countably R-equidecomposable
with
xE
x+Q. Since latter set is R and the former set is contained
in [0, 1], the claim follows.
2.2. The Banach-Tarski paradox 321
Of course, the same proposition holds if [0, 1] is replaced by any
other interval. As a quick consequence of this proposition and Ex-
ercise 2.2.2, we see that any subset of R containing an interval is
R-equidecomposable with R. In particular, we have
Corollary 2.2.5. Any subset of R containing an interval is countably
R-paradoxical.
In particular, we see that any countably additive translation-
invariant measure that measures every subset of R, must assign a
zero or innite measure to any set containing an interval. In par-
ticular, it is not possible to extend Lebesgue measure to measure all
subsets of R.
We now turn from countably paradoxical sets to nitely paradox-
ical sets. Here, the situation is quite dierent: we can rule out many
sets from being nitely paradoxical. The simplest example is that of
a nite set:
Proposition 2.2.6. If G acts on X, and A is a non-empty nite
subset of X, then A is not nitely (or countably) G-paradoxical.
Proof. One easily sees that any two sets that are nitely or count-
ably G-equidecomposable must have the same cardinality. The claim
follows.
Now we consider the integers.
Proposition 2.2.7. Let the integers Z act on themselves by transla-
tion. Then Z is not nitely Z-paradoxical.
Proof. The integers are of course innite, and so Proposition 2.2.6
does not apply directly. However, the key point is that the integers
can be eciently truncated to be nite, and so we will be able to
adapt the argument used to prove Proposition 2.2.6 to this setting.
Lets see how. Suppose for contradiction that we could partition
Z into two sets A and B, which are in turn partitioned into nitely
many pieces A =
n
i=1
A
i
and B =
m
j=1
B
j
, such that Z can be
partitioned as Z =
n
i=1
A
i
+ a
i
and Z =
m
j=1
B
j
+ b
j
for some
integers a
1
, . . . , a
n
, b
1
, . . . , b
m
.
322 2. Related articles
Now let N be a large integer (much larger than n, m, a
1
, . . . , a
n
, b
1
, . . . , b
m
).
We truncate Z to the interval [N, N] := N, . . . , N. Clearly
(2.1) A [N, N] =
n
_
i=1
A
i
[N, N]
and
(2.2) [N, N] =
n
_
i=1
(A
i
+a
i
) [N, N].
From (2.2) we see that the set
n
i=1
(A
i
[N, N]) + a
i
diers from
[N, N] by only O(1) elements, where the bound in the O(1) expres-
sion can depend on n, a
1
, . . . , a
n
but does not depend on N. (The
point here is that [N, N] is almost translation-invariant in some
sense.) Comparing this with (2.1) we see that
(2.3) [[N, N][ [A [N, N][ +O(1).
Similarly with A replaced by B. Summing, we obtain
(2.4) 2[[N, N][ [[N, N][ +O(1),
but this is absurd for N suciently large, and the claim follows.
Exercise 2.2.4. Use the above argument to show that in fact no
innite subset of Z is nitely Z-paradoxical; combining this with Ex-
ample 2.2.3, we see that the only nitely Z-paradoxical set of integers
is the empty set.
The above argument can be generalised to an important class of
groups:
Denition 2.2.8 (Amenability). Let G = (G, ) be a discrete, at
most countable, group. A Flner sequence is a sequence F
1
, F
2
, F
3
, . . .
of nite subsets of G with
N=1
F
N
= G with the property that
lim
N
|gF
N
F
N
|
|F
N
|
= 0 for all g G, where denotes symmetric
dierence. A discrete, at most countable, group G is amenable if it
contains at least one Flner sequence. Of course, one can dene the
same concept for additive groups G = (G, +).
Remark 2.2.9. One can dene amenability for uncountable groups
by replacing the notion of a Flner sequence with a Flner net. Simi-
larly, one can dene amenability for locally compact Hausdor groups
2.2. The Banach-Tarski paradox 323
equipped with a Haar measure by using that measure in place of car-
dinality in the above denition. However, we will not need these more
general notions of amenability here. The notion of amenability was
rst introduced (though not by this name, or by this denition) by
von Neumann, precisely in order to study these sorts of decomposition
paradoxes. We discuss amenability further in Section 2.8.
Example 2.2.10. The sequence [N, N] for N = 1, 2, 3, . . . is a
Flner sequence for the integers Z, which are hence an amenable
group.
Exercise 2.2.5. Show that any abelian discrete group that is at most
countable, is amenable.
Exercise 2.2.6. Show that any amenable discrete group G that is
at most countable is not nitely G-paradoxical, when acting on itself.
Combined with Exercise 2.2.3, we see that if such a group G acts on
a non-empty space X, then X is not nitely G-paradoxical.
Remark 2.2.11. Exercise 2.2.6 suggests that an amenable group G
hould be able to support a non-trivial nitely additive measure which
is invariant under left-translations, and can measure all subsets of G.
Indeed, one can even create a nitely additive probability measure,
for instance by selecting a non-principal ultralter p N and a
Flner sequence (F
n
)
n=1
and dening (A) := lim
np
[A F
n
[/[F
n
[
for all A G.
The reals R = (R, +) (which we will give the discrete topology!)
are uncountable, and thus not amenable by the narrow denition
of Denition 2.2.8. However, observe from Exercise 2.2.5 that any
nitely generated subgroup of the reals is amenable (or equivalently,
that the reals themselves with the discrete topology are amenable,
using the Flner net generalisation of Denition 2.2.8. Also, we have
the following easy observation:
Exercise 2.2.7. Let G act on X, and let A be a subset of X which
is nitely G-paradoxical. Show that there exists a nitely generated
subgroup H of G such that A is nitely H-paradoxical.
From this, we see that R is not nitely R-paradoxical. But we
can in fact say much more:
324 2. Related articles
Proposition 2.2.12. Let A be a non-empty subset of R. Then A is
not nitely R-paradoxical.
Proof. Suppose for contradiction that we can partition A into two
sets A = A
1
A
2
which are both nitely R-equidecomposable with
A. This gives us two maps f
1
: A A
1
, f
2
: A A
2
which are
piecewise given by a nite number of translations; thus there exists
a nite set g
1
, . . . , g
d
R such that f
i
(x) x + g
1
, . . . , g
d
for all
x A and i = 1, 2.
For any integer N 1, consider the 2
N
composition maps f
i1
. . . f
i
N
: A A for i
1
, . . . , i
N
1, 2. From the disjointness of
A
1
, A
2
and an easy induction we see that the ranges of all these maps
are disjoint, and so for any x A the 2
N
quantities f
i1
. . . f
i
N
(x)
are distinct. On the other hand, we have
(2.5) f
i1
. . . f
i
N
(x) x +g
1
, . . . , g
d
+. . . +g
1
, . . . , g
d
.
Simple combinatorics (relying primarily on the abelian nature of (R, +)
shows that the number of values on the right-hand side of (2.5) is at
most N
d
. But for suciently large N, we have 2
N
> N
d
, giving the
desired contradiction.
Let us call a group G supramenable if every non-empty subset
of G is not nitely G-paradoxical; thus R is supramenable. From
Exercise 2.2.3 we see that if a supramenable group acts on any space
X, then the only nitely G-paradoxical subset of X is the empty set.
Exercise 2.2.8. We say that a group G = (G, ) has subexponential
growth if for any nite subset S of G, we have lim
n
[S
n
[
1/n
= 1,
where S
n
= S . . . S is the set of n-fold products of elements of S.
Show that every group of subexponential growth is supramenable.
Exercise 2.2.9. Show that every abelian group has subexponential
growth (and is thus supramenable). More generally, show that every
nilpotent group has subexponential growth and is thus also supra-
menable.
Exercise 2.2.10. Show that if two nite unions of intervals in R
are nitely R-equidecomposable, then they must have the same total
length. (Hint: reduce to the case when both sets consist of a single
2.2. The Banach-Tarski paradox 325
interval. First show that the lengths of these intervals cannot dier
by more than a factor of two, and then amplify this fact by iteration
to conclude the result.)
Remark 2.2.13. We already saw that amenable groups G admit
nitely additive translation-invariant probability measures that mea-
sure all subsets of G (Remarkk 2.2.11 can be extended to the un-
countable case); in fact, this turns out to be an equivalent deni-
tion of amenability. It turns out that supramenable groups G enjoy
a stronger property, namely that given any non-empty set A on G,
there exists a nitely additive translation-invariant measure on G that
assigns the measure 1 to A; this is basically a deep result of Tarski.
2.2.2. Two-dimensional equidecomposability. Now we turn to
equidecomposability on the plane R
2
. The nature of equidecompos-
ability depends on what group G of symmetries we wish to act on the
plane.
Suppose rst that we only allow ourselves to translate various sets
in the planes, but not to rotate them; thus G = R
2
. As this group is
abelian, it is supramenable by Exercise 2.2.9, and so any non-empty
subset A of the plane will not be nitely R
2
-paradoxical; indeed, by
Remark 2.2.13, there exists a nitely additive translation-invariant
measure that gives A the measure 1. On the other hand, it is easy to
adapt Corollary 2.2.5 to see that any subset of the plane containing
a ball will be countably R
2
-paradoxical.
Now suppose we allow both translations and rotations, thus G
is now the group SO(2) R
2
of (orientation-preserving) isometries
x e
i
x +v for v R
2
and R/2Z, where e
i
denotes the anti-
clockwise rotation by around the origin. This group is no longer
abelian, or even nilpotent, so Exercise 2.2.9 no longer applies. Indeed,
it turns out that G is no longer supramenable. This is a consequence
of the following three lemmas:
Lemma 2.2.14. Let G be a group which contains a free semigroup on
two generators (in other words, there exist group elements g, h G
such that all the words involving g and h (but not g
1
or h
1
) are
distinct). Then G contains a non-empty nitely G-paradoxical set.
In other words, G is not supramenable.
326 2. Related articles
Proof. Let S be the semigroup generated by g and h (i.e. the set of
all words formed by g and h, including the empty word (i.e. group
identity). Observe that gS and hS are disjoint subsets of S that
are clearly G-equidecomposable with S. The claim then follows from
Exercise 2.2.2.
Lemma 2.2.15 (Semigroup ping-pong lemma). Let G act on a space
X, let g, h be elements of G, and suppose that there exists a non-
empty subset A of X such that gA and hA are disjoint subsets of A.
Then g, h generate a free semigroup.
Proof. As in the proof of Proposition 2.2.12, we see from induction
that for two dierent words w, w
n=1
and (B
n
)
n=1
be Flner sequences for H and K
respectively. Let f : N N be a rapidly growing function, and let
(F
n
)
n=1
be the set F
n
:=
xBn
(x) A
f(n)
. One easily veries that
this is a Flner sequence for G if f is suciently rapidly growing.
Exercise 2.2.11. Show that any nitely generated solvable group is
amenable. More generally, show that any discrete, at most countable,
solvable group is amenable.
Exercise 2.2.12. Show that any nitely generated subgroup of SO(2)
R
2
is amenable. (Hint: use the short exact sequence 0 R
2
SO(2) R
2
SO(2) 0, which shows that SO(2) R
2
is solv-
able (in fact it is metabelian)). Conclude that R
2
is not nitely
SO(2) R
2
-paradoxical.
Finally, we show a result of Banach.
Proposition 2.2.19. The unit disk D in R
2
is not nitely SO(2)
R
2
-paradoxical.
Proof. If the claim failed, then D would be nitely SO(2) R
2
-
equidecomposable with a disjoint union of two copies of D, say D
and D + v for some vector v of length greater than 2. By Exercise
2.2.7, we can then nd a subgroup G of SO(2) R
2
generated by a
nite number of rotations x e
ij
x for j = 1, . . . , J and translations
x x + v
k
for k = 1, . . . , K such that D and D (D + v) are
nitely G-equidecomposable. Indeed, we may assume that the rigid
motions that move pieces of D to pieces of D (D + v) are of the
form x e
ij
x +v
k
for some 1 j J, 1 k K, thus
(2.6) D (D +v) =
J
_
j=1
K
k=1
e
ij
D
j,k
+v
k
for some partition D =
J
j=1
K
k=1
D
j,k
of the disk.
328 2. Related articles
By amenability of the rotation group SO(2), one can nd a nite
set SO(2) of rotations such that e
ij
diers from by at most
0.01[[ elements for all 1 j J. Let N be a large integer, and let
N
R
2
be the set of all linear combinations of e
i
v
k
for and
1 k K with coecients in N, . . . , N. Observe that
N
is a
nite set whose cardinality grows at most polynomially in N. Thus,
by the pigeonhole principle, one can nd arbitrarily large N such that
(2.7) [D
N+10
[ 1.01[D
N
[.
On the other hand, from (2.6) and the rotation-invariance of the disk
we have
2[D
N
[ = 2[e
i
(D)
N
[
[e
i
(D (D +v))
N+5
[
j=1
K
k=1
[e
i(+j)
D
j,k
N+10
[
(2.8)
for all . Averaging this over all we conclude
(2.9) 2[D
N
[ 1.01[D
N+10
[,
contradicting (2.7).
Remark 2.2.20. Banach in fact showed the slightly stronger state-
ment that any two nite unions of polygons of diering area were not
nitely SO(2) R
2
-equidecomposable. (The converse is also true,
and is known as the Bolyai-Gerwien theorem.)
Exercise 2.2.13. Show that all the claims in this section continue to
hold if we replace SO(2)R
2
by the slightly larger group Isom(R)
2
=
O(2) R
2
of isometries (not necessarily orientation-preserving.
Remark 2.2.21. As a consequence of Remark 2.2.20, the unit square
is not SO(2)R
2
-paradoxical. However, it is SL(2)R
2
-paradoxical;
this is known as the von Neumann paradox.
2.2.3. Three-dimensional equidecomposability. We now turn
to the three-dimensional setting. The new feature here is that the
group SO(3) R
3
of rigid motions is no longer abelian (as in one
dimension) or solvable (as in two dimensions), but now contains a free
2.2. The Banach-Tarski paradox 329
group on two generators (not just a free semigroup, as per Lemma
2.2.16. The signicance of this fact comes from
Lemma 2.2.22. The free group F
2
on two generators is nitely F
2
-
paradoxical.
Proof. Let a, b be the two generators of F
2
. We can partition F
2
=
1 W
a
W
b
W
a
1 W
b
1, where W
c
is the collection of reduced
words of F
2
that begin with c. From the identities
(2.10) W
a
1 = a
1
(F
2
W
a
); W
b
1 = b
1
(F
2
W
b
)
we see that F
2
is nitely F
2
-equidecomposable with both W
a
W
a
1
and W
c
W
c
1, and the claim now follows from Exercise 2.2.2.
Corollary 2.2.23. Suppose that F
2
acts freely on a space X (i.e.
gx ,= x whenever x X and g F
2
is not the identity). Then X is
nitely F
2
-paradoxical.
Proof. Using the axiom of choice, we can partition X as X =
x
F
2
x
for some subset of X. The claim now follows from Lemma 2.2.22.
Next, we embed the free group inside the rotation group SO(3)
using the following useful lemma (cf. Lemma 2.2.15).
Exercise 2.2.14 (Ping-pong lemma). Let G be a group acting on a
set X. Suppose that there exist disjoint subsets A
+
, A
, B
+
, B
of
X, whose union is not all of X, and elements a, b G, such that
2
(2.11)
a(XA
) A
+
; a
1
(XA
+
) A
; b(XB
) B
+
; b
1
(XB
+
) B
.
Show that a, b generate a free group.
Proposition 2.2.24. SO(3) contains a copy of the free group on two
generators.
Proof. It suces to nd a space X that two elements of SO(3) act on
in a way that Exercise 2.2.14 applies. There are many such construc-
tions. One such construction
3
, based on passing from the reals to the
2
If drawn correctly, a diagram of the inclusions in (2.11) resembles a game of
doubles ping-pong of A+, A versus B+, B, hence the name.
3
See http://sbseminar.wordpress.com/2007/09/17/ for more details and motiva-
tion for this construction.
330 2. Related articles
5-adics, where 1 is a square root and so SO(3) becomes isomorphic
to PSL(2). At the end of the day, one takes
(2.12) a =
_
_
3/5 4/5 0
4/5 3/5 0
0 0 1
_
_
; b =
_
_
1 0 0
0 3/5 4/5
0 4/5 3/5
_
_
and
A
:= 5
Z
_
_
x
y
z
_
_
: x, y, z Z, x = 3y mod 5, z = 0 mod 5
B
:= 5
Z
_
_
x
y
z
_
_
: x, y, z Z, z = 3y mod 5, x = 0 mod 5
X := A
A
+
B
B
+
_
0 1 0
_
,
(2.13)
where 5
Z
denotes the integer powers of 5 (which act on column vectors
in the obvious manner). The verication of the ping-pong inclusions
(2.11) is a routine application of modular arithmetic.
Remark 2.2.25. This is a special case of the Tits alternative.
Corollary 2.2.26 (Hausdor paradox). There exists a countable sub-
set E of the sphere S
2
such that S
2
E is nitely SO(3)-paradoxical,
where SO(3) of course acts on S
2
by rotations.
Proof. Let F
2
SO(3) be a copy of the free group on two generators,
as given by Proposition 2.2.24. Each rotation in F
2
xes exactly two
points on the sphere. Let E be the union of all these points; this is
countable since F
2
is countable. The action of F
2
on SO(3)E is free,
and the claim now follows from Corollary 2.2.23.
Corollary 2.2.27 (Banach-Tarski paradox on the sphere). S
2
is
nitely SO(3)-paradoxical.
Proof. (Sketch) Iterating the Hausdor paradox, we see that S
2
E
is nitely SO(3)-equidecomposable to four copies of S
2
E, which can
easily be used to cover two copies of S
2
(with some room to spare),
by randomly rotating each of the copies. The claim now follows from
Exercise 2.2.2.
2.3. Stone and Loomis-Sikorski 331
Exercise 2.2.15 (Banach-Tarski paradox on R
3
). Show that the
unit ball in R
3
is nitely SO(3) R
3
-paradoxical.
Exercise 2.2.16. Extend these three-dimensional paradoxes to higher
dimensions.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/08.
Thanks to Harald Helfgott for corrections.
2.3. The Stone and Loomis-Sikorski
representation theorems
A (concrete) Boolean algebra is a pair (X, B), where X is a set, and
B is a collection of subsets of X which contain the empty set , and
which is closed under unions A, B A B, intersections A, B
A B, and complements A A
c
:= XA. The subset relation
also gives a relation on B. Because the B is concretely represented
as subsets of a space X, these relations automatically obey various
axioms, in particular, for any A, B, C B,
(i) is a partial ordering on B, and A and B have join A B
and meet A B.
(ii) We have the distributive laws A(BC) = (AB)(AC)
and A (B C) = A (B C).
(iii) is the minimal element of the partial ordering , and
c
is the maximal element.
(iv) A A
c
= and A A
c
=
c
.
(More succinctly: B is a lattice which is distributive, bounded, and
complemented.)
We can then dene an abstract Boolean algebra B = (B, ,
c
, , ,
) to be an abstract set B with the specied objects, operations, and
relations that obey the axioms (i)-(iv). Of course, some of these op-
erations are redundant; for instance, intersection can be dened in
terms of complement and union by de Morgans laws. In the liter-
ature, dierent authors select dierent initial operations and axioms
when dening an abstract Boolean algebra, but they are all easily seen
to be equivalent to each other. To emphasise the abstract nature of
332 2. Related articles
these algebras, the symbols ,
c
, , , are often replaced with other
symbols such as 0, , , , <.
Clearly, every concrete Boolean algebra is an abstract Boolean
algebra. In the converse direction, we have Stones representation
theorem (see below), which asserts (among other things) that every
abstract Boolean algebra is isomorphic to a concrete one (and even
constructs this concrete representation of the abstract Boolean alge-
bra canonically). So, up to (abstract) isomorphism, there is really no
dierence between a concrete Boolean algebra and an abstract one.
Now let us turn from Boolean algebras to -algebras.
A concrete -algebra (also known as a measurable space) is a
pair (X, B), where X is a set, and B is a collection of subsets of X
which contains and are closed under countable unions, countable
intersections, and complements; thus every concrete -algebra is a
concrete Boolean algebra, but not conversely. As before, concrete -
algebras come equipped with the structures ,
c
, , , which obey
axioms (i)-(iv), but they also come with the operations of countable
union (A
n
)
n=1
n=1
A
n
and countable intersection (A
n
)
n=1
n=1
A
n
, which obey an additional axiom:
(v) Any countable family A
1
, A
2
, . . . of elements of B has supre-
mum
n=1
A
n
and inmum
n=1
A
n
.
As with Boolean algebras, one can now dene an abstract -
algebra to be a set B = (B, ,
c
, , , ,
n=1
,
n=1
) with the indi-
cated objects, operations, and relations, which obeys axioms (i)-(v).
Again, every concrete -algebra is an abstract one; but is it still true
that every abstract -algebra is representable as a concrete one?
The answer turns out to be no, but the obstruction can be de-
scribed precisely (namely, one needs to quotient out an ideal of null
sets from the concrete -algebra), and there is a satisfactory repre-
sentation theorem, namely the Loomis-Sikorski representation theo-
rem (see below). As a corollary of this representation theorem, one
can also represent abstract measure spaces (B, ) (also known as mea-
sure algebras) by concrete measure spaces, (X, B, ), after quotienting
out by null sets.
2.3. Stone and Loomis-Sikorski 333
In the rest of this section, I will state and prove these representa-
tion theorems. These theorems help explain why it is safe to focus
attention primarily on concrete -algebras and measure spaces when
doing measure theory, since the abstract analogues of these mathe-
matical concepts are largely equivalent to their concrete counterparts.
(The situation is quite dierent for non-commutative measure theo-
ries, such as quantum probability, in which there is basically no good
representation theorem available to equate the abstract with the clas-
sically concrete, but I will not discuss these theories here.)
2.3.1. Stones representation theorem. We rst give the class
of Boolean algebras the structure of a category:
Denition 2.3.1 (Boolean algebra morphism). A morphism : /
B from one abstract Boolean algebra to another is a map which pre-
serves the empty set, complements, unions, intersections, and the
subset relation (e.g. (A B) = (A) (B) for all A, B /. An
isomorphism is an morphism : / B which has an inverse mor-
phism
1
: B /. Two Boolean algebras are isomorphic if there is
an isomorphism between them.
Note that if (X, /), (Y, B) are concrete Boolean algebras, and if
f : X Y is a map which is measurable in the sense that f
1
(B) /
for all B B, then the inverse of f is a Boolean algebra morphism
f
1
: B / which goes in the reverse (i.e. contravariant) direction
to that of f. To state Stones representation theorem we need another
denition.
Denition 2.3.2 (Stone space). A Stone space is a topological space
X = (X, T) which is compact, Hausdor, and totally disconnected.
Given a Stone space, dene the clopen algebra Cl(X) of X to be the
concrete Boolean algebra on X consisting of the clopen sets (i.e. sets
that are both closed and open).
It is easy to see that Cl(X) is indeed a concrete Boolean alge-
bra for any topological space X. The additional properties of being
compact, Hausdor, and totally disconnected are needed in order to
recover the topology T of X uniquely from the clopen algebra. In-
deed, we have
334 2. Related articles
Lemma 2.3.3. If X is a Stone space, then the topology T of X
is generated by the clopen algebra Cl(X). Equivalently, the clopen
algebra forms an open base for the topology.
Proof. Let x X be a point, and let K be the intersection of all
the clopen sets containing x. Clearly, K is closed. We claim that
K = x. If this is not the case, then (since X is totally disconnected)
K must be disconnected, thus K can be separated non-trivially into
two closed sets K = K
1
K
2
. Since compact Hausdor spaces are
normal, we can write K
1
= K U
1
and K
2
= K U
2
for some
disjoint open U
1
, U
2
. Since the intersection of all the clopen sets
containing x with the closed set (U
1
U
2
)
c
is empty, we see from the
nite intersection property that there must be a nite intersection
K
U
1
and K
U
2
are clopen and do not contain K.
But this contradicts the denition of K (since x is contained in one
of K
U
1
and K
U
2
). Thus K = x.
Another application of the nite intersection property then re-
veals that every open neighbourhood of x contains at least one clopen
set containing x, and so the clopen sets form a base as required.
Exercise 2.3.1. Show that two Stone spaces have isomorphic clopen
algebras if and only if they are homeomorphic.
Now we turn to the representation theorem.
Theorem 2.3.4 (Stone representation theorem). Every abstract Boolean
algebra B is equivalent to the clopen algebra Cl(X) of a Stone space
X.
Proof. We will need the binary abstract Boolean algebra 0, 1, with
the usual Boolean logic operations. We dene X := Hom(B, 0, 1) be
the space of all morphisms from B to 0,1. Observe that each point
x X can be viewed as a nitely additive measure
x
: B 0, 1
that takes values in 0, 1. In particular, this makes X a closed subset
of 0, 1
B
(endowed with the product topology). The space 0, 1
B
is
Hausdor, totally disconnected, and (by Tychonos theorem, Theo-
rem 1.8.14) compact, and so X is also; in other words, X is a Stone
space. Every B B induces a cylinder set C
B
0, 1
B
, consisting of
2.3. Stone and Loomis-Sikorski 335
all maps : B 0, 1 that map B to 1. If we dene (B) := C
B
X,
it is not hard to see that is a morphism from B to Cl(X). Since
the cylinder sets are clopen and generate the topology of 0, 1
B
, we
see that (B) of clopen sets generates the topology of X. Using com-
pactness, we then conclude that every clopen set is the nite union of
nite intersections of elements of (B); since (B) is an algebra, we
thus see that is surjective.
The only remaining task is to check that is injective. It is
sucient to show that (A) is non-empty whenever A B is not
equal to . But by Zorns lemma (Section 2.4), we can place A inside
a maximal proper lter (i.e. an ultralter) p. The indictator 1
p
:
B 0, 1 of p can then be veried to be an element of (A), and
the claim follows.
Remark 2.3.5. If B = 2
Y
is the power set of some set Y , then the
Stone space given by Theorem 2.3.4 is the Stone-
Cech compactica-
tion of Y (which we give the discrete topology); see Section 2.5.
Remark 2.3.6. Lemma 2.3.3 and Theorem 2.3.4 can be interpreted
as giving a duality between the category of Boolean algebras and
the category of Stone spaces, with the duality maps being B
Hom(B, 0, 1) and X Cl(X). (The duality maps are (contravari-
ant) functors which are inverses up to natural transformations.) It
is the model example of the more general Stone duality between cer-
tain partially ordered sets and certain topological spaces. The idea
of dualising a space X by considering the space of its morphisms to
a fundamental space (in this case, 0, 1) is a common one in math-
ematics; for instance, Pontryagin duality in the context of Fourier
analysis on locally compact abelian groups provides another exam-
ple (with the fundamental space in this case being the unit circle
R/Z); see Section 1.12. Other examples include the Gelfand repre-
sentation of C
) contains
x. Iterating this (starting with [0, 1]) we see that there exist Borel
sets B of arbitrarily small measure with (B) containing x. Taking
countable intersections, we conclude that there exists a null set N
whose image (N) contains x; but (N) is empty, a contradiction.
However, it turns out that quotienting out by ideals is the only
obstruction to having a Stone-type representation theorem. Namely,
we have
Theorem 2.3.10 (Loomis-Sikorski representation theorem). Let B
be an abstract -algebra. Then there exists a concrete -algebra (X, /)
and a -ideal A of / such that B is isomorphic to //A.
Proof. We use the argument of Loomis[Lo1946]. Applying Stones
representation theorem, we can nd a Stone space X such that there
is a Boolean algebra isomorphism : B Cl(X) from B (viewed
now only as a Boolean algebra rather than a -algebra to the clopen
algebra of X. Let / be the Baire -algebra of X, i.e. the -algebra
generated by Cl(X). The map need not be a -algebra isomor-
phism, being merely a Boolean algebra isomorphism one instead; it
preserves nite unions and intersections, but need not preserve count-
able ones. In particular, if B
1
, B
2
, . . . B are such that
n=1
B
n
= ,
then
n=1
(B
n
) / need not be empty.
Let us call sets
n=1
(B
n
) of this form basic null sets, and let
A be the collection of sets in / which can be covered by at most
countably many basic null sets.
It is not hard to see that A is a -ideal in /. The map then
descends to a map : B //A. It is not hard to see that
is a Boolean algebra morphism. Also, if B
1
, B
2
, . . . B are such
that
n=1
B
n
= , then from construction we have
n=1
(B
n
) = .
From these two facts one can easily show that is in fact a -algebra
morphism. Since (B) = Cl(X) generates /, (B) must generate
//A, and so is surjective.
338 2. Related articles
The only remaining task is to show that is injective. As before,
it suces to show that (A) ,= when A ,= . Suppose for contra-
diction that A ,= and (A) = ; then (A) can be covered by a
countable family
n=1
(A
(i)
n
) of basic null sets, where
n=1
A
(i)
n
=
for each i. Since A ,= and
n=1
A
(1)
n
= , we can nd n
1
such that
AA
(1)
n1
,= (where of course AB := A B
c
). Iterating this, we can
nd n
2
, n
3
, n
4
, . . . such that A(A
(1)
n1
. . . A
(k)
n
k
) ,= for allk. Since
is a Boolean space isomorphism, we conclude that (A) is not covered
by any nite subcollection of the (A
(1)
n1
), (A
(2)
n2
), . . .. But all of these
sets are clopen, so by compactness, (A) is not covered by the en-
tire collection (A
(1)
n1
), (A
(2)
n2
), . . .. But this contradicts the fact that
(A) is covered by the
n=1
(A
(i)
n
).
Remark 2.3.11. The proof above actually gives a little bit more
structure on X, /, namely it gives X the structure of a Stone space,
with / being its Baire -algebra. Furthermore, the ideal A con-
structed in the proof is in fact the ideal of meager Baire sets. The
only dicult step is to show that every closed Baire set S with empty
interior is in N, i.e. is a countable intersection of clopen sets. To see
this, note that S is generated by a countable subalgebra of B which
corresponds to a continuous map f from X to the Cantor set K (since
K is dual to the free Boolean algebra on countably many generators).
Then f(S) is closed in K and is hence a countable intersection of
clopen sets in K, which pull back to countably many clopen sets on
X whose intersection is f
1
(f(S)). But the fact that S is gener-
ated by the subalgebra dening f can easily be seen to imply that
f
1
(f(S)) = S.
Remark 2.3.12. The Stone representation theorem relies in an es-
sential way on the axiom of choice (or at least the boolean prime ideal
theorem, which is slightly weaker than this axiom). However, it is
possible to prove the Loomis-Sikorski representation theorem with-
out choice; see for instance [BudePvaR2008].
Remark 2.3.13. The construction of X, /, A in the above proof was
canonical, but it is not unique (in contrast to the situation with the
2.3. Stone and Loomis-Sikorski 339
Stone representation theorem, where Lemma 2.3.3 provides unique-
ness up to homeomorphisms). Nevertheless, using Remark 2.3.11,
one can make the Loomis-Sikorski representation functorial. Let A
and B be -algebras with Stone spaces X and Y . A map Y X in-
duces a -homomorphism Bor(X) Bor(Y ), and if the inverse image
of a Borel meager set is meager then it induces a -homomorphism
A B. Conversely a -homomorphism A B induces a map
Y X under which the inverse image of a Borel meager set is mea-
ger (using the fact above that Borel meager sets are generated by
countable intersections of clopen sets). The correspondence is bijec-
tive since it is just a restriction of the correspondence for ordinary
Boolean algebras. This gives a duality between the category of -
algebras and -homomorphisms and the category of -Stone spaces
and continuous maps such that the inverse image of a Borel meager
set is meager. In fact, -Stone spaces can be abstractly charac-
terized as Stone spaces such that the closure of a countable union of
clopen sets is clopen.
A (concrete) measure space (X, B, ) is a concrete -algebra (X, B)
together with a countably additive measure : B [0, +]. One can
similarly dene an abstract measure space (B, ) (or measure alge-
bra) to be an abstract -algebra B with a countably additive measure
mu : B [0, +]. (Note that one does not need the concrete space
X in order to dene the notion of a countably additive measure.)
One can obtain an abstract measure space from a concrete one
by deleting X and then quotienting out by some -ideal of null sets
- sets of measure zero with respect to mu. (For instance, one could
quotient out the space of all null sets, which is automatically a -
ideal.) Thanks to the Loomis-Sikorski representation theorem, we
have a converse:
Exercise 2.3.3. Show that every abstract measure space is isomor-
phic to a concrete measure space after quotieting out by a -ideal
of null sets (where the notion of morphism, isomorphism, etc. on
abstract measure spaces is dened in the obvious manner.)
340 2. Related articles
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/12.
Thanks to Eric for Remark 2.3.11, and for the functoriality remark
in Remark 2.3.13.
Eric and Tom Leinster pointed out a subtlety that two concrete
Boolean algebras which are abstractly isomorphic need not be con-
cretely isomorphic. In particular, the modier abstract is essential
in the statement that up to (abstract) isomorphism, there is no dif-
ference between a concrete Boolean algebra and an abstract one.
2.4. Well-ordered sets, ordinals, and Zorns
lemma
Notational convention: As in Section 2.2, I will colour a statement
red in this post if it assumes the axiom of choice. We will, of course,
rely on every other axiom of Zermelo-Frankel set theory here (and in
the rest of the course).
In analysis, one often needs to iterate some sort of operation
innitely many times (e.g. to create a innite basis by choosing
one basis element at a time). In order to do this rigorously, we will
rely on Zorns lemma:
Lemma 2.4.1 (Zorns Lemma). Let (X, ) be a non-empty partially
ordered set, with the property that every chain (i.e. a totally ordered
set) in X has an upper bound. Then X contains a maximal element
(i.e. an element with no larger element).
Indeed, we have used this lemma several times already in previous
sections. Given the other standard axioms of set theory, this lemma
is logically equivalent to
Axiom 2.4.2 (Axiom of choice). Let X be a set, and let T be a col-
lection of non-empty subsets of X. Then there exists a choice function
f : T X, i.e. a function such that f(A) A for all A T.
One implication is easy:
Proof of axiom of choice using Zorns lemma. Dene a partial
choice function to be a pair (T
, f
), where T
is a subset of T and
f
: T
, f
) (T
, f
) if
T
and f
extends f
, f
of this function
must be all of T, since otherwise one could enlarge T
by a single
set A and extend f
) whenever x < x
).
Example 2.4.4. The natural numbers are well-ordered (this is the
well-ordering principle), as is any nite totally ordered set (includ-
ing the empty set), but the integers, rationals, or reals are not well-
ordered.
Example 2.4.5. Any subset of a well-ordered set is again well-
ordered. In particular, if a, b are two elements of a well-ordered
342 2. Related articles
set, then intervals such as [a, b] := c X : a c b, [a, b) := c
X : a c < b, etc. are also well-ordered.
Example 2.4.6. If X is a well-ordered set, then the ordered set X
+, dened by adjoining a new element + to X and declaring
it to be larger than all the elements of X, is also well-ordered. More
generally, if X and Y are well-ordered sets, then the ordered set XY ,
dened as the disjoint union of X and Y , with any element of Y
declared to be larger than any element of X, is also well-ordered.
Observe that the operation is associative (up to isomorphism), but
not commutative in general: for instance, N is not isomorphic
to N.
Example 2.4.7. If X, Y are well-ordered sets, then the ordered set
XY , dened as the Cartesian product XY with the lexicograph-
ical ordering (thus (x, y) (x
, y
) if x < x
, or if x = x
and y y
),
is again a well-ordered set. Again, this operation is associative (up
to isomorphism) but not commutative. Note that we have one-sided
distributivity: (X Y ) Z is isomorphic to (X Z) (Y Z), but
Z (X Y ) is not isomorphic to (Z X) (Z Y ) in general.
Remark 2.4.8. The axiom of choice is trivially true in the case when
X is well-ordered, since one can take min to be the choice function.
Thus, the axiom of choice follows from the well-ordering theorem
(every set has at least one well-ordering). Conversely, we will be able
to deduce the well-ordering theorem from Zorns lemma (and hence
from the axiom of choice): see Exercise 2.4.11 below.
One of the reasons that well-ordered sets are useful is that one
can perform induction on them. This is easiest to describe for the
principle of strong induction:
Exercise 2.4.1 (Strong induction on well-ordered sets). Let X be
a well-ordered set, and let P : X true, false be a property of
elements of X. Suppose that whenever x X is such that P(y) is
true for all y < x, then P(x) is true. Then P(x) is true for every
x X. This is called the principle of strong induction. Conversely,
show that a totally ordered set X enjoys the principle of strong in-
duction if and only if it is well-ordered. (For partially ordered sets,
the corresponding notion is that of being well-founded.)
2.4. Zorns lemma 343
To describe the analogue of the ordinary principle of induction
for well-ordered sets, we need some more notation. Given a subset A
of a non-empty well-ordered set X, we dene the supremum sup(A)
X + of A to be the least upper bound
(2.14) sup(A) := min(y X + : x y for all x X
of A (thus for instance the supremum of the empty set is min(X)).
If x X, we dene the successor succ(x) X + of x by the
formula
(2.15) succ(x) := min((x, +]).
We have the following Peano-type axioms:
Exercise 2.4.2. If x is an element of a non-empty well-ordered set
X, show that exactly one of the following statements hold:
(Limit case) x = sup([min(X), x)).
(Successor case) x = succ(y) for some Y .
In particular, min(X) is not the successor of any element in X.
Exercise 2.4.3. Show that if x, y are elements of a well-ordered set
such that succ(x) = succ(y), then x = y.
Exercise 2.4.4 (Transnite induction for well-ordered sets). Let X
be a non-empty well-ordered set, and let P : X true, false be a
property of elements of X. Suppose that
(Base case) P(min(X)) is true.
(Successor case) If x X and P(x) is true, then P(succ(x))
is true.
(Limit case) If x = sup([min(X), x)) and P(y) is true for all
y < x, then P(x) is true. [Note that this subsumes the base
case.]
Then P(x) is true for all x X.
Remark 2.4.9. The usual Peano axioms for succession are the spe-
cial case of Exercises 2.4.2-2.4.4 in which the limit case of Exercise
2.4.2 only occurs for min(X) (which is denoted 0), and the succes-
sor function never attains +. With these additional axioms, X is
necessarily isomorphic to N.
344 2. Related articles
Now we introduce two more key concepts.
Denition 2.4.10. An initial segment of a well-ordered set X is a
subset Y of X such that [min(X), y] Y for all y Y (i.e. whenever
y lies in Y , all elements of X that are less than y also lie in Y ).
A morphism from one well-ordered set X to another Y is a map
: X Y which is strictly monotone (thus (x) < (x
) whenever
x < x
X is such
that x
a
from [min(X), a) to Y , thus min(X) is good. If + is good, then
we are done. From uniqueness we see that if every element in a set A
is good, then the supremum sup(A) is also good. Applying transnite
induction (Exercise 2.4.5), we thus see that we are done unless there
exists a good a X such that succ(a) is not good. By Exercise 2.4.5,
a
([min(X), a)) = [min(Y ), b) for some b Y +. If b Y then
we could extend the morphism
a
to [min(X), a] = [min(X), succ(a))
by mapping a to b, contradicting the fact that succ(a) is not good;
thus b = + and so
a
is surjective. It is then easy to check that
1
a
exists and is a morphism from Y to X, and the claim follows.
Remark 2.4.15. Formally, Proposition 2.4.13, Exercise 2.4.8, and
Proposition 2.4.14 tell us that the collection of all well-ordered sets,
modulo isomorphism, is totally ordered by declaring one well-ordered
set X to be at least as large as another Y when there is a morphism
from Y to X. However, this is not quite the case, because the col-
lection of well-ordered sets is only a class rather than a set. Indeed,
as we shall soon see, this is not a technicality, but is in fact a fun-
damental fact about well-ordered sets that lies at the heart of Zorns
lemma. (From Russells paradox we know that the notions of class
and set are necessarily distinct; see Section 3.15.)
2.4.2. Ordinals. As we learn very early on in our mathematics ed-
ucation, a nite set of a certain cardinality (e.g. a set a, b, c, d, e)
can be put in one-to-one correspondence with a standard set of the
same cardinality (e.g. the set 1, 2, 3, 4, 5); two nite sets have the
same cardinality if and only if they correspond to the same standard
346 2. Related articles
set 1, . . . , N). (The same fact is true for innite sets; see Exercise
2.4.12 below.) Similarly, we would like to place every well-ordered set
in a standard form. This motivates
Denition 2.4.16. A representation of the well-ordered sets is an
assignment of a well-ordered set (X) to every well-ordered set X
such that
(X) is isomorphic to X for every well-ordered set X. (In
particular, if (X) and (Y ) are equal, then X and Y are
isomorphic.)
If there exists a morphism from X to Y , then (X) is a
subset of (Y ) (and the order structure on (X) is induced
from that on (Y ). (In particular, if X and Y are isomor-
phic, then (X) and (Y ) are equal.)
Remark 2.4.17. In the language of category theory, a representation
is a covariant functor from the category of well-ordered sets to itself
which turns all morphisms into inclusions, and which is naturally
isomorphic to the identity functor.
Remark 2.4.18. Because the collection of all well-ordered sets is a
class rather than a set, is not actually a function (it is sometimes
referred to as a class function).
It turns out that several representations of the well-ordered sets
exist. The most commonly used one is that of the ordinals, dened
by von Neumann as follows.
Denition 2.4.19 (Ordinals). An ordinal is a well-ordered set with
the property that x = y : y < x for all x . (In particular,
each element of is also a subset of , and the strict order relation
< on is identical to the set membership relation .)
Example 2.4.20. For each natural number n = 0, 1, 2, . . ., dene
the ordinal number n
th
recursively by setting 0
th
:= and n
th
:=
2.4. Zorns lemma 347
0
th
, 1
th
, . . . , (n 1)
th
for all n 1, thus for instance
0
th
:=
1
th
:= 0
th
=
2
th
:= 0
th
, 1
th
= ,
3
th
:= 0
th
, 1
th
, 2
th
= , , , ,
(2.16)
and so forth. (Of course, to be compatible with the English language
conventions for ordinals, we should write 1
st
instead of 1
th
, etc., but
let us ignore this discrepancy.) One can easily check by induction
that n
th
is an ordinal for every n. Furthermore, if we dene :=
n
th
: n N, then is also an ordinal. (In the foundations of
set theory, this construction, together with the axiom of innity, is
sometimes used to dene the natural numbers (so that n = n
th
for
all natural numbers n), although this construction can lead to some
conceptually strange-looking consequences that blur the distinction
between numbers and sets, such as 3 5 and 4 = 0, 1, 2, 3.)
The fundamental theorem about ordinals is
Theorem 2.4.21. (i) Given any two ordinals , , one is a
subset of the other (and the order structure on is induced
from that on ).
(ii) Every well-ordered set X is isomorphic to exactly one ordi-
nal ord(X).
In particular, ord is a representation of the well-ordered sets.
Proof. We rst prove (i). From Proposition 2.4.14 and symmetry,
we may assume that there is a morphism from to . By strong
induction (Exercise 2.4.1) and Denition 2.4.19, we see that (x) = x
for all x , and so is the inclusion map from into . The claim
follows.
Now we prove (ii). If uniqueness failed, then we would have two
distinct ordinals that are isomorphic to each other, but as one ordinal
is a subset of the other, this would contradict Proposition 2.4.13 (the
inclusion morphism is not an isomorphism); so it suces to prove
existence.
348 2. Related articles
We use transnite induction. It suces to show that for every
a X +, that [min(X), a) is isomorphic to an ordinal (a)
(which we know to be unique). This is of course true in the base case
a = min(X). To handle the successor case a = succ(b), we set (a) :=
(b) (b), which is easily veried to be an ordinal isomorphic to
[min(X).a). To handle the limit case a = sup([min(X), a)), we take
all the ordinals associated to elements in [min(X), a) and take their
union (here we rely crucially on the axiom schema of replacement and
the axiom of union); by use of (i) one can show that this union is an
ordinal isomorphic to a as required.
Remark 2.4.22. Operations on well-ordered sets, such as the sum
and product dened in Exercises 2.4.3, 2.4.4, induce corresponding
operations on ordinals, leading to ordinal arithmetic, which we will
not discuss here. (Note that the convention for which order multi-
plication proceeds in is swapped in some of the literature, thus
would be the ordinal of rather than .)
Exercise 2.4.9 (Ordinals are themselves well-ordered). Let T be a
non-empty class of ordinals. Show that there is a least ordinal min(T)
in this class, which is a subset of all the other ordinals in this class.
In particular, this shows that any set of ordinals is well-ordered by
set inclusion.
Remark 2.4.23. Because of Exercise 2.4.9, we can meaningfully talk
about the least ordinal obeying property P, as soon as we can
exhibit at least one ordinal with that property P. For instance, once
one can demonstrate the existence of an uncountable ordinal (which
follows from Exercise 2.4.11 below
4
), one can talk about the least
uncountable ordinal.
Exercise 2.4.10 (Transnite induction for ordinals). Let P() be a
property pertaining to ordinals . Suppose that
(Base case) P() is true.
(Successor case) If = , for some ordinal , and
P() is true, then P() is true.
4
One can also create an uncountable ordinal without the axiom of choice by taking
starting with all the well-orderings of subsets of the natural numbers, and taking the
union of their associated ordinals; this construction is due to Hartogs.
2.4. Zorns lemma 349
(Limit case) If =
0
0
, which contradicts the axiom of foundation.
Remark 2.4.25. It is also possible to prove Theorem 2.4.24 without
the theory of ordinals, or the axiom of foundation. One rst observes
(by transnite induction) that given two well-ordered sets X, X
, one
of the sets (X), (X
)
whenever Y and Y
) whenever Y is a subset of Y
. Thus
is a representation of the ordinals. But this contradicts Theorem
2.4.24.
Remark 2.4.28. One can use transnite induction on ordinals rather
than well-ordered sets if one wishes here, using Remark 2.4.26 in place
of Theorem 2.4.24.
Proof of Zorns lemma. Suppose for contradiction that one had a
non-empty partially ordered set X without maximal elements, such
that every chain had an upper bound. As there are no maximal
elements, every element in X must be bounded by a strictly larger
element in X, and so every chain in fact has a strict upper bound; in
particular every well-ordered set has a strict upper bound. Applying
the axiom of choice, we may thus nd a choice function g : (
X from the space of well-ordered sets in X to X, that maps every
such set to a strict upper bound. But this contradicts Proposition
2.4.27.
Remark 2.4.29. It is important for Zorns lemma that X is a set,
rather than a class. Consider for instance the class of all ordinals.
Every chain of ordinals has an upper bound (namely, the union of the
ordinals in that chain), and the class is certainly non-empty, but there
is no maximal ordinal. (Compare also Theorem 2.4.21 and Theorem
2.4.24.)
Remark 2.4.30. It is also important that every chain have an up-
per bound, and not just countable chains. Indeed, the collection of
2.4. Zorns lemma 351
countable subsets of an uncountable set (such as R) is non-empty, and
every countable chain has an upper bound, but there is no maximal
element.
Remark 2.4.31. The above argument shows that the hypothesis of
Zorns lemma can be relaxed slightly; one does not need every chain
to have an upper bound, merely every well-ordered set needs to have
one. But I do not know of any application in which this apparently
stronger version of Zorns lemma dramatically simplies an argument.
(In practice, either Zorns lemma can be applied routinely, or it fails
utterly to be applicable at all.)
Exercise 2.4.11. Use Zorns lemma to establish the well-ordering
theorem (every set has at least one well-ordering).
Remark 2.4.32. By the above exercise, Rcan be well-ordered. How-
ever, if one drops the axiom of choice from the axioms of set theory,
one can no longer prove that R is well-ordered. Indeed, given a well-
ordering of R, it is not dicult (using Remark 2.4.8) to remove the
axiom of choice from the Banach-Tarski constructions in Section 2.2,
and thus obtain constructions of non-measurable subsets of R. But
a deep theorem of Solovay gives a model of set theory (without the
axiom of choice) in which every set of reals is measurable.
Exercise 2.4.12. Dene a (von Neumann) cardinal to be an ordinal
with the property that all smaller ordinals have strictly lesser car-
dinality (i.e. cannot be placed in one-to-one correspondence with ).
Show that every set can be placed in one-to-one correspondence with
exactly one cardinal. (This gives a representation of the category of
sets, similar to how ord gives a representation of well-ordered sets.)
It seems appropriate to close these notes with a quote from Jerry
Bona:
The Axiom of Choice is obviously true, the well-ordering principle
obviously false, and who can tell about Zorns Lemma?
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/01/28.
Thanks to an anonymous commenter for corrections.
352 2. Related articles
Eric remarked that any application of Zorns lemma can be equiv-
alently rephrased as a transnite induction, after using a choice func-
tion to decide where to go at each limit ordinal.
2.5. Compactication and metrisation
One way to study a general class of mathematical objects is to embed
them into a more structured class of mathematical objects; for in-
stance, one could study manifolds by embedding them into Euclidean
spaces. In these notes we study two (related) embedding theorems
for topological spaces:
The Stone-
: X
X
(or
: X X
X such that =
Cech compactication.
Example 2.5.4. From the above exercise, we can dene limits lim
xp
f(x) :=
f(p) for any bounded continuous function on X and any p X.
But one for coarser compactications, one can only take limits for
special types of bounded continuous functions; for instance, using
the one-point compactication of R, lim
x
f(x) need not exist for
a bounded continuous function f : R R, e.g. lim
x
sin(x) or
lim
x
arctan(x) do not exist. The ner the compactication, the
more limits can be dened; for instance the two point compactica-
tion [, +] of R allows one to dene the limits lim
x+
f(x) and
lim
x
f(x) for some additional functions f (e.g. lim
x
arctan(x)
is well-dened); and the Stone-
Cech
compactications.
Exercise 2.5.3. Let X be a locally compact Hausdor space. Let
C(X [0, 1]) be the space of continuous functions from X to the unit
interval, let Q := [0, 1]
C(X[0,1])
be the space of tuples (y
f
)
fC(X[0,1])
taking values in the unit interval, with the product topology, and let
: X Q be the Gelfand transform (x) := (f(x))
fC(X[0,1])
, and
let X be the closure of X in Q.
Show that X is a compactication of X. (Hint: Use
Urysohns lemma and Tychonos theorem.)
Show that X is the Stone-
Cech compactication of X.
(Hint: If X is any other compactication of X, we can iden-
tify C(X [0, 1]) as a subset of C(X [0, 1]), and then
project Q to [0, 1]
C(X[0,1])
. Meanwhile, we can embed X
inside [0, 1]
C(X[0,1])
by the Gelfand transform.)
Exercise 2.5.4. Let X be a discrete topological space, let 2
X
be
the Boolean algebra of all subsets of X. By Stones representation
theorem (Theorem 1.2.2), 2
X
is isomorphic to the clopen algebra of
a Stone space X.
2.5. Compactication 355
Show that X is a compactication of X.
Show that X is the Stone-
Cech compactication of X.
Identify X with the space of ultralters on X. (See Section
1.5 of Structure and randomness for further discussion of
ultralters, and Section 2.3 of Poincares legacies, Vol. I
for further discussion of the relationship of ultralters to
the Stone-
Cech compactication.)
Exercise 2.5.5. Let X be a locally compact Hausdor space, and
let BC(X C) be the space of bounded continuous complex-valued
functions on X.
Show that BC(X C) is a unital commutative C
-algebra
(see Section 1.10.4).
By the commutative Gelfand-Naimark theorem (Theorem
1.10.24), BC(X C) is isomorphic as a unital C
-algebra
to C(X C) for some compact Hausdor space X
(which is in fact the spectrum of BC(X C). Show that
X is the Stone-
Cech compactication of X.
More generally, show that given any other compactica-
tion X of X, that C(X C) is isomorphic as a unital
C
-algebras between
BC(X C) and CC
0
(X C), which correspond to the
Stone-
of BC(X R) is isomorphic as a
Banach space to the space M(X) of real signed Radon measures on
the Stone-
(N)
M(N).
Remark 2.5.5. The Stone-
Cech com-
pactication is almost never sequentially compact. For instance, it is
not hard to show that N is sequentially closed in N. In particular,
these compactications are usually not metrisable.
2.5.2. Urysohns metrisation theorem. Recall that a topologi-
cal space is metrisable if there exists a metric on that space which
generates the topology. There are various necessary conditions for
metrisability. For instance, we have already seen that metric spaces
must be normal and Hausdor. In the converse direction, we have
Theorem 2.5.7 (Urysohns metrisation theorem). Let X be a normal
Hausdor space which is second countable. Then X is metrisable.
Proof. (Sketch) This will be a variant of the argument in Exercise
2.5.3, but with a countable family of continuous functions in place of
C(X [0, 1]).
Let U
1
, U
2
, . . . be a countable base for X. If U
i
, U
j
are in this
base with U
i
U
j
, we can apply Urysohns lemma and nd a con-
tinuous function f
ij
: X [0, 1] which equals 1 on U
i
and vanishes
outside of U
j
. Let T be the collection of all such functions; this
is a countable family. We can then embed X in [0, 1]
F
using the
Gelfand transform x (f(x))
fF
. By modifying the proof of Exer-
cise 2.5.3 one can show that this is an embedding. On the other
hand, [0, 1]
F
is a countable product of metric spaces and is thus
metrisable (e.g. by enumerating T as f
1
, f
2
, . . . and using the metric
d((x
n
)
fnF
, (y
n
)
fnF
) :=
n=1
2
n
[x
n
y
n
[). Since a subspace of a
metrisable space is clearly also metrisable, the claim follows.
Recalling that compact metric spaces are second countable (Lemma
1.8.6), thus we have
2.6. Hardys uncertainty principle 357
Corollary 2.5.8. A compact Hausdor space is metrisable if and
only if it is second countable.
Of course, non-metrisable compact Hausdor spaces exist; N is
a standard example. Uncountable products of non-trivial compact
metric spaces, such as 0, 1, are always non-metrisable. Indeed, we
already saw in Section 1.8 that 0, 1
X
is compact but not sequen-
tially compact (and thus not metrisable) when X has the cardinal-
ity of the continuum; one can use the rst uncountable ordinal to
achieve a similar result for any uncountable X, and then by embed-
ding one can obtain non-metrisability for any uncountable product
of non-trivial compact metric spaces, thus complementing the metris-
ability of countable products of such spaces. Conversely, there also
exist metrisable spaces which are not second countable (e.g. uncount-
able discrete spaces). So Urysohns metrisation theorem does not
completely classify the metrisable spaces, however it already covers a
large number of interesting cases.
Notes. This lecture rst appeared at terrytao.wordpress.com/2009/03/02.
Thanks to Eric, Javier Lopez, Mark Meckes, Max Baroi, Paul Leop-
ardi, Pete L. Clark, and anonymous commenters for corrections.
2.6. Hardys uncertainty principle
Many properties of a (suciently nice) function f : R C are
reected in its Fourier transform
f : R C, dened by the formula
(2.17)
f() :=
_
f(x)e
2ix
dx.
For instance, decay properties of f are reected in smoothness prop-
erties of
f, as the following table shows:
If f is... then
f is... and this relates to...
Square-integrable square-integrable Plancherels theorem
Absolutely integrable continuous Riemann-Lebesgue lemma
Rapidly decreasing smooth theory of Schwartz functions
Exponentially decreasing analytic in a strip
Compactly supported entire, exponential growth Paley-Wiener theorem
(See Section 1.12 for further discussion of the Fourier transform.)
358 2. Related articles
Another important relationship between a function f and its
Fourier transform
f is the uncertainty principle, which roughly as-
serts that if a function f is highly localised in space, then its Fourier
transform
f must be widely dispersed in space, or to put it another
way, f and
f cannot both decay too strongly at innity (except of
course in the degenerate case f = 0). There are many ways to make
this intuition precise. One of them is the Heisenberg uncertainty prin-
ciple, which asserts that if we normalise
_
R
[f(x)[
2
dx =
_
R
[
f()[
2
d = 1
then we must have
(
_
R
[x[
2
[f(x)[
2
dx) (
_
R
[[
2
[
f()[
2
dx)
1
(4)
2
thus forcing at least one of f or
f to not be too concentrated near the
origin. This principle can be proven (for suciently nice f, initially)
by observing the integration by parts identity
xf, f
=
_
R
xf(x)f
(x) dx =
1
2
_
R
[f(x)[
2
dx
and then using Cauchy-Schwarz and the Plancherel identity.
Another well known manifestation of the uncertainty principle
is the fact that it is not possible for f and
f to both be compactly
supported (unless of course they vanish entirely). This can be in fact
be seen from the above table: if f is compactly supported, then
f
is an entire function; but the zeroes of a non-zero entire function are
isolated, yielding a contradiction unless f vanishes. (Indeed, the table
also shows that if one of f and
f is compactly supported, then the
other cannot have exponential decay.)
On the other hand, we have the example of the Gaussian func-
tions f(x) = e
ax
2
,
f() =
1
a
e
2
/a
, which both decay faster
than exponentially. The classical Hardy uncertainty principle asserts,
roughly speaking, that this is the fastest that f and
f can simultane-
ously decay:
Theorem 2.6.1 (Hardy uncertainty principle). Suppose that f is
a (measurable) function such that [f(x)[ Ce
ax
2
and [
f()[
2.6. Hardys uncertainty principle 359
C
2
/a
for all x, and some C, C
f()[ C
e
b
2
for all x, and some C, C
, a, b > 0. Then ab C
0
for some absolute constant C
0
.
Note that the correct value of C
0
should be 1, as is implied by
the true Hardy uncertainty principle. Despite the weaker statement,
I thought the proof might still might be of interest as it is a little
less magical than the complex-variable one, and so I am giving it
below.
2.6.1. The complex-variable proof. We rst give the complex-
variable proof. By dilating f by
a (and contracting
f by 1/
a) we
may normalise a = 1. By multiplying f by a small constant we may
also normalise C = C
= 1.
The super-exponential decay of f allows us to extend the Fourier
transform
f to the complex plane, thus
f( +i) =
_
R
f(x)e
2ix
e
2x
dx
for all , R. We may dierentiate under the integral sign and
verify that
f is entire. Taking absolute values, we obtain the upper
bound
[
f( +i)[
_
R
e
x
2
e
2x
dx;
completing the square, we obtain
(2.18) [
f( +i)[ e
2
for all , . We conclude that the entire function
F(z) := e
z
2
f(z)
360 2. Related articles
is bounded in magnitude by 1 on the imaginary axis; also, by hy-
pothesis on
f, we also know that F is bounded in magnitude by 1 on
the real axis. Formally applying the Phragmen-Lindelof principle (or
maximum modulus principle), we conclude that F is bounded on the
entire complex plane, which by Liouvilles theorem implies that F is
constant, and the claim follows.
Now lets go back and justify the Phragmen-Lindelof argument.
Strictly speaking, Phragmen-Lindelof does not apply, since it requires
exponential growth on the function F, whereas we have quadratic-
exponential growth here. But we can tweak F a bit to solve this
problem. Firstly, we pick 0 < < /2 and work on the sector
:= re
i
: r > 0, 0 .
Using (2.18) we have
[F( +i)[ e
2
.
Thus, if > 0, and is suciently close to /2 depending on , the
function e
iz
2
F(z) is bounded in magnitude by 1 on the boundary of
, so
is bounded by 1 in that interior by the maximum modulus principle.
Sending 0, and then /2, and then 0, we obtain F
bounded in magnitude by 1 on the upper right quadrant. Similar
arguments work for the other quadrants, and the claim follows.
2.6.2. The real-variable proof. Now we turn to the real-variable
proof of Theorem 2.6.2, which is based on the fact that polynomials
of controlled degree do not resemble rapidly decreasing functions.
Rather than use complex analyticity
f, we will rely instead on a
dierent relationship between the decay of f and the regularity of
f,
as follows:
Lemma 2.6.3 (Derivative bound). Suppose that [f(x)[ Ce
ax
2
for all x R, and some C, a > 0. Then
f is smooth, and furthermore
one has the bound [
k
f()[
C
a
k!
k/2
(k/2)!a
(k+1)/2
for all R and every
even integer k.
2.6. Hardys uncertainty principle 361
Proof. The smoothness of
f follows from the rapid decrease of f. To
get the bound, we dierentiate under the integral sign (one can easily
check that this is justied) to obtain
f() =
_
R
(2ix)
k
f(x)e
2ix
dx
and thus by the triangle inequality for integrals (and the hypothesis
that k is even)
[
k
f()[ C
_
R
e
ax
2
(2x)
k
dx.
On the other hand, by dierentiating the Fourier analytic identity
1
a
e
2
/a
=
_
R
e
ax
2
e
2ix
dx
k times at = 0, we obtain
d
k
d
k
(
1
a
e
2
/a
)[
=0
=
_
R
e
ax
2
(2ix)
k
dx;
expanding out
1
a
e
2
/a
using Taylor series we conclude that
k!
a
(/a)
k/2
(k/2)!
=
_
R
e
ax
2
(2ix)
k
dx
f()[ (
e
a
+o(1))
k/2
k
k/2
for all large even integers k (where the decay of o(1) can depend on
a, C).
We can combine (2.19) with Taylors theorem with remainder, to
conclude that on any interval I R, we have an approximation
f() = P
I
() +O(
1
k!
(
e
a
+o(1))
k/2
k
k/2
[I[
k
)
where [I[ is the length of I and P
I
is a polynomial of degree less than
k. Using Stirlings formula again, we obtain
(2.20)
f() = P
I
() +O((
ea
+o(1))
k/2
k
k/2
[I[
k
)
362 2. Related articles
Now we apply a useful bound.
Lemma 2.6.4 (Doubling bound). Let P be a polynomial of degree
at most k for some k 1, let I = [x
0
r, x
0
+r] be an interval, and
suppose that [P(x)[ A for all x I and some A > 0. Then for
any N 1 we have the bound [P(x)[ (CN)
k
A for all x NI :=
[x
0
Nr, x
0
+Nr] and for some absolute constant C.
Proof. By translating we may take x
0
= 0; by dilating we may take
r = 1. By dividing P by A, we may normalise A = 1. Thus we have
[P(x)[ 1 for all 1 x 1, and the aim is now to show that
[P(x)[ (CN)
k
for all N x N.
Consider the trigonometric polynomial P(cos ). By de Moivres
formula, this function is a linear combination of cos(j) for 0 j k.
By Fourier analysis, we can thus write P(cos ) =
k
j=0
c
j
cos(j),
where
c
j
=
1
P(cos ) cos(j) d.
Since P(cos ) is bounded in magnitude by 1, we conclude that c
j
is
bounded in magnitude by 2. Next, we use de Moivres formula again
to expand cos(j) as a linear combination of cos() and sin
2
(), with
coecients of size O(1)
k
; expanding sin
2
() further as 1 cos
2
(),
we see that cos(j) is a polynomial in cos() with coecients O(1)
k
.
Putting all this together, we conclude that the coecients of P are
all of size O(1)
k
, and the claim follows.
Remark 2.6.5. One can get slightly sharper results by using the
theory of Chebyshev polynomials. (Is the best bound for C known?
I do not know the recent literature on this subject. I think though
that even the sharpest bound for C would not fully recover the sharp
Hardy uncertainty principle, at least with the argument given here.)
We return to the proof of Theorem 2.6.2. We pick a large integer
k and a parameter r > 0 to be chosen later. From (2.20) we have
f() = P
r
() +O(
r
2
ak
)
k/2
2.7. Create an epsilon of room 363
for [r, 2r], and some polynomial P
r
of degree k. In particular,
we have
P
r
() = O(e
br
2
) +O(
r
2
ak
)
k/2
for [r, 2r]. Applying Lemma 2.6.4, we conclude that
P
r
() = O(1)
k
e
br
2
+O(
r
2
ak
)
k/2
for [r, r]. Applying (2.20) again we conclude that
f() = O(1)
k
e
br
2
+O(
r
2
ak
)
k/2
for [r, r]. If we pick r :=
_
k
cb
for a suciently small absolute
constant c, we conclude that
[
f()[ 2
k
+O(
1
ab
)
k/2
(say) for [r, r]. If ab C
0
for large enough C
0
, the right-hand
side goes to zero as k (which also implies r ), and we
conclude that
f (and hence f) vanishes identically.
Notes. This article rst appeared at terrytao.wordpress.com/2009/02/18.
Pedro Lauridsen Ribiero noted an old result of Schrodinger, that
the only minimisers of the Heisenberg uncertainty principle were the
gaussians (up to scaling, translation, and modulation symmetries).
Fabrice Planchon and Phillipe Jaming mentioned several related
results and generalisations, including a recent PDE-based proof of the
Hardy uncertainty principle (with the sharp constant) in [EsKePoVe2008].
2.7. Create an epsilon of room
In this article I would like to discuss a fundamental trick in soft
analysis, sometimes known as the limiting argument or epsilon
regularisation argument.
A quick description of the trick is as follows. Suppose one wants
to prove some statement S
0
about some object x
0
(which could be
a number, a point, a function, a set, etc.). To do so, pick a small
> 0, and rst prove a weaker statement S
. Then,
364 2. Related articles
take limits 0. Provided that the dependency and continuity of
the weaker conclusion S
is
converging to x
0
in an appropriately strong sense, you will recover
the original statement.
One can of course play a similar game when proving a statement
S
f(x
0
) = g(x
0
) f(x
) = g(x
) +o(1)
f(x
0
) g(x
0
) f(x
) g(x
) +o(1)
f(x
0
) > 0 f(x
) is bounded uniformly in
f(x
0
) f(x) for all x X f(x
nearly maximises f)
f
n
(x
0
) converges as n f
n
(x
to S
0
, it is necessary
that x
converge to x
0
(or f
converge to f
0
, etc.) in a suitably
strong sense. (But for the purposes of proving just upper bounds,
such as f(x
0
) M, one can often get by with quite weak forms of
convergence, thanks to tools such as Fatous lemma or the weak clo-
sure of the unit ball.) Similarly, we need some continuity (or at least
semi-continuity) hypotheses on the functions f, g appearing above.
2.7. Create an epsilon of room 365
It is also necessary in many cases that the control S
on the
approximating object x
f() :=
_
R
f(x)e
2ix
dx.
The Riemann-Lebesgue lemma asserts that
f() 0 as . It
is dicult to prove this estimate for f directly, because this function
is too rough: it is absolutely integrable (which is enough to ensure
that
f exists and is bounded), but need not be continuous, dier-
entiable, compactly supported, bounded, or otherwise nice. But
suppose we give ourselves an epsilon of room. Then, as the space C
0
of test functions is dense in L
1
(R) (Exercise 1.13.5), we can approx-
imate f to any desired accuracy > 0 in the L
1
norm by a smooth,
compactly supported function f
: R C, thus
(2.21)
_
R
[f(x) f
(x)[ dx .
The point is that f
() =
1
2i
_
R
f
(x)e
2ix
dx
for any non-zero , and it is now clear (since f
() 0 as .
Now we need to take limits as 0. It will be enough to have
converge uniformly to
f. But from (2.21) and the basic estimate
(2.22) sup
[ g()[
_
R
[g(x)[ dx
(which is the single hard analysis ingredient in the proof of the
lemma) applied to g := f f
f()
f
()[
and we obtain the desired uniform convergence.
Remark 2.7.2. The same argument also shows that
f is continuous;
we leave this as an exercise to the reader. See also Exercise 1.12.11
for the generalisation of this lemma to other locally compact abelian
groups.
Remark 2.7.3. Example 2.7.1 is a model case of a much more gen-
eral instance of the limiting argument: in order to prove a convergence
or continuity theorem for all rough functions in a function space, it
suces to rst prove convergence or continuity for a dense subclass
of smooth functions, and combine that with some quantitative es-
timate in the function space (in this case, (2.22)) in order to justify
the limiting argument. See Corollary 1.7.7 for an important example
of this principle.
Example 2.7.4. The limiting argument in Example 2.7.1 relied on
the linearity of the Fourier transform f
f. But, with more eort,
it is also possible to extend this type of argument to nonlinear set-
tings. We will sketch (omitting several technical details, which can be
found for instance in my PDE book [Ta2006]) a very typical instance.
Consider a nonlinear PDE, e.g. the cubic nonlinear wave equation
(2.23) u
tt
+u
xx
= u
3
368 2. Related articles
where u : RR R is some scalar eld, and the t and x subscripts
denote dierentiation of the eld u(t, x). If u is suciently smooth,
and suciently decaying at spatial innity, one can show that the
energy
(2.24) E(u)(t) :=
_
R
1
2
[u
t
(t, x)[
2
+
1
2
[u
x
(t, x)[
2
+
1
4
[u(t, x)[
4
dx
is conserved, thus E(u)(t) = E(u)(0) for all t. Indeed, this can be
formally justied by computing the derivative
t
E(u)(t) by dierenti-
ating under the integral sign, integrating by parts, and then applying
the PDE (2.23); we leave this as an exercise for the reader
7
. How-
ever, these justications do require a fair amount of regularity on
the solution u; for instance, requiring u to be three-times continu-
ously dierentiable in space and time, and compactly supported in
space on each bounded time interval, would be sucient to make the
computations rigorous by applying o the shelf theorems about
dierentiation under the integration sign, etc.
But suppose one only has a much rougher solution, for instance
an energy class solution which has nite energy (2.24), but for which
higher derivatives of u need not exist in the classical sense
8
. Then
it is dicult to justify the energy conservation law directly. How-
ever, it is still possible to obtain energy conservation by the limiting
argument. Namely, one takes the energy class solution u at some ini-
tial time (e.g. t = 0) and approximates that initial data (the initial
position u(0) and initial data u
t
(0)) by a much smoother (and com-
pactly supported) choice (u
()
(0), u
()
t
(0)) of initial data, which con-
verges back to (u(0), u
t
(0)) in a suitable energy topology related to
(2.24), which we will not dene here (it is based on Sobolev spaces,
which are discussed in Section 1.14). It then turns out (from the
existence theory of the PDE (2.23)) that one can extend the smooth
7
There are also more fancy ways to see why the energy is conserved, using Hamil-
tonian or Lagrangian mechanics or by the more general theory of stress-energy tensors,
but we will not discuss these here.
8
There is a non-trivial issue regarding how to make sense of the PDE (2.23) when
u is only in the energy class, since the terms utt and uxx do not then make sense
classically, but there are standard ways to deal with this, e.g. using weak derivatives,
see Section 1.13.
2.7. Create an epsilon of room 369
initial data (u
()
(0), u
()
t
(0)) to other times t, providing a smooth so-
lution u
()
to that data. For this solution, the energy conservation
law E(u
()
)(t) = E(u
()
)(0) can be justied.
Now we take limits as 0 (keeping t xed). Since (u
()
(0), u
()
t
(0))
converges in the energy topology to (u(0), u
t
(0)), and the energy
functional E is continuous in this topology, E(u
()
)(0) converges to
E(u)(0). To conclude the argument, we will also need E(u
()
)(t) to
converge to E(u)(t), which will be possible if (u
()
(t), u
()
t
(t)) con-
verges in the energy topology to (u(t), u
t
(t)). Thus in turn follows
from a fundamental fact (which requires a certain amount of eort
to prove) about the PDE to (2.24), namely that it is well-posed in
the energy class. This means that not only do solutions exist and are
unique for initial data in the energy class, but they depend continu-
ously on the initial data in the energy topology; small perturbations
in the data lead to small perturbations in the solution, or more for-
mally that the map (u(0), u
t
(0)) (u(t), u
t
(t)) from data to solution
(say, at some xed time t) is continuous in the energy topology. This
nal fact concludes the limiting argument and gives us the desired
conservation law E(u(t)) = E(u(0)).
Remark 2.7.5. It is important that one have a suitable well-posedness
theory in order to make the limiting argument work for rough solu-
tions to a PDE; without such a well-posedness theory, it is possible
for quantities which are formally conserved to cease being conserved
when the solutions become too rough or otherwise weak; energy,
for instance, could disappear into a singularity and not come back.
Example 2.7.6 (Maximum principle). The maximum principle is a
fundamental tool in elliptic and parabolic PDE (for example, it is used
heavily in the proof of the Poincare conjecture, discussed extensively
in Poincares legacies, Vol. II ). Here is a model example of this
principle:
Proposition 2.7.7. Let u : D R be a smooth harmonic function
on the closed unit disk D := (x, y) : x
2
+ y
2
1. If M is a bound
such that u(x, y) M on the boundary D := (x, y) : x
2
+y
2
= 1.
Then u(x, y) M on the interior as well.
370 2. Related articles
A naive attempt to prove Proposition 2.7.7 comes very close to
working, and goes like this: suppose for contradiction that the propo-
sition failed, thus u exceeds M somewhere in the interior of the disk.
Since u is continuous, and the disk is compact, there must then be
a point (x
0
, y
0
) in the interior of the disk where the maximum is
attained. Undergraduate calculus then tells us that u
xx
(x
0
, y
0
) and
u
yy
(x
0
, y
0
) are non-positive, which almost contradicts the harmonic-
ity hypothesis u
xx
+ u
yy
= 0. However, it is still possible that u
xx
and u
yy
both vanish at (x
0
, y
0
), so we dont yet get a contradiction.
But we can nish the proof by giving ourselves an epsilon of
room. The trick is to work not with the function u directly, but with
the modied function u
()
(x, y) := u(x, y) + (x
2
+ y
2
), to boost the
harmonicity into subharmonicity. Indeed, we have u
()
xx
+u
()
yy
= 4 >
0. The preceding argument now shows that u
()
cannot attain its
maximum in the interior of the disk; since it is bounded by M +
on the boundary of the disk, we we conclude that u
()
is bounded by
M + on the interior of the disk as well. Sending 0 we obtain
the claim.
Remark 2.7.8. Of course, Proposition 2.7.7 can also be proven by
much more direct means, for instance via the Greens function for the
disk. However, the argument given is extremely robust and applies
to a large class of both linear and nonlinear elliptic and parabolic
equations, including those with rough variable coecients.
Exercise 2.7.1. Use the maximum modulus principle to prove the
Phragmen-Lindelof principle: if f is complex analytic on the strip
z : 0 Re(z) 1, is bounded in magnitude by 1 on the boundary
of this strip, and obeys a growth condition [f(z)[ Ce
|z|
C
on the
interior of the strip, then show that f is bounded in magnitude by
1 throughout the strip. (Hint: multiply f by e
z
m
for some even
integer m.) See Section 1.11 for some applications of this principle to
interpolation theory.
Example 2.7.9 (Manipulating generalised functions). In PDE one
is primarily interested in smooth (classical) solutions; but for a vari-
ety of reasons it is useful to also consider rougher solutions. Some-
times, these solutions are so rough that they are no longer functions,
2.7. Create an epsilon of room 371
but are measures, distributions (see Section 1.13), or some other con-
cept of generalised function or generalised solution. For instance,
the fundamental solution to a PDE is typically just a distribution
or measure, rather than a classical function. A typical example: a
(suciently smooth) solution to the three-dimensional wave equation
u
tt
+ u = 0 with initial position u(0, x) = 0 and initial velocity
u
t
(0, x) = g(x) is given by the classical formula
u(t) = tg
t
for t > 0, where
t
is the unique rotation-invariant probability mea-
sure on the sphere S
t
:= (x, y, z) R
3
: x
2
+y
2
+z
2
= t
2
of radius
t, or equivalently, the area element dS on that sphere divided by the
surface area 4t
2
of that sphere. (The convolution f of a smooth
function f and a (compactly supported) nite measure is dened
by f (x) :=
_
f(x y) d(y); one can also use the distributional
convolution dened in Section 1.13.)
For this and many other reasons, it is important to manipulate
measures and distributions in various ways. For instance, in addition
to convolving functions with measures, it is also useful to convolve
measures with measures; the convolution of two nite measures
on R
n
is dened as the measure which assigns to each measurable set
E in R
n
, the measure
(2.25) (E) :=
_ _
1
E
(x +y) d(x)d(y).
For sake of concreteness, lets focus on a specic question, namely
to compute (or at least estimate) the measure , where is the nor-
malised rotation-invariant measure on the unit circle x R
2
: [x[ =
1. It turns out that while is not absolutely continuous with respect
to Lebesgue measure m, the convolution is: d( ) = fdm for some
absolutely integrable function f on R
2
. But what is this function f?
It certainly is possible to compute it from the denition (2.25), or
by other methods (e.g. the Fourier transform), but I would like to
give one approach to computing these sorts of expressions involving
measures (or other generalised functions) based on epsilon regularisa-
tion, which requires a certain amount of geometric computation but
which I nd to be rather visual and conceptual, compared to more
372 2. Related articles
algebraic approaches (e.g. based on Fourier transforms). The idea is
to approximate a singular object, such as the singular measure , by
a smoother object
:=
1
m(A
)
1
A
dm
where A
:= x R
2
: 1 [x[ 1 + is a thin annular
neighbourhood of the unit circle. It is clear that
converges to in
the vague topology, which implies that
converges to in
the vague topology also. Since
=
1
m(A
)
2
1
A
1
A
dm,
we will be able to understand the limit f by rst considering the
function
f
(x) :=
1
m(A
)
2
1
A
1
A
(x) =
m(A
(x A
))
m(A
)
2
and then taking (weak) limits as 0 to recover f.
Up to constants, one can compute from elementary geometry that
m(A
(x A
of some smoothed
2.8. Amenability 373
out approximations
become un-
bounded), are at least still distributions (the integrals of derivatives
of
xG
f(x)g(x) whenever the right-hand side
is convergent.
374 2. Related articles
All
p
spaces are real-valued. The cardinality of a nite set A is
denoted [A[. The symmetric dierence of two sets A, B is denoted
AB.
A nite mean is a non-negative, nitely supported function :
G R
+
such that ||
1
(G)
= 1. A mean is a non-negative linear
functional :
by the formula
(f) := f, .
The following equivalences were established by Flner[Fo1955]:
Theorem 2.8.1. Let G be a countable group. Then the following are
equivalent:
(i) There exists a left-invariant mean :
(G) R, i.e.
mean such that (
x
f) = (f) for all f
(G) and x G.
(ii) For every nite set S G and every > 0, there exists a
nite mean such that |
x
|
1
(G)
for all x S.
(iii) For every nite set S G and every > 0, there exists a
non-empty nite set A G such that [(x A)A[/[A[
for all x S.
(iv) There exists a sequence A
n
of non-empty nite sets such that
[x A
n
A
n
[/[A
n
[ 0 as n for each x G. (Such a
sequence is called a Flner sequence.)
Proof. We shall use an argument of Namioka[Na1964].
(i) implies (ii): Suppose for contradiction that (ii) failed, then
there exists S, such that |
x
|
1
(G)
> for all means and
all x S. The set (
x
)
xS
:
1
(G) is then a convex set
of (
1
(G))
S
that is bounded away from zero. Applying the Hahn-
Banach separation theorem (Theorem 1.5.14), there thus exists a lin-
ear functional (
1
(G)
S
)
such that ((
x
)
xS
) 1 for all
means . Since (
1
(G)
S
)
(G)
S
, there thus exist m
x
(G)
for x S such that
xS
x
, m
x
1 for all means , thus
,
xS
m
x
x
1m
x
1. Specialising to the Kronecker means
y
we see that
xS
m
x
x
1m
x
1 pointwise. Applying the mean ,
we conclude that
xS
(m
x
) (
x
1m
x
) 1. But this contradicts
the left-invariance of .
2.8. Amenability 375
(ii) implies (iii): Fix S (which we can take to be non-empty), and
let > 0 be a small quantity to be chosen later. By (ii) we can nd
a nite mean such that
|
x
|
1
(G)
< /[S[
for all x S.
Using the layer-cake decomposition, we can write =
k
i=1
c
i
1
Ei
for some nested non-empty sets E
1
E
2
. . . E
k
and some
positive constants c
i
. As is a mean, we have
k
i=1
c
i
[E
i
[ = 1. On
the other hand, observe that [
x
[ is at least c
i
on (x E
i
)E
i
.
We conclude that
k
i=1
c
i
[(x E
i
)E
i
[
[S[
k
i=1
c
i
[E
i
[
for all x S, and thus
k
i=1
c
i
xS
[(x E
i
)E
i
[
k
i=1
c
i
[E
i
[.
By the pigeonhole principle, there thus exists i such that
xS
[(x E
i
)E
i
[ [E
i
[
and the claim follows.
(iii) implies (iv): Write the countable group G as the increasing
union of nite sets S
n
and apply (iii) with := 1/n and S := S
n
to
create the set A
n
.
(iv) implies (i): Use the Hahn-Banach theorem to select an in-
nite mean
(N)
1
(N), and dene (m) = ((m,
1
|An|
1
An
)
nN
).
(Alternatively, one can dene (m) to be an ultralimit of the m,
1
|An|
1
An
.)
G
(f) :=
K
(F)
for any f
H
shows that the exact choice of coset representative x is irrelevant).
(One can view
G
as sort of a product measure of the
H
and
K
.)
Now we argue using Flner sequences instead. Let E
n
, F
n
be
Flner sequences for H, K respectively. Let S be a nite subset of
G, and let > 0. We would like to nd a nite non-empty subset
A G such that [(xA)A[ [A[ for all x S; this will demonstrate
amenability. (Note that by taking S to be symmetric, we can replace
[(x A)A[ with [(x A)A[ without diculty.)
By taking n large enough, we can nd F
n
such that (x) F
n
diers from F
n
by at most [F
n
[/2 elements for all x S, where
: G K is the projection map. Now, let F
n
be a preimage of F
n
in G. Let T be the set of all group elements t K such that S F
n
intersects F
n
t. Observe that T is nite. Thus, by taking m large
enough, we can nd E
m
such that t E
m
diers from E
m
by at most
[E
m
[/2[T[ points for all t T.
Now set A := F
n
E
m
= zy : y E
m
, z F
n
. Observe that
the sets z E
m
for z F
n
lie in disjoint cosets of H and so [A[ =
[E
m
[[F
n
[ = [E
m
[[F
n
[. Now take x S, and consider an element of
(x A)A. This element must take the form xzy for some y E
m
and
z F
n
. The coset of H that xzy lies in is given by (x)(z). Suppose
rst that (x)(z) lies outside of F
n
. By construction, this occurs for
378 2. Related articles
at most [F
n
[/2 choices of z, leading to at most [E
m
[[F
n
[/2 = [A[/2
elements in (x A)A.
Now suppose instead that (x)(z) lies in F
n
. Then we have
xz = z
t for some z
n
and t T, by construction of T, and
so xzy = z
n
G
n
is amenable.
Proof. We rst use invariant means. An invariant mean on
(G
n
)
induces a mean on
,
to discover that this equation can also be explicitly solved in closed
form. (For the reason why I was interested in this equation, see
[Ta2010].)
A posteriori, I now know the reason for this explicit solvability;
(3.3) is the limiting case a = 0, b of the more general equation
tt
+
xx
= e
+a
e
+b
which (after applying the simple transformation =
ba
2
+(
2e
a+b
4
t,
2e
a+b
4
x))
becomes the sinh-Gordon equation
tt
+
xx
= sinh()
(a close cousin of the more famous sine-Gordon equation
tt
+
xx
=
sin()), which is known to be completely integrable, and exactly solv-
able. However, I only realised this after the fact, and stumbled upon
the explicit solution to (3.3) by much more classical and elementary
means. I thought I might share the computations here, as I found
them somewhat cute, and seem to serve as an example of how one
3.1. An explicitly solvable equation 383
might go about nding explicit solutions to PDE in general; accord-
ingly, I will take a rather pedestrian approach to describing the hunt
for the solution, rather than presenting the shortest or slickest route
to the answer.
After the initial publishing of this post, Patrick Dorey pointed
out to me that (3.3) is extremely classical; it is known as Liouvilles
equation and was solved by Liouville [Li1853], with essentially the
same solution as presented here.
3.1.1. Symmetries. To simplify the discussion let us ignore all is-
sues of regularity, division by zero, taking square roots and logarithms
of negative numbers, etc., and proceed for now in a purely formal fash-
ion, pretending that all functions are smooth and lie in the domain of
whatever algebraic operations are being performed. (It is not too dif-
cult to go back after the fact and justify these formal computations,
but I do not wish to focus on that aspect of the problem here.)
Although not strictly necessary for solving the equation (3.3), I
nd it convenient to bear in mind the various symmetries that (3.3)
enjoys, as this provides a useful reality check to guard against errors
(e.g. arriving at a class of solutions which is not invariant under
the symmetries of the original equation). These symmetries are also
useful to normalise various special families of solutions.
One easily sees that solutions to (3.3) are invariant under space-
time translations
(3.4) (t, x) (t t
0
, x x
0
)
and also spacetime reections
(3.5) (t, x) (t, x).
Being relativistic, the equation is also invariant under Lorentz trans-
formations
(3.6) (t, x) (
t vx
1 v
2
,
x vt
1 v
2
).
Finally, one has the scaling symmetry
(3.7) (t, x) (t, x) + 2 log .
384 3. Expository articles
3.1.2. Solution. Henceforth will be a solution to (3.3). In view
of the linear explicit solution (3.2), it is natural to move to null coor-
dinates
u = t +x, v = t x,
thus
u
=
1
2
(
t
+
x
);
v
=
1
2
(
t
x
)
and (3.3) becomes
(3.8)
uv
=
1
4
e
.
The various symmetries (3.4)-(3.7) can of course be rephrased in terms
of null coordinates in a straightforward manner. The Lorentz sym-
metry (3.6) simplies particularly nicely in null coordinates, to
(3.9) (u, v) (u,
1
v).
Motivated by the general theory of stress-energy tensors of relativistic
wave equations (of which (3.3) is a very simple example), we now look
at the null energy densities
2
u
,
2
v
. For the linear wave equation (3.1)
(or equivalently
uv
= 0), these null energy densities are transported
in null directions:
(3.10)
v
2
u
= 0;
u
2
v
= 0.
(One can also see this from the explicit solution (3.2).)
The above transport law isnt quite true for the nonlinear wave
equation, of course, but we can hope to get some usable substitute.
Let us just look at the rst null energy
2
u
for now. By two applica-
tions of (3.10), this density obeys the transport equation
2
u
= 2
u
uv
=
1
2
u
e
=
1
2
u
(e
)
= 2
u
uv
=
v
(2
uu
)
and thus we have the pointwise conservation law
v
(
2
u
2
uu
) = 0
3.1. An explicitly solvable equation 385
which implies that
(3.11)
1
2
uu
+
1
4
2
u
= U(u)
for some function U : R R depending only on u. Similarly we have
1
2
vv
+
1
4
2
v
= V (v)
for some function V : R R depending only on v.
For any xed v, (11) is a nonlinear ODE in u. To solve it, we can
rst look at the homogeneous ODE
(3.12)
1
2
uu
+
1
4
2
u
= 0.
Undergraduate ODE methods (e.g. separation of variables, after sub-
stituting :=
u
) soon reveal that the general solution to this ODE
is given by (u) = 2 log(u + C) + D for arbitrary constants C, D
(ignoring the issue of singularities or degeneracies for now). Equiva-
lently, (3.12) is obeyed if and only if e
/2
is linear in u. Motivated
by this, we become tempted to rewrite (3.11) in terms of := e
/2
.
One soon realises that
uu
= (
1
2
uu
+
1
4
2
u
)
and hence (3.11) becomes
(3.13) (
uu
+U(u)) = 0,
thus is a null (generalised) eigenfunction of the Schrodinger oper-
ator (or Hill operator)
uu
+ U(u). If we let a(u) and b(u) be two
linearly independent solutions to the ODE
(3.14) f
uu
+Uf = 0,
we thus have
(3.15) = a(u)c(v) +b(u)d(v)
for some functions c, d (which one easily veries to be smooth, since
, a, b are smooth and a, b are linearly independent). Meanwhile,
by playing around with the second null energy density we have the
counterpart to (3.14),
(
vv
+V (v)) = 0,
386 3. Expository articles
and hence (by linear independence of a, b) c, d must be solutions to
the ODE
g
vv
+V g = 0.
This would be a good time to pause and see whether our implications
are reversible, i.e. whether any that obeys the relation (3.15) will
solve (3.3) or (3.10). It is of course natural to rst write (3.10) in
terms of . Since
u
=
1
2
u
;
v
=
1
2
v
;
uv
= (
1
4
v
1
2
uv
)
one soon sees that (3.10) is equivalent to
(3.16)
uv
=
u
v
+
1
8
.
If we then insert the ansatz (3.15), we soon reformulate the above
equation as
(a(u)b
(u) b(u)a
(u))(c(v)d
(v) d(v)c
(v)) =
1
8
.
It is at this time that one should remember the classical fact that if a,
u are two solutions to the ODE (3.11), then the Wronskian ab
ba
is constant; similarly cd
dc
ba
= , cd
dc
ba
= /a
2
to
recover b. (This doesnt quite work at the locations when a vanishes,
but there are a variety of ways to resolve that; as I said above, we are
ignoring this issue for the purposes of this discussion.)
This is not the only way to express solutions. Factoring a(u)d(v)
(say) from (3.13), we see that is the product of a solution c(v)/d(v)+
b(u)/a(u) to the linear wave equation, plus the exponential of a so-
lution log a(u) + log d(u) to the linear wave equation. Thus we may
3.1. An explicitly solvable equation 387
write = F 2 log G, where F and G solve the linear wave equation.
Inserting this back ansatz into (3.1) we obtain
2(G
2
t
+G
2
x
)/G
2
= e
F
/G
2
and so we see that
(3.17) = log
2(G
2
t
+G
2
x
)
G
2
= log
8G
u
G
v
G
2
for some solution G to the free wave equation, and conversely every
expression of the form (3.17) can be veried to solve (3.1) (since
log 2(G
2
t
+G
2
x
) does indeed solve the free wave equation, thanks to
(3.2)). Inserting (3.2) into (3.17) we thus obtain the explicit solution
(3.18) = log
8f
(t +x)g
(t x)
(f(t +x) +g(t x))
2
to (3.1), where f and g are arbitrary functions (recall that we are
neglecting issues such as whether the quotient and the logarithm are
well-dened).
I, for one, would not have expected the solution to take this form.
But it is instructive to check that (3.18) does at least respect all the
symmetries (3.4)-(3.7).
3.1.3. Some special solutions. If we set U = V = 0, then a, b, c, d
are linear functions, and so is ane-linear in u, v. One also checks
that the uv term in cannot vanish. After translating in u and v, we
end up with the ansatz (u, v) = c
1
+c
2
uv for some constants c
1
, c
2
;
applying (3.16) we see that c
1
c
2
= 1/8, and by using the scaling
symmetry (3.7) we may normalise e.g. c
1
= 8, c
2
= 1, and so we
arrive at the (singular) solution
(3.19) = 2 log(8 +uv) = log
1
(8 +t
2
x
2
)
2
.
To express this solution in the form (3.18), one can take f(u) =
8
u
and
g(v) = v; some other choices of f, g are also possible. (Determining
the extent to which f, g are uniquely determined by in general can
be established from a closer inspection of the previous arguments,
and is left as an exercise.)
We can also look at what happens when is constant in space,
i.e. it solves the ODE
tt
= e
2
.
To express this solution in the form (3.18), one can take for instance
f(u) = e
u/
2
and g(v) = e
v/
2
.
One can of course push around (3.19), (3.20) by the symmetries
(3.4)-(3.7) to generate a few more special solutions.
Notes. This article rst appeared at terrytao.wordpress.com/2009/01/22.
Thanks to Jake K. for corrections.
There was some interesting discussion online regarding whether
the heat equation had a natural relativistic counterpart, and more
generally whether it was protable to study non-relativistic equations
via relativistic approximations.
3.2. Innite elds, nite elds, and the
Ax-Grothendieck theorem
Jean-Pierre Serre (whose papers are, of course, always worth reading)
recently wrote a lovely article[Se2009] in which he describes several
ways in which algebraic statements over elds of zero characteristic,
such as C, can be deduced from their positive characteristic counter-
parts such as F
p
m, despite the fact that there is no non-trivial eld
homomorphism between the two types of elds. In particular nitary
tools, including such basic concepts as cardinality, can now be de-
ployed to establish innitary results. This leads to some simple and
elegant proofs of non-trivial algebraic results which are not easy to
establish by other means.
3.2. The Ax-Grothendieck theorem 389
One deduction of this type is based on the idea that positive
characteristic elds can partially model zero characteristic elds, and
proceeds like this: if a certain algebraic statement failed over (say)
C, then there should be a nitary algebraic obstruction that wit-
nesses this failure over C. Because this obstruction is both nitary
and algebraic, it must also be denable in some (large) nite charac-
teristic, thus leading to a comparable failure over a nite character-
istic eld. Taking contrapositives, one obtains the claim.
Algebra is denitely not my own eld of expertise, but it is inter-
esting to note that similar themes have also come up in my own area
of additive combinatorics (and more generally arithmetic combina-
torics), because the combinatorics of addition and multiplication on
nite sets is denitely of a nitary algebraic nature. For instance,
a recent paper of Vu, Wood, and Wood[VuWoWo2010] establishes
a nitary Freiman-type homomorphism from (nite subsets of) the
complex numbers to large nite elds that allows them to pull back
many results in arithmetic combinatorics in nite elds (e.g. the sum-
product theorem) to the complex plane. Van Vu and I also used a
similar trick in [TaVu2007] to control the singularity property of ran-
dom sign matrices by rst mapping them into nite elds in which
cardinality arguments became available.) And I have a particular
fondness for correspondences between nitary and innitary mathe-
matics; the correspondence Serre discusses is slightly dierent from
the one I discuss for instance in Section 1.3 of Structure and Random-
ness, although there seems to be a common theme of compactness
(or of model theory) tying these correspondences together.
As one of his examples, Serre cites one of my own favourite re-
sults in algebra, discovered independently by Ax[Ax1968] and by
Grothendieck[Gr1966] (and then rediscovered many times since).
Here is a special case of that theorem:
Theorem 3.2.1 (Ax-Grothendieck theorem, special case). Let P :
C
n
C
n
be a polynomial map from a complex vector space to itself.
If P is injective, then P is bijective.
The full version of the theorem allows one to replace C
n
by an
algebraic variety X over any algebraically closed eld, and for P to be
an morphism from the algebraic variety X to itself, but for simplicity
390 3. Expository articles
I will just discuss the above special case. This theorem is not at all
obvious; it is not too dicult (see Lemma 3.2.6 below) to show that
the Jacobian of P is non-degenerate, but this does not come close to
solving the problem since one would then be faced with the notorious
Jacobian conjecture. Also, the claim fails if polynomial is replaced
by holomorphic, due to the existence of Fatou-Bieberbach domains.
In this post I would like to give the proof of Theorem 3.2.1 based
on nite elds as mentioned by Serre, as well as another elegant
proof of Rudin[Ru1995] that combines algebra with some elemen-
tary complex variable methods. (There are several other proofs of
this theorem and its generalisations, for instance a topological proof
by Borel[Bo1969], which I will not discuss here.)
3.2.1. Proof via nite elds. The rst observation is that the
theorem is utterly trivial in the nite eld case:
Theorem 3.2.2 (Ax-Grothendieck theorem in F). Let F be a nite
eld, and let P : F
n
F
n
be a polynomial. If P is injective, then P
is bijective.
Proof. Any injection from a nite set to itself is necessarily bijective.
(The hypothesis that P is a polynomial is not needed at this stage,
but becomes crucial later on.)
Next, we pass from a nite eld F to its algebraic closure F.
Theorem 3.2.3 (Ax-Grothendieck theorem in F). Let F be a -
nite eld, let F be its algebraic closure, and let P : F
n
F
n
be a
polynomial. If P is injective, then P is bijective.
Proof. Our main tool here is Hilberts nullstellensatz, which we in-
terpret here as an assertion that if an algebraic problem is insoluble,
then there exists a nitary algebraic obstruction that witnesses this
lack of solution (see also Section 1.15 of Structure and Randomness).
Specically, suppose for contradiction that we can nd a polynomial
P : F
n
F
n
which is injective but not surjective. Injectivity of P
means that the algebraic system
P(x) = P(y); x ,= y
3.2. The Ax-Grothendieck theorem 391
has no solution over the algebraically closed eld F; by the nullstel-
lensatz, this implies that there must exist an algebraic identity of the
form
(3.21) (P(x) P(y)) Q(x, y) = (x y)
r
for some r 1 and some polynomial Q : F
n
F
n
F
n
that specif-
ically witnesses this lack of solvability. Similarly, lack of surjectivity
means the existence of an z
0
F
n
such that the algebraic system
P(x) = z
0
has no solution over the algebraically closed eld F; by another ap-
plication of the nullstellensatz, there must exist an algebraic identity
of the form
(3.22) (P(x) z
0
) R(x) = 1
for some polynomial R : F
n
F
n
that specically witnesses this
lack of solvability.
Fix Q, z
0
, R as above, and let k be the subeld of F generated by
F and the coecients of P, Q, z
0
, R. Then we observe (thanks to our
explicit witnesses (3.21), (3.22)) that the counterexample P descends
from F to k; P is a polynomial from k
n
to k
n
which is injective but
not surjective.
But k is nitely generated, and every element of k is algebraic
over the nite eld F, thus k is nite. But this contradicts Theorem
3.2.2.
Remark 3.2.4. As pointed out to me by L. Spice, there is a simpler
proof of Theorem 3.2.3 that avoids the nullstellensatz: one observes
from Theorem 3.2.2 that P is bijective over any nite extension of F
that contains all of the coecients of P, and the claim then follows
by taking limits.
The complex case C follows by a slight extension of the argu-
ment used to prove Theorem 3.2.3. Indeed, suppose for contradiction
that there is a polynomial P : C
n
C
n
which is injective but not
surjective. As C is algebraically closed (the fundamental theorem of
algebra), we may invoke the nullstellensatz as before and nd wit-
nesses (3.21), (3.22) for some Q, z
0
, R.
392 3. Expository articles
Now let k = Q[(] be the subeld of C generated by the ratio-
nals Q and the coecients ( of P, Q, z
0
, R. Then we can descend
the counterexample to k. This time, k is not nite, but we can de-
scend it to a nite eld (and obtain the desired contradiction) by a
number of methods. One approach, which is the one taken by Serre,
is to quotient the ring Z[(] generated by the above coecients by
a maximal ideal, observing that this quotient is necessarily a nite
eld. Another is to use a general mapping theorem of Vu, Wood, and
Wood[VuWoWo2010]. We sketch the latter approach as follows.
Being nitely generated, we know that k has a nite transcendence
basis
1
, . . . ,
m
over Q. Applying the primitive element theorem,
we can then express k as the nite extension of Q[
1
, . . . ,
m
] by an
element which is algebraic over Q[
1
, . . . ,
m
]; all the coecients
( are thus rational combinations of
1
, . . . ,
m
, . By rationalising,
we can ensure that the denominators of the expressions of these co-
ecients are integers in Z[
1
, . . . ,
m
]; dividing by an appropriate
power of the product of these denominators we may assume that
the coecients in ( all lie in the commutative ring Z[
1
, . . . ,
m
, ],
which can be identied with the commutative ring Z[a
1
, . . . , a
m
, b]
generated by formal indeterminates a
1
, . . . , a
m
, b, quotiented by the
ideal generated by the minimal polynomial f Z[a
1
, . . . , a
m
, b] of
; the algebraic identities (3.21), (3.22) then transfer to this ring.
Now pick a large prime p, and map a
1
, . . . , a
m
to random elements of
F
p
. With high probability, the image of f (which is now in F
p
[b]) is
non-degenerate; we can then map b to a root of this image in a nite
extension of F
p
. (In fact, by using the Chebotarev density theorem (or
Frobenius density theorem), we can place b back in F
p
for innitely
many primes p.) This descends the identities (3.21), (3.22) to this
nite extension, as desired.
Remark 3.2.5. This argument can be generalised substantially; it
can be used to show that any rst-order sentence in the language of
elds is true in all algebraically closed elds of characteristic zero
if and only if it is true for all algebraically closed elds of su-
ciently large characteristic. This result can be deduced from the
famous result (proved by Tarski[Ta1951], and independently, in an
equivalent formulation, by Chevalley) that the theory of algebraically
3.2. The Ax-Grothendieck theorem 393
closed elds (in the language of rings) admits elimination of quan-
tiers. See for instance [PCM, Section IV.23.4]. There are also
analogues for real closed elds, starting with the paper of Bialynicki-
Birula and Rosenlicht[BiRo1962], with a general result established
by Kurdyka[Ku1999]. Ax-Grothendieck type properties in other cat-
egories have been studied by Gromov[Gr1999], who calls this prop-
erty surjunctivity.
3.2.2. Rudins proof. Now we give Rudins proof, which does not
use the nullstellensatz, instead relying on some Galois theory and the
topological structure of C. We rst need a basic fact:
Lemma 3.2.6. Let C
n
be an open set, and let f : C
n
be an
injective holomorphic map. Then the Jacobian of f is non-degenerate,
i.e. det Df(z) ,= 0 for all z .
Actually, we only need the special case of this lemma when f is
a polynomial.
Proof. We use an argument of Rosay[Ro1982]. For n = 1 the claim
follows from Taylor expansion. Now suppose n > 1 and the claim is
proven for n 1. Suppose for contradiction that det Df(z
0
) = 0 for
some z
0
. We claim that Df(z
0
) in fact vanishes entirely. If not,
then we can nd 1 i, j n such that
zj
f
i
(z
0
) ,= 0; by permuting
we may take i = j = 1. We can also normalise z
0
= f(z
0
) = 0.
Then the map h : z (f
1
(z), z
2
, . . . , z
n
) is holomorphic with non-
degenerate Jacobian at 0 and is thus locally invertible at 0. The map
f h
1
is then holomorphic at 0 and preserves the z
1
coordinate, and
thus descends to an injective holomorphic map on a neighbourhood of
the origin C
n1
, and so its Jacobian is non-degenerate by induction
hypothesis, a contradiction.
We have just shown that the gradient of f vanishes on the zero
set det Df = 0, which is an analytic variety of codimension 1 (if f
is polynomial, it is of course an algebraic variety). Thus f is locally
constant on this variety, which contradicts injectivity and we are done.
= u
2
blows up in nite time.) These appear to be the best possible
rates for acceleration or deceleration using only air and water sails,
though I do not have a formal proof of this fact.
Notes. This article rst appeared at terrytao.wordpress.com/2009/03/23.
Izabella Laba pointed out several real-world sailing features not
covered by the above simplied model, notably the interaction be-
tween multiple sails, and noted that the model was closer in many
ways to windsurng (or ice-sailing) than to traditional sailing.
Meichenl pointed out the relevance of the drag equation.
404 3. Expository articles
Figure 6. By alternating between a pure-lift aerofoil (red)
and a pure-lift hydrofoil (purple), one can in principle reach
arbitrarily large speeds in any direction.
3.4. The completeness and compactness
theorems of rst-order logic
The famous Godel completeness theorem in logic (not to be confused
with the even more famous Godel incompleteness theorem) roughly
states the following:
3.4. Completeness and compactness theorems 405
Theorem 3.4.1 (Godel completeness theorem, informal statement).
Let be a theory (a formal language /, together with a set of ax-
ioms, i.e. sentences assumed to be true), and let be a sentence in
the formal language. Assume also that the language / has at most
countably many symbols. Then the following are equivalent:
(i) (Syntactic consequence) can be deduced from the axioms in
by a nite number of applications of the laws of deduction
in rst order logic. (This property is abbreviated as .)
(ii) (Semantic consequence) Every structure U which satises
or models , also satises . (This property is abbreviated
as [= .)
(iii) (Semantic consequence for at most countable models) Every
structure U which is at most countable, and which models ,
also satises .
One can also formulate versions of the completeness theorem for
languages with uncountably many symbols, but I will not do so here.
One can also force other cardinalities on the model U by using the
Lowenheim-Skolem theorem.
To state this theorem even more informally, any (rst-order) re-
sult which is true in all models of a theory, must be logically deducible
from that theory, and vice versa. (For instance, any result which is
true for all groups, must be deducible from the group axioms; any
result which is true for all systems obeying Peano arithmetic, must
be deducible from the Peano axioms; and so forth.) In fact, it suces
to check countable and nite models only; for instance, any rst-order
statement which is true for all nite or countable groups, is in fact true
for all groups! Informally, a rst-order language with only countably
many symbols cannot detect whether a given structure is countably
or uncountably innite. Thus for instance even the Zermelo-Frankel-
Choice (ZFC) axioms of set theory must have some at most countable
model, even though one can use ZFC to prove the existence of un-
countable sets; this is known as Skolems paradox. (To resolve the
paradox, one needs to carefully distinguish between an object in a
set theory being externally countable in the structure that models
that theory, and being internally countable within that theory.)
406 3. Expository articles
Of course, a theory may contain undecidable statements -
sentences which are neither provable nor disprovable in the theory.
By the completeness theorem, this is equivalent to saying that is
satised by some models of but not by other models. Thus the
completeness theorem is compatible with the incompleteness theorem:
recursively enumerable theories such as Peano arithmetic are modeled
by the natural numbers N, but are also modeled by other structures
also, and there are sentences satised by N which are not satised by
other models of Peano arithmetic, and are thus undecidable within
that arithmetic.
An important corollary of the completeness theorem is the com-
pactness theorem:
Corollary 3.4.2 (Compactness theorem, informal statement). Let
be a rst-order theory whose language has at most countably many
symbols. Then the following are equivalent:
(i) is consistent, i.e. it is not possible to logically deduce a
contradiction from the axioms in .
(ii) is satisable, i.e. there exists a structure U that models
.
(iii) There exists a structure U which is at most countable, that
models .
(iv) Every nite subset
of is consistent.
(v) Every nite subset
of is satisable.
(vi) Every nite subset
of is satisable.
It is easy to see that Theorem 3.4.8 will allow us to use the nite
case of Theorem 3.4.6 to deduce the innite case, so it remains to
prove Theorem 3.4.8. The implication of (ii) from (i) is trivial; the
interesting implication is the converse.
Observe that there is a one-to-one correspondence between truth
assignments U and elements of the product space 0, 1
A
, where
/ is the set of propositional variables. For every sentence , let
F
0, 1
A
be the collection of all truth assignments that satisfy
; observe that this is a closed (and open) subset of 0, 1
A
in the
product topology (basically because only involves nitely many of
the propositional variables). If every nite subset
of is satis-
able, then
of closed
sets enjoys the nite intersection property. On the other hand, from
Tychonos theorem, 0, 1
A
is compact. We conclude that
U
: G G G, and a unary operation (()
1
)
U
: G G. At
present, no group-type properties are assumed on these operations;
the structure here is little more than a magma at this point.
Every sentence in a zeroth-order language / can be inter-
preted in a structure U for that language to give a truth value
U
in the
language /
) to
y) (y =
z) = (x =
z) to
for each triple of terms x, y, z, as well as substitution axioms such as
(x =
y) = (B(x, z) =
can
be translated to a contradiction derived in simply by replacing =
)
U
of =
) of U
, and also
quotienting all the interpretations of the relations and operations of
/; the axioms of equality ensure that this quotienting is possible, and
that the quotiented structure U satises /; we omit the details.
Henceforth we assume that / does not contain the equality sign.
We will then choose a tautological domain of discourse Dom(U),
by setting this domain to be nothing more than the collection of all
terms in the language /. For instance, in the language of groups
on six generators, the domain Dom(U) is basically the free magma
(with inverse) on six generators plus an identity, consisting of
terms such as (a
1
a
2
)
1
a
1
, (e a
3
) ((a
4
)
1
)
1
, etc. With this
choice of domain, there is an obvious tautological interpretation of
constants (c
U
:= c) and operations (e.g. B
U
(t
1
, t
2
) := B(t
1
, t
2
)
U
for
3.4. Completeness and compactness theorems 417
binary operations B and terms t
1
, t
2
), which leads to every term t
being interpreted as itself: t
U
= t.
It remains to gure out how to interpret the propositional vari-
ables A
1
, A
2
, . . . and relations R
1
, R
2
, . . .. Actually, one can replace
each relation with an equivalent collection of new propositional vari-
ables by substituting in all possible terms in the relation. For instance,
if one has a binary relation R(, ), one can replace this single relation
symbol in the language by a (possibly innite) collection of proposi-
tional variables R(t
1
, t
2
), one for each pair of terms t
1
, t
2
, leading to
a new (and probably much larger) language
L without any relation
symbols. It is not hard to see that if theory is consistent in /, then
the theory
in
L formed by interpreting all atomic formulae such as
R(t
1
, t
2
) as propositional variables is also consistent. If
has a model
in a new language
, while
any at most countable model for
in a new language
can be converted to a
countable model for (again by forgetting c). So one can eliminate
the existential quantier from this sentence also. Similar methods
work for any other prenex normal form; for instance with the sentence
xyzw : P(x, y, z, w)
one can obtain a conservative extension of that theory by introducing
a unary operator c and a binary operator d and replacing the above
sentence with
xz : P(x, c(x), z, d(x, z)).
One can show that one can perform Skolemisation on all the sentences
in simultaneously, which has the eect of eliminating all existential
quantiers from while still keeping the language / at most countable
(since is at most countable). (Intuitively, what is going on here
is that we are interpreting all existential axioms in the theory as
implicitly dening functions, which we then explicitly formalise as a
new symbol in the language. For instance, if we had some theory of
sets which contained the axiom of choice (every family of non-empty
sets (X
)
A
admits a choice function f : A
A
X
), then we
can Skolemise this by introducing a choice function function T :
(X
)
A
T((X
)
A
) that witnessed this axiom to the language.
Note that we do not need uniqueness in the existential claim in order
to be able to perform Skolemisation.)
After performing Skolemisation and adding all the witnesses to
the language, we are reduced to the case in which all the sentences
in are in fact universal statements, i.e. of the form x
1
. . . x
k
:
P(x
1
, . . . , x
k
), where P(x
1
, . . . , x
k
) is an quantier-free formula of k
free variables. In this case one can repeat the zeroth-order arguments,
selecting a structure U whose domain of discourse is the tautological
one, indexed by all the terms with no free variables (in particular, this
structure will be countable). One can then replace each rst-order
422 3. Expository articles
statement x
1
. . . x
k
: P(x
1
, . . . , x
k
) in by the family of zeroth-
order statements P(t
1
, . . . , t
k
), where t
1
, . . . , t
k
ranges over all terms
with no free variables, thus creating a zeroth-order theory
0
. As
is consistent,
0
is also, so by the zeroth-order theory, we can nd a
model U for
0
with the tautological domain of discourse, and it is
clear that this structure will also be a model for the original theory
. The proof of the completeness theorem (and thus the compactness
theorem) is now complete.
In summary: to create a countable model from a consistent rst-
order theory, one rst replaces the equality sign = (if any) by a binary
relation =
of V , then one
observes that
dist(X, V )
2
= X PX =
n
i=1
n
j=1
x
i
x
j
p
ij
and so upon taking expectations we see that
(3.23) Edist(X, V )
2
=
n
i=1
p
ii
= tr P = n d
since P is a rank nd orthogonal projection. So we expect dist(X, V )
to be about
n d on the average.
In fact, one has sharp concentration around this value, in the
sense that dist(X, V ) =
n d[ t) C exp(ct
2
)
for some absolute constants C, c > 0.
In fact the constants C, c are very civilised; for large t one can
basically take C = 4 and c = 1/16, for instance. This type of con-
centration, particularly for subspaces V of moderately large codimen-
sion
6
n d, is fundamental to much of my work on random matrices
with Van Vu, starting with our rst paper[TaVu2006] (in which this
proposition rst appears).
Proposition 3.5.1 is an easy consequence of the second moment
computation and Talagrands inequality[Ta1996], which among other
things provides a sharp concentration result for convex Lipschitz func-
tions on the cube 1, +1
n
; since dist(x, V ) is indeed a convex Lips-
chitz function, this inequality can be applied immediately. The proof
of Talagrands inequality is short and can be found in several text-
books (e.g. [AlSp2008]), but I thought I would reproduce the argu-
ment here (specialised to the convex case), mostly to force myself to
learn the proof properly. Note the concentration of O(1) obtained by
Talagrands inequality is much stronger than what one would get from
more elementary tools such as Azumas inequality or McDiarmids in-
equality, which would only give concentration of about O(
n) or so
(which is in fact trivial, since the cube 1, +1
n
has diameter 2
n);
the point is that Talagrands inequality is very eective at exploiting
the convexity of the problem, as well as the Lipschitz nature of the
function in all directions, whereas Azumas inequality can only easily
take advantage of the Lipschitz nature of the function in coordinate
directions. On the other hand, Azumas inequality works just as well
if the
2
metric is replaced with the larger
1
metric, and one can con-
clude that the
1
distance between X and V concentrates around its
median to a width O(
2
concentration bound given by that inequality. (The computation
6
For subspaces of small codimension (such as hyperplanes) one has to use other
tools to get good results, such as inverse Littlewood-Oord theory or the Berry-Esseen
central limit theorem, but that is another story.
3.5. Talagrands concentration inequality 425
of the median of the
1
distance is more complicated than for the
2
distance, though, and depends on the orientation of V .)
Remark 3.5.2. If one makes the coordinates of X iid Gaussian vari-
ables x
i
N(0, 1) rather than random signs, then Proposition 3.5.1 is
much easier to prove; the probability distribution of a Gaussian vec-
tor is rotation-invariant, so one can rotate V to be, say, R
d
, at which
point dist(X, V )
2
is clearly the sum of n d independent squares of
Gaussians (i.e. a chi-square distribution), and the claim follows from
direct computation (or one can use the Cherno inequality). The
gaussian counterpart of Talagrands inequality is more classical, be-
ing essentially due to Levy, and will also be discussed later in this
post.
3.5.1. Concentration on the cube. Proposition 3.5.1 follows eas-
ily from the following statement, that asserts that if a convex set
A R
n
occupies a non-trivial fraction of the cube 1, +1
n
, then
the neighbourhood A
t
:= x R
n
: dist(x, A) t will occupy
almost all of the cube for t 1:
Proposition 3.5.3 (Talagrands concentration inequality). Let A be
a convex set in R
d
. Then
P(X A)P(X , A
t
) exp(ct
2
)
for all t > 0 and some absolute constant c > 0, where X 1, +1
n
is chosen uniformly from 1, +1
n
.
Remark 3.5.4. It is crucial that A is convex here. If instead A
is, say, the set of all points in 1, +1
n
with fewer than n/2
n
+1s, then P(X A) is comparable to 1, but P(X , A
t
) only starts
decaying once t
n, rather than t 1. Indeed, it is not hard to
show that Proposition 3.5.3 implies the variant
P(X A)P(X , A
t
) exp(ct
2
/n)
for non-convex A (by restricting A to 1, +1
n
and then passing
from A to the convex hull, noting that distances to A on 1, +1
n
may be contracted by a factor of O(
, x
n
) for x
n
= 1. For each t R, we
introduce the slice A
t
:= x
R
n1
: (x
, t) A, then A
t
is convex.
We now try to bound the left-hand side of (3.24) in terms of X
, A
t
rather than X, A. Clearly
P(X A) =
1
2
[P(X
A
1
) +P(X
A
+1
)].
By symmetry we may assume that P(X
A
+1
) P(X
A
1
),
thus we may write
(3.25) P(X
A
1
) = p(1 q)
where p := P(X A) and 0 q 1.
Now we look at dist(X, A)
2
. For t = 1, let Y
t
R
n1
be the
closest point of (the closure of) A
t
to X
, thus
(3.26) [X
Y
t
[ = dist(X
, A
t
).
3.5. Talagrands concentration inequality 427
Let 0 1 be chosen later; then the point (1 )(Y
xn
, x
n
) +
(Y
xn
, x
n
) lies in A by convexity, and so
dist(X, A) [(1 )(Y
xn
, x
n
) +(Y
xn
, x
n
) (X
, x
n
)[.
Squaring this and using Pythagoras, one obtains
dist(X, A)
2
4
2
+[(1 )(X
Y
xn
) +(X
Y
xn
)[
2
.
As we will shortly be exponentiating the left-hand side, we need to lin-
earise the right-hand side. Accordingly, we will exploit the convexity
of the function x [x[
2
to bound
[(1 )(X Y
xn
) +(X Y
xn
)[
2
(1 )[X
Y
xn
[
2
+[X
Y
xn
[
2
and thus by (3.26)
dist(X, A)
2
4
2
+ (1 ) dist(X
, A
xn
)
2
+dist(X
, A
xn
)
2
.
We exponentiate this and take expectations in X
(holding x
n
xed
for now) to get
E
X
e
c dist(X,A)
2
e
4c
2
E
X
(e
c dist(X
,Axn
)
2
)
1
(e
c dist(X
,Axn
)
2
)
.
Meanwhile, from the induction hypothesis and (3.25) we have
E
X
e
c dist(X
,Axn
)
2
1
p(1 +x
n
q)
and similarly for A
xn
. By Holders inequality, we conclude
E
X
e
c dist(X,A)
2
e
4c
2 1
p(1 +x
n
q)
1
(1 x
n
q)
.
For x
n
= +1, the optimal choice of here is 0, obtaining
E
X
e
c dist(X,A)
2
=
1
p(1 +q)
;
for x
n
= 1, the optimal choice of is to be determined. Averaging,
we obtain
E
X
e
c dist(X,A)
2
=
1
2
[
1
p(1 +q)
+e
4c
2 1
p(1 q)
1
(1 +q)
]
so to establish (3.24), it suces to pick 0 1 such that
1
1 +q
+e
4c
2 1
(1 q)
1
(1 +q)
2.
428 3. Expository articles
If q is bounded away from zero, then by choosing = 1 we would
obtain the claim if c is small enough, so we may take q to be small.
But then a Taylor expansion allows us to conclude if we take to be
a constant multiple of q, and again pick c to be small enough. The
point is that = 0 already almost works up to errors of O(q
2
), and
increasing from zero to a small non-zero quantity will decrease the
LHS by about O(q) O(c
2
).
By optimising everything using rst-year calculus, one eventually
gets the constant c = 1/16 claimed earlier.
Remark 3.5.5. Talagrands inequality is in fact far more general
than this; it applies to arbitrary products of probability spaces, rather
than just to 1, +1
n
, and to non-convex A, but the notion of dis-
tance needed to dene A
t
becomes more complicated; the proof of
the inequality, though, is essentially the same. Besides its applicabil-
ity to convex Lipschitz functions, Talagrands inequality is also very
useful for controlling combinatorial Lipschitz functions F which are
locally certiable in the sense that whenever F(x) is larger than
some threshold t, then there exist some bounded number f(t) of co-
ecients of x which certify this fact (in the sense that F(y) t
for any other y which agrees with x on these coecients). See e.g.
[AlSp2008] for a more precise statement and some applications.
3.5.2. Gaussian concentration. As mentioned earlier, there are
analogous results when the uniform distribution on the cube 1, +1
n
are replaced by other distributions, such as the n-dimensional Gauss-
ian distribution. In fact, in this case convexity is not needed:
Proposition 3.5.6 (Gaussian concentration inequality). Let A be a
measurable set in R
d
. Then
P(X A)P(X , A
t
) exp(ct
2
)
for all t > 0 and some absolute constant c > 0, where X N(0, 1)
n
is a random Gaussian vector.
This inequality can be deduced from Levys classical concentra-
tion of measure inequality for the sphere (with the optimal constant),
but we will give an alternate proof due to Maurey and Pisier. It
suces to prove the following variant of Proposition 3.5.6:
3.5. Talagrands concentration inequality 429
Proposition 3.5.7 (Gaussian concentration inequality for Lipschitz
functions). Let f : R
d
R be a function which is Lipschitz with
constant 1 (i.e. [f(x) f(y)[ [x y[ for all x, y R
d
). Then for
any t we have
P([f(X) Ef(X)[ t) exp(ct
2
)
for all t > 0 and some absolute constant c > 0, where X N(0, 1)
n
is a random variable.
Indeed, if one sets f(x) := dist(x, A) one can soon deduce Propo-
sition 3.5.6 from Proposition 3.5.7.
Informally, Proposition 3.5.7 asserts that Lipschitz functions of
Gaussian variables concentrate as if they were Gaussian themselves;
for comparison, Talagrands inequality implies that convex Lipschitz
functions of Bernoulli variables concentrate as if they were Gaussian.
Now we prove Proposition 3.5.7. By the epsilon regularisation
argument (Section 2.7) we may take f to be smooth, and so by the
Lipschitz property we have
(3.27) [f(x)[ 1
for all x. By subtracting o the mean we may assume Ef = 0. By
replacing f with f if necessary it suces to control the upper tail
probability P(f(X) t) for t > 0.
We again use the exponential moment method. It suces to show
that
Eexp(tf(X)) exp(Ct
2
)
for some absolute constant C.
Now we use a variant of the square and rearrange trick. Let Y be
an independent copy of X. Since Ef(Y ) = 0, we see from Jensens
inequality that Eexp(tf(Y )) 1, and so
Eexp(tf(X)) Eexp(t(f(X) f(Y ))).
With an eye to exploiting (3.27), one might seek to use the funda-
mental theorem of calculus to write
f(X) f(Y ) =
_
1
0
d
d
f((1 )Y +X) d.
430 3. Expository articles
But actually it turns out to be smarter to use a circular arc of inte-
gration, rather than a line segment:
f(X) f(Y ) =
_
/2
0
d
d
f(Y cos +X sin ) d.
The reason for this is that X
:=
Y sin +X cos ; furthermore, and crucially, these two random vari-
ables are independent.
To exploit this, we rst use Jensens inequality to bound
exp(t(f(X) f(Y )))
2
_
/2
0
exp(
2t
d
d
f(X
)) d.
Applying the chain rule and taking expectations, we have
Eexp(t(f(X) f(Y )))
2
_
/2
0
Eexp(
2t
f(X
) X
) d.
Let us condition X
to be xed, then X
N(0, 1)
n
; applying (3.27),
we conclude that
2t
f(X
)X
. As such we have
Eexp(
2t
f(X
) X
) exp(Ct)
for some absolute constant C; integrating out the conditioning on X
L
[P [
and thus by Cauchy-Schwarz
L
[P [
2
[I(P, L)[
2
[L[
.
On the other hand, observe that
L
[P [
2
[P [ = [(p, q, ) P P L : p ,= q; p, q .
Because two distinct points p, q are incident to at most one line, the
right-hand side is at most [P[
2
, thus
L
[P [
2
[I(P, L)[ +[P[
2
.
Comparing this with the Cauchy-Schwarz bound and using a little
high-school algebra we obtain (3.29). A dual argument (swapping
the role of lines and points) give (3.30).
A more informal proof of (3.29) can be given as follows. Suppose
for contradiction that [I(P, L)[ was much larger than [P[[L[
1/2
+[L[.
Since [I(P, L)[ =
pP
[L
p
[, this implies that that the [L
p
[ are much
larger than [L[
1/2
on the average. By the birthday paradox, one then
expects two randomly chosen L
p
, L
q
to intersect in at least two places
,
; but this would mean that two lines intersect in two points, a con-
tradiction. The use of Cauchy-Schwarz in the rigorous argument given
above can thus be viewed as an assertion that the average intersection
of L
p
and L
q
is at least as large as what random chance predicts.
As mentioned in the introduction, we now see (intuitively, at
least) that if nearby p, q are such that L
p
, L
q
are drawn from a smaller
pool of lines than L, then their intersection is likely to be higher, and
so one should be able to improve upon (3.29).
3.6. The cell decomposition 435
3.6.2. The probabilistic bound. Now we start proving Lemma
3.6.1. We can assume that r < [L[, since the claim is trivial otherwise
(we just use all the lines in L to subdivide the plane, and there are
no lines left in L to intersect any of the cells). Similarly we may
assume that r > 1, and that [L[ is large. We can also perturb all
the lines slightly and assume that the lines are in general position
(no three are concurrent), as the general claim then follows from
a limiting argument (note that this may send some of the cells to
become empty). (Of course, the Szemeredi-Trotter theorem is quite
easy under the assumption of general position, but this theorem is
not our current objective right now.)
We use the probabilistic method, i.e. we construct R by some
random recipe and aim to show that the conclusion of the lemma
holds with positive probaility.
The most obvious approach would be to choose the r lines R
at random from L, thus each line L has a probability of r/[L[
of lying in R. Actually, for technical reasons it is slightly better
to use a Bernoulli process to select R, thus each line L lies in
R with an independent probability of r/[L[. This can cause R to
occasionally have size much larger than r, but this probability can be
easily controlled (e.g. using the Cherno inequality). So with high
probability, R consists of O(r) lines, which therefore carve out O(r
2
)
cell. The remaining task is to show that each cell is incident to at
most O([L[/r) lines from L.
Observe that each cell is a (possibly unbounded) polygon, whose
edges come from lines in R. Note that (except in the degenerate case
when R consists of at most one line, which we can ignore) any line
which meets a cell in R, must intersect at least one of the edges of R.
If we pretend for the moment that all cells have a bounded number
of edges, it would then suce to show that each edge of each cell was
incident to O([L[/r) lines.
Lets see how this would go. Suppose that one line L was
picked for the set R, and consider all the other lines in L that inter-
sect ; there are O([L[) of these lines
of lines with [L
1, one can partition L
into at most C
1
(r
)
2
cells using at most C
0
r
[/r
lines, where
C
1
, C
2
, C
3
are absolute constants. (When using induction, asymptotic
notation becomes quite dangerous to use, and it is much safer to start
writing out the constants explicitly. To close the induction, one has
to end up with the same constants C
0
, C
1
, C
2
as one started with.)
For each k between C
2
/C and O(log r) which is a power of two, one
can apply the induction hypothesis to all the cells which are incident
to between Ck[L[/r and 2Ck[L[/r (with L
k
k
2
exp(k) converges, especially if k is restricted to
powers of two) to close the induction if the constants C
0
, C
1
, C
2
are
chosen properly; we leave the details as an exercise.
Notes. This article rst appeared at terrytao.wordpress.com/2009/06/12.
Thanks to Oded and vedadi for corrections.
Jozsef Solymosi noted that there is still no good characterisa-
tion of the point-line congurations for which the Szemeredi-Trotter
theorem is close to sharp; such a characterisation may well lead to
improvements to a variety of bounds which are currently proven using
this theorem.
438 3. Expository articles
Jordan Ellenberg raised the interesting possibility of using alge-
braic methods to attack the nite eld analogue of this problem.
3.7. Benfords law, Zipf s law, and the Pareto
distribution
A remarkable phenomenon in probability theory is that of universal-
ity - that many seemingly unrelated probability distributions, which
ostensibly involve large numbers of unknown parameters, can end up
converging to a universal law that may only depend on a small handful
of parameters. One of the most famous examples of the universality
phenomenon is the central limit theorem; another rich source of ex-
amples comes from random matrix theory, which is one of the areas
of my own research.
Analogous universality phenomena also show up in empirical dis-
tributions - the distributions of a statistic X from a large population
of real-world objects. Examples include Benfords law, Zipf s law,
and the Pareto distribution (of which the Pareto principle or 80-20
law is a special case). These laws govern the asymptotic distribution
of many statistics X which
(i) take values as positive numbers;
(ii) range over many dierent orders of magnitude;
(iii) arise from a complicated combination of largely independent
factors (with dierent samples of X arising from dierent
independent factors); and
(iv) have not been articially rounded, truncated, or otherwise
constrained in size.
Examples here include the population of countries or cities, the
frequency of occurrence of words in a language, the mass of astro-
nomical objects, or the net worth of individuals or corporations. The
laws are then as follows:
Benfords law: For k = 1, . . . , 9, the proportion of X
whose rst digit is k is approximately log
10
k+1
k
. Thus, for
instance, X should have a rst digit of 1 about 30% of the
time, but a rst digit of 9 only about 5% of the time.
3.7. Benfords law 439
Zipf s law: The n
th
largest value of X should obey an
approximate power law, i.e. it should be approximately
Cn
for all integers 1 < 10, where the left-hand side denotes
the proportion of data for which X lies between 10
n
and 10
n
for
some integer n. Suppose now that we generalise Benfords law to the
continuous Benfords law, which asserts that (3.32) is true for all real
numbers 1 < 10. Then it is not hard to show that a statistic
X obeys the continuous Benfords law if and only if its dilate
X = 2X
does, and similarly with 2 replaced by any other constant growth
factor. (This is easiest seen by observing that (3.32) is equivalent to
asserting that the fractional part of log
10
X is uniformly distributed.)
In fact, the continuous Benford law is the only distribution for the
quantities on the left-hand side of (3.32) with this scale-invariance
property; this fact is a special case of the general fact that Haar
measures are unique (see Section 1.12).
3.7. Benfords law 445
It is also easy to see that Zipfs law and the Pareto distribu-
tion also enjoy this sort of scale-invariance property, as long as one
generalises the Pareto distribution
(3.33) P(X 10
m
) = c10
m/
from integer m to real m, just as with Benfords law. Once one does
that, one can phrase the Pareto distribution law independently of any
base as
(3.34) P(X x) = cx
1/
for any x much larger than the median value of X, at which point the
scale-invariance is easily seen.
One may object that the above thought-experiment was too ide-
alised, because it assumed uniform growth rates for all the statistics
at once. What happens if there are non-uniform growth rates? To
keep the computations simple, let us consider the following toy model,
where we take the same 2007 population statistics X as before, and
assume that half of the countries (the high-growth countries) will
experience a population doubling by 2067, while the other half (the
zero-growth countries) will keep their population constant, thus the
2067 population statistic
X is equal to 2X half the time and X half
the time. (We will assume that our sample sizes are large enough that
the law of large numbers kicks in, and we will therefore ignore issues
such as what happens to this half the time if the number of samples
is odd.) Furthermore, we make the plausible but crucial assumption
that the event that a country is a high-growth or a zero-growth coun-
try is independent of the rst digit of the 2007 population; thus, for
instance, a country whose population begins with 3 is assumed to be
just as likely to be high-growth as one whose population begins with
7.
Now lets have a look again at the proportion of countries whose
2067 population
X begins with either 2 or 3. There are exactly two
ways in which a country can fall into this category: either it is a zero-
growth country whose 2007 population X also began with either 2 or
3, or ot was a high-growth country whose population in 2007 began
with 1. Since all countries have a probability 1/2 of being high-
growth regardless of the rst digit of their population, we conclude
446 3. Expository articles
the identity
(3.35) P(
X has rst digit 2, 3) =
1
2
P(X has rst digit 2, 3)
+
1
2
P(X has rst digit 1)
which is once again compatible with Benfords law for
X since
log
10
4
2
=
1
2
log
10
4
2
+
1
2
log
2
1
.
More generally, it is not hard to show that if X obeys the continuous
Benfords law (3.32), and one multiplies X by some positive multi-
plier Y which is independent of the rst digit of X (and, a fortiori,
is independent of the fractional part of log
10
X), one obtains another
quantity
X = XY which also obeys the continuous Benfords law.
(Indeed, we have already seen this to be the case when Y is a deter-
ministic constant, and the case when Y is random then follows simply
by conditioning Y to be xed.)
In particular, we see an absorptive property of Benfords law: if
X obeys Benfords law, and Y is any positive statistic independent
of X, then the product
X = XY also obeys Benfords law - even
if Y did not obey this law. Thus, if a statistic is the product of
many independent factors, then it only requires a single factor to
obey Benfords law in order for the whole product to obey the law
also. For instance, the population of a country is the product of
its area and its population density. Assuming that the population
density of a country is independent of the area of that country (which
is not a completely reasonable assumption, but let us take it for the
sake of argument), then we see that Benfords law for the population
would follow if just one of the area or population density obeyed this
law. It is also clear that Benfords law is the only distribution with
this absorptive property (if there was another law with this property,
what would happen if one multiplied a statistic with that law with
an independent statistic with Benfords law?). Thus we begin to get
a glimpse as to why Benfords law is universal for quantities which
are the product of many separate factors, in a manner that no other
law could be.
3.7. Benfords law 447
As an example: for any given number N, the uniform distribution
from 1 to N does not obey Benfords law; for instance, if one picks a
random number from 1 to 999, 999 then each digit from 1 to 9 appears
as the rst digit with an equal probability of 1/9 each. However, if N
is not xed, but instead obeys Benfords law, then a random number
selected from 1 to N also obeys Benfords law (ignoring for now the
distinction between continuous and discrete distributions), as it can
be viewed as the product of N with an independent random number
selected from between 0 and 1.
Actually, one can say something even stronger than the absorp-
tion property. Suppose that the continuous Benfords law (3.32) for a
statistic X did not hold exactly, but instead held with some accuracy
> 0, thus
log
10
P(10
n
X < 10
n
for some integer n)
log
10
+
(3.36)
for all 1 < 10. Then it is not hard to see that any dilated
statistic, such as
X = 2X, or more generally
X = XY for any xed
deterministic Y , also obeys (3.36) with exactly the same accuracy .
But now suppose one uses a variable multiplier; for instance, suppose
one uses the model discussed earlier in which
X is equal to 2X half
the time and X half the time. Then the relationship between the
distribution of the rst digit of
X and the rst digit of X is given by
formulae such as (3.35). Now, in the right-hand side of (3.35), each of
the two terms P(X has rst digit 2, 3) and P(X has rst digit 1) dif-
fers from the Benfords law predictions of log
10
4
2
and log
10
2
1
respec-
tively by at most . Since the left-hand side of (3.35) is the average
of these two terms, it also diers from the Benford law prediction by
at most . But the averaging opens up an opportunity for cancelling;
for instance, an overestimate of + for P(X has rst digit 2, 3) could
cancel an underestimate of for P(X has rst digit 1) to produce
a spot-on prediction for
X. Thus we see that variable multipliers
(or variable growth rates) not only preserve Benfords law, but in
fact stabilise it by averaging out the errors. In fact, if one started
with a distribution which did not initially obey Benfords law, and
then started applying some variable (and independent) growth rates
448 3. Expository articles
to the various samples in the distribution, then under reasonable as-
sumptions one can show that the resulting distribution will converge
to Benfords law over time. This helps explain the universality of
Benfords law for statistics such as populations, for which the inde-
pendent variable growth law is not so unreasonable (at least, until
the population hits some maximum capacity threshold).
Note that the independence property is crucial; if for instance
population growth always slowed down for some inexplicable reason
to a crawl whenever the rst digit of the population was 6, then there
would be a noticeable deviation from Benfords law, particularly in
digits 6 and 7, due to this growth bottleneck. But this is not a par-
ticularly plausible scenario (being somewhat analogous to Maxwells
demon in thermodynamics).
The above analysis can also be carried over to some extent to the
Pareto distribution and Zipfs law; if a statistic X obeys these laws
approximately, then after multiplying by an independent variable Y ,
the product
X = XY will obey the same laws with equal or higher
accuracy, so long as Y is small compared to the number of scales that
X typically ranges over. (One needs a restriction such as this be-
cause the Pareto distribution and Zipfs law must break down below
the median. Also, Zipfs law loses its stability at the very extreme
end of the distribution, because there are no longer enough samples
for the law of large numbers to kick in; this is consistent with the em-
pirical observation that Zipfs law tends to break down in extremis.)
These laws are also stable under other multiplicative processes, for
instance if some fraction of the samples in X spontaneously split into
two smaller pieces, or conversely if two samples in X spontaneously
merge into one; as before, the key is that the occurrence of these
events should be independent of the actual size of the objects being
split. If one considers a generalisation of the Pareto or Zipf law in
which the exponent is not xed, but varies with n or k, then the
eect of these sorts of multiplicative changes is to blur and average to-
gether the various values of , thus attening the curve over time
and making the distribution approach Zipfs law and/or the Pareto
distribution. This helps explain why eventually becomes constant;
3.7. Benfords law 449
however, I do not have a good explanation as to why is often close
to 1.
3.7.2. Compatibility between laws. Another mathematical line
of support for Benfords law, Zipfs law, and the Pareto distribution
are that the laws are highly compatible with each other. For instance,
Zipfs law and the Pareto distribution are formally equivalent: if there
are N samples of X, then applying (3.34) with x equal to the n
th
largest value X
n
of X gives
n
N
= P(X X
n
) = cX
1/
n
which implies Zipfs law X
n
= Cn
with C := (Nc)
. Conversely
one can deduce the Pareto distribution from Zipfs law. These de-
ductions are only formal in nature, because the Pareto distribution
can only hold exactly for continuous distributions, whereas Zipfs law
only makes sense for discrete distributions, but one can generate more
rigorous variants of these deductions without much diculty.
In some literature, Zipfs law is applied primarily near the ex-
treme edge of the distribution (e.g. the top 0.1% of the sample space),
whereas the Pareto distribution in regions closer to the bulk (e.g. be-
tween the top 0.1% and and top 50%). But this is mostly a dierence
of degree rather than of kind, though in some cases (such as with
the example of the 2007 country populations data set) the exponent
for the Pareto distribtion in the bulk can dier slightly from the
exponent for Zipfs law at the extreme edge.
The relationship between Zipfs law or the Pareto distribution
and Benfords law is more subtle. For instance Benfords law pre-
dicts that the proportion of X with initial digit 1 should equal the
proportion of X with initial digit 2 or 3. But if one formally uses
the Pareto distribution (3.34) to compare those X between 10
m
and
2 10
m
, and those X between 2 10
m
and 4 10
m
, it seems that
the former is larger by a factor of 2
1/
, which upon summing by m
appears inconsistent with Benfords law (unless is extremely large).
A similar inconsistency is revealed if one uses Zipfs law instead.
However, the fallacy here is that the Pareto distribution (or Zipfs
law) does not apply on the entire range of X, but only on the upper
450 3. Expository articles
tail region when X is signicantly higher than the median; it is a law
for the outliers of X only. In contrast, Benfords law concerns the
behaviour of typical values of X; the behaviour of the top 0.1% is of
negligible signicance to Benfords law, though it is of prime impor-
tance for Zipfs law and the Pareto distribution. Thus the two laws
describe dierent components of the distribution and thus comple-
ment each other. Roughly speaking, Benfords law asserts that the
bulk distribution of log
10
X is locally uniform at unit scales, while the
Pareto distribution (or Zipfs law) asserts that the tail distribution of
log
10
X decays exponentially. Note that Benfords law only describes
the ne-scale behaviour of the bulk distribution; the coarse-scale dis-
tribution can be a variety of distributions (e.g. log-gaussian).
Notes. This article rst appeared at terrytao.wordpress.com/2009/07/03.
Thanks to Kevin OBryant for corrections. Several other derivations
of Benfords law and the Pareto distribution, such as those relying on
max-entropy principles, were also discussed in the comments.
3.8. Selbergs limit theorem for the Riemann
zeta function on the critical line
The Riemann zeta function (s), dened for Re(s) > 1 by
(3.37) (s) :=
n=1
1
n
s
and then continued meromorphically to other values of s by analytic
continuation, is a fundamentally important function in analytic num-
ber theory, as it is connected to the primes p = 2, 3, 5, . . . via the
Euler product formula
(3.38) (s) =
p
(1
1
p
s
)
1
(for Re(s) > 1, at least), where p ranges over primes. (The equiva-
lence between (3.37) and (3.38) is essentially the generating function
version of the fundamental theorem of arithmetic.) The function
has a pole at 1 and a number of zeroes . A formal application of the
3.8. Selbergs limit theorem 451
factor theorem gives
(3.39) (s) =
1
s 1
(s ) . . .
where ranges over zeroes of , and we will be vague about what the
. . . factor is, how to make sense of the innite product, and exactly
which zeroes of are involved in the product. Equating (3.38) and
(3.39) and taking logarithms gives the formal identity
(3.40) log (s) =
p
log(1
1
p
s
) = log(s1)
log(s)+. . . ;
using the Taylor expansion
(3.41) log(1
1
p
s
) =
1
p
s
1
2p
2s
1
3p
3s
. . .
and dierentiating the above identity in s yields the formal identity
(3.42)
(s)
(s)
=
n
(n)
n
s
=
1
s 1
1
s
+. . .
where (n) is the von Mangoldt function, dened to be log p when n
is a power of a prime p, and zero otherwise. Thus we see that the
behaviour of the primes (as encoded by the von Mangoldt function)
is intimately tied to the distribution of the zeroes . For instance, if
we knew that the zeroes were far away from the axis Re(s) = 1, then
we would heuristically have
n
(n)
n
1+it
1
it
for real t. On the other hand, the integral test suggests that
n
1
n
1+it
1
it
and thus we see that
(n)
n
and
1
n
have essentially the same (multi-
plicative) Fourier transform:
n
(n)
n
1+it
n
1
n
1+it
.
452 3. Expository articles
Inverting the Fourier transform (or performing a contour integral
closely related to the inverse Fourier transform), one is led to the
prime number theorem
nx
(n)
nx
1.
In fact, the standard proof of the prime number theorem basically
proceeds by making all of the above formal arguments precise and
rigorous.
Unfortunately, we dont know as much about the zeroes of the
zeta function (and hence, about the function itself) as we would
like. The Riemann hypothesis (RH) asserts that all the zeroes (ex-
cept for the trivial zeroes at the negative even numbers) lie on the
critical line Re(s) = 1/2; this hypothesis would make the error terms
in the above proof of the prime number theorem signicantly more
accurate. Furthermore, the stronger GUE hypothesis asserts in addi-
tion to RH that the local distribution of these zeroes on the critical
line should behave like the local distribution of the eigenvalues of a
random matrix drawn from the gaussian unitary ensemble (GUE). I
will not give a precise formulation of this hypothesis here, except to
say that the adjective local in the context of distribution of zeroes
means something like at scale O(1/ log T) when Im(s) = O(T).
Nevertheless, we do know some reasonably non-trivial facts about
the zeroes and the zeta function , either unconditionally, or assum-
ing RH (or GUE). Firstly, there are no zeroes for Re(s) > 1 (as one
can already see from the convergence of the Euler product (3.38) in
this case) or for Re(s) = 1 (this is trickier, relying on (3.42) and the
elementary observation that
Re(3
(n)
n
+ 4
(n)
n
+it
+
(n)
n
+2it
) = 2
(n)
n
s/2
(s/2)(s) =
(1s)/2
((1 s)/2)(1 s)
(which can be viewed as a consequence of the Poisson summation
formula, see e.g. Section 1.5 of Poincares Legacies, Vol. I ) we know
that there are no zeroes for Re(s) 0 either (except for the trivial
zeroes at negative even integers, corresponding to the poles of the
3.8. Selbergs limit theorem 453
Gamma function). Thus all the non-trivial zeroes lie in the critical
strip 0 < Re(s) < 1.
We also know that there are innitely many non-trivial zeroes,
and can approximately count how many zeroes there are in any large
bounded region of the critical strip. For instance, for large T, the
number of zeroes in this strip with Im() = T + O(1) is O(log T).
This can be seen by applying (3.42) to s = 2+iT (say); the trivial ze-
roes at the negative integers end up giving a contribution of O(log T)
to this sum (this is a heavily disguised variant of Stirlings formula,
as one can view the trivial zeroes as essentially being poles of the
Gamma function), while the
1
s1
and . . . terms end up being negli-
gible (of size O(1)), while each non-trivial zero contributes a term
which has a non-negative real part, and furthermore has size compa-
rable to 1 if Im() = T + O(1). (Here I am glossing over a technical
renormalisation needed to make the innite series in (3.42) converge
properly.) Meanwhile, the left-hand side of (3.42) is absolutely con-
vergent for s = 2+iT and of size O(1), and the claim follows. A more
rened version of this argument shows that the number of non-trivial
zeroes with 0 Im() T is
T
2
log
T
2
T
2
+ O(log T), but we will
not need this more precise formula here. (A fair fraction - at least
40%, in fact - of these zeroes are known to lie on the critical line; see
[Co1989].)
Another thing that we happen to know is how the magnitude
[(1/2 + it)[ of the zeta function is distributed as t ; it turns
out to be log-normally distributed with log-variance about
1
2
log log t.
More precisely, we have the following result of Selberg:
Theorem 3.8.1. Let T be a large number, and let t be chosen uni-
formly at random from between T and 2T (say). Then the distribution
of
1
1
2
log log T
log [(1/2 + it)[ converges ( in distribution) to the nor-
mal distribution N(0, 1).
To put it more informally, log [(1/2+it)[ behaves like
_
1
2
log log t
N(0, 1) plus lower order terms for typical large values of t. (Zeroes
of are, of course, certainly not typical, but one can show that one
can usually stay away from these zeroes.) In fact, Selberg showed a
slightly more precise result, namely that for any xed k 1, the k
th
454 3. Expository articles
moment of
1
1
2
log log T
log [(1/2 + it)[ converges to the k
th
moment
of N(0, 1).
Remarkably, Selbergs result does not need RH or GUE, though it
is certainly consistent with such hypotheses. (For instance, the deter-
minant of a GUE matrix asymptotically obeys a remarkably similar
log-normal law to that given by Selbergs theorem.) Indeed, the net
eect of these hypotheses only aects some error terms in log [(1/2+
it)[ of magnitude O(1), and are thus asymptotically negligible com-
pared to the main term, which has magnitude about O(
log [s [ +. . . .
This formula turns out not to be directly useful because it requires
one to know much more about the distribution of the zeroes than
we currently possess. On the other hand, from the rst part of (3.40)
and (3.41) one also has the formula
(3.44) log [(s)[ =
p
Re
1
p
s
+. . . .
This formula also turns out not to be directly useful, because it re-
quires one to know much more about the distribution of the primes
p than we currently possess.
However, it turns out that we can split the dierence between
(3.43), (3.44), and get a formula for log [(s)[ which involves some
3.8. Selbergs limit theorem 455
zeroes and some primes p, in a manner that one can control them
both. Roughly speaking, the formula looks like this
8
:
(3.45)
log [(s)[ =
pT
Re
1
p
s
+O(
=s+O(1/ log T)
1 +[ log
[s [
1/ log T
[) +. . .
for s = 1/2 +it and t = O(T), where is a small parameter that we
can choose (e.g. = 0.01); thus we have localised the prime sum to
the primes p of size O(T
O(1)
), and the zero sum to those zeroes at a
distance O(1/ log T) from s.
It turns out that all of these expressions can be controlled. The
error term coming from the zeroes (as well as the . . . error term)
turn out to be of size O(1) for most values of t, so are a lower order
term. (As mentioned before, it is this error term that would be better
controlled if one had RH or GUE, but this is not necessary to establish
Selbergs result.) The main term is the one coming from the primes.
We can heuristically argue as follows. The expression X
p
:=
Re
1
p
s
=
1
p
cos(t log p), for t ranging between T and 2T, is a random
variable of mean zero and variance approximately
1
2p
(if p T
and
is small). Making the heuristic assumption that the X
p
behave as if
they were independent, the central limit theorem then suggests that
the sum
pT
X
p
should behave like a normal distribution of mean
zero and variance
pT
1
2p
. But the claim now follows from the
classical estimate
px
1
p
= log log x +O(1)
(which follows from the prime number theorem, but can also be de-
duced from the formula (3.44) for s = 1 +O(1/ log x), using the fact
that has a simple pole at 1).
To summarise, there are three main tasks to establish Selbergs
theorem:
(1) Establish a formula along the lines of (3.45);
8
This is an oversimplication; there is a tail coming from those zeroes that are
more distant from s than O(1/ log T), and also one has to smooth out the sum in p a
little bit, and allow the implied constants in the O() notation to depend on , but let
us ignore these technical issues here, as well as the issue of what exactly is hiding in
the . . . error.
456 3. Expository articles
(2) Show that the error terms arising from zeroes are O(1) on
the average;
(3) Justify the central limit calculation for
p
X
p
.
Ill now briey talk (informally) about each of the three steps in
turn.
3.8.2. The explicit formula. To get a formula such as (3.45), the
basic strategy is to take a suitable average of the formula (3.43) and
the formula (3.44). Traditionally, this is done by contour integration;
however, I prefer (perhaps idiosyncratically) to take a more Fourier-
analytic perspective, using convolutions rather than contour integrals.
(The two approaches are largely equivalent, though.) The basic point
is that the imaginary part Im() of the zeroes inhabits the same space
as the imaginary part t = Im(s) of the s variable, which in turn is the
Fourier analytic dual of the variable that the logarithm log p of the
primes p live in; this can be seen by writing (3.43), (3.44) in a more
Fourier-like manner
9
as
p
1
p
e
it log p
+. . . .
The uncertainty principle then predicts that localising log p to the
scale O(log T
)[(y) dy
where is some bump function with total mass 1; informally, this is
log [(s)[ averaged out in the vertical direction at scale O(1/ log T
) =
O(1/ log T) (we allow implied constants to depend on ).
We can express (3.46) in two dierent ways, one using (3.43),
and one using (3.44). Lets look at (3.43) rst. If one modies s by
O(1/ log T), then the quantity log [s[ doesnt uctuate very much,
9
These sorts of Fourier-analytic connections are often summarised by the slogan
the zeroes of the zeta function are the music of the primes.
3.8. Selbergs limit theorem 457
unless is within O(1/ log T) of s, in which case it can move by about
O(1 + log
|s|
1/ log T
). As a consequence, we see that
_
R
log [s +
iy
log T
[(y) dy log [s [
when [ s[ 1/ log T, and
_
R
log [s +
iy
log T
=s+O(1/ log T)
O(1 + log
[s [
1/ log T
) +. . . .
Now lets compute (3.46) using (3.44) instead. Writing s = 1/2+
it, we express (3.46) as
p
Re
1
p
s
_
R
e
iy log p/ log T
(y) dy +. . . .
Introducing the Fourier transform
() :=
_
R
e
iy
(y) dy of , one
can write this as
p
Re
1
p
s
(log p/ log T
) +. . . .
Now we took to be a bump function, so its Fourier transform should
also be like a bump function (or perhaps a Schwartz function). As a
rst approximation, one can thus think of
as a smoothed truncation
to the region : = O(1), thus the
(log p/ log T
) weight is
morally restricting p to the region p T
pT
Re
1
p
s
+. . . .
Comparing this with the other formula (3.47) we have for (3.46), we
obtain (3.45) as required (formally, at least).
458 3. Expository articles
3.8.3. Controlling the zeroes. Next, we want to show that the
quantity
=s+O(1/ log T)
1 +[ log
[s [
1/ log T
[
is O(1) on the average, when s = 1/2 + it and t is chosen uniformly
at random from T to 2T.
For this, we can use the rst moment method. For each zero
, let I
pT
X
p
behaves like a normal distribution, as predicted by the
central limit theorem heuristic. The key is to show that the X
p
behave
as if they were jointly independent. In particular, as the X
p
all have
mean zero, one would like to show that products such as
(3.48) X
p1
. . . X
p
k
have a negligible expectation as long as at least one of the primes
in p
1
, . . . , p
k
occurs at most once. Once one has this (as well as a
similar formula for the case when all primes appear at least twice),
one can then do a standard moment computation of the k
th
moment
(
pT
X
p
)
k
and verify that this moment then matches the answer
3.8. Selbergs limit theorem 459
predicted by the central limit theorem, which by standard arguments
(involving the Weierstrass approximation theorem) is enough to es-
tablish the distributional law. Note that to get close to the normal
distribution by a xed amount of accuracy, it suces to control a
bounded number of moments, which ultimately means that we can
treat k as being bounded, k = O(1).
If we expand out the product (3.48), we get
1
p
1
. . .
p
k
cos(t log p
1
) . . . cos(t log p
k
).
Using the product formula for cosines (or Eulers formula), the prod-
uct of cosines here can be expressed as a linear combination of cosines
cos(t), where the frequency takes the form
= log p
1
log p
2
. . . log p
k
.
Thus, is the logarithm of a rational number, whose numerator and
denominator are the product of some of the p
1
, . . . , p
k
. Since all the
p
j
are at most T
xnx+y
(n) = y +O
(y
1/2+
)
in the prime number theorem holding asymptotically as x for
all > 0 and all intervals [x, x + y] which are large in the sense that
y is comparable to x. Meanwhile, the pair correlation conjecture (the
simplest component of the GUE hypothesis) is equivalent (on RH)
to the square root error term holding (with the expected variance)
for all > 0 and almost all intervals [x, x + y] which are short in
the sense that y = x
(and P to P
[/[P
[ [A[/[P[ + c for
some c > 0). On the other hand, the density of A is clearly
bounded above by 1. As long as one has a suciently good
lower bound on the density increment at each stage, one
3.10. Mosers entropy compression argument 469
can conclude an upper bound on the number of iterations
in the algorithm. The prototypical example of this is Roths
proof of his theorem[Ro1953] that every set of integers of
positive upper density contains an arithmetic progression of
length three. The general strategy here is to keep looking
for useful density uctuations inside A, and then zoom in
to a region of increased density by reducing A and P appro-
priately. Eventually no further usable density uctuation
remains (i.e. A is uniformly distributed), and this should
force some desirable property on A.
The energy increment argument. This is an L
2
analogue
of the L
1
-based mass increment argument (or the L
-
based density increment argument), in which one seeks to
increments the amount of energy that A captures from
some reference object X, or (equivalently) to decrement the
amount of energy of X which is still orthogonal to A. Here
A and X are related somehow to a Hilbert space, and the
energy involves the norm on that space. A classic example
of this type of argument is the existence of orthogonal pro-
jections onto closed subspaces of a Hilbert space; this leads
among other things to the construction of conditional ex-
pectation in measure theory, which then underlies a number
of arguments in ergodic theory, as discussed for instance in
Section 2.8 of Poincares Legacies, Vol. I. Another basic
example is the standard proof of the Szemeredi regularity
lemma (where the energy is often referred to as the in-
dex). These examples are related; see Section 4.2 for fur-
ther discussion. The general strategy here is to keep looking
for useful pieces of energy orthogonal to A, and add them
to A to form A
and
3.10. Mosers entropy compression argument 471
R
. Thus, each
stage of the argument compresses the information-theoretic content
of the string A+R into the string A
+R
+H
in a lossless fashion.
However, a random variable such as A + R cannot be compressed
losslessly into a string of expected size smaller than the Shannon
entropy of that variable. Thus, if one has a good lower bound on the
entropy of A + R, and if the length of A
+ R
+ H
is signicantly
less than that of A + R (i.e. we need the marginal growth in the
length of the history le H
of A whose set T
randomly; in par-
ticular, we will use k fresh random bits to replace the k bits of A in
the support of s. (By doing so, there is a small probability (2
k
) that
we in fact do not change A at all, but the argument is (very) slightly
simpler if we do not bother to try to eliminate this case.)
If all the clauses had disjoint supports, then this strategy would
work without diculty. But when the supports are not disjoint, one
has a problem: every time one modies A to x a clause s by
modifying the variables on the support of s, one may cause other
clauses s
in S whose supports
intersect s, and which A now violates; this is a collection of
at most 2
kC
clauses, possibly including s itself. Order these
clauses s
)
to each such clause in turn. (Thus the original algorithm
Fix(s) is put on hold on some CPU stack while all the
child processes Fix(s
, R
by appending to R
+R
) to the
x algorithm. But then s is one of the at most 2
kC
clauses in S
whose support intersects that of s
. Let us call
this number the index of the call Fix(s).
Now imagine that while the Fix routine is called, a running log
le (or history) H of the routine is kept, which records s each time
one of the original [T[ calls Fix(s) with s T is invoked, and also
3.10. Mosers entropy compression argument 477
records the index of any other call Fix(s) made during the recursive
procedure. Finally, we assume that this log le records a termination
symbol whenever a Fix routine terminates. By performing a stack
trace, one sees that whenever a Fix routine is called, the clause s that
is being repaired by that routine can be deduced from an inspection
of the log le H up to that point.
As a consequence, at any intermediate stage in the process of all
these x calls, the original state A + R of the assignment and the
random string of bits can be deduced from the current state A
+R
up to that point.
Now suppose for contradiction that S is not satisable; thus the
stack of x calls can never completely terminate. We trace through
this stack for M steps, where M is some large number to be chosen
later. After these steps, the random string R has shortened by an
amount of Mk; if we set R to initially have length Mk, then the string
is now completely empty, R
has size at most O([S[) + M(k C + O(1)), since it takes [S[ bits
to store the initial clauses in T, O([S[) + O(M) bits to record all
the instances when Step 1 occurs, and every subsequent call to Fix
generates a k C-bit number, plus possibly a termination symbol
of size O(1). Thus we have a lossless compression algorithm A +
R A
+ H
.
Proof. For each 1 i log
2
2
n, the number n
i
1 has at most
O(log
O(1)
n) prime divisors (by the fundamental theorem of arith-
metic). If one picks r to be the rst prime not equal to any of these
prime divisors, one obtains the claim. (One can use a crude version
of the prime number theorem to get the upper bound on r.)
It is clear that Theorem 3.11.1 follows from Theorem 3.11.2 and
Lemma 3.11.3, so it suces now to prove Theorem 3.11.2.
Suppose for contradiction that Theorem 3.11.2 fails. Then n is
divisible by some smaller prime p, but is not a power of p. Since n
is coprime to all numbers of size O(log
O(1)
n) we know that p is not
of polylogarithmic size, thus we may assume p log
C
n for any xed
C. As r is coprime to n, we see that r is not a multiple of p (indeed,
one should view p as being much larger than r).
Let F be a eld extension of F
p
by a primitive r
th
root of unity
X, thus F = F
p
[X]/h(X) for some factor h(X) (in F
p
[X]) of the r
th
cyclotomic polynomial
r
(X). From the hypothesis (3.53), we see
that
(X +a)
n
= X
n
+a
in F for all 1 a A, where A = O(r log
O(1)
n). Note that n is
coprime to every integer less than A, and thus A < p.
Meanwhile, from (3.51) one has
(X +a)
p
= X
p
+a
in F for all such a. The two equations give
(X
p
+a)
n/p
= (X
p
)
n/p
+a.
482 3. Expository articles
Note that the p
th
power X
p
of a primitive r
th
root of unity X is again
a primitive r
th
root of unity (and conversely, every primitive r
th
root
arises in this fashion) and hence we also have
(X +a)
n/p
= X
n/p
+a
in F for all 1 a A.
Inspired by this, we dene a key concept: a positive integer m is
said to be introspective if one has
(X +a)
m
= X
m
+a
in F for all 1 a A, or equivalently if (X + a)
m
=
m
(X + a),
where
m
: F F is the ring homomorphism that sends X to X
m
.
We have just shown that p, n, n/p are all introspective; 1 is also
trivially introspective. Furthermore, if m and m
are introspective,
it is not hard to see that mm
m
(P(X))). In particular, this shows that X
m1
, . . . , X
mt
are all roots
of the polynomial P Q. But this polynomial has degree less than t,
3.12. Dueling conspiracies 483
and the X
m1
, . . . , X
mt
are distinct by hypothesis, and we obtain the
desired contradiction by the factor theorem.
Proposition 3.11.5 (Upper bound on (). Suppose that there are
exactly t residue classes modulo r of the form p
i
(n/p)
j
mod r for
i, j 0. Then [([ n
t
.
Proof. By the pigeonhole principle, we must have a collision
p
i
(n/p)
j
= p
i
(n/p)
j
mod r
for some 0 i, j, i
, j
t with (i, j) ,= (i
, j
). Setting m :=
p
i
(n/p)
j
and m
:= p
i
(n/p)
j
of size most n
t
which are equal modulo
r. (To ensure that m, m
t
,
and the claim now follows from the factor theorem.
Since n has order greater than log
2
n in (Z/rZ)
, we see that
the number t of residue classes r of the form p
i
(n/p)
j
is at least
log
2
n. But then 2
t
> n
t
, and so Propositions 3.11.4, 3.11.5 are
incompatible.
Notes. This article rst appeared at terrytao.wordpress.com/2009/08/11.
Thanks to Leandro, theoreticalminimum and windfarmmusic for cor-
rections.
A thorough discussion of the AKS algorithm can be found at
[Gr2005].
3.12. The prime number theorem in arithmetic
progressions, and dueling conspiracies
A fundamental problem in analytic number theory is to understand
the distribution of the prime numbers 2, 3, 5, . . .. For technical rea-
sons, it is convenient not to study the primes directly, but a proxy for
the primes known as the von Mangoldt function : N R, dened
by setting (n) to equal log p when n is a prime p (or a power of that
prime) and zero otherwise. The basic reason why the von Mangoldt
484 3. Expository articles
function is useful is that it encodes the fundamental theorem of arith-
metic (which in turn can be viewed as the dening property of the
primes) very neatly via the identity
(3.54) log n =
d|n
(d)
for every natural number n.
The most important result in this subject is the prime number
theorem, which asserts that the number of prime numbers less than a
large number x is equal to (1 +o(1))
x
log x
:
px
1 = (1 +o(1))
x
log x
.
Here, of course, o(1) denotes a quantity that goes to zero as x .
It is not hard to see (e.g. by summation by parts) that this is
equivalent to the asymptotic
(3.55)
nx
(n) = (1 +o(1))x
for the von Mangoldt function (the key point being that the squares,
cubes, etc. of primes give a negligible contribution, so
nx
(n)
is essentially the same quantity as
px
log p). Understanding the
nature of the o(1) term is a very important problem, with the con-
jectured optimal decay rate of O(
px:p=a mod q
1 = (1 +o
q
(1))
1
(q)
x
log x
.
3.12. Dueling conspiracies 485
(Of course, if a is not coprime to q, the number of primes less than x
equal to a mod q is O(1). The subscript q in the o() and O() notation
denotes that the implied constants in that notation is allowed to de-
pend on q.) This is a more quantitative version of Dirichlets theorem,
which asserts the weaker statement that the number of primes equal
to a mod q is innite. This theorem is important in many applications
in analytic number theory, for instance in Vinogradovs theorem that
every suciently large odd number is the sum of three odd primes.
(Imagine for instance if almost all of the primes were clustered in the
residue class 2 mod 3, rather than 1 mod 3. Then almost all sums
of three odd primes would be divisible by 3, leaving dangerously few
sums left to cover the remaining two residue classes. Similarly for
other moduli than 3. This does not fully rule out the possibility that
Vinogradovs theorem could still be true, but it does indicate why the
prime number theorem in arithmetic progressions is a relevant tool in
the proof of that theorem.)
As before, one can rewrite the prime number theorem in arith-
metic progressions in terms of the von Mangoldt function as the equiv-
alent form
nx:n=a mod q
(n) = (1 +o
q
(1))
1
(q)
x.
Philosophically, one of the main reasons why it is so hard to con-
trol the distribution of the primes is that we do not currently have too
many tools with which one can rule out conspiracies between the
primes, in which the primes (or the von Mangoldt function) decide
to correlate with some structured object (and in particular, with a
totally multiplicative function) which then visibly distorts the distri-
bution of the primes. For instance, one could imagine a scenario in
which the probability that a randomly chosen large integer n is prime
is not asymptotic to
1
log n
(as is given by the prime number theorem),
but instead to uctuate depending on the phase of the complex num-
ber n
it
for some xed real number t, thus for instance the probability
might be signicantly less than 1/ log n when t log n is close to an
integer, and signicantly more than 1/ log n when t log n is close to a
half-integer. This would contradict the prime number theorem, and
so this scenario would have to be somehow eradicated in the course
486 3. Expository articles
of proving that theorem. In the language of Dirichlet series, this
conspiracy is more commonly known as a zero of the Riemann zeta
function at 1 +it.
In the above scenario, the primality of a large integer n was some-
how sensitive to asymptotic or Archimedean information about n,
namely the approximate value of its logarithm. In modern terminol-
ogy, this information reects the local behaviour of n at the innite
place . There are also potential consipracies in which the primality
of n is sensitive to the local behaviour of n at nite places, and in
particular to the residue class of n mod q for some xed modulus q.
For instance, given a Dirichlet character : Z C of modulus q, i.e.
a completely multiplicative function on the integers which is periodic
of period q (and vanishes on those integers not coprime to q), one
could imagine a scenario in which the probability that a randomly
chosen large integer n is prime is large when (n) is close to +1, and
small when (n) is close to 1, which would contradict the prime
number theorem in arithmetic progressions. (Note the similarity be-
tween this scenario at q and the previous scenario at ; in particular,
observe that the functions n (n) and n n
it
are both totally
multiplicative.) In the language of Dirichlet series, this conspiracy is
more commonly known as a zero of the L-function of at 1.
An especially dicult scenario to eliminate is that of real char-
acters, such as the Kronecker symbol (n) =
_
n
q
_
, in which numbers
n which are quadratic nonresidues mod q are very likely to be prime,
and quadratic residues mod q are unlikely to be prime. Indeed, there
is a scenario of this form - the Siegel zero scenario - which we are still
not able to eradicate (without assuming powerful conjectures such
as the Generalised Riemann Hypothesis (GRH)), though fortunately
Siegel zeroes are not quite strong enough to destroy the prime number
theorem in arithmetic progressions.
It is dicult to prove that no conspiracy between the primes ex-
ist. However, it is not entirely impossible, because we have been able
to exploit two important phenomena. The rst is that there is often a
all or nothing dichotomy (somewhat resembling the zero-one laws
in probability) regarding conspiracies: in the asymptotic limit, the
primes can either conspire totally (or more precisely, anti-conspire
3.12. Dueling conspiracies 487
totally) with a multiplicative function, or fail to conspire at all, but
there is no middle ground. (In the language of Dirichlet series, this is
reected in the fact that zeroes of a meromorphic function can have
order 1, or order 0 (i.e. are not zeroes after all), but cannot have an
intermediate order between 0 and 1.) As a corollary of this fact, the
prime numbers cannot conspire with two distinct multiplicative func-
tions at once (by having a partial correlation with one and another
partial correlation with another); thus one can use the existence of
one conspiracy to exclude all the others. In other words, there is at
most one conspiracy that can signicantly distort the distribution of
the primes. Unfortunately, this argument is ineective, because it
doesnt give any control at all on what that conspiracy is, or even if
it exists in the rst place!
But now one can use the second important phenomenon, which
is that because of symmetries, one type of conspiracy can lead to
another. For instance, because the von Mangoldt function is real-
valued rather than complex-valued, we have conjugation symmetry; if
the primes correlate with, say, n
it
, then they must also correlate with
n
it
. (In the language of Dirichlet series, this reects the fact that
the zeta function and L-functions enjoy symmetries with respect to
reection across the real axis (i.e. complex conjugation).) Combining
this observation with the all-or-nothing dichotomy, we conclude that
the primes cannot correlate with n
it
for any non-zero t, which in fact
leads directly to the prime number theorem (3.55), as we shall discuss
below. Similarly, if the primes correlated with a Dirichlet character
(n), then they would also correlate with the conjugate (n), which
also is inconsistent with the all-or-nothing dichotomy, except in the
exceptional case when is real - which essentially means that is a
quadratic character. In this one case (which is the only scenario which
comes close to threatening the truth of the prime number theorem in
arithmetic progressions), the above tricks fail and one has to instead
exploit the algebraic number theory properties of these characters
instead, which has so far led to weaker results than in the non-real
case.
As mentioned previously in passing, these phenomena are usually
presented using the language of Dirichlet series and complex analysis.
488 3. Expository articles
This is a very slick and powerful way to do things, but I would like
here to present the elementary approach to the same topics, which is
slightly weaker but which I nd to also be very instructive. (However,
I will not be too dogmatic about keeping things elementary, if this
comes at the expense of obscuring the key ideas; in particular, I will
rely on multiplicative Fourier analysis (both at and at nite places)
as a substitute for complex analysis in order to expedite various parts
of the argument. Also, the emphasis here will be more on heuristics
and intuition than on rigour.)
The material here is closely related to the theory of pretentious
characters developed in [GrSo2007], as well as the earlier paper
[Gr1992].
3.12.1. A heuristic elementary proof of the prime number
theorem. To motivate some of the later discussion, let us rst give a
highly non-rigorous heuristic elementary proof of the prime number
theorem (3.55). Since we clearly have
nx
1 = x +O(1)
one can view the prime number theorem as an assertion that the von
Mangoldt function behaves like 1 on the average,
(3.56) (n) 1,
where we will be deliberately vague as to what the symbol means.
(One can think of this symbol as denoting some sort of proximity in
the weak topology or vague topology, after suitable normalisation.)
To see why one would expect (3.56) to be true, we take divisor
sums of (3.56) to heuristically obtain
(3.57)
d|n
(d)
d|n
1.
By (3.54), the left-hand side is log n; meanwhile, the right-hand side
is the divisor function (n) of n, by denition. So we have a heuristic
relationship between (3.56) and the informal approximation
(3.58) (n) log n.
3.12. Dueling conspiracies 489
In particular, we expect
(3.59)
nx
(n)
nx
log n.
The right-hand side of (3.59) can be approximated using the in-
tegral test as
(3.60)
nx
log n =
_
x
1
log t dt +O(log x) = xlog x x +O(log x)
(one can also use Stirlings formula to obtain a similar asymptotic).
As for the left-hand side, we write (n) =
d|n
1 and then make the
substitution n = dm to obtain
nx
(n) =
d,m:dmx
1.
The right-hand side is the number of lattice points underneath the
hyperbola dm = x, and can be counted using the Dirichlet hyperbola
method:
d,m:dmx
1 =
mx/d
1 +
dx/m
1
x
1.
The third sum is equal to (
x + O(1))
2
= x + O(
mx/d
1 =
x
(
x
d
+O(1)) = x
x
1
d
+O(1);
meanwhile, from the integral test and the denition of Eulers con-
stant = 0.577 . . . one has
(3.61)
dy
1
d
= log y + +O(1/y)
for any y 1; combining all these estimates one obtains
(3.62)
nx
(n) = xlog x + (2 1)x +O(
x).
Comparing this with (3.60) we do see that (n) and log n are roughly
equal to top order on average, thus giving some form of (3.58) and
hence (3.57); if one could somehow invert the divisor sum operation,
one could hope to get (3.56) and thus the prime number theorem.
490 3. Expository articles
(Looking at the next highest order terms in (3.60), (3.62), we see
that we expect (n) to in fact be slightly larger than log n on the
average, and so (n) should be slightly less than 1 on the average.
There is indeed a slight eect of this form; for instance, it is possible
(using the prime number theorem) to prove
dy
(d)
d
= log y +o(1),
which should be compared with (3.61).)
One can partially translate the above discussion into the lan-
guage of Dirichlet series, by transforming various arithmetical func-
tions f(n) to their associated Dirichlet series
F(s) :=
n=1
f(n)
n
s
,
ignoring for now the issue of convergence of this series. By denition,
the constant function 1 transforms to the Riemann zeta function (s).
Taking derivatives in s, we see (formally, at least) that if f(n) has
Dirichlet series F(s), then f(n) log n has Dirichlet series F
(s); thus,
for instance, log n has Dirichlet series
(s).
Most importantly, though, if f(n), g(n) have Dirichlet series F(s), G(s)
respectively, then their Dirichlet convolution fg(n) :=
d|n
f(d)g(
n
d
)
has Dirichlet series F(s)G(s); this is closely related to the well-known
ability of the Fourier transform to convert convolutions to pointwise
multiplication. Thus, for instance, (n) has Dirichlet series (s)
2
.
Also, from (3.54) and the preceding discussion, we see that (n) has
Dirichlet series
(s)
(s)
=
1
s 1
+O(s 1)
3.12. Dueling conspiracies 491
thus giving support to (3.56); similarly, the Dirichlet series for log n
and (n) have asymptotic
(s) =
1
(s 1)
2
+O(1)
and
(s)
2
=
1
(s 1)
2
+
2
s 1
+O(1)
which gives support to (3.58) (and is also consistent with (3.60),
(3.62)).
Remark 3.12.1. One can connect the properties of Dirichlet series
F(s) more rigorously to asymptotics of partial sums
nx
f(n) by
means of various transforms in Fourier analysis and complex analysis,
in particular contour integration or the Hilbert transform, but this
becomes somewhat technical and we will not do so here. I will remark,
though, that asymptotics of F(s) for s close to 1 are not enough, by
themselves, to get really precise asymptotics for the sharply truncated
partial sums
nx
f(n), for reasons related to the uncertainty prin-
ciple; in order to control such sums one also needs to understand the
behaviour of F far away from s = 1, and in particular for s = 1 + it
for large real t. On the other hand, the asymptotics for F(s) for
s near 1 are just about all one needs to control smoothly truncated
partial sums such as
n
f(n)(n/x) for suitable cuto functions .
Also, while Dirichlet series are very powerful tools, particularly with
regards to understanding Dirichlet convolution identities, and control-
ling everything in terms of the zeroes and poles of such series, they do
have the drawback that they do not easily encode such fundamental
physical space facts as the pointwise inequalities [(n)[ 1 and
(n) 0, which are also an important aspect of the theory.
3.12.2. Almost primes. One can hope to make the above heuristics
precise by applying the Mobius inversion formula
1
n=1
=
d|n
(d)
where (d) is the Mobius function, dened as (1)
k
when d is the
product of k distinct primes for some k 0, and zero otherwise. In
terms of Dirichlet series, we thus see that has the Dirichlet series of
492 3. Expository articles
1/(s), and so can invert the divisor sum operation f(n)
d|n
f(d)
(which corresponds to multiplication by (s)):
f(n) =
m|n
(m)(
d|n/m
f(d)).
From (3.54) we then conclude
(3.63) (n) =
d|n
(d) log
n
d
while from (n) =
d|n
1 we have
(3.64) 1 =
d|n
(d)(
n
d
).
One can now hope to derive the prime number theorem (3.55) from
the formulae (3.60), (3.62). Unfortunately, this doesnt quite work:
the prime number theorem is equivalent to the assertion
(3.65)
nx
((n) 1) = o(x),
but if one inserts (3.63), (3.64) into the left-hand side of (3.65), one
obtains
dx
(d)
mx/d
(log m(m)),
which if one then inserts (3.60), (3.62) and the trivial bound (d) =
O(1), leads to
2Cx
dx
(d)
d
+O(x).
Using the elementary inequality
(3.66) [
dx
(d)
d
[ 1,
(see [Ta2010b]), we only obtain a bound of O(x) for (3.65) instead of
o(x). (A renement of this argument, though, shows that the prime
number theorem would follow if one had the asymptotic
nx
(n) =
o(x), which is in fact equivalent to the prime number theorem.)
3.12. Dueling conspiracies 493
We remark that if one computed
nx
(n) or
nx
(n) by the
above methods, one would eventually be led to a variant of (3.66),
namely
(3.67)
dx
(d)
d
log
x
d
= O(1),
which is an estimate which will be useful later.
So we see that when trying to sum the von Mangoldt function
by elementary means, the error term O(x) overwhelms the main term
x. But there is a slight tweaking of the von Mangoldt function, the
second von Mangoldt function
2
, that increases the size of the main
term to 2xlog x while keeping the error term at O(x), thus leading
to a useful estimate; the price one pays for this is that this function
is now a proxy for the almost primes rather than the primes. This
function is dened by a variant of (3.63), namely
(3.68)
2
(n) =
d|n
(d) log
2
n
d
.
It is not hard to see that
2
(n) vanishes once n has at least three
distinct prime factors (basically because the quadratic function x
x
2
vanishes after being dierentiated three or more times). Indeed,
one can easily verify the identity
(3.69)
2
(n) = (n) log n + (n)
(which corresponds to the Dirichlet series identity
(s)/(s) = (
(s)/(s))
+
(
(s)/(s))
2
); the rst term (n) log n is mostly concentrated on
primes, while the second term (n) is mostly concentrated on
semiprimes (products of two distinct primes).
Now let us sum
2
(n). In analogy with the previous discussion,
we will do so by comparing the function log
2
n with something in-
volving the divisor function. In view of (3.58), it is reasonable to try
the approximation
log
2
n (n) log n;
from the identity
(3.70) 2 log n =
d|n
(d)(
n
d
) log
n
d
494 3. Expository articles
(which corresponds to the Dirichlet series identity 2
(s) =
1
(s)
(
2
(s))
) we thus expect
(3.71)
2
(n) 2 log n.
Now we make these heuristics more precise. From the integral
test we have
nx
log
2
n = xlog
2
x +C
1
xlog x +C
2
x +O(log
2
x)
while from (3.62) and summation by parts one has
nx
(n) log n = xlog
2
x +C
3
xlog x +C
4
x +O(
xlog x)
where C
1
, C
2
, C
3
, C
4
are explicit absolute constants whose exact value
is not important here. Thus
(3.72)
nx
(log
2
n (n) log n) = C
5
xlog x +C
6
x +O(
xlog x)
for some other constants C
5
, C
6
.
Meanwhile, from (3.68), (3.70) one has
nx
(
2
(n) 2 log(n)) =
dx
(d)
mx/d
log
2
n (n) log n;
applying (3.72), (3.66), (3.67) we see that the right-hand side is O(x).
Computing
nx
log n by the integral test, we deduce the Selberg
symmetry formula
(3.73)
nx
2
(n) = 2xlog x +O(x).
One can view (3.73) as the almost prime number theorem - the
analogue of the prime number theorem for almost primes.
The fact that the almost primes have a relatively easy asymptotic,
while the genuine primes do not, is a reection of the parity problem
in sieve theory; see Section 3.10 of Structure and Randomness for
further discussion. The symmetry formula is however enough to get
within a factor of two of the prime number theorem: if we discard
3.12. Dueling conspiracies 495
the semiprimes from (3.69), we see that (n) log n
2
(n), and
thus
nx
(n) log n 2xlog x +O(x)
which by a summation by parts argument leads to
0
nx
(n) 2x +O(
x
log x
),
which is within a factor of 2 of (3.55) in some sense.
One can twist all of the above arguments by a Dirichlet char-
acter . For instance, (3.68) twists to
2
(n)(n) =
d|n
(d)(d) log
2
n
d
(
n
d
).
On the other hand, if is a non-principal character of modulus q,
then it has mean zero on any interval with length q, and it is then
not hard to establish the asymptotic
ny
log
2
n(n) = O
q
(log
2
y).
This soon leads to the twisted version of (3.73):
(3.74)
nx
2
(n)(n) = O
q
(x),
thus almost primes are asymptotically unbiased with respect to non-
principal characters.
From the multiplicative Fourier analysis of Dirichlet characters
modulo q (and the observation that
2
is quite small on residue classes
not coprime to q) one then has an almost prime number theorem in
arithmetic progressions:
nx:n=a mod q
2
(n) =
2
(q)
xlog x +O
q
(x).
As before, this lets us come within a factor of two of the actual prime
number theorem in arithmetic progressions:
nx:n=a mod q
(n)
2
(q)
x +O
q
(
x
log x
).
496 3. Expository articles
One can also twist things by the completely multiplicative func-
tion n n
it
, but with the caveat that the approximation 2 log n to
2
(n) can locally correlate with n
it
. Thus for instance one has
nx
(
2
(n) 2 log n)(n)n
it
= O
q
(x)
for any xed t and ; in particular, if is non-principal, one has
nx
2
(n)(n)n
it
= O
q
(x).
3.12.3. The all-or-nothing dichotomy. To summarise so far, the
almost primes (as represented by
2
) are quite uniformly distributed.
These almost primes can be split up into the primes (as represented
by (n) log n) and the semiprimes (as represented by (n)), thanks
to (3.69).
One can rewrite (3.69) as a recursive formula for :
(3.75) (n) =
1
log n
2
(n)
1
log n
(n).
One can also twist this formula by a character and/or a completely
multiplicative function n n
it
, thus for instance
(3.76) (n) =
1
log n
2
(n)
1
log n
(n).
This recursion, combined with the uniform distribution properties on
2
, lead to various all-or-nothing dichotomies for . Suppose, for
instance, that behaves like a constant c on the average for some
non-principal character :
(n) c.
Then (from (3.58)) we expect to behave like c
2
log n, thus
1
log n
(n) c
2
.
On the other hand, from (3.74),
1
log n
2
(n) is asymptotically uncor-
related with :
1
log n
2
0.
Putting all this together, one obtains
c c
2
3.12. Dueling conspiracies 497
which suggests that c must be either close to 0, or close to 1.
Basically, the point is that there are only two equilibria for the
recursion (3.76). One equilibrium occurs when is asymptotically
uncorrelated with ; the other is when it is completely anti-correlated
with , so that (n) is supported primarily on those n for which (n)
is close to 1. Note in the latter case (n) 1 for most primes
n, and thus (n) +1 for most semiprimes n, thus leading to an
equidistribution of (n) for almost primes (weighted by
2
). Any
intermediate distribution of would be inconsistent with the distri-
bution of
2
. (In terms of Dirichlet series, this assertion corresponds
to the fact that the L-function of either has a zero of order 1, or a
zero of order 0 (i.e. not a zero at all) at s = 1.)
A similar phenomenon occurs when twisting by n
it
; basically,
the average value of ((n) 1)n
it
must asymptotically either be close
to 0, or close to 1; no other asymptotic ends up being compatible
with the distribution of (
2
(n) 2 log n)n
it
. (Again, this corresponds
to the fact that the Riemann zeta function has a zero of order 1 or
0 at 1 + it.) More generally, the average value of ((n) 1)(n)n
it
must asymptotically approach either 0 or 1.
Remark 3.12.2. One can make the above heuristics precise either
by using Dirichlet series (and analytic continuation, and the theory
of zeroes of meromorphic functions), or by smoothing out arithmetic
functions such as by a suitable multiplicative convolution with
a mollier (as is basically done in elementary proofs of the prime
number theorem); see also [GrSo2007] for a closely related theory.
We will not pursue these details here, however.
3.12.4. Dueling conspiracies. In the previous section we have seen
(heuristically, at least), that the von Mangoldt function (n) (or more
precisely, (n) 1) will either have no correlation, or a maximal
amount of anti-correlation, with a completely multiplicative function
such as (n), n
it
, or (n)n
it
. On the other hand, it is not possible
for this function to maximally anti-correlate (or to conspire) with
two such functions; thus the presence of one conspiracy excludes the
presence of all others.
498 3. Expository articles
Suppose for instance that we had two distinct non-principal char-
acters ,
(n) 1.
One could then combine the two statements to obtain
(n)((n) +
(n)) 2.
Meanwhile,
1
log n
2
(n) doesnt correlate with either or
. It will
be convenient to exploit this to normalise , obtaining
((n)
1
2 log n
2
(n))((n) +
(n)) 2.
(Note from (3.56), (3.71) that we expect (n)
1
2 log n
2
(n) to have
mean zero.)
On the other hand, since 0 (n) log n
2
(n), one has
[(n)
1
2 log n
2
(n)[
1
2 log n
2
(n)
and hence by the triangle inequality
2
(n)[(n) +
(n)[ 4 log n
in the sense that averages of the left-hand side should be at least
as large as averages of the right-hand side. From this, (3.71), and
Cauchy-Schwarz, one thus expects
2
(n)[(n) +
(n)[
2
8 log n.
But if one expands out the left-hand side using (3.71), (3.74), one
only ends up with 4 log n +O
q
(1) on the average, a contradiction for
n suciently large.
Remark 3.12.3. The above argument belongs to a family of L
2
-
based arguments which go by various names (almost orthogonality,
TT
nx
(n)(n) as varies, but we will not make this precise here.
As one consequence of the above arguments, one can show that
(n) cannot maximally anti-correlate with any non-real character ,
since (by the reality of ) it would then also maximally anti-correlate
with the complex conjugate , which is distinct from . A similar
3.12. Dueling conspiracies 499
argument shows that (n) cannot maximally anti-correlate with n
it
for any non-zero t, a fact which can soon lead to the prime num-
ber theorem, either by Dirichlet series methods, by Fourier-analytic
means, or by elementary means. (Sketch of Fourier-analytic proof:
L
2
methods provide L
2
-type bounds on the averages of (n)n
it
in t,
while the above arguments show that these averages are also small in
L
. Then (n),
1, then (n)n
2it
+1, which is also incompatible with the zero-one
law; this is essentially the method underlying the standard proof of
the prime number theorem (which relates (1 +it) with (1 + 2it)).
3.12.5. Quadratic characters. The one dicult scenario to elimi-
nate is that of maximal anti-correlation with a real non-principal (i.e.
quadratic) character , thus
(n)(n) 1.
This scenario implies that the quantity
L(1, ) :=
n=1
(n)
n
vanishes. Indeed, if one starts with the identity
log n(n) =
d|n
(d)(
n
d
)
and sums in n, one sees that
nx
log n(n) =
d,m:dmx
(d)(m).
500 3. Expository articles
The left-hand side is O
q
(log x) by the mean zero and periodicity prop-
erties of . To estimate the right-hand side, we use the hyperbola
method and rewrite it as
mM
(m)
dx/m
(d) +
dx/M
(d)
M<mx/d
(m)
for some parameter M (suciently slowly growing in x) to be opti-
mised later. Writing
dx/m
(d) = (1+o
q
(1))x/mand
M<mx/d
(m) =
O
q
(1), we can express this as
x(
mM
(m)
m
+o
q
(1)) +O
q
(x/M);
sending x (and M at a slower rate) we conclude L(1, ) =
0 as required.
It is remarkably dicult to show that L(1, ) does not, in fact,
vanish. One way to do this is to use the class number formula, that
relates this quantity to the class number of the quadratic number
eld Q(
d) (or more
generally, as the norm of an ideal in that ring) is non-negative, and
is at least 1 on the squares. In particular we have
nx
1 (n)
n
1
2
log x +O(1).
On the other hand, from the hyperbola method we can express the
left-hand side as
(3.77)
x
(d)
mx/d
1
m
+
m<
x
1
x<dx/m
(d)
d
.
From the mean zero and periodicity properties of we have
x<dx/m
(d)
d
=
O
q
(x
1/4
), so the second term in (3.77) is O
q
(1). Meanwhile, from
3.12. Dueling conspiracies 501
the midpoint rule,
my
1
m
= 2
y
3
2
+O(1/
x
(d)
d
+O([
x
(d)
d
[) +O(1) = 2
xL(1, ) +O(1).
Putting all this together we have
1
2
log x +O(1) 2
xL(1, ) +O
q
(1),
which leads to a contradiction as x if L(1, ) vanishes.
Note in fact that the above argument shows that L(1, ) is pos-
itive. If one carefully computes the dependence of the above ar-
gument on the modulus q, one obtains a lower bound of the form
L(1, ) exp(q
1/2+o(1)
), which is quite poor. Using a non-trivial
improvement on the error term in counting lattice points under the
hyperbola (or better still, by smoothing the sum
nx
), one can im-
prove this a bit, to L(1, ) q
O(1)
. In contrast, the class number
method gives a bound L(1, ) q
1/2+o(1)
.
We can improve this even further for all but at most one real
primitive character :
Theorem 3.12.5 (Siegels theorem). For every > 0, one has
L(1, )
and L(1,
) <
c(q
of (large) mod-
ulus q, q
respectively.
We begin by modifying the proof that L(1, ) was positive, which
relied (among other things) on the observation that 1 , and equals
1 at 1. In particular, one has
(3.78)
nx
1 (n)
n
s
1
for any x 1 and any real s. (One can get slightly better bounds by
exploiting that 1 is also at least 1 on square numbers, as before,
but this is really only useful for s 1/2, and we are now going to
take s much closer to 1.)
On the other hand, one has the asymptotics
nx
1
n
s
= (s) +
x
1s
1 s
+O(x
s
)
for any real s close (but not equal) to 1, and similarly
nx
(n)
n
s
= L(s, ) +O(q
O(1)
x
s
)
for any real s close to 1; similarly for
nx
1 (n)
n
s
= (s)L(s, ) +
x
1s
1 s
L(1, ) +O(q
O(1)
x
0.5s
)
for all real s suciently close to 1. Indeed, one can expand the left-
hand side of (3.79) as
x
(d)
d
s
mx/d
1
m
s
+
m<
x
1
m
s
x<dx/m
(d)
d
s
and the claim then follows from the previous asymptotics. (One can
improve the error term by smoothing the summation, but we will not
need to do so here.)
Now set x = Cq
C
for a large absolute constant C. If 0.99 s < 1,
then the error term in O(q
O(1)
x
0.5s
) is then at most 1/2 (say) if C
3.12. Dueling conspiracies 503
is large enough. We conclude from (3.78) that
(s)L(s, )
1
2
O(
q
O(1s)
1 s
L(1, ))
for 0.99 s < 1. Since L(1, ) cq
(s, ) = O(log
2
q)
for s = 1O(1/ log q), and so by the mean value theorem we see that
the zero of L(s, ) must also obey 1 s L(1, )/ log
2
q. Thus
L(s, ) has a zero for some s < 1 with
(3.80) L(1, )/ log
2
q 1 s L(1, ).
Similarly, L(s
< 1 with
(3.81) L(1,
)/ log
2
q
1 s
L(1,
).
Now, we consider the function
f := 1
.
One can also show that f is non-negative and equals 1 at 1, thus
nx
f(n)
n
s
1.
(The algebraic number theory interpretation of this positivity is that
f(n) is the number of representations of n as the norm of an ideal in
the biquadratic eld generated by
q and
.)
Also, by (a more complicated version of) the derivation of (3.79),
one has
nx
f(n)
n
s
= (s)L(s, )L(s,
)L(s,
)+
x
1s
1 s
L(1, )L(1,
)L(1,
)+O((qq
)
O(1)
x
0.9s
)
504 3. Expository articles
(say). Arguing as before, we conclude that
(s)L(s, )L(s,
)L(s,
)
1
2
O(
(qq
)
O(1s)
1 s
L(1, )L(1,
)L(1,
))
for 0.99 s < 1. Using the bound L(1,
) log(qq
) (which can be
established by summation by parts), we conclude that (s)L(s, )L(s,
)L(s,
)
is positive in the range
L(1, )L(1,
) log(qq
) 1 s .
Since we already know L(s, ) and L(s
)
log
2
q
L(1, )L(1,
) log(qq
);
taking geometric means and rearranging we obtain
L(1, )L(1,
) log(qq
)
O(1)
.
But this contradicts the hypotheses L(1, ) cq
, L(1,
) c(q
if c is small enough.
Remark 3.12.6. Siegels theorem leads to a version of the prime
number theorem in arithmetic progressions known as the Siegel-Walsz
theorem. As with Siegels theorem, the bounds are ineective unless
one is allowed to exclude a single exceptional modulus q (and its mul-
tiples), in which case one has a modied prime number theorem which
favours the quadratic nonresidues mod q; see [Gr1992].
Remark 3.12.7. One can improve the eective bounds in Siegels
theorem if one is allowed to exclude a larger set of bad moduli. For
instance, the arguments in Section 3.12.4 allow one to establish a
bound of the form L(1, ) log
O(1)
q after excluding at most one
q in each hyper-dyadic range 2
100
k
q 2
100
k+1
for each k; one
can of course replace 100 by other exponents here, but at the cost
of worsening the O(1) term. (This is essentially an observation of
Landau.)
Notes. This article rst appeared at terrytao.wordpress.com/2009/09/24.
Thanks to anonymous commenters for corrections.
3.13. Mazurs swindle 505
David Speyer noted the connection between Siegels theorem and
the classication of imaginary quadratic elds with unique factorisa-
tion.
3.13. Mazurs swindle
Let d be a natural number. A basic operation in the topology of ori-
ented, connected, compact, d-dimensional manifolds (hereby referred
to simply as manifolds for short) is that of connected sum: given two
manifolds M, N, the connected sum M#N is formed by removing a
small ball from each manifold and then gluing the boundary together
(in the orientation-preserving manner). This gives another oriented,
connected, compact manifold, and the exact nature of the balls re-
moved and their gluing is not relevant for topological purposes (any
two such procedures give homeomorphic manifolds). It is easy to see
that this operation is associative and commutative up to homeomor-
phism, thus M#N
= N#M and (M#N)#O
= M#(N#O), where
we use M
= N to denote the assertion that M is homeomorphic to
N.
(It is important that the orientation is preserved; if, for instance,
d = 3, and M is a chiral 3-manifold which is chiral (thus M ,
= M,
where M is the orientation reversal of M), then the connect sum
M#M of M with itself is also chiral (by the prime decomposition;
in fact one does not even need the irreducibility hypothesis for this
claim), but M# M is not. A typical example of an irreducible
chiral manifold is the complement of a trefoil knot. Thanks to Danny
Calegari for this example.)
The d-dimensional sphere S
d
is an identity (up to homeomor-
phism) of connect sum: M#S
d
= M for any M. A basic result in
the subject is that the sphere is itself irreducible:
Theorem 3.13.1 (Irreducibility of the sphere). If S
d
= M#N, then
M, N
= S
d
.
For d = 1 (curves), this theorem is trivial because the only con-
nected 1-manifolds are homeomorphic to circles. For d = 2 (surfaces),
the theorem is also easy by considering the genus of M, N, M#N. For
d = 3 the result follows from the prime decomposition. But for higher
506 3. Expository articles
d, these ad hoc methods no longer work. Nevertheless, there is an
elegant proof of Theorem 3.13.1, due to Mazur[Ma1959], and known
as Mazurs swindle. The reason for this name should become clear
when one sees the proof, which I reproduce below.
Suppose M#N
= S
d
. Now consider the innite connected sum
(M#N)#(M#N)#(M#N)#. . . .
This is an innite connected sum of spheres, and can thus be viewed as
a half-open cylinder, which is topologically equivalent to a sphere with
a small ball removed; alternatively, one can contract the boundary at
innity to a point to recover the sphere S
d
. On the other hand, by
using the associativity of connected sum (which will still work for the
innite connected sum, if one thinks about it carefully), the above
manifold is also homeomorphic to
M#(N#M)#(N#M)#. . .
which is the connected sum of M with an innite sequence of spheres,
or equivalently M with a small ball removed. Contracting the small
balls to a point, we conclude that M
= S
d
, and a similar argument
gives N
= S
d
.
A typical corollary of Theorem 3.13.1 is a generalisation of the
Jordan curve theorem: any locally at embedded copy of S
d1
in
S
d
divides the sphere S
d
into two regions homeomorphic to balls B
d
.
(Some sort of regularity hypothesis, such as local atness, is essential,
thanks to the counterexample of the Alexander horned sphere. If one
assumes smoothness instead of local atness, the problem is known
as the Schonies problem, and is apparently quite subtle, especially
in the four-dimensional case d = 4.)
One can ask whether there is a way to prove Theorem 3.13.1 for
general d without recourse to the innite sum swindle. I do not know
the complete answer to this, but some evidence against this hope can
be seen by noting that if one works in the smooth category instead
of the topological category (i.e. working with smooth manifolds, and
only equating manifolds that are dieomorphic, and not merely home-
omorphic), then the exotic spheres in ve and higher dimensions pro-
vide a counterexample to the smooth version of Theorem 3.13.1: it is
3.13. Mazurs swindle 507
possible to nd two exotic spheres whose connected sum is dieomor-
phic to the standard sphere. (Indeed, in ve and higher dimensions,
the exotic sphere structures on S
d
form a nite abelian group under
connect sum, with the standard sphere being the identity element.
The situation in four dimensions is much less well understood.) The
problem with the swindle here is that the homeomorphism generated
by the innite number of applications of the associativity law is not
smooth when one identies the boundary with a point.
The basic idea of the swindle - grouping an alternating innite
sum in two dierent ways - also appears in a few other contexts. Most
classically, it is used to show that the sum 1 1 +1 1 +. . . does not
converge in any sense which is consistent with the innite associative
law, since this would then imply that 1 = 0; indeed, one can view
the swindle as a dichotomy between the innite associative law and
the presence of non-trivial cancellation. (In the topological manifold
category, one has the former but not the latter, whereas in the case
of 1 1 + 1 1 + . . ., one has the latter but not the former.) The
alternating series test can also be viewed as a variant of the swindle.
Another variant of the swindle arises in the proof of the Cantor-
BernsteinSchroder theorem. Suppose one has two sets A, B, together
with injections from A to B and from B to A. The rst injection
leads to an identication B
= C A for some set C, while the second
injection leads to an identication A
= DB. Iterating this leads to
identications
A
= (D C D . . .) X
and
B
= (C D C . . .) X
for some additional set X. Using the identication D C
= C D
then yields an explicit bijection between A and B.
Notes. This article rst appeared at terrytao.wordpress.com/2009/10/05.
Thanks to Jan, Peter, and an anonymous commenter for corrections.
Thanks to Danny Calegari for telling me about the swindle, while
we were both waiting to catch an airplane.
Several commenters provided further examples of swindle-type
arguments. Scott Morrison noted that Mazurs argument also shows
508 3. Expository articles
that non-trivial knots do not have inverses: one cannot untie a knot
by tying another one. Qiaochu Yuan provided a swindle argument
that showed that GL(H) is contractible for any innite-dimensional
Hilbert space H. In a similar spirit, Pace Nielsen recalled the Eilen-
berg swindle that shows that for every projective module P, there
exists a free module F with P F F. Tim Gowers also mentioned
Pelczynskis decomposition method in the theory of Banach spaces as
a similar argument.
3.14. Grothendiecks denition of a group
In his wonderful article [Th1994],, Bill Thurston describes (among
many other topics) how ones understanding of given concept in math-
ematics (such as that of the derivative) can be vastly enriched by
viewing it simultaneously from many subtly dierent perspectives; in
the case of the derivative, he gives seven standard such perspectives
(innitesimal, symbolic, logical, geometric, rate, approximation, mi-
croscopic) and then mentions a much later perspective in the sequence
(as describing a at connection for a graph).
One can of course do something similar for many other funda-
mental notions in mathematics. For instance, the notion of a group
G can be thought of in a number of (closely related) ways, such as
the following:
(0) Motivating examples: A group is an abstraction of the
operations of addition/subtraction or multiplication/division
in arithmetic or linear algebra, or of composition/inversion
of transformations.
(1) Universal algebraic: A group is a set G with an identity
element e, a unary inverse operation
1
: G G, and a
binary multiplication operation : G G G obeying the
relations (or axioms) e x = x e = x, x x
1
= x
1
x = e,
(x y) z = x (y z) for all x, y, z G.
(2) Symmetric: A group is all the ways in which one can trans-
form a space V to itself while preserving some object or
structure O on this space.
3.14. Grothendiecks denition of a group 509
(3) Representation theoretic: A group is identiable with a
collection of transformations on a space V which is closed
under composition and inverse, and contains the identity
transformation.
(4) Presentation theoretic: A group can be generated by a
collection of generators subject to some number of relations.
(5) Topological: A group is the fundamental group
1
(X) of
a connected topological space X.
(6) Dynamic: A group represents the passage of time (or of
some other variable(s) of motion or action) on a (reversible)
dynamical system.
(7) Category theoretic: A group is a category with one ob-
ject, in which all morphisms have inverses.
(8) Quantum: A group is the classical limit q 0 of a quan-
tum group.
etc.
One can view a large part of group theory (and related subjects,
such as representation theory) as exploring the interconnections be-
tween various of these perspectives. As ones understanding of the
subject matures, many of these formerly distinct perspectives slowly
merge into a single unied perspective.
From a recent talk by Ezra Getzler, I learned a more sophisticated
perspective on a group, somewhat analogous to Thurstons example
of a sophisticated perspective on a derivative (and coincidentally, at
connections play a central role in both):
(37) Sheaf theoretic: A group is identiable with a (set-valued)
sheaf on the category of simplicial complexes such that the
morphisms associated to collapses of d-simplices are bijec-
tive for d > 1 (and merely surjective for d 1).
This interpretation of the group concept is apparently due to
Grothendieck, though it is motivated also by homotopy theory. One
of the key advantages of this interpretation is that it generalises eas-
ily to the notion of an n-group (simply by replacing 1 with n in
(37)), whereas the other interpretations listed earlier require a certain
510 3. Expository articles
amount of subtlety in order to generalise correctly (in particular, they
usually themselves require higher-order notions, such as n-categories).
The connection of (37) with any of the other perspectives of a
group is elementary, but not immediately obvious; I enjoyed working
out exactly what the connection was, and thought it might be of
interest to some readers here, so I reproduce it below the fold.
3.14.1. Flat connections. To see the relationship between (37) and
more traditional concepts of a group, such as (1), we will begin by
recalling the machinery of at connections.
Let G be a group, X be a topological space. A principal G-
connection on X can be thought of as an assignment of a group
element () G to every path in X which obey the following four
properties:
Invariance under reparameterisation: if
is a reparameter-
isation of , then () = (
).
Identity: If is a constant path, then () is the identity
element.
Inverse: If is the reversal of a path , then () is the
inverse of ().
Groupoid homomorphism: If
2
starts where
1
ends (so
that one can dene the concatenation
1
+
2
), then (
1
+
2
) = (
2
)(
1
). (Depending on ones conventions, one
may wish to reverse the order of the group multiplication
on the right-hand side.)
Intuitively, () represents a way to use the group G to connect
(or parallel transport) the bre at the initial point of to the bre
at the nal point; see Section 1.4 of Poincares Legacies, Vol. II for
more discussion. Note that the identity property is redundant, being
implied by the other three properties.
We say that a connection is at if () is the identity ele-
ment for every short closed loop , thus strengthening the identity
property. One could dene short rigorously (e.g. one could use
contractible as a substitute), but we will prefer here to leave the
concept intentionally vague.
3.14. Grothendiecks denition of a group 511
Typically, one studies connections when the structure group G
and the base space X are continuous rather than discrete. However,
there is a combinatorial model for connections which is suitable for
discrete groups, in which the base space X is now an (abstract) simpli-
cial complex - a vertex set V , together with a number of simplices
in V , by which we mean ordered d + 1-tuples (x
0
, . . . , x
d
) of distinct
vertices in V for various integers d (with d being the dimension of
the simplex (x
0
, . . . , x
d
)). In our denition of a simplicial complex,
we add the requirement that if a simplex lies in the complex, then
all faces of that simplex (formed by removing one of the vertices, but
leaving the order of the remaining vertices unchanged) also lie in the
complex. We also assume a well dened orientation, in the sense that
every d + 1-tuple x
0
, . . . , x
d
is represented by at most one simplex
(thus, for instance, a complex cannot contain both an edge (0, 1) and
its reversal (1, 0)). Though it will not matter too much here, one can
think of the vertex set V here as being restricted to be nite.
A path in a simplicial complex is then a sequence of 1-
simplices (x
i
, x
i+1
) or their formal reverses (x
i
, x
i+1
), with the nal
point of each 1-simplex being the initial point of the next. If G is
a (discrete) group, a principal G-connection on is then an as-
signment of a group element () G to each such path , obeying
the groupoid homomorphism property and the inverse property (and
hence the identity property). Note that the reparameterisation prop-
erty is no longer needed in this abstract combinatorial model. Note
that a connection can be determined by the group elements (b a)
it assigns to each 1-simplex (a, b). (I have written the simplex b a
from right to left, as this makes the composition law cleaner.)
So far, only the 1-skeleton (i.e. the simplices of dimension at most
1) of the complex have been used. But one can use the 2-skeleton to
dene the notion of a at connection: we say that a principal G-
connection on is at if the boundary of every 2-simplex (a, b, c),
oriented appropriately, is assigned the identity element, or more pre-
cisely that (c a)
1
(c b)(b a) = e, or in other words that
(c a) = (c b)(b a); thus, in this context, a short loop
512 3. Expository articles
means a loop that is the boundary of a 2-simplex. Note that this cor-
responds closely to the topological concept of a at connection when
applied to, say, a triangulated manifold.
Fix a group G. Given any simplicial complex , let O() be the
set of at connections on . One can get some feeling for this set by
considering some basic examples:
If is a single 0-dimensional simplex (i.e. a point), then
there is only the trivial path, which must be assigned the
identity element e of the group. Thus, in this case, O()
can be identied with e.
If is a 1-dimensional simplex, say (0, 1), then the path
from 0 to 1 can be assigned an arbitrary group element
(1 0) G, and this is the only degree of freedom in
the connection. So in this case, O() can be identied with
G.
Now suppose is a 2-dimensional simplex, say (0, 1, 2).
Then the group elements (1 0) and (2 1) are arbi-
trary elements of G, but (2 0) is constrained to equal
(2 1)(1 0). This determines the entire at connec-
tion, so O() can be identied with G
2
.
Generalising this example, if is a k-dimensional simplex,
then O() can be identied with G
k
.
An important operation one can do on at connections is that of
pullback. Let :
on
on , dened
by the formula
(w v) :=
((w) (v))
3.14. Grothendiecks denition of a group 513
for any 1-simplex (v, w) in , with the convention that
(u u)
is the identity for any u. It is easy to see that this is still a at
connection. Also, if :
and :
are morphisms,
then the operations of pullback by and then by compose to equal
the operation of pullback by :
= ()
. In the language of
category theory, pullback is a contravariant functor from the category
of simplicial complexes to the category of sets (with each simplicial
complex being mapped to its set of at connections).
A special case of a morphism is an inclusion morphism :
to a simplicial complex
on
to its restriction
to .
3.14.2. Sheaves. We currently have a set O() of at connections
assigned to each simplicial complex , together with pullback maps
(and in particular, restriction maps) connecting these sets to each
other. One can easily observe that this system of structures obeys
the following axioms:
(Identity) There is only one at connection on a point.
(Locality) If =
1
2
is the union of two simplicial
complexes, then a at connection on is determined by its
restrictions to
1
and
2
. In other words, the map
(
1
,
2
) is an injection from O() to O(
1
)O(
2
).
(Gluing) If =
1
2
, and
1
,
2
are at connections on
1
,
2
which agree when restricted to
1
2
, (and if the
orientations of
1
,
2
on the intersection
1
2
agree)
then there exists a at connection on which agrees
with
1
,
2
on
1
,
2
. (Note that this gluing of
1
and
2
is unique, by the previous axiom. It is important that
the orientations match; we cannot glue (0, 1) to (1, 0), for
instance.)
One can consider more abstract assignments of sets to simpli-
cial complexes, together with pullback maps, which obey these three
axioms. A system which obeys the rst two axioms is known as a pre-
sheaf, while a system that obeys all three is known as a sheaf. (One
can also consider pre-sheaves and sheaves on more general topological
514 3. Expository articles
spaces than simplicial complexes, for instance the spaces of smooth
(or continuous, or holomorphic, etc.) functions (or forms, sections,
etc.) on open subsets of a manifold form a sheaf.)
Thus, at connections associated to a group G form a sheaf. But
at connections form a special type of sheaf that obeys an additional
property (listed above as (37)). To explain this property, we rst
consider a key example when = (0, 1, 2) is the standard 2-simplex
(together with subsimplices), and
of to be a simplicial
complex obtained from by removing the interior of a k-simplex, to-
gether with one of its faces; thus for instance the complex consisting
of (0, 1), (1, 2) (and subsimplices) is a 2-dimensional collapse of the
2-simplex (0, 1, 2) (and subsimplices). We then see that the sheaf of
at connections obeys an additional axiom:
(Grothendiecks axiom) If
is a k-dimensional collapse of
, then the restriction map from O() to O(
) is surjec-
tive for all k, and bijective for k 2.
This axiom is trivial for k = 0. For k = 1, it is true because
if an edge (and one of its vertices) can be removed from a complex,
then it is not the boundary of any 2-simplex, and the value of a
at connection on that edge is thus completely unconstrained. (In
any event, the k = 1 case of this axiom can be deduced from the
sheaf axioms.) For k = 2, it follows because if one can remove a
2-simplex and one of its edges from a complex, then the edge is not
3.14. Grothendiecks denition of a group 515
the boundary of any other 2-simplex and thus the connection on that
edge only constrained to precisely be the product of the connection on
the other two edges of the 2-simplex. For k = 3, it follows because if
oen removes a 3-simplex and one of its 2-simplex faces, the constraint
associated to that 2-simplex is implied by the constraints coming
from the other three faces of the 3-simplex (I recommend drawing
a tetrahedron and chasing some loops around to see this), and so
one retains bijectivity. For k 4, the axiom becomes trivial again
because the k-simplices and k 1-simplices have no impact on the
denition of a at connection.
Grothendiecks beautiful observation is that the converse holds:
if a (concrete) sheaf O() obeys Grothendiecks axiom, then it
is equivalent to the sheaf of at connections of some group G dened
canonically from the sheaf. Lets see how this works. Suppose we
have a sheaf O(), which is concrete in the sense that O() is
a set, and the morphisms between these sets are given by functions. In
analogy with the preceding discussion, well refer to elements of O()
as (abstract) at connections, though a priori we do not assume there
is a group structure behind these connections.
By the sheaf axioms, there is only one at connection on a point,
which we will call the trivial connection. Now consider the space
O(0, 1) of at connections on the standard 1-simplex (0, 1). If the
sheaf was indeed the sheaf of at connections on a group G, then
O(0, 1) is canonically identiable with G. Inspired by this, we will
dene G to equal the space O(0, 1) of at connections on (0, 1). The
at connections on any other 1-simplex (u, v) can then be placed
in one-to-one correspondence with elements of G by the morphism
u 0, v 1, so at connections on (u, v) can be viewed as being
equivalent to an element of G.
At present, G is merely a set, not a group. To make it into a
group, we need to introduce an identity element, an inverse operation,
and a multiplication operation, and verify the group axioms.
To obtain an identity element, we look at the morphism from
(0, 1) to a point, and pull back the trivial connection on that point to
obtain a at connection e on (0, 1), which we will declare to be the
516 3. Expository articles
identity element. (Note from the functorial nature of pullback that it
does not matter which point we choose for this.)
Now we dene the multiplication operation. Let g, h G, then g
and h are at connections on (0, 1). By using the morphism i i 1
from (1, 2) to (0, 1), we can pull back h to (1, 2) to create a at
connection
h on (1, 2) that is equivalent to h. The restriction of g and
n
is also a countable ordinal, as is the set . But
is not equal to any of the
n
(by the axiom of foundation), a contra-
diction.
Remark 3.15.16. One can show the existence of uncountable or-
dinals (e.g. by considering all the well-orderings of subsets of the
natural numbers, up to isomorphism), and then there exists a least
uncountable ordinal . By construction, this ordinal consists pre-
cisely of all the countable ordinals, but is itself uncountable, much
as N consists precisely of all the nite natural numbers, but is itself
innite (Proposition 3.15.2). The least uncountable ordinal is noto-
rious, among other things, for providing a host of counterexamples
to various intuitively plausible assertions in point set topology, and
in particular in showing that the topology of suciently uncountable
3.15. No self-defeating objects 525
spaces cannot always be adequately explored by countable objects
such as sequences.
Remark 3.15.17. The existence of the least uncountable ordinal
can explain why one cannot contradict Cantors theorem on the un-
countability of the reals simply by iterating the diagonal argument
(or any other algorithm) in an attempt to exhaust the reals. From
transnite induction we see that the diagonal argument allows one to
assign a dierent real number to each countable ordinal, but this does
not establish countability of the reals, because the set of all countable
ordinals is itself uncountable. (This is similar to how one cannot con-
tradict Proposition 3.15.5 by iterating the N N + 1 map, as the
set of all nite natural numbers is itself innite.) In any event, even
once one reaches the rst uncountable ordinal, one may not yet have
completely exhausted the reals; for instance, using the diagonal argu-
ment given in the proof of Proposition 3.15.8, only the real numbers
in the interval [0, 1] will ever be enumerated by this procedure. (Also,
the question of whether all real numbers in [0, 1] can be enumerated
by the iterated diagonal algorithm requires the continuum hypothesis,
and even with this hypothesis I am not sure whether the statement
is decidable.)
3.15.2. Logic. The no self-defeating object argument leads to a
number of important non-existence results in logic. Again, the basic
idea is to show that a suciently overpowered logical structure will
eventually lead to the existence of a self-contradictory statement, such
as the liar paradox. To state examples of this properly, one unfortu-
nately has to invest a fair amount of time in rst carefully setting up
the language and theory of logic. I will not do so here, and instead use
informal English sentences as a proxy for precise logical statements
to convey a taste (but not a completely rigorous description) of these
logical results here.
The liar paradox itself - the inability to assign a consistent truth
value to this sentence is false - can be viewed as an argument
demonstrating that there is no consistent way to interpret (i.e. assign
a truth value to) sentences, when the sentences are (a) allowed to
be self-referential, and (b) allowed to invoke the very notion of truth
given by this interpretation. Ones rst impulse is to say that the
526 3. Expository articles
diculty here lies more with (a) than with (b), but there is a clever
trick, known as Quining (or indirect self-reference), which allows one
to modify the liar paradox to produce a non-self-referential statement
to which one still cannot assign a consistent truth value. The idea is
to work not with fully formed sentences S, which have a single truth
value, but instead with predicates S, whose truth value depends on
a variable x in some range. For instance, S may be x is thirty-two
characters long., and the range of x may be the set of strings (i.e.
nite sequences of characters); then for every string T, the statement
S(T) (formed by replacing every appearance of x in S with T) is ei-
ther true or false. For instance, S(a
) is false.
Crucially, predicates are themselves strings, and can thus be fed into
themselves as input; for instance, S(S) is false. If however U is the
predicate x is sixty-ve characters long., observe that U(U) is true.
Now consider the Quine predicate Q given by
x is a predicate whose range is the set of strings, and x(x) is false.
whose range is the set of strings. Thus, for any string T, Q(T) is
the sentence
T is a predicate whose range is the set of strings, and T(T) is false.
This predicate is dened non-recursively, but the sentence Q(Q)
captures the essence of the liar paradox: it is true if and only if
it is false. This shows that there is no consistent way to interpret
sentences in which the sentences are allowed to come from predicates,
are allowed to use the concept of a string, and also allowed to use the
concept of truth as given by that interpretation.
Note that the proof of Proposition 3.15.5 is basically the set-
theoretic analogue of the above argument, with the connection being
that one can identify a predicate T(x) with the set x : T(x) true.
By making one small modication to the above argument - re-
placing the notion of truth with the related notion of provability -
one obtains the celebrated Godels (second) incompleteness theorem:
Theorem 3.15.18 (Godels incompleteness theorem). (Informal state-
ment) No consistent logical system which has the notion of a string,
can provide a proof of its own logical consistency. (Note that a proof
can be viewed as a certain type of string.)
3.15. No self-defeating objects 527
Remark 3.15.19. Because one can encode strings in numerical form
(e.g. using the ASCII code), it is also true (informally speaking) that
no consistent logical system which has the notion of a natural number,
can provide a proof of its own logical consistency.
Proof. (Informal sketch only) Suppose for contradiction that one
had a consistent logical system inside of which its consistency could
be proven. Now let Q be the predicate given by
x is a predicate whose range is the set of strings, and x(x) is not provable
and whose range is the set of strings. Dene the Godel sentence
G to be the string G := Q(Q). Then G is logically equivalent to the
assertion G is not provable. Thus, if G were false, then G would
be provable, which (by the consistency of the system) implies that G
is true, a contradiction; thus, G must be true. Because the system
is provably consistent, the above argument can be placed inside the
system itself, to prove inside that system that G must be true; thus
G is provable and G is then false, a contradiction. (It becomes quite
necessary to carefully distinguish the notions of truth and provability
(both inside a system and externally to that system) in order to get
this argument straight!)
Remark 3.15.20. It is not hard to show that a consistent logical
system which can model the standard natural numbers cannot dis-
prove its own consistency either (i.e. it cannot establish the state-
ment that one can deduce a contradiction from the axioms in the
systems in n steps for some natural number n); thus the consistency
of such a system is undecidable within that system. Thus this the-
orem strengthens the (more well known) rst Godel incompleteness
theory, which asserts the existence of undecidable statements inside
a consistent logical system which contains the concept of a string (or
a natural number). On the other hand, the incompleteness theorem
does not preclude the possibility that the consistency of a weak the-
ory could be proven in a strictly stronger theory (e.g. the consistency
of Peano arithmetic is provable in Zermelo-Frankel set theory).
Remark 3.15.21. One can use the incompleteness theorem to estab-
lish the undecidability of other overpowered problems. For instance,
528 3. Expository articles
Matiyasevichs theorem demonstrates that the problem of determin-
ing the solvability of a system of Diophantine equations is, in general,
undecidable, because one can encode statements such as the consis-
tency of set theory inside such a system.
3.15.3. Computability. One can adapt these arguments in logic
to analogous arguments in the theory of computation; the basic idea
here is to show that a suciently overpowered computer program
cannot exist, by feeding the source code for that program into the
program itself (or some slight modication thereof) to create a con-
tradiction. As with logic, a properly rigorous formalisation of the
theory of computation would require a fair amount of preliminary
machinery, for instance to dene the concept of Turing machine (or
some other universal computer), and so I will once again use informal
English sentences as an informal substitute for a precise programming
language.
A fundamental no self-defeating object argument in the sub-
ject, analogous to the other liar paradox type arguments encountered
previously, is the Turing halting theorem:
Theorem 3.15.22 (Turing halting theorem). There does not exist a
program P which takes a string S as input, and determines in nite
time whether S is a program (with no input) that halts in nite time.
Proof. Suppose for contradiction that such a program P existed.
Then one could easily modify P to create a variant program Q, which
takes a string S as input, and halts if and only if S is a program (with
S itself as input) that does not halts in nite time. Indeed, all Q has
to do is call P with the string S(S), dened as the program (with no
input) formed by declaring S to be the input string for the program
S. If P determines that S(S) does not halt, then Q halts; otherwise,
if P determines that S(S) does halt, then Q performs an innite loop
and does not halt. Then observe that Q(Q) will halt if and only if it
does not halt, a contradiction.
Remark 3.15.23. As one can imagine from the proofs, this result
is closely related to, but not quite identical with, the Godel incom-
pleteness theorem. That latter theorem implies that the question of
3.15. No self-defeating objects 529
whether a given program halts or not in general is undecidable (con-
sider a program designed to search for proofs of the inconsistency
of set theory). By contrast, the halting theorem (roughly speaking)
shows that this question is uncomputable (i.e. there is no algorithm
to decide halting in general) rather than undecidable (i.e. there are
programs whose halting can neither be proven nor disproven).
On the other hand, the halting theorem can be used to establish
the incompleteness theorem. Indeed, suppose that all statements in
a suitably strong and consistent logical system were either provable
or disprovable. Then one could build a program that determined
whether an input string S, when run as a program, halts in nite
time, simply by searching for all proofs or disproofs of the statement
S halts in nite time; this program is guaranteed to terminate with
a correct answer by hypothesis.
Remark 3.15.24. While it is not possible for the halting problem
for a given computing language to be computable in that language,
it is certainly possible that it is computable in a strictly stronger
language. When that is the case, one can then invoke Newcombs
paradox to argue that the weaker language does not have unlimited
free will in some sense.
Remark 3.15.25. In the language of recursion theory, the halting
theorem asserts that the set of programs that do not halt is not a
decidable set (or a recursive set). In fact, one can make the slightly
stronger assertion that the set of programs that do not halt is not
even a semi-decidable set (or a recursively enumerable set), i.e. there
is no algorithm which takes a program as input and halts in nite time
if and only if the input program does not halt. This is because the
complementary set of programs that do halt is clearly semi-decidable
(one simply runs the program until it halts, running forever if it does
not), and so if the set of programs that do not halt is also semi-
decidable, then it is decidable (by running both algorithms in parallel;
this observation is a special case of Posts theorem).
Remark 3.15.26. One can use the halting theorem to exclude overly
general theories for certain types of mathematical objects. For in-
stance, one cannot hope to nd an algorithm to determine the ex-
istence of smooth solutions to arbitrary nonlinear partial dierential
530 3. Expository articles
equations, because it is possible to simulate a Turing machine using
the laws of classical physics, which in turn can be modeled using (a
moderately complicated system of) nonlinear PDE. Instead, progress
in nonlinear PDE has instead proceeded by focusing on much more
specic classes of such PDE (e.g. elliptic PDE, parabolic PDE, hy-
perbolic PDE, gauge theories, etc.).
One can place the halting theorem in a more quantitative form.
Call a function f : N N computable if there exists a computer
program which, when given a natural number n as input, returns
f(n) as output in nite time. Dene the Busy Beaver function BB :
N N by setting BB(n) to equal the largest output of any program
of at most n characters in length (and taking no input), which halts
in nite time. Note that there are only nitely many such programs
for any given n, so BB(n) is well-dened. On the other hand, it is
uncomputable, even to upper bound:
Proposition 3.15.27. There does not exist a computable function f
such that one has BB(n) f(n) for all n.
Proof. Suppose for contradiction that there existed a computable
function f(n) such that BB(n) f(n) for all n. We can use this to
contradict the halting theorem, as follows. First observe that once
the Busy Beaver function can be upper bounded by a computable
function, then for any n, the run time of any halting program of
length at most n can also be upper bounded by a computable function.
This is because if a program of length n halts in nite time, then a
trivial modication of that program (of length larger than n, but
by a computable factor) is capable of outputting the run time of that
program (by keeping track of a suitable clock variable, for instance).
Applying the upper bound for Busy Beaver to that increased length,
one obtains the bound on run time.
Now, to determine whether a given program S halts in nite time
or not, one simply simulates (runs) that program for time up to the
computable upper bound of the possible running time of S, given
by the length of S. If the program has not halted by then, then it
never will. This provides a program P obeying the hypotheses of the
halting theorem, a contradiction.
3.15. No self-defeating objects 531
Remark 3.15.28. A variant of the argument shows that BB(n)
grows faster than any computable function: thus if f is computable,
then BB(n) > f(n) for all suciently large n. We leave the proof of
this result as an exercise to the reader.
Remark 3.15.29. Sadly, the most important unsolved problem in
complexity theory, namely the P ,= NP problem, does not seem to
be susceptible to the no-self-defeating-object argument, basically be-
cause such arguments tend to be relativisable whereas the P ,= NP
problem is not; see Section 3.9 for more discussion. On the other
hand, one has the curious feature that many proposed proofs that
P ,= NP appear to be self-defeating; this is most strikingly captured
in the celebrated work of Razborov and Rudich[RaRu1997], who
showed (very roughly speaking) that any suciently natural proof
that P ,= NP could be used to disprove the existence of an object
closely related to the belief that P ,= NP, namely the existence of
pseudorandom number generators. (I am told, though, that diago-
nalisation arguments can be used to prove some other inclusions or
non-inclusions in complexity theory that are not subject to the rela-
tivisation barrier, though I do not know the details.)
3.15.4. Game theory. Another basic example of the no-self-defeating-
objects argument arises from game theory, namely the strategy steal-
ing argument. Consider for instance a generalised version of naughts
and crosses (tic-tac-toe), in which two players take turns placing
naughts and crosses on some game board (not necessarily square,
and not necessarily two-dimensional), with the naughts player going
rst, until a certain pattern of all naughts or all crosses is obtained
as a subpattern, with the naughts player winning if the pattern is all
naughts, and the crosses player winning if the pattern is all crosses.
(If all positions are lled without either pattern occurring, the game
is a draw.) We assume that the winning patterns for the cross player
are exactly the same as the winning patterns for the naughts player
(but with naughts replaced by crosses, of course).
Proposition 3.15.30. In any generalised version of naughts and
crosses, there is no strategy for the second player (i.e. the crosses
player) which is guaranteed to ensure victory.
532 3. Expository articles
Proof. Suppose for contradiction that the second player had such a
winning strategy W. The rst player can then steal that strategy by
placing a naught arbitrarily on the board, and then pretending to be
the second player and using W accordingly. Note that occasionally,
the W strategy will compel the naughts player to place a naught on
the square that he or she has already occupied, but in such cases the
naughts player may simply place the naught somewhere else instead.
(It is not possible that the naughts player would run out of places,
thus forcing a draw, because this would imply that W could lead to
a draw as well, a contradiction.) If we denote this stolen strategy by
W
, then W
strategy for the naughts player against the W strategy for the crosses
player, we obtain a contradiction.
Remark 3.15.31. The key point here is that in naughts and crosses
games, it is possible to play a harmless move - a move which gives
up the turn of play, but does not actually decrease ones chance of
winning. In games such as chess, there does not appear to be any
analogue of the harmless move, and so it is not known whether black
actually has a strategy guaranteed to win or not in chess, though it
is suspected that this is not the case.
Remark 3.15.32. The Hales-Jewett theorem shows that for any xed
board length, an n-dimensional game of naughts and crosses is un-
able to end in a draw if n is suciently large. An induction argument
shows that for any two-player game that terminates in bounded time
in which draws are impossible, one player must have a guaranteed
winning strategy; by the above proposition, this strategy must be a
win for the naughts player. Note, however, that Proposition 3.15.30
provides no information as to how to locate this winning strategy,
other than that this strategy belongs to the naughts player. Never-
theless, this gives a second example in which the no-self-defeating-
object argument can be used to ensure the existence of some object,
rather than the non-existence of an object. (The rst example was
the prime number theorem, discussed earlier.)
3.15. No self-defeating objects 533
The strategy-stealing argument can be applied to real-world eco-
nomics and nance, though as with any other application of mathe-
matics to the real world, one has to be careful as to the implicit as-
sumptions one is making about reality and how it conforms to ones
mathematical model when doing so. For instance, one can argue that
in any market or other economic system in which the net amount of
money is approximately constant, it is not possible to locate a uni-
versal trading strategy which is guaranteed to make money for the
user of that strategy, since if everyone applied that strategy then the
net amount of money in the system would increase, a contradiction.
Note however that there are many loopholes here; it may be that the
strategy is dicult to copy, or relies on exploiting some other group
of participants who are unaware or unable to use the strategy, and
would then lose money (though in such a case, the strategy is not
truly universal as it would stop working once enough people used
it). Unfortunately, there can be strong psychological factors that can
cause people to override the conclusions of such strategy-stealing ar-
guments with their own rationalisations, as can be seen, for instance,
in the perennial popularity of pyramid schemes, or to a lesser extent,
market bubbles (though one has to be careful about applying the
strategy-stealing argument in the latter case, since it is possible to
have net wealth creation through external factors such as advances in
technology).
Note also that the strategy-stealing argument also limits the uni-
versal predictive power of technical analysis to provide predictions
other than that the prices obey a martingale, though again there are
loopholes in the case of markets that are illiquid or highly volatile.
3.15.5. Physics. In a similar vein, one can try to adapt the no-self-
defeating-object argument from mathematics to physics, but again
one has to be much more careful with various physical and meta-
physical assumptions that may be implicit in ones argument. For
instance, one can argue that under the laws of special relativity, it is
not possible to construct a totally immovable object. The argument
would be that if one could construct an immovable object O in one
inertial reference frame, then by the principle of relativity it should
be possible to construct an object O
t
p
i
=
H
q
i
;
t
q
i
=
H
p
i
;
more abstractly, just from the symplectic form on the
phase space, the equations of motion can be written as
(3.83)
t
x(t) =
H(x(t)),
where
t
A(x(t)) = H, A(x(t))
3.16. Bose-Einstein condensates 537
for any observable A, where H, A :=
H A is the Poisson
bracket. One can express Poissons equation more abstractly as
(3.84)
t
A = H, A.
In the above formalism, we are assuming that the system is in a
pure state at each time t, which means that it only occupies a single
point x(t) in phase space. One can also consider mixed states in which
the state of the system at a time t is not fully known, but is instead
given by a probability distribution (t, x) dx on phase space. The
act of measuring an observable A at a time t will thus no longer be
deterministic, but will itself be a random variable, whose expectation
A is given by
(3.85) A(t) =
_
A(x)(t, x) dx.
The equation of motion of a mixed state is given by the advection
equation
t
= div(
H)
using the same vector eld
x(t)
(x). One can thus think of mixed states as continuous averages
of pure states, or equivalently the space of mixed states is the convex
hull of the space of pure states.
Suppose one had a 2-particle system, in which the joint phase
space =
1
2
is the product of the two one-particle phase
spaces. A pure joint state is then a point x = (x
1
, x
2
) in , where
x
1
represents the state of the rst particle, and x
2
is the state of the
second particle. If the joint Hamiltonian H : R split as
H(x
1
, x
2
) = H
1
(x
1
) +H
2
(x
2
)
11
We ignore for now the formal issues of how to perform operations such as
derivatives on Dirac masses; this can be accomplished using the theory of distributions
in Section 1.13 (or, equivalently, by working in the dual setting of observables) but this
is not our concern here.
538 3. Expository articles
then the equations of motion for the rst and second particles would
be completely decoupled, with no interactions between the two par-
ticles. However, in practice, the joint Hamiltonian contains coupling
terms between x
1
, x
2
that prevents one from totally decoupling the
system; for instance, one may have
H(x
1
, x
2
) =
[p
1
[
2
2m
1
+
[p
2
[
2
2m
2
+V (q
1
q
2
),
where x
1
= (q
1
, p
1
), x
2
= (q
2
, p
2
) are written using position coor-
dinates q
i
and momentum coordinates p
i
, m
1
, m
2
> 0 are constants
(representing mass), and V (q
1
q
2
) is some interaction potential that
depends on the spatial separation q
1
q
2
between the two particles.
In a similar spirit, a mixed joint state is a joint probability dis-
tribution (x
1
, x
2
) dx
1
dx
2
on the product state space. To recover the
(mixed) state of an individual particle, one must consider a marginal
distribution such as
1
(x
1
) :=
_
2
(x
1
, x
2
) dx
2
(for the rst particle) or
2
(x
2
) :=
_
1
(x
1
, x
2
) dx
1
(for the second particle). Similarly for N-particle systems: if the joint
distribution of N distinct particles is given by (x
1
, . . . , x
N
)
dx
1
. . .
dx
N
,
then the distribution of the rst particle (say) is given by
1
(x
1
) =
_
2...
N
(x
1
, x
2
, . . . , x
N
) dx
2
. . . dx
N
,
the distribution of the rst two particles is given by
12
(x
1
, x
2
) =
_
3...
N
(x
1
, x
2
, . . . , x
N
) dx
3
. . . dx
N
,
and so forth.
A typical Hamiltonian in this case may take the form
H(x
1
, . . . , x
n
) =
N
j=1
[p
j
[
2
2m
j
+
1j<kN
V
jk
(q
j
q
k
)
3.16. Bose-Einstein condensates 539
which is a combination of single-particle Hamiltonians H
j
and interac-
tion perturbations. If the momenta p
j
and masses m
j
are normalised
to be of size O(1), and the potential V
jk
has an average value (i.e.
an L
1
norm) of O(1) also, then the former sum has size O(N) and
the latter sum has size O(N
2
), so the latter will dominate. In order
to balance the two components and get a more interesting limiting
dynamics when N , we shall therefore insert a normalising factor
of
1
N
on the right-hand side, giving a Hamiltonian
H(x
1
, . . . , x
n
) =
N
j=1
[p
j
[
2
2m
j
+
1
N
1j<kN
V
jk
(q
j
q
k
).
Now imagine a system of N indistinguishable particles. By this,
we mean that all the state spaces
1
= . . . =
N
are identical, and
all observables (including the Hamiltonian) are symmetric functions
of the product space =
N
1
(i.e. invariant under the action of the
symmetric group S
N
). In such a case, one may as well average over
this group (since this does not aect any physical observable), and
assume that all mixed states are also symmetric. (One cost of doing
this, though, is one has to largely give up pure states (x
1
, . . . , x
N
),
since such states will not be symmetric except in the very exceptional
case x
1
= . . . = x
N
.)
A typical example of a symmetric Hamiltonian is
H(x
1
, . . . , x
n
) =
N
j=1
[p
j
[
2
2m
+
1
N
1j<kN
V (q
j
q
k
)
where V is even (thus all particles have the same individual Hamilton-
ian, and interact with the other particles using the same interaction
potential). In many physical systems, it is natural to consider only
short-range interaction potentials, in which the interaction between
q
j
and q
k
is localised to the region q
j
q
k
= O(r) for some small r.
We model this by considering Hamiltonians of the form
H(x
1
, . . . , x
n
) =
N
j=1
H(x
j
) +
1
N
1j<kN
1
r
d
V (
x
j
x
k
r
)
where d is the ambient dimension of each particle (thus in physical
models, d would usually be 3); the factor of
1
r
d
is a normalisation
540 3. Expository articles
factor designed to keep the L
1
norm of the interaction potential of
size O(1). It turns out that an interesting limit occurs when r goes to
zero as N goes to innity by some power law r = N
; imagine for
instance N particles of radius r bouncing around in a box, which
is a basic model for classical gases.
An important example of a symmetric mixed state is a factored
state
(x
1
, . . . , x
N
) =
1
(x
1
) . . .
1
(x
N
)
where
1
is a single-particle probability density function; thus is the
tensor product of N copies of
1
. If there are no interaction terms in
the Hamiltonian, then Hamiltonians equation of motion will preserve
the property of being a factored state (with
1
evolving according to
the one-particle equation); but with interactions, the factored nature
may be lost over time.
3.16.2. A quick review of quantum mechanics. Now we turn
to quantum mechanics. This theory is fundamentally rather dier-
ent in nature than classical mechanics (in the sense that the basic
objects, such as states and observables, are a dierent type of math-
ematical object than in the classical case), but shares many features
in common also, particularly those relating to the Hamiltonian and
other observables. (This relationship is made more precise via the
correspondence principle, and more precise still using semi-classical
analysis.)
The formalism of quantum mechanics for a given physical system
can be summarised briey as follows:
The physical system has a phase space Hof states [ (which
is often parameterised as a complex-valued function of the
position space). Mathematically, it has the structure of a
complex Hilbert space, which is traditionally manipulated
using bra-ket notation.
The complete state of the system at any given time t is
given (in the case of pure states) by a unit vector [(t) in
the phase space H.
3.16. Bose-Einstein condensates 541
Every physical observable A is associated to a linear oper-
ator on H; real-valued observables are associated to self-
adjoint linear operators. If one measures the observable A
at time t, one will obtain the random variable whose expec-
tation A is given by (t)[A[(t). (The full distribution
of A is given by the spectral measure of A relative to [(t).)
There is a special observable, the Hamiltonian H : H H,
which governs the evolution of the state [(t) through time,
via Schrodingers equations of motion
(3.86) i
t
[(t) = H[(t).
Schrodingers equation of motion can also be expressed in a dual
form in terms of observables A, as Heisenbergs equation of motion
t
[A[ =
i
[[H, A][
or more abstractly as
(3.87)
t
A =
i
[H, A]
where [, ] is the commutator or Lie bracket (compare with (3.84)).
The states [ are pure states, analogous to the pure states x
in Hamiltonian mechanics. One also has mixed states in quantum
mechanics. Whereas in classical mechanics, a mixed state is a prob-
ability distribution (a non-negative function of total mass
_
= 1),
in quantum mechanics a mixed state is a non-negative (i.e. positive
semi-denite) operator on H of total trace tr = 1. If one measures
an observable A at a mixed state , one obtains a random variable
with expectation tr A. From (3.87) and duality, one can infer that
the correct equation of motion for mixed states must be given by
(3.88)
t
=
i
[H, ].
One can view pure states as the special case of mixed states which
are rank one projections,
= [[.
Morally speaking, the space of mixed states is the convex hull of the
space of pure states (just as in the classical case), though things are a
542 3. Expository articles
little trickier than this when the phase space H is innite dimensional,
due to the presence of continuous spectrum in the spectral theorem.
Pure states suer from a phase ambiguity: a phase rotation e
i
[
of a pure state [ leads to the same mixed state, and the two states
cannot be distinguished by any physical observable.
In a single particle system, modeling a (scalar) quantum particle
in a d-dimensional position space R
d
, one can identify the Hilbert
space H with L
2
(R
d
C), and describe the pure state [ as a wave
function : R
d
C, which is normalised as
_
R
d
[(x)[
2
dx = 1
as [ has to be a unit vector. (If the quantum particle has additional
features such as spin, then one needs a fancier wave function, but lets
ignore this for now.) A mixed state is then a function : R
d
R
d
C
which is Hermitian (i.e. (x, x
) = (x
) = (x)(x
).
A typical Hamiltonian in this setting is given by the operator
H(x) :=
[p[
2
2m
(x) +V (x)(x)
where m > 0 is a constant, p is the momentum operator p := i
x
,
and
x
is the gradient in the x variable (so [p[
2
=
2
x
, where
x
is the Laplacian; note that
x
is skew-adjoint and should thus
be thought of as being imaginary rather than real), and V : R
d
R
is some potential. Physically, this depicts a particle of mass m in a
potential well given by the potential V .
Now suppose one has an N-particle system of scalar particles. A
pure state of such a system can then be given by an N-particle wave
function : (R
d
)
N
C, normalised so that
_
(R
d
)
N
[(x
1
, . . . , x
N
)[
2
dx
1
. . . dx
N
= 1
and a mixed state is a Hermitian positive semi-denite function :
(R
d
)
N
(R
d
)
N
C with trace
_
(R
d
)
N
(x
1
, . . . , x
N
; x
1
, . . . , x
N
) dx
1
. . . dx
N
= 1,
3.16. Bose-Einstein condensates 543
with a pure state being identied with the mixed state
(x
1
, . . . , x
N
; x
1
, . . . , x
N
) := (x
1
, . . . , x
N
)(x
1
, . . . , x
N
).
In classical mechanics, the state of a single particle was the marginal
distribution of the joint state. In quantum mechanics, the state of
a single particle is instead obtained as the partial trace of the joint
state. For instance, the state of the rst particle is given as
1
(x
1
; x
1
) :=
_
(R
d
)
N1
(x
1
, x
2
, . . . , x
N
; x
1
, x
2
, . . . , x
N
) dx
2
. . . dx
N
,
the state of the rst two particles is given as
12
(x
1
, x
2
; x
1
, x
2
) :=
_
(R
d
)
N2
(x
1
, x
2
, x
3
, . . . , x
N
;
x
1
, x
2
, x
3
, . . . , x
N
) dx
3
. . . dx
N
,
and so forth. (These formulae can be justied by considering observ-
ables of the joint state that only aect, say, the rst two position
coordinates x
1
, x
2
and using duality.)
A typical Hamiltonian in this setting is given by the operator
H(x
1
, . . . , x
N
) =
N
j=1
[p
j
[
2
2m
j
(x
1
, . . . , x
N
)
+
1
N
1j<kN
V
jk
(x
j
x
k
)(x
1
, . . . , x
N
)
where we normalise just as in the classical case, and p
j
:= i
xj
.
An interesting feature of quantum mechanics - not present in the
classical world - is that even if the N-particle system is in a pure state,
individual particles may be in a mixed state: the partial trace of a
pure state need not remain pure. Because of this, when considering
a subsystem of a larger system, one cannot always assume that the
subsystem is in a pure state, but must work instead with mixed states
throughout, unless there is some reason (e.g. a lack of coupling) to
assume that pure states are somehow preserved.
Now consider a system of N indistinguishable quantum particles.
As in the classical case, this means that all observables (including the
Hamiltonian) for the joint system are invariant with respect to the
action of the symmetric group S
N
. Because of this, one may as well
544 3. Expository articles
assume that the (mixed) state of the joint system is also symmetric
with respect to this action. In the special case when the particles are
bosons, one can also assume that pure states [ are also symmetric
with respect to this action (in contrast to fermions, where the ac-
tion on pure states is anti-symmetric). A typical Hamiltonian in this
setting is given by the operator
H(x
1
, . . . , x
N
) =
N
j=1
[p
j
[
2
2m
(x
1
, . . . , x
N
)
+
1
N
1j<kN
V (x
j
x
k
)(x
1
, . . . , x
N
)
for some even potential V ; if one wants to model short-range interac-
tions, one might instead pick the variant
(3.89)
H(x
1
, . . . , x
N
) =
N
j=1
[p
j
[
2
2m
(x
1
, . . . , x
N
)+
1
N
1j<kN
r
d
V (
x
j
x
k
r
)(x
1
, . . . , x
N
)
for some r > 0. This is a typical model for an N-particle Bose-
Einstein condensate. (Longer-range models can lead to more non-
local variants of NLS for the limiting equation, such as the Hartree
equation.)
3.16.3. NLS. Suppose we have a Bose-Einstein condensate given by
a (symmetric) mixed state
(t, x
1
, . . . , x
N
; x
1
, . . . , x
N
)
evolving according to the equation of motion (3.88) using the Hamil-
tonian (3.89). One can take a partial trace of the equation of motion
(3.88) to obtain an equation for the state
1
(t, x
1
; x
1
) of the rst par-
ticle (note from symmetry that all the other particles will have the
same state function). If one does take this trace, one soon nds that
the equation of motion becomes
1
(t, x
1
; x
1
) =
i
[(
[p
1
[
2
2m
[p
1
[
2
2m
)
1
(t, x
1
; x
1
)
+
1
N
N
j=2
_
R
d
1
r
d
[V (
x
1
x
j
r
) V (
x
1
x
j
r
)]
1j
(t, x
1
, x
j
; x
1
, x
j
) dx
j
3.16. Bose-Einstein condensates 545
where
1j
is the partial trace to the 1, j particles. Using symmetry,
we see that all the summands in the j summation are identical, so we
can simplify this as
1
(t, x
1
; x
1
) =
i
[(
[p
1
[
2
2m
[p
1
[
2
2m
)
1
(t, x
1
; x
1
)
+
N 1
N
_
R
d
1
r
d
[V (
x
1
x
2
r
) V (
x
1
x
2
r
)]
12
(t, x
1
, x
2
; x
1
, x
2
) dx
2
.
This does not completely describe the dynamics of
1
, as one also
needs an equation for
12
. But one can repeat the same argument to
get an equation for
12
involving
123
, and so forth, leading to a sys-
tem of equations known as the BBGKY hierarchy. But for simplicity
we shall just look at the rst equation in this hierarchy.
Let us now formally take two limits in the above equation, sending
the number of particles N to innity and the interaction scale r to
zero. The eect of sending N to innity should simply be to eliminate
the
N1
N
factor. The eect of sending r to zero should be to send
1
r
d
V (
x
r
) to the Dirac mass (x), where :=
_
R
d
V is the total mass
of V . Formally performing these two limits, one is led to the equation
1
(t, x
1
; x
1
) =
i
[(
[p
1
[
2
2m
[p
1
[
2
2m
)
1
(t, x
1
; x
1
)
+(
12
(t, x
1
, x
1
; x
1
, x
1
)
12
(t, x
1
, x
1
; x
1
, x
1
))].
One can perform a similar formal limiting procedure for the other
equations in the BBGKY hierarchy, obtaining a system of equations
known as the Gross-Pitaevskii hierarchy.
We next make an important simplifying assumption, which is
that in the limit N any two particles in this system become
decoupled, which means that the two-particle mixed state factors as
the tensor product of two one-particle states:
12
(t, x
1
, x
2
; x
1
, x
2
)
1
(t, x
1
; x
1
)
1
(t, x
2
; x
2
).
One can view this as a mean eld approximation, modeling the inter-
action of one particle x
1
with all the other particles by the mean eld
1
.
Making this assumption, the previous equation simplies to
1
(t, x
1
; x
1
) =
i
[(
[p
1
[
2
2m
[p
1
[
2
2m
)
546 3. Expository articles
+(
1
(t, x
1
; x
1
)
1
(t, x
1
; x
1
))]
1
(t, x
1
; x
1
).
If we assume furthermore that
1
is a pure state, thus
1
(t, x
1
; x
1
) = (t, x
1
)(t, x
1
)
then (up to the phase ambiguity mentioned earlier), (t, x) obeys the
Gross-Pitaevskii equation
t
(t, x) =
i
[(
[p[
2
2m
+[(t, x)[
2
](t, x)
which (up to some factors of and m, which can be renormalised
away) is essentially (3.82).
An alternate derivation of (3.82), using a slight variant of the
above mean eld approximation, comes from studying the Hamilton-
ian (3.89). Let us make the (very strong) assumption that at some
xed time t, one is in a completely factored pure state
(x
1
, . . . , x
N
) =
1
(x
1
) . . .
1
(x
N
),
where
1
is a one-particle wave function, in particular obeying the
normalisation
_
R
d
[
1
(x)[
2
dx = 1.
(This is an unrealistically strong version of the mean eld approxi-
mation. In practice, one only needs the two-particle partial traces to
be completely factored for the discussion below.) The expected value
of the Hamiltonian,
[H[ =
_
(R
d
)
N
(x
1
, . . . , x
N
)H(x
1
, . . . , x
N
) dx
1
. . . dx
N
,
can then be simplied as
N
_
R
d
1
(x)
[p
1
[
2
2m
1
(x) dx
+
N 1
2
_
R
d
R
d
r
d
V (
x
1
x
2
r
)[
1
(x
1
)[
2
[
1
(x
2
)[ dx
1
dx
2
.
Again sending r 0, this formally becomes
N
_
R
d
1
(x)
[p
1
[
2
2m
1
(x) dx +
N 1
2
_
R
d
R
d
[
1
(x
1
)[
4
dx
1
3.16. Bose-Einstein condensates 547
which in the limit N is asymptotically
N
_
R
d
1
(x)
[p
1
[
2
2m
1
(x) +
2
[
1
(x
1
)[
4
dx
1
.
Up to some normalisations, this is the Hamiltonian for the NLS equa-
tion (3.82).
There has been much progress recently in making the above deriva-
tions precise, see e.g. [Sc2006], [KlMa2008], [KiScSt2008], [ChPa2009].
A key step is to show that the Gross-Pitaevskii hierarchy necessar-
ily preserves the property of being a completely factored state. This
requires a uniqueness theory for this hierarchy, which is surprisingly
delicate, due to the fact that it is a system of innitely many coupled
equations over an unbounded number of variables.
Remark 3.16.1. Interestingly, the above heuristic derivation only
works when the interaction scale r is much larger than N
1
. For
r N
1
, the coupling constant acquires a nonlinear correction,
becoming essentially the scattering length of the potential rather than
its mean. (Thanks to Bob Jerrard for pointing out this subtlety.)
Notes. This article rst appeared at terrytao.wordpress.com/2009/11/26.
Thanks to CJ, liuyao, Mio and M.S. for corrections.
Bob Jerrard provided a heuristic argument as to why the coupling
constant becomes nonlinear in the regime r N
1
.
Chapter 4
Technical articles
549
550 4. Technical articles
4.1. Polymath1 and three new proofs of the
density Hales-Jewett theorem
During the rst few months of 2009, I was involved in the Polymath1
project, a massively collaborative mathematical project whose pur-
pose was to investigate the viability of various approaches to proving
the density Hales-Jewett theorem. For simplicity I will focus attention
here on the model case k = 3 of a three-letter alphabet, in which case
the theorem reads as follows:
Theorem 4.1.1 (k = 3 density Hales-Jewett theorem). Let 0 <
1. Then if n is a suciently large integer, any subset A of the cube
[3]
n
= 1, 2, 3
n
of density [A[/3
n
at least contains at least one
combinatorial line (1), (2), (3), where 1, 2, 3, x
n
[3]
n
is a
string of 1s, 2s, 3s, and xs containing at least one wildcard x, and
(i) is the string formed from by replacing all xs with is.
The full density Hales-Jewett theorem is the same statement, but
with [3] replaced by [k] for some k 1. (The case k = 1 is trivial,
and the case k = 2 follows from Sperners theorem.) As a result of the
project, three new proofs of this theorem were established, at least
one of which has extended[Po2009] to cover the case of general k.
This theorem was rst proven by Furstenberg and Katznelson[FuKa1989],
by rst converting it to a statement in ergodic theory; the original
paper of Furstenberg-Katznelson argument was for the k = 3 case
only, and gave only part of the proof in detail, but in a subsequent
paper[FuKa1991] a full proof in general k was provided. The remain-
ing components of the original k = 3 argument were later completed
in unpublished notes of McCutcheon
1
. One of the new proofs is es-
sentially a nitary translation of this k = 3 argument; in principle
one could also nitise the signicantly more complicated argument
of Furstenberg and Katznelson for general k, but this has not been
properly carried out yet (the other two proofs are likely to generalise
much more easily to higher k). The result is considered quite deep;
for instance, the general k case of the density Hales-Jewett theorem
already implies Szemeredis theorem, which is a highly non-trivial
theorem in its own right, as a special case.
1
http://www.msci.memphis.edu/ randall/preprints/HJk3.pdf
4.1. Polymath1 551
Another of the proofs is based primarily on the density increment
method that goes back to Roth, and also incorporates some ideas from
a paper of Ajtai and Szemeredi[AjSz1974] establishing what we have
called the corners theorem (and which is also implied by the k = 3
case of the density Hales-Jewett theorem). A key new idea involved
studying the correlations of the original set A with special subsets of
[3]
n
, such as ij-insensitive sets, or intersections of ij-insensitive and
ik-insensitive sets.
This correlations idea inspired a new ergodic proof of the density
Hales-Jewett theorem for all values of k by Austin[Au2009b], which
is in the spirit of the triangle removal lemma (or hypergraph removal
lemma) proofs of Roths theorem (or the multidimensional Szemeredi
theorem). A nitary translation of this argument in the k = 3 case
has been sketched out; I believe it also extends in a relatively straight-
forward manner to the higher k case (in analogy with some proofs of
the hypergraph removal lemma).
4.1.1. Simpler cases of density Hales-Jewett. In order to mo-
tivate the known proofs of the density Hales-Jewett theorem, it is
instructive to consider some simpler theorems which are implied by
this theorem. The rst is the corners theorem of Ajtai and Szemeredi:
Theorem 4.1.2 (Corners theorem). Let 0 < 1. Then if n is
a suciently large integer, any subset A of the square [n]
2
of den-
sity [A[/n
2
at least contains at least one right-angled triangle (or
corner) (x, y), (x +r, y), (x, y +r) with r ,= 0.
The k = 3 density Hales-Jewett theorem implies the corners the-
orem; this is proven by utilising the map : [3]
n
[n]
2
from the cube
to the square, dened by mapping a string x [3]
n
to a pair (a, b),
where a, b are the number of 1s and 2s respectively in x. The key point
is that maps combinatorial lines to corners. (Strictly speaking, this
mapping only establishes the corners theorem for dense subsets of
[n/3
n, n/3 +
n]
2
, but it is not dicult to obtain the general
case from this by replacing n by n
2
and using translation-invariance.)
The corners theorem is also closely related to the problem of
nding dense sets of points in a triangular grid without any equilateral
triangles, a problem which we have called Fujimuras problem.
552 4. Technical articles
The corners theorem in turn implies
Theorem 4.1.3 (Roths theorem). Let 0 < 1. Then if n is a
suciently large integer, any subset A of the interval [n] of density
[A[/n at least contains at least one arithmetic progression a, a +
r, a + 2r of length three.
Roths theorem can be deduced from the corners theorem by con-
sidering the map : [n]
2
[3n] dened by (a, b) := a +2b; the key
point is that maps corners to arithmetic progressions of length
three.
There are higher k analogues of these implications; the general k
version of the density Hales-Jewett theorem implies a general k ver-
sion of the corners theorem known as the multidimensional Szemeredi
theorem, which in term implies a general version of Roths theorem
known as Szemeredis theorem.
4.1.2. The density increment argument. The strategy of the
density increment argument, which goes back to Roths proof[Ro1953]
of Theorem 4.1.3, is to perform a downward induction on the density
. Indeed, the theorem is obvious for high enough values of ; for
instance, if > 2/3, then partitioning the cube [3]
n
into lines and
applying the pigeonhole principle will already give a combinatorial
line. So the idea is to deduce the claim for a xed density from that
of a higher density .
A key concept here is that of an m-dimensional combinatorial sub-
space of [3]
n
- a set of the form([3]
m
), where 1, 2, 3,
1
, . . . ,
m
n
is a string formed using the base alphabet and m wildcards
1
, . . . ,
m
(with each wildcard appearing at least opnce), and (a
1
. . . a
m
) is the
string formed by substituting a
i
for
i
for each i. (Thus, for in-
stance, a combinatorial line is a combinatorial subspace of dimension
1.) The identication between [3]
m
and the combinatorial space
([3]
m
) maps combinatorial lines to combinatorial lines. Thus, to
prove Theorem 4.1.1, it suces to show
Proposition 4.1.4 (Lack of lines implies density increment). Let
0 < 1. Then if n is a suciently large integer, and A [3]
n
has
density at least and has no combinatorial lines, then there exists
4.1. Polymath1 553
an m-dimensional subspace ([3]
m
) of [3]
n
on which A has density at
least +c(), where c() > 0 depends only on (and is bounded away
from zero on any compact range of ), and m m
0
(n, ) for some
function m
0
(n, ) that goes to innity as n for xed .
It is easy to see that Proposition 4.1.4 implies Theorem 4.1.1
(for instance, one could consider the inmum of all for which the
theorem holds, and show that having this inmum non-zero would
lead to a contradiction).
Now we have to gure out how to get that density increment. The
original argument of Roth relied on Fourier analysis, which in turn
relies on an underlying translation-invariant structure which is not
present in the density Hales-Jewett setting. (Arithmetic progressions
are translation-invariant, but combinatorial lines are not.) It turns
out that one can proceed instead by adapting a (modication of)
an argument of Ajtai and Szemeredi, which gave the rst proof of
Theorem 4.1.2.
The (modied) Ajtai-Szemeredi argument uses the density in-
crement method, assuming that A has no right-angled triangles and
showing that A has an increased density on a subgrid - a product
P Q of fairly long arithmetic progressions with the same spacing.
The argument proceeds in two stages, which we describe slightly in-
formally (in particular, glossing over some technical details regarding
quantitative parameters such as ) as follows:
Step 1. If A [n]
2
is dense but has no right-angled triangles,
then A has an increased density on a cartesian product UV
of dense sets U, V [n] (which are not necessarily arithmetic
progressions).
Step 2. Any Cartesian product U V in [n]
2
can be parti-
tioned into reasonably large grids P Q, plus a remainder
term of small density.
From Step 1, Step 2 and the pigeonhole principle we obtain the
desired density increment of A on a grid P Q, and then the density
increment argument gives us the corners theorem.
Step 1 is actually quite easy. If A is dense, then it must also
be dense on some diagonal D = (x, y) : x + y = const, by the
554 4. Technical articles
pigeonhole principle. Let U and V denote the rows and columns that
A D occupies. Every pair of points in A D forms the hypotenuse
of some corner, whose third vertex lies in U V . Thus, if A has no
corners, then A must avoid all the points formed by U V (except for
those of the diagonal D). Thus A has a signicant density decrease
on the Cartesian product U V . Dividing the remainder [n]
2
(U
V ) into three further Cartesian products U ([n]V ), ([n]U) V ,
([n]U) ([n]V ) and using the pigeonhole principle we obtain the
claim (after redening U, V appropriately).
Step 2 can be obtained by iterating a one-dimensional version:
Step 2a. Any set U [n] can be partitioned into reasonably
long arithmetic progressions P, plus a remainder term of
small density.
Indeed, from Step 2a, one can partition U [n] into products
P [n] (plus a small remainder), which can be easily repartitioned
into grids P Q (plus small remainder). This partitions U V into
sets P (V Q) (plus small remainder). Applying Step 2a again,
each V Q can be partitioned further into progressions Q
(plus small
remainder), which allows us to partition each P (V Q) into grids
P
i
V
j
of non-trivial size.
Counting lemma step: By exploiting the pseudorandom-
ness property, one shows that if G has a triple G
ij
, G
jk
, G
ki
of dense pseudorandom graphs between cells V
i
, V
j
, V
k
of
non-trivial size, then this triple must generate a large num-
ber of triangles; hence, if G has very few triangles, then one
cannot nd such a triple of dense pseudorandom graphs.
Cleaning step: If one then removes all components of G
which are too sparse or insuciently pseudorandom, one can
thus eliminate all triangles.
558 4. Technical articles
Pulling this argument back to the corners theorem, we see that
cells such as V
i
, V
j
, V
k
will correspond either to horizontally insen-
sitive sets, vertically insensitive sets, or diagonally insensitive sets.
Thus this proof of the corners theorem proceeds by partitioning [n]
2
in three dierent ways into insensitive sets in such a way that A is
pseudorandom with respect to many of the cells created by any two
of these partitions, counting the corners generated by any triple of
large cells in which A is pseudorandom and dense, and cleaning out
all the other cells.
It turns out that a variant of this argument can give Theorem
4.1.1; this was in fact the original approach studied by the polymath1
project, though it was only after a detour through ergodic theory
(as well as the development of the density-increment argument dis-
cussed above) that the triangle-removal approach could be properly
executed. In particular, an ergodic argument based on the innitary
analogue of the triangle removal lemma (and its hypergraph gener-
alisations) was developed by Austin[Au2009b], which then inspired
the combinatorial version sketched here.
The analogue of the vertex cells V
i
are given by certain 12-insensitive
sets E
a
12
, 13-insensitive sets E
b
13
, and 23-insensitive sets E
c
23
. Roughly
speaking, a set A [3]
n
would be said to be pseudorandom with re-
spect to a cell E
a
12
E
b
13
if AE
a
12
E
b
13
has no further density incre-
ment on any smaller cell E
12
E
13
with E
12
a 12-insensitive subset
of E
a
12
, and E
13
a 13-insensitive subset of E
b
13
. (This is an oversim-
plication, glossing over an important renement of the concept of
pseudorandomness involving the discrepancy between global densi-
ties in [3]
n
and local densities in subspaces of [3]
n
.) There is a similar
notion of A being pseudorandom with respect to a cell E
b
13
E
c
23
or
E
c
23
E
a
12
.
We briey describe the regularity lemma step. By modifying
the proof of the regularity lemma, one can obtain three partitions
[3]
n
= E
1
12
. . . E
M12
12
= E
1
13
. . . E
M13
13
= E
1
23
. . . E
M23
23
into 12-insensitive, 13-insensitive, and 23-insensitive components re-
spectively, where M
12
, M
13
, M
23
are not too large, and A is pseudo-
random with respect to most cells E
a
12
E
b
13
, E
b
13
E
c
23
, and E
c
23
E
a
12
.
4.1. Polymath1 559
In order for the counting step to work, one also needs an addi-
tional stationarity reduction, which is dicult to state precisely,
but roughly speaking asserts that the local statistics of sets such as
E
a
12
on medium-dimensional subspaces are close to the corresponding
global statistics of such sets; this can be achieved by an additional
pigeonholing argument. We will gloss over this issue, pretending that
there is no distinction between local statistics and global statistics.
(Thus, for instance, if E
a
12
has large global density in [3]
n
, we shall as-
sume that E
a
12
also has large density on most medium-sized subspaces
of [3]
n
.)
Now for the counting lemma step. Suppose we can nd a, b, c
such that the cells E
a
12
, E
b
13
, E
c
23
are large, and that A intersects E
a
12
E
b
13
, E
b
13
E
c
23
, and E
c
23
E
a
12
in a dense pseudorandom manner. We
claim that this will force A to have a large number of combinatorial
lines , with (1) in A E
a
12
E
b
13
, (2) in A E
c
23
E
a
12
, and (3)
in A E
b
13
E
c
23
. Because of the dense pseudorandom nature of
A in these cells, it turns out that it will suce to show that there
are a lot of lines (1) with (1) E
a
12
E
b
13
, (2) E
c
23
E
a
12
, and
(3) E
b
13
E
c
23
.
One way to generate a line is by taking the triple x,
12
(x),
13
(x),
where x [3]
n
is a generic point. (Actually, as we will see below, we
would have to to a subspace of [3]
n
before using this recipe to generate
lines.) Then we need to nd many x obeying the constraints
x E
a
12
E
b
13
;
12
(x) E
c
23
E
a
12
;
13
(x) E
b
13
E
c
23
.
Because of the various insensitivity properties, many of these condi-
tions are redundant, and we can simplify to
x E
a
12
E
b
13
;
12
(x) E
c
23
.
Now note that the property
12
(x) E
c
23
is 123-insensitive; it is
simultaneously 12-insensitive, 23-insensitive, and 13-insensitive. As
E
c
23
is assumed to be large, there will be large combinatorial subspaces
on which (a suitably localised version of) this property
12
(x)
E
c
23
will be always true. Localising to this space (taking advantage
of the stationarity properties alluded to earlier), we are now looking
for solutions to
x E
a
12
E
b
13
.
560 4. Technical articles
Well pick x to be of the form
21
(y) for some y. We can then
rewrite the constraints on y as
y E
a
12
;
21
(y) E
b
13
.
The property
21
(y) E
b
13
is 123-invariant, and E
b
13
is large,
so by arguing as before we can pass to a large subspace where this
property is always true. The largeness of E
a
12
then gives us a large
number of solutions.
Taking contrapositives, we conclude that if A in fact has no com-
binatorial lines, then there do not exist any triple E
a
12
, E
b
13
, E
c
23
of
large cells with respect to which A is dense and pseudorandom. This
forces A to be conned either to very small cells, or to very sparse
subsets of cells, or to the rare cells which fail to be pseudorandom.
None of these cases can contribute much to the density of A, and
so A itself is very sparse - contradicting the hypothesis in Theorem
4.1.1 that A is dense (this is the cleaning step). This concludes the
sketch of the triangle-removal proof of this theorem.
The ergodic version of this argument in [Au2009b] works for all
values of k, so I expect the combinatorial version to do so as well.
4.1.4. The nitary Furstenberg-Katznelson argument. In [FuKa1989],
Furstenberg and Katznelson gave the rst proof of Theorem 4.1.1, by
translating it into a recurrence statement about a certain type of sta-
tionary process indexed by an innite cube [3]
:=
n=1
[3]
n
. This
argument was inspired by a long string of other successful proofs of
density Ramsey theorems via ergodic means, starting with the ini-
tial paper of Furstenberg[Fu1977] giving an ergodic theory proof of
Szemeredis theorem. The latter proof was transcribed into a ni-
tary language in [Ta2006b], so it was reasonable to expect that the
Furstenberg-Katznelson argument could similarly be translated into
a combinatorial framework.
Let us rst briey describe the original strategy of Furstenberg
to establish Roths theorem, but phrased in an informal, and vaguely
combinatorial, language. The basic task is to get a non-trivial lower
bound on averages of the form
(4.1) E
a,r
f(a)f(a +r)f(a + 2r)
4.1. Polymath1 561
where we will be a bit vague about what a, r are ranging over, and
where f is some non-negative function of positive mean. It is then
natural to study more general averages of the form
(4.2) E
a,r
f(a)g(a +r)h(a + 2r).
Now, it turns out that certain types of functions f, g, h give a negli-
gible contribution to expressions such as (4.2). In particular, if f is
weakly mixing, which roughly means that the pair correlations
E
a
f(a)f(a +r)
are small for most r, then the average (4.2) is small no matter what
g, h are (so long as they are bounded). This can be established
by some applications of the Cauchy-Schwarz inequality (or its close
cousin, the van der Corput lemma). As a consequence of this, all
weakly mixing components of f can essentially be discarded when
considering an average such as (4.1).
After getting rid of the weakly mixing components, what is left?
Being weakly mixing is like saying that almost all the shifts f( + r)
of f are close to orthogonal to each other. At the other extreme is
that of periodicity - the shifts f( + r) periodically recur to become
equal to f again. There is a slightly more general notion of almost
periodicity - roughly, this means that the shifts f( +r) dont have to
recur exactly to f again, but they are forced to range in a precompact
set, which basically means that for every > 0, that f( + r) lies
within (in some suitable norm) of some nite-dimensional space. A
good example of an almost periodic function is an eigenfunction, in
which we have f(a + r) =
r
f(a) for each r and some quantity
r
independent of a (e.g. one can take f(a) = e
2ia
for some R).
In this case, the nite-dimensional space is simply the scalar multiples
of f(a) (and one can even take = 0 in this special case).
It is easy to see that non-trivial almost periodic functions are not
weakly mixing; more generally, any function which correlates non-
trivially with an almost periodic function can also be seen to not
be weakly mixing. In the converse direction, it is also fairly easy to
show that any function which is not weakly mixing must have non-
trivial correlation with an almost periodic function. Because of this, it
turns out that one can basically decompose any function into almost
562 4. Technical articles
periodic and weakly mixing components. For the purposes of getting
lower bounds on (4.1), this allows us to essentially reduce matters
to the special case when f is almost periodic. But then the shifts
f( + r) are almost ranging in a nite-dimensional set, which allows
one to essentially assign each shift r a colour from a nite range of
colours. If one then applies the van der Waerden theorem, one can
nd many arithmetic progressions a, a+r, a+2r which have the same
colour, and this can be used to give a non-trivial lower bound on (4.1).
(Thus we see that the role of a compactness property such as almost
periodicity is to reduce density Ramsey theorems to colouring Ramsey
theorems.)
This type of argument can be extended to more advanced recur-
rence theorems, but certain things become more complicated. For
instance, suppose one wanted to count progressions of length 4; this
amounts to lower bounding expressions such as
(4.3) E
a,r
f(a)f(a +r)f(a + 2r)f(a + 3r).
It turns out that f being weakly mixing is no longer enough to give
a negligible contribution to expressions such as (4.3). For that, one
needs the stronger property of being weakly mixing relative to almost
periodic functions; roughly speaking, this means that for most r, the
expression f()f(+r) is not merely of small mean (which is what weak
mixing would mean), but that this expression furthermore does not
correlate strongly with any almost periodic function (i.e. E
a
f(a)f(a+
r)g(a) is small for any almost periodic g). Once one has this stronger
weak mixing property, then one can discard all components of f which
are weakly mixing relative to almost periodic functions.
One then has to gure out what is left after all these components
are discarded. Because we strengthened the notion of weak mixing, we
have to weaken the notion of almost periodicity to compensate. The
correct notion is no longer that of almost periodicity - in which the
shifts f( +r) almost take values in a nite-dimensional vector space -
but that of almost periodicity relative to almost periodic functions, in
which the shifts almost take values in a nite-dimensional module over
the algebra of almost periodic functions. A good example of such a
beast is that of a quadratic eigenfunction, in which we have f(a+r) =
r
(a)f(a) where
r
(a) is itself an ordinary eigenfunction, and thus
4.1. Polymath1 563
almost periodic in the ordinary sense; here, the relative module is
the one-dimensional module formed by almost periodic multiples of
f. (A typical example of a quadratic eigenfunction is f(a) = e
2ia
2
for some R.)
It turns out that one can relativise all of the previous argu-
ments to the almost periodic factor, and decompose an arbitrary f
into a component which is weakly mixing relative to almost periodic
functions, and another component which is almost periodic relative
to almost periodic functions. The former type of components can
be discarded. For the latter, we can once again start colouring the
shifts f( + r) with a nite number of colours, but with the caveat
that the colour assigned is no longer independent of a, but depends
in an almost periodic fashion on a. Nevertheless, it is still possible
to combine the van der Waerden colouring Ramsey theorem with the
theory of recurrence for ordinary almost periodic functions to get a
lower bound on (4.3) in this case. One can then iterate this argument
to deal with arithmetic progressions of longer length, but one now
needs to consider even more intricate notions of almost periodicity,
e.g. almost periodicity relative to (almost periodic functions relative
to almost periodic functions), etc.
It turns out that these types of ideas can be adapted (with some
eort) to the density Hales-Jewett setting. Its simplest to begin
with the k = 2 situation rather than the k = 3 situation. Here, we
are trying to obtain non-trivial lower bounds for averages of the form
(4.4) E
f((1))f((2))
where ranges in some fashion over combinatorial lines in [2]
n
, and
f is some non-negative function with large mean.
The analogues of weakly mixing and almost periodic in this set-
ting are the 12-uniform and 12-low inuence functions respectively.
Roughly speaking, a function is 12-low inuence if its value usually
doesnt change much if a 1 is ipped to a 2 or vice versa (e.g. the indi-
cator function of a 12-insensitive set is 12-low inuence); conversely,
a 12-uniform function is a function g such that E
f((1))g((2)) is
small for all (bounded) f. One can show that any function can be
decomposed, more or less orthogonally, into a 12-uniform function
564 4. Technical articles
and a 12-low inuence function, with the upshot being that one can
basically reduce the task of lower bounding (4.4) to the case when f
is 12-low inuence. But then f((1)) and f((2)) are approximately
equal to each other, and it is straightforward to get a lower-bound in
this case.
Now we turn to the k = 3 setting, where we are looking at lower-
bounding expressions such as
(4.5) E
f((1))g((2))h((3))
with f = g = h.
It turns out that g (say) being 12-uniform is no longer enough
to give a negligible contribution to the average (4.5). Instead, one
needs the more complicated notion of g being 12-uniform relative
to 23-low inuence functions; this means that not only are the av-
erages E
i=1
2
t
j=1
d(V
t
i
, V
t
j
)[AV
t
i
[[B V
t
j
[ +O([V [
2
).
This weaker lemma only lets us count macroscopic edge den-
sities d(A, B), when A, B are dense subsets of V , whereas the full
regularity lemma is stronger in that it also controls microscopic
edge densities d(A, B) where A, B are now dense subsets of the cells
V
Mr
i
, V
Mr
j
. Nevertheless this weaker lemma is easier to prove and
already illustrates many of the ideas.
4.2. Regularity via random partitions 569
Lets now prove this lemma. Fix > 0, let M be chosen later,
let G = (V, E) be a graph, and select v
1
, . . . , v
M
at random. (There
can of course be many vertices selected more than once; this will not
bother us.) Let A
t
and V
t
1
, . . . , V
t
2
t
be as in the above lemma. For
notational purposes it is more convenient to work with the (random)
-algebra B
t
generated by the A
1
, . . . , A
t
(i.e. the collection of all
sets that can be formed from A
1
, . . . , A
t
by boolean operations); this
is an atomic -algebra whose atoms are precisely the (non-empty)
cells V
t
1
, . . . , V
t
2
t
in the partition. Observe that these -algebras are
nested: B
t
B
t+1
.
We will use the trick of turning sets into functions, and view
the graph as a function 1
E
: V V R. One can then form the
conditional expectation E(1
E
[B
t
B
t
) : V V R of this function
to the product -algebra B
t
B
t
, whose value on V
t
i
V
t
j
is simply
the average value of 1
E
on the product set V
t
i
V
t
j
. (When i and
j are dierent, this is simply the edge density d(V
t
i
, V
t
j
)). One can
view E(1
E
[B
t
B
t
) more combinatorially, as a weighted graph on V
such that all edges between two distinct cells V
t
i
, V
t
j
have the same
constant weight of d(V
t
i
, V
t
j
).
We give V (and V V ) the uniform probability measure, and
dene the energy e
t
at time t to be the (random) quantity
e
t
:= |E(1
E
[B
t
B
t
)|
2
L
2
(V V )
=
1
[V [
2
v,wV
E(1
E
[B
t
B
t
)
2
.
one can interpret this as the mean square of the edge densities d(V
t
i
, V
t
j
),
weighted by the size of the cells V
t
i
, V
t
j
. From Pythagoras theorem
we have the identity
e
t
= e
t
+|E(1
E
[B
t
B
t
) E(1
E
[B
t
B
t
)|
2
L
2
(V V )
for all t
v,w,v
,w
V
f
U
(v, w)f
U
(v, w
)f
U
(v
, w)f
U
(v
, w
)
is of size O(
1
M
).
Proof. The left-hand side can be rewritten as
E
1
[V [
2
v,wV
f
U
(v, w)f
U
(v, v
t+2
)f
U
(v
t+1
, w)f
U
(v
t+1
, v
t+2
).
Observe that the function (v, w) f
U
(v, v
t+2
)f
U
(v
t+1
, w)f
U
(v, w)
is measurable with respect to B
t+2
B
t+2
, so we can rewrite this
expression as
E
1
[V [
2
v,wV
E(f
U
[B
t+2
B
t+2
)(v, w)f
U
(v, v
t+2
)f
U
(v
t+1
, w)f
U
(v
t+1
, v
t+2
).
Applying Cauchy-Schwarz, one can bound this by
E|E(f
U
[B
t+2
B
t+2
)|
L
2
(V V )
.
4.2. Regularity via random partitions 571
But from Pythagoras we have
E(f
U
[B
t+2
B
t+2
)
2
= e
t+2
e
t
and so the claim follows from (4.8) and another application of Cauchy-
Schwarz.
Now we can prove Lemma 4.2.4. Observe that
[E (AB)[
2
t
i=1
2
t
j=1
d(V
t
i
, V
t
j
)[A V
t
i
[[B V
t
j
[
=
v,wV
1
A
(v)1
B
(w)f
U
(v, w).
Applying Cauchy-Schwarz twice in v, w and using Lemma 4.2.6, we
see that the RHS is O((M)
1/8
); choosing M
9
we obtain the
claim.
4.2.2. Strong regularity via random neighbourhoods. We now
prove Lemma 4.2.3, which of course implies Lemma 4.2.2.
Fix > 0 and a graph G = (V, E) on n vertices. We randomly
select an innite sequence v
1
, v
2
, . . . V of vertices in V , drawn
uniformly and independently at random. We dene A
t
, V
t
i
, B
t
, e
t
, as
before.
Now let m be a large number depending on > 0 to be chosen
later, let F : Z
+
Z
+
be a rapidly growing function (also to be
chosen later), and set M
1
:= F(1) and M
r
:= 2(M
r1
+ F(M
r1
))
for all 1 r m, thus M
1
< M
2
< . . . < M
m+1
grows rapidly to
innity. The expected energies Ee
Mr
are increasing from 0 to 1, thus
if we pick 1 r m uniformly at random, the expectation of
Ee
Mr+1
e
Mr
telescopes to be O(1/m). Thus, by Markovs inequality, with proba-
bility 1 O() we will have
Ee
Mr+1
e
Mr
= O(
1
m
).
Assume that r is chosen to obey this. Then, by another application
of the pigeonhole principle, we can nd M
r+1
/2 t < M
r+1
such
572 4. Technical articles
that
E(e
t+2
e
t
) = O(
1
mM
r+1
) = O(
1
mF(M
r
)
).
Fix this t. We have
E(e
t
e
Mr
) = O(
1
m
),
so by Markovs inequality, with probability 1 O(), v
1
, . . . , v
t
are
such that
(4.9) e
t
e
Mr
= O(
1
m
2
)
and also obey the conditional expectation bound
(4.10) E(e
t+2
e
t
[v
1
, . . . , v
t
) = O(
1
mF(M
r
)
).
Assume that this is the case. We split
1
E
= f
U
+f
err
+f
U
where
f
U
:= E(1
E
[B
Mr
B
Mr
)
f
err
:= E(1
E
[B
t
B
t
) E(1
E
[B
Mr
B
Mr
)
f
U
:= 1
E
E(1
E
[B
t
B
t
).
We now assert that the partition V
Mr
1
, . . . , V
Mr
2
Mr
induced by B
Mr
obeys the conclusions of Lemma 4.2.2. For this, we observe various
properties on the three components of 1
E
:
Lemma 4.2.7 (f
U
locally constant). f
U
is constant on each prod-
uct set V
Mr
i
V
Mr
j
.
Proof. This is clear from construction.
Lemma 4.2.8 (f
err
small). We have |f
err
|
2
L
2
(V V )
= O(
1
m
2
).
Proof. This follows from (4.9) and Pythagoras theorem.
Lemma 4.2.9 (f
U
uniform). The expression
1
[V [
4
v,w,v
,w
V
f
U
(v, w)f
U
(v, w
)f
U
(v
, w)f
U
(v
, w
)
is of size O(
1
mF(Mr)
).
4.2. Regularity via random partitions 573
Proof. This follows by repeating the proof of Lemma 4.2.6, but using
(4.10) instead of (4.8).
Now we verify the regularity.
First, we eliminate small atoms: the pairs (V
i
, V
j
) for which
[V
Mr
i
[ [V [/2
Mr
clearly give a net contribution of at most O([V [
2
)
and are acceptable; similarly for those pairs for which [V
Mr
j
[ [V [/2
Mr
.
So we may henceforth assume that
(4.11) [V
Mr
i
[, [V
Mr
j
[ [V [/2
Mr
.
Now, let A V
Mr
i
, B V
Mr
i
have densities
:= [A[/[V
Mr
i
[ ; := [B[/[V
Mr
j
[ ,
then
d(A, B) =
1
[V
Mr
i
[[V
Mr
j
[
vV
Mr
i
wV
Mr
i
1
A
(v)1
B
(w)1
E
(v, w).
We divide 1
E
into the three pieces f
U
, f
err
, f
U
.
The contribution of f
U
is exactly d(V
Mr
i
, V
Mr
j
).
The contribution of f
err
can be bounded using Cauchy-Schwarz
as
O(
1
[V
Mr
i
[[V
Mr
j
[
vV
Mr
i
wV
Mr
i
[f
err
(v, w)[
2
)
1/2
.
Using Lemma 4.2.8 and Chebyshevs inequality, we see that the pairs
(V
i
, V
j
) for which this quantity exceeds
3
will contribute at most
8
/m to (4.6), which is acceptable if we choose m so that m
9
.
Let us now discard these bad pairs.
Finally, the contribution of f
U
can be bounded by two applica-
tions of Cauchy-Schwarz and (4.2.9) as
O(
[V [
2
[V
Mr
i
[[V
Mr
j
[
1
(mF(M
r
))
1/8
)
which by (4.11) is bounded by
O(2
2Mr
2
/(mF(M
r
))
1/8
).
574 4. Technical articles
This can be made O(
3
) by selecting F suciently rapidly growing
depending on . Putting this all together we see that
d(A, B) = d(V
Mr
i
, V
Mr
j
) +O(
3
)
which (since , ) gives the desired regularity.
Remark 4.2.10. Of course, this argument gives tower-exponential
bounds (as F is exponential and needs to be iterated m times), which
will be familiar to any reader already acquainted with the regularity
lemma.
Remark 4.2.11. One can take the partition induced by random
neighbourhoods here and carve it up further to be both equitable
and (mostly) regular, thus recovering a proof of Lemma 1, by follow-
ing the arguments in [Ta2006c]. Of course, when one does so, one no
longer has a partition created purely from random neighbourhoods,
but it is pretty clear that one is not going to be able to make an equi-
table partition just from boolean operations applied to a few random
neighbourhoods.
Notes. This article rst appeared at terrytao.wordpress.com/2009/04/26.
Thanks to Anup for corrections.
Asaf Shapira noted that in [FiMaSh2007] a similar (though not
identical) regularisation algorithm was given which explicitly regu-
larises a graph or hypergraph in linear time.
4.3. Szemeredis regularity lemma via the
correspondence principle
In the previous section, we discussed the Szemeredi regularity lemma,
and how a given graph could be regularised by partitioning the vertex
set into random neighbourhoods. More precisely, we gave a proof of
Lemma 4.3.1 (Regularity lemma via random neighbourhoods). Let
> 0. Then there exists integers M
1
, . . . , M
m
with the following
property: whenever G = (V, E) be a graph on nitely many vertices,
if one selects one of the integers M
r
at random from M
1
, . . . , M
m
,
then selects M
r
vertices v
1
, . . . , v
Mr
V uniformly from V at random,
then the 2
Mr
vertex cells V
Mr
1
, . . . , V
Mr
2
Mr
(some of which can be empty)
4.3. Regularity via correspondence 575
generated by the vertex neighbourhoods A
t
:= v V : (v, v
t
) E
for 1 t M
r
, will obey the regularity property
(4.12)
(
G) of
n
0, and p
n
p as n . Then any graph limit
G = (Z,
G) of
this sequence will be an Erdos-Renyi graph
G = G(, p), where each
edge (i, j) lies in
G with an independent probability of p.
Example 4.3.5 (Structured example). Let G
n
= (V
n
, E
n
) be a se-
quence of complete bipartite graphs, where the two cells of the bi-
partite graph have vertex density q
n
and 1 q
n
respectively, with
[V
n
[ and q
n
q. Then any graph limit
G = (Z,
E) of this
sequence will be a random complete bipartite graph, constructed as
follows: rst, randomly colour each vertex n of Z red with probabil-
ity q and blue with probability 1 q, independently for each vertex.
Then dene
G to be the complete bipartite graph between the red
vertices and the blue vertices.
Example 4.3.6 (Random+structured example). Let G
n
= (V
n
, E
n
)
be a sequence of incomplete bipartite graphs, where the two cells of
the bipartite graph have vertex density p
n
and 1p
n
respectively, and
the graph G
n
is
n
-regular between these two cells with edge density
p
n
, with [V
n
[ ,
n
0, p
n
p, and q
n
q. Then any graph
limit
G = (Z,
E) of this sequence will be a random bipartite graph,
constructed as follows: rst, randomly colour each vertex n of Z red
with probability q and blue with probability 1 q, independently for
each vertex. Then dene
G to be the bipartite graph between the red
580 4. Technical articles
vertices and the blue vertices, with each edge between red and blue
having an independent probability of p of lying in
E.
One can use the graph correspondence principle to prove state-
ments about nite deterministic graphs, by the usual compactness and
contradiction approach: argue by contradiction, create a sequence of
nite deterministic graph counterexamples, use the correspondence
principle to pass to an innite random exchangeable limit, and ob-
tain the desired contradiction in the innitary setting. This will be
how we shall approach the proof of Lemma 4.3.2.
4.3.2. The innitary regularity lemma. To prove the nitary
regularity lemma via the correspondence principle, one must rst de-
velop an innitary counterpart. We will present this innitary regu-
larity lemma (rst introduced in this paper) shortly, but let us moti-
vate it by a discussion based on the three model examples of innite
exchangeable graphs
G = (Z,
E) from the previous section.
First, consider the random graph
G from Example 4.3.4. Here,
we observe that the events (i, j)
E are jointly independent of
each other, thus for instance
P((1, 2), (2, 3), (3, 1)
E) =
(i,j)=(1,2),(2,3),(3,1)
P((i, j)
E).
More generally, we see that the factors B
{i,j}
for all distinct i, j Z
are independent, which means that
P(E
1
. . . E
n
) = P(E
1
) . . . P(E
n
)
whenever E
1
B
{i1,j1}
, . . . , E
n
B
{in,jn}
and the i
1
, j
1
, . . . , i
n
, j
n
are distinct.
Next, we consider the structured graph
G from Example 4.3.5,
where we take 0 < p < 1 to avoid degeneracies. In contrast to the
preceding example, the events (i, j)
E are now highly dependent;
for instance, if (1, 2)
E and (1, 3)
E, then this forces (2, 3) to lie
outside of
E, despite the fact that the events (i, j)
E each occur
with a non-zero probability of p(1 p). In particular, the factors
B
{1,2}
, B
{1,3}
, B
{2,3}
are not jointly independent.
However, one can recover a conditional independence by intro-
ducing some new factors. Specically, let B
i
be the factor generated
4.3. Regularity via correspondence 581
by the event that the vertex i is coloured red. Then we see that
the factors B
{1,2}
, B
{1,3}
, B
{2,3}
now become conditionally jointly in-
dependent, relative to the base factor B
1
B
2
B
3
, which means that
we have conditional independence identities such as
P((1, 2), (2, 3), (3, 1)
E[B
1
B
2
B
3
) =
(i,j)=(1,2),(2,3),(3,1)
P((i, j)
E[B
1
B
2
B
3
).
Indeed, once one xes (conditions) the information in B
1
B
2
B
3
(i.e. once one knows what colour the vertices 1, 2, 3 are), the events
(i, j)
E for (i, j) = (1, 2), (2, 3), (3, 1) either occur with proba-
bility 1 (if i, j have distinct colours) or probability 0 (if i, j have the
same colour), and so the conditional independence is trivially true.
A similar phenomenon holds for the random+structured graph
_
ie
B
i
-measurable event for all e E, then
P(
eE
E
e
[
iI
B
i
) =
eE
P(E
e
[
iI
B
i
).
Proof. By induction on E, it suces to show that for any e
0
E,
the event E
e0
and the event
_
eE\{e0}
E
e
are independent relative to
_
iI
B
i
.
By relabeling we may take I = 1, . . . , n and e
0
= 1, 2 for
some n 2. We use the exchangeability of
G (and Hilberts hotel ) to
observe that the random variables
E(1
Ee
0
[B
{1,2,...}{1}
B
{1,2,...}{2}
)
and
E(1
Ee
0
[B
{1,2,...}{1}{3,...,n}
B
{1,2,...}{2}{3,...,n}
)
have the same distribution; in particular, they have the same L
2
norm. By Pythagoras theorem, they must therefore be equal al-
most surely; furthermore, for any intermediate -algebra B between
B
{1,2,...}{1}
B
{1,2,...}{2}
and B
{1,2,...}{1}{3,...,n}
B
{1,2,...}{2}{3,...,n}
,
E(1
Ee
0
[B) is also equal almost surely to the above two expressions.
(The astute reader will observe that we have just run the energy
increment argument; in the innitary world, it is somewhat slicker
than in the nitary world, due to the convenience of the Hilberts
hotel trick, and the fact that the existence of orthogonal projections
(and in particular, conditional expectation) is itself encoding an en-
ergy increment argument.)
As a special case of the above observation, we see that
E(1
Ee
0
[
iI
B
i
) = E(1
Ee
0
[
iI
B
i
eE\{e0}
B
e
).
4.3. Regularity via correspondence 583
In particular, this implies that E
0
is conditionally independent of
every event measurable in
_
iI
B
i
_
eE\{e0}
B
e
, relative to
_
iI
B
i
,
and the claim follows.
Remark 4.3.8. The same argument also allows one to easily regu-
larise innite exchangeable hypergraphs; see [Ta2007]. In fact one
can go further and obtain a structural theorem for these hypergraphs
generalising de Finettis theorem, and also closely related to the graphons
of Lovasz and Szegedy; see [Au2008] for details.
4.3.3. Proof of nitary regularity lemma. Having proven the
innitary regularity lemma, we now use the correspondence princi-
ple and the compactness and contradiction argument to recover the
nitary regularity lemma, Lemma 4.3.2.
Suppose this lemma failed. Carefully negating all the quantiers,
this means that there exists > 0, a sequence M
n
going to innity,
and a sequence of nite deterministic graphs G
n
= (V
n
, E
n
) such
that for every 1 M M
n
, if one selects vertices v
1
, . . . , v
M
V
n
uniformly from V
n
, then the 2
M
vertex cells V
M
1
, . . . , V
M
2
M
generated
by the vertex neighbourhoods A
t
:= v V : (v, v
t
) E for 1
t M, will obey the regularity property (4.12) with probability less
than 1 .
We convert each of the nite deterministic graphs G
n
= (V
n
, E
n
)
to an innite random exchangeable graph
G
n
= (Z,
E
n
); invoking the
ocrrespondence principle and passing to a subsequence if necessary,
we can assume that this graph converges in the vague topology to
an exchangeable limit
G = (Z,
E). Applying the innitary regularity
lemma to this graph, we see that the edge factors B
{i,j}
B
i
B
j
for
natural numbers i, j are conditionally jointly independent relative to
the vertex factors B
i
.
Now for any distinct natural numbers i, j, let f(i, j) be the in-
dicator of the event (i, j) lies in
E, thus f = 1 when (i, j) lies in
B
{1,2,...,M}{2}
as M increases. Thus, given any > 0 (to be
chosen later), one can nd an approximation
f
U
(1, 2) to f
U
(1, 2),
bounded between 0 and 1, which is B
{1,2,...,M}{1}
B
{1,2,...,M}{2}
-
measurable for some M, and such that
E[
f
U
(1, 2) f
U
(1, 2)[ .
We can also impose the symmetry condition
f
U
(1, 2) =
f
U
(2, 1).
Now let
}{1}
B
{1,2,...,M
}{2}
-measurable for some M
f
U
(1, 2) f
U
(1, 2)[
.
Again we can impose the symmetry condition
f
U
(1, 2) =
f
U
(2, 1). We
can then extend
f
U
by exchangeability, so that
E[
f
U
(i, j) f
U
(i, j)[
.
for all distinct natural numbers i, j. By the triangle inequality we
then have
(4.13) E
f
U
(1, 2)
f
U
(3, 2)
f
U
(1, 4)
f
U
(3, 4) = O(
)
and by a separate application of the triangle inequality
(4.14) E[f(i, j)
f
U
(i, j)
f
U
(i, j)[ = O( ).
4.4. The two-ends reduction 585
The bounds (4.13), (4.14) apply to the limiting innite random
graph
G = (Z,
E). On the other hand, all the random variables
appearing in (4.13), (4.14) involve at most nitely many of the edges
of the graph. Thus, by vague convergence, the bounds (4.13), (4.14)
also apply to the graph
G
n
= (Z,
E
n
) for suciently large n.
Now we unwind the denitions to move back to the nite graphs
G
n
= (V
n
, E
n
). Observe that, when applied to the graph
G
n
, one has
f
U
(1, 2) = F
U
,n
(v
1
, v
2
)
where F
U,n
: V
n
V
n
[0, 1] is a symmetric function which is con-
stant on the pairs of cells V
M
1
, . . . , V
M
2
M
generated the vertex neigh-
bourhoods of v
1
, . . . , v
M
. Similarly,
f
U
(1, 2) = F
U,n
(v
1
, v
2
)
for some symmetric function F
U,n
: V
n
V
n
[1, 1]. The estimate
(4.13) can then be converted to a uniformity estimate on F
U,n
EF
U,n
(v
1
, v
2
)F
U,n
(v
3
, v
2
)F
U,n
(v
1
, v
4
)F
U,n
(v
3
, v
4
) = O(
)
while the estimate (4.14) can be similarly converted to
E[1
En
(v
1
, v
2
) F
U
,n
(v
1
, v
2
) F
U,n
(v
1
, v
2
)[ = O( ).
If one then repeats the arguments in the preceding blog post, we
conclude (if is suciently small depending on , and
is su-
ciently small depending on , eps, M) that for 1 of the choices for
v
1
, . . . , v
M
, the partition V
M
1
, . . . , V
M
2
M
induced by the correspond-
ing vertex neighbourhoods will obey (4.12). But this contradicts the
construction of the G
n
, and the claim follows.
Notes. This article rst appeared at terrytao.wordpress.com/2009/05/08.
4.4. The two-ends reduction for the Kakeya
maximal conjecture
In this articleI would like to make some technical notes on a standard
reduction used in the (Euclidean, maximal) Kakeya problem, known
as the two ends reduction. This reduction (which takes advantage
of the approximate scale-invariance of the Kakeya problem) was in-
troduced by Wol[Wo1995], and has since been used many times,
586 4. Technical articles
both for the Kakeya problem and in other similar problems (e.g. in
[TaWr2003] to study curved Radon-like transforms). I was asked
about it recently, so I thought I would describe the trick here. As an
application I give a proof of the d =
n+1
2
case of the Kakeya maximal
conjecture.
The Kakeya maximal function conjecture in R
n
can be formu-
lated as follows:
Conjecture 4.4.1 (Kakeya maximal function conjecture). If 0 < <
1, 1 d n, and T
1
, . . . , T
N
is a collection of 1 tubes oriented in
a -separated set of directions, then
(4.15) |
N
i=1
1
Ti
|
L
d/(d1)
(R
n
)
(
1
)
n
d
1+
for any > 0.
A standard duality argument shows that (4.15) is equivalent to
the estimate
N
i=1
_
Ti
F
(
1
)
n
d
1+
|F|
L
d
(R
n
)
for arbitrary non-negative measurable functions F; breaking F up into
level sets via dyadic decomposition, this estimate is in turn equivalent
to the estimate
(4.16)
N
i=1
[E T
i
[
(
1
)
n
d
1+
[E[
1/d
for arbitrary measurable sets E. This estimate is then equivalent to
the following:
Conjecture 4.4.2 (Kakeya maximal function conjecture, second ver-
sion). If 0 < , < 1, 1 d n, T
1
, . . . , T
N
is a collection of 1
tubes oriented in a -separated set of directions, and E is a measurable
set such that [E T
i
[ [T
i
[ for all i, then
[E[
(N
n1
)
d
nd+
for all > 0.
Indeed, to deduce (4.16) from Conjecture 4.4.2 one can perform
another dyadic decomposition, this time based on the dyadic range of
4.4. The two-ends reduction 587
the densities [ET
i
[/[T
i
[. Conversely, (4.16) implies Conjecture 4.4.2
in the case N
n1
1, and the remaining case N
n1
1 can then
be deduced by the random rotations trick (see e.g. [ElObTa2009]).
We can reformulate the conjecture again slightly:
Conjecture 4.4.3 (Kakeya maximal function conjecture, third ver-
sion). Let 0 < , < 1, 1 d n, and T
1
, . . . , T
N
is a collection of
1 tubes oriented in a -separated set of directions with N
1n
.
For each 1 i N, let E
i
T
i
be a set with [E
i
[ [T
i
[. Then
[
N
_
i=1
E
i
[
nd+
for all > 0.
We remark that (the Minkowski dimension version of) the Kakeya
set conjecture essentially corresponds to the = 1 case of Conjecture
4.4.3, while the Hausdor dimension can be shown to be implied by
the case where
1
log
2
1/
(actually any lower bound here which
is dyadically summable in would suce). Thus, while the Kakeya
set conjecture is concerned with how small one can make unions of
tubes T
i
, the Kakeya maximal function conjecture is concerned with
how small one can make unions of portions E
i
of tubes T
i
, where the
density of the tubes are xed.
A key technical problem in the Euclidean setting (which is not
present in the nite eld case), is that the portions E
i
of T
i
may be
concentrated in only a small portion of the tube, e.g. they could ll up
a subtube, rather than being dispersed uniformly throughout
the tube. Because of this, the set
N
i=1
E
i
could be crammed into
a far tighter space than one would ideally like. Fortunately, the two
ends reduction allows one to eliminate this possibility, letting one only
consider portions E
i
which are not concentrated on just one end of
the tube or another, but occupy both ends of the tube in some sense.
A more precise version of this is as follows.
Denition 4.4.4 (Two ends condition). Let E be a subset of R
n
,
and let > 0. We say that E obeys the two ends condition with
exponent if one has the bound
[E B(x, r)[
[E[
588 4. Technical articles
for all balls B(x, r) in R
n
(note that the bound is only nontrivial
when r 1).
Informally, the two ends condition asserts that E cannot concen-
trate in a small ball; it implies for instance that the diameter of E is
1.
We now have
Proposition 4.4.5 (Two ends reduction). To prove Conjecture 4.4.3
for a xed value of d and n, it suces to prove it under the assumption
that the sets E
i
all obey the two ends condition with exponent , for
any xed value of > 0.
The key tool used to prove this proposition is
Lemma 4.4.6 (Every set has a large rescaled two-ends piece). Let
E R
n
be a set of positive measure and diameter O(1), and let
0 < < n. Then there exists a ball B(x, r) of radius r = O(1) such
that
[E B(x, r)[ r
[E[
and
[E B(x
, r
)[ (r
/r)
[E B(x, r)[
for all other balls B(x
, r
).
Proof. Consider the problem of maximising the quantity [EB(x, r)[/r
N
i=1
E
i
[ from below the volume
[T
i
[
n1
of a single tube. So we may assume that is much
greater than .
Let > 0 be arbitrary. We apply Lemma 4.4.6 to each E
i
, to
nd a ball B(x
i
, r
i
) such that
(4.17) [E
i
B(x
i
, r
i
)[ r
i
[E
i
[
and
[E
i
B(x
, r
)[ (r
/r
i
)
[E
i
B(x
i
, r
i
)[
for all B(x
, r
n1
n
, as well as the trivial bound [E
i
B(x
i
, r
i
)[ [B(x
i
, r
i
)[
r
n
i
, we obtain the lower bound r
i
1+O()
. Thus there are only
about O(log
1
),
we may assume that there is a single
1+O()
1 such that all
of the r
i
lie in the same dyadic range [, 2].
The intersection of T
i
with B(x
i
, r
i
) is then contained in a O()
tube
T
i
, and
E
i
:= E
i
T
i
occupies a fraction
[
E
i
[/[
T
i
[ r
i
[E
i
[/[
T
i
[
O()
/
of
T
i
. If we then rescale each of the
E
i
and
T
i
by O(1/), we can
locate subsets E
i
of O(/)1-tubes T
i
of density
O()
/. These
tubes T
i
have cardinality
1n+O()
(the loss here is due to the use
of the pigeonhole principle earlier) and occupy a -separated set of
directions, but after rening these tubes a bit we may assume that
they instead occupy a /-separated set of directions, at the expense
of cutting the cardinality down to
O()
(/)
1n
or so. Furthermore,
by construction the E
i
obey the two-ends condition at exponent .
Applying the hypothesis that Conjecture 4.4.3 holds for such sets, we
590 4. Technical articles
conclude that
[
_
i
E
i
[
O()
[/]
d
[/]
nd
,
which on undoing the rescaling by 1/ gives
[
_
i
E
i
[
O()
nd
.
Since > 0 was arbitrary, the claim follows.
To give an idea of how this two-ends reduction is used, we give a
quick application of it:
Proposition 4.4.7. The Kakeya maximal function conjecture is true
for d
n+1
2
.
Proof. We use the bush argument of Bourgain. By the above
reductions, it suces to establish the bound
[
N
_
i=1
E
i
[
n+1
2
n1
2
whenever N
1n
, and E
i
T
i
are subsets of 1 tubes T
i
in -separated directions with density and obeying the two-ends
condition with exponent .
Let be the maximum multiplicity of the E
i
, i.e. := |
N
i=1
1
Ei
|
L
(R
n
)
.
On the one hand, we clearly have
[
N
_
i=1
E
i
[
1
|
N
i=1
1
Ei
|
L
1
(R
n
)
1
N
n1
.
This bound is good when is small. What if is large? Then there
exists a point x
0
which is contained in of the E
i
, and hence also
contained in (at least) of the tubes T
i
. These tubes form a bush
centred at x
0
, but the portions of that tube near the centre x
0
of the
bush have high overlap. However, the two-ends condition can be used
to nesse this issue. Indeed, that condition ensures that for each E
i
involved in this bush, we have
[E
i
B(x
0
, r)[
1
2
[E
i
[
4.4. The two-ends reduction 591
for some r 1, and thus
[E
i
B(x
0
, r)[
1
2
[E
i
[
n1
.
The -separated nature of the tubes T
i
implies that the maximum
overlap of the portion T
i
B(x
0
, r) of the tubes in the bush away
from the origin is O(1), and so
[
_
i
E
i
B(x
0
, r)[
n1
.
Thus we have two dierent lower bounds for
i
E
i
, namely
and
n1
. Taking the geometric mean of these bounds to eliminate the
unknown multiplicity , we obtain
[
_
i
E
i
[
(n1)/2
,
which certainly implies the desired bound since 1.
Remark 4.4.8. Note that the two-ends condition actually proved
a better bound than what was actually needed for the Kakeya con-
jecture, in that the power of was more favourable than necessary.
However, this gain disappears under the rescaling argument used in
the proof of Proposition 4.4.5. Nevertheless, this does illustrate one
of the advantages of employing the two-ends reduction; the bounds
one gets upon doing so tend to be better (especially for small values
of ) than what one would have had without it, and so getting the
right bound tends to be a bit easier in such cases. Note though that
for the Kakeya set problem, where is essentially 1, the two-ends
reduction is basically redundant.
Remark 4.4.9. One technical drawback to using the two-ends re-
duction is that if at some later stage one needs to rene the sets E
i
to
smaller sets, then one may lose the two-ends property. However, one
could invoke the arguments used in Proposition 4.4.5 to recover this
property again by rening E
i
further. One may then lose some other
property by this further renement, but one convenient trick that al-
lows one to take advantage of multiple renements simultaneously is
to iteratively rene the various sets involved and use the pigeonhole
principle to nd some place along this iteration where all relevant
statistics of the system (e.g. the width r of the E
i
) stabilise (here
592 4. Technical articles
one needs some sort of monotonicity property to obtain this stabili-
sation). This type of trick was introduced in [Wo1998] and has been
used in several subsequent papers, for instance in [LaTa2001].
Notes. This article rst appeared at terrytao.wordpress.com/2009/05/15.
Thanks to Arie Israel, Josh Zahl, Shuanglin Shao and an anonymous
commenter for corrections.
4.5. The least quadratic nonresidue, and the
square root barrier
A large portion of analytic number theory is concerned with the dis-
tribution of number-theoretic sets such as the primes, or quadratic
residues in a certain modulus. At a local level (e.g. on a short in-
terval [x, x + y]), the behaviour of these sets may be quite irregular.
However, in many cases one can understand the global behaviour of
such sets on very large intervals, (e.g. [1, x]), with reasonable accuracy
(particularly if one assumes powerful additional conjectures, such as
the Riemann hypothesis and its generalisations). For instance, in the
case of the primes, we have the prime number theorem, which asserts
that the number of primes in a large interval [1, x] is asymptotically
equal to x/ log x; in the case of quadratic residues modulo a prime
p, it is clear that there are exactly (p 1)/2 such residues in [1, p].
With elementary arguments, one can also count statistics such as the
number of pairs of consecutive quadratic residues; and with the aid
of deeper tools such as the Weil sum estimates, one can count more
complex patterns in these residues also (e.g. k-point correlations).
One is often interested in converting this sort of global infor-
mation on long intervals into local information on short intervals.
If one is interested in the behaviour on a generic or average short
interval, then the question is still essentially a global one, basically
because one can view a long interval as an average of a long sequence
of short intervals. (This does not mean that the problem is automati-
cally easy, because not every global statistic about, say, the primes is
understood. For instance, we do not know how to rigorously establish
the conjectured asymptotic for the number of twin primes n, n + 2
4.5. The square root barrier 593
in a long interval [1, N], and so we do not fully understand the local
distribution of the primes in a typical short interval [n, n + 2].)
However, suppose that instead of understanding the average-case
behaviour of short intervals, one wants to control the worst-case be-
haviour of such intervals (i.e. to establish bounds that hold for all
short intervals, rather than most short intervals). Then it becomes
substantially harder to convert global information to local informa-
tion. In many cases one encounters a square root barrier, in which
global information at scale x (e.g. statistics on [1, x]) cannot be used
to say anything non-trivial about a xed (and possibly worst-case)
short interval at scales x
1/2
or below. (Here we ignore factors of log x
for simplicity.) The basic reason for this is that even randomly dis-
tributed sets in [1, x] (which are basically the most uniform type of
global distribution one could hope for) exhibit random uctuations
of size x
1/2
or so in their global statistics (as can be seen for in-
stance from the central limit theorem). Because of this, one could
take a random (or pseudorandom) subset of [1, x] and delete all the
elements in a short interval of length o(x
1/2
), without anything sus-
picious showing up on the global statistics level; the edited set still
has essentially the same global statistics as the original set. On the
other hand, the worst-case behaviour of this set on a short interval
has been drastically altered.
One stark example of this arises when trying to control the largest
gap between consecutive prime numbers in a large interval [x, 2x].
There are convincing heuristics that suggest that this largest gap
is of size O(log
2
x) (Cramers conjecture). But even assuming the
Riemann hypothesis, the best upper bound on this gap is only of size
O(x
1/2
log x), basically because of this square root barrier.
On the other hand, in some cases one can use additional tricks to
get past the square root barrier. The key point is that many number-
theoretic sequences have special structure that distinguish them from
being exactly like random sets. For instance, quadratic residues have
the basic but fundamental property that the product of two quadratic
residues is again a quadratic residue. One way to use this sort of
structure to amplify bad behaviour in a single short interval into bad
behaviour across many short intervals (cf. Section 1.9 of Structure
594 4. Technical articles
and Randomness). Because of this amplication, one can sometimes
get new worst-case bounds by tapping the average-case bounds.
In this post I would like to indicate a classical example of this
type of amplication trick, namely Burgesss bound on short character
sums. To narrow the discussion, I would like to focus primarily on
the following classical problem:
Problem 4.5.1. What are the best bounds one can place on the rst
quadratic non-residue n
p
in the interval [1, p 1] for a large prime p?
(The rst quadratic residue is, of course, 1; the more interesting
problem is the rst quadratic non-residue.)
Probabilistic heuristics (presuming that each non-square integer
has a 50-50 chance of being a quadratic residue) suggests that n
p
should have size O(log p), and indeed Vinogradov conjectured that
n
p
= O
(p
e
log
2
p). Inserting Burgesss ampli-
cation trick one can boost this to n
p
= O
(p
1
4
e
+
) for any > 0.
Apart from renements to the factor, this bound has stood for ve
decades as the world record for this problem, which is a testament
to the diculty in breaching the square root barrier.
Note: in order not to obscure the presentation with technical de-
tails, I will be using asymptotic notation O() in a somewhat informal
manner.
4.5.1. Character sums. To approach the problem, we begin by
xing the large prime p and introducing the Legendre symbol (n) =
_
n
p
_
, dened to equal 0 when n is divisible by p, +1 when n is an
invertible quadratic residue modulo p, and 1 when n is an invertible
quadratic non-residue modulo p. Thus, for instance, (n) = +1 for
all 1 n < n
p
. One of the main reasons one wants to work with the
function is that it enjoys two easily veried properties:
is periodic with period p.
4.5. The square root barrier 595
One has the total multiplicativity property (nm) = (n)(m)
for all integers n, m.
In the jargon of number theory, is a Dirichlet character with
conductor p. Another important property of this character is of course
the law of quadratic reciprocity, but this law is more important for
the average-case behaviour in p, whereas we are concerned here with
the worst-case behaviour in p, and so we will not actually use this
law here.
An obvious way to control n
p
is via the character sum
(4.18)
1nx
(n).
From the triangle inequality, we see that this sum has magnitude at
most x. If we can then obtain a non-trivial bound of the form
(4.19)
1nx
(n) = o(x)
for some x, this forces the existence of a quadratic residue less than
or equal to x, thus n
p
x. So one approach to the problem is to
bound the character sum (4.18).
As there are just as many residues as non-residues, the sum (4.18)
is periodic with period p and we obtain a trivial bound of p for the
magnitude of the sum. One can achieve a non-trivial bound by Fourier
analysis. One can expand
(n) =
p1
a=0
(a)e
2ian/p
where (a) are the Fourier coecients of :
(a) :=
1
p
p1
n=0
(n)e
2ian/p
.
As there are just as many quadratic residues as non-residues, (0) =
0, so we may drop the a = 0 term. From summing the geometric
series we see that
(4.20)
1nx
e
2ian/p
= O(1/|a/p|),
596 4. Technical articles
where |a/p| is the distance from a/p to the nearest integer (0 or 1);
inserting these bounds into (4.18) and summing what is essentially a
harmonic series in a we obtain
1nx
(n) = O(p log p sup
a=0
[ (a)[).
Now, how big is (a)? Taking absolute values, we get a bound of
1, but this gives us something worse than the trivial bound. To do
better, we use the Plancherel identity
p1
a=0
[ (a)[
2
=
1
p
p1
n=0
[(n)[
2
which tells us that
p1
a=0
[ (a)[
2
= O(1).
This tells us that is small on the average, but does not immediately
tell us anything new about the worst-case behaviour of , which is
what we need here. But now we use the multiplicative structure of
to relate average-case and worst-case behaviour. Note that if b is co-
prime to p, then (bn) is a scalar multiple of (n) by a quantity (b) of
magnitude 1; taking Fourier transforms, this implies that (a/b) and
(a) also dier by this factor. In particular, [ (a/b)[ = [ (a)[. As b
was arbitrary, we thus see that [ (a)[ is constant for all a coprime to p;
in other words, the worst case is the same as the average case. Com-
bining this with the Plancherel bound one obtains [ (a)[ = O(1/
p),
leading to the Polya-Vinogradov inequality
1nx
(n) = O(
p log p).
(In fact, a more careful computation reveals the slightly sharper
bound [
1nx
(n)[
p log p; this is non-trivial for x >
p log p.)
Remark 4.5.2. Up to logarithmic factors, this is consistent with
what one would expect if uctuated like a random sign pattern
(at least for x comparable to p; for smaller values of x, one expects
instead a bound of the form O(
1nx
(n)
1nx
1
np<qx
1nx:q|n
2.
Clearly,
1nx
1 = x + O(1) and
1nx:q|n
2 = 2
x
q
+ O(1). The
total number of primes less than x is O(
x
log x
) = o(x) by the prime
number theorem, thus
1nx
(n) x
np<qx
2
x
q
+o(x).
Using the classical asymptotic
qy
1
q
= log log y +C+o(1) for some
absolute constant C (which basically follows from the prime number
theorem, but also has an elementary proof), we conclude that
1nx
(n) x[1 2 log
log x
log n
p
+o(1)].
If n
p
x
1
e
+
for some xed > 0, then the expression in brackets is
bounded away from zero for x large; in particular, this is incompatible
with (4.19) for x large enough. As a consequence, we see that if
we have a bound of the form (4.19), then we can conclude n
p
=
598 4. Technical articles
O
(x
1
e
+
) for all > 0; in particular, from the Polya-Vinogradov
inequality one has
n
p
= O
(p
1
2
e
+
)
for all > 0, or equivalently that n
p
p
1
2
e
+o(1)
. (By being a bit
more careful, one can rene this to n
p
= O(p
1
2
e
log
2/
e
p).)
Remark 4.5.3. The estimates on the Gauss-type sums (a) :=
1
p
p1
n=0
(n)e
2ian/p
are sharp; nevertheless, they fail to penetrate
the square root barrier in the sense that no non-trivial estimates are
provided below the scale
p. One can also see this barrier using the
Poisson summation formula (Exercise 1.12.41), which basically gives
a formula that (very roughly) takes the form
n=O(x)
(n)
x
n=O(p/x)
(n)
for any 1 < x < p, and is basically the limit of what one can say
about character sums using Fourier analysis alone. In particular, we
see that the Polya-Vinogradov bound is basically the Poisson dual
of the trivial bound. The scale x =
p is the crossing point where
Poisson summation does not achieve any non-trivial modication of
the scale parameter.
4.5.2. Average-case bounds. The Polya-Vinogradov bound estab-
lishes a non-trivial estimate (4.18) for x signicantly larger than
ana+y
(n),
where a is a parameter. The analogue of (4.18) for such intervals
would be
(4.22)
ana+y
(n) = o(y).
For y very small (e.g. y = p
a=0
[
ana+y
(n)[
k
= O
k
(y
k/2
+y
k
p
1/2
)
for any positive even integer k = 2, 4, . . .. If y is not too tiny, say
y p
ana+y
(n)[ y
for all but at most O
,
(
a=0
an,ma+y
(n)(m) = O(y) +O(y
2
p
1/2
).
We can write (n)(m) as (nm). Writing m = n+h, and using the
periodicity of , we can rewrite the left-hand side as
y
h=y
(y [h[)[
1
p
nFp
(n(n +h))]
where we have abused notation and identied the nite eld F
p
with
0, 1, . . . , p 1.
For h = 0, the inner average is O(1). For h non-zero, we claim
the bound
(4.24)
nFp
(n(n +h)) = O(
p)
which is consistent with (and is in fact slightly stronger than) what
one would get if was a random sign pattern; assuming this bound
gives (4.23) for k = 2 as required.
The bound (4.24) can be established by quite elementary means
(as it comes down to counting points on the hyperbola y
2
= x(x+h),
600 4. Technical articles
which can be done by transforming the hyperbola to be rectangular),
but for larger values of k we will need the more general estimate
(4.25)
nFp
(P(n)) = O
k
(
p)
whenever P is a polynomial over F of degree k which is not a constant
multiple of a perfect square; this can be easily seen to give (4.23) for
general k.
An equivalent form of (4.25) is that the hyperelliptic curve
(4.26) (x, y) F
p
F
p
: y
2
= P(x)
contains p+O
k
(
p) values of
F
p
. Multiplying P by a quadratic non-residue and running the same
argument, we also see that P(x) is an invertible quadratic non-residue
for at most p/2 +O
k
(
p) values of F
p
, and (4.25) (or the asymptotic
for the number of points in (4.26)) follows.
4.5. The square root barrier 601
We now prove the lemma. The polynomial Q will be chosen to
be of the form
Q(x) = P
l
(x)(R(x, x
p
) +P
p1
2
(x)S(x, x
p
))
where R(x, z), S(x, z) are polynomials of degree at most
pk1
2
in x,
and degree at most
l
2
+C in z, where C is a large constant (depending
on k) to be chosen later (these parameters have been optimised for
the argument that follows). Since P has degree at most k, such a Q
will have degree
kl +
p k 1
2
+
p 1
2
k +p(
l
2
+C
) =
lp
2
+O
k
(p)
as required. We claim (for suitable choices of C, C
) that
(a) The degrees are small enough that Q(x) is a non-zero poly-
nomial whenever R(x, z), S(x, z) are non-zero polynomials;
and
(b) The degrees are large enough that there exists a non-trivial
choice of R(x, z) and S(x, z) that Q(x) vanishes to order at
least l whenever x F
p
is such that P(x) is a quadratic
residue.
Claims (a) and (b) together establish the lemma.
We rst verify (a). We can cancel o the initial P
l
factor, so that
we need to show that R(x, x
p
)+P
p1
2
(x)S(x, x
p
) does not vanish when
at least one of R(x, z), Q(x, z) is not vanishing. We may assume that
R, Q are not both divisible by z, since we could cancel out a common
factor of x
p
otherwise.
Suppose for contradiction that the polynomial R(x, x
p
)+P
p1
2
S(x, x
p
)
vanished, which implies that R(x, 0) = P
p1
2
(x)S(x, 0) modulo x
p
.
Squaring and multiplying by P, we see that
R(x, 0)
2
P(x) = P(x)
p
S(x, 0)
2
mod x
p
.
But over F
p
and modulo x
p
, P(x)
p
= P(0) by Fermats little theo-
rem. Observe that R(x, 0)
2
P(x) and P(0)S(x, 0)
2
both have degree
at most p 1, and so we can remove the x
p
modulus and conclude
that R(x, 0)
2
P(x) = P(0)S(x, 0)
2
over F
p
. But this implies (by the
fundamental theorem of arithmetic for F
p
[X]) that P is a constant
602 4. Technical articles
multiple of a square, a contradiction. (Recall that P(0) is non-zero,
and that R(x, 0) and S(x, 0) are not both zero.)
Now we prove (b). Let x F
p
be such that P(x) is a quadratic
residue, thus P(x)
p1
2
= +1 by Fermats little theorem. To get van-
ishing to order l, we need
(4.27)
d
j
dx
j
[P
l
(x)(R(x, x
p
) +P
p1
2
(x)S(x, x
p
))] = 0
for all 0 j < l. (Of course, we cannot dene derivatives using lim-
its and Newton quotients in this nite characteristic setting, but we
can still dene derivatives of polynomials formally, thus for instance
d
dx
x
n
:= nx
n1
, and enjoy all the usual rules of calculus, such as the
product rule and chain rule.)
Over F
p
, the polynomial x
p
has derivative zero. If we then com-
pute the derivative in (4.27) using many applications of the product
and chain rule, we see that the left-hand side of (4.27) can be ex-
pressed in the form
P
lj
(x)[R
j
(x, x
p
) +P
p1
2
(x)S
j
(x, x
p
))]
where R
j
(x, z), S
j
(x, z) are polynomials of degree at most
pk1
2
+
O
k
(j) in x and at most
l
2
+C in z, whose coecients depend in some
linear fashion on the coecients of R and S. (The exact nature of this
linear relationship will depend on k, p, P, but this will not concern us.)
Since we only need to evaluate this expression when P(x)
p1
2
= +1
and x
p
= p (by Fermats little theorem), we thus see that we can
verify (4.27) provided that the polynomial
P
lj
(x)[R
j
(x, x) +S
j
(x, x))]
vanishes identically. This is a polynomial of degree at most
O(l j) +
p k 1
2
+O
k
(j) +
l
2
+C =
p
2
+O
k
(p
1/2
) +C,
and there are l + 1 possible values of j, so this leads to
lp
2
+O
k
(p) +O(C
p)
4.5. The square root barrier 603
linear constraints on the coecients of R and S to satisfy. On the
other hand, the total number of these coecients is
2 (
p k 1
2
+O(1)) (
l
2
+C +O(1)) =
lp
2
+Cp +O
k
(p).
For C large enough, there are more coecients than constraints, and
so one can nd a non-trivial choice of coecients obeying the con-
straints (4.27), and (b) follows.
Remark 4.5.5. If one optimises all the constants here, one gets an
upper bound of basically 8k
n[1,x]
(n)[ x,
then by taking a small y (e.g. y = p
/2
) and covering [1, x] by intervals
of length y, we see (from a rst moment method argument) that
[
ana+y
(n)[ y
for a positive fraction of the a in [1, x]. But this contradicts the results
of the previous section.
604 4. Technical articles
Burgess observed that by exploiting the multiplicativity of one
last time to amplify the above argument, one can extend the range
for which (4.18) can be proved from x > p
1/2+
to also cover the
range p
1/4+
< x < p
1/2
. The idea is not to cover [1, x] by intervals of
length y, but rather by arithmetic progressions a, a +r, . . . , a +yr
of length y, where a = O(x) and r = O(x/y). Another application of
the rst moment method then shows that
[
0jy
(a +jr)[ y
for a positive fraction of the a in [1, x] and r in [1, x/y] (i.e. x
2
/y
such pairs (a, r)).
For technical reasons, it will be inconvenient if a and r have too
large of a common factor, so we pause to remove this possibility.
Observe that for any d 1, the number of pairs (a, r) which have d
as a common factor is O(
1
d
2
x
2
/y). As
d=1
1
d
2
is convergent, we may
thus remove those pairs which have too large of a common factor, and
assume that all pairs (a, r) have common factor O(1) at most (so are
almost coprime).
Now we exploit the multiplicativity of to write (a + jr) as
(r)(b+j), where b is a residue which is equal to a/r mod q. Dividing
out by (r), we conclude that
(4.28) [
0jy
(b +j)[ y
for x
2
/y pairs (a, r).
Now for a key observation: the x
2
/y values of b arising from
the pairs (a, r) are mostly disjoint. Indeed, suppose that two pairs
(a, r), (a
, r
/r
mod p.
This implies that ar
= a
, a
r
do not exceed p, so we may remove the modulus and conclude that
ar
= a
, r
are almost
coprime, we see that for each (a, r) there are at most O(1) values of
a
, r
for which ar
= a
(p
1/4
e+
) for all > 0, which is the
best bound known today on the least quadratic non-residue except
for renements of the p
error term.
Remark 4.5.7. There are many generalisations of this result, for
instance to more general characters (with possibly composite conduc-
tor), or to shifted sums (4.21). However, the p
1/4
type exponent has
not been improved except with the assistance of powerful conjectures
such as the generalised Riemann hypothesis.
Notes. This article rst appeared at terrytao.wordpress.com/2009/08/18.
Thanks to Efthymios Sofos, Joshua Zelinsky, K, and Seva Lev for cor-
rections.
Boris noted the similarity between the use of the Frobenius map
x x
p
in Stepanovs argument and Thues trick from the proof of
his famous result on the Diophantine approximations to algebraic
numbers, where instead of the exact equality x = x
p
that is used
here, he used two very good approximations to the same algebraic
number.
4.6. Determinantal processes
Given a set S, a (simple) point process is a random subset A of S.
(A non-simple point process would allow multiplicity; more formally,
A is no longer a subset of S, but is a Radon measure on S, where we
give S the structure of a locally compact Polish space, but I do not
wish to dwell on these sorts of technical issues here.) Typically, A will
be nite or countable, even when S is uncountable. Basic examples
of point processes include
(Bernoulli point process) S is an at most countable set,
0 p 1 is a parameter, and A a random set such that
the events x A for each x S are jointly independent and
occur with a probability of p each. This process is automat-
ically simple.
606 4. Technical articles
(Discrete Poisson point process) S is an at most countable
space, is a measure on S (i.e. an assignment of a non-
negative number (x) to each x S), and A is a multiset
where the multiplicity of x in A is a Poisson random variable
with intensity (x), and the multiplicities of x A as x
varies in S are jointly independent. This process is usually
not simple.
(Continuous Poisson point process) S is a locally compact
Polish space with a Radon measure , and for each S of
nite measure, the number of points [A[ that A contains
inside is a Poisson random variable with intensity ().
Furthermore, if
1
, . . . ,
n
are disjoint sets, then the ran-
dom variables [A
1
[, . . . , [A
n
[ are jointly independent.
(The fact that Poisson processes exist at all requires a non-
trivial amount of measure theory, and will not be discussed
here.) This process is almost surely simple i all points in
S have measure zero.
(Spectral point processes) The spectrum of a random matrix
is a point process in C (or in R, if the random matrix is
Hermitian). If the spectrum is almost surely simple, then
the point process is almost surely simple. In a similar spirit,
the zeroes of a random polynomial are also a point process.
A remarkable fact is that many natural (simple) point processes
are determinantal processes. Very roughly speaking, this means that
there exists a positive semi-denite kernel K : S S R such that,
for any x
1
, . . . , x
n
S, the probability that x
1
, . . . , x
n
all lie in the
random set Ais proportional to the determinant det((K(x
i
, x
j
))
1i,jn
).
Examples of processes known to be determinantal include non-intersecting
random walks, spectra of random matrix ensembles such as GUE, and
zeroes of polynomials with gaussian coecients.
I would be interested in nding a good explanation (even at the
heuristic level) as to why determinantal processes are so prevalent
in practice. I do have a very weak explanation, namely that deter-
minantal processes obey a large number of rather pretty algebraic
identities, and so it is plausible that any other process which has a
4.6. Determinantal processes 607
very algebraic structure (in particular, any process involving gaus-
sians, characteristic polynomials, etc.) would be connected in some
way with determinantal processes. Im not particularly satised with
this explanation, but I thought I would at least describe some of these
identities below to support this case. (This is partly for my own ben-
et, as I am trying to learn about these processes, particularly in
connection with the spectral distribution of random matrices.) The
material here is partly based on [HoKrPeVi2006].
4.6.1. Discrete determinantal processes. In order to ignore all
measure-theoretic distractions and focus on the algebraic structure of
determinantal processes, we will rst consider the discrete case when
the space S is just a nite set S = 1, . . . , N of cardinality [S[ = N.
We say that a process A S is a determinantal process with kernel
K, where K is an k k symmetric real matrix, if one has
(4.29) P(i
1
, . . . , i
k
A) = det(K(i
a
, i
b
))
1a,bk
for all distinct i
1
, . . . , i
k
S.
To build determinantal processes, let us rst consider point pro-
cesses of a xed cardinality n, thus 0 n N and A is a random
subset of S of size n, or in other words a random variable taking
values in the set
_
S
n
_
:= B S : [B[ = n.
In this simple model, an n-element point processes is basically
just a collection of
_
N
n
_
probabilities p
B
= P(A = B), one for each
B
_
S
n
_
, which are non-negative numbers which add up to 1. For
instance, in the uniform point process where A is drawn uniformly at
random from
_
S
n
_
, each of these probabilities p
B
would equal 1/
_
N
n
_
.
How would one generate other interesting examples of n-element point
processes?
For this, we can borrow the idea from quantum mechanics that
probabilities can arise as the square of coecients of unit vectors,
though unlike quantum mechanics it will be slightly more convenient
here to work with real vectors rather than complex ones. To formalise
this, we work with the n
th
exterior power
_
n
R
N
of the Euclidean
space R
N
; this space is sort of a quantisation of
_
S
n
_
, and is anal-
ogous to the space of quantum states of n identical fermions, if each
fermion can exist classically in one of N states (or spins). (The
608 4. Technical articles
requirement that the process be simple is then analogous to the Pauli
exclusion principle.)
This space of n-vectors in R
N
is spanned by the wedge products
e
i1
. . . e
in
with 1 i
1
< . . . < i
n
N, where e
1
, . . . , e
N
is the
standard basis of R
N
. There is a natural inner product to place on
_
n
R
N
by declaring all the e
i1
. . . e
in
to be orthonormal.
Lemma 4.6.1. If f
1
, . . . , f
N
is any orthonormal basis of R
N
, then
the f
i1
. . . f
in
for 1 i
1
< . . . < i
n
N are an orthonormal basis
for
_
n
R
N
.
Proof. By denition, this is true when (f
1
, . . . , f
N
) = (e
1
, . . . , e
N
).
If the claim is true for some orthonormal basis f
1
, . . . , f
N
, it is not
hard to see that the claim also holds if one rotates f
i
and f
j
in the
plane that they span by some angle , where 1 i < j n are
arbitrary. But any orthonormal basis can be rotated into any other
by a sequence of such rotations (e.g. by using Euler angles), and the
claim follows.
Corollary 4.6.2. If v
1
, . . . , v
n
are vectors in R
N
, then the magnitude
of v
1
. . . v
n
is equal to the n-dimensional volume of the parallelop-
iped spanned by v
1
, . . . , v
n
.
Proof. Observe that applying row operations to v
i
(i.e. modifying
one v
i
by a scalar multiple of another v
j
) does not aect either the
wedge product or the volume of the parallelopiped. Thus by using the
Gram-Schmidt process, we may assume that the v
i
are orthogonal; by
normalising we may assume they are orthonormal. The claim now
follows from the preceding lemma.
From this and the ordinary Pythagorean theorem in the inner
product space
_
n
R
N
, we conclude the multidimensional Pythagorean
theorem: the square of the n-dimensional volume of a parallelopiped
in R
N
is the sum of squares of the n-dimensional volumes of the pro-
jection of that parallelopiped to each of the
_
N
n
_
coordinate subspaces
span(e
i1
, . . . , e
in
). (I believe this theorem was rst observed in this
generality by Donchian and Coxeter.) We also note another related
fact:
4.6. Determinantal processes 609
Lemma 4.6.3 (Gram identity). If v
1
, . . . , v
n
are vectors in R
N
, then
the square of the magnitude of v
1
. . . v
n
is equal to the determinant
of the Gram matrix (v
i
v
j
)
1i,jn
.
Proof. Again, the statement is invariant under row operations, and
one can reduce as before to the case of an orthonormal set, in which
case the claim is clear. (Alternatively, one can proceed via the Cauchy-
Binet formula.)
If we dene e
{i1,...,in}
:= e
i1
. . . e
in
, then we have identied
the standard basis of
_
n
R
N
with
_
S
n
_
by identifying e
B
with B. As a
consequence of this and the multidimensional Pythagorean theorem,
every unit n-vector in
_
n
R
N
determines an n-element point pro-
cess A on S, by declaring the probability p
B
of A taking the value B
to equal [ e
B
[
2
for each B
_
S
n
_
. Note that multiple n-vectors can
generate the same point process, because only the magnitude of the
coecients e
B
are of interest; in particular, and generate the
same point process. (This is analogous to how multiplying the wave
function in quantum mechanics by a complex phase has no eect on
any physical observable.)
Now we can introduce determinantal processes. If V is an n-
dimensional subspace of R
N
, we can dene the (projection) deter-
minantal process A = A
V
associated to V to be the point process
associated to the volume form of V , i.e. to the wedge product of an
orthonormal basis of V . (This volume form is only determined up to
sign, because the orientation of V has not been xed, but as observed
previously, the sign of the form has no impact on the resulting point
process.)
By construction, the probability that the point process A is equal
to a set i
1
, . . . , i
n
is equal to the square of the determinant of the
n n matrix consisting of the i
1
, . . . , i
n
coordinates of an arbitrary
orthonormal basis of V . By extending such an orthonormal basis to
the rest of R
N
, and representing e
i1
, . . . , e
in
in this basis, it is not
hard to see that P(A = i
1
, . . . , i
n
) can be interpreted geometri-
cally as the square of the volume of the parallelopiped generated by
Pe
i1
, . . . , Pe
in
, where P is the orthogonal projection onto V .
In fact we have the more general fact:
610 4. Technical articles
Lemma 4.6.4. If k 1 and i
1
, . . . , i
k
are distinct elements of S,
then P(i
1
, . . . , i
k
A) is equal to the square of the k-dimensional
volume of the parallelopiped generated by the orthogonal projections
of Pe
i1
, . . . , Pe
i
k
to V .
Proof. We can assume that k n, since both expressions in the
lemma vanish otherwise.
By (anti-)symmetry we may assume that i
1
, . . . , i
k
= 1, . . . , k.
By the Gram-Schmidt process we can nd an orthonormal basis v
1
, . . . , v
n
of V such that each v
i
is orthogonal to e
1
, . . . , e
i1
.
Now consider the n N matrix M with rows v
1
, . . . , v
n
, thus
M vanishes below the diagonal. The probability P(1, . . . , k A) is
equal to the sum of squares of the determinants of all the nn minors
of M that contain the rst k rows. As M vanishes below the diagonal,
we see from cofactor expansion that this is equal to the product of
the squares of the rst k diagonal entries, times the sum of squares
of the determinants of all the n k n k minors of the bottom
n k rows. But by the generalised Pythagorean theorem, this latter
factor is the square of the volume of the parallelopiped generated by
v
k+1
, . . . , v
n
, which is 1. Meanwhile, by the base times height formula,
we see that the product of the rst k diagonal entries of M is equal in
magnitude to the k-dimensional volume of the orthogonal projections
of e
1
, . . . , e
k
to V . The claim follows.
As a special case of Lemma 4.6.4, we have P(i A) = |Pe
i
|
2
for
any i. In particular, if e
i
lies in V , then i almost surely lies in A, and
when e
i
is orthogonal to V , i almost surely is disjoint from A.
Let K(i, j) = Pe
i
e
j
= Pe
i
Pe
j
denote the matrix coecients
of the orthogonal projection P. From Lemma 4.6.4 and the Gram
identity, we conclude that A is a determinantal process (see (4.29))
with kernel K. Also, by combining Lemma 4.6.4 with the generalised
Pythagorean theorem, we conclude a monotonicity property:
Lemma 4.6.5 (Monotonicity property). If V W are nested sub-
spaces of R
N
, then P(B A
V
) P(B A
W
) for every B S.
This seems to suggest that there is some way of representing A
W
as the union of A
V
with another process coupled with A
V
, but I was
4.6. Determinantal processes 611
not able to build a non-articial example of such a representation.
On the other hand, if V R
N
and V
R
N
R
N+N
respectively,
and let M be the resulting NN orthogonal matrix; then the task is
to show that the top n n minor X of M has the same determinant
squared as the bottom NnNn minor Y . But if one splits M =
_
X Z
W Y
_
, we see from the orthogonality property that XX
= I
n
ZZ
and Y
Y = I
Nn
Z
Z, where I
n
is the n n identity matrix.
But from the singular value decomposition we see that I
n
ZZ
and
I
Nn
Z
jJ
j
(i
a
)
j
(i
b
), where
j
(i) is the i
th
coordinate
of
j
. Thus we can write
(K
V
J
(i
a
, i
b
))
1a,bk
=
N
j=1
I(j J)R
j
where I(j J) is the indicator of the event j J, and R
j
is the
rank one matrix (
j
(i
a
)
j
(i
b
))
1a,bk
. Using multilinearity of the
determinant, and the fact that any determinant involving two or more
rows of the same rank one matrix automatically vanishes, we see that
we can express
det((K
V
J
(i
a
, i
b
))
1a,bk
) =
1j1,...,j
k
N,distinct
I(j
1
, . . . , j
k
J) det(R
j1,...,j
k
)
wheree R
j1,...,j
k
is the matrix whose rst row is the same as that of
R
j1
, the second row is the same as that of R
j2
, and so forth. Taking
expectations in J, the quantity I(j
1
, . . . , j
k
J) becomes
j1
. . .
j
k
.
Undoing the multilinearity step, we conclude that
E
J
det(K
V
J
(i
a
, i
b
))
1a,bk
= det(
N
j=1
j
R
j
)
614 4. Technical articles
and thus A is a determinantal process with kernel
K(x, y) :=
N
j=1
j
(x)
j
(y).
To summarise, we have created a determinantal process A whose
kernel K is now an arbitrary symmetric matrix with eigenvalues
1
, . . . ,
n
[0, 1], and it is a mixture of constant-size processes A
V
J
.
In particular, the cardinality [A[ of this process has the same distri-
bution as the cardinality [J[ of the random subset of 1, . . . , N, or in
other words [A[ I
1
+. . . +I
k
, where I
1
, . . . , I
k
are independent
Bernoulli variables with expectation
1
, . . . ,
k
respectively.
Observe that if one takes a determinantal process A S with
kernel K, and restricts it to a subset S
is
simply the restriction of K to the S
[
has the same distribution as the sum of [S
[ independent Bernoulli
variables, whose expectations are the eigenvalues of the restriction of
K to S
of a
determinantal process A
is positive semi-denite,
then A
) P(B A)
for all B S. A particularly nice special case is when K = cK
for
some 0 c 1, then P(B A) = c
|B|
P(B A
by deleting each
element of A
k
: R
k
R for k 1 if the
k
are symmetric, non-negative, and
locally integrable, and one has the formula
E
x1,...,x
k
A,distinct
f(x
1
, . . . , x
k
) =
_
R
k
f(x
1
, . . . , x
k
)
k
(x
1
, . . . , x
k
) dx
1
. . . dx
k
for any bounded measurable symmetric f with compact support,
where the left-hand side is summed over all k-tuples of distinct points
in A (this sum is of course empty if [A[ k). Intuitively, the
probability that A contains an element in the innitesimal interval
[x
i
, x
i
+ dx
i
] for all 1 i k and distinct x
1
, . . . , x
k
is equal to
k
(x
1
, . . . , x
k
)dx
1
. . . dx
k
. The
k
are not quite probability distribu-
tions; instead, the integral
_
R
k
k
is equal to k!E
_
|A|
k
_
. Thus, for
instance, if A is a constant-size process of cardinality n, then
k
has
integral
n!
(nk)!
on R
n
for 1 k n and vanishes for k > n.
616 4. Technical articles
If the correlation functions exist, it is easy to see that they are
unique (up to almost everywhere equivalence), and can be used to
compute various statistics of the process. For instance, an applica-
tion of the inclusion-exclusion principle shows that for any bounded
measurable set , the probability that A = is (formally) equal
to
k=0
(1)
k
k!
_
(R\)
k
k
(x
1
, . . . , x
k
) dx
1
. . . dx
k
.
A process is determinantal with some symmetric measurable ker-
nel K : R R R if it has correlation functions
k
given by the
formula
(4.31)
k
(x
1
, . . . , x
k
) = det(K(x
i
, x
j
))
1i,jk
.
Informally, the probability that A intersects the innitesimal intervals
[x
i
, x
i
+dx
i
] for distinct x
1
, . . . , x
k
is det(K(x
i
, x
j
)dx
1/2
i
dx
1/2
j
)
1i,jk
.
(Thus, K is most naturally interpreted as a half-density, or as an
integral operator from L
2
(R) to L
2
(R).)
There are analogues of the discrete theory in this continuous set-
ting. For instance, one can show that a symmetric measurable kernel
K generates a determinantal process if and only if the associated inte-
gral operator / has spectrum lies in the interval [0, 1]. The analogue
of (4.30) is the formula
P(A = ) = det(I /[
);
more generally, the distribution of [A [ is the sum of independent
Bernoulli variables, whose expectations are the eigenvalues of /[
.
Finally, if / is an orthogonal projection onto an n-dimensional space,
then the process has a constant size of n. Conversely, if A is a pro-
cess of constant size n, whose n
th
correlation function
n
(x
1
, . . . , x
n
)
is given by (4.31), where / is an orthogonal projection onto an n-
dimensional space, then (4.31) holds for all other values of k as well,
and so A is a determinantal process with kernel K. (This is roughly
the analogue of Lemma 4.6.4.)
These facts can be established either by approximating a contin-
uous process as the limit of discrete ones, or by obtaining alternate
proofs of several of the facts in the previous section which do not
4.6. Determinantal processes 617
rely as heavily on the discrete hypotheses. See [HoKrPeVi2006] for
details.
A Poisson process can be viewed as the limiting case of a deter-
minantal process in which / degenerates to a (normalisation of) a
multiplication operator f f, where is the intensity function.
4.6.3. The spectrum of GUE. Now we turn to a specic example
of a continuous point process, namely the spectrumA =
1
, . . . ,
n
R of the Gaussian unitary ensemble M
n
= (
ij
)
1i,jn
, where the
ij
are independent for 1 i j n with mean zero and variance 1,
with
ij
being the standard complex gaussian for i < j and the stan-
dard real gaussian N(0, 1) for i = j. The probability distribution of
M
n
can be expressed as
c
n
exp(
1
2
trace(M
2
n
)) dM
n
where dM
n
is Lebesgue measure on the space of Hermitian n n
matrices, and c
n
> 0 is some explicit normalising constant.
The n-point correlation function of A can be computed explicitly:
Lemma 4.6.9 (Ginibre formula). The n-point correlation function
n
(x
1
, . . . , x
n
) of the GUE spectrum A is given by
(4.32)
n
(x
1
, . . . , x
n
) = c
n
(
1i<jn
[x
i
x
j
[
2
) exp(
n
i=1
x
2
i
/2)
where the normalising constant c
n
is chosen so that
n
has integral 1.
The constant c
n
> 0 is essentially the reciprocal of the partition
function for this ensemble, and can be computed explicitly, but we
will not do so here.
Proof. Let D be a diagonal random matrix D = diag(x
1
, . . . , x
n
)
whose entries are drawn using the distribution
n
(x
1
, . . . , x
n
) dened
by (4.32), and let U U(n) be a unitary matrix drawn uniformly at
random (with respect to Haar measure on U(n)) and independently
of D. It will suce to show that the GUE M
n
has the same prob-
ability distribution as U
n
i=1
(x
0
i
)
2
/2). On the other hand, a short compu-
tation shows that if U
DU at D
0
is a constant multiple of
n
(x
1
, . . . , x
n
)
1i<jn
1
[x
0
i
x
0
j
[
2
(the square here arising because of the complex nature of the ij co-
ecient of U) and the claim follows.
One can also represent the k-point correlation functions as a de-
terminant:
Lemma 4.6.10 (Gaudin-Mehta formula). The k-point correlation
function
k
(x
1
, . . . , x
n
) of the GUE spectrum A is given by
(4.33)
k
(x
1
, . . . , x
k
) = det(K
n
(x
i
, x
j
))
1i<jk
where K
n
(x, y) is the kernel of the orthogonal projection / in L
2
(R)
to the space spanned by the polynomials x
i
e
x
2
/4
for i = 0, . . . , n1.
In other words, A is the n-point determinantal process with kernel
K
n
.
Proof. By the material in the preceding section, it suces to estab-
lish this for k = n. As K is the kernel of an orthogonal projection to
an n-dimensional space, it generates an n-point determinantal pro-
cess and so det(K
n
(x
i
, x
j
))
1i<jn
has integral
_
n
n
_
= 1. Thus it
4.7. The Cohen-Lenstra distribution 619
will suce to show that
n
and det(K
n
(x
i
, x
j
))
1i<jn
agree up to
multiplicative constants.
By Gram-Schmidt, one can nd an orthonormal basis
i
(x)e
x
2
/4
,
i = 0, . . . , n1 for the range of /, with each
i
a polynomial of degree
i (these are essentially the Hermite polynomials). Then we can write
K
n
(x
i
, x
j
) =
n1
k=0
k
(x
i
)
k
(x
j
)e
(x
2
i
+x
2
j
)/4
.
Cofactor expansion then shows that det(K
n
(x
i
, x
j
))
1i<jn
is equal
to exp(
n
i=1
x
2
i
/2) times a polynomial P(x
1
, . . . , x
n
) in x
1
, . . . , x
n
of degree at most 2
n1
k=0
k = n(n 1). On the other hand, this
determinant is always non-negative, and vanishes whenever x
i
= x
j
for any 1 i < j n, and so must contain (x
i
x
j
)
2
as a factor for
all 1 i < j n. As the total degree of all these (relatively prime)
factors is n(n 1), the claim follows.
This formula can be used to obtain asymptotics for the (renor-
malised) GUE eigenvalue spacings in the limit n , by using
asymptotics for (renormalised) Hermite polynomials; this was rst
established by Dyson[Dy1970].
Notes. This article rst appeared at terrytao.wordpress.com/2009/08/23.
Thanks to anonymous commenters for corrections.
Craig Tracy noted that some non-determinantal processes, such
as TASEP, still enjoy many of the spacing distributions as their de-
terminantal counterparts.
Manju Krishnapur raised the relevant question of how one could
determine quickly whether a given process is determinantal.
Russell Lyons noted the open problem on coupling determinan-
tal processes together was also raised in Question 10.1 of [Ly2003]
(which also covers most of the other material in this article).
4.7. The Cohen-Lenstra distribution
At a conference recently, I learned of the recent work of Ellenberg,
Venkatesh, and Westerland[ElVeWe2009], which concerned the con-
jectural behaviour of class groups of quadratic elds, and in particular
620 4. Technical articles
to explain the numerically observed phenomenon that about 75.4%
of all quadratic elds Q[
j=1
(1
1
p
j
),
where Aut(G) is the group of automorphisms of G. In particular this
leads to the strange identity
(4.35)
G
1
[ Aut(G)[
=
j=1
(1
1
p
j
)
1
where G ranges over all p-groups; I do not know how to prove this
identity other than via the above probability computation, the proof
of which I give below.
Based on the heuristic that the class group should behave ran-
domly subject to some obvious constraints, it is expected that a
randomly chosen real quadratic eld Q[
p odd
j=2
(1
1
p
j
) 0.754,
whereas a randomly chosen imaginary quadratic eld Q[
d] has
unique factorisation with probability
p odd
j=1
(1
1
p
j
) = 0.
4.7. The Cohen-Lenstra distribution 621
The former claim is conjectural, whereas the latter claim follows from
(for instance) Siegels theorem on the size of the class group, as dis-
cussed in Section 3.12.4. The work in [ElVeWe2009] establishes
some partial results towards the function eld analogues of these
heuristics.
4.7.1. p-groups. Henceforth the prime p will be xed. We will ab-
breviate nite abelian p-group as p-group for brevity. Thanks to
the classication of nite abelian groups, the p-groups are all isomor-
phic to the products
(Z/p
n1
Z) . . . (Z/p
n
d
Z)
of cyclic p-groups.
The cokernel of a random homomorphism from (Z/p
n
Z)
d
to (Z/p
n
Z)
d
can be written as the quotient of the p-group (Z/p
n
Z)
d
by the sub-
group generated by d randomly chosen elements x
1
, . . . , x
d
from that
p-group. One can view this quotient as a d-fold iterative process, in
which one starts with the p-group (Z/p
n
Z)
d
, and then one iterates d
times the process of starting with a p-group G, and quotienting out
by a randomly chosen element x of that group G. From induction,
one sees that at the j
th
stage of this process (0 j d), one ends up
with a p-group isomorphic to (Z/p
n
Z)
dj
G
j
for some p-group G
j
.
Lets see how the group (Z/p
n
Z)
dj
G
j
transforms to the
next group (Z/p
n
Z)
dj1
G
j+1
. We write a random element of
(Z/p
n
Z)
dj
G
j
as (x, y), where x (Z/p
n
Z)
dj
and y G
j
. Ob-
serve that for any 0 i < n, x is a multiple of p
i
(but not p
i+1
)
with probability (1 p
(dj)
)p
i(dj)
. (The remaining possibility is
that x is zero, but this event will have negligible probability in the
limit n .) If x is indeed divisible by p
i
but not p
i+1
, and i is not
too close to n, a little thought will then reveal that [G
j+1
[ = p
i
[G
j
[.
Thus the size of the p-groups G
j
only grow as j increases. (Things
go wrong when i gets close to n, e.g. p
i
p
n
/[G
j
[, but the total
size of this event as j ranges from 0 to d sums to be o(1) as n
(uniformly in d), by using the tightness bounds on [G
j
[ mentioned
below. Alternatively, one can avoid a lot of technicalities by taking
the limit n before taking the limit d (instead of studying
622 4. Technical articles
the double limit n, d ), or equivalently by replacing the cyclic
group Z/p
n
Z with the p-adics Z
p
.)
The exponentially decreasing nature of the probability (1p
(dj)
)p
i(dj)
in i (and in d j) furthermore implies that the distribution of [G
j
[
forms a tight sequence in n, j, d: for every > 0, one has an R > 0
such that the probability that [G
j
[ R is less than for all choices
of n, j, d. (This tightness is necessary to prove the equality in (4.35)
rather than just an inequality (from Fatous lemma).) Indeed, the
probability that [G
j
[ = p
m
converges as n, d to the t
m
coe-
cient in the generating function
(4.36)
k=1
i=0
t
i
(1 p
k
)p
ik
=
k=1
1 p
k
1 tp
k
.
In particular, this claim is true for the nal cokernel G
d
. Note that
this (and the geometric series formula) already yields (4.34) in the case
of the trivial group G = 0 and the order p group G = Z/pZ (note
that Aut(G) has order 1 and p in these respective cases). But it is not
enough to deal with higher groups. For instance, up to isomorphism
there are two p-groups of order p
2
, namely Z/p
2
Z and (Z/pZ)
2
, whose
automorphism group has order p
2
p and (p
2
1)(p
2
p) respectively.
Summing up the corresponding two expressions (4.34) one can observe
that this matches the t
2
coecient of (4.36) (after some applications
of the geometric series formula). Thus we see that (4.36) is consistent
with the claim (4.34), but does not fully imply that claim.
To get the full asymptotic (4.34) we try a slightly dierent tack.
Fix a p-group G, and consider the event that the cokernel of a random
map T : (Z/p
n
Z)
d
(Z/p
n
Z)
d
is isomorphic to G. We assume n so
large that all elements in G have order at most p
n
. If this is the case,
then there must be a surjective homomorphism : (Z/p
n
Z)
d
G
such that the range of T is equal to the kernel of . The number
of homomorphisms from (Z/p
n
Z)
d
to G is [G[
d
(one has to pick d
generators in G). If d is large, it is easy to see that most of these ho-
momorphisms are surjective (the proportion of such homomorphisms
is 1 o(1) as d ). On the other hand, there is some multiplicity;
the range of T can emerge as the kernel of in [ Aut(G)[ dierent
ways (since any two surjective homomorphisms ,
: (Z/p
n
Z)
d
G
with the same kernel arise from an automorphism of G). So to
4.7. The Cohen-Lenstra distribution 623
prove (4.34), it suces to show that for any surjective homomor-
phism : (Z/p
n
Z)
d
G, the probability that the range of T equals
the kernel of is
(1 +o(1))[G[
d
j=1
(1
1
p
j
).
The range of T is the same thing as the subgroup of (Z/p
n
Z)
d
gen-
erated by d random elements x
1
, . . . , x
d
of that group. The kernel of
has index [G[ inside (Z/p
n
Z)
d
, so the probability that all of those
random elements lie in the kernel of is [G[
d
. So it suces to prove
the following claim: if is a xed surjective homomorphism from
(Z/p
n
Z)
d
to G, and x
1
, . . . , x
d
are chosen randomly from the kernel
of , then x
1
, . . . , x
d
will generate that kernel with probability
(4.37) (1 +o(1))
j=1
(1
1
p
j
).
But from the classication of p-groups, the kernel of (which has
bounded index inside (Z/p
n
Z)
d
) is isomorphic to
(4.38) (Z/p
nO(1)
Z) . . . (Z/p
nO(1)
Z)
where O(1) means bounded uniformly in n, and there are d factors
here. As in the previous argument, one can now imagine starting
with the group (4.38), and then iterating d times the operation of
quotienting out by the group generated by a randomly chosen element;
our task is to compute the probability that one ends up with the trivial
group by applying this process.
As before, at the j
th
stage of the iteration, one ends up with a
group of the form
(4.39) (Z/p
nO(1)
Z) . . . (Z/p
nO(1)
Z) G
j
where there are d j factors of (Z/p
nO(1)
Z). The group G
j
is in-
creasing in size, so the only way in which one ends up with the trivial
group is if all the G
j
are trivial. But if G
j
is trivial, the only way
that G
j+1
is trivial is if the randomly chosen element from (4.39) has
a (Z/p
nO(1)
Z) . . . (Z/p
nO(1)
Z) component which is invertible
(i.e. not a multiple of p), which occurs with probability 1 p
(dj)
624 4. Technical articles
(assuming n is large enough). Multiplying all these probabilities to-
gether gives (4.37).
Notes. This article rst appeared at terrytao.wordpress.com/2009/10/02.
Thanks to David Speyer and an anonymous commenter for correc-
tions.
4.8. An entropy Pl unnecke-Ruzsa inequality
A handy inequality in additive combinatorics is the Pl unnecke-Ruzsa
inequality[Ru1989]:
Theorem 4.8.1 (Pl unnecke-Ruzsa inequality). Let A, B
1
, . . . , B
m
be
nite non-empty subsets of an additive group G, such that [A+B
i
[
K
i
[A[ for all 1 i m and some scalars K
1
, . . . , K
m
1. Then
there exists a subset A
of A such that [A
+ B
1
+ . . . + B
m
[
K
1
. . . K
m
[A
[.
The proof uses graph-theoretic techniques. Setting A = B
1
=
. . . = B
m
, we obtain a useful corollary: if A has small doubling in
the sense that [A + A[ K[A[, then we have [mA[ K
m
[A[ for all
m 1, where mA = A+. . . +A is the sum of m copies of A.
In a recent paper[Ta2010c], I adapted a number of sum set esti-
mates to the entropy setting, in which nite sets such as A in G are
replaced with discrete random variables X taking values in G, and
(the logarithm of) cardinality [A[ of a set A is replaced by Shannon
entropy H(X) of a random variable X. (Throughout this note I as-
sume all entropies to be nite.) However, at the time, I was unable to
nd an entropy analogue of the Pl unnecke-Ruzsa inequality, because I
did not know how to adapt the graph theory argument to the entropy
setting.
I recently discovered, however, that buried in a classic paper[KaVe1983]
of Kaimonovich and Vershik (implicitly in Proposition 1.3, to be pre-
cise) there was the following analogue of Theorem 4.8.1:
Theorem 4.8.2 (Entropy Pl unnecke-Ruzsa inequality). Let X, Y
1
, . . . , Y
m
be independent random variables of nite entropy taking values in an
additive group G, such that H(X+Y
i
) H(X)+log K
i
for all 1 i
4.8. An entropy Pl unnecke-Ruzsa inequality 625
m and some scalars K
1
, . . . , K
m
1. Then H(X +Y
1
+. . . +Y
m
)
H(X) + log K
1
. . . K
m
.
In fact Theorem 4.8.2 is a bit better than Theorem 4.8.1 in
the sense that Theorem 4.8.1 needed to rene the original set A to a
subset A
, y
X such that x
1
x
= y
1
y
(y
)
1
. Thus there are more than
1
2
[X[ elements y
X such
that a = x
(y
)
1
for some x
X (since x
is uniquely determined by
y
X such that
b = z
(w
)
1
for some w
= z
(y
)
1
and b = y
(z
)
1
, and so ab = x
(z
)
1
X X
1
, and
the claim follows.
In the course of the above argument we showed that every element
of the group S has more than
1
2
[X[ representations of the form xy
1
for x, y X. But there are only [X[
2
pairs (x, y) available, and thus
[S[ < 2[X[.
Now let x be any element of X. Since X x
1
S, we have
X S x, and so X
1
X x
1
S x. Conversely, every element
of x
1
S x has exactly [S[ representations of the form z
1
w where
z, w S x. Since X occupies more than half of S x, we thus
see from the inclusion-exclusion principle, there is thus at least one
4.9. Noncommutative Freiman theorem 629
representation z
1
w for which z, w both lie in X. In other words,
x
1
S x = X
1
X, and the claim follows.
To relate this to the classical doubling constants [X X[/[X[, we
rst make an easy observation:
Lemma 4.9.3. If [X X[ < 2[X[, then X X
1
= X
1
X.
Again, this is sharp; consider X equal to x, y where x, y gener-
ate a free group.
Proof. Suppose that xy
1
is an element of XX
1
for some x, y X.
Then the sets Xx and Xy have cardinality [X[ and lie in XX, so by
the inclusion-exclusion principle, the two sets intersect. Thus there
exist z, w X such that zx = wy, thus xy
1
= z
1
w X
1
X. This
shows that X X
1
is contained in X
1
X. The converse inclusion
is proven similarly.
Proposition 4.9.4. If [X X[ <
3
2
[X[, then S := X X
1
is a nite
group of order [X X[, and X S x = x S for some x in the
normaliser of S.
The factor
3
2
is sharp, by the same example used to show sharp-
ness of Proposition 4.9.1. However, there seems to be some room for
further improvement if one weakens the conclusion a bit; see below
the fold.
Proof. Let S = X
1
X = X X
1
(the two sets being equal by
Lemma 4.9.3). By the argument used to prove Lemma 4.9.3, every
element of S has more than
1
2
[X[ representations of the form xy
1
for x, y X. By the argument used to prove Proposition 4.9.1, this
shows that S is a group; also, since there are only [X[
2
pairs (x, y),
we also see that [S[ < 2[X[.
Pick any x X; then x
1
X, X x
1
S, and so X x S, S x.
Because every element of x S x has [S[ representations of the form
yz with y x S, z S x, and X occupies more than half of x S
and of S x, we conclude that each element of x S x lies in X X,
and so X X = x S x and [S[ = [X X[.
630 4. Technical articles
The intersection of the groups S and x S x
1
contains X x
1
,
which is more than half the size of S, and so we must have S =
x S x
1
, i.e. x normalises S, and the proposition follows.
Because the arguments here are so elementary, they extend easily
to the innitary setting in which X is now an innite set, but has nite
measure with respect to some translation-invariant Kiesler measure
. We omit the details. (I am hoping that this observation may help
simplify some of the theory in that setting.)
4.9.1. Beyond the 3/2 barrier. It appears that one can push the
arguments a bit beyond the 3/2 barrier, though of course one has
to weaken the conclusion in view of the counterexample in Remark
4.9.2. Here I give a result that increases 3/2 = 1.5 to the golden ratio
:= (1 +
5)/2 = 1.618 . . .:
Proposition 4.9.5 (Weak non-commutative Kneser theorem). If
[X
1
X[, [XX
1
[ K[X[ for some 1 < K < , then XX
1
= HZ
for some nite subgroup H, and some nite set Z with [Z[ C(K)
for some C(K) depending only on K.
Proof. Write S := X X
1
. Let us say that h symmetrises S if
h S = S, and let H be the set of all h that symmetrise S. It is clear
that H is a nite group with H S = S and thus S H = S also.
For each z S, let r(z) be the number of representations of z
of the form z = xy
1
with x, y X. Double counting shows that
zS
r(z) = [X[
2
, while by hypothesis [S[ K[X[; thus the average
value of r(z) is at least [X[/K. Since 1 < K < , 1/K > K1. Since
r(z) [X[ for all z, we conclude that r(z) > (K 1)[X[ for at least
c(K)[X[ values of z S, for some explicitly computable c(K) > 0.
Suppose z, w S is such that r(z) > (K1)[X[, thus z has more
than (K 1)[X[ representations of the form xy
1
with x, y X. On
the other hand, the argument used to prove Proposition 4.9.1 shows
that w has at least (2 K)[X[ representations of the form x
(y
)
1
with x
, y
p| without using
rational numbers of unbounded height.
For similar reasons, the notion of linear independence over the
rationals doesnt initially look very interesting over Z/pZ: any two
non-zero elements of Z/pZ are of course rationally dependent. But
again, if one restricts attention to rational numbers of bounded height,
then independence begins to emerge: for instance, 1 and
p| are
independent in this sense.
Thus, it becomes natural to ask whether there is a quantita-
tive analogue of Theorem 4.11.1, with non-trivial content in the
case of vector spaces over the bounded height rationals such as
Z/pZ, which asserts that given any bounded collection v
1
, . . . , v
n
of elements, one can nd another set w
1
, . . . , w
k
which is linearly
independent over the rationals up to some height, such that the
v
1
, . . . , v
n
can be generated by the w
1
, . . . , w
k
over the rationals up
to some height. Of course to make this rigorous, one needs to quan-
tify the two heights here, the one giving the independence, and the
one giving the generation. In order to be useful for applications, it
turns out that one often needs the former height to be much larger
4.11. Sunowers 637
than the latter; exponentially larger, for instance, is not an uncom-
mon request. Fortunately, one can accomplish this, at the cost of
making the height somewhat large:
Theorem 4.11.2 (Finite generation implies nite basis, nitary ver-
sion). Let n 1 be an integer, and let F : N N be a function.
Let V be an abelian group which admits a well-dened division opera-
tion by any natural number of size at most C(F, n) for some constant
C(F, n) depending only on F, n; for instance one can take V = Z/pZ
for p a prime larger than C(F, n). Let v
1
, . . . , v
n
be a nite collection
of vectors in V . Then there exists a collection w
1
, . . . , w
k
of vectors
in V , with 1 k n, as well an integer M 1, such that
(Complexity bound) M C(F, n) for some C(F, n) depend-
ing only on F, n.
(w generates v) Every v
j
can be expressed as a rational lin-
ear combination of the w
1
, . . . , w
k
of height at most M (i.e.
the numerator and denominator of the coecients are at
most M).
(w independent) There is no non-trivial linear relation a
1
w
1
+
. . . +a
k
w
k
= 0 among the w
1
, . . . , w
k
in which the a
1
, . . . , a
k
are rational numbers of height at most F(M).
In fact, one can take w
1
, . . . , w
k
to be a subset of the v
1
, . . . , v
n
.
Proof. We perform the same rank reduction argument as before,
but translated to the nitary setting. Start with w
1
, . . . , w
k
initialised
to v
1
, . . . , v
n
(so initially we have k = n), and initialise M = 1.
Clearly w generates v at this height. If the w
i
are linearly indepen-
dent up to rationals of height F(M) then we are done. Otherwise,
there is a non-trivial linear relation between them; after shuing
things around, we see that one of the w
i
, say w
k
, is a rational linear
combination of the w
1
, . . . , w
k1
, whose height is bounded by some
function depending on F(M) and k. In such a case, w
k
becomes re-
dundant, and we may delete it (reducing the rank k by one), but note
that in order for the remaining w
1
, . . . , w
k1
to generate v
1
, . . . , v
n
we
need to raise the height upper bound for the rationals involved from
M to some quantity M
K
V
K
/p, dened as the space of sequences v =
(v
K
)
KN
with v
K
V
K
for each K, modulo equivalence by p; thus
two sequences v = (v
K
)
KN
and v
= (v
K
)
KN
are considered equal
if v
K
= v
K
for a p-large set of K (i.e. for a set of K that lies in p).
Now that non-standard objects are in play, we will need to take
some care to distinguish between standard objects (e.g. standard
natural numbers) and their nonstandard counterparts.
Since each of the V
K
are an abelian group, V is also an abelian
group (an easy special case of the transfer principle). Since each V
K
is
divisible up to height K, V is divisible up to all (standard) heights; in
other words, V is actually a vector space over the (standard) rational
numbers Q. The point is that while none of the V
K
are, strictly
speaking, vector spaces over Q, they increasingly behave as if they
642 4. Technical articles
were such spaces, and in the limit one recovers genuine vector space
structure.
For each 1 i n, one can take an ultralimit of the elements
v
i,K
V
K
to generate an element v
i
:= (v
i,K
)
KN
of the ultraproduct
V . So now we have n vectors v
1
, . . . , v
n
of a vector space V over Q
- precisely the setting of Theorem 4.11.1! So we apply that theorem
and obtain a subcollection w
1
, . . . , w
k
V of the v
1
, . . . , v
n
, such
that each v
i
can be generated from the w
1
, . . . , w
k
using (standard)
rationals, and such that the w
1
, . . . , w
k
are linearly independent over
the (standard) rationals.
Since all (standard) rationals have a nite height, one can nd a
(standard) natural number M such that each of the v
i
can be gener-
ated from the w
1
, . . . , w
k
using (standard) rationals of height at most
M. Undoing the ultralimit, we conclude that for a p-large set of Ks,
all of the v
i,K
can be generated from the w
1,K
, . . . , w
k,K
using ra-
tionals of height at most M. But by hypothesis, this implies for all
suciently large K in this p-large set, the w
1,K
, . . . , w
k,K
contain a
non-trivial rational dependence of height at most F(M), thus
a
1,K
q
1,K
w
1,K
+. . . +
a
k,K
q
k,K
w
k,K
= 0
for some integers a
i,K
, q
i,K
of magnitude at most F(M), with the
a
k,K
not all zero.
By the pigeonhole principle (and the niteness of F(M)), each of
the a
i,K
, q
i,K
is constant in K on a p-large set of K. So if we take an
ultralimit again to go back to the nonstandard world, the quantities
a
i
:= (a
i,K
)
KN
, q
i
:= (q
i,K
)
KN
are standard integers (rather than
merely nonstandard integers). Thus we have
a
1
q
1
w
1
+. . . +
a
k
q
k
w
k
= 0
with the a
i
not all zero, i.e. we have a linear dependence amongst the
w
1
, . . . , w
k
. But this contradicts Theorem 4.11.1.
4.11.2. Polynomials over nite elds. Let F a xed nite eld
(e.g. the eld F
2
of two elements), and consider a high-dimensional
nite vector space V over F. A polynomial P : F
n
F of degree d
can then be dened as a combination of monomials each of degree at
4.11. Sunowers 643
most d, or alternatively as a function whose d+1
th
derivative vanishes;
see Section 1.12 of Poincares Legacies, Vol. I for some discussion of
this equivalence.
We dene the rank rank
d1
(P) of a degree d polynomial P to
be the least number k of degree d1 polynomials Q
1
, . . . , Q
k
, such
that P is completely determined by Q
1
, . . . , Q
k
, i.e. P = f(Q
1
, . . . , Q
k
)
for some function f : F
k
F. In the case when P has degree 2,
this concept is very close to the familiar rank of a quadratic form or
matrix.
A generalisation of the notion of linear independence is that of lin-
ear independence modulo low rank. Let us call a collection P
1
, . . . , P
n
of degree d polynomials M-linearly independent if every non-trivial
linear combination a
1
P
1
+. . . +a
n
P
n
with a
1
, . . . , a
n
F not all zero,
has rank at least M:
rank
d1
(a
1
P
1
+. . . +a
n
P
n
) M.
There is then the following analogue of Theorem 4.11.2:
Theorem 4.11.5 (Polynomial regularity lemma at one degree, ni-
tary version). Let n, d 1 be integers, let F be a nite eld and let
F : N N be a function. Let V be a vector space over F, and let
P
1
, . . . , P
n
: V F be polynomials of degree d. Then there exists
a collection Q
1
, . . . , Q
k
: V F of polynomials of degree d, with
1 k n, as well an integer M 1, such that
(Complexity bound) M C(F, n, d, F) for some C(F, n, d, F)
depending only on F, n, d, F.
(Q generates P) Every P
j
can be expressed as a F-linear
combination of the Q
1
, . . . , Q
k
, plus an error E which has
rank rank
d1
(E) at most M.
(P independent) There is no non-trivial linear relation a
1
Q
1
+
. . . +a
k
Q
k
= E among the w
1
, . . . , w
m
in which E has rank
rank
d1
(E) at most F(M).
In fact, one can take Q
1
, . . . , Q
k
to be a subset of the P
1
, . . . , P
n
.
This theorem can be proven in much the same way as Theorem
4.11.2, and the reader is invited to do so as an exercise. The constant
644 4. Technical articles
C(F, n, d, F) can in fact be taken to be independent of d and F, but
this is not important to us here.
Roughly speaking, Theorem 4.11.5 asserts that a nite family of
degree d polynomials can be expressed as a linear combination of
degree d polynomials which are linearly independent modulo low
rank errors, plus some lower rank objects. One can think of this
as regularising the degree d polynomials, modulo combinations of
lower degree polynomials. For applications (and in particular, for
understanding the equidistribution) one also needs to regularise the
degree d 1 polynomials that arise this way, and so forth for
increasingly lower degrees until all polynomials are regularised. (A
similar phenomenon occurs for the hypergraph regularity lemma.)
When working with theorems like this, it is helpful to think con-
ceptually of quotienting out by all polynomials of low rank. Unfor-
tunately, in the nitary setting, the polynomials of low rank do not
form a group, and so the quotient is ill-dened. However, this can be
rectied by passing to the innitary setting. Indeed, once one does so,
one can quotient out the low rank polynomials, and Theorem 4.11.5
follows directly from Theorem 4.11.1 (or more precisely, the analogue
of that theorem in which the eld of rationals Q is replaced by the
nite eld F).
Lets see how this works. To prove Theorem 4.11.5, suppose for
contradiction that the theorem failed. Then one can nd F, n, d, F,
such that for every natural K, one can nd a vector space V
K
and
polynomials P
1,K
, . . . , P
n,K
: V
K
F of degree d, for which there
do not exist polynomials Q
1,K
, . . . , Q
k,K
with k n and an integer
M K such that each P
j,K
can be expressed as a linear combination
of the Q
i,K
modulo an error of rank at most M, and such that there
are no nontrivial linear relations amongst the Q
i,K
modulo errors of
rank at most F(M).
Taking an ultralimit as before, we end up with a (nonstandard)
vector space V over F (which is likely to be innite), and (nonstan-
dard) polynomials P
1
, . . . , P
n
: V F of degree d (here it is best
to use the local denition of a polynomial of degree d, as a (non-
standard) function whose d + 1
th
derivative, but one can also view
this as a (nonstandard) sum of monomials if one is careful).
4.11. Sunowers 645
The space Poly
d
(V ) of (nonstandard) degree d polynomials
on V is a (nonstandard) vector space over F. Inside this vector space,
one has the subspace Lowrank
d
(V ) consisting of all polynomials P
Poly
d
(V ) whose rank rank
d1
(V ) is a standard integer (as opposed
to a nonstandard integer); call these the bounded rank polynomials.
This is easily seen to be a subspace of Poly
d
(V ) (although it is not
a nonstandard or internal subspace, i.e. the ultralimit of subspaces
of the Poly
d
(V
K
)). As such, one can rigorously form the quotient
space Poly
d
(V )/ Lowrank
d
(V ) of degree d polynomials, modulo
bounded rank d polynomials.
The polynomials P
1
, . . . , P
n
then have representatives P
1
, . . . , P
n
mod Lowrank
d
(V )
in this quotient space. Applying Theorem 4.11.1 (for the eld F), one
can then nd a subcollection Q
1
, . . . , Q
k
mod Lowrank
d
(V ) which
are linearly independent in this space, which generate P
1
, . . . , P
n
. Un-
doing the quotient, we see that the P
1
, . . . , P
n
are linear combinations
of the Q
1
, . . . , Q
k
plus a bounded rank error, while no nontrivial linear
combination of Q
1
, . . . , Q
k
has bounded rank. Undoing the ultralimit
as in the previous section, we obtain the desired contradiction.
We thus see that in the nonstandard world, the somewhat non-
rigorous concepts of low rank and high rank can be formalised
as that of bounded rank and unbounded rank. Furthermore, the
former space forms a subspace, so in the nonstandard world one can
rigorously talk about quotienting out by bounded rank errors. Thus
we see that the algebraic machinery of quotient spaces can be applied
in the nonstandard world directly, whereas in the nitary world it
can only be applied heuristically. In principle, one could also start
deploying more advanced tools of abstract algebra (e.g. exact se-
quences, cohomology, etc.) in the nonstandard setting, although this
has not yet seriously begun to happen in additive combinatorics (al-
though there are strong hints of some sort of additive cohomology
emerging in the body of work surrounding the inverse conjecture for
the Gowers norm, especially on the ergodic theory side of things).
4.11.3. Sunowers. Now we return to vector spaces (or approx-
imate vector spaces) V over the rationals, such as V = Z/pZ for a
large prime p. Instead of working with a single (small) tuple v
1
, . . . , v
n
of vectors in V , we now consider a family (v
1,h
, . . . , v
n,h
)
hH
of such
646 4. Technical articles
vectors in V , where H ranges over a large set, for instance a dense
subset of the interval X := [N, N] = N, . . . , N for some large
N. This situation happens to show up in our recent work on the
inverse conjecture for the Gowers norm, where the v
1,h
, . . . , v
n,h
rep-
resent the various frequencies that arise in a derivative
h
f of a
function f with respect to the shift h. (This need to consider families
is an issue that also comes up in the nite eld ergodic theory ana-
logue [BeTaZi2009] of the inverse conjectures, due to the unbounded
number of generators in that case, but interestingly can be avoided
in the ergodic theory over Z.)
In Theorem 4.11.2, the main distinction was between linear de-
pendence and linear independence of the tuple v
1
, . . . , v
n
(or some
reduction of this tuple, such as w
1
, . . . , w
k
). We will continue to
be interested in the linear dependence or independence of the tuples
v
1,h
, . . . , v
n,h
for various h. But we also wish to understand how the
v
i,h
vary with h as well. At one extreme (the structured case), there
is no dependence on h: v
i,h
= v
i
for all i and all h. At the other ex-
treme (the pseudorandom case), the v
i,h
are basically independent
as h varies; in particular, for (almost) all of the pairs h, h
H, the tu-
ples v
1,h
, . . . , v
n,h
and v
1,h
, . . . , v
n,h
are not just separately indepen-
dent, but are jointly independent. One can think of v
1,h
, . . . , v
n,h
and
v
1,h
, . . . , v
n,h
as being in general position relative to each other.
The sunower lemma asserts that any family (v
1,h
, . . . , v
n,h
)
hH
is basically a combination of the above scenarios, thus one can di-
vide the family into a linearly independent core collection of vec-
tors (w
1
, . . . , w
m
) that do not depend on h, together with petals
(v
1,h
, . . . , v
k,h
)
hH
, which are in general position in the above
sense, relative to the core. However, as a price one pays for this,
one has to rene H to a dense subset H
1,h
, . . . , v
k,h
)
hH
for each h H
, as well an integer M 1,
such that
(Complexity bound) M C(F, n) for some C(F, n) depend-
ing only on F, n.
(H
generates v) Every v
j,h
with 1 j n and h
H
1,h
, . . . , v
k,h
of height at most M.
(w independent) There is no non-trivial rational linear re-
lation among the w
1
, . . . , w
m
of height at most F(M).
(v
) H
, there is no non-trivial
linear relation among w
1
, . . . , w
m
, v
1,h
, . . . , v
k,h
, v
1,h
, . . . , v
k,h
1,h
, . . . , v
k,h
to be a subcollection of the v
1,h
, . . . , v
n,h
,
though this is not particularly useful in applications.
Proof. We perform a two-parameter rank reduction argument,
where the rank is indexed by the pair (k, m) (ordered lexicographi-
cally). We initially set m = 0, k = n, H
= H, M = 1, and v
i,h
= v
i,h
for h H.
At each stage of the iteration, w, v
relative to w.
If there is a linear relation of w at height F(M), then one can
use this to reduce the size m of the core by one, leaving the petal size
k unchanged, just as in the proof of Theorem 4.11.2. So let us move
on, and suppose that there is no linear relation of w at height F(M),
but instead there is a failure of the general position hypothesis. In
other words, for at least [H
[
2
/F(M) pairs (h, h
) H
, one can
648 4. Technical articles
nd a relation of the form
a
1,h,h
w
1
+. . .+a
m,h,h
w
m
+b
1,h,h
v
1,h
+. . .+b
k,h,h
v
k,h
+c
1,h,h
v
1,h
+. . .+c
k,h,h
v
k,h
= 0
where the a
i,h,h
, b
i,h,h
, c
i,h,h
are rationals of height at most F(M),
not all zero. The number of possible values for such rationals is
bounded by some quantity depending on m, k, F(M). Thus, by the
pigeonhole principle, we can nd
F(M),m,k
[H
[
2
pairs (i.e. at least
c(F(M), m, k)[H
[
2
pairs for some c(F(M), m, k) > 0 depending only
on F(M), m, k) such that
a
1
w
1
+. . . +a
m
w
m
+b
1
v
1,h
+. . . +b
k
v
k,h
+c
1
v
1,h
+. . . +c
k
v
k,h
= 0
for some xed rationals a
i
, b
i
, c
i
of height at most F(M). By the
pigeonhole principle again, we can then nd a xed h
0
H
such
that
a
1
w
1
+. . . +a
m
w
m
+b
1
v
1,h
+. . . +b
k
v
k,h
= u
h0
for all h in some subset H
of H
with [H
[
F(M),m,k
[H
[, where
u
h0
:= c
1
v
1,h0
. . . c
k
v
k,h0
.
If the b
i
and c
i
all vanished then we would have a linear depen-
dence amongst the core vectors, which we already know how to deal
with. So suppose that we have at least one active petal coecient,
say b
k
. Then upon rearranging, we can express v
k,h
as some ratio-
nal linear combination of the original core vectors w
1
, . . . , w
m
, a new
core vector u
h0
, and the other petals v
1,h
, . . . , v
k1,h
, with heights
bounded by
F(M),k,m
1. We may thus rene H
to H
, delete the
petal vector v
k,h
, and add the vector u to the core, thus decreas-
ing k by one and increasing m by one. One still has the generation
property so long as one replaces M with a larger M
depending on
M, F(M), k, m.
Since each iteration of this process either reduces m by one keep-
ing k xed, or reduces k by one increasing m, we see that after at
most 2n steps, the process must terminate, when we have both the
linear independence of the w property and the general position of
the v
K
of H
K
), and if [H
[ [H[
for some standard > 0 (recall that [H
of H, a bounded-dimensional subspace
W of V , a (standard) integer k 0 with dim(W) + k n, and a
collection of petal vectors (v
1,h
, . . . , v
k,h
)
hH
for each h H
, with
the maps h v
i,h
being internal for all 1 i k, such that
(W, v
generates v) Every v
j,h
with 1 j n and h H
1,h
, . . . , v
k,h
.
(v
) H
, the vectors v
1,h
, . . . , v
k,h
, v
1,h
, . . . , v
k,h
1,h
, . . . , v
k,h
)
hH
depending internally on h that obeys the gen-
eration property (but not necessarily the general position property).
Clearly we have at least one partial representation, namely the trivial
one where W is empty, k = n, H
:= H, and v
i,h
:= v
i,h
. Now, among
all such partial representations, let us take a representation with the
minimal value of k. (Here we are of course using the well-ordering
4.11. Sunowers 651
property of the standard natural numbers.) We claim that this rep-
resentation enjoys the general position property, which will give the
claim.
Indeed, suppose this was not the case. Then, for many pairs
(h, h
) H
, the vectors v
1,h
, . . . , v
k,h
, v
1,h
, . . . , v
k,h
have a lin-
ear dependence modulo W over Q. (Actually, there is a technical
measurability issue to address here, which I will return to later.)
By symmetry and pigeonholing, we may assume that the v
k,h
coef-
cient of (say) of this dependence is non-zero. (Again, there is a
measurability issue here.) Applying the pigeonhole principle, one can
nd h
0
H
such that
v
1,h
, . . . , v
k,h
, v
1,h0
, . . . , v
k,h0
have a linear dependence over Q modulo W for many h. (Again,
there is a measurability issue here.)
Fix h
0
. The number of possible linear combinations of v
1,h0
, . . . , v
k,h0
is countable. Because of this (and using a countable pigeonhole prin-
ciple) that I will address below, we can nd a xed rational linear
combination u
h0
of the v
1,h0
, . . . , v
k,h0
such that
v
1,h
, . . . , v
k,h
, u
h0
have a linear dependence over Q modulo W for all h in some dense
subset H
of H
k,h
, and add the vector u
h0
to the core space
W, thus creating a partial representation with a smaller value of k,
contradicting minimality, and we are done.
We remark here that whereas the nitary analogue of this result
was proven using the method of innite descent, the nonstandard
version could instead be proven using the (equivalent) well-ordering
principle. One could easily recast the nonstandard version in descent
form also, but it is somewhat more dicult to cast the nitary argu-
ment using well-ordering due to the extra parameters and quantiers
in play.
Let us now address the measurability issues. The main prob-
lem here is that the property of having a linear dependence over the
standard rationals Q is not an internal property, because it requires
652 4. Technical articles
knowledge of what the standard rationals are, which is not an in-
ternal concept in the language of vector spaces. However, for each
xed choice of rational coecients, the property of having a specic
linear dependence with those selected coecients is an internal con-
cept (here we crucially rely on the hypothesis that the maps h v
i,h
were internal), so really what we have here is a sort of -internal
property (a countable union of internal properties). But this is good
enough for many purposes. In particular, we have
Lemma 4.11.8 (Countable pigeonhole principle). Let H be a non-
standardly nite set (i.e. the ultralimit of nite sets H
K
), and for
each standard natural number n, let E
n
be an internal subset of H.
Then one of the following holds:
(Positive density) There exists a natural number n such that
h E
n
for many h H (i.e. E
n
is a dense subset of H).
(Zero density) For almost all h H, one has h , E
n
for
all n. (In other words, the (external) set
nN
E
n
in is
contained in a sparse subset of H.)
This lemma is sucient to resolve all the measurability issues
raised in the previous proof. It is analogous to the trivial statement
in measure theory that given a countable collection of measurable
subsets of a space of positive measure, either one of the measurable
sets has positive measure, or else their union has measure zero (i.e.
the sets fail to cover almost all of the space).
Proof. If any of the E
n
are dense, we are done. So suppose this is
not the case. Since E
n
is a denable subset of H which is not dense,
it is sparse, thus [E
n
[ = o([H[). Now it is convenient to undo the
ultralimit and work in the nite sets H
K
that H is the ultralimit of.
Note that each E
n
, being internal, is also an ultralimit of some nite
subsets E
n,K
of H
K
.
For each standard integer M > 0, the set E
1
. . . E
M
is sparse
in H, and in particular has density less than 1/M. Thus, one can
nd a p-large set S
M
N such that
[E
1,K
. . . E
M,K
[ [H
K
[/M
4.11. Sunowers 653
for all K S
M
. One can arrange matters so that the S
M
are decreas-
ing in M. One then sets the set E
K
to equal E
1,K
. . . E
M,K
, where
M is the smallest integer for which K S
M
(or E
K
is empty if K
lies in all the S
M
, or in none), and let E be the ultralimit of the E
K
.
Then we see that [E[ [H[/M for every standard M, and so E is a
sparse subset of H. Furthermore, E contains E
M
for every standard
M, and so we are in the zero density conclusion of the argument.
Remark 4.11.9. Curiously, I dont see how to prove this lemma
without unpacking the limit; it doesnt seem to follow just from, say,
the overspill principle. Instead, it seems to be exploiting the weak
countable saturation property I mentioned in Section 4.10. But per-
haps I missed a simple argument.
4.11.4. Summary. Let me summarise with a brief list of pros and
cons of switching to a nonstandard framework. First, the pros:
Many rst-order parameters such as or N disappear
from view, as do various negligible errors. More impor-
tantly, second-order parameters, such as the function F
appearing in Theorem 4.11.2, also disappear from view. (In
principle, third-order and higher parameters would also dis-
appear, though I do not yet know of an actual nitary ar-
gument in my elds of study which would have used such
parameters (with the exception of Ramsey theory, where
such parameters must come into play in order to generate
such enormous quantities as Grahams number).) As such,
a lot of tedious epsilon management disappears.
Iterative (and often parameter-heavy) arguments can often
be replaced by minimisation (or more generally, extremi-
sation) arguments, taking advantage of such properties as
the well-ordering principle, the least upper bound axiom, or
compactness.
The transfer principle lets one use for free any (rst-order)
statement about standard mathematics in the non-standard
setting (provided that all objects involved are internal ; see
below).
654 4. Technical articles
Mature and powerful theories from innitary mathematics
(e.g. linear algebra, real analysis, representation theory,
topology, functional analysis, measure theory, Lie theory,
ergodic theory, model theory, etc.) can be used rigorously
in a nonstandard setting (as long as one is aware of the usual
innitary pitfalls, of course; see below).
One can formally dene terms that correspond to what would
otherwise only be heuristic (or heavily parameterised and
quantied) concepts such as small, large, low rank,
independent, uniformly distributed, etc.
The conversion from a standard result to its nonstandard
counterpart, or vice versa, is fairly quick (but see below),
and generally only needs to be done only once or twice per
paper.
Next, the cons:
Often requires the axiom of choice, as well as a certain
amount of set theory. (There are however weakened ver-
sions of nonstandard analysis that can avoid choice that are
still suitable for many applications.)
One needs the machinery of ultralimits and ultraproducts to
set up the conversion from standard to nonstandard struc-
tures.
The conversion usually proceeds by a proof by contradiction,
which (in conjunction with the use of ultralimits) may not
be particularly intuitive.
One cannot eciently discern what quantitative bounds emerge
from a nonstandard argument (other than by painstakingly
converting it back to a standard one, or by applying the
tools of proof mining). (On the other hand, in particularly
convoluted standard arguments, the quantitative bounds are
already so poor - e.g. of iterated tower-exponential type -
that letting go of these bounds is no great loss.)
One has to take some care to distinguish between standard
and nonstandard objects (and also between internal and
external sets and functions, which are concepts somewhat
4.11. Sunowers 655
analogous to measurable and non-measurable sets and func-
tions in measurable theory). More generally, all the usual
pitfalls of innitary analysis (e.g. interchanging limits, or
the need to ensure measurability or continuity) emerge in
this setting, in contrast to the nitary setting where they
are usually completely trivial.
It can be dicult at rst to conceptually visualise what
nonstandard objects look like (although this becomes easier
once one maps nonstandard analysis concepts to heuristic
concepts such as small and large as mentioned earlier,
thus for instance one can think of an unbounded nonstan-
dard natural number as being like an incredibly large stan-
dard natural number).
It is inecient for both nonstandard and standard argu-
ments to coexist within a paper; this makes things a little
awkward if one for instance has to cite a result from a stan-
dard mathematics paper in a nonstandard mathematics one.
There are philosophical objections to using mathematical
structures that only exist abstractly, rather than correspond-
ing to the real world. (Note though that similar objections
were also raised in the past with regard to the use of, say,
complex numbers, non-Euclidean geometries, or even nega-
tive numbers.)
Formally, there is no increase in logical power gained by
using nonstandard analysis (at least if one accepts the axiom
of choice); anything which can be proven by nonstandard
methods can also be proven by standard ones. In practice,
though, the length and clarity of the nonstandard proof may
be substantially better than the standard one.
In view of the pros and cons, I would not say that nonstandard
analysis is suitable in all situations, nor is it unsuitable in all situ-
ations, but one needs to carefully evaluate the costs and benets in
a given setting; also, in some cases having both a nitary and in-
nitary proof side by side for the same result may be more valuable
than just having one of the two proofs. My rule of thumb is that if
a nitary argument is already spitting out iterated tower-exponential
656 4. Technical articles
type bounds or worse in an argument, this is a sign that the argument
wants to be innitary, and it may be simpler to move over to an
innitary setting (such as the nonstandard setting).
Notes. This article rst appeared at terrytao.wordpress.com/2009/12/13.
4.12. The double Duhamel trick and the in/out
decomposition
This is a technical post inspired by separate conversations with Jim
Colliander and with Soonsik Kwon on the relationship between two
techniques used to control non-radiating solutions to dispersive non-
linear equations, namely the double Duhamel trick and the in/out
decomposition. See for instance [KiVi2009] for a survey of these
two techniques and other related methods in the subject. (I should
caution that this article is likely to be unintelligible to anyone not
already working in this area.)
For sake of discussion we shall focus on solutions to a nonlinear
Schrodinger equation
iu
t
+ u = F(u)
and we will not concern ourselves with the specic regularity of the so-
lution u, or the specic properties of the nonlinearity F here. We will
also not address the issue of how to justify the formal computations
being performed here.
Solutions to this equation enjoy the forward Duhamel formula
u(t) = e
i(tt0)
u(t
0
) i
_
t
t0
e
i(tt
)
F(u(t
)) dt
)
F(u(t
)) dt
)
F(u(t
)) dt
and
(4.43) u(t) = i
_
t1
t
e
i(tt
)
F(u(t
)) dt
where t
0
, t
1
are the endpoint times of existence. (This type of situa-
tion comes up for instance in the Kenig-Merle approach to critical reg-
ularity problems, by reducing to a minimal blowup solution which is
almost periodic modulo symmetries, and hence non-radiating.) These
types of non-radiating solutions are propelled solely by their own non-
linear self-interactions from the immediate past or immediate future;
they are generalisations of nonlinear bound states such as solitons.
A key task is then to somehow combine the forward represen-
tation (4.42) and the backward representation (4.43) to obtain new
information on u(t) itself, that cannot be obtained from either rep-
resentation alone; it seems that the immediate past and immediate
future can collectively exert more control on the present than they
each do separately. This type of problem can be abstracted as fol-
lows. Let |u(t)|
Y+
be the inmal value of |F
+
|
N
over all forward
658 4. Technical articles
representations of u(t) of the form
(4.44) u(t) =
_
t
t0
e
i(tt
)
F
+
(t
) dt
|
N
over
all backward representations of u(t) of the form
(4.45) u(t) =
_
t1
t
e
i(tt
)
F
(t
) dt
.
Typically, one already has (or is willing to assume as a bootstrap
hypothesis) control on F(u) in the norm N, which gives control of
u(t) in the norms Y
+
, Y
)
F
+
(t
), e
i(t
)
F
(t
)
X
dt
dt
.
The dispersive nature of the linear Schrodinger equation often causes
e
i(t
)
F
+
(t
), e
i(t
)
F
(t
)
X
to decay, especially in high di-
mensions. In high enough dimension (typically one needs ve or
higher dimensions, unless one already has some spacetime control
on the solution), the decay is stronger than 1/[t
[
2
, so that the
integrand becomes absolutely integrable and one recovers (4.47).
4.12. The double Duhamel trick 659
Unfortunately it appears that estimates of the form (4.47) fail
in low dimensions (for the type of norms N that actually show up
in applications); there is just too much interaction between past and
future to hope for any reasonable control of this inner product. But
one can try to obtain (4.46) by other means. By the Hahn-Banach
theorem (and ignoring various issues related to reexivity), (4.46) is
equivalent to the assertion that every u X can be decomposed as
u = u
+
+ u
, where |u
+
|
Y
+
|u|
X
and |u
|
Y
|v|
X
. Indeed
once one has such a decomposition, one obtains (4.46) by computing
the inner product of u with u = u
+
+u
+
as |e
i(t)
u
+
|
N
([t0,t])
and similarly write |u
|
Y
as |e
i(t)
u
|
N
([t,t1])
So one can dualise the task of proving (4.46) as that of obtaining
a decomposition of an arbitrary initial state u into two components
u
+
and u
, where the former disperses into the past and the latter
disperses into the future under the linear evolution. We do not know
how to achieve this type of task eciently in general - and doing
so would likely lead to a signicant advance in the subject (perhaps
one of the main areas in this topic where serious harmonic analysis
is likely to play a major role). But in the model case of spherically
symmetric data u, one can perform such a decomposition quite easily:
one uses microlocal projections to set u
+
to be the inward pointing
component of u, which propagates towards the origin in the future
and away from the origin in the past, and u
to simimlarly be the
outward component of u. As spherical symmetry signicantly di-
lutes the amplitude of the solution (and hence the strength of the
nonlinearity) away from the origin, this decomposition tends to work
quite well for applications, and is one of the main reasons (though
not the only one) why we have a global theory for low-dimensional
nonlinear Schrodinger equations in the radial case, but not in general.
The in/out decomposition is a linear one, but the Hahn-Banach
argument gives no reason why the decomposition needs to be linear.
(Note that other well-known decompositions in analysis, such as the
Feerman-Stein decomposition of BMO, are necessarily nonlinear, a
fact which is ultimately equivalent to the non-complemented nature
of a certain subspace of a Banach space; see Section 1.7.) So one could
660 4. Technical articles
imagine a sophisticated nonlinear decomposition as a general substi-
tute for the in/out decomposition. See for instance [BoBr2003] for
some of the subtleties of decomposition even in very classical func-
tion spaces such as H
1/2
(R). Alternatively, there may well be a third
way to obtain estimates of the form (4.46) that do not require either
decomposition or the double Duhamel trick; such a method may well
clarify the relative relationship between past, present, and future for
critical nonlinear dispersive equations, which seems to be a key aspect
of the theory that is still only partially understood. (In particular, it
seems that one needs a fairly strong decoupling of the present from
both the past and the future to get the sort of elliptic-like regularity
results that allow us to make further progress with such equations.)
Notes. This article rst appeared at terrytao.wordpress.com/2009/12/17.
Thanks to Kareem Carr, hezhigang, and anonymous commenters for
corrections.
4.13. The free nilpotent group
In a multiplicative group G, the commutator of two group elements
g, h is dened as [g, h] := g
1
h
1
gh (other conventions are also in use,
though they are largely equivalent for the purposes of this discussion).
A group is said to be nilpotent of step s (or more precisely, step s),
if all iterated commutators of order s+1 or higher necessarily vanish.
For instance, a group is nilpotent of order 1 if and only if it is abelian,
and it is nilpotent of order 2 if and only if [[g
1
, g
2
], g
3
] = id for all
g
1
, g
2
, g
3
(i.e. all commutator elements [g
1
, g
2
] are central ), and so
forth. A good example of an s-step nilpotent group is the group of
s + 1 s + 1 upper-triangular unipotent matrices (i.e. matrices with
1s on the diagonal and zero below the diagonal), and taking values in
some ring (e.g. reals, integers, complex numbers, etc.).
Another important example of nilpotent groups arise from op-
erations on polynomials. For instance, if V
s
is the vector space of
real polynomials of one variable of degree at most s, then there are
two natural ane actions on V
s
. Firstly, every polynomial Q in
V
s
gives rise to an vertical shift P P + Q. Secondly, every
h R gives rise to a horizontal shift P P( + h). The group
4.13. The free nilpotent group 661
generated by these two shifts is a nilpotent group of step s; this
reects the well-known fact that a polynomial of degree s vanishes
once one dierentiates more than s times. Because of this link be-
tween nilpotentcy and polynomials, one can view nilpotent algebra as
a generalisation of polynomial algebra.
Suppose one has a nite number g
1
, . . . , g
n
of generators. Us-
ing abstract algebra, one can then construct the free nilpotent group
T
s
(g
1
, . . . , g
n
) of step s, dened as the group generated by the
g
1
, . . . , g
n
subject to the relations that all commutators of order s +1
involving the generators are trivial. This is the universal object in
the category of nilpotent groups of step s with n marked elements
g
1
, . . . , g
n
. In other words, given any other s-step nilpotent group
G
1
, . . . , g
n
, there is a unique homomor-
phism from the free nilpotent group to G
j
for
1 j n. In particular, the free nilpotent group is well-dened up
to isomorphism in this category.
In many applications, one wants to have a more concrete descrip-
tion of the free nilpotent group, so that one can perform computations
more easily (and in particular, be able to tell when two words in the
group are equal or not). This is easy for small values of s. For in-
stance, when s = 1, T
1
(g
1
, . . . , g
n
) is simply the free abelian group
generated by g
1
, . . . , g
n
, and so every element g of T
1
(g
1
, . . . , g
n
)
can be described uniquely as
(4.48) g =
n
j=1
g
mj
j
:= g
m1
1
. . . g
mn
n
for some integers m
1
, . . . , m
n
, with the obvious group law. Indeed,
to obtain existence of this representation, one starts with any rep-
resentation of g in terms of the generators g
1
, . . . , g
n
, and then uses
the abelian property to push the g
1
factors to the far left, followed
by the g
2
factors, and so forth. To show uniqueness, we observe that
the group G of formal abelian products g
m1
1
. . . g
mn
n
: m
1
, . . . , m
n
Z Z
k
is already a 1-step nilpotent group with marked elements
g
1
, . . . , g
n
, and so there must be a homomorphism from the free group
to G. Since G distinguishes all the products g
m1
1
. . . g
mn
n
from each
other, the free group must also.
662 4. Technical articles
It is only slightly more tricky to describe the free nilpotent group
T
2
(g
1
, . . . , g
n
) of step 2. Using the identities
gh = hg[g, h]; gh
1
= ([g, h]
1
)
g
1
h
1
g; g
1
h = h[g, h]
1
g
1
; g
1
h
1
:= [g, h]g
1
h
1
(where g
h
:= h
1
gh is the conjugate of g by h) we see that whenever
1 i < j n, one can push a positive or negative power of g
i
past
a positive or negative power of g
j
, at the cost of creating a positive
or negative power of [g
i
, g
j
], or one of its conjugates. Meanwhile,
in a 2-step nilpotent group, all the commutators are central, and
one can pull all the commutators out of a word and collect them as
in the abelian case. Doing all this, we see that every element g of
T
2
(g
1
, . . . , g
n
) has a representation of the form
(4.49) g = (
n
j=1
g
mj
j
)(
1i<jn
[g
i
, g
j
]
m
[i,j]
)
for some integers m
j
for 1 j n and m
[i,j]
for 1 i < j n. Note
that we dont need to consider commutators [g
i
, g
j
] for i j, since
[g
i
, g
i
] = id
and
[g
i
, g
j
] = [g
j
, g
i
]
1
.
It is possible to show also that this representation is unique, by re-
peating the previous argument, i.e. by showing that the set of formal
products
G := (
k
j=1
g
mj
j
)(
1i<jn
[g
i
, g
j
]
m
[i,j]
) : m
j
, m
[i,j]
Z
forms a 2-step nilpotent group, after using the above rules to dene
the group operations. This can be done, but verifying the group
axioms (particularly the associative law) for Gis unpleasantly tedious.
Once one sees this, one rapidly loses an appetite for trying to
obtain a similar explicit description for free nilpotent groups for higher
step, especially once one starts seeing that higher commutators obey
some non-obvious identities such as the Hall-Witt identity
(4.50) [[g, h
1
], k]
h
[[h, k
1
], g]
k
[[k, g
1
], h]
g
= 1
4.13. The free nilpotent group 663
(a nonlinear version of the Jacobi identity in the theory of Lie alge-
bras), which make one less certain as to the existence or uniqueness
of various proposed generalisations of the representations (4.48) or
(4.49). For instance, in the free 3-step nilpotent group, it turns out
that for representations of the form
g = (
n
j=1
g
mj
j
)(
1i<jn
[g
i
, g
j
]
m
[i,j]
)(
1i<j<kn
[[g
i
, g
j
], g
k
]
n
[[i,j],k]
)
one has uniqueness but not existence (e.g. even in the simplest case
n = 3, there is no place in this representation for, say, [[g
1
, g
3
], g
2
]
or [[g
1
, g
2
], g
2
]), but if one tries to insert more triple commutators
into the representation to make up for this, one has to be careful not
to lose uniqueness due to identities such as (4.50). One can paste
these in by ad hoc means in the s = 3 case, but the s = 4 case
looks more fearsome still, especially now that the quadruple commu-
tators split into several distinct-looking species such as [[g
i
, g
j
], [g
k
, g
l
]]
and [[[g
i
, g
j
], g
k
], g
l
] which are nevertheless still related to each other
by identities such as (4.50). While one can eventually disentangle
this mess for any xed n and s by a nite amount of combinatorial
computation, it is not immediately obvious how to give an explicit
description of T
s
(g
1
, . . . , g
n
) uniformly in n and s.
Nevertheless, it turns out that one can give a reasonably tractable
description of this group if one takes a polycyclic perspective rather
than a nilpotent one - i.e. one views the free nilpotent group as a
tower of group extensions of the trivial group by the cyclic group Z.
This seems to be a fairly standard observation in group theory - I
found it in [MaKaSo2004] and [Le2009] - but seems not to be so
widely known outside of that eld, so I wanted to record it here.
4.13.1. Generalisation. The rst step is to generalise the concept
of a free nilpotent group to one where the generators have dierent
degrees. Dene a graded sequence to be a nite ordered sequence
(g
)
A
of formal group elements g
),
which we call the degree of g
)
A
) generated
by a graded sequence (g
)
A
to be the group generated by the g
,
subject to the constraint that any iterated commutator of the g
of
degree greater than s is trivial. Thus the free group T
s
(g
1
, . . . , g
k
)
corresponds to the case when all the g
i
are assigned a degree of 1.
Note that any element of a graded sequence of degree greater
than s is automatically trivial (we view it as a 0-fold commutator of
itself) and so can be automatically discarded from that sequence.
We will recursively dene the free s-step nilpotent group of
some graded sequence (g
)
A
in terms of simpler sequences, which
have fewer low-degree terms at the expense of introducing higher-
degree terms, though as mentioned earlier there is no need to intro-
duce terms of degree larger than s. Eventually this process exhausts
the sequence, and at that point the free nilpotent group will be com-
pletely described.
4.13.2. Shift. It is convenient to introduce the iterated commuta-
tors [g, mh] for m = 0, 1, 2, . . . by declaring [g, 0h] := g and [g, (m +
1)h] := [[g, mh], h], thus for instance [g, 3h] = [[[g, h], h], h].
Denition 4.13.1 (Shift). Let s 1 be an integer, let (g
)
A
be
a non-empty graded sequence, and let
0
be the minimal element of
A. We dene the (degree s) shift (g
)
A
of (g
)
A
by den-
ing A
, m
0
] if >
, or
if =
and m > m
, mg
0
].
Example 4.13.2. If s 3, and the graded sequence g
a
, g
b
, g
c
consists
entirely of elements of degree 1, then the shift of this sequence is given
4.13. The free nilpotent group 665
by
g
b
, g
c
, g
[b,a]
, g
[b,2a]
, g
[c,a]
, g
[c,2a]
where [b, a], [c, a] have degree 2, and [b, 2a], [c, 2a] have degree 3, and
g
[b,a]
= [g
b
, g
a
], g
[b,2a]
= [g
b
, 2g
a
], etc.
The key lemma is then
Lemma 4.13.3 (Recursive description of free group). Let s 1 be
an integer, let (g
)
A
be a non-empty graded sequence, and let
0
be the minimal element of A. Let (g
)
A
be the shift of of (g
)
A
.
Then T
s
((g
)
A
) is generated by g
0
and T
s
((g
)
A
), and fur-
thermore the latter group is a normal subgroup of T
s
((g
)
A
) that
does not contain g
0
. In other words, we have a semi-direct product
representation
T
s
((g
)
A
) = Z T
s
((g
)
A
)
with g
0
being identied with (1, id) and the action of Z being given
by the conjugation action of g
0
. In particular, every element g in
T
s
((g
)
A
) can be uniquely expressed as g = g
n
, where g
T
s
((g
)
A
).
Proof. It is clear that T
s
((g
)
A
) is a subgroup of T
s
((g
)
A
),
and that it together with g
0
generates T
s
((g
)
A
). To show that
this subgroup is normal, it thus suces to show that the conjugation
action of g
0
and g
1
0
preserve T
s
((g
)
A
). It suces to check
this on generators. But this is clear from the identity
g
1
0
[g
, mg
0
]g
0
= [g
, mg
0
][g
, (m+ 1)g
0
]
and its inverse
g
0
[g
, mg
0
]g
1
0
= [g
, mg
0
][g
, (m+ 1)g
0
]
1
[g
, (m+ 2)g
0
] . . .
(note that the product terminates in nite time due to nilpotency).
Finally, we need to show that g
0
is not contained in T
s
((g
)
A
).
But because the conjugation action of g
0
preserves the latter group,
we can form the semidirect product G := ZT
s
((g
)
A
). By the
universal nature of the free group, there must thus be a homomor-
phism from T
s
((g
)
A
) to G which maps g
0
to (1, id) and maps
T
s
((g
)
A
) to 0 T
s
((g
)
A
). This implies that g
0
cannot
lie in T
s
((g
)
A
), and the claim follows.
666 4. Technical articles
We can now iterate this. Observe that every time one shifts a
non-empty graded sequence, one removes one element (the minimal
element g
0
) but replaces it with zero or more elements of higher
degree. Iterating this process, we eventually run out of elements of
degree one, then degree two, and so forth, until the sequence becomes
completely empty. We glue together all the elements encountered this
way and refer to the full sequence as the completion (g
)
A
of the
original sequence (g
)
A
. As a corollary of the above lemma we thus
have
Corollary 4.13.4 (Explicit description of free nilpotent group). Let
s 1 be an integer, and let (g
)
A
be a graded sequence. Then every
element g of T
s
((g
)
A
) can be represented uniquely as
A
g
n
where n
)
A
into a canonical basis (x
)
A
, where A is the same
completion of A that was encountered here. This was used to show
a close connection between such bracket polynomials and nilpotent
groups (or more precisely, nilsequences).
Notes. This article rst appeared at terrytao.wordpress.com/2009/12/21.
Thanks to Dylan Thurston for corrections.
Bibliography
[AgKaSa2004] M. Agrawal, N. Kayal, N. Saxena, PRIMES is in P, Annals
of Mathematics 160 (2004), no. 2, pp. 781-793.
[AjSz1974] M. Ajtai, E. Szemeredi, Sets of lattice points that form no
squares, Stud. Sci. Math. Hungar. 9 (1974), 911 (1975).
[AlDuLeRoYu1994] N. Alon, R. Duke, H. Lefmann, Y. R odl, R. Yuster,
The algorithmic aspects of the regularity lemma, J. Algorithms 16
(1994), no. 1, 8010.
[AlSh2008] N. Alon, A. Shapira, Every monotone graph property is testable,
SIAM J. Comput. 38 (2008), no. 2, 505522.
[AlSp2008] N. Alon, J. Spencer, The probabilistic method. Third edi-
tion. With an appendix on the life and work of Paul Erdos. Wiley-
Interscience Series in Discrete Mathematics and Optimization. John
Wiley & Sons, Inc., Hoboken, NJ, 2008.
[Au2008] T. Austin, On exchangeable random variables and the statistics
of large graphs and hypergraphs, Probab. Surv. 5 (2008), 80145.
[Au2009] T. Austin, Deducing the multidimensional Szemeredi Theorem
from an innitary removal lemma, preprint.
[Au2009b] T. Austin, Deducing the Density Hales-Jewett Theorem from an
innitary removal lemma, preprint.
[AuTa2010] T. Austin, T. Tao, On the testability and repair of hereditary
hypergraph properties, preprint.
[Ax1968] J. Ax, The elementary theory of nite elds, Ann. of Math. 88
(1968) 239271.
[BaGiSo1975] T. Baker, J. Gill, R. Solovay, Relativizations of the P =?NP
question, SIAM J. Comput. 4 (1975), no. 4, 431442.
669
670 Bibliography
[Be1975] W. Beckner, Inequalities in Fourier analysis, Ann. of Math. 102
(1975), no. 1, 159182.
[BeTaZi2009] V. Bergelson, T. Tao, T. Ziegler, An inverse theorem for the
uniformity seminorms associated with the action of F
, preprint.
[BeLo1976] J. Bergh, J. L ofstrom, Interpolation spaces. An introduction.
Grundlehren der Mathematischen Wissenschaften, No. 223. Springer-
Verlag, Berlin-New York, 1976.
[BiRo1962] A. Bialynicki-Birula, M. Rosenlicht, Injective morphisms of real
algebraic varieties, Proc. Amer. Math. Soc. 13 (1962) 200203.
[BoKe1996] E. Bogomolny, J. Keating, Random matrix theory and the Rie-
mann zeros. II. n-point correlations, Nonlinearity 9 (1996), no. 4, 911
935.
[Bo1969] A. Borel, Injective endomorphisms of algebraic varieties, Arch.
Math. (Basel) 20 1969 531537.
[Bo1999] J. Bourgain, On the dimension of Kakeya sets and related maxi-
mal inequalities, Geom. Funct. Anal. 9 (1999), no. 2, 256282.
[BoBr2003] J. Bourgain, H. Brezis, On the equation div Y = f and ap-
plication to control of phases, J. Amer. Math. Soc. 16 (2003), no. 2,
393426.
[BudePvaR2008] G. Buskes, B. de Pagter, A. van Rooij, The Loomis-
Sikorski theorem revisited, Algebra Universalis 58 (2008), 413426.
[ChPa2009] T. Chen, N. Pavlovic, The quintic NLS as the mean eld limit
of a boson gas with three-body interactions, preprint.
[ClEdGuShWe1990] K. Clarkson, H. Edelsbrunner, L. Guibas, M. Sharir,
E. Welzl, Combinatorial complexity bounds for arrangements of curves
and spheres, Discrete Comput. Geom. 5 (1990), no. 2, 99160.
[Co1989] J. B. Conrey, More than two fths of the zeros of the Riemann zeta
function are on the critical line, J. Reine Angew. Math. 399 (1989),
126.
[Dy1970] F. Dyson, Correlations between eigenvalues of a random matrix,
Comm. Math. Phys. 19 1970 235250.
[ElSz2008] G. Elek, B. Szegedy, A measure-theoretic approach to the theory
of dense hypergraphs, preprint.
[ElObTa2009] J. Ellenberg, R. Oberlin, T. Tao, The Kakeya set and max-
imal conjectures for algebraic varieties over nite elds, preprint.
[ElVeWe2009] J. Ellenberg, A. Venkatesh, C. Westerland, Homological sta-
bility for Hurwitz spaces and the Cohen-Lenstra conjecture over func-
tion elds, preprint.
[ErKa1940] P. Erdos, M. Kac, The Gaussian Law of Errors in the Theory
of Additive Number Theoretic Functions, American Journal of Mathe-
matics, volume 62, No. 1/4, (1940), pages 738742.
Bibliography 671
[EsKePoVe2008] L. Escauriaza, C. E. Kenig, G. Ponce, L. Vega, Hardys
uncertainty principle, convexity and Schr odinger evolutions, J. Eur.
Math. Soc. (JEMS) 10 (2008), no. 4, 883907.
[Fa2003] K. Falconer, Fractal geometry, Mathematical foundations and ap-
plications. Second edition. John Wiley & Sons, Inc., Hoboken, NJ,
2003.
[FeSt1972] C. Feerman, E. M. Stein, H
p
spaces of several variables, Acta
Math. 129 (1972), no. 3-4, 137193.
[FiMaSh2007] E. Fischer, A. Matsliach, A. Shapira, Approximate Hyper-
graph Partitioning and Applications, Proc. of FOCS 2007, 579589.
[Fo2000] G. Folland, Real Analysis, Modern techniques and their applica-
tions. Second edition. Pure and Applied Mathematics (New York). A
Wiley-Interscience Publication. John Wiley & Sons, Inc., New York,
1999.
[Fo1955] E. Flner, On groups with full Banach mean value, Math. Scand.
3 (1955), 243254
[Fo1974] J. Fournier, Majorants and L
p
norms, Israel J. Math. 18 (1974),
157166.
[Fr1973] G. Freiman, Groups and the inverse problems of additive num-
ber theory, Number-theoretic studies in the Markov spectrum and in
the structural theory of set addition, pp. 175183. Kalinin. Gos. Univ.,
Moscow, 1973.
[Fu1977] H. Furstenberg, Ergodic behavior of diagonal measures and a the-
orem of Szemeredi on arithmetic progressions, J. Analyse Math. 31
(1977), 204256.
[FuKa1989] H. Furstenberg, Y. Katznelson, A density version of the Hales-
Jewett theorem for k = 3, Graph theory and combinatorics (Cam-
bridge, 1988). Discrete Math. 75 (1989), no. 1-3, 227241.
[FuKa1991] H. Furstenberg, Y. Katznelson, A density version of the Hales-
Jewett theorem, J. Anal. Math. 57 (1991), 64119.
[GiTr1998] D. Gilbarg, N. Trudinger, Elliptic partial dierential equations
of second order. Reprint of the 1998 edition. Classics in Mathematics.
Springer-Verlag, Berlin, 2001.
[GoMo1987] D. Goldston, H. Montgomery, Pair correlation of zeros and
primes in short intervals, Analytic number theory and Diophantine
problems (Stillwater, OK, 1984), 183203, Progr. Math., 70, Birkhuser
Boston, Boston, MA, 1987.
[Go1993] W. T. Gowers, B. Maurey, The unconditional basic sequence prob-
lem, J. Amer. Math. Soc. 6 (1993), no. 4, 851874.
[Gr1992] A. Granville, On elementary proofs of the prime number theo-
rem for arithmetic progressions, without characters, Proceedings of the
672 Bibliography
Amal Conference on Analytic Number Theory (Maiori, 1989), 157
194, Univ. Salerno, Salerno, 1992.
[Gr2005] A. Granville, It is easy to determine whether a given integer is
prime, Bull. Amer. Math. Soc. (N.S.) 42 (2005), no. 1, 338.
[GrSo2007] A. Granville, K. Soundararajan, Large character sums: preten-
tious characters and the P olya-Vinogradov theorem, J. Amer. Math.
Soc. 20 (2007), no. 2, 357384.
[GrTa2007] B. Green, T. Tao, The distribution of polynomials over nite
elds, with applications to the Gowers norms, preprint.
[GrTaZi2009] B. Green, T. Tao, T. Ziegler, An inverse theorem for the
Gowers U
4
norm, preprint.
[Gr1999] M. Gromov, Endomorphisms of symbolic algebraic varieties, J.
Eur. Math. Soc. (JEMS) 1 (1999), no. 2, 109197.
[Gr1966] A. Grothendieck,
Elements de geometrie algebrique. IV.
Etude
locale des schemas et des morphismes de schemas. III., Inst. Hautes