Introduction To Fourier Analysis: 1.1 Text

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Lecture 1

Introduction to Fourier Analysis


Jan 7, 2005
Lecturer: Nati Linial
Notes: Atri Rudra & Ashish Sabharwal
1.1 Text
The main text for the rst part of this course would be
T. W. K orner, Fourier Analysis
The following textbooks are also fun
H. Dym and H. P. Mckean, Fourier Series and Integrals.
A. Terras, Harmonic Analysis on Symmetric Spaces and Applications, Vols. I, II.
The following text follows a more terse exposition
Y. Katznelson, An Introduction to Harmonic Analysis.
1.2 Introduction and Motivation
Consider a vector space V (which can be of nite dimension). From linear algebra we know that at least in
the nite-dimension case V has a basis. Moreover, there are more than one basis and in general different
bases are the same. However, in some cases when the vector space has some additional structure, some
basis might be preferable over others. To give a more concrete example consider the vector space V =
f : X R or C where X is some universe. If X = 1, , n then one can see that V is indeed the
space R
n
or C
n
respectively in which case we have no reason to prefer any particular basis. However, if X
is an abelian group
1
there may be a reason to prefer a basis B over others. As an example, let us consider
X = Z/nZ, V = (y
0
, , y
n
)[y
i
R = R
n
. We now give some scenarios (mostly inspired by the
engineering applications of Fourier Transforms) where we want some properties for B aka our wish list:
1
An abelian group is given by S, + where S is the set of elements which is closed under the commutative operation +. Further
there exists an identity element 0 and every element in S has an inverse.
1
1. Think of the elements of Z/nZ as time units and let the vector y = (y
1
, , y
n
) denote some mea-
surements taken at the different time units. Now consider another vector z = z
1
, , z
n
such that
for all i, z
i
= y
i+1 mod (n)
. Note that the vectors y and z are different though from the measurement
point of view they are not much different they correspond to the same physical situation when our
clock was shifted by one unit of time. Thus, with this application in mind, we might want to look for
a basis for V such that the representation of z is closely related to that of y.
2. In a setting more general than the previous one, if for f : X R, a given member of V and a X,
g : X R be such that g(x) = f(x + a) then we would like to have representations of f and g
being close in B for any f and a. Note that in the previous example x corresponds to the index i
and a = 1.
3. In situations where derivatives (or discrete analogues thereof are well-dened), we would like f
t
to
have a representation similar to that of f.
4. In real life, signals are never nice and smooth but suffer from noise. One way to reduce the noise is
to average-out the signal. As a concrete example let the signal samples be given by f
0
, , f
n1
then the smoothened out signal could be given by the samples g
i
=
1
4
f
i1
+
1
2
f
i
+
1
4
f
i+1
. Dene
g
1
=
1
4
, g
0
=
1
2
, g
1
=
1
4
. We now look at an new operator: convolution which is dened as follows.
Let g be f convolved where h = f g and h(x) =

y
f(y)g(x y). So another natural property of
B is that the representation of f g should be related to that of f and g.
Before we go ahead, here are some frequently used instantiations of X:
X = Z/nZ. This is the Discrete Fourier Transform (DFT).
X = T = Real Numbers mod 1, addition mod 1

= e
i
,multiplication. The isomorphism exists
because multiplication of elements in the second group is the same as addition mod 2 of the angles
.
This is the most classical case of the theory, covered by the book Trigonometric Polynomials by
Zygmund.
X = (R, +). This is the Real Fourier Transform. In this case, in order to get meaningful analysis,
one has to restrict the family of functions f : X R under consideration e.g. ones with converging
integrals or those with compact support. The more general framework is that of Locally compact
Abelian groups.
X = 0, 1
n
where the operations are done mod 2. Note that f : X 0, 1 are simply the
boolean functions.
1.3 A good basis
As before let X be an abelian group and dene the vector space V = f : X R.
Denition 1.1. The characters of X is the set : X C[ is a homomorphism.
2
By homomorphism we mean that the following relationship holds: (x + y) = (x) (y) for any
x, y X. As a concrete example, X = Z/nZ has n distinct characters and the jth character is given by

j
(x) =
jx
for any x X where = e
2i/n
.
We now state a general theorem without proof (we will soon prove a special case):
Theorem 1.1. Distinct characters (considered as functions from X C) are orthogonal
2
.
We now have the following fact:
Fact 1.1. If X is a nite abelian group of n elements, then X has n distinct characters which form an
orthogonal basis for V = f : X R.
Consider the special case of X = Z/nZ. We will show that the characters are orthogonal. Recall that
in this case,
j
(x) =
jx
. By denition,
j
,
k
) =

n1
x=0

jx

kx
=

n1
x=0

(jk)x
. If j = k then each
term is one and the inner product evaluates to n. If j ,= k, then summing up the geometric series, we have

j
,
k
) =
(
jk
)
n
1

jk
= 0. The last equality follows from the fact that
n
= 1.
We will take a quick detour and mention some applications where Fourier Analysis has had some mea-
sure of success:
Coding Theory. A code ( is a subset of 0, 1
n
where we want each element to be as far as possible
from each other (where far is measured in terms of the hamming distance). We would like ( to be
as large as possible while keeping the distance as large as possible. Note that these are two opposing
goals.
Inuence of variables on boolean functions. Say you have an array of sensors and there is some
function which computes an answer. If a few of the sensors fail then answers should not change: in
other words we need to nd functions that are not too inuenced by their variables.
Numerical Integration/ Discrepancy. Say you want to integrate over some domain . Of course
one cannot nd the exact integral if one does not have have an analytical expression of the function.
So one would sample measurements at some discrete points and try and approximate the integral.
Suppose that we further know that certain subdomains of are signicant for the computation. The
main question is how to spread n points in such that every signicant region is sampled with the
correct number of points.
1.4 A Rush Course in Classical Fourier Analysis
Let X = T =
__
e
i
[ 0 < 2
_
, multiplication
_
. Let f : T C, which can alternatively be thought of
as a periodic function f : R C. What do characters of X look like?
There are innitely many characters and each is a periodic function fromR to C. In fact, every character
of X is a function : X T, i.e. : T T. Being a homomorphism, it must also satisfy (x.y) =
2
There is natural notion of inner-product among functions f, g : X R. f, g =
P
xX
f(x)g(x) in the discrete case and
f, g =
R
f(x)g(x) dx in the continuous case. If the functions maps into C, then g(x) is replaced by its conjugate g(x) in the
expressions. Finally f and g are orthogonal if f, g = 0
3
(x).(y). This implies that the only continuous characters of X are
k
(x) = x
k
, k Z. Note that if
k , Z, then x
k
can have multiple values, discontinuities, etc. It is an easy check to see that
k
,
l
) =
kl
:

k
,
l
) =
1
2
_
T

k
(x)
l
(x) dx
=
1
2
_
T
x
k
x
l
dx
=
1
2
_
T
x
kl
dx
=
1
2
_
2
0
e
i(kl)
d
=
_
1 if k = l
0 if k ,= l
=
kl
How do we express a given f : T C in the basis of the characters of X? Recall that if V is a nite
dimentional vector space over a eld F with an inner product and u
1
, . . . , u
n
is an orthonormal basis for V ,
then every f V can be expressed as
f =
n

j=1
a
j
u
j
, a
j
F, a
j
= f, u
j
) (1.1)
We would like to obtain a similar representation of f in the basis of the characters
k
, k Z.
Denition 1.2 (Fourier Coefcients). For r Z, the r
th
Fourier coefcient of f is

f(r) =
1
2
_
T
f(t) e
irt
dt
The analogue of Equation 1.1 now becomes
S
n
(f, t) =
n

r=n

f(r) e
irt
, does lim
n
S
n
(f, t) = f(t)? (1.2)
Here

f(r) replaces a
j
and
r
(e
it
) replaces u
j
in Equation 1.1. In a dream world, we would like to ask
whether

r=

f(r) e
irt
?
= f(t) holds. We are, however, being more careful and asking the question by
making this sum go from n to n and considering the limit as n .
1.4.1 Notions of Convergence
Before attempting to answer the question of representation of f in terms of its Fourier coefcients, we must
formalize what it means for two functions dened over a domain A to be close. Three commonly studied
notions of distance between functions (and hence of convergence of functions) are as follows.
L

Distance: [[f g[[

= sup
xA
[f(x) g(x)[. Recall that convergence in the sense of L

is called
uniform convergence.
4
L
1
Distance: [[f g[[
1
=
_
A
[f(x) g(x)[ dx
L
2
Distance: [[f g[[
2
=
_
_
A
[f(x) g(x)[
2
dx
In Fourier Analysis, all three measures of proximity are used at different times and in different contexts.
1.4.2 Fourier Expansion and Fejers Theorem
The rst correct proof (under appropriate assumptions) of the validity of Equation 1.2 was given by Dirichlet:
Theorem 1.2 (Dirichlet). Let f : T C be a continuous function whose rst derivative is continuous
with the possible exception of nitely many points. Then Equation 1.2 holds for every t T at which f is
continuous.
Even before Dirichlet proved this theorem, DuBois Reymond gave an example of a continuous f for
which limsup
n
S
n
(f, 0) = . This ruled out the possibility that continuity is sufcient for Equation
1.2 to hold. The difculty in answering the question afrmatively came in proving convergence of S
n
(f, t) as
n . Fejer answered a more relaxed version of the problem, namely, when can f be reconstructed from

f(r) in possibly other ways? He showed that if f satises certain conditions even weaker than continuity,
then it can be reconstructed from

f(r) by taking averages.
Denition 1.3 (Cesaro Means). Let a
1
, a
2
, . . . be a sequence of real numbers. Their k
th
Cesaro mean is
b
k
= (1/k)

k
j=1
a
j
.
Proposition 1.3. Let a
1
, a
2
, . . . be a sequence of real numbers that converges to a. Then the sequence
b
1
, b
2
, . . . of its Cesaro means converges to a as well. Moreover, the sequence b
i
can converge even when
the sequence a
i
does not (e.g. a
2j
= 1, a
2j+1
= 0).
Let us apply the idea of Cesaro means to S
n
. Dene

n
(f, t) =
1
n + 1
n

k=0
S
k
(f, t)
=
1
n + 1
n

k=0
k

r=k

f(r) e
irt
=
n

r=n
n + 1 [r[
n + 1

f(r) e
irt
Theorem 1.4 (Fejer). Let f : T C be Riemann integrable. If f is continuous at t T, then
lim
n

n
(f, t) = f(t). Further, if f is continuous then the above holds uniformly.
Proof. Note that lim
n

n
(f, t) = f(t) means that > 0, n
0
: n > n
0
[
n
(f, t) f(t)[ < . The
convergence is uniform if the same n
0
works for all t simultaneously. The proof of the Theorem uses Fejers
5
kernels K
n
that behave as continuous approximations to the Dirac delta function.

n
(f, t) =
n

r=n
n + 1 [r[
n + 1

f(r) e
irt
=
n

r=n
n + 1 [r[
n + 1
_
1
2
_
T
f(x) e
irx
dx
_
e
irt
=
1
2
_
T
f(x)
n

r=n
n + 1 [r[
n + 1
e
ir(tx)
dx
=
1
2
_
T
f(x)K
n
(t x) dx for K
n
(z)
def
=
n

r=n
n + 1 [r[
n + 1
e
irz
=
1
2
_
T
f(t y)K
n
(y) dy for y = t x
which is the convolution of f with kernel K
n
. Note that if K
n
were the Dirac delta function, then
_
T
f(t
y)K
n
(y) dy would evaluate exactly to f(t). Fejers kernels K
n
approximate this behavior.
Proposition 1.5. K
n
satises the following:
K
n
(s) =
_
_
_
1
n+1
_
sin
n+1
2
s
sin
n
2
_
2
if s ,= 0
n + 1 if s = 0
The kernels K
n
have three useful properties.
1. u : K
n
(u) 0
2. > 0 : K
n
(s) 0 uniformly outside the interval [, ], i.e. > 0, n
0
: s / [, ]
[K
n
(s)[ <
3. (1/2)
_
T
K
n
(s) ds = 1
Given any > 0, we seek a large enough n
0
such that for all n > n
0
, [
_
T
f(t y)K
n
(y) dy f(t)[ < .
Divide this integral into two intervals:
_
T
f(t y)K
n
(y) dy =
_

f(t y)K
n
(y) dy +
_
T\[,]
f(t y)K
n
(y) dy
The rst integral on the RHS converges to 2f(t) because f(t y) is almost constant and equals f(t) in the
range y [, ] and
_
T\[,]
K
n
(s) ds converges to 2 because of property 3 of K
n
. The second integral
converges to 0 because f is bounded and because of property 2 of K
n
. Hence the LHS converges to 2f(t),
nishing the proof.
Corollary 1.6. If f, g : T C are continuous functions and r Z :

f(r) = g(r), then f = g.
Proof. Let h
def
= f g. h is also continuous. r :

h(r) =

f(r) g(r) = 0. By Fejers Theorem, h 0.
6
1.4.3 Connection with Weierstrass Theorem
Because of the uniform convergence part of Fejers Theorem, we have proved that for all f : T C
continuous and all > 0, there exists a trigonometric polynomial P such that for all t T, [f(t) P(t)[ <
. This implies Weierstrass Theorem which states that under l

[a, b] norm, polynomials are dense in


C[a, b], i.e., for all f : [a, b] R continuous and all > 0, there exists a polynomial P such that for all
x [a, b], [f(x) P(x)[ < .
Informally, Weierstrass Theorem says that given any continuous function over a nite inverval and an
arbitrarily small envelope around it, we can nd a polynomial that ts inside that envelope in that interval.
To see why this is implied by Fejers Theorem, simply convert the given function f : [a, b] C into a
symmetric function g over an interval of size 2(b a), identify the end points of the new interval so that it
is isomorphic to T, and use Fejers Theorem to conclude that
n
(g, .) is a trigonometric polynomial close
to g (and hence f). To see why Weierstrass Theorem implies Fejers Theorem, recall that cos rt can be
expressed as a degree r polynomial in cos t. Use this to express the promised trigonometric polynomial
P(t) as a linear combination of cos rt and sinrt with n r n.
Remark. Weierstrass Theorem can alternatively be proved using Bernsteins polynomials even though nor-
mal interpolation polynomials do not work well for this purpose. Consider f : [0, 1] R. The n
th
Bernstein polynomial is B
n
(f, x)
def
=

n
k=0
f(
k
n
)
_
n
k
_
x
k
(1 x)
nk
. The key idea is that this involves the
fact that the Binomial distribution P(k) =
_
n
k
_
x
k
(1x)
nk
is highly concentrated around k = xn and thus
approximates the behavior of the Dirac delta function.
7
Lecture 2
Introduction to Some Convergence theorems
Friday 14, 2005
Lecturer: Nati Linial
Notes: Mukund Narasimhan and Chris R e
2.1 Recap
Recall that for f : T C, we had dened

f(r) =
1
2
_
T
f(t)e
irt
dt
and we were trying to reconstruct f from

f. The classical theory tries to determine if/when the following is
true (for an appropriate denition of equality).
f(t)
??
=

rZ

f(r)e
irt
In the last lecture, we proved Fej ers theorem f k
n
f where the denotes convolution and k
n
(Fej er
kernels) are trignometric polynomials that satisfy
1. k
n
0
2.
_
T
k
n
= 1
3. k
n
(s) 0 uniformly as n outside [, ] for any > 0.
If X is a nite abelian group, then the space of all functions f : X Cforms an algebra with the operations
(+, ) where + is the usual pointwise sum and is convolution. If instead of a nite abelian group, we take
X to be T then there is no unit in this algebra (i.e., no element h with the property that h f = f for all f).
However the k
n
behave as approximate units and play an important role in this theory. If we let
S
n
(f, t) =
n

r=n

f(r)e
irt
Then S
n
(f, t) = f D
n
, where D
n
is the Dirichlet kernel that is given by
D
n
(x) =
sin
_
n +
1
2
_
s
sin
s
2
The Dirichlet kernel does not have all the nice properties of the the Fej er kernel. In particular,
8
1. D
n
changes sign.
2. D
n
does not converge uniformly to 0 outside arbitrarily small [, ] intervals.
Remark. The choice of an appropriate kernel can simplify applications and proofs tremendously.
2.2 The Classical Theory
Let G be a locally compact abelian group.
Denition 2.1. A character on Gis a homomorphism : G T. Namely a mapping satisfyin (g
1
+g
2
) =
(g
1
)(g
2
) for all g
1
, g
2
G.
If
1
,
2
are any two characters of G, then it is easily veried that
1

2
is also a character of G, and so
the set of characters of G forms a commutative group under multiplication. An important role is played by

G, the group of all continuous characters. For example,



T = Z and

R = R.
For any function f : G C, associate with it a function

f :

G C where

f() = f, ). For
example, if G = T then
r
(t) = e
irt
for r Z. Then we have

f(
r
) =

f(r). We call

f :

G C the
Fourier transform of f. Now

G is also a locally compact abelian group and we can play the same game
backwards to construct

f. Pontryagins theorem asserts that


G = G and so we can ask the question: Does

f = f ? While in theory Fej er answered the question of when



f uniquely determines f, this question is still
left unanswered.
For the general theory, we will also require a normalized nonnegative measure on G that is translation
invariant: (S) = (a +S) = (a +s [s S) for every S G and a G. There exists a unique such
measure which is called the Haar measure.
2.3 L
p
spaces
Denition 2.2. If (X, , ) is a measure space, then L
p
(X, , ) is the space of all measureable functions
f : X R such that
|f|
p
=
__
X
[f[
p
d
_1
p
<
For example, if X = N, is the set of all nite subsets of X, and is the counting measure, then
|(x
1
, x
2
, . . . , x
n
, . . . )|
p
= (

[x
i
[
p
)
1
p
. For p = , we dene
|x|

= sup
iN
[x
i
[
Symmetrization is a technique that we will nd useful. Loosely, the idea is that we are averaging over
all the group elements.
Given a function f : G C, we symmetrize it by dening g : G C as follows.
g(x) =
_
G
f(x +a) d(a)
9
We will use this concept in the proof of the following result.
Proposition 2.1. If G is a locally compact abelian group, with a normalized Haar measure , and if

1
,
2


G are two distinct characters then
1
,
2
) = 0. i.e.,
I =
_
X

1
(x)
2
(x) d(x) =

1
,
2
=
_
0
1
,=
2
1
1
=
2
Proof. For any xed a G, I =
_
X

1
(x)
2
(x) d(x) =
_
X

1
(x +a)
2
(x +a) d(x). Therefore,
I =
_
X

1
(x +a)
2
(x +a) d(x)
=
_
X

1
(x)
1
(a)
2
(x)
2
(a) d(x)
=
1
(a)
2
(a)
_
X

1
(x)
2
(x) d(x)
=
1
(a)
2
(a)I
This can only be true if either I = 0 or
1
(a) =
2
(a). If
1
,=
2
, then there is at least one a such that

1
(a) ,=
2
(a). It follows that either
1
=
2
or I = 0.
By letting
2
be the character that is identically 1, we conclude that

G with ,= 1 for any
_
G
(x) d(x) = 0.
2.4 Approximation Theory
Weierstrasss theorem states that the polynomials are dense in L

[a, b] C[a, b]
1
Fej ers theorem is about
approximating functions using trignometric polynomials.
Proposition 2.2. cos nx can be expressed as a degree n polynomial in cos x.
Proof. Use the identity cos(u +v) + cos(u v) = 2 cos ucos v and induction on n.
The polynomial T
n
(x) where T
n
(cos x) = cos(nx) is called n
th
Chebyshevs polynomial. It can be
seen that T
0
(s) = 1, T
1
(s) = s, T
2
(s) = 2s
2
1 and in general T
n
(s) = 2
n1
s
n
plus some lower order
terms.
Theorem 2.3 (Chebyshev). The normalized degree n polynomial p(x) = x
n
+ . . . that approximates the
function f(x) = 0 (on [1, 1]) as well as possible in the L

[1, 1] norm sense is given by


1
2
n1
T
n
(x). i.e.,
min
p a normalized polynomial
max
1x1
[p(x)[ =
1
2
n1
This theorem can be proved using linear programming.
1
This notation is intended to imply that the norm on this space is the sup-norm (clearly C[a, b] L[a, b])
10
2.4.1 Moment Problems
Suppose that X is a random variable. The simplest information about X are its moments. These are
expressions of the form
r
=
_
f(x)x
r
dx, where f is the probability distribution function of X. A moment
problem asks: Suppose I know all (or some of) the moments
r

rN
. Do I know the distribution of X?
Theorem 2.4 (Hausdorff Moment Theorem). If f, g : [a, b] C are two continuous functions and if for
all r = 0, 1, 2, . . . , we have
_
b
a
f(x)x
r
dx =
_
b
a
g(x)x
r
dx
then f = g. Equivalently, if h : [a, b] C is a continuous function with
_
b
a
h(x)x
r
dx = 0 for all r N,
then h 0.
Proof. By Weierstrasss theorem, we knowthat for all > 0, there is a polynomial P such that
_
_
h P
_
_

<
. If
_
b
a
h(x)x
r
dx = 0 for all r N, then it follows that
_
b
a
h(x)Q(x) dx = 0 for every polynomial Q(x),
and so in particular,
_
b
a
h(x)P(x) dx. Therefore,
0 =
_
b
a
h(x)P(x) dx =
_
b
a
h(x)h(x) dx +
_
b
a
h(x)
_
P(x) h(x)
_
dx
Therefore,
h, h) =
_
b
a
h(x)
_
P(x) h(x)
_
dx
Since h is continuous, it is bounded on [a, b] by some constant c and so on [a, b] we have

h(x)
_
P(x) h(x)
_

c [b a[. Therefore, for any > 0 we can pick > 0 so that so that
|h|
2
2
. Hence h 0.
2.4.2 A little Ergodic Theory
Theorem 2.5. Let f : T C be continuous and be irrational. Then
lim
n
1
n
n

r=1
f
_
e
2ir
_
=
_
T
f(t) dt
Proof. We show that this result holds when f(t) = e
ist
. Using Fej ers theorem, it will follow that the result
holds for any continuous function. Now, clearly
1
2
_
T
e
ist
dt = 0. Therefore,

1
n
n

r=1
e
2irs

1
2
_
T
e
ist
dt

1
n
n

r=1
e
2irs

1
n
e
2is

1 e
2ins
1 e
2is

2
n (1 e
2is
)
Since is irrational, 1 e
2is
is bounded away from 0. Therefore, this quantity goes to zero, and hence
the result follows.
11
Figure 2.1: Probability of Property v. p
This result has applications in the evaluations of integrals, volume of convex bodies. Is is also used in
the proof of the following result.
Theorem 2.6 (Weyl). Let be an irrational number. For x R, we denote by x) = x [x] the fractional
part of x. For any 0 < a < b < 1, we have
lim
n
[1 r n : a r) < b[
n
= b a
Proof. We would like to use Theorem 2.5 with the function f = 1
[a,b]
. However, this function is not
continuous. To get around this, we dene functions f
+
1
[a,b]
f

as shown in the following diagram.


f
+
and f

are continuous functions approximating f. We let let them approach f and pass to the
limit.
This is related to a more general ergodic theorem by Birkhoff.
Theorem 2.7 (Birkhoff, 1931). Let (, T, p) be a probability measure and T : be a measure
preserving transformation. Let X L
1
(, T, p) be a random variable. Then
1
n
n

k=1
X T
k
E [X; 1]
Where 1 is the -eld of T-invariant sets.
2.5 Some Convergence Theorems
We seek conditions under which S
n
(f, t) f(t) (preferably uniformly). Some history:
DuBois Raymond gave an example of a continuous function such that limsupS
n
(f, 0) = .
Kolmogorov [1] found a Lebesgue measureable function f : T R such that for all t,
limsupS
n
(f, t) = .
12
Carleson [2] showed that if f : T C is a continuous function (even Riemann integrable), then
S
n
(f, t) f(t) almost everywhere.
Kahane and Katznelson [3] showed that for every E T with (E) = 0, there exists a continuous
function f : T C such that S
n
(f, t) ,f(t) if and only if t E.
Denition 2.3.
p
= L
p
(N, Finite sets, counting measure). = x[(x
0
, . . . )[
p
< .
Theorem 2.8. Let f : T C be continuous and suppose that

rZ
[

f(r)[ < (so



f
1
). Then
S
n
(f, t) f uniformly on T.
Proof. See lecture 3, theorem 3.1.
2.6 The L
2
theory
The fact that e(t) = e
ist
is an orthonormal family of functions allows to develop a very satisfactory theory.
Given a function f, the best coefcients
1
,
2
, . . . ,
n
so that |f

n
i=1

j
e
j
|
2
is minimized is given by

j
= f, e
j
). This answer applies just as well in any inner product normed space (Hilbert space) whenever
e
j
forms an orthonormal system.
Theorem 2.9 (Bessels Inequality). For every
1
,
2
, . . . ,
n
,
_
_
_
_
_
f
n

i=1

i
e
i
_
_
_
_
_
2
|f|
2

i=1
f, e
i
)
2
with equality when
i
= f, e
i
)
Proof. We offer a proof here for the real case, in the next lecture the complex case will be done as well.
_
_
_
_
_
f
n

i=1

i
e
i
_
_
_
_
_
2
=
_
_
_
_
_
(f
n

i=1
f, e
i
)e
i
) + (
n

i=1
f, e
i
)e
i

i=1

i
e
i
)
_
_
_
_
_
2
=
_
_
_
_
_
(f
n

i=1
f, e
i
)e
i
)
_
_
_
_
_
2
+
_
_
_
_
_
(
n

i=1
f, e
i
)e
i

i=1

i
e
i
)
_
_
_
_
_
2
+ cross terms
cross terms = 2f
n

i=1
f, e
i
)e
i
,
n

i=1
f, e
i
)e
i

i=1

i
e
i
)
Observe that the terms in the cross terms are orthogonal to one another since if f, e
i
)e
i
, e
i
) = 0. We
write
2

f, e
i
)f
n

j=1
f, e
j
)e
j
, e
i
)
n

i
f
n

j=1
f, e
j
)e
i
, e
i
)
Observe that each innter product term is 0. Since if i = j, then we apply if f, e
i
)e
i
, e
i
) = 0. If
i ,= j, then they are orthogonal basis vectors.
13
We want to make this as small as possible and have only control over the
i
s. Since this term is squared
and therefore non-negative, the sum is minimized when we set i
i
= f, e
i
). With this choice,
_
_
_
_
_
f
n

i=1

i
e
i
_
_
_
_
_
2
= f
n

i=1

i
e
i
, f
n

i=1

i
e
i
)
= f, f) 2
n

i=1

i
f, e
i
) +
n

i=1

2
i
= |f|
2

i=1
f, e
i
)
2
where the last inequality is obtained by setting
i
= f, e
i
).
References
[1] A. N. Kolmogorov, Une s erie de Fourier-Lebesgue divergente partout, CRAS Paris, 183, pp. 1327-
1328, 1926.
[2] L. Carleson, Convergence and growth of partial sums of Fourier series, Acta Math. 116, pp. 135-157,
1964.
[3] J-P Kahane and Y. Katznelson, Sur les ensembles de divergence des s eries trignom etriques, Studia
Mathematica, 26 pp. 305-306, 1966
14
Lecture 3
Harmonic Analysis on the Cube and Parsevals
Identity
Jan 28, 2005
Lecturer: Nati Linial
Notes: Pete Couperus and Neva Cherniavsky
3.1 Where we can use this
During the past weeks, we developed the general machinery which we will apply to problems in discrete
math and computer science in the following weeks. In the general setting, we can ask how much information
can we determine about a function f given its Fourier coefcients

f. Or, given f what can we say about

f? There is some distinction between properties which will hold in the general setting, and those that make
sense for the specic spaces we have dealt with. So far, we have looked at
1. T (the unit circle/Fourier Series).
2. Z/nZ (Discrete Fourier Transform).
3. R (Real Fourier Transform).
4. 0, 1
n
= GF(2)
n
= (Z/2Z)
n
(the n-cube).
For the n-cube (or for any space we wish to do Harmonic Analysis on), we need to determine the characters.
We can view elements of 0, 1
n
as subsets of [n] = 1, ..., n, and then to each subset S [n], let

S
(T) = (1)
[ST[
. Then:

S
1
,
S
2
) =
1
2
n

T[n]
(1)
[S
1
T[+[S
2
T[
To see that the
S
form an orthonormal basis, suppose that x S
1
S
2
. Then, the function
(A) =
_
Ax x A
A x x , A
gives a bijection between A : [S
1
A[ [S
2
A[ (mod 2) and A : [S
1
A[ , [S
2
A[ (mod 2).
So,
S
1
,
S
2
) = 0 for S
1
,= S
2
. If S
1
= S
2
, then [T S
1
[ +[T S
2
[ is always even, so
S
,
S
) = 1.
15
Hence, the
S
form an orthonormal basis for functions from 0, 1
n
R. (This is, of course, true in
general, but its useful to see this explicitly for this special case). Then for any f : 0, 1
n
R, we can
write f =

S[n]

f(S)
S
, where

f(S) = f,
S
) =
1
2
n

T[n]
f(T) (1)
[ST[
.
There is an equivalent and often useful way of viewing this. We can also view the n-cube as 1, 1
n
with
coordinate-wise multiplication. In this case, any function f : 1, 1
n
R can be uniquely expressed as
a multilinear polynomial:
f =

S0,1
n
a
S

iS
x
i
where

iS
x
i
corresponds to
S
.
There is an advantage to the fact that we now deal with a nite group. Note that f =

S[n]

f(S)
S
is always the case for functions over the n-cube, unlike working over T. Working over T, we made some
assumptions on f to be able have a similar formula to recover f from its fourier coefcients.
Now we can ask, what can be said about

f when f is boolean (when the range of f is 0, 1)? More
specically, how do the properties of f get reected in

f? In general, this is too hard a question to tackle.
But what sorts of relationships between properties are we looking for? In the case of T, the smoothness of
f roughly corresponds to its fourier coefcients

f(r) decaying rapidly as r . E.g.
f : T C f(r)[r Z
smooth

f(r) decays rapidly
An instance of this relationship can be seen from the following theorems.
Theorem 3.1. Let f : T C be continuous, and suppose that

r=
[

f(r)[ converges. Then S


n
(f) f
uniformly.
We can derive this theorem from another.
Theorem 3.2. Suppose that the sequence

n
r=n
[a
r
[ converges (as n ). Then g
n
(t) =

n
r=n
a
r
e
irt
converges uniformly as n on T to g : T C, where g is continuous and g(r) = a
r
for all r.
This (roughly) says that if we have a sequence that is decreasing rapidly enough (its series converges
absolutely), then we can choose these to be the Fourier coefcients for some continuous function.
To see that Theorem 3.2 implies Theorem 3.1, if

f(r) = g(r) = a
r
for all r, and both f and g are
continuous, then f = g. This is based on Fejers Theorem (or Weierstrauss).
So to prove Theorem 3.1, all that remains is to prove Theorem 3.2.
16
Proof. The underlying idea for the proof of Theorem 3.2 is that C(T) with the -norm is a complete
metric space, meaning that all Cauchy sequences converge. Recall, a sequence (a
n
) is Cauchy if for > 0,
there is some N so for n, m N, we have d(a
n
, a
m
) < (where d is whatever metric we are using).
So, to prove the theorem, we only need to check that f
n
=

n
n
a
r
e
irt
is a Cauchy sequence with
the -norm. Since s
n
:=

n
r=n
[a
r
[ converges, for > 0, there is some N so that [s
m
s
n
[ < for
n, m N (basically, the tail end is small), hence

m[r[>n
a
r
e
irt

m[r[>n
[a
r
[

< .
So, the g
n
forms a Cauchy sequence.
Hence,
n

r=n
a
r
e
irt
g uniformly, so
e
ikt
n

r=n
a
r
e
irt
e
ikt
g(t) uniformly.
_
T
e
ikt
n

r=n
a
r
e
irt
dt
_
T
e
ikt
g(t).
(3.1)
Recall, du Bois Raymond gives an example of f : T C such that lim[S
n
(f, 0)[ = +. However, if
the rst derivative is somewhat controlled, we can say more.
Theorem 3.3. Let f : T C be continuous and suppose that f
t
is dened for all but a nite subset of T.
Then S
n
(f) f uniformly.
f smooth

f decays rapidly S
n
f f.
Recall from basic analysis, if f
n
are continuously differentiable and if f
n
f uniformly and f
t
n
g
uniformly then f
t
= g and g is continuous. This will allow us to show that the Fourier Series of f
t
is
attained by termwise derivatives of the Fourier Series of f.
Theorem 3.4. Let f : T C be continuous and suppose that

r=
r[

f(r)[ converges. Then f is


continuously differentiable and

n
r=n
ir

f(r)e
irt
f
t
uniformly.
Proof. We would like to show that we can apply this when f
n
= S
n
f. But if

r=
r[

f(r)[ converges,
then

r=
[

f(r)[ converges (since the each term is smaller). So we have


[

f(r)[ [r

f(r)[
n

n
[

f(r)[ converges
17
So by Theorem 3.1, f
n
= S
n
f f uniformly. By the same theorem, f
t
n
=

n
r=n
ir

f(r)e
irt
g is
continuous. By the statement above due to basic analysis, we know that this implies that f is continuously
differentiable.
A similar argument will provide a stronger connection between the idea that

f(r) are rapidly decreasing
implies that f is smoother.
Proposition 3.5. Let f : T C satisfy f
(n1)
is continuously differentiable except possibly nitely many
points X, and [f
(n)
(x)[ M for x , X. Then r ,= 0[

f(r)[ Mr
n
.
Proof. (Integration by parts).

f(r) =
1
2
_
T
f(t)e
irt
dt. Let u = f(t), dv = e
irt
dt. Then du = f
t
(t)dt, v =
e
irt
ir
.

f(r) =
1
2
_
T
f(t)e
irt
dt =
1
2
_
f(t)
e
irt
ir
[


_
T
f
t
(t)
e
irt
ir
dt
_
=
1
2
_
0
_
T
f
t
(t)
e
irt
ir
dt
_
= = (rst term is 0 since f is periodic)
1
2(ir)
n
_
T
f
(n)
(t)e
irt
dt.
(3.2)
So

f(r)

(ir)
n
2

[f
(n)
(t)e
irt
[dt = O(
1
r
n
)
Corollary 3.6. If f : T C is in C
2
(twice continuously differentiable), then S
n
f f uniformly.
Proof.

f(r) = O
_
1
r
2
_

r=

f(r)

converges.
So, S
n
f f uniformly.
3.2 Rate of Convergence
Until now, we havent really addressed the rate of convergence, meaning when S
n
(f) does converge to f,
how fast does it converge to f? Examine g(x) = [x[ for x [, ], and extend g periodically to h(x).
Direction calculation gives [S
n
(h, 0) [ >
1
n+2
. By using L
2
theory, it can be further shown that every
trigonometric polynomial P of degree n has the property |P h|

> (n
3/2
). Kolmogorov showed the
following.
18
Theorem 3.7. (Kolmogorov) For all A > 0, there is a trigonometric polynomial f such that:
1. f 0.
2.
1
2
_
T
f(t)dt 1
3. For every x T, sup
n
[S
n
(f, x)[ A.
Hence, there is a Lebesgue integrable function f such that for all x T, lim[S
n
(f, x)[ = +.
3.2.1 Convergence Results
In 1964, Carleson proved the following.
Theorem 3.8. (Carleson) If f is continuous (or only Riemann integrable), then S
n
f f almost every-
where.
Later, Kahane and Katznelson proved that this result is tight.
Theorem 3.9. For all E T with (E) = 0, there is a continuous f such that S
n
f f exactly on TE.
Notice that these results make somewhat weak assumptions on f. We will now work on seeing how
things improve in the situation where f is an L
2
function.
3.3 L
2
theory for Fourier Series
Recall part of original question was how are f and

f related? Our immediate goal will be to show that in
the L
2
case, their norms are identical, which is the Parseval identity. Recall, |f|
2
=
_
_
T
[f(t)[
2
dt. Then
the Parseval identity states |f|
2
= |

f|
2
. For the Discrete Fourier Transform, this essentially means that
the transform matrix is an orthonormal matrix.
We will procede by focusing on Hilbert Spaces. A Hilbert Space H is a normed (C-linear) space with
an inner product , ) satisfying the following axioms.
1. ax +by, z) = ax, z) +by, z).
2. x, y) = y, x).
3. x, x) = |x|
2
0 with equality x = 0.
19
There are a number of facts that we know about familiar Hilbert spaces (like R
n
) that hold for general
Hilbert spaces as well.
Theorem 3.10. If His a Hilbert space, then the Cauchy-Schwarz Inequality holds, namely if f, g H, then
|f| |g| f, g).
Proof. We will show the proof for real Hilbert spaces.
0 f g, f g) = |f|
2
2f, g) +
2
|g|
2
. (3.3)
Viewing this as a degree 2 polynomial in , it is non-negative, so has at most one real root. Hence, the
discriminant (2f, g))
2
4|f|
2
|g|
2
0. Hence, f, g)
2
|f|
2
|g|
2
.
One may ask, if we have an element f H, how can we best approximate f with respect to some
basis? Specically, let e
n
, ..., e
0
, e
1
, ..., e
n
be an orthonormal system in H (meaning, e
i
, e
j
) =
i,j
).
Given f H, the question is to nd
i
C such that |f

i
e
i
| is minimized.
Theorem 3.11. Let H, e
i
, f be as above. Set g =

n
j=n

j
e
j
, and g
0
=

n
j=n
f, e
j
)e
j
. Then
|f|
2
2

n

j=n
f, e
j
)
2
, |f g|
2
|f g
0
|
2
=

_
|f|
2
2

j=n
f, e
j
)
2
(3.4)
with equality iff
j
= f, e
j
) for all j.
Proof.
|f g|
2
2
= f g, f g) = f

j
e
j
, f

j
e
j
) =
|f|
2
2
(

j
f, e
j
) +
j
f, e
j
)) +

j
[
j
[
2
=
f, f) +

j
[
j
f, e
j
)[
2

j
[f, e
j
)[
2

f, f)

j
[f, e
j
)[
2
= |f g
0
|
2
.
(3.5)
Note that equality in the last step occurs exactly when
j
= f, e
j
) for all j.
Corollary 3.12. (Approximation and Bessels Inequality).
1. S
n
f is the closest (in the L
2
sense) degree n trigonometric polynomial approximation to f.
2. (Bessels Inequality). If f L
2
(T), then
|f|
2
=
1
2
_
T
[f(t)[
2
dt
n

r=n
[

f(r)[
2
,
and |f|
2

r=
[

f(r)[
2
.
20
This shows one side of the Parseval Identity, namely |f|
2
|

f|
2
.
Recall by Theorem 3.1, if f continuous and

f l
1
(meaning

r
[

f(r)[ converges), then S


n
f f
uniformly. We will show that f having continuous rst derivative in fact implies that

f is in l
1
.
Corollary 3.13. If f C
1
, then S
n
f f uniformly.
Proof.
n

r=n
[

f(r)[ = [

f(0)[ +

1[r[n
[r

f(r)[
1
[r[
(by Cauchy-Schwartz) [

f(0)[ +

1rn
1
r
2


1[r[n
[

f
t
(r)[
2
(by identity

1
1
n
2
=

2
6
) [

f(0)[ +

2
3

1
2
_
T
[f
t
(t)[
2
dt
which is bounded since the rst derivative is bounded (continuous on T).
3.3.1 Parsevals Identity
We are now ready to complete the proof of the Parseval Identity.
Theorem 3.14. If f : T C is continuous, then |f S
n
f| 0.
Proof. By Weierstrass (or Fejer) approximation, for any > 0, there is some trigonometric polynomial P
such that |f P|

< . So,
|f S
n
f|
2
|f P|
2
+|S
n
P S
n
f|
2
|f P|

+|S
n
(P f)|
2
We use the fact that S
n
P = f for every trigonometric polynomial of degree n. Then Bessels inequality
tells us |S
n
(P f)|
2
|P f|
2
. Since |P f|
2
|P f|

< , we see that |f S


n
f|
2
< 2. This
completes the proof.
Hence, it is easy to see |f|
2
= |

f|
2
, and we have the Parseval Identity.
Corollary 3.15. (Parseval) If f : T C is continuous, then
1
2
_
T
[f(t)[
2
dt = |f|
2
2
=

r=
[

f(r)[
2
.
Proof. Since |fS
n
f|
2
2
= |f|
2
2

n
r=n
[

f(r)[
2
goes to 0 as n , we conclude that

n
r=n
[

f(r)[
2

|f|
2
2
as n .
In other words, f

f is an isometry in L
2
.
21
3.4 Geometric Proof of the Isoperimetric Inequality
We will complete this next time. Here, we will present Steiners idea to resolve the following question.
What is the largest area of a planar region with xed circumference L? We will present Steiners idea here.
Suppose that C is a curve such that the area enclosed is optimal.
C encloses a convex region.
If not, then there are points A, B such that the line segment joining A and B lies outside the region. By
A
B
Figure 3.1: Joining A and B yields more area.
replacing the arc from A to B with the line segment from A to B, we increase the area, and decrease the
circumference. See Figure 3.1.
C encloses a centrally symmetric region.
If not, pick points A, B such that the arc length from A to B is the same for both directions travelled.
A
B
O
L
L
Figure 3.2: Reecting larger area yields more area.
Suppose that the region enclosed by AB L has area at least that of the region enclosed by AB L
t
. We
can then replace the latter by a mirror copy of the rst. This can only increase the total area and yields a
region that is centrally symmetric with respect to the middle of the segment [A, B]. C is a circle.
Recall the following fact from Euclidean geometry: A circle is the locus of all points x such that xA is
perpendicular to xB where AB is some segment (which is the diameter of the circle). Therefore, if this is
not so, then there is some parallelogram a, b, c, d, with ac passing the the center, inscribed in C (since C is
centrally symmetric), and with the angle at b not equal to

2
. Now, move sides a, b and c, d to sides a
t
, b
t
22
a
b c
d
b
a
c
d
R P
Figure 3.3: Changing Parallelogram to Rectangle Yields more area.
and c
t
, d
t
such that a
t
, b
t
, c
t
, d
t
forms a rectangle. See Figure 3.3. We obtain a new curve C
t
such that the area
outside of rectangle R = [a
t
, b
t
, c
t
, d
t
] is the same as the area outside of the parallelogram P = [a, b, c, d].
Since the side lengths of R and P are the same, the area enclosed by R must exceed the area enclosed by
P, so the area enclosed by C
t
must exceed the area enclosed by C. Hence, C was not optimal. Hence, our
parallelogram P must have angles equal to

2
.
Although these ideas are pretty and useful, this is still not a proof of the isoperimetric inequality. We do
not know that an optimal C exists, only that if it does, it must be a circle.
23
Lecture 4
Applications of Harmonic Analysis
February 4, 2005
Lecturer: Nati Linial
Notes: Matthew Cary
4.1 Useful Facts
Most of our applications of harmonic analysis to computer science will involve only Parsevals identity.
Theorem 4.1 (Parsevals Identity).
|f|
2
= |

f|
2
Corollary 4.2.
f, g) =

f, g).
Proof. Note that f + g, f + g) = |f + g|
2
= |

f +g|
2
= |

f + g|
2
. Now as f + g, f + g) =
|f|
2
2
+ |g|
2
2
+ 2f, g), and similarly |

f + g|
2
2
= |

f|
2
2
+ | g|
2
2
+ 2

f, g), applying Parseval to |f|


2
and
|g|
2
and equating nishes the proof.
The other basic identity is the following.
Lemma 4.3.

f g =

f g
Proof. We will show this for the unit circle T, but one should note that it is true more generally. Recall that
by denition h = f g means that
h(t) =
1
2
_
T
f(s)g(t s)ds.
Now to calculate

f g we manipulate

h.

h(r) =
1
2
_
T
h(x)e
irx
dx
=
1
4
2
__
T
2
f(s)g(x s)e
irx
ds dx
=
1
4
2
__
T
2
f(s)g(x s)e
irs
e
ir(xs)
dxds
24
using e
irx
= e
irs
e
ri(xs)
and interchanging the order of integration. Then by taking u = xs we have
=
1
4
2
__
T
2
f(s)g(u)e
irs
e
iru
duds
=
1
4
2
_
T
f(s)e
irs
ds
_
T
g(u)e
iru
du
=
_
1
2
_
T
f(s)e
irs
ds
__
1
2
_
T
g(u)ds
_
=

f g.
4.2 Hurwitzs Proof of the Isoperimetric Inequality
Recall from last lecture that the isoperimetric problem is to show that a circle encloses the largest area for
all curves of a xed length. Formally, if L is the length of a curve and A the area enclosed, then we want
to show that L
2
4A 0 with equality if and only if the curve is a circle. We will prove the following
stronger theorem.
Theorem 4.4. Let (x, y) : T R
2
be an anticlockwise arc length parametrization of a non self-intersecting
curve of length L enclosing an area A. If x, y C
1
, then
L
2
4A = 2
2
_

n,=0
[n x(n) i y(n)[
2
+[n y(n) +i x(n)[
2
+ (n
2
1)
_
[ x(n)[
2
+[ y(n)[
2
_
_
.
In particular, L
2
4A with equality if and only if is a circle.
We will not dene arc length parameterization formally, only remark that intuitively it means that
if one views the parameterization as describing the motion of a particle in the plane, then an arc length
parameterization is one so that the speed of the particle is constant. In our context, where we view time as
the unit circle T of circumference 2, we have that ( x)
2
+( y)
2
is a constant so that the total distance covered
is
_
L
2
_
2
.
Proof. First we use our identity about the parameterization to relate the length to the transform of the
parameterization.
_
L
2
_
2
=
1
2
_
T
_
_
x(s)
_
2
+
_
y(s)
_
2
_
ds
= |

x|
2
2
+|

y|
2
2
by Parsevals
=

[in x(n)[
2
+[in y(n)[
2
by Fourier differentiation identities
=

n
2
_
[ x(n)[
2
+[ y(n)[
2
_
(4.1)
25
dx/ds
y
Figure 4.1: Computing the area enclosed by a curve
Now we compute the area. As the curve is anticlockwise,
A =
_
y
dx
ds
ds,
where the negative sign comes from the fact that the curve is anticlockwise. See Figure 4.1. This area
integral looks like an inner product, so we write
A
2
= y, x) = y,

x).
By symmetry, considering the area integral from the other direction, we also have that
A
2
= x,

y),
note there is no negative sign in this expression. Hence by adding we have that
A

= x,

y) y,

x)
=

in
_
x(n)

y(n) x(n) y(n)

_
, (4.2)
using the Fourier differentiation identities and using the notation a

for the complex conjugate of a. Now


subtract (4.1) and (4.2) to prove the theorem.
To see why the right hand side is zero if and only if is a circle, consider when the right hand side
vanishes. As it is a sum of many squares, x and y must vanish for all n ,= 0 or 1. Looking carefully at
what those terms mean shows that they vanish if and only if is a circle.
4.3 Harmonic Analysis on the Cube for Coding Theory
The theory of error correcting codes is broad and has numerous practical applications. We will look at the
asymptotic theory of block coding, which like many problems in coding theory is well-known, has a long
26
history and is still not well understood. The Boolean or Hamming cube 0, 1
n
is the set of all n-bit strings
over 0, 1. The usual distance on 0, 1
n
is the Hamming distance d
H
(x, y), dened over x, y 0, 1
n
by the number of positions where x and y are not equal: d
H
(x, y) = [i : x
i
,= y
i
[. A code ( is a subset of
0, 1
n
. The minimum distance of ( is the minimum distance between any two elements of (:
dist(() = mind
H
(x, y) : x, y (.
The asymptotic question is to estimate the size of the largest code for any given minimum distance,
A(d, n) = max[([, ( 0, 1
n
, dist(() d,
as n . The problem is easier if we restrict the parameter space by xing d to be a constant fraction of
the bit-length n, that is, consider A(n, n). Simple constructions show for 1/2 > > 0 that A(n, n) is
exponential in n, so the interesting quantity will be the bit-rate of the code. Accordingly, we dene the rate
of a code as
1
n
log [([, and then dene the asymptotic rate limit as
R() = limsup
n
_
1
n
log [([ : ( 0, 1
n
, dist(() n
_
.
It is a sign of our poor knowledge of the area that we do not even know if in the above we can replace the
limsup by lim, i.e., if the limit exists. If [([ = 2
k
, we may think of the code as mapping k-bit strings into
n-bit strings which are then communicated over a channel. The rate is then the ratio k/n, and measures the
efciency with which we utilize the channel.
A code is linear if ( is a linear subspace of 0, 1
n
, viewed as a vector space over GF(2). In a linear
code, if the minimum distance is realized by two codewords x and y, then x y is a codeword whose
(Hamming) length equals the minimum distance. Hence for linear codes we have that
dist(() = min
_
[w[ : w ( 0
_
.
Here we use [ [ to indicate the Hamming weight of a codeword, the number of nonzero positions. Note that
this is equal to several other, common norms on GF(2).
A useful entity is the orthogonal code to a given code. If ( a linear code, we dene
(

= y : x (, x, y) = 0,
where we compute the inner product , ) over GF(2), that is, x, y) =

n
i=1
x
i
y
i
(mod 2).
4.3.1 Distance Distributions and the MacWilliams Identities
Our rst concrete study of codes will be into the distance distribution, which are the probabilities
Pr[[x y[ = k : x, y chosen randomly from (]
for 0 k n. If ( is linear, our discussion above shows that the question of distance distribution is
identical to the weight distribution of a code, the probabilities that a randomly selected codeword has a
specied weight.
The MacWilliams Identities are important identities about this distribution that are easily derived using
Parsevals Identity. Let f = 1
(
, the indicator function for the code. We rst need the following lemma.
27
Lemma 4.5.

f =
[([
2
n
1
(

Proof.

f(u) =
1
2
n

v
f(v)
v
(u)
=
1
2
n

v
f(v)(1)
u,v)
=
1
2
n

v(
(1)
u,v)
If u (

, then u, v) = 0 for all v (, so that



f(u) = [([/2
n
. Suppose otherwise, so that

(
(1)
u,v)
= [(
0
[ [(
1
[, where (
0
are the codewords of ( that are perpendicular to u, and (
1
= ( (
1
.
As u , (

, (
1
is nonempty. Pick an arbitrary w in (
1
. Then, any y (
1
w corresponds to a unique
x (
0
, namely w +y. Similarly, any x (
0
0 corresponds to w +x (
1
w. As w (
1
corresponds
to 0 (
0
, we have that [(
0
[ = [(
1
[. Hence

(
(1)
e
1
,v)
= 0, so that

f(u) =
_
[([/2
n
if u (

0 otherwise
which proves the lemma.
We now dene the weight enumerator of a code to be
P
(
(x, y) =

w(
x
[w[
y
n[w[
.
The MacWilliams Identity connects the weight enumerators of ( and (

for linear codes.


Theorem 4.6 (The MacWilliams Identity).
P
(
(x, y) = [([P
(
(y x, y +x)
Proof. Harmonic analysis provides a nice proof of the identity by viewing it as an inner product. Dene
f = 1
(
and g(w) = x
[w[
y
n[w[
. Then, using Parsevals,
P
(
(x, y) = 2
n
f, g) = 2
n

f, g).

f has already been computed in Lemma 4.5, so we turn our attention to g.


g(u) =
1
2
n

v
g(v)(1)
u,v)
=
1
2
n

v
x
[v[
y
n[v[
(1)
u,v)
28
Let u have k ones and n k zeros. For a given v, let s be the number of ones of v that coincide with those
of u, and let t be the number ones of v coinciding with the zeros of u. Then we rewrite the sum as
=
1
2
n

s,t,k
_
k
s
__
n k
t
_
x
s+t
y
nst
(1)
s
=
y
n
2
n

s
_
k
s
__

x
y
_
s

t
_
n k
t
__
x
y
_
t
=
y
n
2
n
_
1
x
y
_
k
_
1 +
x
y
_
nk
=
1
2
n
(y x)
k
(y +x)
nk
=
1
2
n
(y x)
[u[
(y +x)
n[u[
Now, as f, g) =

f, g) = 2
n
P
(
(x, y), we plug in our expressions for

f and g to get
2
n
P
(
(x, y) =
[([
2
n

w(

1
2
n
(y x)
[w[
(y +x)
n[w[
=
[([
2
n
P
(
(y x, y +x),
which implies
P
(
= [([P
(
(y x, y +x).
4.3.2 Upper and Lower Bounds on the Rate of Codes
We now turn our attention to upper and lower bounds for codes. We remind any complexity theorists reading
these notes that the senses of upper bound and lower bound are reversed from their usage in complexity
theory. Namely, a lower bound on R() shows that good codes exist, and an upper bound shows that superb
codes dont.
In the remainder of this lecture we show several simple upper and lower bounds, an then set the stage
for the essentially strongest known upper bound on the rate of codes, the MacEleiece, Rumsey, Rodemich
and Welsh (MRRW) upper bound. This is also referred to as the JPL bound, after the lab the authors worked
in, or the linear programming (LP) bound, after its proof method.
Our rst bound is a lower bound. Recall the binary entropy function
H(x) = xlog x (1 x) log(1 x).
Theorem 4.7 (Gilbert-Varshamov Bound).
R() 1 H(),
and there exists a linear code satisfying the bound.
29
Proof. We will sequentially pick codewords where each new point avoids all n-spheres around previously
selected points. The resulting code ( will satisfy
[([
2
n
vol(sphere of radius n)
=
2
n

n
j=0
_
n
j
_.
Now, note that log
_
n
n
_
/n H() as n , so that 2
n
/

n
_
n
j
_
2
n(1H())
, and take logs to prove
the rst part of the theorem.
We now show that theres a linear code satisfying this rate bound. This proof is different than the one
given in class, as I couldnt get that to work out. The presentation is taken from Trevisons survey of coding
theory for computational complexity. We can describe a linear k-dimensional code (
A
by a k n 0-1
matrix A by (
A
= Ax : x 0, 1
k
. Well show that if k/n 1 H(), with positive probability
dist((
A
) n. As the code is linear, it sufces to show that the weight of all nonzero codewords is at least
n. As for a given x 0, 1
k
, Ax is uniformly distributed over 0, 1
n
, we have
Pr[[Ax[ < n] = 2
n
n1

i=0
_
n
i
_
2
n
2
nH()+o(n)
,
using our approximation to the binomial sum. Now we take a union bound over all 2
k
choices for x to get
Pr[x ,= 0 : Ax < d] 2
k
2
n
2
nH()+o(n)
= 2
k+n(H()1)+o(1)
< 1
by our choice of k n(1 H()).
We now turn to upper bounds on R().
Theorem 4.8 (Sphere-Packing Bound).
R() 1 H(/2)
Proof. The theorem follows from noting that balls of radius n/2 around codewords must be disjoint, and
applying the approximations used above for the volume of spheres in the cube.
We note in Figure 4.2 that the sphere-packing bound is far from the GV bound. In particular, that GV
bound reaches zero at = 1/2, while the sphere-packing bound is positive until = 1. However, we have
the following simple claim.
Claim 4.1. R() = 0 for > 1/2.
Proof. We will show the stronger statement that if [([ is substantial then not only is it impossible for
d
H
(x, y) > n for all x, y (, but even the average of all x, y ( will be at most n/2. This average
distance is
1
_
[([
2
_

((
d(x, y),
30
0.5 1
R()

GV bound
Sphere-Packing Bound
Elias Bound
Figure 4.2: The GV bound contrasted with the Sphere-Packing Bound
and we will expand the distance d(x, y) = [i : x
i
,= y
i
[. Reversing the order of summation,
Average distance =
1
_
[([
2
_

x,y
1
x
i
,=y
i
=
1
_
[([
2
_

i
z
i
([([ z
i
),
where z
i
is the number of zeros in the i
th
position of all the codewords of (.

1
_
[([
2
_

[([
2
4

1
2
n
[([
[([ 1
.
So unless ( is very small, the average distance is essentially n/2.
Our next upper bound improves on the sphere packing bounds, at least achieving R() = 0 for > 1/2.
It still leaves a substantial gap with the GV bound.
Theorem 4.9 (Elias Bound).
R() 1 H
_
1

1 2
2
_
Proof. The proof begins by considering the calculation of average distance from the previous theorem. It
follows from Jensens inequality that if the average weight of the vectors in ( is n, then the maximum of

z
i
([([ z
i
) is obtained if for all i, z
i
= (1 )( for some . We sketch the argument for those not
31
familiar with Jensens inequality. The inequality states that if f is convex, then for x
1
, . . . , x
n
,
1
n

f(x
i
)
f(

x
i
/n), with equality if and only if x
1
= = x
n
. For our case, the function f(x) = x([([ x) is
easily veried convex and so its maximum over a set of z
i
is achieved when the z
i
are all equal. This makes
the average distance in ( at most 2(1 )n.
With this calculation in mind, chose a spherical shell S in 0, 1
n
centered at some x
0
with radius r
such that
[S ([ [C[
[S[
2
n
.
Such a shell exists as the right hand side of the inequality is the expected intersection size if the sphere is
chosen randomly. Set r = pn so that [S[ 2
nH(p)
, which means
[S ([
[([
2
n
_
1H(p)
_ .
Now apply the argument above on x
0
+ ( S. It follows from our discussion that we actually have a
p fraction of ones in each row, so if > 2p(1 p), the [S ([ is subexponential, but this is equal to
[([2
n
_
1H(p)
_
, implying
1
n
log [([ < 1 H(p).
Let us rewrite our condition > 2p(1 p) as follows:
1 2p

1 2 p =
1

1 2
2
.
This is the critical value of pwhen p is below this the code is insubstantially small.
Figure 4.2 shows how the Elias bound improves the sphere-packing bound to something reasonable. The
gap between it and the GV bound is still large, however.
4.4 Aside: Erdo os-Ko-Rado Theorem
The proof of the Elias bound that we just saw is based on the following clever idea: we investigate and
unknown object (the code () by itersecting it with random elements of a cleverly chosen set (the sphere).
This method of a randomly chosen sh-net is also the basis for the following beautiful proof, due to
Katona, of the Erd os-Ko-Rado theorem.
Denition 4.1. An intersecting family is a family T of k-sets in 1 . . . n (compactly, T
_
[n]
k
_
), with 2k n,
such that for any A, B T, A B ,= .
Informally, an intersecting family is a collection of sets which are pairwise intersecting. One way to
construct such a set is to pick a common point of intersection, and then choose all possible (k 1)-sets to
ll out the sets. The Erdo os-Ko-Rado Theorem says that this easy construction is the best possible.
Theorem 4.10 (Erd os-Ko-Rado). If T
_
[n]
k
_
is an intersecting family with 2k n, then
T
_
n 1
k 1
_
.
32
Proof (Katona). Given an intersecting family T, arrange 1 . . . n in a random permutation along a circle,
and count the number of sets A T such that A appears as an arc in . This will be our random sh-net.
There are (n 1)! cyclic permutationsthat is, n! permutations, divided by n as rotations of the circle
are identical. There are k! ways for a given k-set to be arranged, and (n k)! ways of the other elements
not interfering with that arc, so that the set appears consecutively on the circle. Hence the probability that a
given k-set appears as an arc is
k!(n k)!
(n 1)!
=
n
_
n
k
_,
which by linearity of expectation implies
E
_
# arcs belonging
to T
_
=
n[T[
_
n
k
_ .
Now, as 2k n, at most k member of an intersecting family can appear as arcs on the circle, otherwise two
of the arcs wouldnt intersect. Hence
n[T[
_
n
k
_ k
implying
[T[
k
n
_
n
k
_
=
_
n 1
k 1
_
33
Lecture 5
Isoperimetric Problems
Feb 11, 2005
Lecturer: Nati Linial
Notes: Yuhan Cai & Ioannis Giotis
Codes: densest sphere packing in 0, 1
n
.
A(n, d) = max[[, 0, 1
n
, dist() d
R() = limsup
1
n
log
2
()[ 0, 1
n
, dist()
n

Majority is the stablest -


Gaussian:
1
(2)
(
n/2)
e
|x|
2
/2
Borell: isoperimetric problem is solved by a half-space
Isoperimetric Questions on the cube (Harper): Vertex and Edge isoperimetric questions.
The edge problem is dened as follows: Given that S 0, 1
n
, [S[ = R, how small e(S,

S) be?
Answer: S 0, 1
n
, e(S) 1/2[S[ log
2
[S[, [S[ = 2
k
, S = ( . . . 0 . . . 0 with k *s.
Proof (induction on dim):
e(S) e(S
0
) +e(S
1
) +[S
0
[, [S[ = x, [S
0
[ x, < 1/2.
1/2xlog
2
x 1/2(x) log
2
x) + 1/2(1 )xlog[(1 )x] +x
0 log + (1 ) log(1 ) + 2
H() 2 at = 0, 1/2.
The vertex isoperimetric problem is dened as miny [ y / S, x S such that xy E(0, 1)
n
),
S 0, 1
n
, [S[ k. The answer is an optimal S-ball. Specically, if k = [S[ =

t
j=0
_
n
j
_
, then
[S[
_
n
t+1
_
.
We will use the Kraskal-Katona theorem. If f
_
[n]
k
_
, then the shadow of f is
(f) =
_
y
_
[n]
k
_
[ x f, x y
_
We wish to minimize [(f)[.
34
To do this, take f as an initial segment in the reverse lexicographic order. The lexicographic order is
dened as
A < B, if min(AB) < min(BA)
while the reverse lexicographic order is
A <
RL
B, if max(AB) <
RL
max(BA)
For example:
Lex : 1, 2)1, 3)1, 4), . . .
RLex : 1, 2)1, 3)2, 3), . . .
Margulis and Talagrand gave the following denition for S 0, 1)
n
h(x) = y / S [ xy E , x S
We now have the 2 problems
Vertex Isoperimetric, min
[S[=k

xS
(h(x))
0=0
Edge Isoperimetric, min
[S[=k

xS
(h(x))
=1
We have [S[ 2
n1

_
h(x) (2
n
), for p = 1/2.
Kleitman: [S[ =
t
j=0
_
n
j
_
, S 0, 1
n
, t < n/2 diam(S) 2t. Can you show that S necessarily
contains a large code?
Question: (answered by Friedgut) suppose that [S[ 2
n1
and (S, S
C
) 2
n1
, then is S roughly a
dictatorship?
Answer: yes. subcube x
1
= 0 f(x
1
, . . . , x
n
) = x
1
. R() = limsup
n

1
n
log()[
0, 1
n
, dist() n.
5.1 Delsartes LP
Having g = 1
C
, f = 2
n
g g/[C[, Delsartes LP is
A(n, d) max
x0,1
nf(x)
f 0
f(0) = 1

f 0
f[
1,...,d1
= 0
Some useful equations
g g(0) =
1
2
n

g(y)g(y) =
[C[
2
n
g g(S) =
1
2
n
x, y C [ x y = S
35
We start with an observation. Without loss of generality, f is symmetric or in other words f(x) depends
only on [x[ =
[x[
. We look for
0
= 1,
1
= . . . =
d1
= 0,
d
, . . . ,
n
0.
Weve expressed f 0, f(

0) = 1 and we are trying to maximize


_
n
j
_

j
.
L
j
= x 0, 1
n
, [x[ = j
f =
n

j=0

j
1
L
j

f =

1
L
j
Note that L
j
is symmetric. It also follows that

1
L
j
is symmetric. We need to know

1
L
j
if [y[ = t.

(T) =

(S)(1)
[ST[

1
L
j
(T) =

[S[=j
(1)
[ST[
K
(n)
j
(x) =

i
(1)
i
_
t
i
__
n t
j i
_
This is the Krawtchouk polynomial presented in the next section.
5.2 Orthogonal Polynomials on R
Interesting books for this section are Interpolation and Approximation by Davis and Orthogonal polyno-
mials by Szeg o.
The weights of orthogonal polynomials on R are dened by
w : R R
+
,
_
R
w(x) <
The inner product on f : R R is
f, g) =
_
R
f(x)g(x)w(x) dx
and with weights w
1
, w
2
, . . ., and points x
1
, x
2
, . . .
f, g) =

w
i
f(x
i
)g(x
i
)
Lets now talk about orthogonality. Start from the functions 1, x, x
2
, . . . and carry out a Gram-Schmidt
orthogonalization process. Youll end up with a sequence of polynomials P
0
(x), P
1
(x), . . . s.t. P
i
has degree
i and P
i
, P
j
) =
ij
.
36
One case of orthogonal polynomials are the Krawtchouk polynomials, on discrete points x
0
= 0, x
1
=
1, . . . , x
n
= n with w
j
=
_
n
j
_
/2
n
. The j-th Krawtchouk polynomial K
j
(x) is a degree j polynomial in x.
It is also the value of

1
L
j
(T) whenever [T[ = x.
K
(n)
j
(x) =
n

i=0
(1)
i
_
x
i
__
n x
j i
_
Lets see why are they orthogonal or in other words
1
2
n

i=0
K
p
(i)K
q
(i)
_
n
i
_
=
pq
_
n
p
_
Starting from
1
p
, 1
q
) =
1
2
n
_
n
p
_

pq
and using Parsevals identity we get

1
Lp
,

1
Lq
) =
1
2
n

K
p
([S[)K
q
([S[) =
1
2
n

i=0
K
p
(i)K
q
(i)
_
n
i
_
The rst K
j
s are
K
0
(x) = 1, K
1
(x) = n 2x, K
2
(x) =
_
x
2
_
(n x) +
_
n x
2
_
=
(n 2x)
2
n
2
We also have the following identity
K
j
(n x) = (1)
j
K
j
(x)
Lemma 5.1. Every system of orthogonal polynomials satises a 3-term recurrence
xP
j
=
j
P
j+1
+
j
P
j
+
j
P
j1
Proof.
1
L
i
1
L
j
(S) =
1
2
n

i
1
L
j
(S i) =
=
1
2
n
((j + 1)1
L
j+1
+ (n j + 1)1
L
j1
) =
=
1
2
n
((j + 1)1
L
j+1
+ (n j + 1)1
L
j1
)
For the Krawtchouk polynomials
K
i
K
j
= (j + 1)K
j+1
+ (n j + 1)K
j1
(n 2x)K
j
= (j + 1)K
j+1
+ (n j + 1)K
j1
37
Theorem 5.2. For every family of orthogonal polynomials there is
1. a 3-term recurrence relation
x P
j
=
j
P
j+1
+
j
P
j
+
j
P
j1
2. P
j
has j real roots all in conv[supp w].
Proof. Observe that P
0
, P
1
, . . . , P
t
form a basis for the space of all polynomials of degree t, which means
that P, Q) = 0, Q polynomials of degree j
x P
j
=
j+1

i=0

i
P
i
(5.1)
We now claim that
0
=
1
= =
j2
= 0. Lets take in (5.1) an inner product with P
l
,l < j 1.
xP
j
, P
l
) =
j+1

i=0

i
P
i
, P
j
) =
l
|P
l
|
2
P
j
, xP
l
) =
l
|P
l
|
2
which is 0 for P
l
of degree j 1.
If u
i
s are the zeros of P
j
of odd multiplicity then
0 = P
j
,

(x u
i
)) = P
j

(x u
j
) > 0
38
Lecture 6
MRRW Bound and Isoperimetric Problems
Feb 18, 2005
Lecturer: Nati Linial
Notes: Ethan Phelps-Goodman and Ashish Sabharwal
6.1 Preliminaries
First we recall the main ideas from the last lecture. Let
g = 1
C
, f =
g g
[C[
.
Then we can bound the code size A(n, d) using Delsartes linear program:
A(n, d) max
f

x0,1
n
f(x)
subject to
f 0 f(0) = 1

f 0 f
[1,...,d1
= 0
By averaging over a solution f, we can get an equivalent solution that is symmetric about permutations
of the input bits. That is, we can assume w.l.o.g. that f that depends only on the hamming weight of the
input. f is then determined by n + 1 coordinate weights A
j
by
A
j
=

x [ [x[=j
f(x)
Or equivalently,
f =
n

j=0
A
j
_
n
j
_1
L
j
Central to our proof will be the Krawtchouk polynomials, which are related to our linear program by

1
Lr
= K
r
(x) =
r

j=0
(1)
j
_
x
j
__
n x
r j
_

f =
n

j=0
A
j
_
n
j
_K
j
39
6.2 Primal and Dual Programs
Making the substitutions above we can now write Delsartes program in terms of Krawtchouk polynomials
and symmeterized f.
A(n, d) max
A
0
,...,An
n

i=0
A
i
subject to
A
0
= 1
A
1
, . . . , A
d1
= 0
k 0, . . . , n
n

i=0
A
i
_
n
i
_K
i
(k) 0.
This can be further simplied with the following identity for Krawtchouk polynomials.
Fact 6.1.
K
i
(k)
_
n
i
_ =
K
k
(i)
_
n
k
_
Proof.
1
_
n
i
_
i

j=0
(1)
j
_
k
j
__
n k
i j
_
=
i

j
(1)
j
i!(n i)!k!(n k)!
n!j!(k j)!(i j)!(n k i +j)!
=
1
_
n
k
_
i???

j
(1)
j
_
i
j
__
n i
k j
_
Using this in the last constraint, and removing the 1/
_
n
k
_
term, which pulls out of the sum and doesnt
affect the sign, we get the constraints
k 0, . . . , n
n

i=0
A
i
K
k
(i) 0.
Our approach will be to use LP duality to give a bound on the maximum of this program. Recall that
duality tells us that the maximum value of the primal is at most the minimum value of the dual. Strong
duality states that the optima are exactly equal, but we will not use this.
Start by multiplying each of the

n
i=0
A
i
K
k
(i) 0 constraints by
k
, and summing all of the con-
straints. This gives
n

k=1

k
n

i=0
A
i
K
k
(i) =
n

i=0
A
i
n

k=1

k
K
k
(i) 0
40
Let (x) =

n
k=1

k
K
k
(i). If we add the constraint that x, (x) 1, then using A
0
=
1, A
1
, . . . , A
d1
= 0, we get
n

i=0
A
i
(i) = (x) +
n

i=d
A
i
(i) 0
(0)
n

i=d
A
i
(i)

i=d
A
i
(0) + 1
n

i=1
A
i
A(n, d)
What we have done here is just an explicit construction of the dual. The reader can check that this dual
can be arrived at by any standard method for computing the dual.
Let (x) = 1 +

n
k=1

k
K
k
(x). Then our nal program is given by
A(n, d) min

k
(0)
subject to:
k = 1, . . . , n,
k
0
j = d, . . . , n, (j) 0
6.3 The Upper Bound
To show an upper bound on A(n, d) we need to demonstrate a feasible solution and bound (0). First we
need two additional facts about Krawtchouk polynomials.
Fact 6.2 (Christoffel-Darboux). Let P
1
, P
2
, . . . be a family of orthonormal polynomials, and let a
i
be the
leading coefcient of P
i
. Then
P
k
(x)P
k+1
(y) P
k+1
(x)P
k
(y)
y x
=
a
k+1
a
k
k

i=0
P
i
(x)P
i
(y)
For the case of Krawtchouk polynomials, the leading term of K
r
(x) is
2
r
r!
. Also, to normalize we need
to divide K
r
by
_
_
n
r
_
. Putting these together, we get
K
r+1
(x)K
r
(y) K
r
(x)K
r+1
(y)
y x
=
2
r + 1
_
n
r
_
r

i=0
K
i
(x)K
i
(y)
_
n
i
_ .
The second fact we need is that the product of two Krawtchouk polynomials can be expressed as a
non-negative combination of Krawtchouk polynomials.
41
Fact 6.3. For any p, q, there exist
0
, . . . ,
p+q
0 such that
K
p
K
q
=
p+q

j=0

j
K
j
This can be seen easily from the harmonic analysis perspective since K
p
K
q
=

1
Lp

1
Lq
=

1
Lp
1
Lq
,
and the convolution is a positive combination.
We can now present the feasible solution for the dual. Let
(x) =
_
K
t
(a)K
t+1
(x) K
t+1
(a)K
t
(x)
_
2
a x
.
Then set (x) =
(x)

0
, where
0
is chosen to make the constant term equal 1. Now we need to set
values for a and t. Denote by x
(l)
r
the leftmost root of K
r
. We know from last lecture that the roots of the
Krawtchouk polynomials are real, lie in [0, n], and interleave with one another. Therefore we can pick a t
such that 0 < x
(l)
t+1
< x
(l)
t
< d. In the region (x
(l)
t+1
, x
(l)
t
), K
t+1
is negative and K
t
is positive, so we can
pick an a such that K
t
(a) = K
t+1
(a).
Now we need to show that (x) satises the two constraints from the dual. First, note that at all x > d,
(x) < 0. Then we just need to show that (x) is non-negative combination of Krawtchouk polynomials.
Using the above settings, and Christoffel-Darboux, we can factor (x) as
(x) =
_
K
t
(a)K
t+1
(x) K
t+1
(a)K
t
(x)
_
_
K
t
(a)K
t+1
(x) K
t+1
(a)K
t
(x)
a x
_
= K
t
(a)(K
t+1
(x) +K
t
(x))
_
K
t
(a)K
t+1
(x) K
t+1
(a)K
t
(x)
a x
_
= K
t
(a)(K
t+1
(x) +K
t
(x))
_
2
r + 1
_
n
r
_
r

i=0
K
i
(x)K
i
(y)
_
n
i
_
_
Since all terms are positive, this can be expanded as a positive combination of Krawtchouk polynomials.
Now that we have a feasible solution to the dual, we just need to nd the value of (0). We can use the
fact that for t n, the leftmost root is at x
(l)
t
= (1 +o(1))(
1
2

_
(1 ))n. Given this we can conclude
that R() H(
1
2

_
(1 )). Both the lecture and van Lint [1] seem to imply that this step is obvious,
but your scribe has been unable to see any connection.
6.4 More on Isoperimetric Problems on the Cube
We now turn our attention to isoperimetric problems. In a previous lecture, we studied isoperimetric ques-
tions on the n-dimensional cube, namely the vertex isoperimetric problem and the edge isoperimetric prob-
lem. Why is the study of such problems important? The reason is that Computer Science deals with Boolean
functions which are simply partitions of the n-dimensional cube into two parts. Understanding the geometry
of the cube is therefore critical to understand Boolean functions. Here is one more isoperimetric problem
that is open.
42
Open Problem 6.1 (Chung-F uredi-Grahan-Seymour, 1988 J.C.T.A.). What is the largest d = d(n) such
that for all S 0, 1
n
, [S[ > 2
n1
, there exists x S with d
S
(x) d?
Here d
S
(x) denotes the number of neighbors of x in S. Note that for [S[ 2
n1
, S can be an indepen-
dent set, i.e., x S.d
S
(x) = 0. Further, for [S[ > 2
n1
, S may not be independent. In general, all we
know is that d(n) is both O(

n) and (log n). This leaves a huge gap open.


Consider any Boolean function f : 0, 1
n
0, 1 represented as a 0, 1-labeling of the n-dimensional
cube seen as a layered lattice. This lattice has four types of edges as depicted in Figure 6.1. Let S = f
1
(0).
The two edges from 0 to 1 and from 1 to 0 belong to the cut E(S, S
c
) and thus contribute to the cut size
e(S, S
c
).
f = 0 0 1 1
f = 0 1 0 1
x
i
= 0
x
i
= 1
Edges in the E(S, S
c
) cut
Figure 6.1: The cut dened in terms of the four types of edges in the lattice
If [S[ = 2
n1
, then e(S, S
c
) 2
n1
. This is sharp for S = x [ x
1
= 0. In the edge isoperimetry
problem, given [S[, we want to minimize the cut size e(S, S
c
). What about trying to maximize the cut size
instead? The maximum cut size can really be anything. Indeed, when f is the parity function, e(S, S
c
) =
n2
n1
.
6.4.1 Maximizing Edge Cut Size for Monotone Functions
Consider the setting of the previous section. How can we maximize the edge cut when f is monotone, i.e.,
x ~ y f(x) f(y), where x ~ y means i.x
i
y
i
? In the following, we use Parsevals identity to
answer this question.
Theorem 6.1. Let S 0, 1
n
correspond to a monotone Boolean function f : 0, 1
n
0, 1. f =
majority maximizes the edge cut size e(S, S
c
).
Proof. It is clear from the lattice corresponding to f = majority (see Figure 6.2) that the size of the cut
corresponding to it is
_
n
]n/2|
_
= (

n2
n
). We will use Parsevals identity to prove that this is the optimal.
Let f be any monotone Boolean function in n dimensions. Recall that for characters
T
(Z) =
(1)
[ZT[
, the function f can be represented as f =

f(T)
T
where

f(T) = f,
T
). What is

f(i)?
43
f = 1
f = 0
_
n
n/2|
_
points
Figure 6.2: The lattice corresponding to the majority function

i
(Z) = (1)
[Zi[
which is +1 if i , Z and 1 if i Z. Therefore

f(i) = f,
i
)
=
1
2
n

Z
f(Z)
i
(Z)
=
1
2
n
_
_

Z,i
f(Z)

Zi
f(Z)
_
_
=
1
2
n
(number of cut edges in the i-direction)
For ease of computation, convert everything from the 0, 1 basis to the 1, +1 basis. This quantity
is then (2/2
n
) times the number of cut edges in the i-direction. Using Parsevals identity and Cauchy-
Schwartz inequality, 1 = [[f[[
2
2
=

S
_

f(S)
_
2


i
_

f(i)
_
2
(1/n)
_

f(i)
_
2
. Hence

n

f(i) = (2/2
n
) e(S, S
c
), which nishes the proof.
We give an alternative combinatorial proof of the fact that e(S, S
c
) = 2
n1

f(i) based on the


following claim.
Claim 6.1. Let f be a monotone Boolean function. If the expectation of f is given and xed, then to
maximize e(f
1
(0), f
1
(1)), it is best to take f symmetric.
Proof of claim. Consider

x:f(x)=0
(n 2[x[). This is the sum of the rst Krawtchouk polynomials and
is equal to the cut size e(f
1
(0), f
1
(1)) because (n [x[) edges in the lattice corresponding to f that
go upwards from x contributing +1 each while [x[ edges go downward from x contributing 1 each (see
Figure 6.3. Maximizing this quantity means minimizing

x:f(x)=0
[x[ which happens exactly when f is
pushed down as much as possible.
Formally, let us change the basis from 0, 1 to 1, +1 and reinterpret the summation. It is equal
to

x:f(x)=1
(n 2[x[)

x:f(x)=1
(n 2[x[) = 2
n
f, K
1
). Observe however that

x
(n 2[x[) =
K
1
, K
0
) = 0. Therefore

x:f(x)=1
(n 2[x[) = 2
n1
f, K
1
), which is the same as

i

f(i) by the
properties of Krawtchouk polynomials.
44
f = 1
f = 0
[x[ edges contributing 1 each
(n [x[) edges contributing +1 each
x
Figure 6.3: Contribution of f to the cut
6.4.2 The Brunn-Minkowski Inequality
Let v be a volume measure on subsets of R
n
.
Theorem 6.2 (Brunn-Minkowski [2]). For A, B measurable subsets of R
n
,
(v(A+B))
1/n
(v(A))
1/n
+ (v(B))
1/n
.
Moreover, equality holds if and only if A and B are homothetic, i.e. B = A+C for R.
Here A+B is the Minkowski sum dened as a +b [ a A, b B, where a+b is the standard vector
sum over R
n
. For R, A is similarly dened as a [ a A. We will not be using the second part of
the theorem.
Let us try to understand what this inequality says. Take a convex body K in R
n
and slide a hyperplane
A
t
, t R, through it (see Figure 6.4). What can we say about the function f(t) =
n1
(A
t
K) which
is the volume of the intersection of the body with the hyperplane? Brunn-Minkowski inequality says that
(f(t))
1/(n1)
is convex.
A
t
K
t
Figure 6.4: Sliding a hyperplane A
k
through a convex body K
Theorem 6.3. Brunn-Minkowski inequality implies the classical n-dimensional isoperimetric inequality.
45
Proof. We want to show that if K R
n
and B is the unit ball in R
n
, then
_
v(K)
v(B)
_1
n

_
S(K)
S(B)
_ 1
n1
where S denotes the surface area. For a 2-dimensional plane, the LHS equals
_
A/ while the RHS equals
L/(2). To prove LHS RHS, we need L
2
4A, which we know to be true. Lets try to generalize this
to higher dimensions.
The surface area of K is, by denition,
S(K) = lim
0
v(K +B) v(K)

.
By Brunn-Minkowski inequality,
S(K) lim
0
_
(v(K))
1
n
+ (v(B))
1
n
_
n
v(K)

= lim
0
n (v(K))
n1
n
(v(B))
1
n
+O(
2
)

= n (v(K))
n1
n
(v(B))
1
n
= S(B)
_
v(K)
v(B)
_n1
n
n v(B)
S(B)
The last term n v(B)/S(B) is, however, always 1 in any number of dimensions. We have therefore proved
the isoperimetric inequality.
References
[1] J.H. van Lint. Introduction to Coding Theory. Springer, 1999.
[2] J. Matousek. Lectures on Discrete Geometry. Springer-Verlag, 2002.
46
Lecture 7
The Brunn-Minkowski Theorem and Inuences of
Boolean Variables
Friday 25, 2005
Lecturer: Nati Linial
Notes: Mukund Narasimhan
Theorem 7.1 (Brunn-Minkowski). If A, B R
n
satisfy some mild assumptions (in particular, convexity
sufces), then
[vol (A+B)]
1
n
[vol (A)]
1
n
+ [vol (B)]
1
n
where A+B = a +b : a A and b B.
Proof. First, suppose that A and B are axis aligned boxes, say A =

n
j=1
I
j
and B =

n
i=1
J
i
, where
each I
j
and J
i
is an interval with [I
j
[ = x
j
and [J
i
[ = y
i
. We may assume WLOG that I
j
= [0, x
j
] and
J
i
= [0, y
i
] and hence A+B =

n
i=1
[0, x
i
+y
i
]. For this case, the BM inequality asserts that
n

i=1
(x
i
+y
i
)
1
n

n

i=1
x
1
n
i

n

i=1
y
1
n
i
1
_

_
x
i
x
i
+y
i
__1
n

_
y
i
x
i
+y
i
__1
n
Now, since the arithmetic mean of n numbers is bounded above by their harmonic mean, we have (

i
)
1
n

i
n
and (

(1
i
))
1
n

P
(1
i
)
n
. Taking
i
=
x
i
x
i
+y
i
and hence 1
i
=
y
i
x
i
+y
i
, we see that the above
inequality always holds. Hence the BM inequality holds whenever A and B are axis aligned boxes.
Now, suppose that A and B are the disjoint union of axis aligned boxes. Suppose that A =

,
A

and B =

B
B

. We proceed by induction on [/[ + [B[. We may assume WLOG that [/[ > 1. Since
the boxes are disjoint, there is a hyperplane separating two boxes in /. We may assume WLOG that this
hyperplane is x
1
= 0.
47
A A
+
A

Let A
+
= x A : x
1
0 and A

= x A : x
1
0 as shown in the gure above. It is clear that
both A
+
and A

are the disjoint union of axis aligned boxes. In fact, we may let A
+
=

,
+ A

and
A

,
A

where [/
+
[ < [/[ and [/

[ < [/[. Suppose that


vol(A
+
)
vol(A)
= . Pick a so that
vol (x B : x
1
)
vol (B)
=
We can always do this by the mean value theorem because the function f() =
vol(xB:x
1
)
vol(B)
is continu-
ous, and f() 0 as and and f() 1 as .
Let B
+
= x B : x
1
and B

= x B : x
1
. By induction, we may apply BM to both
(A
+
, B
+
) and (A

, B

), obtaining
_
vol
_
A
+
+B
+
_ 1
n

_
vol
_
A
+
_ 1
n
+
_
vol
_
B
+
_ 1
n
_
vol
_
A

+B

_ 1
n

_
vol
_
A

_ 1
n
+
_
vol
_
B

_ 1
n
Now,
_
vol
_
A
+
_ 1
n
+
_
vol
_
B
+
_ 1
n
=
1
n
_
[vol (A)]
1
n
+ [vol (B)]
1
n
_
_
vol
_
A

_ 1
n
+
_
vol
_
B

_ 1
n
= (1 )
1
n
_
[vol (A)]
1
n
+ [vol (B)]
1
n
_
Hence
_
vol
_
A
+
+B
+
_ 1
n
+
_
vol
_
A

+B

_ 1
n

_
[vol (A)]
1
n
+ [vol (B)]
1
n
_
The general case follows by a limiting argument (without the analysis for the case where equality holds).
Suppose that f : S
1
R is a mapping having a Lipshitz constant 1. Hence
|f(x) f(y)| |x y|
2
Let be the median of f, so
= prob[x S
n
: f(x) < ] =
1
2
48
We assume that the probability distribution always admits such a (at least approximately). The following
inequality holds for every > 0 as a simple consequence of the isoperimetric inequality on the sphere.
x S
n
: [f [ > < 2e
n/2
For A S
n
and for > 0, let
A

= x S
n
: dist x, A <
Question 7.1. Find a set A S
n
with A = a for which A

is the smallest.
The probability used here is the (normalized) Haar measure. The answer is always a spherical cap, and
in particular if a =
1
2
, then the best A is the hemisphere (and so A

= x S
n
: x
1
< ). We will show
that for A S
n
with A =
1
2
, A

1 2e

2
n/4
. If A is the hemisphere, then A

= 1 (e

2
n/2
), and
so the hemisphere is the best possible set.
But rst, a small variation on BM :
vol
_
A+B
2
_

_
vol (A) vol (B)
This follows from BM because
vol
_
A+B
2
_1
n
vol
_
A
2
_1
n
+ vol
_
B
2
_1
n
=
1
2
_
vol (A)
1
n
+ vol (B)
1
n
_

_
vol (A)
1
n
+ vol (B)
1
n
For A S
n
, let

A = a : a A, 1 0. Then A =
n+1
(

A). Let B = S
n
A

.
Lemma 7.2. If x

A and y

B, then

x + y
2

1

2
8
It follows that

A+

B
2
is contained in a ball of radius at most 1

2
8
. Hence
_
1

2
8
_
n+1
vol
_

A+

B
2
_

_
vol
_

A
_
vol
_

B
_

_
vol
_

B
_
2
Therefore, 2e

2
n/4
vol
_

B
_
.
49
7.1 Boolean Inuences
Let f : 0, 1
n
0, 1 be a boolean function. For a set S [n], the inuence of S on f, I
f
(S) is dened
as follows. When we pick x
i

i,S
uniformly at random, three things can happen.
1. f = 0 regardless of x
i

iS
(suppose that this happens with probability q
0
).
2. f = 1 regardless of x
i

iS
(suppose that this happens with probability q
1
).
3. With probability Inf
f
(S) := 1 q
0
q
1
, f is still undetermined.
Some examples:
(Dictatorship) f(x
1
, x
2
, . . . , x
n
) = x
1
. In this case
Inf
dictatorship
(S) =
_
1 if i S
0 if i , S
(Majority) For n = 2k + 1, f(x
1
, x
2
, . . . , x
n
) is 1 if and only if a majority of the x
i
are 1. For
example, if S = 1,
Inf
majority
(1) = prob (x
1
is the tie breaker )
=
_
2k
k
_
2
2k
=
_
1

k
_
For fairly small sets S,
Inf
majority
(S) =
_
[S[

n
_
(Parity) f(x
1
, x
2
, . . . , x
n
) = 1 if and only if an even number of the x
i
s are 1. In this case
Inf
parity
(x
i
) = 1
for every 1 i n.
Question 7.2. What is the smallest = (n) such that there exists a function f : 0, 1
n
0, 1 which
is balanced (i.e., Ef =
1
2
) for which Inf
f
(x
i
) < for all x
i
?
Consider the following example, called tribes. The set of inputs x
1
, x
2
, . . . , x
n
is partitioned into tribes
of size b each. Here, f(x
1
, x
2
, . . . , x
n
) = 1 if and only if there is a tribe that unanimously 1.

50
Since we want Ef =
1
2
, we must have prob(f = 0) =
_
1
1
2
b
_n
b
=
1
2
. Therefore,
n
b
ln
_
1
1
2
b
_
=
ln2. We use the Taylor series expansion for ln(1 ) =
2
/2 = O(
2
) to get
n
b
_
1
2
b
+O
_
1
4
b
__
= ln2. This yields n = b 2
b
ln2 (1 +O(1)). Hence b = log
2
n log
2
lnn + (1).
Hence,
Inf
tribes
(x) =
_
1
1
2
b
_
n/b

_
1
2
_
b1
=
_
1
1
2
b
_n
b
1
1
2
b

1
2
b1
=
1
1
1
2
b

1
2
b
=
1
2
b1
=
_
log b
n
_
In this example, each individual variable has inuence (log n/n). It was later shown that this is lowest
possible inuence.
Proposition 7.3. If Ef =
1
2
, then

x
Inf
f
(x) 1.
This is a special case of the edge isoperimetric inequality for the cube, and the inequality is tight if f is
dictatorship.
x = 0
x = 1
f=0
f=0
f=1
f=1
f=1
f=0
f=0
f=1
The variable x is inuential in the cases indicated by the solid lines, and hence
Inf
f
(x) =
# of mixed edges
2
n1
Let S = f
1
(0). Then

Inf
f
(x) =
1
2
n1
e(S, S
c
).
51
One can use

f to compute inuences. For example, if f is monotone (so x y f(x) f(y)), then

f(S) =

T
(1)
[ST[
2
n
Therefore,

f(i) =
1
2
n

i,T
f(T)
1
2
n

iT
f(T)
=
1
2
n

i,T
(f(T) f(T i))
=
1
2
n
# mixed edges in the direction of i
=
1
2
Inf
f
(x
i
)
Hence Inf
f
(x
i
) = 2

f(i). What can be done to express Inf
f
(x) for a general f? Dene
f
(i)
(z) = f(z) f(z e
i
)
x = 0
x = 1
f=0
f=0
f=1
f=1
f=1
f=0
f=0
f=1
f
(i)
=0
f
(i)
=0
f
(i)
=0
f
(i)
=0
f
(i)
=1
f
(i)
=1
f
(i)
=1
f
(i)
=1
Then
Inf
f
(x
i
) =

supportf
(i)

w
_
f
(i)
(w)
_
2
52
The last term will be evaluated using Parseval. For this, we need to compute the Fourier expression of f
(i)
(expressed in terms of

f).

f
(i)
(S) =
1
2
n

T
f
(i)
(T)(1)
[ST[
=
1
2
n

T
_
f(T) f(T i)
_
(1)
[ST[
=
1
2
n

i,T
__
f(T) f(T i
_
(1)
[ST[
+
_
f(T i) f(T)
_
(1)
[S(Ti)[
_
=
1
2
n

i,T
_
f(T) f(T i
_ _
(1)
[ST[
(1)
[S(Ti)[
_
=
_
0 if i , S
2

f(S) if i S
Using Parseval on

f
(i)
along with the fact that

f
(i)
takes on only values 0, 1, we conclude that
Inf
f
(x
i
) = 4

iS
[hatf(S)[
2
Next time, we will show that if Ef =
1
2
, then there exists a i such that

iS
_

f(S)
_
2
> (lnn/n).
Lemma 7.4. For every f : 0, 1
n
0, 1, there is a monotone g : 0, 1
n
0, 1 such that
Eg = Ef.
For every s [n], Inf
g
(S) Inf
f
(S).
Proof. We use a shifting argument.
x = 0
x = 1
f=0
f=0
f=1
f=1
f=1
f=0
f=0
f=1

f=0

f=0

f=1

f=1

f=1

f=0

f=1

f=0
Clearly E

f = Ef. We will showthat for all S, Inf

f
(S) Inf
f
(S). We may keep repeating the shifting
step until we obtain a monotone function g. It is clear that the process will terminate by considering the
progress measure

f(x) [x[ which is strictly increasing. Therefore, we only need show that Inf

f
(() S)
Inf
f
(S).
53
Lecture 8
More on the inuence of variables on boolean
functions
March 4, 2005
Lecturer: Nati Linial
Notes: Neva Cherniavsky & Atri Rudra
In this lecture we will look at the following natural question do there exist balanced boolean functions
f on n variables such that for every variable x the inuence of x on f is small and how small can this
bound be made? (Balanced means that Pr(f = 0) = Pr(f = 1) =
1
2
but Ef = for some bounded
away from 0 and 1 is just as interesting.) In the last lecture we showed that for the tribes function (which
was dened by Ben-Or and Linial in [1]), every variable has inuence (
log n
n
). Today, we will prove the
result of Kahn, Kalai and Linial [2] which shows that this quantity is indeed the best one can hope for. In
the process we will look into the Bonami Beckner Inequality and will also look at threshold phenomena in
random graphs.
8.1 The Kahn Kalai Linial Result
Recall the denition of inuence. Let f : 0, 1
n
0, 1 be a boolean function and let S [n]. The
inuence of the set of variables S on the function f, denoted by Inf
f
(S), is the probability that f is still
undetermined when all variables in [n] S have been assigned values at random.
We also talk about inuences in the case when the function is dened on a solid cube f : [0, 1]
n

0, 1. This formulation has connections to game theory variables are controlled by the players. Note that
in this case we can talk about things like inuence of a subset of variables towards 0.
The situation for the case when [S[ = 1 is relatively better understood. As we saw in the last class, the
situation for calculating Inf
f
(x) looks like Figure 8.1. In particular, calculating the inuence is same as
counting the number of non-solid edges in Figure 8.1.
The situation is much less understood for the more general case, for example when [S[ = 2. In a nutshell,
we are interested in situations other than those in Figure 8.2. This scenario is not well understood and is
still a mysterious object. Unlike the case of a single variable, the number of zeroes and ones in a mixed
2-dimensional subcube can vary and the whole situation is consequently more complex. As an interesting
special case consider the tribes example and the case when S is an entire tribe. It is easy to see that the
inuence of S towards 1 is exactly 1 (as any tribe can force the result to be 1) while it is not hard to see that
the inuence of S towards 0 is only O(
log n
n
). As another example consider the case when S consists of one
element from each tribe (one can consider each element of S as a spy in a tribe). Here the inuence of S
54
0
1
0
1
0
1
1
0
x
=
0
x
=
1
Figure 8.1: Inuence of a variable
towards 0 is exactly 1 (as each spy can force its tribe to be 0). Further, its inuence towards 1 can be shown
to be

21
4
+o(1).
Let us now turn back to the motivating question for this lecture
Question 8.1. Find boolean functions f with Ef =
1
2
for which Inf
f
(x) is small for each variable x.
For any variable x
i
dene
i
= Inf
f
(x
i
) and let =
1
, ,
n
). Thus, the quantity we are interested
in is [[[[

. Note that the edge isoperimetric inequality on the cube implies that

n
i=1

i
1 which by
an averaging argument gives [[[[


1
n
. Also note that for the tribes example, [[[[

= (
log n
n
). The
following result due to Kahn, Kalai and Linial shows that this is the best possible
Theorem 8.1. For any f : 1, 1 1, 1 with Ef = 0, there exists a variable x such that Inf
f
(x)
(
log n
n
).
Before we start to prove the theorem, let us collect some facts that we know or follow trivially from what
we have covered in previous lectures.

S[n]
(

f(S))
2
= 1 (8.1)

f() = 0 (8.2)

i
= 4

Si
(

f(S))
2
(8.3)
n

i=1

i
= 4

S[n]
[S[(

f(S))
2
(8.4)
Equation (8.1) follows from Parsevals identity and the fact that f takes values in 1, 1. Equation
(8.2) follows from the fact that

= 1 which implies

f() = f,

) = 2
n
Ef = 0. Equation (8.3)was
proved in the last lecture. Equation (8.4) follows from summing Equation (8.3) for all i.
55
0
0
0
0
1
1
1
1
Figure 8.2: Inuence of a general subset of variables
We will rst show that if most of the

S[n]
(

f(S))
2
comes from large sets S then the conclusion
of Theorem 8.1 holds. In fact, in this case even the average inuence is (
log n
n
). To be more precise
let T =
log n
10
and H =

[S[T
(

f(S))
2
. Further assume that H
1
10
. Then by (8.4),

n
i=1

i

4

[S[T
[S[(

f(S))
2
4HT
log n
25
.
It follows that it sufces to prove the theorem under the complementary assumption that H <
1
10
. In
view of (8.1) this is the same as showing

[S[<T
(

f(S))
2
> 0.9. In the proof we will need to estimate the
following quantity

[S[<T
(

(S))
2

S
W
T
(S)(

(S))
2
for = f
(i)
(recall from the last lecture that f
(i)
(z) = f(z) f(z e
i
)). Here W
T
() is the step function
which takes value 1 for any set S [n] such that [S[ T and 0 otherwise. We use two ideas to solve the
problem in hand
56
Try and approximate W() by functions which are sufciently close to the step function W
T
() and
are easy to analyze.
As the bound [[[[
1
1 is tight (which is insufcient for our purposes) and since it is difcult to work
directly with [[[[

, we could perhaps estimate [[[[


p
for some > p > 1. In the proof we will use
p =
4
3
but this is quite arbitrary.
We focus on the second alternative and use the Bonami Beckner inequality which we consider in the
next section.
8.2 Bonami Beckner Inequality
Let T

be a linear operator which maps real functions on 1, 1


n
to real functions on 1, 1
n
. By
linear, we mean that the following holds for functions f and g and scalars a and b: T

(a f + b g) =
a T

(f) +b T

(g).
As T

() is a linear operator, one can fully determine it by just specifying it at the basis of the functions

S
. We dene the operator as follows
T

(
S
) =
[S[

S
By linearity, T

(f) =

S[n]

[S[

f(S)
S
(). Note that T
1
(f) = f. In other words, T
1
is the identity
operator.
We will now state the main result of this section
Theorem 8.2. Let 0 < < 1 and consider T

as an operator from L
p
to L
2
where p = 1 +
2
. Then its
operator norm is 1.
1
Let us rst explain the terms used in the statement above. Let T : (X, [[ [[
X
) (Y, [[ [[
Y
) be a linear
operator here X and Y are normed spaces and [[ [[
X
and [[ [[
Y
are their respective norms. The operator
norm of T is dened as
[[T[[
op
= sup
xX
[[Tx[[
Y
[[x[[
X
This quantity measures how much the length (norm) of an element x X can grow by an application of
the operator T. We now turn to the proof.
What is the size of f? How expanding is the operator? These are very narrow passages; we have no
wiggle room. We can only use Parsevals in L
2
, so the norm on the right hand side needs to be L
2
. On the
left hand side, our norm is L
p
, which is usually very difcult to calculate. But because our functions (the
f
(i)
) only take on the values 1, 0, 1, we can calculate the necessary L
p
norms.
That the operator norm of T

is at least 1 is obvious. Let f be identically 1 everywhere. Then



f(T) =

f(S)(1)
[ST[
= 0 for T ,= and

f() = 1. So [[T

f[[
2
= 1 = [[f[[
p
. What the Bonami-Beckner
inequality says is that for every f : 1, 1
n
R, [[T

f[[
2
[[f[[
p
.
Well do a part of the proof only. The proof is via induction on n, the dimension of the cube. The base
case is the main part of the proof; the method for inducting is standard in analysis and well skip it.
1
This is called a hypercontractive inequality.
57
Again, the surprising thing here is that p < 2 and still [[T[[
op
1.
For n = 1, every function f has the form f = a + bx and then T

f = a + bx. (There are only two


characters,

and
x
). Then at x = 1, f = a b and T

f = a b. At x = 1, f = a + b and
T

f = a +b. So
[[f[[
p
=
_
[a +b[
p
+[a b[
p
2
_1
p
[[T

f[[
2
=
_
(a +b)
2
+ (a b)
2
2
=
_
a
2
+
2
b
2
We want to prove [[T

f[[
2
[[f[[
p
, i.e., we want to prove
_
[a +b[
p
+[a b[
p
2
_1
p

_
a
2
+
2
b
2
. (8.5)
Suppose [a[ [b[. Let b = ta and divide by [a[:
_
[a +b[
p
+[a b[
p
2
_1
p
=
_
[a +ta[
p
+[a ta[
p
2
_1
p
= [a[
_
[1 +t[
p
+[1 t[
p
2
_1
p
_
a
2
+
2
b
2
=
_
a
2
+
2
(at)
2
= [a[
_
1 +
2
t
2
So we will prove
[1 +t[
p
+[1 t[
p
2
(1 +
2
t
2
)
p
2
when [t[ 1 (8.6)
and (8.5) will follow. Note that if [a[ < [b[, wed substitute a = bt and divide by [b[, and would want to
prove
[t + 1[
p
+[t 1[
p
2
(
2
+t
2
)
p
2
when [t[ 1 (8.7)
But since
(1 +
2
t
2
)
2
+t
2
,
once we prove equation (8.6), (8.7) will follow immediately.
Proof of (8.6) is via the Taylor expansion. For the left hand side, terms in odd places will cancel out,
and terms in even places will double. Recall p = 1 +
2
and [t[ 1. The left hand side becomes

j=0
t
2j
_
p
2j
_
58
and the right hand side becomes

j=0

2j
t
2j
_
p/2
j
_
.
Lets examine the rst two terms of each side. On both sides, the rst term is 1. On the left hand side, the
second term is t
2
(p(p 1))/2 and on the right hand side the second term is
2
t
2
p/2; since
2
= p 1, this
means the second terms are also equal on both sides. Therefore it is sufcient to compare the terms from
j 2.
What we discover is that on the left hand side, all terms are positive, whereas on the right hand side,
the j = 2k and j = 2k + 1 terms have a negative sum for all k 1. First we show the left hand side is
positive. The (2j)th coefcient equals p(p 1)(p 2) . . . (p 2j + 1) divided by some positive constant.
Note that p(p 1) is positive and all the terms (p 2) . . . (p 2j + 1) are negative. But since there are an
even number of these negative terms, the product as a whole is positive. Therefore, on the left hand side, all
terms are positive.
Now consider the right hand side. We will show that the j = 2k and j = 2k + 1 terms have a negative
sum for all k 1. Consider the sum

4k
t
4k
_
p/2
2k
_
+
4k+2
t
4k+2
_
p/2
2k + 1
_
.
We can divide out
4k
t
4k
without affecting the sign. Since the second term is the positive one, and [t[ 1,
we can replace t
2
by 1 without loss of generality. So now we have
_
p/2
2k
_
+
2
_
p/2
2k + 1
_
=
_
p/2
2k
_
+ (p 1)
_
p/2
2k + 1
_
=
p/2(p/2 1) . . . (p/2 2k + 1)
2k!
+ (p 1)
p/2(p/2 1) . . . (p/2 2k)
(2k + 1)!
=
_
(2k + 1)
p
2
(
p
2
1) . . . (
p
2
2k + 1) + (p 1)
p
2
(
p
2
1) . . . (
p
2
2k)
_
/(2k + 1)!
=
_
p
2
(
p
2
1) . . . (
p
2
2k + 1)
_ _
2k + 1 + (p 1)(
p
2
2k)
_
/(2k + 1)!
Notice that the rst term in brackets is negative and the second term in brackets is positive. Thus the
sum of the kth even and odd term is negative for all k, and weve proved equation (8.6). Equation (8.6)
implies equation (8.7); equations (8.6) and (8.7) together imply equation (8.5); and equation (8.5) implies
[[f[[
p
[[T

f[[
2
for p = 1 +
2
. Thus weve proved the base case for the Bonami-Beckner inequality.
8.3 Back to Kahn Kalai Linial
In general it is not obvious how to utilize this inequality, since its hard to compute the p-norm. But were
looking at an easy case. Specically, if g : 0, 1
n
1, 0, 1, the inequality says that
Pr(g ,= 0) = t t
2
1+

[S[
( g(S))
2
(8.8)
59
To see this, let =
2
. We know [[g[[
p
[[T

g[[
2
.
[[g[[
p
=
_
1
2
n

[g(x)[
p
_1
p
=
_
_
1
2
n

g(x),=0
1
_
_
1
p
= t
1
p
= t
1
1+
2
So the Bonami-Beckner inequality tells us
t
1
1+
2

2[S[
( g(S))
2
and squaring both sides and substituting gives us equation (8.8).
We applied this inequality for g = f
(i)
. Then t =
i
, the inuence of the ith variable, which is exactly
what were looking for. Recall that

f
(i)
(S) =
_
0 i / S

f(S) i S
We want to prove that max
i
(log n/n). By substituting the new values into (8.8), we get

2
1+
i

Si

[S[
(

f(S))
2
The
[S[
terms are the W
S
s that we want to behave sufciently close to the step function. Recall, we
also know that

0[S[<T
(

f(S))
2
> 0.9 where T = log n/10. Since we assume f to be balanced (i.e.
Pr(f = 1) = 1/2), we can ignore the 0 term because

f(S) = 0 when S = , but we cannot ignore it for
imbalanced functions. Still, it wont matter, and well come back to this point later.
So we know

0<[S[<T
(

f(S))
2
> 0.9. Let = 1/2 (the choice is arbitrary).
n max
4
3
i

4
3
i

Si
(
1
2
)
[S[
(

f(S))
2
=

S
_
1
2
_
[S[
[S[(

f(S))
2

[S[<T
_
1
2
_
[S[
[S[(

f(S))
2

_
1
2
_
T

[S[<T
(

f(S))
2
(n

1
10
)(0.9)
Therefore, n max
4/3
i
0.9/n
1/10
and so
max
i

_
c
n
11/10
_
3/4
= (n
33/40
) >>
log n
n
60
There is some progress in understanding what the vector of inuences is in general, but we have a long
way to go.
We return to an issue we had left open. What happens when we deal with functions that are imbalanced?
What if f : 0, 1
n
0, 1, Ef = p (not necessarily p = 1/2). Now we cannot ignore

f() in the
previous argument.
Indeed, f() = p.

(

f(S))
2
= p. Wed have to subtract (

f())
2
= p
2
off of p in general. But this is
ne as long as 0 < p < 1. We mention this because this technique is used often.
8.4 Sharp thresholds in graphs
Theorem 8.3. Every monotone graph property has a sharp threshold.
This theoremis also the title of the paper, by Friedgut and Kalai. The background is that in the late 1950s,
Erd os and Renyi dened random graphs, which became an important ingredient for many investigations in
modern discrete mathematics and theoretical computer science. A random graph G(n, p) is a graph on n
vertices in which the probability that (x, y) is an edge in G is p. That is, for each pair of vertices, ip a
p-weighted coin, and put an edge in the graph if the coin comes up heads. We do this independently for each
pair of vertices. So
Pr(G) = p
e(G)
(1 p)
(
n
2
)e(G)
Already Erd os and Renyi had noticed that some properties of graphs have special behavior. For example,
take the property that G is connected. Write Pr(G is connected) as f(p). When p is small, the graph is
unlikely to be connected; when p is big, it is almost certain to be connected. The shape of f(p) = Pr(G is
connected) is shown in Figure 8.3.
log n
n
p
f(p)
Figure 8.3: Sharp threshold
61
The transition from disconnected to connected is very rapid. This is related to a class of physical phe-
nomena called phase transition; there is some physical system that is dependent on a single parameter, such
as temperature or pressure, and the system goes from one phase to the other very sharply (for instance, the
system goes from not magnetized to magnetized).
There are other graph properties that also exhibit this behavior (e.g Hamiltonian cycle, planarity, etc).
But there was no satisfactory general theorem until Freidgut-Kalai.
To have a precise form of the theorem, we dene the terms:
Denition 8.1. A graph property is a class of labeled graphs that is closed under permutations of the vertex
labels.
Intuitively, a graph property holds or does not hold regardless of the labeling of the vertices, such as
connectedness, the graph contains a triangle, the graph is 17-colorable, etc.
Denition 8.2. A graph property is called monotone if it continues to hold after adding more edges to the
graph.
Again, connectedness is monotone; non-planarity is also. An example of a graph property which is
non-monotone is the number of edges in the graph is even.
Let A be a monotone graph property.

p
(A) = Pr(A[G
R
G(n, p))
where G is sampled randomly. Clearly
p
(A) is an increasing function of p. The theorem says that p
0
,
where
p
0
(A) = , and p
1
, where
p
1
(A) = 1 are very close. Namely,
p
1
p
0
O
_
log(1/)
log n
_
In the connectivity example, this is not very interesting. The transition from almost surely disconnected
to almost surely connected takes place around p = log n/n, and 1/ log n is much bigger. In other words,
the gap is much bigger than the critical value where the threshold occurs.
Later, Bourgain and Kalai showed that for all monotone graph properties, the same bound holds with
p
1
p
0
O
_
1
log
2
n
_
, for every > 0 and n large enough. This theorem is nearly tight, since there exist
examples of monotone graph properties where the width of the gap is
_
1
log
2
n
_
. For instance, G contains
a k-clique for specic k = (log n) where the critical p = 1/2.
So what can we do about the problem in the connectivity example, where the threshold comes at a point
much smaller than the gap? We can ask a tougher question: is it true that the transition from
p
0
= to

p
1
= 1 occurs between (1 o(1))q, where q is the critical probability (i.e.
q
(A) = 1/2)?
If the answer were yes, wed have a more satisfactory gap in relation to our critical value in the con-
nectivity case. However, the answer is negative for certain graph properties: this is asking too much. For
example, suppose the property is that Gcontains a triangle. Here the critical p = c/n. The expected number
of triangles in G is
_
n
3
_
p
3
=
c
3
6
(1
1
n
)(1
2
n
)
62
p
f(p)
c/n c/n
Figure 8.4: G contains a triangle
The picture looks like Figure 8.4. At the smaller scale, the threshold isnt as sharp. As we vary c in
p = c/n, the probability that G contains no triangle changes from one constant to another constant, both
bounded away from zero and from one. The reason behind this picture is the number of triangles is a random
variable with a Poisson distribution. Therefore, the probability there is no triangle is Pr(X = 0) = e

,
where is the expectation of X.
One of Freidguts major contributions was to characterize which graph properties have a sharp threshold
and which dont in this stronger sense. Its a little complicated to even state his result precisely, but the spirit
of the theorem is this: properties like G contains a triangle are considered local. Friedguts theorem
says that a graph property has a sharp threshold (in the strong sense) is roughly equivalent to the property
being non-local.
63
References
[1] M. Ben-Naor and N. Linial. Collective Coin Flipping. In Randomness and Computation S. Micali ed.
Academic press, new York, 1990.
[2] J. Kahn, G. Kalai and N. Linial The inuence of variables on boolean function. In Proceedings of the
29th Symposium on the Foundations of Computer Science, White Planes, 1988, Pages 6880.
64
Lecture 9
Threshold Phenomena
March 7, 2005
Lecturer: Nati Linial
Notes: Chris R e
9.1 Asides
There is a good survey of this area by Gil Kalai, Muli Safra called Threshold Phenomena and Inuence due
out very soon.
9.1.1 Percolation
Though our main technical result concerns random graphs in the G(n,p) model, let us mention other contexts
in which threshold phenomena occur. One classical example is Percolation, an area started in physics. A
typical question here is this: given a planar grid and 0 < p < 1. Create a graph by keeping each edge of the
planar grid with probability p and removing each edge with probability 1-p. The inclusion of edges is done
independently. Our question is then: In the resulting graph is the origin in an innite connected component?
It turns out that there is a critical probability, p
c
, such that
p < p
c
with probability 1, the origin is not in an innite component
p > p
c
with probability > 0, the origin is in an innite component
You can imagine considering other similiar questions on higher dimensional grids. For the planar grids
it turns out that p
c
=
1
2
.
This problemcomes up in mining in the following idealized model. Somewhere underground is a deposit
of oil. It is surrounded by rocks whose structure is that of a random sponge, a solid with randomly placed
cavities. The question is how far the oil is likely to ow away from its original location. Percolation in a
3-dimensional setting is a good abstraction of the above physical situation.
Now imagine graphing the probability of the property holding versus the p value from above. As an
example see gure 9.1. The interesting questions are how does it behave around or slightly to the right of p
c
.
For example is this a smooth function? Is it differentiable? How large is its derivative? Figure 9.1 illustrates
some curves that could happen. In this example, the property could be discontinuous at p
c
or is continuous
but not smooth at p
c
.
65
P
r
o
b
a
b
i
l
i
t
y

P

H
o
l
d
s P

C
r
i
t
i
c
a
l
Probabilty p
Figure 9.1: Probability of Property vs. p
9.2 Monotone Graph Properties
The main theorem we want to prove is:
Theorem 9.1 (Friedgut and Kalai). Every monotone graph property has a sharp threshold
To make this precise, we need some denitions. Let P be a graph property, that is a property invariant
under vertex relabeling. A property P is monotone if P(G
0
) implies that P(G) for all G such that G
0
is
a subgraph of G. A property has a sharp threshold, if Pr[A[G(n, p
1
)] = , Pr[A[G(n, p
2
)] = 1 and
p
2
p
1
= o(1)
Theorem 9.2 (Erd os and Renyi). The threshold for graph connectivity is at p =
log n
n
p < (1 )
log n
n
G almost surely disconnected
p > (1 +)
log n
n
G almost surely connected
There is a counter-point model to our deleting model, where we throw in edges. There are some
surprising facts in this model. For example, when you throw in the edge that reaches the last isolated vertex,
with almost certainty, you also connect the graph - at the exact same stage. At the same instant, you also
make the graph hamiltonian.
It may be illustrative to see the form of these arguments.
Proof. sketch Let p < (1)
log n
n
. Let X be a random variable representing the number of isolated vertices.
Then E[X] since E[x] = n(1 p)
n1
. We also need a second moment argument like Chebyshev to
deduce X > 0 almost surely. In particular, when X > 0, the graph is disconnected.
Proof. Let Y
k
be a random variable that counts the number of sets S V with [S[ = k that have no edges
between S and its complement. Then the E[Y
k
] =
_
n
k
_
(1p)
k(nk)
. It can be checked that if p > (1+)
log n
n
,
66
then

k
n
2
E[Y
k
] = o(1). It follows that with probability 1 o(1) no such sets exist. Clearly, when no
such sets exist, the graph is connected.
9.2.1 Relation to KKL
Why should we expect KKL to work like these examples?
If f : 0, 1
n
0, 1 with E[f] =
1
2
1
. By KKL, x [n] Inf
f
(x) > (
log n
n
). Let N =
_
[n]
2
_
then
each z 0, 1
N
is a description of an n vertex graph and the variables correspond to edges.
We can now view graph property as an N-variable boolean function. Notice also by symmetry if one
edge (variable) is inuential, then all edges (variables) are inuential. As we will see later large inuence
entails a sharp threshold.
To generalize, we need to understand the role of p in G(n, p). We have to work with 0, 1
N
not
under the uniform distribution but under the following product distribution: Pr[U] = p
[U[
(1 p)
N[U[
=
p
E(G)
(1 p)
(
n
2
)E(G)
. We are denoting the Hamming weight of U as [U[. and E(G) is the edge set of the
graph G.
9.3 BK
3
L
9.3.1 A relation between inuence and the derivative of
p
(A)
The new B and K in our theory are Bourgain and Katznelson. By
p
(A) we denote the probability that the
property A holds under the p, 1 p product measure.
Lemma 9.3 (Margulis & Russo). Let A 0, 1
n
be a monotone subset and let
p
(A) be the p-measure
of A. For x A let h(x) = [y / A[x, y E(cube)[ (number of neigbors of x outside of A).
Let
p
(A) =

h(x)
p
(x), the weighted sum of these hs.
Additionally let
p
(A) be the sum of inuences of individual variables. Then

p
(A) =
1

p
(A)
p
=
2
d
dp

p
(A)
The subscripts on the equality are only for convenience in the proof.
Denition 9.1. We will say x ~ y, if x and y differ in exactly one coordinate, say the i
th
, and x
i
= 1 and
y
i
= 0.
Inuences, more generally In general, if X is a probability space and if f : X
n
0, 1 (i.e. f
can be viewed as an indicator function for a subset of X
n
). For 1 k n, we can say Inf
f
(k) =
Pr
X
n1[ Obtain a non-constant ber ]. Here we are randomly choosing n 1 coordinates from X with the
k
th
coordinate missing, and checking if the resulting ber is constant for f. Namely, if the value of f is xed
regardless of the choice of for the k
th
variable.
1
choosing E[f] =
1
2
is not critical. Anything bounded away from 0,1 will do
67
Figure 9.2: Cube with a Fiber
Proof. We prove equality 1.
p
(A) is the sum of all inuences. The inuence of the i
th
variable is the
weighted sum of all such edges such that x ~ y where x A, y / A and x
i
= 1, y
i
= 0. The probability of
the relevent event is this: We have selected all coordinates except the i
th
and the outcome should coincide
with x. There are [x[ 1 coordinates which are 1 among those and n [x[ coordinates for which are 0. So
we can rewrite the formula as follows

p
(A) =

xA,y/ A,x~y
p
([x[1)
(1 p)
n[x[
=
1
p

xA,y/ A,x~y
p
([x[)
(1 p)
n[x[
=
1
p

xA
p
([x[)
(1 p)
n[x[
[y[y / A, x ~ y[
=
1
p

xA
p
([x[)
(1 p)
n[x[
h(x)
=
1
p

xA

p
(x)h(x) =
1
p

p
(A)
Proof. Equality 2.
d
dp

p
(A) =

xA
[x[p
[x[1
(1 p)
n[x[

xA
(n [x[)p
[x[
(1 p)
n[x[1
p
d
dp

p
(A) =

xA
[x[p
[x[
(1 p)
n[x[

p
1 p

xA
(n [x[)p
[x[
(1 p)
n[x[
For a xed vertex of the cube, x, and e an edge incident with x dene
w
x,e
=
_
1 e goes down from x

p
1p
e goes up from x
68
Figure 9.3: Partitioning the cube to derive KKL from BK
3
L
Figure 9.4: f on n=2 case
So we can rewrite (summing over x and es incident).
p
d
dp

p
(A) =

xA,ex
w
x,e

p
(x)
This is because there are [x[ edges going down from x and [nx[ edges going up from it. Notice that if
x ~ y are both in A and e = (x, y), then w
x,e

p
(x) + w
y,e

p
(y) = 0. It follows that we can restrict to the
sum to the edges in E(A, A
c
). In other words,
p
d
dp

p
(A) =

xA,y/ A,e=(x,y)
w
x,e

p
(x) =

xA,y/ A,e=(x,y)

p
(x) =

xA
h(x)
p
(x) =
p
(A)
Returning to the proof that every monotone graph property has a sharp threshold. Let A be a monotone
graph property and let us operatre in the probability space G(n, p). We will show here that the p value where
the property holds with less than is very close to where the property holds with
1
2
. A symmetric argument
for 1 will give us the full desired result.
69
9.3.2 Words about BKKKL
Theorem 9.4 (BKKKL). Let f : [0, 1]
n
0, 1 with E[f] = t, let t
t
= min(t, 1 t). Then there exists
n k 1 such that Inf
f
(k) (t
t log n
n
)
Set version of KKL f : 0, 1
n
0, 1 E[f]
1
2
and for every (n) as n ,
S [n]. [S[
n((n))
log n
with Inf
f
(S) = 1 o(1). This result follows from repeated application of KKL.
Remark. It is interesting to note that the analgous statement for f : [0, 1]
n
[0, 1] does not hold.
Consider the following f, represented in gure 9.4. Let f(x
1
, . . . , x
n
) = 0 iff i 0 x
i

c
n
where c = log
e
(2). In other words, f
1
(1) =

n
i
[
c
n
, 1]. Let [S[ = . In this example, Inf(S) =
Pr[f still undetermined when all variables outside of S are set at random]. The function is still undeter-
mined iff all others outside the set are 1. This happens with probability (1
c
n
)
n(1)
e
c(1)
, which is
bounded away from 1.
This is a close-cousin of the tribes example. Recall in the tribes example we broke the variables into
tribes of size log n log log n. Each tribe contributed if all variables take on the value 1, that is there
is one assignment out of the 2
log nlog log n
=
n
log n
such that the tribe had value 1. In our setting, we can
identify tribes with single variables. The 0 region of the continuous case corresponds to the assigment where
all variables in the discrete case are set to 1, since this determines the function.
Proof. By BK
3
L there exist inuential variables. By symmetry all variables are inuential. Sum of all
individual inuences are at least as large as
p
(A) ( log N) = ( log n)

p
(A) (
p
(A) log n)
By Margulis-Russo Lemma we know
p
(A) =
d
dp

p
(A).
d
dp

p
(A) (
p
(A) log n)
(
d
dp

p
(A))/
p
(A) (log n)
d
dp
(log(
p
(A)) (log n)
let p
1
, p
2
be dened by Pr
G(n,p
1
)
[A] = and Pr
G(n,p
2
)
[A] =
1
2
. From above we know that d(log
p
(A)) >
(log n) so p
1
p
2
< O(
log
1

log n
).
Remark. We will not give a proof here, but note that Freidgut showed using standard measure theory how
to derive BK
3
L from KKL. Namely, how we can reach the same conclusion for any f : X
n
0, 1,
where X is any probability space. To derive KKL from BK
3
L, is easy: Given f : 0, 1
n
0, 1 dene
F : [0, 1]
n
0, 1 by breaking the cube to 2
n
subcubes and letting F be constant on each subcube that is
equal to f at the correpsoning vertex of the cube. For a simple illustration of the case n = 2, see gure 9.3.
70

You might also like