Matrix Analysis LN
Matrix Analysis LN
Cite as:
Joel A. Tropp, ACM 204: Matrix Analysis, Caltech CMS Lecture Notes 2022-01, Pasadena,
Winter 2022. doi:10.7907/nwsv-df59.
Cover image: Sample paths of a randomized block Krylov method for estimating the
largest eigenvalue of a symmetric matrix.
Lecture images: Falling text in the style of The Matrix was created by Jamie Zawinski,
©1999–2003.
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
I lectures
1 Tensor Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Tensor product: Motivation 2
1.2 The space of tensor products 4
1.3 Isomorphism to bilinear forms 5
1.4 Inner products 7
1.5 Theory of linear operators 8
1.6 Spectral theory 10
2 Multilinear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Multivariate tensor product 12
2.2 Permutations 14
2.3 Wedge products 16
2.4 Wedge operators 17
2.5 Spectral theory of wedge operators 18
2.6 Determinants 18
2.7 Symmetric tensor product 19
3 Majorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Majorization 21
3.2 Doubly stochastic matrices 24
3.3 T-transforms 25
3.4 Majorization and doubly stochastic matrices 27
3.5 The Schur–Horn theorem 28
4 Isotone Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Recap 30
4.2 Weyl majorant theorem 30
4.3 Isotonicity 32
4.4 Schur “convexity” 35
II problem sets
III projects
Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
IV back matter
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Preface
“The Matrix is everywhere. It is all around us. Even now in this very room.”
Course overview
The topics of this course vary from term to term, depending on the audience. This
term, we covered the following material:
• Basics of multilinear algebra
• Majorization and doubly stochastic matrices
• Symmetric and unitarily invariant norms
• Uniform smoothness of matrix spaces
• Complex interpolation methods for matrix inequalities
• Variational principles for eigenvalues
• Perturbation theory for eigenvalues
• Angles between subspaces and perturbation theory for eigenspaces
• Tensor products and matrix equations
• Positive and completely positive linear maps
• Matrix monotonicity and convexity
• Differentiation of standard matrix functions
• Loewner’s theorems on matrix monotone functions
• Matrix means
• Convexity of matrix trace functions
• Positive-definite functions and Bochner’s theorem
• Entrywise positivity preservers and Schoenberg’s theorem
The course had three optional problem sets, which helped to cement some of the
foundational material. The problem sets are attached to the notes.
The primary assignment was a project, where each student read some classic or
modern papers in matrix analysis and wrote a synthetic treatment of the material. A
selection of the projects is attached to the lecture notes.
Prerequisites
ACM 204 is designed for G2 and G3 students in the mathematical sciences. The
prerequisites for this course are differential and integral calculus (e.g., Caltech Ma 1ac),
ordinary differential equations (e.g., Ma 2), and intermediate linear algebra (e.g., Ma
1b and ACM 104). Exposure to linear analysis (e.g., CMS 107), functional analysis (e.g.,
ACM 105), and optimization theory (e.g., CMS 122) is also valuable.
x
Supplemental textbooks
There is no required textbook for the course. Some recent books that cover related
material include
Bhatia’s books [Bha97; Bha07b] are the primary sources for this course.
These notes
These lecture notes document ACM 204 as taught in Winter 2022, and they are primarily
intended as a reference for students who have taken the class. The notes are prepared
by student scribes with feedback from the instructor. The notes have been edited by
the instructor to try to correct his own failures of presentation. All remaining errors
and omissions are the fault of the instructor.
Please be aware that these notes reflect material presented in a classroom, rather
than a formal scholarly publication. In some places, the notes may lack appropriate
citations to the literature. There is no claim that the arrangement or presentation of
the material is primarily due to the instructor.
The notes also contain the projects of students who wished to share their work.
They received feedback and made revisions, but the projects have not been edited.
They represent the students’ individual work.
Acknowledgements
These notes were transcribed by students taking the course in Winter 2022. They are
Eray Atay, Jag Boddapati, Edoardo Calvello, Ruizhi Cao, Anthony (Chi-Fang) Chen,
Nicolas Christianson, Matthieu Darcy, Rohit Dilip, Ethan Epperly, Salvador Gomez,
Taylan Kargin, Eitan Levin, Elvira Moreno, Nicholas Nelson, Roy (Yixuan) Wang, and
Jing Yu. Without their care and attention, we would not have such an excellent record.
Joel A. Tropp
Steele Family Professor of Applied & Computational Mathematics
California Institute of Technology
[email protected]
http://users.cms.caltech.edu/~jtropp
Pasadena, California
March 2022
Notation
I have selected notation that is common in the linear algebra and probability literature.
I have tried to been consistent in using the symbols that are presented below. There
are some minor variations in different lectures, including the letter that indicates the
dimension of a matrix and the indexing of sums. Scribes are expected to use this same
notation!
Set theory
The Pascal notation B and C generates a definition. Sets without any particular
internal structure are denoted with sans serif capitals: A, B, E. Collections of sets are
written in a calligraphic font: A, B, F. The power set (that is, the collection containing
all subsets) of a set E is written as P( E) .
The symbol ∅ is reserved for the empty set. We use braces to denote a set. The
character ∈ (or, rarely, 3) is the member-of relation. The set-builder notation
{𝑥 ∈ A : 𝑃 (𝑥)}
carves out the (unique) set of elements that belong to a set A and that satisfy the
predicate 𝑃 . Basic set operations include union (∪), intersection (∩), symmetric
difference (4), set difference (\), and the complement (c ) with respect to a fixed set.
The relations ⊆ and ⊇ indicate set containment.
The natural numbers ℕ B {1, 2, 3, . . . }. Ordered tuples and sequences are written
with parentheses, e.g.,
(𝑎 1 , 𝑎 2 , 𝑎 3 , . . . , 𝑎 𝑛 ) or (𝑎 1 , 𝑎 2 , 𝑎 3 , . . . )
Alternative notations include things like (𝑎𝑖 : 𝑖 ∈ ℕ) or (𝑎𝑖 )𝑖 ∈ℕ or simply (𝑎𝑖 ) .
Real analysis
We mainly work in the field ℝ of real numbers, equipped with the absolute value
|·| . The extended real numbers ℝ B ℝ ∪ {±∞} are defined with the usual rules
of arithmetic and order. In particular, we instate the conventions that 0/0 = 0 and
0 · ±∞ = 0. We use the standard (American) notation for open and closed intervals;
e.g.,
Linear algebra
We work in a real or complex linear space. The letters 𝑑 and 𝑛 (and occasionally
others) are used to denote the dimension of this space, which is always finite. For
example, we write ℝ𝑑 or ℂ𝑛 . We may write 𝔽 to refer to either field, or we may omit
the field entirely if it is not important.
We use the delta notation for standard basis vectors: 𝜹 𝑖 has a one in the 𝑖 th
coordinate and zeros elsewhere. The vector 1 has ones in each entry. The dimension
of these vectors is determined by context.
The symbol ∗ denotes the (conjugate) transpose of a vector or a matrix. In particular,
∗
𝑧 is the complex conjugate of a complex number 𝑧 ∈ ℂ. We may also write ᵀ for the
ordinary transpose to emphasize that no conjugation is performed.
We equip 𝔽 𝑑 with the standard inner product h𝒙 , 𝒚 i B 𝒙 ∗ 𝒚 . The inner product
generates the Euclidean norm k𝒙 k 2 B h𝒙 , 𝒙 i .
We write ℍ𝑑 (𝔽) for the real-linear space of 𝑑 × 𝑑 self-adjoint matrices with entries
in the field 𝔽 . Recall that a matrix is self-adjoint when 𝑨 = 𝑨 ∗ . The symbols 0 and
I denote zero matrix and the identity matrix; their dimensions are determined by
context or by an explicit subscript.
We equip the space ℍ𝑑 with the trace inner product h𝑿 , 𝒀 i B tr (𝑿𝒀 ) , which
generates the Frobenius norm k𝑿 k 2F B h𝑿 , 𝑿 i . The map tr (·) returns the trace of
a square matrix; the parentheses are often omitted. We instate the convention that
nonlinear functions bind before the trace.
The spectral theorem states that every self-adjoint matrix 𝑨 ∈ ℍ𝑛 admits a spectral
resolution:
∑︁𝑚 ∑︁𝑚
𝑨= 𝜆𝑖 𝑷 𝑖 where 𝑷 𝑖 = I𝑛 and 𝑷 𝑖 𝑷 𝑗 = 𝛿𝑖 𝑗 𝑷 𝑖 .
𝑖 =1 𝑖 =1
Here, 𝜆 1 , . . . , 𝜆𝑚 are the distinct (real) eigenvalues of 𝑨 . The range of the orthogonal
projector 𝑷 𝑖 is the invariant subspace associated with 𝜆𝑖 . In this context, 𝛿𝑖 𝑗 is the
Kronecker delta.
The maps 𝜆 min (·) and 𝜆 max (·) return the minimum and maximum eigenvalues of
a self-adjoint matrix. The 2 operator norm k·k of a self-adjoint matrix satisfies the
relation
k𝑨 k B max |𝜆 max (𝑨)|, |𝜆min (𝑨)| for 𝑨 ∈ ℍ𝑛 .
A self-adjoint matrix is positive semidefinite (psd) if its eigenvalues are nonnegative; a
self-adjoint matrix is positive definite (pd) if its eigenvalues are positive. The symbol 4
refers to the psd order: 𝑨 4 𝑯 if and only if 𝑯 − 𝑨 is psd.
We can define a standard matrix function for a self-adjoint matrix using the spectral
resolution. For an interval I ⊆ ℝ and for a function 𝑓 : I → ℝ,
∑︁𝑚 ∑︁𝑚
𝑨= 𝜆𝑖 𝑷 𝑖 implies 𝑓 (𝑨) = 𝑓 (𝜆𝑖 ) 𝑷 𝑖 .
𝑖 =1 𝑖 =1
Implicitly, we assume that the eigenvalues of the matrix 𝑨 lie within the domain I of
the function 𝑓 . When we apply a real function to a self-adjoint matrix, we are always
referring to the associated standard matrix function. In particular, we often encounter
powers, exponentials, and logarithms.
We write 𝕄𝑛 (𝔽) for the linear space of 𝑛 × 𝑛 matrices over the field 𝔽 . We also
define the linear space 𝕄𝑚×𝑛 (𝔽) of 𝑚 × 𝑛 matrices over the field 𝔽 . We can extend
the trace inner-product and Frobenius norm to this setting:
The matrices 𝑼 and 𝑽 are orthogonal (or unitary). The rectangular matrix 𝚺 ∈ 𝔽 𝑚×𝑛
is diagonal, in the sense that the entries (𝚺)𝑖 𝑗 = 0 whenever 𝑖 ≠ 𝑗 . The diagonal
entries of 𝚺 are called singular values. They are conventionally arranged in decreasing
order and written with the following notation.
𝜎max (𝑩) B 𝜎1 (𝑩) ≥ 𝜎2 (𝑩) ≥ · · · ≥ 𝜎𝑟 (𝑩) C 𝜎min (𝑩) where 𝑟 B min{𝑚, 𝑛}.
The symbol k·k always refers to the 2 operator norm; it returns the maximum singular
value of its argument.
We write lin (·) for the linear hull of a family of vectors. The operators range (·)
and null (·) extract the range and null space of a matrix. The operator † extracts the
pseudoinverse.
Probability
The map ℙ {·} returns the probability of an event. The operator 𝔼[·] returns the
expectation of a random variable taking values in a linear space. We only include the
brackets when it is necessary for clarity, and we impose the convention that nonlinear
functions bind before the expectation.
The symbol ∼ means “has the distribution.” We abbreviate (statistically) independent
and identically distributed (iid). Named distributions, such as normal and uniform,
are written with small capitals.
We say that a random vector 𝒙 ∈ 𝔽 𝑛 is centered when 𝔼[𝒙 ] = 0. A random vector
is isotropic when 𝔼[𝒙𝒙 ∗ ] = I𝑛 . A random vector that is both centered and isotropic is
standardized.
An important property of the standard normal distribution, which we use heavily, is
the fact that it is rotationally invariant. If 𝒙 ∼ normal ( 0, I) , then 𝑸 𝒙 is also standard
normal for every matrix 𝑸 that is orthogonal (in the real case) or unitary (in the
complex case).
Order notation
We sometimes use the familiar order notation from computer science. The symbol Θ(·)
refers to asymptotic equality. The symbol 𝑂 (·) refers to an asymptotic upper bound.
I. lectures
1 Tensor Products . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Multilinear Algebra . . . . . . . . . . . . . . . . . . . . . 12
3 Majorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Isotone Functions . . . . . . . . . . . . . . . . . . . . . 30
11 Perturbation of Eigenspaces . . . . . . . . . . . . . . 87
In this lecture, we develop the theory of tensor products, which provides us with a way
Agenda:
to “multiply” vector spaces. We begin by discussing the axioms that such a construction
1. Tensor product: Motivation
should satisfy; then we develop a rigorous way to implement these axioms. We show 2. Tensor product spaces
that the tensor product of two Hilbert spaces is itself a Hilbert space, equipped with an 3. Bilinear forms
induced inner product. By developing an isomorphism with bilinear forms, we show 4. Theory of linear operators
how to regard tensor products as matrices. Finally, we show that we can construct
linear operators on a tensor product space in a very natural way, and we develop their
spectral theory in a transparent way.
1.1.1 Setting
Throughout this chapter, we will assume H is an 𝑛 -dimensional Hilbert space over a field
𝔽 (either ℝ or ℂ). A Hilbert space is endowed with an inner product denoted by h·, ·i .
By convention, we assume the inner product is conjugate linear in the first coordinate
and linear in the second coordinate. We fix an orthonormal basis {𝒆 1 , 𝒆 2 , . . . , 𝒆 𝑛 }.
That is, h𝒆 𝑗 , 𝒆 𝑘 i = 𝛿 𝑗 𝑘 . Finally, we denote the space of linear operators acting on H The Kronecker delta 𝛿 𝑗 𝑗 = 1 for all 𝑗 ,
by L( H) . Since H is finite-dimensional, we can regard every element of L( H) is as a while 𝛿 𝑗 𝑘 = 0 when 𝑗 ≠ 𝑘 .
matrix with dimension dim ( H) .
Definition 1.1 (Tensor product: Axioms). The tensor product operation ⊗ maps a pair of
vectors 𝒙 , 𝒚 ∈ H to an object called the tensor product 𝒙 ⊗ 𝒚 . We can add tensors
and scale them. The product should satisfy the following properties:
1. Additivity. The product should distribute across the addition operation. For
Lecture 1: Tensor Products 3
(𝒙 + 𝒚 ) ⊗ 𝒛 = 𝒙 ⊗ 𝒛 + 𝒚 ⊗ 𝒛 ;
𝒙 ⊗ (𝒚 + 𝒛 ) = 𝒙 ⊗ 𝒚 + 𝒙 ⊗ 𝒛 .
2. Homogeneity. We would also like the product operation to behave well with
scalar multiplication in the field 𝔽 of the Hilbert space. In particular, for all
𝒙 , 𝒚 ∈ H, The homogeneity property is
independent of the argument index
𝛼 (𝒙 ⊗ 𝒚 ) = (𝛼𝒙 ) ⊗ 𝒚 = 𝒙 ⊗ (𝛼𝒚 ) for all 𝛼 ∈ 𝔽 . in the product; that is, there is no
conjugation of the field element 𝛼 .
The zero vector 0 is the identity operator over
3. Interaction with the zero vector.
addition; i.e., 𝒙 + 0 = 𝒙 for all 𝒙 ∈ H. We require that the tensor product
with zero to be absorbing. Thus, for vectors 𝒙 , 𝒚 ∈ H,
𝒙 ⊗ 0 = 0 ⊗ 0 = 0 ⊗ 𝒚.
Example 1.3 (Schur product). The Schur product between vectors 𝒙 and 𝒚 in ℝ𝑛 is
denoted by 𝒙 𝒚 and is defined by elementwise multiplication of the vectors 𝒙 and 𝒚 .
That is,
: ℝ𝑛 × ℝ𝑛 → ℝ𝑛 ;
: (𝒙 , 𝒚 ) ↦→ (𝑥𝑖 𝑦𝑖 : 𝑖 = 1, . . . , 𝑛).
Similar to the inner product, the Schur product does not satisfy faithfulness. For
instance, when H = ℝ2 , one can pick 𝒙 = ( 1, 0) ᵀ and 𝒚 = ( 0, 1) ᵀ . Their Hadamard
product is 𝒙 𝒚 = ( 0, 0) ᵀ , but both vectors are nonzero.
Example 1.4 (Outer product). We will denote the outer product between vectors 𝒙 and 𝒚
in ℂ𝑛 by 𝒙 ⊗ 𝒚 , for reasons that will be clear in the following discussion. It is defined
by
⊗ : ℂ𝑛 × ℂ𝑛 → ℂ𝑛×𝑛 ;
⊗ : (𝒙 , 𝒚 ) ↦→ 𝒙 𝒚 ᵀ = (𝑥𝑖 𝑦 𝑗 )𝑖𝑛=1,𝑗 =1 .
We can prove this by checking all four axioms for the tensor product.
1. Additivity.Since the outer product is a particular case of matrix multiplication
(between an 𝑛 × 1 matrix and a 1 × 𝑛 matrix) and matrix multiplication is
additive, the outer product is also additive.
Lecture 1: Tensor Products 4
The outer product thus satisfies all four axioms of the tensor product.
where 𝒙 𝑖 and 𝒚 𝑖 are elements of H. This space is evidently closed under linear
Lecture 1: Tensor Products 5
Í𝑟 Í𝑝
combinations, since given tensors T1 = 𝑖 =1 𝛼𝑖𝒙 𝑖 ⊗ 𝒚 𝑖 and T2 = 𝑖 =1 𝛽𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 ,
∑︁𝑟 ∑︁𝑝
𝛾1 T1 + 𝛾2 T2 = 𝛾1 𝛼𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 + 𝛾2 𝛽 𝑗 𝒙 𝑗 ⊗ 𝒚 𝑗
𝑖 =1 𝑗 =1
∑︁max {𝑟 ,𝑝 }
= 𝜆𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 ,
𝑖 =1
where we have concatenated the two sums. Some intuition about non-elementary
tensors can be obtained by considering the specific case of an outer product, as in the
following example. One can view the tensor product as a
generalization of the outer product.
Example 1.7 (Elementary tensors). As a concrete example of a non-elementary tensor, For the particular case of vectors in
consider the standard orthonormal basis {𝜹 1 , 𝜹 2 } in ℝ2 . Two elementary tensors are ℂ𝑛 , the tensor product produces
matrices in ℂ𝑛×𝑛 .
1 0 0 0
𝜹1 ⊗ 𝜹1 = and 𝜹2 ⊗ 𝜹2 = .
0 0 0 1
In this way, we can view an elementary tensor as a rank-1 matrix, while a non-elementary
tensor is any higher rank matrix.
is equal to the sum in the previous example, but has an ostensibly different representa-
tion. We can show that these representations are equivalent by using the axioms in
Definition 1.1 to distribute the sums in the first term of Equation (1.1) and simplify.
Once these equivalence classes have been established, we can rigorously define a
tensor product space as follows.
Definition 1.9 (Tensor product space). Let H be a finite Hilbert space over a field 𝔽 .
The tensor product space H ⊗ H is defined by all linear combinations of expressions
𝒙 ⊗ 𝒚 with 𝒙 , 𝒚 ∈ H modulo the axioms presented in Section 1.1.2.
Bilinear forms can be viewed more concretely by noting their correspondence with
matrices.
Proposition 1.12 (Bilinear forms are isomorphic to matrices). The linear space Bil ( H) is
isomorphic to 𝕄𝑛 (𝔽) . 𝕄𝑛 (𝔽) is the linear space of 𝑛 × 𝑛
matrices over the field 𝔽 .
Proof. Every matrix 𝑨 ∈ 𝕄𝑛 (𝔽) has an associated bilinear form defined by
∑︁𝑛
𝐵 𝑨 (𝒙 , 𝒚 ) = 𝑥𝑖 𝑎𝑖 𝑗 𝑦 𝑗 = 𝒙 ᵀ 𝑨𝒚 .
𝑖 ,𝑗 =1
Definition 1.13 (Tensor product space). The tensor product space H ⊗ H is the algebraic
dual space of the space Bil ( H) of bilinear forms. Recall that the algebraic dual space of
a linear space V is the space of all
linear functionals on V.
Concretely, we identify each elementary tensor in H ⊗ H as a linear functional on
Bil ( H) via the following mapping:
𝒙 ⊗ 𝒚 : 𝐵 ↦→ 𝐵 (𝒙 , 𝒚 ).
This can be easily extended to all tensors by linearity as follows:
∑︁ ∑︁
𝛼𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 : 𝐵 ↦→ 𝛼𝑖 𝐵 (𝒙 𝑖 , 𝒚 𝑖 ).
𝑖 𝑖
Now, we must show that this construction satisfies the axioms in Definition 1.1.
In the following discussion,1 let us consider bilinear forms 𝐵 with underlying matrix
representations 𝑩 .
1For brevity, we do not check every case for each axiom, but all the other cases follow similarly.
Lecture 1: Tensor Products 7
𝐵 (𝒙 + 𝒚 , 𝒛 ) = (𝒙 ᵀ + 𝒚 ᵀ )𝑩𝒛
= 𝒙 ᵀ 𝑩𝒛 + 𝒚 ᵀ 𝑩𝒛
= 𝐵 (𝒙 , 𝒛 ) + 𝐵 (𝒚 , 𝒛 ),
𝐵 (𝛼𝒙 , 𝒚 ) = 𝛼𝒙 ᵀ 𝑩𝒚
= 𝒙 ᵀ 𝑩 (𝛼𝒚 )
= 𝐵 (𝒙 , 𝛼𝒚 ),
𝒙 ᵀ 𝑩𝒚 = 𝒙 ᵀ 𝒙 𝒚 ᵀ 𝒚
= k𝒙 k 2 k𝒚 k 2 > 0,
which would violate the assumption that the tensor maps all bilinear forms to
zero.
Example 1.14 (Matrix associations in 𝔽 𝑛 ). If H ' 𝔽 𝑛 , then the space of bilinear forms
Bil ( H) ' 𝕄𝑛 (𝔽) , and H ⊗ H = 𝕄𝑛 (𝔽) . We can then identify 𝒙 ⊗ 𝒚 with 𝒙 𝒚 ᵀ . Because
bilinear forms are isomorphic to matrices and tensors are isomorphic to the dual of the
space of bilinear forms, tensors are also isomorphic to matrices.
Definition 1.15 (Tensor inner product). Given two elementary tensors 𝒙 1 ⊗ 𝒙 2 and
𝒚 1 ⊗ 𝒚 2 , let the inner product between the two elementary tensors be given by
h𝒙 1 ⊗ 𝒙 2 , 𝒚 1 ⊗ 𝒚 2 i B h𝒙 1 , 𝒚 1 ih𝒙 2 , 𝒚 2 i.
Exercise 1.16 (The tensor inner product is well-defined). Show that the tensor inner product
is well-defined. In this case, this means that two different representatives of the same
equivalence class (i.e., they are related by the axioms in Definition 1.1) should lead to
the same inner product.
Proposition 1.17 (Tensor orthonormal basis). Given an orthonormal basis {𝒆 1 , 𝒆 2 , . . . , 𝒆 𝑛 }
for H, the basis {𝒆 𝑖 ⊗ 𝒆 𝑗 : 𝑖 , 𝑗 = 1, . . . , 𝑛} is an orthonormal basis for H ⊗ H. In
particular, the dimension of H ⊗ H is ( dim H) 2 .
h𝒆 𝑖 ⊗ 𝒆 𝑗 , 𝒆 𝑖 ⊗ 𝒆 𝑗 i = h𝒆 𝑖 , 𝒆 𝑖 ih𝒆 𝑗 , 𝒆 𝑗 i = 1.
h𝒆 𝑖 ⊗ 𝒆 𝑗 , 𝒆 𝑘 ⊗ 𝒆 𝑙 i = h𝒆 𝑖 , 𝒆 𝑘 ih𝒆 𝑗 , 𝒆 𝑙 i = 0,
where we have applied Definition 1.1 to reduce all the terms to a linear combination
of elements from the tensor orthonormal basis. This shows that H ⊗ H is in the span
of the set of elementary tensors. Every linear combination of elementary tensors is
trivially in H ⊗ H, which completes the proof.
Exercise 1.18 (The tensor orthonormal basis is complete). Repeat the previous proof of
completeness using the equivalence of tensor product spaces with bilinear forms. Hint:
Remember the correspondence between bilinear forms and matrices.
We will often choose to order the orthonormal basis lexicographically; that is, we
say 𝒆 𝑗 ⊗ 𝒆 𝑘 precedes 𝒆 𝑚 ⊗ 𝒆 𝑛 if and only if 𝑗 < 𝑚 or 𝑗 = 𝑚 and 𝑘 < 𝑛 . In other
words, we sort by the first index, then the second).
Definition 1.19 (Elementary tensor operators). Given linear operators 𝑨, 𝑩 ∈ L( H) For the finite dimensional case, every
acting on H and vectors 𝒙 , 𝒚 ∈ H, define a linear operator acting on H ⊗ H by the linear operator can of course be
described by a matrix.
following action on the elementary tensor 𝒙 ⊗ 𝒚 :
(𝑨 ⊗ 𝑩) (𝒙 ⊗ 𝒚 ) = (𝑨𝒙 ) ⊗ (𝑩𝒚 ).
Proof. As an example, let us prove the case for normal operators. If 𝑨, 𝑩 are normal,
then
(𝑨 ⊗ 𝑩) (𝑨 ⊗ 𝑩) ∗ = (𝑨 ⊗ 𝑩)(𝑨 ∗ ⊗ 𝑩 ∗ )
= (𝑨𝑨 ∗ ⊗ 𝑩𝑩 ∗ )
= (𝑨 ∗ 𝑨) ⊗ (𝑩 ∗ 𝑩)
= (𝑨 ∗ ⊗ 𝑩 ∗ )(𝑨 ⊗ 𝑩).
Therefore, 𝑨 ⊗ 𝑩 is also normal.
Example 1.24 (Quantum computing). Quantum computers, devices that use properties
of quantum states to perform calculations, are studied using a circuit model where
each quantum gate acts on two qubits (the quantum equivalent of a bit). Formally, the
Hilbert space of a quantum circuit is the tensor product of the Hilbert spaces of all the
qubits. For physical reasons, only unitary operations are permitted in quantum circuits.
Because each two-qubit gate is a unitary operation, the unitary case of persistence
shows that the action on the entire space is also unitary. For further references, see
[NC00].
Proof. We prove the expression for the eigenvalues using the Schur decomposition. The Schur decomposition expresses
an arbitrary complex square matrix 𝑨
𝑨 ⊗ 𝑨 = (𝑸𝑻 𝑸 ∗ ) ⊗ (𝑸𝑻 𝑸 ∗ ) as 𝑨 = 𝑸𝑻 𝑸 ∗ , where 𝑸 is unitary
and 𝑻 is an upper triangular matrix
= (𝑸 ⊗ 𝑸 ) (𝑻 ⊗ 𝑻 )(𝑸 ⊗ 𝑸 ) ∗ . whose diagonal entries are the
eigenvalues of 𝑨 .
In moving from the first to the second line, we have applied the Schur decomposition.
The center term is an upper triangular matrix with diagonal elements {𝜆𝑖 𝜆 𝑗 : 𝑖 , 𝑗 =
1, . . . , 𝑛} whose ordering is given by the lexicographic ordering of the orthonormal
basis for the tensor product space. Reading off these entries proves the result.
For the eigenvectors, we can simply note that
(𝑨 ⊗ 𝑨) (𝒖 𝑗 ⊗ 𝒖 𝑘 ) = (𝑨𝒖 𝑗 ) ⊗ (𝑨𝒖 𝑘 ) = 𝜆 𝑗 𝜆𝑘 (𝒖 𝑗 ⊗ 𝒖 𝑘 ),
which proves the result.
A similar construction holds for singular values.
Proposition 1.26 (Singular value decomposition). If a linear operator 𝑨 ∈ L( H) has a
singular value decomposition (SVD) given by 𝑨 = 𝑼 𝚺𝑽 ∗ , where 𝑼 and 𝑽 are unitary
and 𝚺 = diag (𝜎1 , . . . , 𝜎𝑛 ) , then the singular values of 𝑨 ⊗ 𝑨 are {𝜎𝑖 𝜎 𝑗 : 𝑖 , 𝑗 =
1, . . . , 𝑛}.
Lecture 1: Tensor Products 11
𝑨 ⊗ 𝑨 = (𝑼 𝚺𝑽 ∗ ) ⊗ (𝑼 𝚺𝑽 ∗ )
= (𝑼 ⊗ 𝑼 ) (𝚺 ⊗ 𝚺)(𝑽 ⊗ 𝑽 ) ∗ ,
which has the same form of the SVD. As the center term is diagonal, we can read off
the entries in lexicographic order to prove the result.
The spectral decomposition and singular value decomposition therefore provide
a method to construct tensor operators based on relations to the underlying linear
operators with particular eigenvalues and eigenvectors.
Notes
Much of the material in the section is discussed briefly in [Bha97, Chap. I]. The
book [Hal74] contains a more mathematical description of bilinear forms and tensor
products.
Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Hal74] P. R. Halmos. Finite-dimensional vector spaces. 2nd ed. Springer-Verlag, New York-
Heidelberg, 1974.
[NC00] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information.
Cambridge University Press, Cambridge, 2000.
2. Multilinear Algebra
Starting from the algebraic dual of the space ML𝑘 ( H) , denoted by ( ML𝑘 ( H)) ∗ , we
define the 𝑘 -fold tensor product space ⊗𝑘 H.
Definition 2.2 (Multivariate tensor product space). Given any Hilbert space H, we define
the 𝑘 -fold tensor product space ⊗𝑘 H as the algebraic dual of the space ML𝑘 ( H) .
Indeed, we identify ⊗𝑘 H ( ML𝑘 ( H)) ∗ .
Following this definition, we can identify each elementary tensor with a linear
functional on the space ML𝑘 ( H) . Indeed, for any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H, by defining the
action of 𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 on ML𝑘 ( H) as
Definition 2.6 (Inner product for multivariate tensor space). The inner product h·, ·i ⊗𝑘 is
defined on elementary tensors in ⊗𝑘 H by
Ö𝑘
h𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 , 𝒚 1 ⊗ · · · ⊗ 𝒚 𝑘 i ⊗𝑘 := h𝒙 𝑖 , 𝒚 𝑖 i,
𝑖 =1
for any 𝒙 1 , . . . , 𝒙 𝑘 , 𝒚 1 , . . . , 𝒚 𝑘 ∈ H.
Exercise 2.7 (Tensor inner product). Verify that the function h·, ·i ⊗𝑘 as introduced in
Definition 2.6 is an inner product.
Indeed, this inner product can be extended to all tensors in ⊗𝑘 H by linearity. Given
the inner product h·, ·i ⊗𝑘 , it is possible to find an orthonormal basis for the space ⊗𝑘 H.
Proposition 2.8 (Orthonormal basis for ⊗𝑘 H). The set
𝒆 𝑖1 ⊗ · · · ⊗ 𝒆 𝑖𝑘 : 𝑖 𝑗 = 1, . . . , 𝑛 ; 𝑗 = 1, . . . , 𝑘
Definition 2.10 (𝑘 -fold tensor product operator). Given 𝑨 ∈ L( H) , we define the Note that it is sufficient to define the
action of the tensor product operator
(elementary) tensor product operator ⊗𝑘 𝑨 : ⊗𝑘 H → ⊗𝑘 H via
on elementary tensors, as the
definition can be extended to general
⊗𝑘 𝑨 (𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 ) = (𝑨𝒙 1 ) ⊗ · · · ⊗ (𝑨𝒙 𝑘 ), tensors by linearity.
Similarly to the bivariate case, the 𝑘 -fold tensor product operator ⊗𝑘 𝑨 possesses
the following important properties.
Proposition 2.11 (Tensor operator: Properties). Let 𝑨 ∈ L( H) . Let ⊗𝑘 𝑨 : ⊗𝑘 H → ⊗𝑘 H be
the 𝑘 -fold tensor product operator associated to 𝑨 .
1. Composition. For any 𝑨, 𝑩 ∈ L( H) , it holds that ⊗𝑘 (𝑨𝑩) = ⊗𝑘 𝑨 ⊗𝑘 𝑩 .
−1
2. Inverses. For any 𝑨 ∈ L( H) , we have ⊗𝑘 𝑨 = ⊗𝑘 𝑨 −1 .
∗
= ⊗𝑘 𝑨 ∗ .
3. Adjoint. For any 𝑨 ∈ L( H) , it holds that ⊗𝑘 𝑨
4. Persistence. If 𝑨 ∈ L( H) is self-adjoint, unitary, normal, or positive semidefinite,
then ⊗𝑘 𝑨 inherits the respective property.
Exercise 2.12 (Tensor product operators). Provide a proof for Proposition 2.11.
Proof. As before, the proof of Proposition 2.13 follows directly from the Schur decom-
position of 𝑨 .
Proposition 2.14 (Singular values of 𝑘 -fold tensor product operator). Let 𝑨 ∈ L( H) , and
let ⊗𝑘 𝑨 : ⊗𝑘 H → ⊗𝑘 H be the 𝑘 -fold tensor product operator associated to 𝑨 .
The operator ⊗𝑘 𝑨 has singular values 𝜎𝑖 1 · · · 𝜎𝑖𝑘 : 1 ≤ 𝑖 𝑗 ≤ 𝑛, 𝑗 = 1, . . . , 𝑘 , where
𝜎1 , . . . , 𝜎𝑛 are singular values of the linear operator 𝑨 .
Proof. The proof of Proposition 2.14 again follows directly from the singular value
decomposition of 𝑨 .
Exercise 2.15 (Tensor spectral theory). Elaborate on the details of the proofs of Propositions
2.13 and 2.14 by generalizing the arguments of Lecture 1 to the 𝑘 -fold tensor product
case.
2.2 Permutations
Before continuing our discussion on multilinear algebra, we first outline some back-
ground material on permutations which will be of use when defining antisymmetric
and symmetric tensor products. We present the definition of a permutation as follows.
Lecture 2: Multilinear Algebra 15
Exercise 2.17 (Counting permutations). Argue that there are 𝑘 ! permutations on 𝑘 symbols.
We next provide two informative examples of permutations.
Example 2.18 (Some permutations). Both of the following functions 𝜋 1 , 𝜋 2 : {1, 2, 3} →
{1, 2, 3} represent permutations.
1. Let 𝜋 1 : {1, 2, 3} → {1, 2, 3} be defined by the tableau
1 2 3
.
1 3 2
1 2 3
.
2 3 1
Example 2.20 (Transposition). Referring back to Example 2.18, we can observe that the
permutation 𝜋 1 is a transposition, while 𝜋 2 is not.
Exercise 2.22 (*Transpositions). Sketch a proof for Theorem 2.21. Hint: For a more intuitive
argument, appeal to a “sorting” scheme. For a more group-theoretic argument, appeal
to the “cycle representation.” First, show that any permutation can be written as
a product of disjoint cycles, and then show that a product of disjoint cycles can be
expressed as a product of 2-cycles.
Furthermore, let us recall the following result.
is defined by (
+1 if 𝜋 is even,
𝜀𝜋 =
−1 if 𝜋 is odd.
It is important to note that Theorem 2.23 ensures that the signature of a permutation
is well-defined.
The purpose of the normalizing the wedge product by (𝑘 !) −1/2 is to ensure that
the wedge product of orthonormal vectors yields a unit vector. We will turn to this
matter later when we introduce the inner product.
Example 2.27 (Wedge product for 𝑘 = 2). When 𝑛 = 1, the wedge product 𝒙 ∧ 𝒚 = 0
always. In case 𝑘 = 2, the wedge product of 𝒙 , 𝒚 ∈ H is the tensor
1
𝒙 ∧ 𝒚 = √ (𝒙 ⊗ 𝒚 − 𝒚 ⊗ 𝒙 ) .
2
When 𝑛 = 3 and 𝔽 = ℝ, the wedge product is equivalent with the well-known cross-
product. For general 𝑛 and 𝑘 = 2, one can find an analogy between the wedge product
and the skew part of a matrix. (Why?)
We observe that the signature of the permutation in the summand makes the wedge
product antisymmetric. Indeed, for any 𝑖 , 𝑗 ∈ {1, . . . , 𝑘 } we have that
𝒙 1 ∧ · · · ∧ 𝒙 𝑖 ∧ · · · ∧ 𝒙 𝑗 ∧ · · · ∧ 𝒙 𝑘 = −𝒙 1 ∧ · · · ∧ 𝒙 𝑗 ∧ · · · ∧ 𝒙 𝑖 ∧ · · · ∧ 𝒙 𝑘 .
Indeed, exchanging any two vectors introduces an additional transposition into each
permutation in Definition 2.26, which flips the signature. As a consequence, the wedge
product is also known as the antisymmetric tensor product.
The important observation above leads to the conclusion that if 𝒙 𝑖 = 𝒙 𝑗 for some
𝑖 ≠ 𝑗 in {1, . . . , 𝑘 }, then 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 = 0. Exchanging 𝒙 𝑖 and 𝒙 𝑗 does not change
the product, but flips its sign. Therefore, since this product is an element of a vector
space, 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 = 0.
Lecture 2: Multilinear Algebra 17
Proof. Without loss of generality, by the linear dependence of the set, it is possible
to write 𝒙 1 = 𝑘𝑗=2 𝛼 𝑗 𝒙 𝑗 for some 𝛼 2 , . . . , 𝛼𝑘 ∈ 𝔽 . This means in particular that by
Í
linearity,
∑︁𝑘
𝒙1 ∧ · · · ∧ 𝒙𝑘 = 𝛼𝑗 𝒙 𝑗 ∧ 𝒙 2 ∧ · · · ∧ 𝒙 𝑘
𝑗 =2
∑︁𝑘
= 𝛼𝑗 𝒙 𝑗 ∧ 𝒙 2 ∧ · · · ∧ 𝒙 𝑘
𝑗 =2
= 0.
The converse is true by a straightforward induction argument.
Exercise 2.29 (Wedge: Linear dependence). Recalling that any permutation can be written
as a composition of transpositions, complete the induction argument in the proof of
Proposition 2.28.
We can now proceed by constructing the wedge product space.
Definition 2.30 (Wedge product space). We define the wedge product space ∧𝑘 H by
∧𝑘 H := lin {𝒙 1 ∧ · · · ∧ 𝒙 𝑘 : 𝒙 𝑖 ∈ H and 𝑖 = 1, . . . , 𝑘 } .
As a linear subspace of ⊗𝑘 H, the space ∧𝑘 H thus inherits the inner product h·, ·i ⊗𝑘 ,
from which it is possible to construct an orthonormal basis.
Proposition 2.31 (Orthonormal basis for ∧𝑘 H). The set
𝒆 𝑖 1 ∧ · · · ∧ 𝒆 𝑖𝑘 : 1 ≤ 𝑖 1 < · · · < 𝑖 𝑘 ≤ 𝑛
As such, we can identify the wedge operator ∧𝑘 𝑨 as the restriction of the tensor
operator ⊗𝑘 𝑨 to the wedge product space ∧𝑘 H. Indeed, appealing to Definition 2.26
one can conclude that for any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H,
1 ∑︁
⊗𝑘 𝑨 (𝒙 1 , . . . , 𝒙 𝑘 ) = √ 𝜀𝜋 (𝑨𝒙 𝜋 ( 1) ) ⊗ · · · ⊗ (𝑨𝒙 𝜋 (𝑘 ) )
𝑘! 𝜋 ∈𝑆𝑘
= (𝑨𝒙 1 ) ∧ · · · ∧ (𝑨𝒙 𝑘 )
= ∧𝑘 𝑨 (𝒙 1 ∧ · · · ∧ 𝒙 𝑘 ),
where the last equality follows from Definition 2.33. The wedge product space ∧𝑘 H is
thus an invariant subspace of the operator ⊗𝑘 𝑨 . As a consequence, the wedge operator
∧𝑘 𝑨 inherits all the properties of ⊗𝑘 𝑨 .
Proposition 2.34 (Wedge operator: Properties). Let 𝑨 ∈ L( H) , and let ∧𝑘 𝑨 : ∧𝑘 H → ∧𝑘 H
be the wedge operator associated to 𝑨 .
1. Composition. For any 𝑨, 𝑩 ∈ L( H) , it holds that ∧𝑘 (𝑨𝑩) = ∧𝑘 𝑨 ∧𝑘 𝑩 .
−1
2. Inverses. For any 𝑨 ∈ L( H) , we have ∧𝑘 𝑨 = ∧𝑘 𝑨 −1 .
∗
= ∧𝑘 𝑨 ∗ .
3. Adjoint. For any 𝑨 ∈ L( H) , it holds that ∧𝑘 𝑨
4. Persistence. If 𝑨 ∈ L( H) is self-adjoint, unitary, normal, or positive semidefinite,
then ∧𝑘 𝑨 inherits the respective property.
Proof. As in the case of the tensor product operator, the proof of Proposition 2.35
follows directly from the Schur decomposition of 𝑨 . This time, we use the fact that
the orthonormal basis for the wedge space consists of tensors 𝒆 𝑖 1 ∧ · · · ∧ 𝒆 𝑖𝑘 with no
repeated indices.
Proposition 2.36 (Singular values of wedge operators). Let 𝑨 ∈ L( H) , and let ∧𝑘 𝑨 :
∧𝑘 H → ∧𝑘 H be the wedge operator associated to 𝑨 . The operator ∧𝑘 𝑨 has singular
values 𝜎𝑖 1 · · · 𝜎𝑖𝑘 : 1 ≤ 𝑖 1 < · · · < 𝑖𝑘 ≤ 𝑛 , where 𝜎1 , . . . , 𝜎𝑛 are the singular values
of the linear operator 𝑨 .
Proof. As in the case of the tensor product operator, the proof the proof of Proposition
2.36 follows directly from the singular value decomposition of 𝑨 .
2.6 Determinants
In this section, we develop the theory of determinants of matrices as an instant
consequence of the theory of wedge products.
Lecture 2: Multilinear Algebra 19
Suppose that dim H = 𝑛 . Then we may identify the wedge operator ∧𝑛 𝑨 with the de-
terminant of 𝑨 . Indeed, the operator ∧𝑛 𝑨 acts on the space ∧𝑛 H = lin {𝒆 1 ∧ · · · ∧ 𝒆 𝑛 },
which is a one-dimensional space. Recall that a linear operator on a one-dimensional
space acts by scalar multiplication. In particular, the operator ∧𝑛 𝑨 scales by its
own single eigenvalue, 𝜆 1 · · · 𝜆𝑛 , where 𝜆 1 , . . . , 𝜆𝑛 are the eigenvalues of 𝑨 . These
considerations lead us to the following definition of the determinant.
Definition 2.37 (Determinant). For any linear operator 𝑨 ∈ L( H) , the determinant of This definition is a little abusive
𝑨 is defined as because the wedge operator ∧𝑛 𝑨 is
actually a scalar operator 𝛼 I acting on
det (𝑨) := ∧𝑛 𝑨 = 𝜆 1 · · · 𝜆𝑛 ,
a one-dimensional subspace of ⊗𝑛 H.
where 𝜆 1 , . . . , 𝜆𝑛 are the eigenvalues of 𝑨 .
for all 𝑗 = 1, . . . , 𝑛 .
3. Antisymmetry. The determinant reverses the sign if two columns of its argument
are swapped. Namely, letting 𝑨 = [𝒂 1 , . . . , 𝒂 𝑛 ] , for any 𝑖 , 𝑗 ∈ {1, . . . , 𝑛},
det [𝒂 1 , . . . , 𝒂 𝑖 , . . . , 𝒂 𝑗 , . . . , 𝒂 𝑛 ] = −det [𝒂 1 , . . . , 𝒂 𝑗 , . . . , 𝒂 𝑖 , . . . , 𝒂 𝑛 ] .
Exercise 2.39 (Determinant). Using Definition 2.37, provide a proof for Proposition 2.38.
Finally, we note the following theorem.
Theorem 2.40 (Uniqueness of the determinant). The determinant is the unique function
from 𝕄𝑛 (𝔽) to 𝔽 with the above properties of multiplicativity, multilinearity,
antisymmetry and normalization.
Proof. For a proof of this statement see the first chapter of [Art11].
Definition 2.41 (Symmetric tensor product). The symmetric tensor product 𝒙 1 ∨· · ·∨𝒙 𝑘
of vectors 𝒙 1 , . . . 𝒙 𝑘 ∈ H is defined as
1 ∑︁
𝒙 1 ∨ · · · ∨ 𝒙 𝑘 := √ 𝒙 𝜋 ( 1) ⊗ · · · ⊗ 𝒙 𝜋 (𝑘 ) ,
𝑘! 𝜋 ∈S𝑘
𝒙1 ∨ · · · ∨ 𝒙𝑖 ∨ · · · ∨ 𝒙 𝑗 ∨ · · · ∨ 𝒙𝑘 = 𝒙1 ∨ · · · ∨ 𝒙 𝑗 ∨ · · · ∨ 𝒙𝑖 ∨ · · · ∨ 𝒙𝑘 ,
Notes
The material of this chapter can largely be found in [Bha97], albeit to varying degrees
of detail. In particular, the above discussions represent a more in-depth exploration
of the foundational elements of multilinear algebra. The interested reader can find
a further exploration of these topics in [Bha97]. For a refresher on topics relating to
linear algebra, we refer to [Art11].
Lecture bibliography
[Art11] M. Artin. Algebra. Pearson Prentice Hall, 2011.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
3. Majorization
3.1 Majorization
In this section, we introduce the concept of majorization for vectors. Informally
speaking, a vector 𝒙 is majorized by a vector 𝒚 , written 𝒙 ≺ 𝒚 , if 𝒚 is “spikier” (or
more concentrated) than 𝒙 .
3.1.1 Setting
In this lecture, we will work in ℝ𝑛 equipped with the standard basis (𝜹 𝑖 : 𝑖 =
1, 2, . . . , 𝑛) and the canonical inner product h𝒂, 𝒃i B 𝒂 ᵀ𝒃 . We define the vector of
ones:
1 B ( 1, 1, . . . 1) ᵀ ∈ ℝ𝑛 .
The trace of a vector is ∑︁𝑛
tr (𝒙 ) B h1, 𝒙 i = 𝑥𝑖 .
𝑖 =1
Much later, when we study positive linear maps, we will see that the trace of a vector
and the trace of a matrix are analogous.
3.1.2 Rearrangements
We now define rearrangements, a key concept in analysis.
𝑥 1↓ ≥ 𝑥 2↓ ≥ · · · ≥ 𝑥𝑛↓ .
𝒙 ↓ = ( 3, 2, 1) ;
𝒙 ↑ = ( 1, 2, 3).
Observe that the vector 𝒚 = ( 2, 3, 1) has the same rearrangements.
The next proposition provides us with an upper bound and a lower bound on the
inner product of two vectors from the inner products of their rearrangements.
Proposition 3.3 (Chebyshev rearrangement). For all 𝒙 , 𝒚 ∈ ℝ𝑛 ,
h𝒙 ↓ , 𝒚 ↑ i ≤ h𝒙 , 𝒚 i ≤ h𝒙 ↓ , 𝒚 ↓ i ;
h𝒙 ↑ , 𝒚 ↓ i ≤ h𝒙 , 𝒚 i ≤ h𝒙 ↑ , 𝒚 ↑ i.
Proof. The proof is left as an exercise to the reader. Hint: Assume that 𝒙 , 𝒚 ≥ 0, and
apply summation by parts. Otherwise, see [HLP88, p. 261].
Alternatively, the majorization order can be stated using the increasing rearrangements.
For 𝒙 , 𝒚 ∈ ℝ𝑛 , we have 𝒙 ≺ 𝒚 if and only if
∑︁𝑘 ∑︁𝑘
𝑥𝑖↑ ≥ 𝑦↑ for each 𝑘 = 1, . . . 𝑛, and
𝑖 =1 𝑖
∑︁𝑖𝑛=1 ∑︁𝑛
𝑥↑ = 𝑦 ↑.
𝑖 =1 𝑖 𝑖 =1 𝑖
Definition 3.9 (Doubly stochastic matrices). A matrix 𝑺 ∈ 𝕄𝑛 (ℝ) with entries 𝑺 = [𝑠𝑖 𝑗 ]
is doubly stochastic if it has the following three properties:
Example 3.10 (Doubly stochastic matrices). The following are examples of doubly stochastic
matrices:
Doubly stochastic matrices naturally arise in a variety of applications, such as the study
of reversible Markov chains.
In fact, a matrix is doubly stochastic if and only if it satisfies these three properties.
Proof. The positivity property follows from the positivity of each coordinate (𝑺𝒙 )𝑖 =
Í 𝑛
𝑗 =1 𝑠𝑖 𝑗 𝑥 𝑗 ≥ 0 for 𝑖 = 1, . . . 𝑛 .
The unital property follows easily from the observation that (𝑺 1)𝑖 = 𝑛𝑗=1 𝑠𝑖 𝑗 = 1.
Í
To prove the trace preservation property, note that 𝑺 is doubly stochastic if and
only if 𝑺 ᵀ is also doubly stochastic. Therefore
Proposition 3.12 (Properties of DS𝑛 ). The set DSn has the following properties:
Proof. First, observe that the set DS𝑛 is bounded. Indeed, for all 𝑺 ∈ DS𝑛 , we have
k𝑺 k ∞ = max𝑖 𝑗 |𝑠𝑖 𝑗 | ≤ 1 . To show that DS𝑛 is closed and convex, simply observe that
each row and each column of a doubly stochastic matrix can be defined as a vector
belonging to the intersection of closed halfspaces.
In detail, we can write 𝑺 = [𝒔 1 , 𝒔 2 , . . . , 𝒔 𝑛 ] where 𝒔 1 , 𝒔 2 , . . . , 𝒔 𝑛 are the columns
of 𝑺 . Then, for each index 𝑖 = 1, . . . , 𝑛 , the condition h1, 𝒔 𝑖 i = 1 can be expressed as
𝒔 𝑖 ∈ {𝒙 ∈ ℝ𝑛 : h𝜹 𝑗 , 𝒙 i ≥ 0} for 𝑗 = 1, . . . , 𝑛 .
The same holds for the rows of 𝑺 . The set DS𝑛 is therefore the intersection of (a finite
number of) closed and convex sets and hence is itself closed and convex.
To prove the composition property, consider two stochastic matrices 𝑺 , 𝑻 , and their
product 𝑺𝑻 . The entries of the product are clearly positive:
∑︁𝑛
(𝑺𝑻 )𝑖 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 ≥ 0.
𝑘 =1
The row and column sum are also preserved in the product. For each row and column
index,
∑︁𝑛 ∑︁𝑛 ∑︁𝑛 ∑︁𝑛 ∑︁𝑛
(𝑠𝑡 )𝑖 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 = 1.
𝑗 =1 𝑗 =1 𝑘 =1 𝑘 =1 𝑗 =1
∑︁𝑛 ∑︁𝑛 ∑︁𝑛 ∑︁𝑛 ∑︁𝑛
(𝑠𝑡 )𝑖 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 = 𝑠𝑘 𝑗 𝑡𝑖𝑘 = 1.
𝑖 =1 𝑖 =1 𝑘 =1 𝑘 =1 𝑖 =1
Proof. As was seen in Example 3.10, permutation matrices are doubly stochastic
matrices and, by the previous theorem, the set DS𝑛 is a convex set.
The converse of the corollary above is also true: any doubly stochastic matrix is a
convex combination of permutation matrices. This is the content of the Birkhoff–von
Neumann theorem.
3.3 T-transforms
We now introduce a special class of of doubly stochastic matrices, the T-transforms.
T-transforms allow us to take convex combinations of two entries of a vector while
leaving all other coordinates unchanged. In other words, it averages two coordinates.
These matrices will allow us to relate majorization to the space DS𝑛 of doubly stochastic
matrices.
Lecture 3: Majorization 26
For vectors in ℝ2 , we will argue that 𝒙 ≺ 𝒚 if and only if 𝒙 = 𝑻 𝒚 for some choice of 𝜏 .
Indeed, let 𝒙 = (𝑥 1 , 𝑥 2 ), and 𝒚 = (𝑦1 , 𝑦2 ) . The majorization relation 𝒙 ≺ 𝒚 states
that
𝑥 1 ≤ 𝑦1 and 𝑥 1 + 𝑥 2 = 𝑦1 + 𝑦2 . (3.2)
Hence, 𝑦2 ≤ 𝑥 1 ≤ 𝑦1 . As a consequence, there is a 𝜏 ∈ [ 0, 1] such that
𝑥 1 = 𝜏𝑦1 + ( 1 − 𝜏)𝑦2 .
Plugging the last display back in to (3.2), we obtain
𝑥 2 = ( 1 − 𝜏)𝑦1 + 𝜏𝑦2 .
We conclude that 𝒙 = 𝑻 𝒚 for the T-transform with this distinguished parameter 𝜏 .
Conversely, we can reverse this chain of argument to see that 𝒙 = 𝑻 𝒚 implies that
𝒙 ≺ 𝒚 for vectors in ℝ2 .
Theorem 3.18 (Majorization and DS𝑛 ). Fix two vectors 𝒙 , 𝒚 ∈ ℝ𝑛 . The following
statements are equivalent:
1. 𝒙 ≺𝒚
2. 𝒙 = 𝑻 𝑛 · · ·𝑻 1 𝒚 for certain T-transforms 𝑻 𝑖 .
3. 𝒙 = 𝑺 𝒚 for some 𝑺 ∈ DS𝑛 .
4. 𝒙 ∈ conv {𝑷 𝒚 : 𝑷 is a permutation matrix on ℝ𝑛 }.
We have used the unital property and Jensen’s inequality. Taking the trace,
since doubly stochastic matrices are trace-preserving. This is the equivalent formulation
of majorization from Proposition 3.6.
(1 ⇒ 2). This is the hard part. Assume that 𝒙 ≺ 𝒚 . We will construct a sequence of
T-transforms 𝑻 1 ,𝑻 2 , . . . ,𝑻 𝑛 such that
𝒙 = 𝑻 𝑛𝑻 𝑛−1 · · ·𝑻 1 𝒚 and
𝑻 𝑘 +1𝑻 𝑘 . . .𝑻 1 𝑦 ≺ 𝑻 𝑘 . . .𝑻 1 𝒚 for all 𝑘 = 0, . . . , 𝑛 − 1.
𝑥 1 = 𝜏𝑦1 + ( 1 − 𝜏)𝑦𝑘 .
where we use the convention that the indices of 𝒘 range from 2 to 𝑛 . For 𝑛 < 𝑘 , since
𝑥 1 < 𝑦𝑖 for 𝑖 = 1, . . . , 𝑛 , we have that
∑︁𝑛 ∑︁𝑛 ∑︁𝑛
𝑥𝑖 ≤ 𝑦𝑖 = 𝑤𝑖 .
𝑖 =2 𝑖 =2 𝑖 =2
For 𝑛 ≥ 𝑘 ,
∑︁𝑛 ∑︁𝑛
𝑥𝑖 ≤ 𝑦𝑖 .
𝑖 =1 𝑖 =1
Therefore,
∑︁𝑛 ∑︁𝑛 ∑︁𝑛
𝑥𝑖 ≤ 𝑦𝑖 − 𝑥 1 = 𝑤𝑖
𝑖 =2 𝑖 =1 𝑖 =2
As required, (𝑥 2 , 𝑥 3 , . . . , 𝑥𝑛 ) ≺ 𝒘 .
By our induction hypothesis, there is a T-transform on 𝒘 satisfying the requirements.
This transformation can be extended to a transformation 𝑻 2 on (𝑥 1 , 𝒘 ) by leaving the
first coordinate unchanged, completing the induction step.
The next exercise completes the above proof without resorting to the Birkhoff–von
Neumann theorem.
Exercise 3.19 (Direct proof of majorization and DS𝑛 ). Prove that (2 ⇒ 4) in Theorem 3.18
by expanding the product of T-transforms.
Proof. We present a proof of Schur’s theorem only. For a proof of Horn’s theorem, see
[MOA11, p.302].
By the spectral theorem, we can write 𝑨 = 𝑼 ∗ 𝚲𝑼 where 𝑼 is unitary and
𝚲 is a diagonal matrix whose diagonal entries are listed in 𝝀𝑖↓ (𝑨) . Write 𝑼 =
[𝒖 1 , 𝒖 2 , . . . , 𝒖 𝑛 ] where 𝒖 1 , 𝒖 2 , . . . , 𝒖 𝑛 ∈ ℂ𝑛 are the columns of 𝑼 . Then
∑︁𝑛
𝑎𝑖𝑖 = 𝜹 𝑖∗ 𝑨 𝜹 𝑖 = 𝜹 𝑖∗𝑼 ∗ 𝚲𝑼 𝜹 𝑖 = (𝑼 𝜹 ) ∗ 𝚲(𝑼 𝜹 ) = 𝒖 𝑖∗ 𝚲𝒖 𝑖 = |𝑢 𝑖 𝑗 | 2𝜆 ↓𝑗 (𝑨).
𝑗 =1
Define the orthostochastic matrix 𝑺 with entries 𝑠𝑖 𝑗 = |𝑢 𝑖 𝑗 | 2 . The last display tells us
that
𝑎𝑖𝑖 = (𝑺 𝝀 ↓ (𝑨))𝑖 .
In other words, diag (𝑨) = 𝑺 𝝀 ↓ (𝐴) where 𝑺 is doubly stochastic. By Theorem 3.18, we
may conclude that diag (𝑨) ≺ 𝝀 ↓ (𝑨) .
Notes
Majorization is a foundational topic in analysis. It plays a key role in the classic
book Inequalities by Hardy, Littlewood, and Pólya [HLP88]. Majorization is also
the core topic in the well-known book Inequalities by Marshall & Olkin, updated by
Arnold [MOA11]. The presentation in this lecture is adapted from Bhatia’s book [Bha97,
Chap. II], which owes a heavy debt to Ando’s arrangement of the material [And89].
Indeed, Ando understood that similar ideas are also at the heart of the theory of
positive linear maps, which we will explore in the second half of the course.
Lecture bibliography
[And89] T. Ando. “Majorization, doubly stochastic matrices, and comparison of eigenvalues”.
In: Linear Algebra and its Applications 118 (1989), pages 163–248.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[HLP88] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Reprint of the 1952 edition.
Cambridge University Press, Cambridge, 1988.
[MOA11] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: theory of majorization and
its applications. Second. Springer, New York, 2011. doi: 10.1007/978- 0- 387-
68276-1.
4. Isotone Functions
4.1 Recap
For vectors 𝒙 , 𝒚 ∈ ℝ𝑛 , the majorization relation 𝒙 ≺ 𝒚 is defined by the following
conditions:
∑︁𝑘 ∑︁𝑘
𝑥𝑖↓ ≤ 𝑦𝑖↓ for each 𝑘 = 1, . . . , 𝑛 ;
∑︁𝑖𝑛=1 ∑︁𝑛𝑖 =1
tr [𝒙 ] B 𝑥↓ = 𝑦↓ C tr [𝒚 ].
𝑖 =1 𝑖 𝑖 =1 𝑖
In other words, majorization consists of 𝑛 − 1 inequalities and one equality among the
(sorted) entries of the vectors. Intuitively, this captures the idea that vector 𝒚 is “spikier”
than vector 𝒙 , or vector 𝒙 is “flatter” than vector 𝒚 . Majorization plays a basic role in
matrix analysis. For example, we saw an important theorem of Schur that relates the
diagonal entries of a matrix to its decreasingly ordered eigenvalues.
Theorem 4.2 (Weyl majorant theorem). For every matrix 𝑨 ∈ 𝕄𝑛 (ℂ) , we have
Ö𝑘 Ö𝑘
|𝜆𝑖↓ (𝑨)| ≤ 𝜎𝑖↓ (𝑨) for each 𝑘 = 1, . . . , 𝑛 ;
Ö𝑖𝑛=1 Ö𝑛𝑖 =1
|𝜆𝑖 (𝑨)| = 𝜎𝑖 (𝑨).
𝑖 =1 𝑖 =1
We can express these relations with the shorthand When there are zero eigenvalues and
zero singular values, the product
formulas give a precise meaning to
log (𝝀 ↓ ) ≺ log (𝝈 ). the log-majorization.
Proof. Let us express the determinant in terms of both the eigenvalues and the singular
values.
In equation (4.1), we use Schur decomposition and the multiplicativity of the determi-
nant, and then we evaluate the determinant of the upper triangular matrix 𝑻 . Note
the determinant of any unitary matrix has magnitude one. Equation (4.2) is analogous
but uses the singular value decomposition instead.
Lemma 4.4 (Spectral radius and spectral norm). For every matrix 𝑴 ∈ 𝕄𝑛 (ℂ) , the maximal
singular value bounds the magnitude of each eigenvalue:
Proof of Weyl majorant theorem. The equality is Lemma 4.3; for the inequalities, we
cleverly use multilinear algebra. For any 𝑘 = 1, . . . , 𝑛 , consider the anti-symmetric Without invoking multilinear algebra,
the proof gets horrible.
subspace 𝑘 ℂ𝑛 and the induced operator 𝑘 𝑨 C 𝑴 . In Lecture 2, we showed that
Ó Ó
the eigenvalues of operator 𝑴 are products of 𝑘 eigenvalues of 𝑨 with no repeated
indices. By Lemma 4.4, we can bound the eigenvalues of 𝑴 above by the maximal
singular value of 𝑴 . Indeed,
Ö𝑘 Ö𝑘
𝜆𝑖↓ (𝑨) = |𝜆 1↓ (𝑴 )| ≤ 𝜎max (𝑴 ) = 𝜎𝑖 (𝑨).
𝑖 =1 𝑖 =1
4.3 Isotonicity
Today, we move on to study isotone functions that respect the majorization order.
We will introduce easy-to-check conditions for isotonicity and see that many existing
measures of “inequality” fit into this framework.
4.3.1 Motivation
Let us begin with an economics example. Suppose we wish to summarize how equal
or unequal a society is. Consider a vector 𝒙 ∈ ℝ+𝑛 with normalization tr [𝒙 ] = 1 that
describes the distribution of wealth of each individual in some society 𝒙 . As we have
already seen, the majorization relation between societies 𝒙 ≺ 𝒚 is a way to quantify
that “society 𝒚 is more unequal than society 𝒙 ”.
Motivated by the above, can we find an even simpler summary of “inequality”? For
example, can we reduce the distribution to a single number? Formally, we may ask if
there is a function Φ : ℝ𝑛 → ℝ such that
• k-max. The sum of the largest 𝑘 entries obviously preserves the majorization
order:
∑︁𝑘
Φ(𝒙 ) = 𝑥 ↓,
𝑖 =1 𝑖
where 1 ≤ 𝑘 ≤ 𝑛 .
• Negative entropy. The negative entropy reflects the amount of randomness or
uniformity in a distribution:
∑︁𝑛
Φ(𝒙 ) = negent (𝒙 ) = 𝑥𝑖 log (𝑥𝑖 ).
𝑖 =1
1 ∑︁𝑛
Φ(𝒙 ) = Var [𝒙 ] = ¯ 2,
(𝑥𝑖 − 𝑥)
𝑛 𝑖 =1
1 Í𝑛
where 𝑥¯ B 𝑛 𝑖 =1 𝑥𝑖 is the mean.
By the end of this lecture, we will develop new tools to confirm that the negative
entropy and the variance also respect the majorization order.
4.3.2 Definitions
To prepare for our treatment of isotone functions, we will need some additional
definitions.
Definition 4.6 (Weak majorization). For two vectors 𝒙 , 𝒚 ∈ ℝ𝑛 , we say that 𝒚 subma-
jorizes 𝒙 and write 𝒙 ≺𝑤 𝒚 when
∑︁𝑘 ∑︁𝑘
𝑥𝑖↓ ≤ 𝑦↓ for each 𝑘 = 1, . . . , 𝑛 . (4.3)
𝑖 =1 𝑖 =1 𝑖
Lecture 4: Isotone Functions 33
Definition 4.8 (Isotone function). We say a vector-valued function Φ : ℝ𝑛 → ℝ𝑚 is In the definition, we also allow for
isotone if for all 𝒙 , 𝒚 ∈ dom (Φ) , functions defined only on a convex
subset of vectors (e.g., those with
positive entries).
𝒙 ≺𝒚 implies Φ(𝒙 ) ≺𝑤 Φ(𝒚 ).
One may wonder: Why not demand the full majorization relation Φ(𝒙 ) ≺ Φ(𝒚 ) ?
This condition turns out to be too limiting. For example, in the case of scalar-valued
functions (𝑚 = 1), the trace constraint in majorization would force the function to be
a constant.
Exercise 4.9 (Majorization on ℝ). For real numbers 𝑎, 𝑏 , verify that
𝑎 ≺ 𝑏 if and only if 𝑎 = 𝑏,
𝑎 ≺𝑤 𝑏 if and only if 𝑎 ≤ 𝑏.
In other words, the majorization relation for numbers is very rigid.
Definition 4.11 (Permutation covariance). A function Φ : ℝ𝑛 → ℝ𝑚 is permutation The permutation 𝑷 0 may depend on
covariant if for each vector 𝒙 ∈ ℝ𝑛 and each permutation 𝑷 on ℝ𝑛 , there is a both 𝒙 and 𝑷 .
permutation 𝑷 0 on ℝ𝑚 such that
Φ(𝑷 𝒙 ) = 𝑷 0Φ(𝒙 ).
Example 4.12 (The real case). When the output dimension 𝑚 = 1, the preceding defini-
tions simplify. Permutation covariant functions must be permutation invariant:
Φ(𝑷 𝒙 ) = Φ(𝒙 ) for each permutation 𝑷 .
Convexity reduces to the familiar notion for real-valued functions.
Lecture 4: Isotone Functions 34
Convexity and permutation covariance give sufficient (but not necessary) conditions
for isotonicity.
The second relation expresses the doubly stochastic matrix as a convex combination of
permutations 𝑷 𝑖 with 𝛼𝑖 ≥ 0 and 𝑟𝑖 𝛼𝑖 = 1. By convexity of the function Φ,
Í
∑︁𝑟 ∑︁𝑟
Φ(𝒙 ) = Φ 𝛼𝑖 𝑷 𝑖 𝒚 ≤ 𝛼𝑖 Φ(𝑷 𝑖 𝒚 )
𝑖 =1 𝑖 =1
∑︁𝑟
= 𝛼𝑖 𝑷 𝑖0 Φ(𝒚 ) C 𝒛
𝑖 =1
where the inequality acts entrywise. Since the vector 𝒛 is a convex combination
of permutations of vector 𝒚 𝑖 , we arrive at the majorization relation 𝒛 ≺ Φ(𝒚 ) .
The combined inequalities Φ(𝒙 ) ≤ 𝒛 ≺ Φ(𝒚 ) imply the submajorization relation
Φ(𝒙 ) ≺𝑤 Φ(𝒚 ) , which is the advertised result.
Exercise 4.14 (Submajorization deduction). Check that the following implication holds.
• Absolute powers. Convex power functions, applied entrywise, are isotone. For
example,
Corollary 4.17 (Log majorization). The log-majorization log (𝒙 ) ≺ log (𝒚 ) implies the
submajorization 𝒙 ≺𝑤 𝒚 .
Proof. This result follows immediately from the exponential example (4.6).
This corollary connects nicely with the Weyl majorant theorem (Theorem 4.2). We
maintain the same notation for the decreasingly ordered eigenvalues and singular
values.
Corollary 4.18 (Eigenvalue and singular value majorization). For every matrix 𝑨 ∈ 𝕄𝑛 , we
have the submajorization relation 𝝀 ↓ (𝑨) ≺𝑤 𝝈 (𝑨) .
Unfortunately, all the above examples rely on convexity. In the next section, we
will see that convexity is not required to attain isotonicity.
Proof of Theorem 4.22. (2 ⇒ 1). We begin with the reverse implication. Consider
vectors 𝒖, 𝒙 that satisfy the majorization relation 𝒖 ≺ 𝒙 . To check isotonicity of the
function Φ, we want to prove the inequality Φ(𝒖) ≤ Φ(𝒙 ) . In Lecture 3, we showed
that majorization can be expressed using a sequence of T-transforms:
𝒖 = 𝑻 𝑛 . . .𝑻 1 𝒙 .
Recall that a T-transform is a convex combination of the identity and a transposition:
𝑻 = 𝜏𝑰 + ( 1 − 𝜏)𝑸 for 𝜏 ∈ [0, 1] and 𝑸 a transposition.
Therefore, it suffices to obtain the inequality Φ(𝒖) ≤ Φ(𝒙 ) for a single transition
𝒖 = 𝑻 𝒙.
Without loss of generality, permutation invariance allows us to assume that the
T-transform 𝑻 averages the first two coordinates 𝑥 1 , 𝑥 2 only. We can write explicitly
𝒖 = ( 1 − 𝑠 )𝑥 1 + 𝑠𝑥 2 , 𝑠𝑥 1 + ( 1 − 𝑠 )𝑥 2 , . . . , 𝑥𝑛 for 𝑠 ∈ [0, 0.5]. (4.8)
The key idea is to interpolate from the vector 𝒙 to the vector 𝒖 . Define the function
𝒙 (𝜏) = ( 1 − 𝜏)𝑥 1 + 𝜏𝑥2 , 𝜏𝑥1 + ( 1 − 𝜏)𝑥 2 , . . . , 𝑥𝑛 for 𝜏 ∈ [0, 𝑠 ].
Note that 𝒙 ( 0) = 𝒙 and 𝒙 (𝑠 ) = 𝒖 . By the fundamental theorem of calculus (and the
assumption that the function Φ is differentiable),
∫ 𝑠
d
Φ(𝒖) − Φ(𝒙 ) = [Φ(𝒙 (𝜏))] d𝜏
d𝜏
∫0 𝑠
(𝑥 2 − 𝑥 1 ) 𝜕1 Φ 𝒙 (𝜏) − 𝜕2 Φ 𝒙 (𝜏) d𝜏
=
0
∫ 𝑠
(𝑥 2 (𝜏) − 𝑥 1 (𝜏))
𝜕1 Φ 𝒙 (𝜏) − 𝜕2 Φ 𝒙 (𝜏) d𝜏
=−
0 1 − 2𝜏
≤ 0.
Lecture 4: Isotone Functions 37
The second equality is the chain rule, and the last inequality is an assumption.
(1 ⇒ 2). Now we confirm the forward implication. Fix any vector 𝒙 . Observe that
𝒙 is majorized by any permutation 𝑷 𝒙 because they have the identical sorted entries.
Therefore, we can write down the two-sided majorization relation
𝒙 ≺ 𝑷 𝒙 ≺ 𝑷 −1 (𝑷 𝒙 ) = 𝒙 .
This is analogous to (4.8), but the T-transform 𝑻 𝑖 𝑗 acts on the (𝑖 , 𝑗 ) pair of indices.
Take the derivative as 𝑠 → 0 to obtain
Φ(𝒖 𝑖 𝑗 (𝑠 )) − Φ(𝒙 ) d
0 ≥ lim = [Φ(𝒖 𝑖 𝑗 (𝑠 ))]
𝑠 →0 𝑠 d𝑠
= (𝑥 𝑗 − 𝑥𝑖 ) 𝜕𝑖 Φ 𝒙 − 𝜕 𝑗 Φ 𝒙 .
The first inequality holds because of the majorization relation between the vectors and
the Schur convexity of the function Φ. Rearrange to obtain the advertised result.
Notes
The material in this lecture is adapted from Bhatia [Bha97, Chap. II], which is based
on Ando’s vision [And89].
Lecture bibliography
[And89] T. Ando. “Majorization, doubly stochastic matrices, and comparison of eigenvalues”.
In: Linear Algebra and its Applications 118 (1989), pages 163–248.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
5. Birkhoff and von Neumann
In this lecture, we study the geometry of DS𝑛 , the set of 𝑛 × 𝑛 doubly stochastic
Agenda:
matrices. We establish the classic result of Birkhoff & von Neumann, which states
1. Doubly stochastic matrices
that the set of doubly stochastic matrices can be expressed as the convex hull of the 2. The Birkhoff–von Neumann
permutation matrices. We use this result to prove the von Neumann trace theorem, theorem
which plays a basic role in understanding unitary invariant norms and convex trace 3. Minkowski theorem
functions. 4. Proof of Birkhoff theorem
5. The von Neumann trace
theorem
Definition 5.2 (Birkhoff polytope). The set DS𝑛 collects all of the 𝑛 × 𝑛 doubly
stochastic matrices:
DS𝑛 B 𝑺 ∈ ℝ𝑛×𝑛 : 𝑺 is doubly stochastic .
As we will discuss, the set DS𝑛 is a convex polytope known as the Birkhoff polytope.
That is, when 𝒙 ≺ 𝒚 , the entries of 𝒙 are “more average” than the entries of 𝒚 .
This observation leads us to explore what role the permutation matrices play in the
structure of DS𝑛 .
Exercise 5.4 (The convex hull of permutations). Deduce that the convex hull of the permu-
tation matrices is a subset of the doubly stochastic matrices:
conv 𝑷 ∈ ℝ𝑛×𝑛 : 𝑷 is a permutation matrix ⊆ DS𝑛 .
5.2 The Birkhoff–von Neumann theorem Figure 5.2 Example of a polytope con-
In this section, we will discuss the Birkhoff–von Neumann theorem which relates structed as a convex-hull of five points
shown in red.
convex hull of permutation matrices to the set of doubly stochastic matrices DS𝑛 . To
prepare for our treatment, we will need some additional definitions.
Theorem 5.6 (Birkhoff 1946; von Neumann 1953). The extreme points (i.e., vertices) of
DS𝑛 are precisely the permutation matrices. In particular, DS𝑛 can be written as
convex hull of permutation matrices,
DS𝑛 = conv 𝑷 ∈ ℝ𝑛×𝑛 : 𝑷 is a permutation matrix
To prove this theorem, we must demonstrate that the permutation matrices compose
the full set of extreme points of the Birkhoff polytope. First, we will present an important
result from convex geometry that shows that the extreme points play a key role in the
structure of convex sets.
Exercise 5.7 (Number of permutations). How many permutations suffice to express a
doubly stochastic matrix in DS𝑛 ? Hint: Use the Carathéodory theorem.
A nice corollary of Theorem 5.6 is an independent proof of the geometric character-
ization of majorization from the last lecture.
Lecture 5: Birkhoff and von Neumann 40
Figure 5.3 (Examples of extreme points). Points shown in red are extreme
while the points shown in blue are not extreme.
Proof. We shall prove this theorem by induction on the dimension of K. If dim ( K) = 0, The dimension of a convex set K is
then K = {𝒙 } is a singleton and the result follows. Assume that the result holds true defined as dim K B dim aff ( K) , the
dimension of the smallest affine space
containing the set K.
Lecture 5: Birkhoff and von Neumann 41
Figure 5.5 (Two cases for proof on Minkowski theorem by induction). The
left figure demonstrates the case when 𝒙 belongs to boundary ( K) .
The right figure demonstrates the case when 𝒙 does not belong to
boundary ( K).
𝒙 = 𝜏𝒚 + 𝜏𝒛
¯ where 𝜏 ∈ [ 0, 1] and 𝜏¯ = 1 − 𝜏 .
Thus,
𝒙 ∈ conv {𝒚 , 𝒛 } ⊆ conv ext ( K).
Indeed, by the induction, each of 𝒚 , 𝒛 ∈ conv ext ( K) .
Please refer to [Sch14] for more details on this theorem and its context. Let us
discuss a few consequences and generalization of the Minkowski theorem.
Corollary 5.12 (Extreme point). Every nonempty, convex, compact set has an extreme
point.
Exercise 5.13 (Bauer maximum principle). Show that every linear functional on a convex,
compact set attains its maximum at an extreme point. In particular, if the maximizer
is unique , it must be an extreme point of the set.
The Minkowski theorem has a generalization to infinite-dimensional spaces, a result
that has far-reaching implications in mathematical analysis.
Fact 5.14 (Krein–Milman). Suppose X is a locally convex topological vector space (for
example, a normed space). Suppose that K is a compact and convex subset of X. Then
K is equal to the closed convex hull of its extreme points:
Definition 5.16 (Perturbation). Let 𝑺 ∈ DS𝑛 belong to the set of doubly stochastic
matrices. A perturbation 𝑬 ∈ ℝ𝑛×𝑛 of the matrix 𝑺 is a matrix with the property
that 𝑺 ± 𝑬 ∈ DS𝑛 .
Exercise 5.17 (Extreme point and perturbation). Show that 𝑺 is an extreme point of DS𝑛
if and only if 𝑺 admits no perturbation 𝑬 , except the zero matrix. In particular, if 𝑺
admits a nontrivial perturbation, then 𝑺 is not an extreme point.
Proof of Theorem 5.6. We already know that the permutation matrices are extreme
Figure 5.6 Example of perturbation.
points of the Birkhoff polytope. We will argue that the permutation matrices compose
the full set of extreme points. An application of Minkowski’s theorem completes the
argument.
To that end, let us suppose that 𝑺 ∈ DS𝑛 is not a permutation. We will produce a
nonzero perturbation 𝑬 . Therefore, 𝑺 is not an extreme point of DS𝑛 .
Observe that every doubly stochastic matrix that is not a permutation contains at
least one entry that is not an integer. We will find a “cycle” consisting of nonintegral
entries. First, find an entry in row 𝑖 1 and column 𝑗 1 such that 𝑠𝑖 1 𝑗 1 lies between 0 and
1. Now, select another entry in row 𝑖 1 such that 𝑠𝑖 1 𝑗 2 lies strictly between 0 and 1. We
can do this because the entries in the row have to add up to 1. Find another entry in
column 𝑗 2 such that 𝑠𝑖 2 ,𝑗 2 lies strictly between 0 and 1. We keep moving horizontally
along the rows and vertically along the columns in sequence until we encounter an
index pair that we have already seen. Figure 5.8 illustrates this process. This process
have to terminate because there are only finite number of positions in the matrix.
Among all such sequences, we choose the one with fewest steps. This sequence
must be a cycle:
(𝑖 1 , 𝑗 1 ) = (𝑖𝑟 , 𝑗 𝑟 ).
Note that this cycle must have even number of steps. Indeed, in order to complete a
cycle, we need to move from a row and a column in sequence. An odd number of steps
in a given row or in a given column can be combined to form a shorter connection.
Given the indices in the cycle, we can find 𝜀 > 0 such that
Then the row sums and column sums of 𝑬 are equal to zero. We can conclude that
𝑺 ± 𝑬 ∈ DS𝑛 .
In other words, the theorem solves a matching problem. What is the best way we can
rotate 𝑨 to align it with 𝑩 ? Part of the assertion is that the maximum is attained.
Proof. We prove the theorem by establishing upper and lower bounds on the trace.
First, let us introduce the eigenvalue decompositions of 𝑨 and 𝑩 .
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) ≥ tr (𝑼 ∗0 𝑨𝑼 0 𝑩)
∑︁𝑛
= tr ( diag (𝝀) diag (𝝁)) = 𝜆𝑖 𝜇𝑖 .
𝑖
It remains to show that this lower bound is indeed the largest possible value.
Lecture 5: Birkhoff and von Neumann 44
We can compute the upper bound of the trace as follows. Recall that 𝜹 𝑖 is the 𝑖 th standard
basis vector.
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) = max𝑼 tr (𝑼 ∗ diag (𝝀)𝑼 diag (𝝁))
∑︁𝑛
= max𝑼 𝜹 𝑖∗𝑼 ∗ diag (𝝀)𝑼 diag (𝝁)𝜹 𝑖
∑︁𝑖𝑛=1
= max𝑼 (𝑼 𝜹 𝑖 ) ∗ diag (𝝀)(𝑼 𝜹 𝑖 )𝜇𝑖
𝑖
Here, the first equality is obtained by absorbing 𝑸 𝑨 , 𝑸 𝑩 into 𝑼 . The second equality
comes from the fact that the trace is the sum of diagonal entries. Next, we represent
diag (𝝀) as a sum of rank-one matrices:
∑︁
diag (𝝀) = 𝜆 𝑗 𝜹 𝑗 𝜹 ∗𝑗 .
𝑗
Note that we have exposed the squared magnitudes of the entries of the unitary matrix,
which compose an orthostochastic matrix 𝑺 ∈ DS𝑛 with entries 𝑠𝑖 𝑗 = |𝑢 𝑖 𝑗 | 2 .
We see that the maximum only increases if we pass to the full set of doubly stochastic
matrices: ∑︁𝑛
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) ≤ max𝑺 ∈DS𝑛 𝜇𝑖 𝜆 𝑗 𝑠𝑖 𝑗 .
𝑖 ,𝑗 =1
Invoke Bauer’s maximum principle to see that this linear function attains its maximum
at an extreme point of the set. Then use Birkhoff ’s theorem to recognize that the
extreme points of DS𝑛 are precisely the permutation matrices 𝑷 ∈ ℝ𝑛×𝑛 . Thus,
∑︁𝑛
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) ≤ max𝑷 𝜇𝑖 𝜆 𝑗 𝑝𝑖 𝑗
𝑖 ,𝑗 =1
∑︁𝑛 ∑︁𝑛
= max𝜋 𝜆𝜋 (𝑖 ) 𝜇𝑖 = 𝜆𝑖 𝜇𝑖 .
𝑖 =1 𝑖
Notes
Birkhoff and von Neumann independently established the result that shares their
name. There is an influential geometric proof of the result due to Hoffmann &
Wielandt [HW53], which also connects the result with perturbation theory for eigenval-
ues of a normal matrix. The material on convex geometry is adapted from Barvinok’s
book [Bar02] and from Schneider’s treatise [Sch14]. We have extracted this direct
proof of Birkhoff ’s theorem from a note by Glenn Hurlbert, which appears in [Hur10].
The proof of Richter’s trace theorem is drawn from Mirsky’s work [Mir59].
Lecture bibliography
[Bar02] A. Barvinok. A course in convexity. American Mathematical Society, Providence, RI,
2002. doi: 10.1090/gsm/054.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bir46] G. Birkhoff. “Three observations on linear algebra”. In: Univ. Nac. Tacuman, Rev.
Ser. A 5 (1946), pages 147–151.
[HW53] A. J. Hoffman and H. W. Wielandt. “The variation of the spectrum of a normal
matrix”. In: Duke J. Math. 20 (1953), pages 37–39.
[Hur10] G. H. Hurlbert. Linear optimization. The simplex workbook. Springer, New York,
2010. doi: 10.1007/978-0-387-79148-7.
[Mir59] L. Mirsky. “On the trace of matrix products”. In: Mathematische Nachrichten 20.3-6
(1959), pages 171–174. doi: 10.1002/mana.19590200306.
[Sch14] R. Schneider. Convex bodies: the Brunn–Minkowski theory. 151. Cambridge university
press, 2014. doi: 10.1017/CBO9781139003858.
6. Unitarily Invariant Norms
Example 6.4 (Ky Fan norm). Fix 𝑘 ∈ {1, . . . , 𝑛}. For each vector 𝑥 ∈ ℝ𝑛 , define
∑︁
k𝒙 k (𝑘 ) B max |𝑥𝑖 |.
| I |=𝑘 𝑖 ∈I
That is, k𝒙 k (𝑘 ) is the sum of the 𝑘 largest entries of the vector |𝒙 | . The functions
k·k (𝑘 ) are norms on ℝ𝑛 , which are known by the name of Ky Fan norms. It is also easy
to check that these norms are symmetric gauge functions.
There are many other norms on ℝ𝑛 that are symmetric gauge functions. For
instance, one could form combinations of 𝑝 and Ky Fan norms, or one could form
weighted sums of the ordered entries of the vector.
Proposition 6.5 (Monotonicity). If Φ : ℝ𝑛 → ℝ+ is a symmetric gauge function, then
|𝒙 | ≤ |𝒚 | implies that Φ(𝒙 ) ≤ Φ(𝒚 ) for all 𝒙 , 𝒚 ∈ ℝ𝑛 . Recall that 𝒙 ≤ 𝒚 for 𝒙 , 𝒚 ∈ ℝ𝑛 is
interpreted entrywise and that
Proof. Let 𝒙 , 𝒚 ∈ ℝ𝑛 be such that |𝒙 | ≤ |𝒚 | . By sign invariance of Φ, we can assume |𝒙 | B ( |𝑥 1 |, . . . , |𝑥𝑛 |) denotes the
that both 𝒙 and 𝒚 are positive and that 0 ≤ 𝑥𝑖 = 𝑡𝑖 𝑦𝑖 for some values 𝑡𝑖 ∈ [ 0, 1] for entrywise modulus.
each 𝑖 = 1, . . . , 𝑛 . By permutation invariance and iteration, it suffices to check the
case where 𝑡 2 = 𝑡 3 = · · · = 𝑡𝑛 = 1. Indeed, if the result holds for the case were 𝒙 and
𝒚 differ by a single entry, it can be obtained for the general case by applying it to one
coordinate at a time. Write 𝑥 1 = 𝑡 𝑦1 for 𝑡 ∈ [ 0, 1] . Then
Φ((𝑥 1 , 𝑥2 , . . . , 𝑥𝑛 )) = Φ((𝑡 𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ))
1+𝑡 1−𝑡 1+𝑡 1−𝑡 1+𝑡 1−𝑡
=Φ 𝑦1 − 𝑦1 , 𝑦2 + 𝑦2 , . . . , 𝑦𝑛 + 𝑦𝑛
2 2 2 2 2 2
1+𝑡 1−𝑡
=Φ (𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ) + (−𝑦1 , 𝑦2 , . . . , 𝑦𝑛 )
2 2
1+𝑡 1−𝑡
≤ Φ((𝑦1 , 𝑦2 , . . . , 𝑦𝑛 )) + Φ((−𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ))
2 2
= Φ((𝑦1 , 𝑦2 , . . . , 𝑦𝑛 )).
The inequality follows from convexity of Φ, while the last equality uses the fact that Φ
is sign invariant.
Next, we prove a theorem that demonstrates that the Ky Fan norms play an essential
role in the theory of symmetric gauge functions.
Theorem 6.6 (Fan dominance: Vector case). Fix 𝒙 , 𝒚 ∈ ℝ𝑛 . The following statements
are equivalent:
1. k𝒙 k (𝑘 ) ≤ k𝒚 k (𝑘 ) for each 𝑘 = 1, . . . , 𝑛 .
2. Φ(𝒙 ) ≤ Φ(𝒚 ) for every symmetric gauge function Φ on ℝ𝑛 .
Proof. Statement 1 follows immediately from 2, as Ky Fan norms are symmetric gauge
functions. For the other implication, note that 1 is equivalent to the condition that
|𝒙 | ≺𝜔 |𝒚 | . As a consequence, there exists a vector 𝒖 ∈ ℝ𝑛 such that |𝒙 | ≤ 𝒖 ≺ 𝒚 .
Therefore,
Φ(𝒙 ) ≤ Φ(|𝒙 |) ≤ Φ(𝒖) ≤ Φ(|𝒚 |) ≤ Φ(𝒚 ).
The first and last inequalities follow from sign invariance of Φ. Monotonicity of Φ
(Proposition 6.5) yields the second inequality. Finally, the third inequality follows from
the fact that Φ is convex and permutation invariant, hence isotone.
Lecture 6: Unitarily Invariant Norms 48
The Fan dominance theorem has a striking meaning. Given vectors 𝒙 and 𝒚 in ℝ𝑛 ,
we can check Φ(𝒙 ) ≤ Φ(𝒚 ) for every symmetric gauge function Φ, by checking the
inequality for only 𝑛 functions, namely the Ky Fan 𝑘 -norms for 𝑘 = 1, . . . , 𝑛 . In this
sense, the theorem reduces a problem involving an infinite number of inequalities to
the verification of a finite number.
Exercise 6.7 (Submajorization). Given |𝒙 | ≺𝜔 |𝒚 | , explain how to construct a vector
𝒖 ∈ ℝ𝑛 such that |𝒙 | ≤ 𝒖 ≺ 𝒚 .
Definition 6.8 (Duality). The dual norm Φ∗ of a symmetric gauge function Φ is given
by
Φ∗ (𝑦 ) B max{h𝒙 , 𝒚 i : Φ(𝒙 ) ≤ 1}.
We now continue with two norm inequalities that hold for all symmetric gauge
functions and specialize to familiar inequalities that frequently appear in analysis.
Proposition 6.13 (Dual norm inequality). For each symmetric gauge function Φ,
The theorem has interesting implications when applied to other symmetric gauge
functions.
Definition 6.15 (Unitarily invariant norm). A norm |||·||| on 𝕄𝑛 is unitarily invariant (UI)
if
|||𝑼 𝑨𝑽 ||| = |||𝑨 ||| for all 𝑨, 𝑼 ,𝑽 ∈ 𝕄𝑛 with 𝑼 ,𝑽 unitary.
We also insist that unitarily invariant norms be normalized in an appropriate
fashion: ||| diag ( 1, 0, . . . , 0)||| = 1.
For the remainder of this lecture, |||·||| will denote an arbitrary unitarily invariant
norm. Next, let us consider some familiar examples.
Example 6.16 (2 operator norm). The 2 operator norm, also known as the spectral norm, The singular values of a matrix are
is defined by numbered in decreasing order; i.e.,
by convention 𝜎1 ≥ 𝜎2 ≥ · · · ≥ 𝜎𝑛 .
k𝑨 k 2 B 𝜎1 (𝑨) for 𝑨 ∈ 𝕄𝑛 .
The 2 operator norm is unitarily invariant.
Example 6.17 (Frobenius norm). The Frobenius norm, also known as the Hilbert–Schmidt
norm, is defined by
∑︁𝑛 1/2
k𝑨 k F B 𝜎𝑖 (𝑨) 2 for 𝑨 ∈ 𝕄𝑛 .
𝑖 =1
The functions k·k 𝑆𝑝 , commonly known as Schatten norms, define unitarily invariant
norms on the space 𝕄𝑛 . The Schatten 1-norm, also known as the trace norm or the
nuclear norm, corresponds to the sum of the singular values of the matrix. The norm
k·k 𝑆∞ coincides with the spectral norm (Example 6.16).
The functions k·k (𝑘 ) define unitarily invariant norms on 𝕄𝑛 , known as the Ky Fan
matrix norms.
Lecture 6: Unitarily Invariant Norms 50
Exercise 6.20 (Why unitarily invariant?). Provide an explanation as to why each of the
examples above is unitarily invariant.
As in the case for symmetric gauge functions, combinations of Schatten and Ky Fan
matrix norms are also unitarily invariant.
Proof. (1). Let Φ : ℝ𝑛 → ℝ be a symmetric gauge function, and let |||·||| Φ be defined as
in (1). Positive definiteness of |||·||| Φ follows from the facts that Φ is positive definite
and the zero matrix is the only matrix whose singular values are all zero. Being a
norm, Φ is positive homogeneous, and since multiplying a matrix by a real number 𝛼
scales its singular values by |𝛼 | , the function |||·||| Φ is also positive homogeneous.
It remains to show that |||·||| Φ satisfies the triangle inequality and is unitarily
invariant. Let 𝑨, 𝑩 ∈ 𝕄𝑛 be 𝑛 × 𝑛 real matrices. By Exercise 6.22, we know that
𝝈 (𝑨) + 𝝈 (𝑩) ≺𝜔 𝝈 (𝑨) + 𝝈 (𝑩) . Since the symmetric gauge function Φ is isotone,
Φ(𝝈 (𝑨 + 𝑩)) ≤ Φ(𝝈 (𝑨) + 𝝈 (𝑩)) ≤ Φ(𝝈 (𝑨)) + Φ(𝝈 (𝑩)).
We conclude that |||·||| Φ is a norm. Note that the first and second inequalities above
follow from monotonicity and convexity of Φ, respectively.
Finally, observe that multiplying any given matrix by a unitary matrix does not alter
its singular values. Indeed, for any 𝑨 ∈ 𝕄𝑛 , its norm |||𝑨 ||| Φ is completely determined
by its singular values. It follows that |||·||| Φ is unitarily invariant.
(2). Let |||·||| be a unitarily invariant norm, and let Φ be defined as in (2). It is
immediate that Φ inherits all the norm properties from |||·||| . To show unitary invariance
of |||·||| Φ , we first let 𝑫 = diag (±1, . . . , ±1) . The matrix 𝑫 is unitary, so we have that
Φ(𝑫𝒙 ) = ||| diag (𝑫𝒙 )||| = |||𝑫 diag (𝒙 )||| = ||| diag (𝒙 )|||.
Similarly, since permutation matrices are unitary, we have that
Problem 6.22 (Singular values of the sum). Let 𝑨, 𝑩 ∈ 𝕄𝑛 (ℂ) . Show that the vectors of
singular values 𝝈 (𝑨) and 𝝈 (𝑩) satisfy the submajorization inequality 𝝈 (𝑨)+𝝈 (𝑩) ≺𝜔
𝝈 (𝑨) + 𝝈 (𝑩).
Next, we state the Fan dominance theorem for unitarily invariant matrices, analo-
gous to Theorem 6.6. As in the vector case, this theorem emphasizes the importance
of Ky Fan norms in the theory of unitarily invariant norms.
Theorem 6.23 (Fan dominance: Matrix case). Fix 𝑨, 𝑩 ∈ 𝕄𝑛 . The following statements
are equivalent:
1. k𝑨 k (𝑘 ) ≤ k𝑩 k (𝑘 ) for each 𝑘 = 1, . . . , 𝑛 .
2. |||𝑨 ||| ≤ |||𝑩 ||| for every unitarily invariant norm.
Proof. The theorem follows from Theorem 6.6 and from the characterization of unitarily
invariant norms in terms of symmetric gauge functions.
Definition 6.24 The dual |||·||| ∗ of a unitarily invariant norm |||·||| on 𝕄𝑛 is given by Recall that h𝑨, 𝑩 i B Tr (𝑩 ∗ 𝑨) for
𝑨, 𝑩 ∈ 𝕄𝑛 .
|||𝑩 ||| ∗ B max{h𝑩, 𝑨i : |||𝑨 ||| ≤ 1} for each 𝑩 ∈ 𝕄𝑛 .
Exercise 6.25 (Involution). Let |||·||| : ℝ𝑛 → ℝ be a unitarily invariant norm. Prove that
(Φ∗ ) ∗ = Φ.
Exercise 6.26 (Dual unitarily invariant norms). Prove that |||·||| ∗ is unitarily invariant if and
only if |||·||| is unitarily invariant.
Exercise 6.27 (Dual norm inequality). For each unitarily invariant norm |||·||| ,
Theorem 6.28 (Von Neumann duality). The dual |||·||| ∗Φ of the unitarily invariant norm
|||·||| Φ associated to a symmetric gauge function Φ is the unitarily invariant norm
|||·||| Φ∗ associated to the dual Φ∗ of the symmetric gauge function. That is,
Proof. To prove this result, we use von Neumann’s trace theorem. Let 𝑨, 𝑩 ∈ 𝕄𝑛 .
Then
∑︁𝑛
max{tr (𝑼 ∗ 𝑨𝑽 𝑩) : 𝑼 ,𝑽 ∈ 𝕄𝑛 are unitary} = 𝜎𝑖 (𝑨)𝜎𝑖 (𝑩).
𝑖 =1
Lecture 6: Unitarily Invariant Norms 52
k𝑩 k 𝑆∗𝑝 = k𝑩 k 𝑆𝑞 .
Recall that the dual to the 𝑝 norm is the 𝑞 norm, where 𝑞 is the Hölder conjugate of 𝑝 .
Therefore, the dual norm to the Schatten 𝑝 -norm is the Shatten 𝑞 -norm. Indeed, the
Schatten 𝑝 - and 𝑞 -norms coincide with the 𝑝 and 𝑞 norms of the vector of singular
values. In particular, the Schatten 2-norm is self dual, while the Schatten norms k·k 𝑆1
and k·k 𝑆∞ form a dual pair.
Example 6.31 (Ky Fan norms). For each 𝑘 = 1, . . . , 𝑛 , the dual norm of the Ky Fan 𝑘 -norm
is given by
1
k𝑩 k ∗(𝑘 ) = max k𝑩 k ( 1) , k𝑩 k (𝑛) .
𝑘
Note that the Ky Fan 1-norm k·k ( 1) corresponds to the spectral norm, while k·k (𝑛)
corresponds to trace norm.
Theorem 6.32 (Generalized Hölder inequality). Let |||·||| be unitarily invariant, and let
1 1
𝑝, 𝑞 > 1 be such that 𝑝 + 𝑞 = 1. Then
||||𝑨𝑩 |||| ≤ ||||𝑨 |𝑝 ||| 1/𝑝 ||||𝑩 |𝑞 ||| 1/𝑞 for all 𝑨, 𝑩 ∈ 𝕄𝑛 ,
where |𝑴 | B (𝑴 ∗ 𝑴 ) 1/2 .
Exercise 6.33 (Singular values of the product). Let 𝑨, 𝑩 ∈ 𝕄𝑛 . Show that 𝝈 (𝑨𝑩) ≺𝜔
𝝈 (𝑨) 𝝈 (𝑩) .
In this lecture, we discussed two important and closely related families of norms,
symmetric gauge functions on ℝ𝑛 and unitarily invariant norms on the space of 𝑛 × 𝑛
real matrices 𝕄𝑛 . We established generalizations of Hölder-type inequalities for these
two families of norms by exploiting their defining properties and their connection to
each other. These results exemplify how considering families of norms with invariance
properties allows us to prove results for a broad class of norms which can then specialize
into results of interest when applied to particular examples.
Notes
This material is adapted from Bhatia’s book [Bha97, Chap. IV].
Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
7. Matrix Inequalities via Complex Analysis
In this lecture, we develop a new approach for proving matrix inequalities based
Agenda:
on the theory of complex interpolation. As motivation, we begin with the duality
1. Motivation
theorem for Schatten 𝑝 -norms, and we recall some of the difficulties that arose in the 2. Interpolation inequalities
proof. We recast the duality theorem as a theorem about interpolation, which suggests 3. Maximum modulus principle
the possibility of deriving this result using general tools for studying interpolation 4. The three-lines theorem
problems. We develop a powerful approach based on the Hadamard three-lines 5. Example: Duality of Schatten
norms
theorem, a quantitative version of the maximum principle for analytic functions. As an
illustration of these ideas, we provide an alternative derivation of the duality theorem
for Schatten 𝑝 -norms.
For powers 𝑝 ∈ [ 1, ∞) , we can also write the Schatten 𝑝 -norm as a trace: Recall that the power binds before
the trace.
k𝑨 k𝑝 = ( tr |𝑨 |𝑝 ) 1/𝑝 for 𝑨 ∈ 𝕄𝑛 (ℂ) .
Theorem 7.2 (Duality relation for Schatten 𝑝 -norms). For all indices 𝑝, 𝑞 ∈ [ 1, ∞] with
1/𝑝 + 1/𝑞 = 1, the dual norm of the Schatten 𝑝 -norm is the Schatten 𝑞 -norm. The
duality is equivalent to a Hölder-type inequality.
Theorem 7.3 (A Schwarz-type inequality). Fix 𝑝 ≥ 1. For all 𝑩 ∈ 𝕄𝑛 (ℂ) and all
𝑨 𝑗 ∈ 𝕄𝑛 (ℂ) for 𝑗 = 1, 2, · · · , 𝑚 , we have
∑︁𝑚 ∑︁𝑚
𝑨 ∗𝑗 𝑩 𝑨 𝑗 ≤ 𝑨 ∗𝑗 𝑨 𝑗 k𝑩 k 2𝑝 .
𝑗 =1 𝑝 𝑗 =1 2𝑝
Proof. We will establish (7.1) for positive-semidefinite matrices 𝑨, 𝑩 ∈ ℍ𝑛+ (ℂ) ; Exer-
cise 7.6 asks you to derive the extension for general matrices.
First, fix 𝑝 ∈ ( 1, ∞) . We show that (7.2) implies (7.1) by a change of variables.
Consider the bijections 𝑨 ↦→ 𝑨 𝑝 and 𝑩 ↦→ 𝑩 𝑞 where 𝜃 = 1/𝑝 and 1 − 𝜃 = 1/𝑞 . We
obtain the inequality
k𝑨𝑩 k 1 ≤ k𝑨 k𝑝 · k𝑩 k 𝑞 for all psd 𝑨, 𝑩 ∈ ℍ𝑛+ (ℂ) .
Exercise 7.5 asks you to check that | tr (𝑨𝑩)| ≤ k𝑨𝑩 k 1 , which yields the inequality (7.1).
The boundary cases 𝑝 = 1 and 𝑝 = ∞ follow from the last display when we take
limits.
Exercise 7.5 (The trace norm). Prove that the trace is bounded by the trace norm:
Exercise 7.6 (Schatten duality: General case). For the general case when 𝑨, 𝑩 ∈ 𝕄𝑛 (ℂ) ,
derive inequality (7.1) from inequality (7.2). Hint: Introduce polar factorizations.
In view of this discussion, we may focus on proving the interpolation inequality
(7.2). A surprising and powerful approach to this problem is to allow the interpolation
parameter 𝜃 to take on complex values. This insight leads us to consider a complex
interpolation inequality, which we can establish using the miracle of complex analysis.
The key tool in this argument is a quantitative extension of the maximum modulus
principle, called the Hadamard three-lines theorem. The rest of this lecture develops
the required background and establishes the interpolation inequality (7.2).
D𝑟 (𝑎) B {𝑧 ∈ ℂ : |𝑧 − 𝑎 | ≤ 𝑟 } ⊂ Ω.
We can show that the Taylor expansion about each point 𝑎 ∈ Ω is uniquely
determined, and it converges absolutely on the largest (open) disc D𝑟 (𝑎) that is
contained in Ω. Moreover, when D𝑟 (𝑎) ⊆ D𝑅 (𝑎) ⊂ Ω, the coefficients in the
expansion coincide.
From this definition, we can easily confirm that analytic functions are continuous
inside the domain Ω. In fact, a complex function is analytic on a domain if and only if
it is differentiable within the domain (also called holomorphic).
Example 7.9 (Exponentials). For any complex number 𝑐 ∈ ℂ, the function 𝑧 ↦→ e𝑐 𝑧 is
analytic on the complex plane ℂ. Moreover, the familiar Taylor series expansion for
the exponential converges in the whole complex plane ℂ.
Proof. The proof of this proposition is straightforward. We can just integrate the
power series expansion of 𝑓 at 𝑎 ∈ ℂ term by term. The constant order (𝑘 = 0) term
contributes 𝑓 (𝑎) ; the higher monomial terms vanish.
We can apply the mean value formula to prove a special case of the maximum
modulus principle: the version for a disc.
Proposition 7.11 (Maximum modulus principle: Disc). Let 𝑓 : Ω → ℂ be analytic, and
consider a closed disc contained in the domain: D𝑟 (𝑎) ⊂ Ω. If the function 𝑧 ↦→ |𝑓 (𝑧)|
achieves its maximum over D𝑟 (𝑎) at the point 𝑎 , then the function 𝑓 must be constant
on the disc D𝑟 (𝑎) . That is, 𝑓 (𝑧) = 𝑓 (𝑎) for each 𝑧 ∈ D𝑟 (𝑎) .
Proof. Fix the center 𝑎 ∈ Ω and the radius 𝑟 > 0 of the disc. By the triangle inequality
applied to the mean value formula (7.3), we compute
∫ 2𝜋 ∫ 2𝜋
1 i𝜃 1
| 𝑓 (𝑎)| ≤ | 𝑓 (𝑎 + 𝑟 e )| d𝜃 ≤ |𝑓 (𝑎)| d𝜃 = |𝑓 (𝑎)| .
2𝜋 0 2𝜋 0
Therefore, both inequalities hold with equality. Since |𝑓 | is continuous, the value
𝑓 (𝑎 + 𝑟 ei𝜃 ) has constant phase for all 𝜃 ∈ [0, 2𝜋) . Furthermore, the magnitude
|𝑓 (𝑎 + 𝑟 ei𝜃 )| = |𝑓 (𝑎)| for all 𝜃 . In other words, 𝜃 ↦→ 𝑓 (𝑎 + 𝑟 ei𝜃 ) is a constant
function. By the mean value formula, the constant value of 𝑓 (𝑎 + 𝑟 ei𝜃 ) equals 𝑓 (𝑎) .
Finally, we apply the same argument for each 𝑟0 < 𝑟 and its associated disc D𝑟0 (𝑎) .
We conclude that 𝑓 (𝑧) = 𝑓 (𝑎) for each 𝑧 ∈ D𝑟 (𝑎) , because it is on the boundary of
some disc D𝑟0 (𝑎) for 𝑟0 = |𝑧 − 𝑎 | ≤ 𝑟 .
With the maximum modulus principle for the disc at hand, we are in position to
prove the general maximum modulus principle on bounded domains.
Aside: (Maximum modulus principle). In fact, a stronger result is valid. Assume that
𝑓 : Ω → ℂ is analytic on a bounded domain Ω and continuous on Ω. If the
function |𝑓 | attains maximum at any point inside the domain Ω, then the function
𝑓 is constant on the domain.
Proof. We will only prove the interpolation inequality (7.4); the general log-convexity
statement follows by a scaling argument. Without loss of generality, we can also
assume that 𝑀 ( 0) and 𝑀 ( 1) are strictly positive; otherwise we can add a small
positive constant 𝛿 to 𝑓 and take the limit 𝛿 ↓ 0.
Consider the auxiliary function
|𝐹 (𝜃 + i𝑡 )| = | 𝑓 (𝜃 + i𝑡 )| · 𝑀 ( 0) 𝜃 −1 𝑀 ( 1) −𝜃 . (7.5)
Exercise 7.15 (Three-lines: Weaker regularity condition). From the proof of Theorem 7.13, we
may infer that the regularity of 𝑓 plays an important role. The boundedness condition
2
can be relaxed, however, as long as we have the property that 𝑓 (𝑧) = 𝑜 ( e𝑐 𝑧 ) for
any positive constant 𝑐 . That is, 𝑓 should increase more slowly than any quadratic
exponential function. Adapt the proof to address this scenario.
the interpolation inequality in (7.2) resembles the conclusion (7.4) of the three-lines
theorem. This parallel suggests an argument based on complex analysis. As a first
step, we explain what it means to apply a complex power to a positive matrix.
The first step in the argument is to establish the duality between the trace norm
(Schatten-1) and the operator norm (Schatten-∞).
Exercise 7.17 (Trace norm and operator norm). By an independent argument, prove that
Hint: There are many possible arguments. The easiest approach is to introduce an SVD
of 𝑨 , cycle the trace, and use the variational definition of the largest singular value.
Proof of Proposition 7.4. Let 𝑨, 𝑩 ∈ ℍ𝑛+ (ℂ) be positive-semidefinite matrices, and let
𝜃 ∈ ( 0, 1) . We intend to bound the quantity
tr (𝑨 𝜃 𝑩 1−𝜃 𝑪 ) where k𝑪 k ∞ = 1.
𝑀 (𝜃 ) B sup𝑡 ∈ℝ | 𝑓 (𝜃 + i𝑡 )| for 0 ≤ 𝜃 ≤ 1.
We need to check that 𝑓 is bounded before we invoke the theorem. In fact, each term
of the linear combination in the expression for 𝑓 (𝑧) is bounded since
Therefore, 𝑓 is indeed bounded. We can apply Theorem 7.13 to obtain the inequality
We can easily bound the terms 𝑀 ( 0) and 𝑀 ( 1) . Using Exercise 7.17 and Exercise 7.5,
we see that
| 𝑓 ( i𝑡 )| ≤ k𝑨 i𝑡 𝑩 1−i𝑡 k 1 · k𝑪 k ∞ = k𝑨 i𝑡 𝑩𝑩 −i𝑡 k 1 = k𝑩 k 1 .
Indeed, 𝑨 i𝑡 and 𝑩 −i𝑡 are unitary matrices and the Schatten norms are unitarily invariant.
We recognize that 𝑀 ( 0) = sup𝑡 ∈ℝ | 𝑓 ( i𝑡 )| ≤ k𝑩 k 1 . Similarly, we have 𝑀 ( 1) ≤ k𝑨 k 1 .
Finally, we can introduce these bounds into (7.7):
The first relation follows from Exercise 7.17. This is the required result.
Notes
The idea of using complex interpolation to prove convexity inequalities is due to
Thorin. Littlewood described this approach as “the most impudent in mathematics,
and brilliantly successful” [Gar07, qtd. p. 135]. The application to proving matrix
inequalities is familiar to operator theorists, but it does not appear in the standard
books on matrix analysis. The presentation here is due to the instructor.
We have given an independent proof of the maximum modulus principle, adapted
to our purpose. The proof of the Hadamard three-lines theorem is also standard; for
example, see [Gar07, Prop. 9.1.1]. You will find more information about maximum
modulus principles on bounded and unbounded domains in any complex analysis text,
such as Ahlfors [Ahl66].
The proof of duality for the Schatten norms is intended as an illustration of these
ideas. Theorem 7.3 is due to Lust-Piquard, and the proof via complex interpolation
is reported by Pisier & Xu [PX97, Lem. 1.1]. There is a beautiful application of
complex interpolation to proving multivariate Golden–Thompson inequalities [SBT17].
The instructor has also used complex interpolation to develop new types of matrix
concentration inequalities [Tro18].
Lecture bibliography
[Ahl66] L. V. Ahlfors. Complex analysis: An introduction of the theory of analytic functions of
one complex variable. Second. McGraw-Hill Book Co., New York-Toronto-London,
1966.
[Gar07] D. J. H. Garling. Inequalities: a journey into linear analysis. Cambridge University
Press, Cambridge, 2007. doi: 10.1017/CBO9780511755217.
[PX97] G. Pisier and Q. Xu. “Non-commutative martingale inequalities”. In: Comm. Math.
Phys. 189.3 (1997), pages 667–698. doi: 10.1007/s002200050224.
[Ste56] E. M. Stein. “Interpolation of linear operators”. In: Transactions of the American
Mathematical Society 83.2 (1956), pages 482–492.
[SBT17] D. Sutter, M. Berta, and M. Tomamichel. “Multivariate trace inequalities”. In: Comm.
Math. Phys. 352.1 (2017), pages 37–58. doi: 10.1007/s00220-016-2778-5.
[Tro18] J. A. Tropp. “Second-order matrix concentration inequalities”. In: Appl. Comput.
Harmon. Anal. 44.3 (2018), pages 700–736. doi: 10.1016/j.acha.2016.07.005.
8. Uniform Smoothness and Convexity
In the previous two lectures, we studied a class of unitarily invariant norms known as
Agenda:
the Schatten 𝑝 -norms. Recall that the Schatten 𝑝 -norm (1 ≤ 𝑝 < ∞) is
1. Convexity and smoothness
∑︁𝑛 1/𝑝 2. Uniform smoothness for
k𝑨 k𝑝 := 𝜎𝑖 (𝑨)𝑝 for 𝑨 ∈ 𝕄𝑛 (𝔽) . Schatten norms
𝑖 =1
3. Proof: Scalar case
Today, we will study the geometry of the space of matrices equipped with the Schatten 4. Proof: Matrix case
𝑝 -norms. In particular, we shall answer the question “How smooth is the unit ball
of the Schatten 𝑝 -norm?” In answering this question, we will develop a powerful
uniform smoothness inequality for Schatten 𝑝 -norms, which has important applications
in random matrix theory and other areas.
Theorem 8.1 (Lindenstrauss). Let X be a normed linear space and X∗ its dual. Then
Informally, this theorem states that if X is very convex, then its dual X∗ is very
smooth. The converse is valid as well.
Theorem 8.2 (Tomczak-Jaegermann 1974; Ball–Carlen–Lieb 1994). For 𝑿 ,𝒀 ∈ 𝕄𝑛 (𝔽) Theorem 8.2, and indeed all of the
and 𝑝 ≥ 2, we have results in this lecture, also hold for
rectangular matrices
2/𝑝 𝑿 ,𝒀 ∈ 𝕄𝑚×𝑛 (𝔽) .
1 1
≤ k𝑿 k𝑝2 + (𝑝 − 1)k𝒀 k𝑝2 .
𝑝 𝑝
k𝑿 + 𝒀 k𝑝 + k𝑿 − 𝒀 k𝑝 (8.3)
2 2
This result was first established by Tomczak-Jaegermann in the more general setting
of trace-class operators on Hilbert spaces [Tom74]; she obtained the optimal constant
𝑝 − 1 for all even integers 𝑝 . Ball, Carlen, and Lieb [BCL94] obtained the optimal
constant of 𝑝 − 1 for all 𝑝 ≥ 2. We will present an alternative approach to the theorem,
developed in 2021 by the instructor (unpublished).
Several remarks are in order:
• By considering diagonal matrices, we see that the same inequality holds for
vectors 𝒙 , 𝒚 ∈ 𝑝𝑛 with 𝑝 ≥ 2.
2/𝑝
1 1
≤ k𝒙 k𝑝2 + (𝑝 − 1)k𝒚 k𝑝2 .
𝑝 𝑝
k𝒙 + 𝒚 k𝑝 + k𝒙 − 𝒚 k𝑝
2 2
In fact, all of the other results in this lecture have parallels for 𝑝𝑛 spaces and for
L𝑝 spaces.
• By Lyapunov’s inequality,
1 1
k𝑿 + 𝒀 k𝑝2 + k𝑿 − 𝒀 k𝑝2 ≤ k𝑿 k𝑝2 + (𝑝 − 1)k𝒀 k𝑝2 . (8.4)
2 2
1 1
k𝑿 + 𝒀 k 22 + k𝑿 − 𝒀 k 22 = k𝑿 k 22 + k𝒀 k 22 ,
2 2
which holds for vectors 𝑿 and 𝒀 in a Hilbert space with norm k·k . See Figure 8.5
or an illustration.
• For 𝑝 ≈ log 𝑛 , we have k·k𝑝 ≈ k·k ∞ . See Exercise 8.5 for a quantitative
statement. Plugging this relation into (8.4),
1 1
k𝑿 + 𝒀 k 2∞ + k𝑿 − 𝒀 k 2∞ / k𝑿 k 2∞ + log 𝑛 · k𝒀 k 2∞ .
2 2 Figure 8.5 Geometry of the parallelo-
The next set of exercises supports these observations. gram law.
1
𝑎 −1 ≤ (|𝑎 | 2 − 1) for 𝑎 ∈ ℝ.
2
Applying this inequality for 𝑎 = k𝑿 + 𝒀 k𝑝 and 𝑎 = k𝑿 − 𝒀 k𝑝 gives
1 1
𝜚X (𝜏) ≤ (k𝑿 + 𝜏𝒀 k𝑝 − 1) + (k𝑿 − 𝜏𝒀 k𝑝 − 1)
2 2
1 1 1
≤ k𝑿 + 𝜏𝒀 k𝑝2 + k𝑿 − 𝜏𝒀 k𝑝2 − 1 .
2 2 2
1 1
𝜚X (𝜏) ≤ k𝑿 k𝑝2 + (𝑝 − 1)𝜏 2 k𝒀 k𝑝2 − 1 = (𝑝 − 1)𝜏 2 .
2 2
This is the promised conclusion. The optimality for small 𝜏 can be proven using a
second-order expansion for small 𝜏 , as in Exercise 8.3.
Qualitatively, this result shows that the unit ball of the Schatten 𝑝 -norm becomes
less smooth as 𝑝 increases. By analogy to the 𝑝 spaces, this result is unsurprising
as the 2 unit ball is round and smooth, whereas the ∞ unit ball has rough corners.
Corollary 8.6 confirms that this intuition carries over to Schatten 𝑝 -norms. See
Figure 8.6 for a schematic drawing.
Exercise 8.7 (Modulus of convexity). Use Theorem 8.1 and Corollary 8.6 to bound the Figure 8.6 The unit ball of the Schat-
ten 𝑝 -norm becomes rougher as 𝑝 in-
modulus of convexity for the Schatten 𝑝 -norm (1 ≤ 𝑝 ≤ 2). creases.
This calculation carries over immediately to Hilbert spaces. For instance, for an
independent family (𝑿 1 , . . . , 𝑿 𝑁 ) of centered random variables in the space of matrices
𝕄𝑛 (𝔽) equipped with the Schatten 2-norm,
∑︁𝑁 2 ∑︁𝑁
𝔼 𝑿𝑖 = 𝔼 k𝑿 𝑖 k 22 . (8.7)
𝑖 =1 2 𝑖 =1
This statement can be verified using the bilinearity of the trace inner product h𝑿 , 𝒀 i :=
tr (𝑿 ∗𝒀 ) .
Unfortunately, Schatten 2-norm bounds are often uninformative for computational
applications of random matrices [Tro15, pp. 122–123]. With the help of the uniform
smoothness bound (8.4), we can obtain a Schatten 𝑝 -norm inequality version of (8.7).
By taking 𝑝 ≈ log 𝑛 , the result of Exercise 8.5 yields a Schatten ∞-norm bound (i.e., a
bound in the spectral norm).
To begin, observe that the uniform smoothness bound (8.4) can be interpreted A random variable
in probabilistic language. Introduce a random variable 𝜀 ∼ uniform{±1}. The 𝜀 ∼ uniform {±1 } is referred to as a
Rademacher random variable.
conclusion of (8.4) can be reformulated as an expectation:
Proof. We compute
1 1
𝔼 k𝑿 + 𝒁 k𝑝2 + k𝑿 k𝑝2 ≤ 𝔼 k𝑿 + 𝒁 k𝑝2 + 𝔼 k𝑿 − 𝒁 k𝑝2
2 2
≤ k𝑿 k𝑝2 + (𝑝 − 1) · 𝔼 k𝒁 k𝑝2 .
The first inequality is Jensen’s inequality, and the second is uniform smoothness (8.4).
Rearranging gives the advertised result with a suboptimal constant 2 (𝑝 − 1) .
Applying this result iteratively yields inequalities for the expected Schatten norm
of a sum of independent random matrices.
Corollary 8.9 (Sums of independent random matrices). Consider an independent family
(𝑿 1 , . . . , 𝑿 𝑁 ) of independent centered random matrices in 𝕄𝑛 (𝔽) . For each 𝑝 ≥ 2,
∑︁𝑁 2 ∑︁𝑁
𝔼 𝑿 𝑖 ≤ (𝑝 − 1) 𝔼 k𝑿 𝑖 k𝑝2 . (8.8)
𝑖 =1 𝑝 𝑖 =1
Lecture 8: Uniform Smoothness and Convexity 67
Proof sketch. The first conclusion follows from Proposition 8.8 applied iteratively to
each 𝑿 𝑘 , conditional on the realization of (𝑿 1 , . . . , 𝑿 𝑘 −1 ) . For the second conclusion,
we derive choose 𝑝 = 2 log 𝑛 to obtain
∑︁𝑁 2 ∑︁𝑁 2 ∑︁𝑁 2
𝔼 𝑿𝑖 ≤𝔼 𝑿𝑖 ≤𝔼 𝑿𝑖
𝑖 =1 ∞ 𝑖 =1 ∞ 𝑖 =1 𝑝
∑︁𝑁 ∑︁𝑁
≤ (𝑝 − 1) 𝔼 k𝑿 𝑖 k𝑝2 ≤ 2e log 𝑛 · 𝔼 k𝑿 𝑖 k 2∞ .
𝑖 =1 𝑖 =1
The first inequality is Jensen, the second and fourth are the norm comparison (8.6),
and the third is (8.8). Taking square roots gives the desired conclusion.
Uniform smoothness can be used to derive matrix concentration inequalities for
other random matrix models, such as products of independent random matrices in
[Hua+21]. See [Tro15] for a survey of exponential matrix concentration
√︁ inequalities,
based on more sophisticated tools from matrix analysis. The log 𝑛 prefactor in (8.9)
is known to be necessary [Tro15, pp. 114–115], but it can sometimes be removed with
still more difficult tools [BBv21].
Aside: (Uniform smoothness and quantum computation). As another application, recent
work has used uniform smoothness improved error bounds for simulating the time-
evolution of quantum systems using Trotter formulas, a workhorse algorithm in
quantum computation [CB21].
(𝔼 |𝑥 + 𝜀𝑦 |𝑝 ) 2/𝑝 ≤ |𝑥 | 2 + (𝑝 − 1) · |𝑦 | 2 . (8.10)
Proof of Theorem 8.2 (scalar case). Assume that 𝑝 is an even natural number. Define
the interpolant √
𝑢 (𝑡 ) := 𝔼 |𝑥 + 𝜀 𝑡 𝑦 |𝑝 for 𝑡 ∈ [0, 1] . (8.11)
Observe that Why define the interpolant 𝑢 as
𝑢 ( 0) = |𝑥 |𝑝 and 𝑢 ( 1) = 𝔼 |𝑥 + 𝜀𝑦 |𝑝 . √ square root
(8.11) with the
dependence 𝑡 instead of the more
The value 𝑢 ( 1) at the endpoint is the left-hand side of Gross’s inequality (8.10), raised natural-seeming linear interpolant
to the 𝑝/2 power. 𝑢 (𝑡 ) := 𝔼 |𝑥 + 𝜀𝑡 𝑦 |𝑝 ?
With the help of the mean-value inequality (Lemma 8.10) for the function 𝜑 : 𝑎 ↦→
In fact, the linear choice also works,
|𝑎 |𝑝−1 , we compute and bound the derivative of the interpolant 𝑢 : but it gives a suboptimal constant of
1 h √ i 2 (𝑝 − 1) in place of 𝑝 − 1.
𝑢¤ (𝑡 ) = 𝑝 · √ · 𝔼 (𝜀𝑦 ) (𝑥 + 𝜀𝑦 𝑡 )𝑝−1
2𝑡
√ √ 𝑝−1 1 √ 𝑝−1
1 1
= 𝑝 · ( 2𝑦 𝑡 ) · (𝑥 + 𝑦 𝑡 ) − (𝑥 − 𝑦 𝑡 )
4𝑡 2 2
√ 𝑝−2 1 √ 𝑝−2
1 2 1
≤ 𝑝 (𝑝 − 1)𝑦 (𝑥 + 𝑦 𝑡 ) + (𝑥 − 𝑦 𝑡 )
2 2 2
1 √
= 𝑝 (𝑝 − 1)𝑦 2 · 𝔼(𝑥 − 𝑦 𝑡 )𝑝−2
2
1 h √ i 1−2/𝑝
≤ 𝑝 (𝑝 − 1)𝑦 2 · 𝔼(𝑥 − 𝑦 𝑡 )𝑝
2
1
= 𝑝 (𝑝 − 1)𝑦 2 · 𝑢 (𝑡 ) 1−2/𝑝 .
2
The first inequality is the mean-value inequality Lemma 8.10, and the second inequality
is Lyapunov’s inequality. We have obtained a differential inequality for the function 𝑢 .
To solve this inequality, define the function 𝑣 (𝑡 ) = 𝑢 (𝑡 ) 2/𝑝 for 𝑡 ∈ [ 0, 1] . Then
𝑣 ( 0) = |𝑥 | 2 , while 𝑣 ( 1) is the left-hand side of Gross’ inequality (8.10). Finally, we
use the fundamental theorem of calculus to compute
∫ 1
𝑝 2/𝑝 2
(𝔼 |𝑥 + 𝜀𝑦 | ) − |𝑥 | = 𝑣 ( 1) − 𝑣 ( 0) = 𝑣¤ (𝑡 ) d𝑡
0
∫ 1
2
= 𝑢 (𝑡 ) 2/𝑝−1 · 𝑢¤ (𝑡 ) d𝑡 (8.12)
0 𝑝
∫ 1
≤ (𝑝 − 1) 𝑦 2 d𝑡 = (𝑝 − 1)𝑦 2 .
0
The inequality is transferred from the last display. Rearranging gives the Gross
two-point inequality (8.10).
Lecture 8: Uniform Smoothness and Convexity 69
Exercise 8.11 (Gross: 𝑝 ≥ 3). Modify the proof of Gross’ two-point inequality (8.10) to
handle the case 𝑝 ≥ 3. Hint: This is essentially the same proof; the only change is that
signum functions arise from the derivative.
Problem 8.12 (Gross: 𝑝 ∈ ( 2, 3) ). Prove Gross’ two-point inequality (8.10) for 𝑝 ∈ ( 2, 3) .
Hint: The derivative 𝜑 0 is now concave. Modify Lemma 8.10, bounding the integral by
the midpoint rule.
8.5.1 Setup
The first step is a standard tool in matrix analysis; we transform possibly non-Hermitian
matrices 𝑿 ,𝒀 to Hermitian matrices by passing to the the Hermitian dilation:
0 𝑴
𝑴 ∈ 𝕄𝑛 (𝔽) maps to H(𝑴 ) B ∈ ℍ2𝑛 (𝐹 ).
𝑴∗ 0
As the following exercise shows, this allows us to assume without loss of generality
that 𝑿 and 𝒀 are Hermitian.
Exercise 8.13 (Hermitian dilation). Let 𝑨, 𝑩 ∈ 𝕄𝑛 (𝔽) . Prove the following properties of
the Hermitian dilation:
1. The mapping H : 𝕄𝑛 (𝔽) → ℍ2𝑛 (𝔽) is a real-linear function. That is, for
𝑨, 𝑩 ∈ 𝕄𝑛 (𝔽) and 𝛼, 𝛽 ∈ ℝ,
Then ∑︁
𝜑 (𝑨) B 𝜑 (𝜆𝑖 )𝑷 𝑖 ∈ ℍ𝑛 (𝔽)
𝑖
Then calculate
∑︁ ∑︁ h ∑︁ ∑︁ i
tr [𝑓𝑘 (𝑨) 𝑔𝑘 (𝑩)] = tr 𝑓𝑘 (𝜆𝑖 )𝑷 𝑖 𝑔𝑘 (𝜇 𝑗 )𝑸 𝑗
𝑘 𝑘 𝑖 𝑗
∑︁ h∑︁ i
= 𝑓𝑘 (𝜆𝑖 ) 𝑔𝑘 (𝜇 𝑗 ) · tr (𝑷 𝑖 𝑸 𝑗 ).
𝑖 ,𝑗 𝑘
as claimed.
Exercise 8.18 (Trace of psd product). Prove that tr (𝑨𝑩) ≥ 0 for positive semidefinite
matrices 𝑨 and 𝑩 . Hint: Write 𝑩 = 𝑩 1/2 𝑩 1/2 and cycle the trace.
As a corollary, we obtain a trace version of the mean-value inequality, Lemma 8.10.
Corollary 8.19 (Mean-value trace inequality). Let 𝜑 : ℝ → ℝ be such that 𝜓 := 𝜑 0 is
convex. Then for 𝑨, 𝑩 ∈ ℍ𝑛 (𝔽) ,
1
tr (𝑩 − 𝑨) 2 (𝜓 (𝑩) + 𝜓 (𝑨)) .
tr [(𝑩 − 𝑨) ( 𝜑 (𝑨) − 𝜑 (𝑩))] ≤
2
Lecture 8: Uniform Smoothness and Convexity 71
Exercise 8.20 (Mean-value trace inequality). Prove Corollary 8.19 using the mean-value
inequality (Lemma 8.10) and the generalized Klein inequality (Proposition 8.17).
Exercise 8.21 (Derivative of a power). Derive the result in Proposition 8.15 using the
generalized Klein inequality (Proposition 8.17). Extend this argument to compute the
derivative of tr |·|𝑝 for all 𝑝 > 0.
Proof of Theorem 8.2. By passing to the Hermitian dilation if necessary, we can assume
that 𝑿 and 𝒀 are Hermitian. We again restrict attention to 𝑝 an even integer. We leave
the proof for general 𝑝 > 2 as an exercise.
Define the interpolant
h √ i
𝑢 (𝑡 ) := tr 𝑿 + 𝜀𝒀 𝑡 for 𝑡 ∈ [ 0, 1] .
We now compute and bound the derivative of 𝑢 using our computation for the
derivative of the trace power (Proposition 8.15) and the mean-value trace inequality
(Corollary 8.19):
1 h √ i
𝑢¤ (𝑡 ) = 𝑝 · √ 𝔼 tr (𝜀𝒀 ) (𝑿 + 𝜀𝒀 𝑡 )𝑝−1
2 𝑡
√ √ 𝑝−1 √ 𝑝−1
1 1
= 𝑝· tr ( 2 𝑡𝒀 ) 𝑿 + 𝒀 𝑡 − 𝑿 −𝒀 𝑡
2 4𝑡
" !#
1 2 1
√ 𝑝−2 1 √ 𝑝−2
≤ 𝑝 · tr 𝒀 (𝑝 − 1) 𝑿 + 𝒀 𝑡 + (𝑝 − 1) 𝑿 + 𝒀 𝑡
2 2 2
1 h √ i
= 𝑝 (𝑝 − 1) · 𝔼 tr 𝒀 2 (𝑿 + 𝜀𝒀 𝑡 )𝑝−2
2
1 h √ 𝑝 i 1−2/𝑝
≤ 𝑝 (𝑝 − 1) ( tr 𝒀 𝑝 ) 2/𝑝 𝔼 tr 𝑿 + 𝜀𝒀 𝑡 By convention, the trace binds after
2 nonlinear operations. For example,
1 h √ 𝑝 i 1−2/𝑝 tr 𝒀 𝑝 = tr (𝒀 𝑝 ) .
≤ 𝑝 (𝑝 − 1) ( tr 𝒀 𝑝 ) 2/𝑝 𝔼 tr 𝑿 + 𝜀𝒀 𝑡
2
1
= 𝑝 (𝑝 − 1)k𝒀 k𝑝2 · 𝑢 (𝑡 ) 1−2/𝑝 .
2
The first equality is Proposition 8.15. The three inequalities are Corollary 8.19, the
Hölder inequality for Schatten norms, and Lyapunov’s inequality. The remainder of
the proof follows from the same calculation (8.12) as the scalar case.
Problems
Problem 8.22 (Matrix Khintchine inequality). The uniform smoothness inequalities in this
lecture can be substantially improved using the same method. Consider a self-adjoint
random matrix of the form
∑︁𝑛
𝑿 =𝑩+ 𝜀𝑖 𝑨 𝑖 where 𝜀𝑖 are i.i.d. Rademacher.
𝑖 =1
2. For even 𝑝 , show that the constant can be improved to [(𝑝 − 1) !!] 1/𝑝 .
Notes
Leonard Gross, for whom the Gross two-point inequality (8.10) is named, is a math-
ematician and mathematical physicist who did foundational work on quantum field
theory and statistical physics. He is among the early developers of logarithmic Sobolev
inequalities, which can be used to prove concentration inequalities for nonlinear
functions of independent random variables [van14, §3].
Interpolation is a very effective tool. Another relatively accessible application uses
interpolation to prove comparison inequalities for Gaussian processes, which can be
used to prove sharp bounds on the expected spectral norm of a standard Gaussian
random matrix [Ver18, §§7.2–7.3]. An advanced application of these ideas can be
used to develop matrix concentration inequalities which sometimes circumvent the
logarithmic factor we obtained in (8.9); see [BBv21].
The Hermitian dilation has an old history in matrix analysis and operator theory.
Jordan first used the Hermitian dilation in 1874 in his discovery of the singular value
decomposition. (The singular value decomposition was also discovered independently
the year prior by Beltrami.) The Hermitian dilation was popularized by Wielandt
and is often referred to as the Jordan–Wielandt matrix for this reason. See [SS90,
pp. 34–35] for a discussion of this history.
Lecture bibliography
[BCL94] K. Ball, E. A. Carlen, and E. H. Lieb. “Sharp Uniform Convexity and Smoothness
Inequalities for Trace Norms”. In: Invent Math 115.1 (Dec. 1994), pages 463–482.
doi: 10.1007/BF01231769.
[BBv21] A. S. Bandeira, M. T. Boedihardjo, and R. van Handel. “Matrix Concentration
Inequalities and Free Probability”. In: arXiv:2108.06312 [math] (Aug. 2021). arXiv:
2108.06312 [math].
[CB21] C.-F. Chen and F. G. S. L. Brandão. “Concentration for Trotter Error”. Available
at https : / / arXiv . org / abs / 2111 . 05324. Nov. 2021. arXiv: 2111 . 05324
[math-ph, physics:quant-ph].
[Hua+21] D. Huang et al. “Matrix Concentration for Products”. In: Foundations of Computa-
tional Mathematics (2021), pages 1–33.
[Lin63] J. Lindenstrauss. “On the Modulus of Smoothness and Divergent Series in Banach
Spaces.” In: Michigan Mathematical Journal 10.3 (1963), pages 241–252.
[Nao12] A. Naor. “On the Banach-Space-Valued Azuma Inequality and Small-Set Isoperime-
try of Alon–Roichman Graphs”. In: Combinatorics, Probability and Computing 21.4
(July 2012), pages 623–634. doi: 10.1017/S0963548311000757.
[RX16] É. Ricard and Q. Xu. “A Noncommutative Martingale Convexity Inequality”. In: The
Annals of Probability 44.2 (Mar. 2016), pages 867–882. doi: 10.1214/14-AOP990.
[SS90] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. 1st Edition. Academic
Press, 1990.
Lecture 8: Uniform Smoothness and Convexity 73
Matrix perturbation theory studies how a function of a matrix changes as the matrix
Agenda:
changes. In particular, we may ask how the eigenvalues of an Hermitian matrix change
1. Poincaré theorem
under additive perturbation. That is, for Hermitian matrices 𝑨, 𝑬 of the same size, 2. Courant–Fischer–Weyl
how do the eigenvalues of the perturbed matrix 𝑨 + 𝑬 compare with the eigenvalues minimax principle
of 𝑨 ? Today, we will develop Lidskii’s theorem, a result that provides a very detailed 3. Weyl monotonicity principle
answer to this question. This is one of the most important applications of majorization 4. Lidskii theorem
theory in matrix analysis.
↓
Proof. The proof relies on dimension counting. Let (𝜆𝑖 , 𝒖 𝑖 ) be the eigenpairs of 𝑨 .
The set (𝒖 𝑖 : 𝑖 = 1, . . . 𝑛) comprises an orthonormal basis of ℂ𝑛 . In other words,
h𝒖 𝑖 , 𝒖 𝑗 i = 𝛿𝑖 𝑗 .
Consider the subspace S = span {𝒖 𝑘 , . . . , 𝒖 𝑛 }. Observe that dim L + dim S = 𝑛 + 1.
Since the total dimension is greater than 𝑛 , the intersection L ∩ S is a nontrivial
subspace. Since the subspace is nontrivial, it contains a unit-norm vector 𝒙 ∈ L ∩ S.
We may express this vector in the distinguished basis for the subspace:
∑︁𝑛
𝒙= 𝛼𝑖 𝒖 𝑖 where 𝛼𝑖 ∈ ℂ.
𝑖 =𝑘
↓
The final inequality comes from the fact that the eigenvalues 𝜆 𝑗 are arranged in
descending order. This proves the first inequality in the statement of the theorem.
In order to prove the second inequality, we simply replace 𝑨 with −𝑨 . This
operation yields a unit vector 𝒙 ∈ L with the property that
In the first line, we used the fact that the eigenvalues of −𝑨 are the negatives eigenvalues
of 𝑨 . Because of the change in sign, the order reverses. This is the second statement
in the theorem.
Theorem 9.1 yields a corollary that provides a minimax representation for the
ordered eigenvalues.
Corollary 9.2 (Courant–Fischer–Weyl minimax principle). Let 𝑨 ∈ ℍ𝑛 . Then
This result represents eigenvalues as quadratic forms. Let us emphasize that these
expressions are linear in the matrix 𝑨 .
↓
Proof. As before, let (𝜆𝑖 (𝑨), 𝒖 𝑖 ) denote the eigenpairs of 𝑨 . By Theorem 9.1, each
subspace L ⊆ ℂ𝑛 with dimension 𝑘 contains a unit vector 𝒙 ∈ L such that
Proof. To prove this result, simply let 𝑘 = 𝑛 and 𝑘 = 1 in the first and second statement
of Corollary 9.2, respectively.
Corollary 9.3 has some striking implications:
Lecture 9: Additive Perturbation Theory 76
Indeed, we observe that the maximum eigenvalue is the maximum of linear functions
of the matrix, so is a convex function. Likewise, the minimum eigenvalue is a minimum
of linear functions, so is concave.
Exercise 9.4 (Spectral condition number convex cone). For 𝑐 ≥ 1, consider the set K (𝑐 ) B Recall that the spectral condition
{𝑨 ∈ ℍ𝑛 : 𝜅2 (𝑨) ≤ 𝑐 } of matrices whose spectral conditional number is bounded by number 𝜅2 (𝑨) B 𝜆 max (𝑨)/𝜆 min (𝑨)
for each Hermitian matrix 𝑨 .
𝑐 . Show that K (𝑐 ) is a closed convex cone.
Exercise 9.5 (Rayleigh–Ritz via diagonalization). Prove Corollary 9.3 by diagonalizing 𝑨
and computing the quadratic form explicitly.
Exercise 9.6 (Rayleigh quotient). Recall that the Rayleigh quotient is defined as
h𝒙 , 𝑨𝒙 i
𝑅 (𝑨 ; 𝒙 ) B for 𝒙 ≠ 0.
h𝒙 , 𝒙 i
Prove Corollary 9.3 by differentiating the Rayleigh quotient with respect to 𝒙 and
finding the second-order stationary points.
Warning 9.10 (Eigenvalue increase). The converse of the Weyl monotonicity principle
↓ ↓
is not true! The conditions 𝜆𝑘 (𝑩) ≥ 𝜆𝑘 (𝑨) do not imply that 𝑩 < 𝑨 .
Lecture 9: Additive Perturbation Theory 77
↓
Proof. Let S𝑛−𝑘 +1 be the subspace that represents 𝜆𝑘 (𝑨 + 𝑯 ) in the second (minimax)
relation in Corollary 9.2. Then
= 𝜆𝑘↓ (𝑨).
The first inequality is valid because 𝑯 is psd. The second inequality occurs because
↓
S𝑛−𝑘 +1 represents 𝜆𝑘 (𝑨 + 𝑯 ) via the minimax principle, so the minimum over all
subspaces S ⊆ ℂ with dimension 𝑛 − 𝑘 + 1 has to be smaller. The final equality
𝑛
We give a simple proof of Lidskii’s theorem due to Li & Mathias [LM99]. This
argument reduces the full result to two applications of the Weyl monotonicity principle.
See [Bha97] for three alternative proofs of Lidskii’s theorem.
Proof. Fix a number 𝑘 , and choose indeces 𝑖 1 < 𝑖 2 < · · · < 𝑖 𝑛 . Without loss of
↓
generality, we may assume that 𝜆𝑘 (𝑬 ) = 0. If not, we simply replace 𝑬 with
𝑬 − 𝜆𝑘↓ (𝑬 ) I. Indeed, this transformation reduces the 𝑘 th eigenvalue of 𝑬 to zero, and
↓
it shifts both sides of the inequality by an equal amount, namely −𝑘𝜆𝑘 (𝑬 ) .
Next, introduce the Jordan decomposition 𝑬 = 𝑬 + − 𝑬 − , where the summands 𝑬 +
and 𝑬 − are commuting psd matrices. Indeed, 𝑬 admits a spectral representation
∑︁𝑛
𝑬 = 𝜆 ↓𝑗 𝑷 𝑗 .
𝑗 =1
The two functions (𝑎)± B max{±𝑎, 0} return the positive (resp., negative) part of a
number 𝑎 ∈ ℝ. Note that 𝑬 4 𝑬 + and so 𝑨 + 𝑬 4 𝑨 + 𝑬 + .
We make the following calculation, invoking the Weyl monotonicity principle twice.
∑︁𝑘 h i ∑︁𝑘 h i
𝜆𝑖↓𝑗 (𝑨 + 𝑬 ) − 𝜆𝑖↓𝑗 (𝑨) ≤ 𝜆𝑖↓𝑗 (𝑨 + 𝑬 + ) − 𝜆𝑖↓𝑗 (𝑨)
𝑗 =1 𝑗 =1
∑︁𝑛 h i
≤ 𝜆 ↓𝑗 (𝑨 + 𝑬 + ) − 𝜆 ↓𝑗 (𝑨)
𝑗 =1
= tr (𝑨 + 𝑬 + ) − tr (𝑨) = tr (𝑬 + )
∑︁𝑛
= 𝜆 ↓ (𝑬 + )
𝑗 =1 𝑗
∑︁𝑘
= 𝜆 ↓𝑗 (𝑬 )
𝑗 =1
The first inequality follows from Corollary 9.9 because 𝑨 + 𝑬 4 𝑨 + 𝑬 + . The second
inequality introduces the missing indices; it relies on Corollary 9.9 and the fact that
𝑨 4 𝑨 + 𝑬 + . Next, we recall that the trace is the sum of eigenvalues, and we rely the
↓ ↓
fact that the trace is linear. Finally, note that 𝜆𝑘 (𝑬 ) = 0. Therefore, 𝜆 𝑗 (𝑬 + ) = 0 for
𝑗 ≥ 𝑘 . Meanwhile, the largest 𝑘 eigenvalues of 𝑬 + and 𝑬 coincide. This observation
completes the proof.
𝝀 ↓ (𝑨 + 𝑬 ) − 𝝀 ↓ (𝑨) ≺ 𝝀 ↓ (𝑬 ). (9.1)
In the majorization relation, the trace equality follows from the linearity of the trace
and the fact that tr (𝑨 + 𝑬 ) = 𝑛𝑗=1 𝜆 𝑗 (𝑨 + 𝑬 ) . In particular, we may consider the
Í
𝑘 = 1 case of the majorization relation:
𝜆 1↓ (𝑨 + 𝑬 ) − 𝜆 1↓ (𝑨) ≤ 𝜆1↓ (𝑬 ).
These two pairs of majorization relations give detailed information about how the
eigenvalues behave when we add or subtract Hermitian matrices.
Lecture 9: Additive Perturbation Theory 79
In other words, the maximum change in each eigenvalue is controlled by the spectral-
norm difference between the two matrices. This last relation implies that the eigenvalue
function
𝝀 ↓ : (ℍ𝑛 , 𝑆 ∞ ) → (ℝ𝑛 , ∞ )
is a 1-Lipschitz map between two metric spaces. In particular, 𝝀 ↓ is a continuous
function on the space of Hermitian matrices.
Aside: The eigenvalues of a gen-
It is sometimes more convenient to make another change of variables. One obtains
eral 𝑛 × 𝑛 matrix also change
↓ ↓ continuously. Nevertheless, this
max 𝑗 |𝜆 𝑗 (𝑩) − 𝜆 𝑗 (𝑨)| ≤ k𝑩 − 𝑨 k ∞ . statement requires some care be-
cause there is no natural way to
This statement describes how the eigenvalues of two Hermitian matrices reflect the order the eigenvalues. Further-
difference between the two matrices. more, the proofs involve tools
from algebra or complex analysis.
Example 9.15 (Hoffman–Wielandt theorem). Consider the symmetric gauge function
Φ(·) = k·k 2 . Then
∑︁𝑛 ∑︁𝑛
|𝜆 ↓𝑗 (𝑨 + 𝑬 ) − 𝜆 ↓𝑗 (𝑨)| 2 ≤ |𝜆 ↓𝑗 (𝑬 )| 2 = k𝑬 k 2F .
𝑗 =1 𝑗 =1
𝝀 ↓ : (ℍ𝑛 , 𝑆 2 ) → (ℝ𝑛 , 2 )
is 1-Lipschitz between two Euclidean spaces. Equivalently, by a change of variables,
∑︁𝑛
|𝜆 ↓𝑗 (𝑩) − 𝜆 ↓𝑗 (𝑨)| 2 ≤ k𝑩 − 𝑨 k 2F . Aside: The Hoffman–Wielandt
𝑗 =1 theorem can be extended to nor-
mal matrices, but it requires
This is the form in which the result is usually presented. matching the eigenvalues of the
two matrices more carefully. For
details, see [Bha97, Sec. VI and
9.4.3 Beyond norms Eqn. VI.36].
By using other isotone functions, we may derive many more results on perturbation of
eigenvalues. Here is a picayune example.
Example 9.16 (Entropy of eigenvalues). Consider two psd matrices 𝑨 and 𝑯 with the
same dimension.
ent (𝝀 ↓ (𝑨 + 𝑯 ) − 𝜆 ↓ (𝑨)) ≥ ent (𝝀 ↓ (𝑯 )).
Í
Recall that the entropy of a positive vector is defined as ent (𝒙 ) B − 𝑗 𝑥 𝑗 log 𝑥 𝑗 , and
the negation of the entropy is an isotone function.
Lecture 9: Additive Perturbation Theory 80
Notes
Variational principles for eigenvalues and perturbation theory for eigenvalues are
perennial topics in matrix analysis. Good references include the books of Bhatia [Bha97;
Bha07a], Parlett [Par98], Saad [Saa11b], and Stewart & Sun [SS90]. The classic
reference is Kato’s magnum opus [Kat95].
Our approach to variational principles is drawn from [Bha97, Chap. III]. Historically,
Lidskii’s theorem was regarded as a very difficult result, but Li & Mathias [LM99] have
really cut to the heart of the matter.
Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07a] R. Bhatia. Perturbation bounds for matrix eigenvalues. Reprint of the 1987 original.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
doi: 10.1137/1.9780898719079.
[Kat95] T. Kato. Perturbation theory for linear operators. Reprint of the 1980 edition.
Springer-Verlag, Berlin, 1995.
[LM99] C.-K. Li and R. Mathias. “The Lidskii-Mirsky-Wielandt theorem–additive and
multiplicative versions”. In: Numerische Mathematik 81.3 (1999), pages 377–413.
[Par98] B. N. Parlett. The symmetric eigenvalue problem. Corrected reprint of the 1980
original. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA,
1998. doi: 10.1137/1.9781611971163.
[Saa11b] Y. Saad. Numerical methods for large eigenvalue problems. Revised edition of the
1992 original [ 1177405]. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2011. doi: 10.1137/1.9781611970739.ch1.
[SS90] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. 1st Edition. Academic
Press, 1990.
10. Multiplicative Perturbation Theory
In the last lecture, we studied how the eigenvalues of a Hermitian matrix 𝑨 change
Agenda:
under an additive perturbation 𝑨 + 𝑬 . This discussion culminated with Lidskii’s
1. Recap of Lidskii’s theorem
theorem. In this lecture, we prove a multiplicative analogue of Lidskii’s theorem 2. Li–Mathias theorem
due to Li and Mathias. They consider multiplicative perturbations of the form 𝑺 ∗ 𝑨𝑺 . 3. Consequences
The proof of the Li–Mathias theorem proceeds in parallel with their proof of Lidskii’s 4. Sylvester inertia theorem
theorem, which we recall below. 5. Ostrowski monotonicity
6. Proof of Li-0Mathias
Throughout this lecture, eigenvalues and singular values are always sorted in
decreasing order, so we write 𝝀, 𝝈 instead of 𝝀 ↓ , 𝝈 ↓ to lighten our notation.
In this lecture, we prove multiplicative analogues of the above results, where differences
are replaced by ratios, positivity is replaced by expansivity, and the additivity of the
trace is replaced by the multiplicativity of the determinant.
Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) Ö𝑘
≤ 𝜆𝑖 𝑗 (𝑺 ∗𝑺 ). (10.2)
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1
In the setting of Theorem 10.1, Theorem 10.4 implies that if 𝑺 is invertible then
sgn (𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 )) = sgn (𝜆𝑖 𝑗 (𝑨)) for all 𝑗 , hence each factor on the left-hand side
of (10.2) is positive. If 𝑺 is not invertible, then each factor
on the left-hand side of (10.2) is only
Proof of Theorem 10.4. Suppose 𝑨 and 𝑩 are congruent, so 𝑨 = 𝑺 ∗ 𝑩𝑺 for some guaranteed to be nonnegative, by
invertible 𝑺 ∈ 𝕄𝑛 . Note that 𝑨𝒙 = 0 for 𝒙 ∈ ℝ𝑛 if and only if 𝑩 (𝑺𝒙 ) = 0 because continuity of each such factor in 𝑺 .
𝑺 ∗ is invertible, hence 𝑺 maps ker (𝑨) isomorphically onto ker (𝑩) . In particular,
𝑛 0 (𝑨) = dim ker (𝑨) = dim ker (𝑩) = 𝑛 0 (𝑩) .
Next, let (𝜆𝑖 , 𝒖 𝑖 ) be orthogonal eigenpairs of 𝑨 , where 𝜆𝑖 > 0 for 𝑖 = 1, . . . , 𝑛 + (𝑨)
by definition of 𝑛 + . Define L+ = lin{𝒖 1 , . . . , 𝒖 𝑛 + (𝑨) }. Any nonzero 𝒙 ∈ L+ can be
Í𝑛+ (𝑨)
|𝛼𝑖 | 2 = k𝒙 k 2 > 0. Hence,
Í
written as 𝒙 = 𝑖 =1 𝛼𝑖 𝒖 𝑖 where 𝑖
∑︁𝑛+ (𝑨)
h(𝑺𝒙 ), 𝑩 (𝑺𝒙 )i = h𝒙 , 𝑺 ∗ 𝑩𝑺 (𝒙 )i = h𝒙 , 𝑨𝒙 i = 𝜆𝑖 |𝛼𝑖 | 2 > 0.
𝑖 =1
Lecture 10: Multiplicative Perturbation Theory 83
We conclude that h𝒚 , 𝑩𝒚 i > 0 for all nonzero 𝒚 ∈ 𝑺 L+ . Since 𝑺 is invertible and the
eigenvectors 𝒖 𝑖 are orthogonal, dim (𝑺 L+ ) = dim L+ = 𝑛 + (𝑨) . By the Courant–Fischer–
Weyl minimax principle (see Lecture 9),
which implies 𝑛 + (𝑩) ≥ 𝑛 + (𝑨) . Interchanging the roles of 𝑨 and 𝑩 , we also get
𝑛 + (𝑨) ≥ 𝑛 + (𝑩) and thus 𝑛 + (𝑨) = 𝑛 + (𝑩) .
Finally, 𝑛 − (𝑨) = 𝑛 − 𝑛 0 (𝑨) − 𝑛 + (𝑨) = 𝑛 − (𝑩) , so we obtain inertia (𝑨) =
inertia (𝑩) .
The converse is not needed for this lecture, and is left as an exercise.
Exercise 10.5 (Sylvester inertia theorem: Converse). Prove the remaining direction in
Theorem 10.4. That is, two Hermitian matrices with the same inertia are congruent.
𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 )
𝑛
≺ log 𝜆 𝑗 (𝑺 ∗𝑺 )
𝑛
log 𝑗 =1 . (10.3)
𝜆 𝑗 (𝑨) 𝑗 =1
For a positive-definite matrix 𝑨 0, we can further rewrite this expression in the form
log 𝜆(𝑺 ∗ 𝑨𝑺 ) − log 𝜆(𝑨) ≺ log 𝜆(𝑺 ∗𝑺 ) .
Just as in the additive case, we can obtain many scalar inequalities from (10.3) by
applying isotone functions to both sides. For example,
𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 )
max 𝑗 =1,...,𝑛 log ≤ max𝑗 =1,...,𝑛 | log 𝜆 𝑗 (𝑺 ∗𝑺 )| = k𝑺 k 2 ,
𝜆 𝑗 (𝑨)
∑︁𝑛 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) ∑︁𝑛
≤ 𝜆 𝑗 (𝑺 ∗𝑺 ) = k𝑺 k 𝐹2 .
𝑗 =1 𝜆 𝑗 (𝑨) 𝑗 =1
Theorem 10.1 also implies multiplicative perturbation bounds for singular values.
Corollary 10.6 (Li–Mathias for singular values). Choose 𝑺 ,𝑻 ∈ 𝕄𝑛 and 𝑘 ∈ {1, . . . , 𝑛}. For
any distinct indices 1 ≤ 𝑖 1 < . . . < 𝑖𝑘 ≤ 𝑛 such that 𝜎𝑖 𝑗 (𝑻 ) > 0 for all 𝑗 , we have
Ö𝑘 𝜎𝑖 𝑗 (𝑻 𝑺 ) Ö𝑘
≤ 𝜎𝑗 (𝑺 ).
𝑗 =1 𝜎𝑖 𝑗 (𝑻 ) 𝑗 =1
Moreover, if 𝑘 = 𝑛 and 𝑻 is invertible then the above inequality holds with equality.
Proof. Set 𝑨 = 𝑻 ∗𝑻 in Theorem 10.1. Take the square roots of both sides of (10.2).
Once again, Corollary 10.6 can be restated in terms of majorization. This leads to a
classic result from operator theory.
Corollary 10.7 (Gel’fand–Naimark). If 𝑺 ,𝑻 ∈ 𝕄𝑛 are invertible, then
𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 )
≥ 1.
𝜆 𝑗 (𝑨)
If 𝜆 𝑗 (𝑨) > 0, then the upper bound in Weyl’s pertrubation inequality (see Lecture 9)
gives
Similarly, if 𝜆 𝑗 (𝑨) < 0 then the lower bound in Weyl’s inequality gives
Dividing by 𝜆 𝑗 (𝑨) and using the fact 𝜆𝑛 (𝑺 ∗𝑺 ) ≥ 1, we arrive at the desired result.
Exercise 10.9 (Ostrowski: Quantitative version). Prove the following strengthening of
Theorem 10.8. For any invertible 𝑺 ∈ 𝕄𝑛 and any 𝑨 ∈ ℍ𝑛 , we have 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) =
𝜃 𝑗 𝜆 𝑗 (𝑨) for some 𝜃 𝑗 ∈ [𝜆𝑛 (𝑺 ∗𝑺 ), 𝜆1 (𝑺 ∗𝑺 )] .
This relation can be viewed as a quantitative version of Sylvester’s law of inertia,
which is how Ostrowski originally presented this result in [Ost59].
We begin with a series of reductions. First, we may assume both 𝑨 and 𝑺 are Here we use the continuity of the
invertible because both sides of the inequality (10.2) are continuous at (𝑨, 𝑺 ) whenever eigenvalues, which follows from
Weyl’s perturbation inequality in
𝜆𝑖 𝑗 (𝑨) ≠ 0 for all 𝑗 , and any square matrix is a limit of invertible matrices. Lecture 9.
Second, let 𝑺 = 𝑼 𝚺𝑽 ∗ be an SVD of 𝑺 , where 𝚺 = diag (𝑠 1 , . . . , 𝑠𝑛 ) is the diagonal
matrix of singular values of 𝑺 , arranged in descending order. We may then assume
𝑺 = 𝚺 because eigenvalues are invariant under conjugation by unitary matrices, so
∗
𝜆 𝑗 (𝑺 𝑨𝑺 ) 𝜆 𝑗 𝑽 [𝚺∗ (𝑼 ∗ 𝑨𝑼 )𝚺]𝑽 ∗ 𝜆 𝑗 𝚺∗ (𝑼 ∗ 𝑨𝑼 )𝚺
= = .
𝜆 𝑗 (𝑨) 𝜆 𝑗 (𝑨) 𝜆 𝑗 (𝑼 ∗ 𝑨𝑼 )
Thus, if we can prove the result for 𝑺 = 𝚺 and arbitrary invertible 𝑨 , then after
changing variables 𝑨 ↦→ 𝑼 ∗ 𝑨𝑼 we obtain the result for the original 𝑺 and arbitrary
invertible 𝑨 .
Third, we may assume 𝑠𝑘 = 1. Indeed, if we replace 𝑺 by 𝑠𝑘−1𝑺 in the inequal-
ity (10.2), then both sides are scaled by 𝑠𝑘−2𝑘 . Note that 𝑠𝑘 > 0 because 𝑺 is invertible.
To summarize, it suffices to prove (10.2) for invertible 𝑨 and 𝑺 = diag (𝑠 1 , . . . , 𝑠𝑛 )
where 𝑠 1 ≥ . . . ≥ 𝑠𝑘 = 1 ≥ · · · ≥ 𝑠𝑛 > 0. To that end, define the matrix 𝑺 is the “expansive part” of 𝑺 ,
b
analogous to the positive part 𝑬 + of
𝑺 = diag (𝑠 1 , . . . , 𝑠𝑘 , 1, . . . , 1 ).
b an additive perturbation 𝑬 .
| {z }
𝑛 − 𝑘 times
Observe that
I 4 (𝑺 −1b
𝑺 ) ∗ (𝑺 −1b
𝑺 ) = diag ( 1, . . . , 1, 𝑠𝑘−+11 , . . . , 𝑠𝑛−1 ).
Now, Ostrowski monotonicity gives
∗
𝜆𝑖 𝑗 (𝑺 −1b
𝑺 ) ∗𝑺 ∗ 𝑨𝑺 (𝑺 −1b
𝑺) 𝜆 𝑖 𝑗 (b 𝑺)
𝑺 𝑨b
1≤ = .
𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 )
Because 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 )/𝜆𝑖 𝑗 (𝑨) > 0 by Sylvester’s law of inertia, we deduce that
∗ ∗
Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) 𝜆𝑖 𝑗 (b 𝑺 ) Ö𝑘 𝜆𝑖 𝑗 (b
𝑺 𝑨b 𝑺 𝑨b 𝑺)
≤ · ∗ = .
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝜆𝑖 𝑗 (𝑺 𝑨𝑺 ) 𝑗 =1 𝜆𝑖 𝑗 (𝑨)
∗ ∗
𝑺 < I, Ostrowski monotonicity also gives 𝜆 𝑗 (b
𝑺 b
Since b 𝑺 )/𝜆 𝑗 (𝑨) ≥ 1 for all 𝑗 .
𝑺 𝑨b
Therefore,
∗ ∗ ∗
Ö𝑘 𝜆 𝑖 𝑗 (b 𝑺)
𝑺 𝑨b Ö𝑛 𝜆 𝑗 (b
𝑺 𝑨b 𝑺 ) det (b 𝑺 𝑨b 𝑺) ∗
≤ = = det (b
𝑺 b𝑺)
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1 𝜆 𝑗 (𝑨) det (𝑨)
Ö𝑘 Ö𝑘
= 𝑠 𝑗2 = 𝜆 𝑗 (𝑺 ∗𝑺 ),
𝑗 =1 𝑗 =1
where we used the multiplicativity of the determinant similarly to the proof of the
equality case above. This establishes the desired inequality (10.2).
Exercise 10.10 (Li–Mathias: Lower bound). In the setting of Theorem 10.1, prove the lower
bound
Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) Ö𝑘
≥ 𝜆𝑛−𝑗 +1 (𝑺 ∗𝑺 ).
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1
Note that this implies a corresponding lower bound in Corollary 10.6. Hint: First
assume 𝑺 is invertible and apply Theorem 10.1.
Lecture 10: Multiplicative Perturbation Theory 86
In Lectures 9–10, we studied the change in the eigenvalues and singular values
under perturbations. In the next lecture, we expand our scope to consider the change
in the eigenspaces of an Hermitian matrix under perturbation.
Notes
This lecture is adapted from the paper [LM99] of Li & Mathias. The proofs of Sylvester’s
theorem and Ostrowski’s theorem are drawn from Horn & Johnson [HJ13]. See Bhatia’s
books [Bha97; Bha07a] for some additional discussion of multiplicative perturbation.
Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07a] R. Bhatia. Perturbation bounds for matrix eigenvalues. Reprint of the 1987 original.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
doi: 10.1137/1.9780898719079.
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[LM99] C.-K. Li and R. Mathias. “The Lidskii-Mirsky-Wielandt theorem–additive and
multiplicative versions”. In: Numerische Mathematik 81.3 (1999), pages 377–413.
[Ost59] A. M. Ostrowski. “A quantitative formulation of Sylvester’s law of inertia”. In:
Proceedings of the National Academy of Sciences of the United States of America 45.5
(1959), page 740.
[Syl52] J. Sylvester. “A demonstration of the theorem that every homogeneous quadratic
polynomial is reducible by real orthogonal substitutions to the form of a sum of
positive and negative squares”. In: The London, Edinburgh, and Dublin Philosophical
Magazine and Journal of Science 4.23 (1852), pages 138–142.
11. Perturbation of Eigenspaces
The last two lectures covered the perturbation theory for eigenvalues of Hermitian
Agenda:
matrices. We will continue our discussion with perturbation of the eigenspaces of
1. Motivation
Hermitian matrices, which has a rather different character. We will start out with 2. Principle angles
the motivation demonstrating the major challenging in understand the behavior of 3. Sylvester equations
eigenspaces under perturbation. This observation leads us to study the principle 4. Eigenspace perturbation
angles between subspaces and the solutions of Sylvester equations. Finally, we will
use these tools to develop a result on the perturbation of eigenspaces associated with
well-separated eigenvalues.
11.1 Motivation
Let 𝑨, 𝑩 ∈ ℍ𝑛 be Hermitian matrices of the same dimension. From Lidskii’s theorem
(Lecture 9), if 𝑨 and 𝑩 are close, then the vectors 𝝀(𝑨) and 𝝀(𝑩) of decreasingly
ordered eigenvalues are also close. One might ask the following question. Are the
eigenspaces of 𝑨 and 𝑩 close as well? Let us investigate by posing a simple example.
Example 11.1 (Bad eigenspaces). For small 𝜀 > 0, consider the matrices
1+𝜀 0 1 𝜀
𝑨= and 𝑩 = .
0 1−𝜀 𝜀 1
These matrices represent two different perturbations of the identity matrix. Each of
the matrices 𝑨 , 𝑩 has eigenvalues 1 ± 𝜀 . Furthermore, the two matrices are close with
respect to every UI norm.
One might expect that the eigenspace of 𝑨 associated with the 1 + 𝜀 eigenvalue
is close to the eigenspace of 𝑩 associated with the 1 + 𝜀 eigenvalue and similarly for
eigenspaces associated with 1 − 𝜀 eigenvalue. The eigenpairs of the matrices are easily
determined.
1 0
𝑨 has eigenpairs 1 + 𝜀, and 1 − 𝜀, ;
0 1
1 1 11
𝑩 has eigenpairs 1 + 𝜀, √ and 1 − 𝜀, √ .
2 1 2 − 1
For each eigenvalue, the eigenvectors of 𝑨 and 𝑩 are very far apart. This happens in
spite of the fact that the matrices 𝑨 and 𝑩 are close to each other in every UI norm.
As seen from the preceding example, two matrices can be close without having
similar eigenspaces. To appreciate why, observe that 2 × 2 identity matrix has only one
distinct eigenvalue, 1, and the associated eigenspace spans all of ℂ2 . Both 𝑨 and 𝑩
are perturbations of the identity matrix, and we are looking at eigenvectors associated
to very close eigenvalues. When their eigenvalues coalesce at 𝜀 = 0, the 1-dimensional
eigenspaces combine to form a 2-dimensional eigenspace with a continuous family of
Lecture 11: Perturbation of Eigenspaces 88
Definition 11.2 (Acute angle between vectors). Let 𝒖, 𝒗 ∈ ℂ𝑛 be vectors with unit
2 -norm. The acute angle 𝜃 ∈ [ 0, 𝜋/2] between 𝒖 and 𝒗 is determined by the
relation cos 𝜃 (𝒖, 𝒗 ) = |h𝒖, 𝒗 i| .
Figure 11.1 demonstrates that the acute angle between vectors is simply the smaller
angle between the 1-dimensional subspaces spanned by 𝒖 and 𝒗 . Now, the question
that we have to pose, which goes back to work of [Jor75], is how to extend this idea of
finding the smallest angle to subspaces.
We begin with a mechanism for parameterizing the subspaces. Let 𝑿 ,𝒀 ∈ ℂ𝑛×𝑘
Figure 11.1 Acute angle between vec-
be tall matrices, each one with orthonormal columns. We emphasize that the columns tors
of the concatenation [𝑿 ,𝒀 ] do not need to be orthonormal. We can parameterize unit
vectors in the ranges of these matrices as follows.
{𝑿 𝒖 : k𝒖 k = 1} and {𝒀 𝒗 : k𝒗 k = 1}.
These sets parametrize the unit spheres in the range of 𝑿 and in the range of 𝒀 . Now,
the acute angle between two unit vectors from the ranges 𝑿 𝒖 and 𝒀 𝒗 satisfies
The first principle angle 𝜃 1 between range (𝑿 ) and range (𝒀 ) is defined as the minimum
acute angle between the unit vectors in range (𝑿 ) and range (𝒀 ) . In order to minimize
the acute angle, we need to maximize the cosine of this angle:
In short, the closest pair of vectors in these two subspaces, range (𝑿 ) and range (𝒀 ) , is
obtained by taking the first singular value of 𝑿 ∗𝒀 .
To continue, let (𝒖 1 , 𝒗 1 ) be a pair of unit vectors where the maximum is achieved
in the definition of 𝜃 1 . The second principle angle 𝜃 2 can be defined recursively by
Lecture 11: Perturbation of Eigenspaces 89
finding the smallest angle between unit vectors in range (𝑿 ) and range (𝒀 ) , excluding
the directions 𝑿 𝒖 1 and 𝒀 𝒗 1 .
The principle angles only depend on the two subspaces, even though the matrices
𝑿 and 𝒀 in the definition are not uniquely determined. To see why, it is helpful to
rewrite this definition in terms of the orthogonal projectors onto the subspaces, as the
orthogonal projectors are unique.
Exercise 11.5 (Principle angles: Equivalence). Check that Definitions 11.3 and 11.4 give the
same result.
k𝑷 𝑸 k 2 = cos2 𝜃 1 ( E, F).
When the cosine of the first principle angle is relatively large (that is, cos 𝜃 1 ≈ 1), then
the subspaces are very similar. Conversely, if the cosine of the angle is small (that is,
cos 𝜃 1 ≈ 0), then the subspaces are very different. Thus, the spectral norm k𝑷 𝑸 k of
the projector product gives one measure of how close the two subspaces are. Other UI
norms lead to other measures of similarity.
In parallel, we can measure the distance between subspaces by considering the
sines of principle angles. For example, sin2 𝜃 1 ( E, F) is a measure of distance between
E and F. That is, sin 𝜃 1 ≈ 0 when the subspaces are similar, and sin 𝜃 1 ≈ 1 when the
Lecture 11: Perturbation of Eigenspaces 90
subspaces are very different. It turns out that the these distances can also be expressed
in terms of projector products. For subspaces E, F with the same dimension,
k( I − 𝑷 )𝑸 k 2 = sin2 𝜃 1 ( E, F).
Other UI norms lead to other notions of distance. This insight allows us to develop
metric geometry for subspaces.
Exercise 11.6 (Distances between subspaces). PS3 has some exercises that describe various
ways to measure the distance between subspaces.
This expression converts the matrix equation into an ordinary linear system, which
may be easier to understand.
Exercise 11.7 (Sylvester equation: Tensor form). Check if the equation above is equivalent
to (SYL).
Proposition 11.8 (Solvabililty of (SYL)). Let 𝑨 ∈ ℍ𝑚 and 𝑩 ∈ ℍ𝑛 . The Sylvester equa-
tion (SYL) has a unique solution for all 𝒀 ∈ ℂ𝑚×𝑛 if and only if 𝜆𝑖 (𝑨) ≠ 𝜆 𝑗 (𝑩) for all
𝑖, 𝑗.
Proof. The Sylvester equation (SYL) has a unique solution for each right-hand side 𝒀
if and only if the matrix in the Kronecker product formulation (11.1) is nonsingular.
According to PS1, the eigenvalues of the tensor I ⊗ 𝑨 ᵀ − 𝑩 ⊗ I are the numbers
Therefore, the tensor is nonsingular if and only if 𝜆𝑖 (𝑨) ≠ 𝜆 𝑗 (𝑩) for all 𝑖 , 𝑗 .
Warning 11.9 (Equal coefficients). If 𝑨 = 𝑩 , then (SYL) never has a unique solution.
By placing additional assumptions on the solution matrix (e.g., Hermiticity), we
can sometimes recover the unique solvability.
The power of this formulation is that we can extend it to the matrix case by capitalizing
the letters, which is a surprisingly useful technique in matrix analysis.
Proof. For a square matrix 𝑴 ∈ 𝕄𝑛 , we can use the Taylor series expansion of the
exponential function to see that
d 𝑡𝑴
e = 𝑴 e𝑡 𝑴 = e𝑡 𝑴 𝑴 .
d𝑡
Introduce this ansatz into (SYL):
∫ ∞
𝑨 e−𝑡 𝑨𝒀 e𝑡 𝑩 − e−𝑡 𝑨𝒀 e𝑡 𝑩 𝑩 d𝑡
𝑨𝑿 − 𝑿 𝑩 =
∫0 ∞
d −𝑡 𝑨 𝑡 𝑩
= −e 𝒀 e d𝑡
0 d𝑡
= 𝒀 − lim e−𝑡 𝑨𝒀 e𝑡 𝑩 = 𝒀 .
𝑡 →∞
The limit in the second to the last equality vanishes. Indeed, 𝑨 has positive eigenvalues,
so e−𝑡 𝑨 is bounded. Since 𝑩 has strictly negative eigenvalues, e𝑡 𝑩 → 0.
Figure 11.2 (Spectral separation). An 𝜀 -spectral gap between sp (𝑨) ∩ S𝑨 and sp (𝑩) ∩ S𝑩
Let us take a moment to reinterpret this corollary. Fix the data 𝑨 ∈ ℍ𝑚 and
𝑩 ∈ ℍ𝑛 for the Sylvester equation (SYL), and assume that spectral separation criterion
from Corollary 11.12 holds. We can define the linear solution operator 𝚽 : 𝒀 ↦→ 𝑿 that
maps the right-hand side 𝒀 of the Sylvester equation to the unique solution 𝑿 . Then
Corollary 11.12 show that
k𝚽k B max{k𝚽(𝒀 )k : k𝒀 k ≤ 1} ≤ 𝜀 −1 .
In other words, the norm of the solution operator is controlled by the spectral separation
of the data 𝑨, 𝑩 .
1 1
|||𝑷 𝑨 𝑷 𝑩 ||| ≤ |||𝑷 𝑨 (𝑨 − 𝑩)𝑷 𝑩 ||| ≤ |||𝑨 − 𝑩 |||
𝜀 𝜀
Lecture 11: Perturbation of Eigenspaces 93
Let us sketch the idea behind the argument and then give a more rigorous treatment.
Since a matrix commutes with its spectral projectors, observe that
𝒀 0 = 𝑷 𝑨 (𝑨 − 𝑩)𝑷 𝑩 = 𝑨 (𝑷 𝑨 𝑷 𝑩 ) − (𝑷 𝑨 𝑷 𝑩 )𝑩.
𝒀 = 𝑸 ∗𝑨 (𝑨 − 𝑩)𝑸 𝑩 .
𝒀 = (𝑸 ∗𝑨 𝑸 𝑨 )𝑸 ∗𝑨 𝑨𝑸 𝑩 − 𝑸 ∗𝑨 𝑩𝑸 𝑩 (𝑸 𝑩∗ 𝑸 𝑩 )
= 𝑸 ∗𝑨 𝑷 𝑨 𝑨𝑸 𝑩 − 𝑸 ∗𝑨 𝑩𝑷 𝑩 𝑸 𝑩
= 𝑸 ∗𝑨 𝑨𝑷 𝑨 𝑸 𝑩 − 𝑸 ∗𝑨 𝑷 𝑩 𝑩𝑸 𝑩
= (𝑸 ∗𝑨 𝑨𝑸 𝑨 ) (𝑸 ∗𝑨 𝑸 𝑩 ) − (𝑸 ∗𝑨 𝑸 𝑩 )(𝑸 𝑩∗ 𝑩𝑸 𝑩 ).
1
|||𝑷 𝑨 𝑷 𝑩 ||| ≤ |||𝑷 𝑨 (𝑨 − 𝑩)𝑷 𝑩 |||
𝜀
1 1
≤ |||𝑷 𝑨 ||| · |||𝑨 − 𝑩 ||| · |||𝑷 𝑩 ||| = |||𝑨 − 𝑩 |||.
𝜀 𝜀
The second inequality depends on the operator ideal property of UI norms.
The result of Theorem 11.13 verifies our intuition from the beginning of the lecture.
When 𝑨 and 𝑩 are close to each other, the spectral projectors for well-separated
eigenvalues are almost orthogonal to each other. Quantitatively, the UI norm of the
projector product 𝑷 𝑨 𝑷 𝑩 is very small, which reflects dissimilarity of the subspaces
range (𝑷 𝑨 ) and range (𝑷 𝑩 ) . In particular, when 𝑨 = 𝑩 , the eigenvectors associated
with different eigenvalues are orthogonal.
Lecture 11: Perturbation of Eigenspaces 94
Notes
James Joseph Sylvester (1814–1897) was an English mathematician who made important
contributions in matrix theory, number theory and combinatorics. His work includes
the proof of Sylvester’s law of inertia and the study of Sylvester equations.
The idea of using Sylvester equations to study the perturbation of eigenspaces is
usually attributed to Davis & Kahan, who worked out these ideas in an important
paper [DK70]. The presentation in this lecture is more modern. It is modeled on
Bhatia’s book [Bha97, Chap. VII]. We have chosen to emphasize the variational
perspective on principal angles because it seems more intuitive.
Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[DK70] C. Davis and W. M. Kahan. “The Rotation of Eigenvectors by a Perturbation. III”.
In: SIAM Journal on Numerical Analysis 7.1 (1970), pages 1–46. eprint: https:
//doi.org/10.1137/0707001. doi: 10.1137/0707001.
[Jor75] C. Jordan. “Essai sur la géométrie à 𝑛 dimensions”. In: Bulletin de la Société
mathématique de France 3 (1875), pages 103–174.
[Syl84] J. J. Sylvester. “Sur l’équation en matrices px= xq”. In: CR Acad. Sci. Paris 99.2
(1884), pages 67–71.
12. Positive Linear Maps
In the first half of the course, we discussed majorization and its consequences, and
Agenda:
then we turned to the study of perturbation theory for eigenvalues and eigenspaces.
1. Positive-semidefinite order
In the second half of the course, we are going to talk about positive-semidefinite 2. Positive linear maps
matrices and operations on positive-semidefinite matrices. This lecture introduces the 3. Examples of positive maps
positive-semidefinite order, and it defines a class of linear maps that respect the order. 4. Basic properties
We will study the properties of these positive linear maps. 5. Convexity inequalities
6. Russo–Dye theorem
For 𝑨 ∈ 𝕄𝑛 (ℂ) , we say that 𝑨 is positive definite (pd) if h𝒖, 𝑨𝒖i > 0 for all
nonzero vectors 𝒖 ∈ ℂ𝑛 .
The most basic fact about psd matrices is the conjugation rule, which states that
conjugation preserves the psd property.
Proposition 12.2 (Conjugation rule). Let 𝑨 ∈ 𝕄𝑛 , and let 𝑿 ∈ ℂ𝑛×𝑘 .
Definition 12.5 (Semidefinite order). For two self-adjoint matrices 𝑨, 𝑩 ∈ 𝕄𝑛sa , we Recall that a matrix 𝑨 ∈ 𝕄𝑛 is
write 𝑨 4 𝑩 when 𝑩 − 𝑨 is psd. The relation “ 4 ” is a partial order on 𝕄𝑛sa since self-adjoint (s.a.) when 𝑨 ∗ = 𝑨 .
the cone 𝕄𝑛sa is proper. We call this the psd order, also known as the semidefinite
order or Loewner order. Similarly, we write 𝑨 ≺ 𝑩 when 𝑩 − 𝑨 is pd.
It is natural to ask what kind of functions respect the psd order. That is, we are
interested in functions 𝑭 : 𝕄𝑛sa → 𝕄𝑘sa for which
A function that satisfies this type of relation is said to be monotone with respect to the
psd order.
We can also ask what kind of functions are “convex” with respect to the psd order.
This amounts to the relation
1
+ 12 𝑩 4 12 𝑭 (𝑨) + 12 𝑭 (𝑩) for all 𝑨, 𝑩 ∈ 𝕄𝑛sa .
𝑭 2
𝑨 (12.1)
In other words, the function of an average is bounded by the average of the function
values.
In this lecture, we will consider linear functions on matrices that are monotone
with respect to the psd order. For a linear function 𝑭 , the convexity relation (12.1) is
always valid, so we will focus on montonicity for now. In the next lecture, we will
study nonlinear functions that are monotone or convex.
Exercise 12.7 (Positive maps are monotone). Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a linear map. Show
that the following conditions are equivalent.
1. 𝚽 is positive.
2. 𝑨 < 𝑩 implies 𝚽(𝑨) < 𝚽(𝑩) for all 𝑨, 𝑩 ∈ 𝕄𝑛sa .
Exercise 12.9 (Unital, trace preserving). Show that a linear map 𝚽 : 𝕄𝑛 → 𝕄𝑘 is unital if
and only if its adjoint 𝚽∗ : 𝕄𝑘 → 𝕄𝑛 is trace preserving. We equip matrices with the trace
inner product: h𝑨, 𝑩 i = tr (𝑨 ∗ 𝑩) .
Compare the definition of a unital linear map and a trace-preserving linear map Adjoints of linear maps are defined
with the analogous definitions for linear functions on vectors. Recall that these concepts with respect to this inner product.
arose in our study of majorization.
1
𝜑 (𝑨) B tr 𝑨 = tr 𝑨.
𝑛
The normalized trace is positive and unital.
Lecture 12: Positive Linear Maps 97
𝜑 (𝑨) B tr (𝑷 𝑨) where 𝑷 ∈ 𝕄𝑛 .
The linear functional 𝜑 is positive if and only if 𝑷 is psd. It is unital if and only
if tr 𝑷 = 1. Positive semidefinite matrices with
trace one are called density matrices.
Are there other distinguished positive linear functionals?
Example 12.11 (Matrix-valued positive linear maps). Next, we consider positive linear maps
from matrices to matrices.
𝚽(𝑨) B ( tr𝑨) I𝑘
𝚽(𝑨) B 𝑿 ∗ 𝑨𝑿
is positive. Furthermore, if 𝑿 is orthonormal, then 𝚽 is unital. In this case, The conjugation map is the primitive
the conjugation operation extracts a principal submatrix (with respect to some example of a completely positive linear
map. See Problem Set 3 for
basis). definitions and discussion.
4. Pinching. Let (𝑷 𝑖 ) be a family of orthogonal projectors on 𝕄𝑛 that are mutually
Í
orthogonal (𝑷 𝑖 𝑷 𝑗 = 𝛿𝑖 𝑗 𝑷 𝑖 ) and that partition the identity ( 𝑗 𝑷 𝑗 = I𝑛 ). The
function 𝚽 : 𝕄𝑛 → 𝕄𝑛 given by
∑︁
𝚽(𝑨) B 𝑷 𝑗 𝑨𝑷 𝑗
𝑗
Example 12.12 (Tensors and positive linear maps). There are also several natural examples
associated with tensor product operations.
𝚽(𝑨) B 𝑩 ⊗ 𝑨
Exercise 12.13 (Cone of positive linear maps). Show that the class of positive linear maps
𝚽 : 𝕄𝑛 → 𝕄𝑘 is a closed convex cone.
Proposition 12.15 (Adjoints). Let 𝚽 be a positive linear map. Then 𝚽(𝑨 ∗ ) = 𝚽(𝑨) ∗ for
each matrix 𝑨 .
Proof. Each matrix 𝑨 ∈ 𝕄𝑛 has a Cartesian decomposition: The s.a. matrices 𝑯 and 𝑺 are given
by the expressions
𝑨 = 𝑯 + i𝑺 where each 𝑯 , 𝑺 ∈ 𝕄𝑛sa . 1
𝑨 + 𝑨∗ ;
𝑯 =
2
By linearity of 𝚽 and Proposition 12.14, 1
𝑨 − 𝑨∗ .
𝑺=
∗ ∗ 2i
𝚽(𝑨 ∗ ) = 𝚽(𝑯 − i𝑺 ) = 𝚽(𝑯 ) − i𝚽(𝑺 ) = 𝚽(𝑯 ) + i𝚽(𝑺 ) = 𝚽(𝑨) .
The matrix 𝑯 − 𝑩 ∗ 𝑨 −1 𝑩 is called the Schur complement of the block matrix with
respect to the block 𝑨 .
−𝑨 −1 𝑩
I 0 𝑨 𝑩 I 𝑨 0
= . (12.2)
−𝑩 ∗ 𝑨 −1 I 𝑩∗ 𝑯 0 I 0 𝑯 − 𝑩 ∗ 𝑨 −1 𝑩
The right hand side of (12.2) is positive semidefinite. Apply the conjugation rule to
complete the proof.
Schur complements arise from partial Gaussian elimination, from Cholesky decom-
position, and from partial least-squares. They also describe conditioning of jointly
Gaussian random variables.
Warning 12.18 (Larger powers?). The analog of Kadison’s inequality is false for powers
greater than two. On the other hand, there is a version of Lyapunov’s inequality
that holds for higher powers.
As usual, the spectral projectors are mutually orthogonal and decompose the identity.
With this representation, the square 𝑨 2 satisfies
∑︁
𝑨2 = 𝜆2𝑗 𝑷 𝑗 .
𝑗
By linearity, we have
∑︁ ∑︁
𝚽(𝑨) = 𝜆 𝑗 𝚽(𝑷 𝑗 ) and 𝚽(𝑨 2 ) = 𝜆2𝑗 𝚽(𝑷 𝑗 ).
𝑗 𝑗
Note that 𝑷 𝑗 < 0 implies that 𝚽(𝑷 𝑗 ) < 0. Since 𝚽 is unital, we also have
∑︁ ∑︁
𝚽(𝑷 𝑗 ) = 𝚽 𝑷 𝑗 = 𝚽( I) = I.
𝑗 𝑗
The matrices 𝚽(𝑷 𝑗 ) are also psd, and tensor products preserve positivity.
Now, by the Schur complement theorem, the Schur complement of the block I in
the block matrix is also psd. That is,
Note that this result only holds for positive definite matrices.
Lecture 12: Positive Linear Maps 101
Warning 12.20 (Smaller powers?). The analog of Choi’s inequality is false for powers
that are less than −1.
The rest of the details are similar to the proof of Kadison’s inequality.
Example 12.21 (Diagonals). Use the theorems above, we have the following results.
This result follows from the Choi inequality because diag is strictly positive and
unital.
You may wish to develop analogous results for other unital positive linear maps from
our list of examples.
k𝚽k B sup{k𝚽(𝑨)k : k𝑨 k ≤ 1} = 1,
Corollary 12.23 (Russo–Dye). Let 𝚽 be a positive linear map. Then k𝚽k = k𝚽( I)k .
We will prove these results in the upcoming subsections. There are many interesting
applications of these results. See Problem Set 3 for examples.
Lecture 12: Positive Linear Maps 102
12.6.1 Contractions
To begin, we need a basic fact from operator theory.
Proposition 12.24 (Contractions). Each contraction 𝑲 ∈ 𝕄𝑛 can be written as an average A contraction is a matrix 𝑲 ∈ 𝕄𝑛
of two unitary matrices: that satisfies k𝑲 k ≤ 1.
For this purpose, recall that each unitary matrix has a spectral resolution where the
eigenvalues are complex numbers with modulus one. By the Schur complement
theorem, we have
𝚽(𝑼 ) ∗ 𝚽(𝑼 ) 4 I.
This is equivalent to k𝚽(𝑼 ) k ≤ 1.
For a general matrix 𝑨 ∈ 𝕄𝑛 with k𝑨 k ≤ 1, note that 𝑨 is a contraction. Thus, we
can write
1
𝑨= 2
𝑼 +𝑽
for some unitary matrices 𝑼 and 𝑽 . By linearity of 𝚽 and the triangle inequality for
the norm, we have
k𝚽k = sup{k𝚽(𝑨)k : k𝑨 k ≤ 1} ≤ 1.
To finish, note that 𝚽( I) = I because 𝚽 is unital. Therefore, we may conclude that
k𝚽k ≥ k𝚽( I)k = 1.
By continuity of the norm, we can let 𝜀 ↓ 0 to conclude that k𝚽k = k𝚽( I)k for any
positive linear map 𝚽.
Notes
This lecture is adapted from Bhatia’s book on positive-definite matrices [Bha07b,
Chap. 2].
Lecture bibliography
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
13. Matrix Monotonicity and Convexity
In the last lecture, we examined positive linear maps, a class of matrix-valued linear
Agenda:
maps that respect the positive semidefinite order. Now, we go beyond linear maps,
1. Monotonicity and Convexity
establishing notions of monotonicity and convexity for nonlinear matrix-valued functions 2. Examples
with respect to the semidefinite order. It is found that these properties are more 3. Matrix Jensen
restrictive than their scalar counterparts. Yet, this restriction brings with it considerable
added structure. In particular, we demonstrate that matrix convex functions satisfy a
remarkable generalization of Jensen’s inequality to non-commutative “matrix convex
combinations.”
Note that standard matrix functions are unitarily equivariant. That is, if 𝑼 ∈ 𝕄𝑛 is
unitary, then 𝑓 (𝑼 ∗ 𝑨𝑼 ) = 𝑼 ∗ 𝑓 (𝑨)𝑼 .
Lecture 13: Matrix Monotonicity and Convexity 105
Proof sketch. The first two properties can be verified easily. The third property requires
use of the fact that the cone of psd matrices is closed.
Exercise 13.6 (Matrix convexity). Verify analogous properties for the class of matrix convex
functions. That is, characterize the scalar case, and check that the class of matrix
convex functions forms a closed, convex cone.
13.2 Examples
In this section, we detail a variety of functions that are (or are not) matrix monotone
and convex.
Clearly 𝑨 4 𝑩 . Yet
2
2 2 5 3
𝑨 = 64 = 𝑩 2.
2 2 3 2
On the other hand, the square function 𝑓 is matrix convex on the entire real line.
More generally, positive quadratics 𝑓 (𝑡 ) = 𝛼𝑡 2 + 𝛽𝑡 + 𝛾 with 𝛼 ≥ 0 and 𝛽, 𝛾 ∈ ℝ
compose the full class of matrix convex functions on ℝ.
We prove the matrix convexity of the square function in the next proposition.
Proposition 13.7 (Square is matrix convex). The square function 𝑡 ↦→ 𝑡 2 is matrix convex
on ℝ.
𝑨𝑩 + 𝑩 𝑨 4 𝑨 2 + 𝑩 2 . (13.2)
where the semidefinite relation follows via (13.2). This calculation establishes midpoint
convexity.
Exercise 13.8 (Square: Matrix convexity). Extend the preceding proof to obtain general
matrix convexity of the square function. That is, establish that
¯ 2 4 𝜏𝑨 2 + 𝜏𝑩
(𝜏𝑨 + 𝜏𝑩) ¯ 2 for all 𝜏 ∈ [ 0, 1] ,
where 𝜏¯ = 1 − 𝜏 .
13.2.3 Inverse
The inverse function 𝑓 (𝑡 ) = 𝑡 −1 is matrix convex on ℝ++ B ( 0, ∞) , and its negative
𝑔 (𝑡 ) = −𝑡 −1 is matrix monotone on ℝ++ . We prove these properties separately in the
next two propositions.
Proposition 13.9 (Inverse: Monotonicity). The negative inverse function 𝑡 ↦→ −𝑡 −1 is matrix
monotone on ℝ++ .
¯ −1 4 𝜏𝑨 −1 + 𝜏𝑩
(𝜏𝑨 + 𝜏𝑩) ¯ −1 ,
which establishes matrix convexity.
Fact 13.12 (Powers: Convexity). The following power functions (and only these) are matrix
convex.
• The power function 𝑓 (𝑡 ) = 𝑡 𝑝 is matrix convex on ℝ+ for 𝑝 ∈ [ 1, 2] .
• The power function 𝑓 (𝑡 ) = −𝑡 𝑝 is matrix convex on ℝ+ for 𝑝 ∈ [ 0, 1] .
• The power function 𝑓 (𝑡 ) = 𝑡 𝑝 is matrix convex on ℝ++ for 𝑝 ∈ [−1, 0] .
These results follow as a consequence of Fact 13.11 by general considerations that will
be discussed in Lecture 15.
13.2.5 Logarithm
The function 𝑓 (𝑡 ) = log 𝑡 is matrix monotone and concave on ℝ++ . We leave the proof
of this fact as a series of exercises.
Exercise 13.13 (Logarithm: Integral representation). For each 𝑎 > 0, verify that
∫ ∞
( 1 + 𝜆) −1 − (𝑎 + 𝜆) −1 d𝜆.
log 𝑎 =
0
Show that this integral representation extends to positive definite matrices via capital-
ization. For any positive definite 𝑨 0,
∫ ∞
( 1 + 𝜆) −1 I − (𝑨 + 𝜆I) −1 d𝜆.
log 𝑨 =
0
Exercise 13.14 (Inverse: Monotonicity and Convexity). For 𝜆 ≥ 0, show that the map
𝑓 (𝑡 ) = (𝜆 + 𝑡 ) −1 is matrix convex on ℝ++ , and that its negative −𝑓 is matrix monotone
on ℝ++ . Hint: The arguments follow those for the inverse.
Exercise 13.15 (Logarithm: Monotonicity and Concavity). Show that the logarithm 𝑓 (𝑡 ) =
log 𝑡 is matrix monotone and matrix concave on ℝ++ . Hint: apply the previous exercises
and the fact that matrix monotone and convex functions compose convex cones that
are closed under pointwise limits.
Lecture 13: Matrix Monotonicity and Convexity 108
13.2.6 Entropy
The negative entropy 𝑓 (𝑡 ) = 𝑡 log 𝑡 is matrix convex on ℝ+ . We leave the proof of this
fact as an exercise.
Exercise 13.16 (Entropy: Convexity). Prove that the negative entropy is matrix convex on
ℝ+ . Hint: Use the identity
∫ ∞
𝑡 ( 1 + 𝜆) −1 − 𝑡 (𝑡 + 𝜆) −1 d𝜆
𝑡 log 𝑡 =
0
and follow the route charted by Exercises 13.13, 13.14, and 13.15.
13.2.7 Exponential
The exponential 𝑓 (𝑡 ) = e𝑡 is neither matrix monotone nor matrix convex on any
interval. The exponential’s failure to be matrix monotone or convex implies that the
classes of matrix monotone and convex functions are strictly smaller than their scalar
counterparts.
𝑓 (𝜏𝑎 + 𝜏𝑏)
¯ ≤ 𝜏 𝑓 (𝑎) + 𝜏¯ 𝑓 (𝑏) for all 𝜏 ∈ [ 0, 1]
and all 𝑎, 𝑏 ∈ I, where 𝜏¯ = 1 − 𝜏 , we can obtain Jensen’s inequality. For any collection
of 𝑎𝑖 ∈ I and (𝑝𝑖 )𝑖𝑚=1 with 𝑖𝑚=1 𝑝𝑖 = 1 and 𝑝𝑖 ≥ 0, we have
Í
∑︁𝑚 ∑︁𝑚
𝑓 𝑝𝑖 𝑎 𝑖 ≤ 𝑝𝑖 𝑓 (𝑎𝑖 ).
𝑖 =1 𝑖 =1
The remarkable main result of this section is that matrix convexity is self-improving
to an extent far beyond the scalar case. That is, matrix convexity of a function
𝑓 : I → ℝ implies a more general form of Jensen’s inequality that holds for arbitrary
matrix convex combinations [HP82; HP03].
Theorem 13.19 (Matrix Jensen inequality; Hansen–Pedersen 1982, 2003). Fix a matrix
convex function 𝑓 : I → ℝ. Let (𝑨 𝑖 )𝑖𝑚=1 be a collection of matrices, each residing
in ℍ𝑛 ( I) , and let (𝑲 𝑖 )𝑖𝑚=1 consist of matrices in 𝕄𝑛 that satisfy the normalization
condition 𝑖𝑚=1 𝑲 𝑖 ∗ 𝑲 𝑖 = I. Then
Í
∑︁𝑚 ∑︁𝑚
𝑓 𝑲 𝑖 ∗𝑨𝑖 𝑲 𝑖 4 𝑲 𝑖 ∗ 𝑓 (𝑨 𝑖 )𝑲 𝑖 .
𝑖 =1 𝑖 =1
Proof of Theorem 13.19. We present the proof in the case that 𝑚 = 2. That is, we will
prove that
𝑓 (𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ) 4 𝑲 1 ∗ 𝑓 (𝑨 1 )𝑲 1 + 𝑲 2 ∗ 𝑓 (𝑨 2 )𝑲 2
when 𝑲 1 ∗ 𝑲 1 + 𝑲 2 ∗ 𝑲 2 = I. The case of general 𝑚 follows a similar argument, and in
fact follows as a corollary of this case.
To begin, we introduce that block matrix
𝑨1 0
𝑨B ∈ ℍ2𝑛 ( I).
0 𝑨2
The key idea in the proof is to lift our attention to this block matrix. We will reinterpret
the matrix convex combinations in the matrix Jensen inequality in terms of operations
on the block matrix that involve simple averages, unitary conjugation, and positive
linear maps. By this mechanism, we can access the definition of matrix convexity and
exploit the unitary equivariance of standard matrix functions.
We preface the proof with four tricks that will be employed in the execution of the
proof. First, note that it is straightforward to apply a standard matrix function to a
block diagonal matrix. If 𝑻 , 𝑴 ∈ ℍ𝑛 ( I) and 𝑓 : I → ℝ, we simply have
𝑻 0 𝑓 (𝑻 ) 0
𝑓 = . (13.4)
0 𝑴 0 𝑓 (𝑴 )
In view of this fact, it is helpful to develop methods for extracting the block-diagonal
part of a block matrix.
Indeed, we can represent the block diagonal pinching of a matrix as a simple
average of unitary conjugates. That is, defining the unitary block diagonal matrix
I 0
𝑼 B ,
0 −I
Lecture 13: Matrix Monotonicity and Convexity 110
𝑲 1∗ 𝑨 1𝑲 1 + 𝑲 2∗ 𝑨 2𝑲 2 ∗
∗
𝑸 𝑨𝑸 = (13.6)
∗ ∗
where the asterices indicate blocks of the matrix that are irrelevant to the argument.
Last, recall that the map [·] 11 that extracts the top left ( 1, 1) block of a block matrix
is a positive linear map, and hence it preserves the positive-semidefinite order. For
our purposes, this map extracts the top left block of 𝑸 ∗ 𝑨𝑸 specified in (13.6). For
example, it holds that
[𝑸 ∗ 𝑨𝑸 ] 11 = 𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ;
[𝑸 ∗ 𝑓 (𝑨)𝑸 ] 11 = 𝑲 1 ∗ 𝑓 (𝑨 1 )𝑲 1 + 𝑲 2 ∗ 𝑓 (𝑨 2 )𝑲 2 .
The second relation exploits the fact that 𝑨 is a block diagonal matrix so we can
identify 𝑓 (𝑨) .
Armed with these tricks, we may now proceed with the proof. We will maintain
the quantities of interest in the (1, 1) block of the block matrix, while the remaining
blocks remain at our discretion. To begin, observe that
𝑓 (𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ) = 𝑓 ([𝑸 ∗ 𝑨𝑸 ] 11 )
by the identity (13.6). Since each block matrix in the pinching identity (13.5) has the
same first diagonal block, it then follows that
𝑓 ( [𝑸 ∗ 𝑨𝑸 ] 11 ) = 𝑓 1 ∗
2
𝑸 𝑨𝑸 + 21 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 11
.
This inequality transmits through the positive linear map [·] 11 . Thus,
1 ∗
+ 12 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 𝑓 (𝑸 ∗ 𝑨𝑸 ) + 21 𝑓 (𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 )
1
𝑓 2
𝑸 𝑨𝑸 11
4 2 11
.
Lecture 13: Matrix Monotonicity and Convexity 111
𝑓 (𝑸 ∗ 𝑨𝑸 ) + 12 𝑓 (𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 ) 𝑸 ∗ 𝑓 (𝑨)𝑸 + 21 𝑼 ∗𝑸 ∗ 𝑓 (𝑨)𝑸𝑼
1 1
2 11
= 2 11
.
To complete the argument, we reverse our course to undo each of the steps. Once
more, using the fact that each block matrix in the pinching identity (13.5) has the same
first diagonal block, we conclude that
𝑓 (𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ) 4 𝑲 1 ∗ 𝑓 (𝑨 1 )𝑲 1 + 𝑲 2 ∗ 𝑓 (𝑨 2 )𝑲 2 ,
which is the desired result.
Notes
This lecture is adapted from Bhatia’s book [Bha97, Chap. V] and from the instructor’s
monograph [Tro15] on matrix concentration.
The proof of the matrix Jensen inequality is drawn from Hansen & Pedersen’s
second paper [HP03]. Using a similar type of argument, Davis [Dav57] had long since
proved a weaker version of the matrix Jensen inequality (Theorem 13.19) under the
extra assumptions that 0 ∈ I and 𝑓 ( 0) = 0. By bringing in deeper tools, Choi [Cho74]
strengthened the result further by removing the conditions, and he extended it to a
wider class of averaging operations.
The significance of the Hansen–Pedersen result is that it admits a direct (although
clever) proof. They developed this argument as the first step in their proof of Loewner’s
integral theorem, a deep result on matrix monotone and matrix convex convex functions.
In fact, Loewner’s theorem is the key ingredient in the proof of Choi’s generalization.
We will return to these matters in Lecture 15.
Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Cho74] M.-D. Choi. “A Schwarz inequality for positive linear maps on 𝐶 ∗ -algebras”. In:
Illinois Journal of Mathematics 18.4 (1974), pages 565 –574. doi: 10.1215/ijm/
1256051007.
[Dav57] C. Davis. “A Schwarz inequality for convex operator functions”. In: Proc. Amer.
Math. Soc. 8 (1957), pages 42–44. doi: 10.2307/2032808.
[HP82] F. Hansen and G. K. Pedersen. “Jensen’s Inequality for Operators and Löwner’s
Theorem.” In: Mathematische Annalen 258 (1982), pages 229–241.
[HP03] F. Hansen and G. K. Pedersen. “Jensen’s Operator Inequality”. In: Bulletin of the
London Mathematical Society 35.4 (2003), pages 553–564.
[Kra36] F. Kraus. “Über konvexe Matrixfunktionen.” ger. In: Mathematische Zeitschrift 41
(1936), pages 18–42. url: http://eudml.org/doc/168648.
[Löw34] K. Löwner. “Über monotone Matrixfunktionen”. In: Mathematische Zeitschrift 38
(1934), pages 177–216. url: http://eudml.org/doc/168495.
[Tro15] J. A. Tropp. “An Introduction to Matrix Concentration Inequalities”. In: Foundations
and Trends in Machine Learning 8.1-2 (2015), pages 1–230.
14. Monotonicity: Differential Characterization
14.1 Recap
The previous two lectures introduced matrix monotonicity and matrix convexity, first
in the context of positive linear maps (i.e., linear functions on matrices) and then in
the context of standard matrix functions (which have a more rigid structure but are
nonlinear). We obtained important convexity inequalities, such as Choi’s inequality for
positive linear maps and the matrix Jensen inequality for matrix convex functions.
To set the stage for this lecture, we next recall the following definitions. For an
interval I ⊆ ℝ, we define ℍ𝑛 ( I) B {𝑨 ∈ ℍ𝑛 : 𝜆𝑖 (𝑨) ∈ I for all 𝑖 }. This set I usually serves as the
domain of the standard matrix
Definition 14.1 (Matrix monotonicity; Loewner 1934). A function 𝑓 : I → ℝ is matrix functions we will consider.
monotone on I if
𝑨4𝑩 implies 𝑓 (𝑨) 4 𝑓 (𝑩)
for every 𝑨, 𝑩 ∈ ℍ𝑛 ( I) and every 𝑛 ∈ ℕ. The action of 𝑓 on a self-adjoint
matrix is interpreted in the sense of
standard matrix functions.
Intuitively, the first divided difference records the slopes of secants and tangents to 𝑓 .
Tabulating the bivariate function 𝑓 [1 ] at a set of points produces Loewner’s matrix.
This is essentially the “kernel matrix”
Definition 14.4 (Loewner Matrix). Let 𝑓 : I → ℝ be continuously differentiable on the
of the function 𝑓 [1 ] on the dataset of
open interval I ⊆ ℝ. For 𝑛 ∈ ℕ, let 𝚲 = diag (𝜆 1 , . . . , 𝜆𝑛 ) ∈ ℍ𝑛 ( I) be any diagonal points (𝜆𝑘 : 𝑘 = 1, . . . , 𝑛) . This
matrix with entries in I. Then the Loewner matrix of 𝑓 at 𝚲 is observation leads to connections with
the theory of positive-definite
[1 ] [1 ] functions, discussed in Lecture 18.
𝑳 𝑓 (𝚲) B 𝑓 (𝚲) B 𝑓 (𝜆𝑖 , 𝜆 𝑗 ) 𝑖 ,𝑗 =1...,𝑛
∈ ℍ𝑛 .
The first divided difference and Loewner’s matrix play a key role in generalizing scalar
differential characterizations of monotonicity and convexity to the matrix setting.
We see that 𝑳 𝑓 in the matrix setting plays the same role as does 𝑓 0 in the scalar setting.
We will prove Loewner’s theorem later in Section 14.4.
There is an analogous result [AV95] for matrix convexity, albeit established much
later than Loewner’s result.
Lecture 14: Monotonicity: Differential Characterization 114
[1 ]
𝜆 ↦→ 𝑔 𝜇 (𝜆) B 𝑓 (𝜇, 𝜆)
is matrix monotone.
The proof may be found in [Bha97, Theorem V.3.10].
k𝑭 (𝑨 + 𝑯 ) − 𝑭 (𝑨) − D𝑭 (𝑨)𝑯 k
→0
k𝑯 k
as 𝑯 → 0 in any norm on ℍ𝑛 , provided that such an object D𝑭 (𝑨) exists.
Lecture 14: Monotonicity: Differential Characterization 115
Note that although the Fréchet derivative of 𝑭 at a point 𝑨 is linear, the mapping
D𝑭 : 𝑨 ↦→ D𝑭 (𝑨)
is nonlinear in general. This parallels the usual notion of scalar derivative from calculus.
Indeed, for 𝑓 : I → ℝ continuously differentiable, 𝑓 0 is a nonlinear function but the
Fréchet derivative
D 𝑓 (𝑡 ) : ℎ ↦→ 𝑓 0 (𝑡 )ℎ
is just the linear operator of scalar multiplication by 𝑓 0 (𝑡 ) , for 𝑡 ∈ I, which can be
identified with 𝑓 0 (𝑡 ) itself.
Exercise 14.10 (Derivative). Show that, if it exists, the Fréchet derivative is unique.
Exercise 14.11 (Linear map). Show that if 𝑭 : 𝑨 ↦→ 𝚽𝑨 is linear, that is, 𝚽 : ℍ𝑛 → ℍ𝑛 is
a linear map, then D𝑭 (𝑨) = 𝚽 for every 𝑨 ∈ ℍ𝑛 .
It is useful to relate the Fréchet derivative to another kind of derivative, the Gâteaux
derivative, that generalizes directional derivatives to Banach space. Indeed, if it exists ,
the Fréchet derivative of 𝑭 at 𝑨 parametrizes all the Gâteaux derivatives of 𝑭 at 𝑨 in
the sense that
d
D𝑭 (𝑨)𝑯 = 𝑭 (𝑨 + 𝑡 𝑯 ) (14.1) Aside: If 𝑭 is real-valued, this
d𝑡 𝑡 =0 is usually called the first variation
for every 𝑯 . The right hand side is called the Gâteaux derivative of 𝑭 at 𝑨 in the or variational derivative in the cal-
direction 𝑯 . culus of variations.
Warning 14.12 (Directional derivatives). The converse is false. A map can be Gâteaux
differentiable in every direction without being Fréchet differentiable. This phe-
nomenon already arises for functions on ℝ2 .
The following lemma gives the derivative of a polynomial standard matrix function
and is used in a limiting argument in the proof of the Daleckii–Krein formula.
Lemma 14.14 (Polynomial: Derivative). Let 𝑓 : ℝ → ℝ be a polynomial. Then for 𝑯 ∈ ℍ𝑛
and diagonal 𝚲 ∈ ℍ𝑛 ,
[1 ]
D 𝑓 (𝚲)𝑯 = 𝑓 (𝚲) 𝑯. (14.4)
Proof. Notice that both sides of (14.4) are linear in 𝑓 since differentiation and point
evaluation are linear operations. Hence, without loss of generality, we may consider
the reduction to the monomial 𝑓 : 𝑥 ↦→ 𝑥 𝑝 for each 𝑝 ∈ ℤ+ .
To this end, recall the algebraic identity
∑︁𝑝−1
𝑩𝑝 − 𝑪 𝑝 = 𝑩 𝑘 (𝑩 − 𝑪 )𝑪 𝑝−1−𝑘
𝑘 =0
Lecture 14: Monotonicity: Differential Characterization 116
𝑓𝑛[1 ] (𝜆𝑖 , 𝜆 𝑗 ) → 𝑓 [1 ]
(𝜆𝑖 , 𝜆 𝑗 ) as 𝑛 → ∞
= 𝑓𝑛[1 ] (𝜆𝑖 , 𝜆 𝑗 ) [𝑯 ] 𝑖 𝑗 → 𝑓 [1 ]
D 𝑓𝑛 (𝚲)𝑯 𝑖𝑗
(𝜆𝑖 , 𝜆 𝑗 ) [𝑯 ] 𝑖 𝑗 as 𝑛 → ∞
for each 𝑖 , 𝑗 . Since the convergence holds entrywise, it also holds in any UI matrix
norm by a characteristic polynomial argument. By [Sim19, Lemma 5.5(a)], the Gâteaux
derivative of 𝑓 exists and equals the limit of the left-hand side above, which yields
(14.2) if it can be shown that the Fréchet derivative exists. This step is found in the
proof of [Bha97, Theorem V.3.3].
Equation (14.3) follows from unitary equivariance of standard matrix functions.
More explicitly, it is easy to verify using the definition of polynomial standard matrix
functions and the calculation in the proof of Lemma 14.14 that
[1 ]
(𝑼 ∗𝑯𝑼 ) 𝑼 ∗ .
D 𝑓𝑛 (𝑨)𝑯 = 𝑼 𝑓𝑛 (𝚲)
𝚲 + 𝑡 𝑯 < 𝚲 for 𝑡 ≥ 0
by definition of the psd order. Since I is open and the eigenvalue map 𝝀(·) is continuous,
there exists 𝛿 > 0 such that 𝚲 + 𝑡 𝑯 ∈ ℍ𝑛 ( I) for all 0 ≤ 𝑡 < 𝛿 . By matrix monotonicity
of 𝑓 on I,
𝑓 (𝚲 + 𝑡 𝑯 ) < 𝑓 (𝚲) when 0 ≤ 𝑡 < 𝛿 .
Since the psd cone is closed, we further deduce
𝑓 (𝚲 + 𝑡 𝑯 ) − 𝑓 (𝚲)
D 𝑓 (𝚲)𝑯 = lim < 0,
𝑡 ↓0 𝑡
where we are allowed to equate the Fréchet and Gâteaux derivatives of 𝑓 above because
existence of the Fréchet derivative is guaranteed by Theorem 14.13. Again by the
Daleckii–Krein formula,
[1 ] [1 ]
0 4 D 𝑓 (𝚲)𝑯 = 𝑓 (𝚲) 𝑯 =𝑓 (𝚲) .
Lecture 14: Monotonicity: Differential Characterization 118
The second equality is the chain rule; the third equality is from the Daleckii–Krein
formula (14.3); and the last inequality follows from the Schur product theorem. Indeed,
by hypothesis, 𝑓 [1 ] (𝚲𝜏 ) < 0, while 𝑼 𝜏∗ 𝑨¤ 𝜏 𝑼 𝜏 < 0 by the conjugation rule. Therefore,
𝑓 [1 ] (𝚲𝜏 ) (𝑼 𝜏∗ 𝑨¤ 𝜏 𝑼 𝜏 ) < 0. This is what we needed to show.
14.5 Examples
We conclude with two illustrative examples where Loewner’s theorem may be applied.
Example 14.16 (Rational function). For I ⊆ ℝ and real coefficients 𝑎, 𝑏, 𝑐 , 𝑑 ∈ ℝ such Aside: Such rational functions
that 𝑎𝑑 − 𝑏𝑐 > 0 and −𝑑/𝑐 ∉ I, define the rational function 𝑓 : I → ℝ by arise, for example, in the aptly
named Loewner framework for
𝑎𝑡 + 𝑏 interpolatory model reduction
𝑡 ↦→ 𝑓 (𝑡 ) B . of dynamical systems [Ben+17,
𝑐𝑡 + 𝑑
Chapter 8].
Notice that 𝑓 is continuous on I, and
𝑎𝑑 − 𝑏𝑐
𝑡 ↦→ 𝑓 0 (𝑡 ) =
(𝑐𝑡 + 𝑑) 2
is also continuous on I. Hence Loewner’s result (Theorem 14.5) applies. For 𝚲 =
diag (𝜆 1 , . . . , 𝜆𝑛 ) and 𝑛 ∈ ℕ, the Loewner matrix of 𝑓 is
[1 ] 𝑎𝑑 − 𝑏𝑐
𝑓 (𝚲) = .
(𝑐𝜆𝑖 + 𝑑)(𝑐𝜆 𝑗 + 𝑑) 𝑖 ,𝑗 =1,...,𝑛
Observe that
[1 ]
𝑓 (𝚲) = 𝑺 ∗ ( 11∗ )𝑺 = ( 1∗𝑺 ) ∗ ( 1∗𝑺 ) < 0 .
We may conclude that the rational function 𝑓 is matrix monotone on I.
Lecture 14: Monotonicity: Differential Characterization 119
Using Loewner’s theorem, by verifying that the Loewner matrix is psd we were
able to conclude matrix monotonicity (in this case, of rational functions). The reverse This is useful in the study of kernel
direction is also of interest, where monotonicity can be informative about positive methods and positive definite
functions [Bha07b, Chapter 5].
semi-definiteness.
Example 14.17 (Logarithm). Since 𝑡 ↦→ log 𝑡 is continuously differentiable and matrix
monotone on ( 0, ∞) , it follows by Theorem 14.5 that
Notes
This lecture is adapted from Bhatia’s book [Bha97, Chap. V] with some elements
from [Bha07b, Chap. 5].
Lecture bibliography
[AV95] J. S. Aujla and H. Vasudeva. “Convex and monotone operator functions”. In: Annales
Polonici Mathematici. Volume 62. 1. 1995, pages 1–11.
[Ben+17] P. Benner et al. Model reduction and approximation: theory and algorithms. SIAM,
2017.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[CP77] “CHAPTER 6. Calculus in Banach Spaces”. In: Functional Analysis in Modern Applied
Mathematics. Volume 132. Mathematics in Science and Engineering. Elsevier, 1977,
pages 87–105. doi: https://doi.org/10.1016/S0076-5392(08)61248-5.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
15. Monotonicity: Integral Characterization
In the last lecture, we showed that smooth matrix monotone functions are characterized
Agenda:
by a differential property. The Loewner matrix, which packages divided differences of
1. Integral representation of
the function, must be a psd matrix. In this lecture, we will discuss a complementary matrix monotone functions
characterization of matrix monotone functions as those that can be expressed using 2. Uniqueness
a certain integral representation. This approach allows us to identify several new 3. Geometric approach
examples of matrix monotone functions. Moreover, the description yields some general 4. Monotonicity on the positive
reals
theorems on matrix convexity, including some matrix Lyapunov inequalities that go 5. Filtering with a positive linear
well beyond the Kadison and Choi inequalities. map
6. Integral representation of
matrix convex functions
15.1 Recap
Last time, we developed characterizations of continuously differentiable matrix mono-
tone and matrix convex functions in terms of derivative properties. Let I ⊂ ℝ be an
open interval. Each continuously differentiable function 𝑓 : I → ℝ induces a family of
Loewner matrices. For a diagonal matrix 𝚲 ∈ ℍ𝑛 ( I) and each 𝑛 ∈ ℕ, we define
[1 ] [ 1 ] (𝜆
𝑳 𝑓 (𝚲) B 𝑓 (𝚲) B 𝑓 𝑗 , 𝜆𝑘 ) 𝑗 ,𝑘 =1,...,𝑛 .
Since the function 𝑡 ↦→ −(𝑡 + 𝜆) −1 is matrix monotone on ℝ++ , the integral repre-
sentation shows that the logarithm is the limit of positive sums of matrix monotone
functions. Since the cone of matrix monotone functions is closed under pointwise
limits, the logarithm also is matrix monotone.
This argument may seem concocted. In fact, this approach is based on a penetrating
insight that every matrix monotone function admits an integral representation. This
is a famous result of Loewner, which is regarded as one of the deepest theorems in
matrix analysis. We present the result and some immediate consequences.
The claim (2) is called the “easy” direction of Loewner’s theorem. We verify this
result in Section 15.2.2. As we will see, the easy direction already offers a powerful
tool. Section 15.2.3 contains several examples.
Meanwhile, the claim (1) is called the “hard” direction of Loewner’s theorem. In
particular, this result implies that every matrix monotone function 𝑓 on (−1, +1)
is analytic in the interval (−1, +1) . This fact is striking because scalar monotone
functions do not even need to be continuous. Introducing deeper ideas from complex
analysis, one may prove that the function 𝑓 continues analytically into the upper
half-plane, where it defines a Herglotz function; see [Bha97, Chap. V] or [Sim19] for
more discussion about this point.
Simon [Sim19] outlines 11 different proofs (!) of the hard part of Loewner’s theorem.
Each one approaches the summit from a different direction, and each one requires a
long and arduous climb. In Section 15.2.4, we give the easy proof of the uniqueness
statement. The existence claim is the difficult step. In Section 15.3, we will summarize
a geometric existence proof, but there is no space here for all of the details.
Simplify the function algebraically so that you can invoke the fact that the negative
inverse is operator monotone.
The constant function 𝑡 ↦→ 𝛼 is matrix monotone on any interval. For each
𝜆 ∈ [−1, +1] , the function 𝑡 →
↦ 𝑡 /( 1 − 𝜆𝑡 ) is matrix monotone on the interval (−1, +1)
by Exercise 15.2. By the usual limiting arguments (simple functions, dominated
convergence), we see that the integral in (15.1) represents a matrix monotone function.
Indeed, matrix monotone functions are closed under positive linear combinations
and pointwise limits. Altogether, the expression (15.1) represents a matrix monotone
function on the standard open interval (−1, +1) .
Problem 15.4 (Tangent). Show that the function 𝑡 ↦→ tan (𝜋𝑡 /2) has an integral repre-
sentation of the form (15.1), hence is matrix monotone on (−1, +1) . Hint: You can
derive this fact from the definite integral
∫ ∞
𝜋 𝜋𝑡 𝜆𝑡 − 1 d𝜆
tan = · for 𝑡 ∈ (−1, +1) .
2 2 0 𝜆 − 𝜆−1 𝜆
This non-obvious statement appears as [GR07, Formula 3.274(3)]. There are more
elementary ways to prove that the tangent is matrix monotone; cf. Lecture 18.
Suppose that there are two Borel probability measures 𝜇, 𝜈 on [−1, +1] for which
∫ +1 ∫ +1
𝑡 𝑡
𝑓 (𝑡 ) = d𝜇(𝑡 ) = d𝜈 (𝑡 ) for all 𝑡 ∈ (−1, +1) .
−1 1 − 𝜆𝑡 −1 1 − 𝜆𝑡
If these two measures are different, they can be separated by a continuous function
ℎ : [−1, +1] → ℝ. That is,
∫ +1 ∫ +1
ℎ (𝜆) d𝜇(𝜆) < ℎ (𝜆) d𝜈 (𝜆).
−1 −1
Since 𝜇 and 𝜈 are both probability measures, we may shift ℎ to ensure that it has zero
∫ +1
integral (with respect to Lebesgue measure): −1 ℎ (𝜆) d𝜆 = 0.
By Exercise 15.5, the functions 𝜆 ↦→ 𝑡 ( 1 − 𝜆𝑡 ) −1 are total in the space of continuous
functions on [−1, +1] with a zero integral. Therefore, we can approximate ℎ in the
supremum norm by a linear combination. For example, with some real coefficients
𝑐 1 , . . . , 𝑐 𝑘 , we have
∑︁𝑛 𝑐 𝑘 𝑡𝑘
ℎ (𝜆) − ≤ 𝜀 for all 𝜆 ∈ [−1, +1] .
𝑘 =1 1 − 𝜆𝑡𝑘
This contradiction forces us to conclude that the measures are the same.
B B {𝑓 : (−1, +1) → ℝ : 𝑓 is matrix monotone, 𝑓 ( 0) = 0, and 𝑓 0 ( 0) = 1}. Figure 15.1 (Base of cone). A base B of
the cone of matrix monotone functions
Furthermore, one may confirm that each nonconstant matrix monotone function 𝑔 on (−1, +1) .
satisfies 𝑔 0 ( 0) > 0. Therefore, the function (𝑔 (𝑡 ) − 𝑔 ( 0))/𝑔 0 ( 0) ∈ B.
Lecture 15: Monotonicity: Integral Characterization 124
In general, the closure is required, and there is a possibility that the extreme points
are uninformative (e.g., the extreme points could be dense in B!). Nevertheless, there
are many settings where we can explicitly identify the extreme points of the set B. The
limit of convex combinations can often be realized as an integral over the extreme
points, which leads to an integral representation of the elements of B.
Theorem 15.7 (Elementary monotone functions: Extremality). For each 𝜆 ∈ [−1, +1] ,
introduce the function
𝑡
𝜑𝜆 (𝑡 ) = for 𝑡 ∈ (−1, +1) .
1 − 𝜆𝑡
The function 𝜑𝜆 is an extreme point of B.
In fact, the family ( 𝜑𝜆 : 𝜆 ∈ [−1, +1]) exhausts the set of extreme points of B.
This claim is significantly harder to prove, so we must take it for granted.
Proof. Fix 𝜆 ∈ [−1, +1] , and note that 𝜑𝜆 ∈ B. The key to the proof is to observe that
the 2 × 2 Loewner matrix associated with 𝜑𝜆 always has rank one:
∗
( 1 − 𝜆𝑠 ) −1 ( 1 − 𝜆𝑠 ) −1
𝜑𝜆[1 ] (𝑠 , 𝑡 ) = for all 𝑠 , 𝑡 ∈ (−1, +1) .
( 1 − 𝜆𝑡 ) −1 ( 1 − 𝜆𝑡 ) −1
Therefore, the Loewner matrix lies in an extreme ray of the psd cone.
Suppose now that we can write 𝜑𝜆 = 12 𝑓 + 21 𝑔 where 𝑓 , 𝑔 ∈ B. The Loewner
matrix is linear in the function, so
𝜑𝜆[1 ] (𝑠 , 𝑡 ) = 21 𝑓 [1 ]
(𝑠 , 𝑡 ) + 12 𝑔 [1 ] (𝑠 , 𝑡 ) for all 𝑠 , 𝑡 ∈ (−1, +1) .
Lecture 15: Monotonicity: Integral Characterization 125
[1 ]
But the Loewner matrix 𝜑𝜆 (𝑠 , 𝑡 ) lies in an extreme ray of the psd cone, so the two
matrices on the right are both positive scalar multiples of it. In particular,
𝑓 [1 ]
(𝑠 , 𝑡 ) = 𝛼 (𝑠 , 𝑡 ) · 𝜑𝜆[1 ] (𝑠 , 𝑡 ) for some 𝛼 (𝑠 , 𝑡 ) ≥ 0.
To complete the argument, we select 𝑠 = 0 and write out the entries of the matrices.
From the normalization 𝑓 ( 0) = 0 and 𝑓 0 ( 0) = 1, it emerges that
( 1 − 𝜆𝑡 ) −1
1 𝑓 (𝑡 )/𝑡 1
= 𝛼 ( 0 , 𝑡 ) · for all 𝑡 ∈ (−1, +1) .
𝑓 (𝑡 )/𝑡 𝑓 0 (𝑡 ) ( 1 − 𝜆𝑡 ) −1 ( 1 − 𝜆𝑡 ) −2
For a nonconstant matrix monotone function 𝑔 on (−1, +1) , we apply this statement
to the function 𝑓 (𝑡 ) = (𝑔 (𝑡 ) − 𝑔 ( 0))/𝑔 0 ( 0) to complete our sketch of the proof of the
“hard” direction of Theorem 15.1.
Proof sketch. Let us introduce the fractional linear transformation 𝜓 that maps the
strictly positive real line ℝ++ onto the standard open interval (−1, +1) . The map 𝜓
and its inverse 𝜓 −1 are increasing functions of the form
𝑡 −1 1+𝑠
𝜓 (𝑡 ) = for 𝑡 ∈ ℝ++ where 𝜓 −1 (𝑠 ) = for 𝑠 ∈ (−1, +1).
𝑡 +1 1−𝑠
We can construct the matrix monotone function 𝑓 (𝑠 ) = 𝑔 (𝜓 −1 (𝑠 )) on (−1, +1) . Apply
Theorem 15.1 to 𝑓 to obtain an integral representation. Then make the change of
variables 𝑠 = 𝜓 (𝑡 ) to transfer this representation back to the function 𝑔 . Make the
affine change of variables 𝜆 ↦→ 2𝜆 − 1 to shift the domain of integration from [−1, +1]
to [ 0, 1] . The result of this process is quoted above.
Example 15.12 (Logarithm, again). We can use the “easy” direction of Corollary 15.11 to
verify that particular functions are matrix monotone. As an example, observe that
∫ 1
𝑡 −1
log 𝑡 = d𝜆 for all 𝑡 > 0.
0 𝜆 + ( 1 − 𝜆)𝑡
Therefore, the logarithm is matrix monotone. The representation of the logarithm that
we saw before is related to this one by a further change of variables in the integral.
Exercise 15.13 (Loewner: Strictly positive reals). It is sometimes convenient to change
variables in Corollary 15.11. For a matrix monotone function 𝑔 : ℝ++ → ℝ, show that
∫
𝜆(𝑡 − 1)
𝑔 (𝑡 ) = 𝛼 + 𝛾𝑡 + d𝜇(𝜆) for all 𝑡 > 0.
ℝ++ 𝜆 +𝑡
Lecture 15: Monotonicity: Integral Characterization 127
where 𝛼 ∈ ℝ and 𝛾 ≥ 0 and 𝜇 is a finite, positive Borel measure on ℝ++ . This is closer
to our original presentation of the logarithm in terms of an integral.
Proof. For the second part, assume that 𝑓 : ℝ+ → ℝ+ is matrix concave and positive .
First, suppose that 𝑨 ≺ 𝑩 . For scalars 𝜏 ∈ ( 0, 1) and 𝜏¯ = 1 − 𝜏 , since 𝑓 is concave and
positive,
𝑓 (𝜏𝑩) = 𝑓 𝜏𝑨 + 𝜏¯ · (𝜏/𝜏)
¯ (𝑩 − 𝑨)
< 𝜏 · 𝑓 (𝑨) + 𝜏¯ · 𝑓 (𝜏/𝜏)(𝑩
¯ − 𝑨) < 𝜏 · 𝑓 (𝑨).
Since 𝑓 is continuous, we may take 𝜏 ↑ 1. We conclude that 𝑨 ≺ 𝑩 implies
𝑓 (𝑨) 4 𝑓 (𝑩) . To handle the case where 𝑨 4 𝑩 , observe that 𝑓 (𝑨) 4 𝑓 (𝑩 + 𝜀 I) for
𝜀 > 0, and take limits as 𝜀 ↓ 0 to resolve that 𝑓 is matrix monotone.
For the first part, the key to this argument is the following claim, which we will
establish in a moment.
Claim 15.19 (Monotonicity: Submatrix). Let 𝑓 : ℝ+ → ℝ be a continuous matrix monotone Recall that [ ·] 11 extracts the top-left
function. For a psd block matrix 𝑨 , we have 𝑓 ([𝑨] 11 ) < [𝑓 (𝑨)] 11 . ( 1, 1) block of a block matrix, and it
is a positive linear map.
We assume that 𝑓 : ℝ+ → ℝ is matrix monotone. Granted the claim, we introduce
a unitary block matrix that generates convex combinations:
𝜏 1/2 I 𝜏¯1/2 I
𝑼𝜏 B for 𝜏 ∈ [ 0, 1] with 𝜏¯ = 1 − 𝜏 .
−𝜏¯1/2 I 𝜏 1/2 I
For psd matrices 𝑨 1 , 𝑨 2 , construct the psd block diagonal matrix 𝑨 = 𝑨 1 ⊕ 𝑨 2 . Using
the claim, we calculate that
¯ 2 ) = 𝑓 [𝑼 𝜏 𝑨𝑼 𝜏∗ ] 11 < 𝑓 (𝑼 𝜏 𝑨𝑼 𝜏∗ ) 11
𝑓 (𝜏𝑨 1 + 𝜏𝑨
= 𝑼 𝜏 𝑓 (𝑨)𝑼 𝜏∗ 11 = 𝜏 𝑓 (𝑨 1 ) + 𝜏¯ 𝑓 (𝑨 2 ).
For a psd block matrix 𝑨 , a short calculation reveals that Of course, [ ·] 22 extracts the ( 2, 2)
block of a block matrix.
( 1 + 𝜀) · [𝑨] 11 0
𝑨4𝑨 + 𝑻 ∗𝜀 𝑨𝑻 𝜀 = .
0 ( 1 + 𝜀 −1 ) · [𝑨] 22
Apply the monotone function 𝑓 to this relation. Since [·] 11 is a positive linear map, it
preserves the psd order on self-adjoint matrices. Therefore,
[𝑓 (𝑨)] 11 4 𝑓 ( 1 + 𝜀) · [𝑨] 11 .
Proof. Assume that 0 ≺ 𝑨 4 𝑩 . Since 𝑩 −1/2 𝑨𝑩 −1/2 4 I, we see that the matrix
𝑲 = 𝑩 −1/2 𝑨 1/2 is a contraction. Let 𝑳 = ( I − 𝑲 ∗ 𝑲 ) 1/2 . The matrix Jensen inequality
(Lecture 13) implies that
Exercise 15.24 (Entropy). Using an integral representation for the logarithm, find an
integral representation for the negative entropy function negent (𝑡 ) B 𝑡 log 𝑡 for 𝑡 ≥ 0.
Exercise 15.25 (Matrix convexity: Other intervals). Consider a matrix convex function
𝑓 : I → ℝ on an open interval I. Bendat & Sherman proved that the divided difference
𝑓 [1 ] (𝑠 , ·) : I → ℝ is matrix monotone for each 𝑠 ∈ I. Use this fact to derive an integral
representation for the matrix convex function 𝑓 .
Theorem 15.26 (Matrix Jensen: Positive linear maps). Consider a matrix convex function
𝑓 : ℝ+ → ℝ, and let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital positive linear map. Then
We can trace this kind of result to work of Davis [Dav57] and Choi [Cho74]. See
also Ando’s paper [And79].
Exercise 15.27 (Choi’s convexity theorem). Theorem 15.26 holds for every matrix convex
function 𝑓 , regardless of its domain. Prove it. Hint: Use Exercise 15.25.
Exercise 15.28 (Matrix Jensen). Deduce that the matrix Jensen inequality from Lecture 13
is a special case of Choi’s convexity theorem (Exercise 15.27). Hint: Consider the block
diagonal matrix 𝑨 = 𝑨 1 ⊕ 𝑨 2 ⊕ · · · ⊕ 𝑨 𝑛 . Show that the matrix convex combination is
a unital, positive linear map on this matrix.
Hint: Note that 𝑡 ↦→ log (𝜀 + 𝑡 ) is matrix concave for 𝑡 ≥ 0 and 𝜀 > 0. Choose an
appropriate value of 𝜀 ; there is no need to take a limit.
Exercise 15.31 (Matrix Lyapunov: More powers). Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital, positive
linear map. For 𝑝 ∈ [ 1/2, 1] , derive the matrix Lyapunov inequality
1/𝑝
4 𝚽(𝑨) for all psd 𝑨 ∈ ℍ𝑛+ .
𝚽(𝑨 𝑝 )
Exercise 15.32 (Matrix entropy: Filtering). Recall that the entropy ent (𝑡 ) B −𝑡 log 𝑡 for
𝑡 ≥ 0 is matrix concave. Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital positive linear map. Show that
In other words, entropy decreases when we filter by a unital, positive linear map.
Notes
Loewner’s theorem is a classic result that has attracted a significant amount of attention,
and it has been the subject of several books, notably [Sim19]. Nevertheless, there does
not seem to be a short, accessible treatment of these ideas. Nor does there appear
to be a self-contained proof of Loewner’s theorem that can be presented in a single
lecture. This lecture is the instructor’s attempt to set out the main facts about integral
representation of matrix monotone and matrix convex functions in a way that might
Lecture 15: Monotonicity: Integral Characterization 132
Lecture bibliography
[And79] T. Ando. “Concavity of certain maps on positive definite matrices and applications
to Hadamard products”. In: Linear Algebra and its Applications 26 (1979), pages 203–
241.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[Cho74] M.-D. Choi. “A Schwarz inequality for positive linear maps on 𝐶 ∗ -algebras”. In:
Illinois Journal of Mathematics 18.4 (1974), pages 565 –574. doi: 10.1215/ijm/
1256051007.
[Dav57] C. Davis. “A Schwarz inequality for convex operator functions”. In: Proc. Amer.
Math. Soc. 8 (1957), pages 42–44. doi: 10.2307/2032808.
[GR07] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Seventh.
Translated from the Russian, Translation edited and with a preface by Alan
Jeffrey and Daniel Zwillinger, With one CD-ROM (Windows, Macintosh and UNIX).
Elsevier/Academic Press, Amsterdam, 2007.
[HP82] F. Hansen and G. K. Pedersen. “Jensen’s Inequality for Operators and Löwner’s
Theorem.” In: Mathematische Annalen 258 (1982), pages 229–241.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
16. Matrix Means
In this lecture, we consider the question about how to “average” a pair of psd matrices.
Agenda:
We introduce a class of matrix means, and we give a complete characterization of these
1. Scalar means
functions. This topic may seem like a departure from the recent lectures. In fact, the 2. Axioms for matrix means
means that we study are in one-to-one correspondence with a class of matrix monotone 3. Representers for scalar means
functions. 4. Representers for matrix means
We begin by formalizing the notion of a scalar mean, which describes an average of 5. Matrix perspective
transformations
two positive numbers. Following the ideas of Kubo & Ando [KA79], we explain how 6. Matrix means from
to extend this construction to matrices. Then we show that (bivariate) means can be representers
described by (univariate) representer functions. This observation leads to a satisfying 7. Integral representations
theory in both the scalar and matrix settings.
The second formula is valid for numbers that are positive, but not necessarily
strictly positive.
Other examples include the family of binomial means and the family of power means.
Have you seen other types of means?
What are the properties common to these examples that can justify their interpre-
tation as the mean of two numbers? By extracting the key features, we may define a
class of scalar means.
𝑎→
↦ 𝑀 (𝑎, 𝑏) is increasing for each 𝑏 ≥ 0;
𝑏→↦ 𝑀 (𝑎, 𝑏) is increasing for each 𝑎 ≥ 0.
We say that a mean is symmetric if it also satisfies 𝑀 (𝑎, 𝑏) = 𝑀 (𝑏, 𝑎) for all
𝑎, 𝑏 ≥ 0. This property is typical, but we will not insist on it.
We remark that these properties are interrelated, so they are not fully independent
from each other. For instance, the ordering property already implies strict positivity.
For a symmetric mean, we do not need to make separate hypotheses about the behavior
of the first and second argument. It is also common to define the mean for strictly
positive numbers only, since we can use continuity to obtain the value of the mean
when one of the arguments is zero.
Exercise 16.3 (Scalar mean: Examples). Confirm that each item in Example 16.1 is a
symmetric scalar mean in the sense of Definition 16.2.
numbers is replaced by the psd order. The only difficulty arises from the generalization
of the positive homogeneity property, which we will discuss after the definition.
Definition 16.4 (Matrix mean). Fix 𝑛 ∈ ℕ. A function 𝑴 : ℍ𝑛+ × ℍ𝑛+ → ℍ𝑛+ on pairs of
psd matrices is called a matrix mean on ℍ𝑛+ if it has the following properties.
04𝑨 4𝑩 𝑨 4 𝑴 (𝑨, 𝑩) 4 𝑩 ;
implies
0 4 𝑩 4 𝑨 implies 𝑩 4 𝑴 (𝑨, 𝑩) 4 𝑨.
We say that a matrix mean is symmetric if it also satisfies the identity 𝑴 (𝑨, 𝑩) =
𝑴 (𝑩, 𝑨) for all 𝑨, 𝑩 < 0.
As in the case of scalar means, we often define a matrix mean for strictly positive
matrices. The continuity requirement allows us to extend the matrix mean to all psd
matrices. For brevity, we will not give any details on these continuity arguments.
We can think about the conjugation axiom for a matrix mean as a counterpart to the
positive homogeneity property of a scalar mean. For each 𝜆 > 0, the function 𝑎 ↦→ 𝜆𝑎
is a bijection on the positive numbers. Likewise, for each nonsingular 𝑿 ∈ 𝕄𝑛 , the
congruence 𝑨 ↦→ 𝑿 ∗ 𝑨𝑿 is a bijection on the psd matrices. Therefore, it is natural to
ask that the mean preserve simultaneous congruence. The extension to all 𝑿 ∈ 𝕄𝑛
follows from a short continuity argument.
The conjugation axiom has some striking implications. For example, it ensures that
the mean of two scalar matrices must also be a scalar matrix (i.e., a multiple of the
identity matrix).
Exercise 16.5 (Matrix mean: Scalar matrices). Let 𝑴 be a matrix mean on ℍ𝑛+ , as in
Definition 16.4. For all scalars 𝛼, 𝛽 ≥ 0, prove that
Definition 16.6 (Matrix arithmetic mean). The matrix arithmetic mean is the function
Exercise 16.7 (Matrix arithmetic mean). Confirm that the matrix arithmetic mean is
symmetric, and it satisfies all the axioms in Definition 16.4.
There are two very basic, but not so obvious, examples of a matrix mean that merit
explicit mention.
Definition 16.8 (Left and right matrix means). The left matrix mean and right matrix
mean are respectively given by the expressions
Exercise 16.9 (Matrix left and right means). Confirm that the left and right matrix means
satisfy the axioms in Definition 16.4, but they are not symmetric.
Definition 16.10 (Matrix harmonic mean). The matrix harmonic mean is defined as
1 −1
−1
𝑴 (𝑨, 𝑩) B 2
𝑨 + 12 𝑩 −1 for all 𝑨, 𝑩 ∈ ℍ𝑛++ .
Exercise 16.11 (Harmonic mean: Projectors). Compute the harmonic mean 𝑴 (𝑷 , 𝑩) for an
orthogonal projector 𝑷 ∈ ℍ𝑛+ and a positive-definite matrix 𝑩 ∈ ℍ𝑛++ .
Exercise 16.12 (Matrix harmonic mean). Note that the matrix harmonic mean is symmetric.
Confirm that the matrix harmonic mean satisfies all the axioms in Definition 16.4. Hint:
Assume that the arguments are positive definite, and recall that the matrix inverse is
order reversing. Use continuity to extend the axioms to all psd matrices and to the
case of psd conjugation.
It is valuable to rewrite the harmonic mean using a related function called the
parallel sum [AJD69].
Definition 16.13 (Parallel sum). The parallel sum of two positive-definite matrices is
defined as −1
(𝑨 : 𝑩) B 𝑨 −1 + 𝑩 −1 where 𝑨, 𝑩 ∈ ℍ𝑛++ .
We extend this definition to psd matrices via continuity. It is evident that 2 (𝑨 : 𝑩)
coincides with the harmonic mean of 𝑨 and 𝑩 .
𝑨 : 𝑩 = (𝑨 −1 + 𝑩 −1 ) −1 = 𝑩 (𝑨 + 𝑩) −1 𝑨 = 𝑨 − 𝑨 (𝑨 + 𝑩) −1 𝑨.
This representation extends to psd matrices. By symmetry, we also have the relation
𝑨 : 𝑩 = 𝑩 − 𝑩 (𝑨 + 𝑩) −1 𝑩 , and we can recognize the Schur complement of another
block matrix.
The representation of the parallel sum as a Schur complement gives an alternative
proof of the fact that the parallel sum is monotone. Indeed, Schur complements of a
psd matrix are increasing with respect to the matrix. We also discover that the parallel
sum is concave.
Exercise 16.14 (Parallel sum: Concavity). For psd matrices 𝑨 𝑖 , 𝑩 𝑖 ∈ ℍ𝑛+ for 𝑖 = 1, 2, use
the connection with Schur complements to prove that
(𝜏𝑨 1 + 𝜏𝑨
¯ 2 ) : (𝜏𝑩 1 + 𝜏𝑩
¯ 2 ) < 𝜏 (𝑨 1 : 𝑩 1 ) + 𝜏¯ (𝑨 2 : 𝑩 2 ) for 𝜏 ∈ [ 0, 1] .
As usual, 𝜏¯ B 1 − 𝜏 . In other words, the parallel sum is jointly concave on pairs of psd
matrices.
We have used the fact that the identity matrix commutes with everything. This
expression appears complicated, but we have no choice about it once we accept that
matrix means satisfy the conjugation axiom.
These considerations lead to the following definition of the matrix geometric mean.
Definition 16.15 (Matrix geometric mean). For psd matrices 𝑨, 𝑩 ∈ ℍ𝑛+ , the matrix
geometric mean is defined as
1/2
𝑨 ♯ 𝑩 B 𝑨 1/2 · 𝑨 −1/2 𝑩 𝑨 −1/2 · 𝑨 1/2 where 𝑨, 𝑩 ∈ ℍ𝑛++ .
It is clear that the matrix geometric mean is positive. Using the fact that the
square-root is a matrix monotone function, it is not hard to show that the matrix
geometric mean satisfies the order property. On the other hand, it is not clear that the
matrix geometric mean has the other required properties (congruence, monotonicity).
Although one may prove these results directly, we will instead develop them as a
consequence of a more general theory of matrix means.
𝑓 : ℝ+ → ℝ+ given by 𝑓 (𝑡 ) B 𝑀 ( 1, 𝑡 ) for 𝑡 ≥ 0.
The axioms for a scalar mean induce several structural constraints on its representer
function.
Exercise 16.18 (Scalar mean: Representer properties). Consider a scalar mean 𝑀 : ℝ+ ×
ℝ+ → ℝ+ . Introduce the representer function 𝑓 (𝑡 ) = 𝑀 ( 1, 𝑡 ) for 𝑡 ≥ 0. Prove that
the representer enjoys the following properties.
Most of these results are straightforward. The subadditivity and symmetry properties
may take a moment of thought.
Lecture 16: Matrix Means 139
Proof. We can easily verify that 𝑀 𝑓 enjoys the properties of a scalar mean directly
from the analogous properties of the representer 𝑓 . The only technical difficulty arises
in verifying the existence of the limit of 𝑀 𝑓 (𝑎, 𝑏) as 𝑎 ↓ 0 and 𝑏 > 0.
To see that representers and scalar means are in one-to-one correspondence, note
that 𝑓 (𝑡 ) = 𝑀 𝑓 ( 1, 𝑡 ) for all 𝑡 > 0, so that 𝑓 is the representer of 𝑀 𝑓 .
Definition 16.20 (Matrix mean: Representer). Let 𝑴 : ℍ𝑛+ × ℍ𝑛+ → ℍ𝑛+ be a matrix
mean on ℍ𝑛+ , as in Definition 16.4. The representer of the matrix mean is the scalar
function
This definition uses Exercise 16.5 to ensure that the mean of two scalar matrices is
itself a scalar matrix.
Exercise 16.21 (Matrix mean: Representers). Compute the representers for the matrix
arithmetic mean, left and right means, the matrix harmonic mean, and the matrix
geometric mean.
The axioms for a matrix mean ensure that the representer has many of the same
properties as in the scalar setting. The next result collects the easy facts.
Exercise 16.22 (Matrix mean: Representer properties). Consider a matrix mean 𝑴 on
ℍ𝑛+ . Introduce the representer function 𝑓 (𝑡 ) I = 𝑴 ( I, 𝑡 I) for 𝑡 ≥ 0. Prove that the
representer enjoys the following properties.
Lecture 16: Matrix Means 140
Theorem 16.23 (Matrix mean representer). Consider a matrix mean 𝑴 on ℍ𝑛+ with
representer 𝑓 : ℝ+ → ℝ+ . Then the representer function satisfies
Proof. First, we show that 𝑷 commutes with the mean 𝑴 (𝑨, 𝑩) . Indeed, since 𝑷
commutes with 𝑨 , we have the relation 𝑨𝑷 = 𝑨 1/2𝑷 𝑨 1/2 4 𝑨 . Likewise, 𝑩𝑷 4 𝑩 . By
monotonicity of the mean, 𝑴 (𝑨𝑷 , 𝑩𝑷 ) 4 𝑴 (𝑨, 𝑩) . Conjugate by the projector to
see that
We have used the conjugation property of the mean at the first step.
Equivalently, we have shown that To see what is going on, assume that
range (𝑷 ) is a subspace spanned by
𝑴 (𝑨, 𝑩) − 𝑷 𝑴 (𝑨, 𝑩)𝑷 < 0. coordinates. Then we are removing a
“diagonal block” from the psd matrix
Since 𝑴 (𝑨, 𝑩) is psd, this statement implies that the restriction of the matrix 𝑴 (𝑨, 𝑩) 𝑴 (𝑨, 𝑩) , and yet we are left with a
psd matrix. This can only happen if
to range (𝑷 ) is zero. In particular, the “off-diagonal blocks” are zero too.
The last relation is the conjugate transpose of the first. Rearrange this expression to
see that 𝑷 commutes with 𝑴 (𝑨, 𝑩) .
By a similar argument, 𝑷 also commutes with 𝑴 (𝑨𝑷 , 𝑩𝑷 ) . These two facts
complete the proof.
Lecture 16: Matrix Means 141
Proof of Theorem 16.23. Fix a matrix 𝑩 ∈ ℍ𝑛+ , and introduce the spectral resolution
Í
𝑩 = 𝑖 𝜆𝑖 𝑷 𝑖 . Since the projectors decompose the identity,
∑︁ ∑︁
𝑴 ( I, 𝑩) = 𝑴 ( I, 𝑩)𝑷 𝑖 = 𝑴 (𝑷 𝑖 , 𝑨𝑷 𝑖 )𝑷 𝑖
∑︁𝑖 𝑖
∑︁
= 𝑴 (𝑷 𝑖 , 𝜆𝑖 𝑷 𝑖 )𝑷 𝑖 = 𝑴 ( I, 𝜆𝑖 I)𝑷 𝑖
∑︁𝑖 𝑖
= 𝑓 (𝜆𝑖 )𝑷 𝑖 = 𝑓 (𝑩).
𝑖
We have used Lemma 16.24 in the second and third lines. To pass from the second line
to the third, we used the fact that the the range of the projector 𝑷 𝑖 is an eigenspace of
𝑩 with eigenvalue 𝜆𝑖 . Last, we have recognized the matrix mean representer, and we
applied the definition of a standard matrix function.
Finally, consider 𝑛 × 𝑛 matrices with 0 4 𝑩 1 4 𝑩 2 . Then
𝑓 (𝑩 1 ) = 𝑴 ( I, 𝑩 1 ) 4 𝑴 ( I, 𝑩 2 ) = 𝑓 (𝑩 2 ).
We have invoked the axiom that the matrix mean is monotone on ℍ𝑛+ .
In particular,
Regardless of the choice of the standard matrix function 𝑓 , the perspective trans-
formation 𝑴 𝑓 interacts nicely with conjugation. When the function 𝑓 has additional
properties, the perspective 𝑴 𝑓 may inherit some of these features. This section
contains some elaboration, and we will continue this discussion in the next lecture.
Although the form (16.1) of the perspective transformation is motivated by the
conjugation axiom, it is not immediate that the perspective satisfies the conjugation
property. The first proposition guarantees that it does.
Proposition 16.26 (Matrix perspective: Conjugation). Let 𝑓 : ℝ+ → ℝ+ be a (continuous)
function. Then the perspective transformation 𝑴 𝑓 satisfies the conjugation axiom. For
Lecture 16: Matrix Means 142
all 𝑨, 𝑩 ∈ ℍ𝑛++ ,
Proof. Assume that 𝑨, 𝑩 ∈ ℍ𝑛++ are positive definite, and fix a nonsingular matrix
𝑿 ∈ 𝕄𝑛 . The remaining cases will follow from continuity.
The quantity of interest takes the unwieldy form
𝑴 𝑓 (𝑿 ∗ 𝑨𝑿 , 𝑿 ∗ 𝑩 𝑿 )
= (𝑿 ∗ 𝑨𝑿 ) 1/2 · 𝑓 (𝑿 ∗ 𝑨𝑿 ) −1/2 (𝑿 ∗ 𝑩 𝑿 )(𝑿 ∗ 𝑨𝑿 ) −1/2 · (𝑿 ∗ 𝑨𝑿 ) 1/2 .
To tame this expression, introduce the matrix 𝒀 = 𝑿 (𝑿 ∗ 𝑨𝑿 ) −1/2 . It has the polar
factorization 𝒀 = 𝑷𝑼 where 𝑷 = (𝒀 𝒀 ∗ ) 1/2 and 𝑼 is unitary. After a short calculation,
we find that 𝑷 = 𝑨 −1/2 . Therefore,
𝑴 𝑓 (𝑿 ∗ 𝑨𝑿 , 𝑿 ∗ 𝑩 𝑿 ) = 𝑿 ∗𝒀 −∗ · 𝑓 (𝒀 ∗ 𝑩𝒀 ) · 𝒀 −1 𝑿
= 𝑿 ∗𝑷 −1𝑼 · 𝑓 (𝑼 ∗ (𝑷 𝑩𝑷 )𝑼 ) · 𝑼 ∗𝑷 −1 𝑿
= 𝑿 ∗ 𝑨 1/2 · 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) · 𝑨 1/2 𝑿 = 𝑿 ∗ 𝑴 𝑓 (𝑨, 𝑩)𝑿 .
We have used the unitary equivariance of the standard matrix function to eliminate
the unitary matrices.
An important corollary of the conjugation invariance property is that the perspective
transformation of a matrix monotone function is monotone in both arguments.
Corollary 16.27 (Matrix perspective: Monotonicity). Let 𝑓 : ℝ+ → ℝ+ be a continuous
matrix monotone function. Then each variable of the perspective transformation 𝑴 𝑓
is matrix monotone on ℍ𝑛+ :
Proof. First, we remark that a positive matrix monotone function is always subadditive. Indeed, a matrix monotone function
Therefore, we can take limits to extend the perspective 𝑴 𝑓 from positive-definite 𝑓 : ℝ+ → ℝ+ is concave, so it must
be subadditive.
matrices to psd matrices.
The matrix monotonicity of 𝑩 ↦→ 𝑴 𝑓 (𝑨, 𝑩) is an easy consequence of the expres-
sion (16.1) for the perspective 𝑴 𝑓 , the conjugation rule, and the matrix monotonicity
of 𝑓 .
As for the other variable, we may assume that 𝑨, 𝑩 are positive definite. Then the
conjugation property (Proposition 16.26) ensures that
Since 𝑓 is matrix monotone, the function 𝑡 ↦→ 𝑡 · 𝑓 ( 1/𝑡 ) is also matrix monotone. (See
Problem 16.28.) Therefore, 𝑨 ↦→ 𝑴 𝑓 (𝑩 −1/2 𝑨𝑩 −1/2 , I) is matrix monotone. By the
conjugation rule, we conclude that 𝑨 ↦→ 𝑴 𝑓 (𝑨, 𝑩) is also matrix monotone.
Lecture 16: Matrix Means 143
Deduce that 𝑔 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) is matrix monotone on ℝ++ . Hint: If you exploit the fact
that 𝑓 is matrix concave, then the proof is easy. But this argument may be circular, so
you should give an independent proof; see Exercise ??.
In general, the perspective function 𝑴 𝑓 treats its two arguments rather differently.
In the context of matrix means, it is valuable to understand when the perspective is
symmetric in its arguments. The next exercise states the result.
Exercise 16.29 (Matrix perspective: Symmetry). Suppose that 𝑓 : ℝ+ → ℝ+ has the
property that 𝑓 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) for 𝑡 > 0. Prove that
Hint: This point follows easily from the conjugation property, much like the result of
Corollary 16.27.
Our main result states that the matrix perspective of a matrix mean representer
generates a family of matrix means. We have already laid most of the groundwork, so
this result will follow quickly.
Theorem 16.32 (Matrix representers yield matrix means). Let 𝑓 be a matrix mean
representer, as in Definition 16.30, and let 𝑴 𝑓 be the associated family of matrix
means, as in Definition 16.31.
Then the 𝑴 𝑓 is a matrix mean on ℍ𝑛+ for each 𝑛 ∈ ℕ with matrix mean
representer 𝑓 . Indeed, 𝑓 ↔ 𝑴 𝑓 is a bijection between matrix representer
functions and families of matrix means.
Moreover, if 𝑓 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) for 𝑡 > 0, then the mean 𝑴 𝑓 is symmetric in its
arguments.
Proof. When 𝑓 is a matrix mean representer, we can argue that 𝑴 𝑓 is a matrix mean
on matrices of any dimension 𝑛 ∈ ℕ.
Lecture 16: Matrix Means 144
Positivity. Strict positivity of 𝑴 𝑓 follows from the fact that a matrix monotone
function 𝑓 : ℝ+ → ℝ+ with 𝑓 ( 1) = 1 is strictly positive.
Order. The order properties are straightforward. For example, to see that 0 ≺ 𝑨 4 𝑩
implies that 𝑨 4 𝑴 𝑓 (𝑨, 𝑩) , we just need to check that I 4 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) . But this
is a consequence of the fact that 𝑨 −1/2 𝑩 𝑨 −1/2 4 I and the normalization 𝑓 ( 1) = 1.
To check that 0 ≺ 𝑨 4 𝑩 implies that 𝑴 𝑓 (𝑨, 𝑩) 4 𝑩 , we invoke the conjugation
property (Proposition 16.26) to obtain the equivalent relation 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) 4 I,
which we have already verified. The other cases are essentially the same.
Monotonicity. Corollary 16.27 already establishes the monotonicity property.
Conjugation. The conjugation axiom was obtained in Proposition 16.26.
Continuity. The function 𝑴 𝑓 is continuous on positive-definite matrices since 𝑓 is
continuous. It is continuous for psd matrices by construction.
Symmetry. Exercise 16.29 requests the proof of the symmetry property.
Bijection. Finally, note that 𝑴 𝑓 ( I, 𝑡 I) = 𝑓 (𝑡 ) · I for all 𝑡 > 0. In other words, 𝑓 is
the matrix mean representer of 𝑴 𝑓 . Since 𝑴 𝑓 is a matrix mean for each dimension
𝑛 ∈ ℕ, the function 𝑓 must be matrix monotone, continuous, and normalized with
𝑓 ( 1) = 1. We conclude that 𝑓 ↔ 𝑴 𝑓 is a bijection.
16.5.3 Examples
Examples of families of matrix means include the matrix arithmetic mean, the left and
right matrix mean, the matrix harmonic mean, and the matrix geometric mean. There
are other important examples.
Example 16.33 (Weighted matrix geometric means). For a parameter 𝑟 ∈ [ 0, 1] , we may
consider the matrix monotone function 𝑓 (𝑡 ) = 𝑡 𝑟 . This function induces a family of
weighted geometric means:
We can extend to all psd matrices by taking limits. The weighted geometric means
interpolate between the left and right mean. With respect to an appropriate geometry
on psd matrices, the weighted geometric means 𝑟 ↦→ 𝑨 ♯𝑟 𝑩 trace out a geodesic
between 𝑨 and 𝑩 . We recognize that ordinary matrix geometric mean 𝑨 ♯ 𝑩 as the
midpoint of this geodesic.
∫1
Example 16.34 (Matrix logarithmic mean). The function 𝑓 (𝑡 ) = 0
𝑡 𝑟 d𝑟 = (𝑡 − 1)/log (𝑡 )
represents the scalar logarithmic mean. This function is matrix monotone and satisfies
the normalization 𝑓 ( 1) = 1, so it induces a family of symmetric matrix means, most
easily written using the weighted geometric mean:
∫ 1
𝑴 𝑓 (𝑨, 𝑩) = (𝑨 ♯𝑟 𝑩) d𝑟 for 𝑨, 𝑩 ∈ ℍ𝑛+ and 𝑛 ∈ ℕ.
0
Is there an alternative expression that makes the role of the logarithm clear?
for all psd 𝑨, 𝑩 with the same dimension. Here, 𝑨 : 𝑩 denotes the parallel sum.
The coefficients 𝛼, 𝛽 ≥ 0 and 𝜇 is a finite, positive Borel measure on ℝ++ . The
normalization ensures that 𝛼 + 𝛽 + 𝜇(ℝ++ ) = 1.
Find a simplification for the symmetric case where 𝑴 𝑓 (𝑨, 𝑩) = 𝑴 𝑓 (𝑩, 𝑨) .
Exercise 16.35 has the remarkable interpretation that every family 𝑴 𝑓 of matrix
means consists of a conic combination of the left mean, the right mean, and a family
of harmonic means. We can deduce other significant results as well.
Exercise 16.36 (Minimal and maximal means). The harmonic mean is the minimal matrix
mean, while the arithmetic mean is the maximal matrix mean. That is, for a family
𝑴 𝑓 of symmetric matrix means,
1
2 (𝑨 : 𝑩) 4 𝑴 𝑓 (𝑨, 𝑩) 4 2
(𝑨 + 𝑩)
𝑴 𝑓 (𝜏𝑨 1 + 𝜏𝑨 ¯ 2 ) < 𝜏𝑴 𝑓 (𝑨 1 , 𝑩 1 ) + 𝜏𝑴
¯ 2 , 𝜏𝑩 1 + 𝜏𝑩 ¯ 𝑓 (𝑨 2 , 𝑩 2 ) for all 𝜏 ∈ [ 0, 1] .
Hint: The parallel sum satisfies the same concavity property. First, show that this claim
holds for a unital, strictly positive linear map by invoking Choi’s inequality. Remove
the extra conditions by emulating the proof of the Russo–Dye theorem (Lecture 12).
Notes
The results in this lecture are drawn from Kubo & Ando [KA79] and from Bhatia’s
book [Bha07b, Chap. 4]. The arrangement of material is somewhat different from
these sources. It may be conceptually simpler to regard the basic object as a family of
means induced by a function and then to derive the properties from this representation.
Lecture bibliography
[AJD69] W. N. Anderson Jr and R. J. Duffin. “Series and parallel addition of matrices”. In:
Journal of Mathematical Analysis and Applications 26.3 (1969), pages 576–594.
Lecture 16: Matrix Means 146
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[HK99] F. Hiai and H. Kosaki. “Means for matrices and comparison of their norms”. In:
Indiana Univ. Math. J. 48.3 (1999), pages 899–936. doi: 10.1512/iumj.1999.
48.1665.
[KA79] F. Kubo and T. Ando. “Means of positive linear operators”. In: Math. Ann. 246.3
(1979/80), pages 205–224. doi: 10.1007/BF01371042.
17. Quantum Relative Entropy
In this lecture, we study matrix entropy functions, also known as quantum entropies.
Agenda:
Quantum entropies arise in quantum information theory and quantum statistical
1. Scalar entropies
physics. They also have remarkable applications in random matrix theory and data 2. Matrix entropies
science. Quantum entries have deep roots in matrix analysis because matrix monotone 3. Matrix perspectives
and convex functions play an important role in the analysis. 4. Tensors and logarithms
We start with some background from information theory in the scalar setting. 5. Convexity of quantum relative
entropy
We introduce the (scalar) entropy function and the relative entropy, and we discuss
their basic properties. These definitions extend to the matrix setting, and we will
see that matrix entropies have many properties that parallel those of scalar entropies.
Nevertheless, the properties of matrix entropies are far more difficult to establish. We
show how to develop these results using matrix perspective transformations, combined
with the idea of lifting a matrix problem to tensors.
We will present two major results in this lecture. The first is the convexity of the
matrix perspective transformation of a matrix convex function. The second result is
the convexity of quantum relative entropy, which represents a nontrivial application of
matrix perspectives. This theorem is one of the crown jewels of matrix analysis.
The entropy is a function on the probability simplex Δ𝑛 that measures the amount
of randomness in a probability distribution.
Recall that negative entropy is an isotone (i.e., Schur convex) function. The theory
of isotone functions implies that
0 ≤ ent (𝒑) ≤ log 𝑛 for each 𝒑 ∈ Δ𝑛 . (17.1)
The minimum in (17.1) is achieved when 𝒑 = 𝜹 𝑖 for some 𝑖 ∈ {1, . . . , 𝑛}. These
probability distributions describe constant (i.e., deterministic) random variables. The
maximum is achieved when 𝒑 = 𝑛 −1 1, which models a uniform random variable.
In other words, the entropy increases as the disorder of a probability distribution
increases .
Let us provide some historical background. The concept of entropy emerged in
statistical physics and thermodynamics, due to work by Gibbs and Boltzmann. Later,
entropy became one of the core tools in information theory after Shannon [Sha48]
found operational interpretations.
Fact 17.3 (Shannon 1948). The entropy ent (𝒑) of a probability distribution 𝒑 ∈ Δ𝑛 is
(proportional to) the average number of bits per symbol to encode a sequence of iid
random variables that are distributed according to 𝒑 .
Definition 17.4 (Relative entropy). The relative entropy or Kullback–Leibler (KL) diver-
gence between two probability distributions 𝒑, 𝒒 ∈ Δ𝑛 is given by the expression
∑︁𝑛
D (𝒑 ; 𝒒 ) B 𝑝𝑖 ( log 𝑝𝑖 − log 𝑞𝑖 ).
𝑖 =1
Exercise 17.5 (Relative entropy). Verify that the relative entropy has the following basic
properties.
Explain why D (𝒑 ; 𝒒 ) ≥ 0 for all 𝒑, 𝒒 ∈ Δ𝑛 .
1. Positivity.
2. Unboundedness.Check that D (𝒑 ; 𝒒 ) can take the value +∞.
3. Asymmetry. Show that D (𝒑 ; 𝒒 ) is not symmetric in its arguments, so it is not a
metric.
Relative entropy also has important operational interpretations in information
theory and statistics.
Fact 17.6 (Stein’s lemma). Fix a distribution 𝒒 ∈ Δ𝑛 . If the relative entropy D (𝒑 ; 𝒒 ) < +∞,
then D (𝒑 ; 𝒒 ) −1 is roughly the number of independent samples we need from the
distribution 𝒑 in order to decide with high probability that 𝒑 ≠ 𝒒 .
The result in Exercise 17.7 is the primary motivation for this lecture. It has important
applications in optimization theory. For example, it plays a key role in the study of
geometric programming and relative entropy programming.
Exercise 17.8 (Information projections). For a closed, convex set C ⊆ Δ𝑛 , characterize the
solution of the minimization problem min𝒑 ∈C D (𝒑 ; 𝒒 ) . The result is analogous with
the characterization of the Euclidean projection onto the convex set C.
The quantum entropy is given by the eigenvalues of the density matrix, which
compose a probability distribution. Quantum entropy then addresses the question
“How disordered are the eigenvalues of a density matrix?” In other words, quantum
entropy offers a way to measure the disorder in a quantum system with state 𝝔.
From our discussion, we immediately obtain the following bounds on the quantum Notice that the bounds on entropy
entropy. are the same in the scalar case in
(17.1) and in the matrix case in (17.2).
0 ≤ ent (𝝔) ≤ log 𝑛 for each 𝝔 ∈ 𝚫𝑛 . (17.2)
The minimum occurs if and only if 𝝔 is a rank-1 matrix (that is, a pure state). The
maximum occurs if and only if 𝝔 = 𝑛 −1 I𝑛 .
There are many operational interpretations of the quantum entropy in quantum
information theory, but they are outside the scope of this lecture.
Lecture 17: Quantum Relative Entropy 150
Definition 17.12 (Umegaki). The (Umegaki) quantum relative entropy of two density
matrices 𝝔, 𝝂 ∈ 𝚫𝑛 is Aside: There are several other
notions of relative entropy in the
S (𝝔; 𝝂) B tr [𝝔( log 𝝔 − log 𝝂)]. quantum setting.
Warning 17.13 (Relative entropy and eigenvectors). Unless 𝝔 and 𝝂 commute, their
eigenvalues alone do not determine the value of the quantum relative entropy
S (𝝔; 𝝂) ! The interactions between the eigenvectors also play a role.
Exercise 17.14 (Quantum relative entropy is positive). For all 𝝔, 𝝂 ∈ 𝚫𝑛 , prove that
S (𝝔; 𝝂) ≥ 0. Hint: Use the generalized Klein inequality (Lecture 8).
Quantum relative entropy also has operational interpretations in quantum informa-
tion theory. For example, we have the quantum extension of Stein’s lemma due to Hiai
& Petz [HP91] and to Ogawa & Nagaoka [ON00].
Fact 17.15 (Quantum Stein’s lemma). Let 𝝂 ∈ 𝚫𝑛 be a density matrix. If S (𝝔; 𝝂) < +∞,
then S (𝝔; 𝝂) −1 is (roughly) the number of unentangled quantum systems, prepared in
state 𝝔, that we must measure to determine that 𝝔 ≠ 𝝂 with high probability.
Having introduced the matrix generalizations of the relative entropy, we can state
a major theorem that extends the convexity property of scalar entropy.
Theorem 17.16 (Convexity of quantum relative entropy). The map (𝝔; 𝝂) ↦→ S (𝝔; 𝝂) is
convex on 𝚫𝑛 × 𝚫𝑛 .
Exercise 17.17 (Quantum versus scalar). Show that the convexity of quantum relative
entropy (Theorem 17.16) implies the convexity of scalar relative entropy (Exercise 17.7).
Hint: Consider diagonal density matrices.
Unlike the scalar result in Exercise 17.7, Theorem 17.16 is quite hard to prove. It
was first obtained by Lindblad [Lin73] using results of Lieb [Lie73a]. In this lecture, we
present a proof due to Effros [Eff09] that is based on matrix perspective transformations.
A key step in this argument is to lift the problem to tensor products, an idea that
first appeared in Ando’s beautiful paper [And79] of the convexity of quantum relative
entropy and related functions.
Theorem 17.16 has many applications in quantum information theory and quantum
statistical physics. In particular, it plays the starring role in the proof that quantum
entropy is strongly subadditive [LR73]. The convexity of quantum relative entropy is
also the main ingredient in the theory of exponential matrix concentration developed
by the lecturer [Tro11; Tro15].
Definition 17.18 may look familiar from Lecture 16. Indeed, if we assume that 𝑓 is
matrix monotone, positive, and satisfies some other inessential properties, then the
construction of 𝚿 𝑓 agrees with the Kubo–Ando matrix mean [KA79].
In this lecture, we study the matrix perspective transformation of a function 𝑓 that
is matrix convex. Let us take a moment to see what these functions look like.
Example 17.19 (Matrix perspectives). Here are some basic examples of matrix perspective
transformations for several matrix convex functions 𝑓 : ℝ++ → ℝ.
We remark that 𝑓 (𝑡 ) = 1 and 𝑓 (𝑡 ) = 𝑡 are both matrix monotone, and they yield
asymmetric matrix means that Kubo & Ando call the “left mean” and the “right mean.”
The other two examples, 𝑓 (𝑡 ) = 𝑡 2 and 𝑓 (𝑡 ) = 𝑡 −1 , are matrix convex but not matrix
monotone. These two functions are the extremal examples of matrix convex functions,
and they yield perspectives that are attractively symmetrical with each other.
Theorem 17.20 (Matrix perspectives are matrix convex). Let 𝑓 : ℝ++ → ℝ be a matrix
convex function. Consider strictly positive-definite matrices 𝑨 𝑖 , 𝑯 𝑖 ∈ ℍ𝑛++ for
𝑖 = 1, 2. Then, for all 𝜏 ∈ [0, 1] with 𝜏¯ B 1 − 𝜏 , we have
𝚿 𝑓 (𝜏𝑨 1 + 𝜏𝑨
¯ 2 ; 𝜏𝑯 1 + 𝜏𝑯
¯ 2 ) 4 𝜏𝚿 𝑓 (𝑨 1 ; 𝑯 1 ) + 𝜏𝚿
¯ 𝑓 (𝑨 2 ; 𝑯 2 ). (17.3)
This theorem tells that the perspective of the averages is “smaller” than the average
of the perspectives. It is the key to our proof of Theorem 17.16.
Proof. We will represent the left-hand side of (17.3) as a matrix convex combination,
which allows us to invoke the matrix Jensen inequality. For notational convenience,
Lecture 17: Quantum Relative Entropy 152
define
𝑨 B 𝜏𝑨 1 + 𝜏𝑨
¯ 2 and 𝑯 B 𝜏𝑯 1 + 𝜏𝑯
¯ 2.
Introduce the coefficient matrices
(𝜏𝑨 1 + 𝜏𝑨 ¯ 2 ) −1 (𝜏𝑨 1 + 𝜏𝑨
¯ 2 ) (𝜏𝑯 1 + 𝜏𝑯 ¯ 2)
4 𝜏𝑨 1𝑯 −1 1 𝑨 1 + 𝜏𝑨
¯ 2𝑯 − 1
2 𝑨 2 for 𝜏 ∈ [ 0, 1] .
This type of expression is called a matrix Schwarz inequality. In some sense, this
quadratic over linear function is the extremal example of a matrix convex function.
The elementary tensor operators of the form 𝑨 ⊗ 𝑯 span the real linear space
L(ℍ𝑛 ⊗ ℍ𝑛 ) of (self-adjoint) linear operators on 𝕄𝑛 .
(𝑨 ⊗ I)( I ⊗ 𝑯 ) = ( I ⊗ 𝑯 )(𝑨 ⊗ I) = (𝑨 ⊗ 𝑯 ).
This commutativity property makes these elementary tensor operators quite pleasant
to work with.
We need to see what the perspective 𝚿 𝑓 looks like. Since 𝑨 ⊗ I and I ⊗ 𝑯 commute,
the perspective can be expressed simply as
𝚿 𝑓 (𝑨 ⊗ I; I ⊗ 𝑯 ) = (𝑨 ⊗ I) · 𝑓 ( I ⊗ 𝑯 )(𝑨 ⊗ I) −1
= −(𝑨 ⊗ I) · log 𝑨 −1 ⊗ 𝑯
You should confirm that 𝜑 𝑿 is a convex function of the pair (𝑨, 𝑯 ) . Indeed, observe
that the psd order on L(ℍ𝑛 ⊗ ℍ𝑛 ) corresponds to increases in this type of quadratic
form.
Write out the quadratic form explicitly using the interpretation of the tensor product
operator in terms of left- and right-multiplication. We find that
Theorem 17.26 (Lieb 1973). Fix an arbitrary matrix 𝑿 ∈ 𝕄𝑛 . For any real 𝑟 ∈ ( 0, 1) ,
the function
(𝑨, 𝑯 ) ↦→ tr 𝑿 ∗ 𝑨 𝑟 𝑿 𝑯 1−𝑟
Proof sketch. The function 𝑓 (𝑡 ) = −𝑡 𝑟 is matrix convex on ℝ++ . Pursue the same
reasoning as in the proof of Theorem 17.16.
Notes
This lecture is based on the instructor’s monograph [Tro15] with some contributions by
Prof. Richard Kueng that appeared in the lecture notes for a previous version of this
course. Ando [And79] developed the technique of proving matrix convexity inequalities
by lifting to tensor products. We use an implementation of the argument proposed by
Effros [Eff09], and extended by Ebadian et al. [ENG11].
Lecture bibliography
[And79] T. Ando. “Concavity of certain maps on positive definite matrices and applications
to Hadamard products”. In: Linear Algebra and its Applications 26 (1979), pages 203–
241.
Lecture 17: Quantum Relative Entropy 155
In this lecture, we will turn to another topic involving the analysis of positive-
Agenda:
semidefinite matrices. We will develop the notion of a positive-definite kernel and
1. Positive-definite kernels
make a connection between these kernels and positive-definite matrices. Positive- 2. Positive-definite functions
definite kernels play an important role in approximation theory, statistics, machine 3. Examples
learning, and physics. Our focus in this lecture is an important class of examples 4. Bochner’s theorem
called translation-invariant kernels, which are associated with convolution operators. 5. Fourier analysis
6. Extensions
Positive-definite, translation-invariant kernels are generated by positive-definite func-
tions. We will develop some examples of positive-definite functions. Then we will state
and prove Bochner’s theorem, which characterizes the continuous positive-definite
functions.
We would like to generalize these notions to functions. This section presents the
definitions, basic examples, and some applications.
18.1.1 Kernels
A bivariate function is often called a kernel, and we may think about them as acting
on functions via integration. In parallel with the concept of a psd matrix, we can
introduce the concept of a positive-definite kernel.
We are being informal in stating this
Definition 18.1 (Positive-definite kernel). A measurable function 𝐾 : ℝ𝑑 × ℝ𝑑 → ℂ definition because it does not play a
is called a kernel on ℝ𝑑 . The kernel acts on a (suitable) function ℎ : ℝ𝑑 → ℂ by central role in this lecture. One must
take care to pose appropriate
integration: ∫ regularity assumptions on the kernel
(𝐾 ℎ) (𝒙 ) B 𝐾 (𝒙 , 𝒚 )ℎ (𝒚 ) d𝒚 for 𝒙 ∈ ℝ𝑑 . function 𝐾 and the test functions ℎ .
ℝ𝑑
We say that 𝐾 is a positive-definite (pd) kernel if it satisfies the condition In this lecture only, we use the
overline for complex conjugation to
avoid confusion with convolution
∫
hℎ, 𝐾 ℎi B ℎ (𝒙 )𝐾 (𝒙 , 𝒚 )ℎ (𝒚 ) d𝒙 d𝒚 ≥ 0 (18.2) operators.
ℝ𝑑 ×ℝ𝑑
Exercise 18.2 (Pd kernels are Hermitian). If 𝐾 is a pd kernel, prove that the kernel is also
Hermitian: 𝐾 (𝒙 , 𝒚 ) = 𝐾 (𝒚 , 𝒙 ) for all 𝒙 , 𝒚 .
Lecture 18: Positive-Definite Functions 157
Definition 18.3 (Kernel matrix). Let 𝐾 be a kernel on ℝ𝑑 . For each finite point set
{𝒙 1 , . . . , 𝒙 𝑛 } ⊂ ℝ𝑑 , we associate a kernel matrix:
𝑲 B 𝑲 (𝒙 1 , . . . , 𝒙 𝑛 ) B 𝐾 (𝒙 𝑖 , 𝒙 𝑗 ) 𝑖 ,𝑗 =1,...,𝑛 .
1. The kernel 𝐾 is bounded and continuous and positive definite for continuous test
functions ℎ that are in L1 (ℝ𝑑 ) .
2. For each finite point set {𝒙 1 , . . . , 𝒙 𝑛 } , the associated kernel matrix 𝑲 (𝒙 1 , . . . , 𝒙 𝑛 )
is positive semidefinite.
Hint: The direction (1 ⇒ 2) follows when we choose a sequence of functions that tends
toward a sum of point masses, located as the data points. The direction (2 ⇒ 1) follows
when we truncate the integrals to a compact set and approximate by a Riemann sum.
1. Inner-product kernel. The inner-product kernel is, perhaps, the simplest positive-
definite kernel on ℝ𝑑 . It is defined as
𝐾 (𝒙 , 𝒚 ) B h𝒙 , 𝒚 i.
The associated kernel matrix is usually called the Gram matrix of the data points. The Gram matrix 𝑮 of points
To see that the kernel is positive-definite, note that 𝒙 1 , . . . , 𝒙 𝑛 ∈ ℝ𝑑 takes the form
∫ ∫ 2 𝑮 = [ h𝒙 𝑗 , 𝒙 𝑘 i ] 𝑗 ,𝑘 =1,...,𝑛 .
hℎ, 𝐾 ℎi = ℎ (𝒙 )h𝒙 , 𝒚 iℎ (𝒚 ) d𝒙 d𝒚 = ℎ (𝒚 )𝒚 d𝒚 ≥ 0.
ℝ𝑑 ×ℝ𝑑 ℝ𝑑
The test function ℎ must have sufficient decay to ensure that the integral is
defined.
2. Correlation kernel. The correlation kernel is the positive-definite kernel that
tabulates the correlations between pairs of vectors:
h𝒙 , 𝒚 i
𝐾 (𝒙 , 𝒚 ) B
k𝒙 kk𝒚 k
Lecture 18: Positive-Definite Functions 158
There are many other examples of kernels. Our discussion here is limited because we
will be focusing on the translation-invariant case.
The kernel trick posits that we can develop alternative methods for data analysis by
replacing the Gram matrix with a pd kernel matrix associated with the data.
For instance, let us summarize how kernel PCA works. Suppose we use a pd kernel
to compute a kernel matrix. To find features for the data, we can compute the leading
eigenvectors of the kernel matrix. Each eigenvector 𝒖 ∈ ℝ𝑑 leads to a feature of the
Í
form 𝜑 (𝒙 ) = 𝑖 𝑢 𝑖 𝐾 (𝒙 , 𝒙 𝑖 ) . This methodology is powerful and widely used.
In summary, we can think about the kernel matrix of a set of data points as a
far-reaching generalization of the Gram matrix. The main criticism of kernel methods
is that the computational cost of the linear algebra can interfere with the application
to large data sets.
Lecture 18: Positive-Definite Functions 159
18.1.4 Generalizations
A few more remarks are in order.
Remark 18.9 (Bounded + continuous). We often assume that the kernel is bounded and
continuous to simplify our exposition. Nevertheless, there are many applications where
we must discard these hypotheses.
In physics, kernels often reflect forces of repulsion between particles. For example,
the Coulomb (electrical potential) kernel takes the form
1
𝐾 (𝒙 , 𝒚 ) B k𝒙 − 𝒚 k −1 for 𝒙 , 𝒚 ∈ ℝ3 .
4𝜋
This form reflects the fact that two negatively charged particles repel each other, and
the force increases as the charges approach each other. The integral operator
∫
(𝐾 𝜎) (𝒙 ) = 𝐾 (𝒙 , 𝒚 )𝜎 (𝒚 ) d𝒚
ℝ3
This generality can be valuable when working with data that may not take vector
values. Using the kernel trick, we can develop methodology for analysis of general
data sets by adapting techniques from the Euclidean setting.
18.2.1 Definitions
A translation-invariant kernel depends only on the difference between its arguments,
so it is spatially homogeneous.
𝐾 𝑓 (𝒙 , 𝒚 ) B 𝑓 (𝒙 − 𝒚 ) for all 𝒙 , 𝒚 ∈ ℝ𝑑 .
Definition 18.13 (Positive-definite function). A function 𝑓 : ℝ𝑑 → ℂ is called a positive- We do not require a pd function 𝑓 to
be continuous, but we will typically
definite (pd) function if it has the following property. For all data 𝒙 1 , . . . , 𝒙 𝑛 ∈
ℝ𝑑 ,
enforce this property. We will see
the kernel matrix 𝑲 𝑓 associated with the convolution operator 𝐾 𝑓 is positive that boundedness follows as a
semidefinite: consequence of the definition.
𝑲 𝑓 B 𝑲 𝑓 (𝒙 1 , . . . , 𝒙 𝑛 ) B 𝑓 (𝒙 𝑗 − 𝒙 𝑘 ) 𝑗 ,𝑘 =1,...,𝑛 < 0.
Exercise 18.14 (Young’s inequality). Suppose that 𝑓 ∈ L1 (ℝ𝑑 ) . Fix 𝑝 ∈ [ 1, ∞] . Prove that
18.2.2 Properties
Individual positive-definite functions have a number of nice features.
Proposition 18.17 (Positive-definite function). Let 𝑓 : ℝ𝑑 → ℂ be a pd function.
At the origin, 𝑓 ( 0) ≥ 0.
1. Positivity.
2. Symmetry. We have the relation 𝑓 (−𝒙 ) = 𝑓 (𝒙 ) for each 𝒙 ∈ ℝ𝑑 .
3. Boundedness. The value | 𝑓 (𝒙 )| ≤ 𝑓 ( 0) for each 𝒙 ∈ ℝ𝑑 .
The class of positive-definite functions has some important stability properties. The
last one is distinctive.
Proposition 18.19 (Class of positive-definite functions). The class of pd functions on ℝ𝑑
satisfies the following properties.
1. Convex cone. If 𝑓 , 𝑔 are pd functions on ℝ𝑑 , then 𝛼 𝑓 + 𝛽 𝑔 is pd for all 𝛼, 𝛽 ≥ 0.
2. Closedness. If ( 𝑓 𝑛 : 𝑛 ∈ ℕ) is a sequence of pd functions on ℝ𝑑 that converges
pointwise to 𝑓 , then 𝑓 is pd.
3. Multiplication. If 𝑓 , 𝑔 are pd functions on ℝ𝑑 , then 𝑓 𝑔 is pd.
Proof. Point (1) holds because the psd cone is convex, and point (2) holds because the
psd cone is closed. To prove (3), we must argue that the entrywise product of two psd
matrices remains psd. But this is the statement of Schur’s product theorem.
Proof. The identity cos (𝑡 𝑥) = 21 ( ei𝑡 𝑥 +e−i𝑡 𝑥 ) expresses the cosine as a conic combination
of two complex exponentials. The claim follows because each complex exponential is
pd, and the class of pd functions is closed under conic combinations.
Exercise 18.23 (Sine is not pd). Show that 𝑥 ↦→ sin (𝑥) is not pd on ℝ. Hint: Sine is an
odd function.
Lecture 18: Positive-Definite Functions 162
To conclude that sinc is pd, we write the integral as a limit of Riemann sums. Each
Riemann sum is pd because it is a conic combination of complex exponentials. The
limit is pd because pd functions are stable under pointwise limits.
As an aside, let us frame an exercise that makes a link between the theory of
positive-definite functions and matrix monotone functions. This is just one of many
such examples; see [Bha07b, Chap. 5].
Exercise 18.25 (Tangent is matrix monotone). Show that 𝑥 ↦→ tan (𝑥) is matrix monotone on
(−𝜋/2, +𝜋/2) by writing the Loewner matrix of the tangent in terms of a kernel matrix
associated with the sinc function. Hint: sin (𝛼 − 𝛽) = sin (𝛼) cos (𝛽) − sin (𝛽) cos (𝛼) .
More generally, let 𝜇 be a finite, complex Borel measure on ℝ. Its Fourier transform
b : ℝ → ℂ is the function
𝜇
∫
b(𝑥) B
𝜇 e−i𝑡 𝑥 d𝜇(𝑡 ) for 𝑥 ∈ ℝ.
ℝ
For our purposes, the key insight is that Fourier transforms induce pd functions.
This is a natural outcome, but the proof requires a little thought.
Proposition 18.27 (Fourier transforms are pd). Let 𝜇 be a finite, positive Borel measure on
ℝ. Then its Fourier transform 𝜇
b is a pd function on ℝ.
Proof. We return to the definition of a pd function. For points 𝑥 1 , . . . , 𝑥𝑛 ∈ ℝ, form
the kernel matrix
∫
e−i𝑡 (𝑥 𝑗 −𝑥𝑘 )
b(𝑥 𝑗 − 𝑥𝑘 )
𝑲 𝜇b = 𝜇 𝑗 ,𝑘
= 𝑗 ,𝑘
d𝜇(𝑡 ).
ℝ
For each 𝑡 ∈ ℝ, the matrix in the integrand is psd since the complex exponential is
a pd function. Therefore, for each unit vector 𝒗 ∈ ℂ𝑛 ,
∫
∗
𝒗 ∗ e−i𝑡 (𝑥 𝑗 −𝑥𝑘 ) 𝑗 ,𝑘 𝒗 d𝜇(𝑡 ) ≥ 0.
𝒗 𝑲 𝜇b𝒗 =
ℝ
Indeed, the integral of a positive function is positive. The integral is finite because the
integrand is bounded by 𝑛 .
Positive-definite functions that arise from Fourier transforms enjoy some additional
regularity properties.
Exercise 18.28 (Fourier transforms: Continuity). Let 𝜇 be a finite, positive Borel measure
on ℝ. Show that its Fourier transform 𝜇 b is a continuous function. Hint: Complex
exponentials are bounded and continuous; truncate the integral to a compact set.
Problem 18.29 (Riemann–Lebesgue lemma). For an integrable function 𝑓 ∈ L1 (ℝ) , prove
that 𝑓b is a continuous function that vanishes at infinity. Hint: Approximate 𝑓 by simple A function 𝑔 : ℝ → ℝ vanishes at
functions, and note that the sinc function is a continuous function that vanishes at infinity if |𝑔 (𝑥) | → 0 as |𝑥 | → ∞.
infinity.
18.3.5 Gaussians
The Fourier transform provides us with a powerful tool for detecting other examples of
pd functions. Here is a critical example.
2
Proposition 18.30 (The Gaussian kernel is pd). The function 𝑥 ↦→ e−𝑥 /2 for 𝑥 ∈ ℝ is pd.
This result is a consequence of the following fundamental fact, which expresses the
Gaussian as a Fourier transform. As we will discuss, Gaussians play a central role in
Fourier analysis because they serve as approximate identities.
Fact 18.31 (Gaussian: Fourier transform). The density 𝜑𝑏,𝑣 of a Gaussian random variable
with mean 𝑏 ∈ ℝ and variance 𝑣 > 0 is the function
2
e−(𝑡 −𝑏) /( 2𝑣 )
𝜑𝑏,𝑣 (𝑡 ) B √ d𝑡 for 𝑡 ∈ ℝ.
2𝜋𝑣
∫
The density is normalized: ℝ
𝜑𝑏,𝑣 (𝑡 ) d𝑡 = 1. Its Fourier transform is the function
2
b𝑏,𝑣 (𝑥) = ei𝑏𝑥 · e−𝑣𝑥
𝜑 /2
for 𝑥 ∈ ℝ.
For example, use symmetry about 𝑡 = 0 to pass to a cosine integral, and integrate by
parts twice.
Theorem 18.33 (Bochner 1933). A continuous function 𝑓 on the real line is positive
definite if and only if 𝑓 is the Fourier transform of a finite, positive Borel measure
𝜇 on the real line: ∫
𝑓 (𝑥) = 𝜇
b(𝑥) = e−i𝑡 𝑥 d𝜇(𝑡 ).
ℝ
Proposition 18.27 establishes the “easy” direction: the Fourier transform of a finite,
positive measure is a pd function. This result has already paid dividends because it
allowed us to identify a number of interesting positive-definite functions.
In this section, we will give a proof the “hard” direction. This result provides a
powerful representation for positive-definite functions. It serves as a building block for
establishing other difficult theorems in matrix analysis and other fields. In particular,
Bochner’s theorem has a consequence for probability theory: The characteristic function
of a random variable is a continuous pd function and vice versa. This can be used (in a
somewhat roundabout way) to prove Lévy’s continuity theorem, which is a key step in
the standard proof of the central limit theorem.
Beyond that, Bochner’s theorem has striking applications in machine learning,
where it serves as the foundation of the method of random Fourier features [RR08].
This representation tells us that every extreme point of B must be a complex exponential.
Let us offer an independent proof of the fact that every complex exponential is an
extreme point. We argue in the spirit of Boutet de Monvel [Sim19, Thm. 28.12].
Proposition 18.34 (Complex exponentials are extreme). For each 𝑡 ∈ ℝ, the complex
exponential 𝑒𝑡 : 𝑥 ↦→ e−i𝑡 𝑥 is an extreme point of B.
Lecture 18: Positive-Definite Functions 165
The Riemann–Lebesgue lemma states that 𝑓b, 𝑓q ∈ C0 (ℝ) , the space of continuous
functions on ℝ that vanish at infinity, equipped with the supremum norm.
A few basic properties merit comment. The Fourier transform satisfies an elegant
duality property:
∫ ∫
𝑓 (𝑥)ℎ
b(𝑥) d𝑥 = 𝑓b(𝑥)ℎ (𝑥) d𝑥 for 𝑓 , ℎ ∈ L1 (ℝ) . (18.4)
ℝ ℝ
The Fourier transform also converts convolution into pointwise multiplication:
∗ ℎ = 𝑓b · ℎ
𝑓 b ∈ C0 (ℝ) for 𝑓 , ℎ ∈ L1 (ℝ) . (18.5)
Young’s inequality ensures that the convolution 𝑓 ∗ ℎ is an integrable function.
Let us introduce the class of twisted Gaussian functions:
2
G B {𝑡 ↦→ 𝑟 ei𝑡 𝑎 e−(𝑡 −𝑏) /( 2𝑣 )
where 𝑟 , 𝑎, 𝑏 ∈ ℝ and 𝑣 > 0}.
Using Fact 18.31, we can quickly confirm that
𝑔 ∈ G implies b
𝑔 , 𝑔q ∈ G and b
𝑔q = 𝑔 .
In particular, the Fourier transform is a bijection on G.
The main result that we require is a precursor of Plancherel’s identity:
∫ ∫
1
𝑓 (𝑡 )𝑔 (𝑡 ) d𝑡 = 𝑓b(𝑥)b
𝑔 (𝑥) d𝑥 when 𝑓 ∈ L1 (ℝ) and 𝑔 ∈ G . (18.6)
ℝ 2𝜋 ℝ
Indeed, we note that 𝑔 = 𝑔q , and apply the duality (18.4) to move the Fourier transform
b
to 𝑓 . Then we pass the complex conjugate through the inverse Fourier transform to
obtain the complex conjugate of the Fourier transform and a scale factor.
Lecture 18: Positive-Definite Functions 166
That is, Gaussians approximate the identity element for the convolution operation. Thus,
we can isolate the value of a continuous function by integration against increasingly
√
localized Gaussians. See Figure 18.2. Hint: Truncate the integral to the interval ±𝑐 𝑣 ,
where 𝑐 depends on sup 𝑓 . Apply the mean value theorem to 𝑓 on this interval.
We will produce the measure 𝜇 that represents 𝑓 as the limit of the measures 𝜇𝑛 that
represent each 𝐹𝑛 .
Positive definiteness. To begin, note that 𝐹𝑛 is positive definite and continuous
because it is the product of two pd, continuous functions. The function 𝐹𝑛 ∈ L1 (ℝ)
because 𝑓 is bounded, and the Gaussian belongs to L1 (ℝ) .
Let us extract the consequence of the pd property that we will need. Choose a test
function 𝑔 ∈ G. By Exercise 18.15, the convolution kernel induced by 𝐹𝑛 is positive
definite for this class of test functions. Therefore, we calculate that
∫ ∫
0≤ 𝑔 (𝑠 )𝐹𝑛 (𝑠 − 𝑡 )𝑔 (𝑡 ) d𝑠 d𝑡 =
𝑔 (𝑠 ) · (𝐹𝑛 ∗ 𝑔 )(𝑠 ) d𝑠
ℝ×ℝ ℝ
∫ ∫ (18.7)
1 1
= 𝑔 (𝑥) · (𝐹
b 𝑛 ∗ 𝑔 ) (𝑥) d𝑥 = 𝑔 (𝑥)| 2 · 𝐹b𝑛 (𝑥) d𝑥.
|b
2𝜋 ℝ 2𝜋 ℝ
To pass to the second line, we invoked Plancherel’s identity (18.6). The last relation is
the convolution theorem (18.5).
Positivity. The pd property of 𝐹𝑛 implies that its Fourier transform 𝐹
b𝑛 is positive:
To prove this, recall that 𝐹b𝑛 is bounded and continuous because of the Riemann–
Lebesgue lemma. Fix a point 𝑏 ∈ ℝ, and select a twisted Gaussian function 𝑔𝑣 ∈ G
whose Fourier transform satisfies
1 2
𝑔𝑣 (𝑥)| 2 = √
|b · e−(𝑥−𝑏) /2𝑣
for 𝑣 > 0.
2𝜋𝑣
Lecture 18: Positive-Definite Functions 167
√
The function |b𝑔𝑣 | 2 is an approximate identity, centered at 𝑏 , with width 𝑣 . Introduce
2
|b
𝑔𝑣 | into the relation (18.7), and take the limit as 𝑣 ↓ 0 using the fact that 𝐹b𝑛 is
continuous. This demonstrates that 𝐹b𝑛 (𝑏) ≥ 0.
Duality. The rest of the proof depends on the complex conjugate of Plancherel’s
identity (18.6):
∫ ∫
1
𝑔 (𝑡 )𝐹𝑛 (−𝑡 ) d𝑡 = 𝑔 (𝑥) 𝐹b𝑛 (𝑥) d𝑥
b for 𝑔 ∈ G. (18.8)
ℝ 2𝜋 ℝ
The last two displays establish the claim because of the identity (18.8).
Measures and limits. Let us introduce a sequence of Borel probability measures on
the real line:
1
d𝜇𝑛 (𝑥) B 𝐹b𝑛 (𝑥) d𝑥 for each 𝑛 ∈ ℕ.
2𝜋
To verify that 𝜇𝑛 is a probability measure, we rely on the facts that 𝐹b𝑛 is positive and its
integral is normalized. Passing to a subsequence if needed, the sequence (𝜇𝑛 : 𝑛 ∈ ℕ)
of probability measures has a weak-∗ limit 𝜇 . Therefore, This claim follows from the
Banach–Alaoglu theorem, applied to
the dual of the space C0 (ℝ) of
∫ ∫
lim ℎ (𝑥) d𝜇𝑛 (𝑥) = ℎ (𝑥) d𝜇(𝑥) for all ℎ ∈ C0 (ℝ) . continuous functions that vanish at
𝑛→∞ ℝ 𝑅 infinity.
1 2 2
𝑔𝑣 (𝑡 ) = √ e−(𝑥+𝑏) /( 2𝑣 )
with 𝑔𝑣 (𝑥) = e−i𝑏𝑥 · e−𝑣𝑡
b /2
for 𝑣 > 0.
2𝜋𝑣
Lecture 18: Positive-Definite Functions 168
Using the duality identity (18.8) and taking the limit as 𝑛 → ∞, we can relate the
function 𝑓 to the measure 𝜇 . Indeed,
∫ ∫
𝑔𝑣 (𝑡 ) 𝑓 (−𝑡 ) d𝑡 = lim 𝑔𝑣 (𝑡 )𝐹𝑛 (−𝑡 ) d𝑡
ℝ 𝑛→∞ ℝ
∫ ∫
= lim 𝑔𝑣 (𝑥) d𝜇𝑛 (𝑥) =
b 𝑔𝑣 (𝑥) d𝜇(𝑥).
b
𝑛→∞ ℝ ℝ
18.4.5 Extensions
Bochner’s theorem holds in far more general settings. In particular, it is valid for pd
functions on ℝ𝑑 . The proof is essentially the same as the proof of Theorem 18.33.
More generally, the theorem holds in abstract settings. We give one such result
without proper definitions, and we illustrate it with an example.
See [Rud90] for the setting of LCA groups, and see [Rud91] for Raikov’s generaliza-
tion to the setting of commutative Banach algebras.
Example 18.38 (Positive-definite sequences). Consider the LCA group (ℤ, +) , comprising
the integers with addition. A function on ℤ is called a sequence, and a sequence
(𝑎𝑘 : 𝑘 ∈ ℤ) is positive definite when
∑︁
𝑢 𝑗 𝑎 𝑗 −𝑘 𝑢 𝑘 ≥ 0 for 𝒖 : ℤ → ℂ with finite support.
𝑗 ,𝑘
Notes
This lecture contains a new presentation of this material. Applications of kernels
are extracted from [SSB18]. The procession of examples is drawn from Bhatia’s
book [Bha07b, Chap. 5]. The proof that complex exponentials are extreme rays of the
cone of pd functions seems to be new. The self-contained proof of Bochner’s theorem is
a significant revision of the proof in Simon’s real analysis text [Sim15]. The material on
Fourier analysis is adapted from Arbogast & Bona [AB08]. See Rudin’s books [Rud90;
Rud91] for generalizations of Bochner’s theorem.
Lecture bibliography
[AB08] T. Arbogast and J. L. Bona. Methods of applied mathematics. ICES Report. UT-Austin,
2008.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[Boc33] S. Bochner. “Monotone Funktionen, Stieltjessche Integrale und harmonische
Analyse”. In: Math. Ann. 108.1 (1933), pages 378–410. doi: 10.1007/BF01452844.
[RR08] A. Rahimi and B. Recht. “Random Features for Large-Scale Kernel Machines”.
In: Advances in Neural Information Processing Systems 20. Curran Associates, Inc.,
2008, pages 1177–1184. url: http://papers.nips.cc/paper/3182-random-
features-for-large-scale-kernel-machines.pdf.
[Rud90] W. Rudin. Fourier analysis on groups. Reprint of the 1962 original, A Wiley-
Interscience Publication. John Wiley & Sons, Inc., New York, 1990. doi: 10.1002/
9781118165621.
[Rud91] W. Rudin. Functional analysis. Second. McGraw-Hill, Inc., New York, 1991.
[SSB18] B. Schlkopf, A. J. Smola, and F. Bach. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. The MIT Press, 2018.
[Sim15] B. Simon. Real analysis. With a 68 page companion booklet. American Mathematical
Society, Providence, RI, 2015. doi: 10.1090/simon/001.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
19. Entrywise PSD Preservers
In this lecture, we discuss several classes of kernel functions that are defined on
Agenda:
Euclidean spaces of every dimension: radial kernels and inner-product kernels. In the
1. Radial kernels
first case, Bochner’s theorem leads to a characterization of all radial kernels. To study 2. Inner-product kernels
the second case, we must investigate a new concept. Suppose that we apply a function 3. Entrywise psd preservers
to each entry of a matrix to produce a new matrix. This function is called an entrywise 4. Examples
psd preserver if it maps each psd matrix to another psd matrix. These functions also 5. Vasudeva’s theorem
6. Extensions
enjoy a beautiful theory that complements the Loewner theory of matrix monotone
functions and the Bochner theory of positive-definite functions.
In other words, we consider the convolution kernel 𝐾 𝑓 induced by the function 𝑓 . The
function 𝑓 is pd when the kernel matrices 𝑲 𝑓 associated with the convolution kernel
are all psd.
Under mild regularity assumptions, the eigenfunctions of a convolution operator
are complex exponentials. This observation suggests that the complex exponentials will
play a basic role in characterizing convolution kernels. Indeed, we have the following
fundamental result.
Theorem 19.3 (Schoenberg 1938). A continuous function 𝜑 : ℝ+ → ℝ+ is positive- Every pd radial function must take
definite radial if and only if it is given by the Laplace transform of a finite, positive positive values! (Why?)
Borel measure 𝜇 on ℝ+ . More precisely,
∫
2 2
𝜑 (𝑡 ) = e−𝑟 𝑡 /2
d𝜇(𝑟 ) for 𝑡 ∈ ℝ+ .
ℝ+
Proof sketch. The reverse direction is simple. We write the Gaussian as a characteristic We use random variables and
expectations to simplify some of the
function (i.e., a Fourier transform). For 𝒙 ∈ ℝ𝑑 ,
formulas here.
2
k𝒙 k 2 /2
e−𝑟 = 𝔼[ e−i h𝒛 , 𝒙 i/𝑟 ] where 𝒛 ∼ normal ( 0, I𝑑 ).
Using dominated convergence,
∫ ∫
−𝑟 2 k𝒙 k 2 /2 −i h𝒛 , 𝒙 i/𝑟
𝜑 (k𝒙 k) = e d𝜇(𝑟 ) = 𝔼 e d𝜇(𝑟 ) .
ℝ𝑑 ℝ𝑑
That is, the function 𝜑 is an entrywise psd preserver. Hint: A matrix in 𝕄𝑑 (𝔽) is psd if
and only if it is the Gram matrix of 𝑑 points in 𝔽 𝑑 .
This seems a bit perverse: we are always told that entrywise operations on matrices
are unorthodox. Nevertheless, this exercise suggests that entrywise functions that
preserve the psd property may merit study. This is the primary object of this lecture.
19.1.4 Applications
Before turning to our work, let us mention some motivating applications of psd
inner-product kernels and entrywise psd preservers.
Example 19.6 (Kernel methods). As we briefly discussed in Lecture 18, we can use the
kernel trick to design new methods for data analysis. In a method that only uses the
Gram matrix of Euclidean data, we can substitute a psd kernel matrix to try to find
alternative (non-Euclidean) structure in the data. Some commonly used inner-product
kernels include polynomial kernels of the form
𝐾 (𝒙 , 𝒚 ) = h𝒙 , 𝒚 i𝑝 or 𝐾 (𝒙 , 𝒚 ) = ( 1 + h𝒙 , 𝒚 i)𝑝 for 𝑝 ∈ ℕ.
Example 19.7 (Covariance regularization). Suppose that we have acquired a psd covariance
matrix 𝑨 < 0 whose entries tabulate (estimated) covariances among a family of random
variables. In practice, it is common that covariance estimates are inaccurate, so we
may wish to process the matrix to mitigate the effects of noise. In some settings, it is
important to ensure that the procedure respects the psd property so that the processed
matrix is still a covariance matrix.
One class of inexpensive methods applies a scalar function 𝑓 to each entry of the
covariance matrix to obtain [𝑓 (𝑎 𝑗 𝑘 )] . In this context, it is natural to insist that the
function 𝑓 is an entrywise psd preserver.
Lecture 19: Entrywise PSD Preservers 173
Definition 19.8 (Entrywise matrix function). Let E ⊆ 𝔽 . Define the set of 𝑛 × 𝑛 matrices In this lecture, we always use brackets
with entries in E: to denote entrywise behavior.
If this claim holds true for all 𝑛 ∈ ℕ, we simply say that 𝑓 is an entrywise psd
preserver (epp) on D.
Exercise 19.10 (Epps and positivity). If D ∩ ℝ+ = ∅, show that the definition of an epp on
𝕄𝑛 [ D] is vacuous. Otherwise, if D ∩ ℝ+ ≠ ∅, exhibit an example of an epp on 𝕄𝑛 [D] .
As with positive linear maps and standard matrix functions, a differentiable epp is
also monotone with respect to the psd order.
Problem 19.11 (Differentiable epps are monotone). For simplicity, assume that D ⊆ 𝔽 is an
open set. Suppose that 𝑓 : D → 𝔽 is differentiable. Show that 𝑓 is an epp on 𝕄𝑛 [ D]
if and only if
Hint: The proof is essentially the same as the proof that a differentiable function is
matrix monotone if and only if the Loewner matrix is psd. In this case, the argument is
much easier because we have no need of the Daleckii–Krein formula.
Lecture 19: Entrywise PSD Preservers 174
19.2.3 Properties
Entrywise psd preservers enjoy some of the same properties as positive-definite
functions.
Proposition 19.12 (Entrywise psd preserver). For a disc D ⊆ 𝔽 , let 𝑓 : D → 𝔽 be an epp.
Proof. Choose numbers 𝑎, 𝑐 ∈ D ∩ℝ++ subject to the relation 0 < 𝑐 < 𝑎 . The positivity
and boundedness properties ensure that 𝑓 (𝑐 ) = |𝑓 (𝑐 )| ≤ 𝑓 (𝑎) . Thus, 𝑓 is increasing
on D ∩ ℝ++ .
Define the function 𝑔 (𝑥) B log 𝑓 √︁
( e𝑥 ) whenever e𝑥 ∈ D ∩ ℝ++ . The boundedness
√
property guarantees that 𝑓 ( 𝑎𝑐 ) ≤ 𝑓 (𝑎) 𝑓 (𝑐 ) . Write 𝑎 = e𝑥 and 𝑐 = e𝑦 . Take the
logarithm, and recognize the function 𝑔 :
𝑔 ( 12 𝑥 + 12 𝑦 ) ≤ 12 𝑔 (𝑥) + 12 𝑔 (𝑦 ).
In other words, 𝑔 is midpoint convex, so it is continuous on the interior of its domain.
Thus, 𝑓 is continuous on D ∩ ℝ++ .
Proposition 19.14 (Epp: Stability properties). Let D ⊂ 𝔽 be a disc.
Proof. The claims (1) and (2) are valid because the psd matrices form a convex cone
that is closed. Meanwhile, claim (3) is a consequence of the Schur product theorem.
There is only one setting where epps have been completely characterized: the case
of 2 × 2 matrices with strictly positive entries [Vas79, Thm. 2]. We mention this result
here because we will use this result to prove a weaker characterization theorem.
Exercise 19.15 (Epps: A special characterization). Show that 𝑓 is an epp on 𝕄2 [ℝ++ ] if and
only if 𝑥 ↦→ log 𝑓 ( e𝑥 ) is increasing and midpoint convex.
Lecture 19: Entrywise PSD Preservers 175
19.3.1 Monomials
The simplest example of an epp is the function that always returns one.
Proposition 19.16 (Constants are epps). The function 𝑓 (𝑡 ) = 1 is an epp on ℝ.
Proof. For a psd matrix 𝑨 ∈ 𝕄𝑛 [ℝ] , we see that 𝑓 [𝑨] = 11ᵀ , which is psd.
Next, we turn our attention to the monomials.
Proposition 19.17 (Monomials are epps). For each 𝑝 ∈ ℕ, the function 𝑓 (𝑡 ) = 𝑡 𝑝 is an epp
on ℝ.
As usual, denotes the Schur product. Since 𝑨 < 0, an iterative application of the
Schur product theorem guarantees that the product is psd.
Problem 19.18 (Other powers). Choose a positive number 𝑟 that is not an integer. Prove
that the function 𝑡 ↦→ 𝑡 𝑟 is not an epp on ℝ+ . Hint: This is hard. One approach is to
argue that every epp on 𝕄𝑛 [ℝ+ ] must have 𝑛 − 1 continuous derivatives, and choose
𝑛 > 𝑟 . See the proofs of Vasudeva’s theorem and Bernstein’s theorems (below) for
some relevant techniques.
Proof. The monomials are epps on ℝ, and the class of epps on ℝ is a convex cone.
Therefore, each polynomial with positive coefficients is an epp on ℝ. Within the domain
D of convergence, the partial sums of 𝑓 are polynomials with positive coefficients that
converge pointwise to 𝑓 . Since the class of epps on D is closed under pointwise limits,
we see that 𝑓 is an epp on D.
Later, we will discuss the class of power series with positive coefficients in greater
detail. For the moment, we note a few basic properties.
Exercise 19.20 (Power series with positive coefficients). Suppose that 𝑓 : D → ℝ has
the form (19.1). Then 𝑓 ∈ C∞ ( D) , the set of infinitely differentiable functions on D.
Furthermore, for each 𝑝 ∈ ℕ, the derivative 𝑓 (𝑝) is also an epp on D.
Lecture 19: Entrywise PSD Preservers 176
These series have positive coefficients, but they converge only in the open interval
D = (−1, +1) . Proposition 19.19 implies that these two functions are epps on the
interval (−1, +1) .
Definition 19.25 (Absolutely monotone function). An infinitely differentiable function We allow 𝑎, 𝑏 ∈ {±∞} in this
𝑓 : (𝑎, 𝑏) → ℝ on an open interval of the real line is called absolutely monotone if definition.
its derivatives are positive:
(𝑝)
𝑓 (𝑡 ) ≥ 0 for all 𝑡 ∈ (𝑎, 𝑏) and all 𝑝 ∈ ℤ+ .
Proof. Exercise 19.26 shows that 𝑓 has right-derivatives of all orders at 𝑡 = 𝑎 . For each
𝑛 ∈ ℕ, we may expand 𝑓 as a Taylor series with an integral remainder:
∑︁𝑛−1 𝑓 (𝑝) (𝑎)
𝑓 (𝑡 ) = (𝑡 − 𝑎)𝑝 + 𝑅 𝑛 (𝑡 ) when 𝑎 ≤ 𝑡 ≤ 𝑐 < 𝑏 .
𝑝=0 𝑝!
We may express the remainder in the form
1
∫ 𝑡 𝑡 − 𝑠 𝑛−1
(𝑛)
𝑅 𝑛 (𝑡 ) = 𝑓 (𝑠 )(𝑐 − 𝑠 ) 𝑛−1 d𝑠 .
(𝑛 − 1) ! 𝑎 𝑐 −𝑠
Since the fraction in the integrand decreases as a function of 𝑡 and the derivatives 𝑓 (𝑛)
Theorem 19.30 (Entrywise psd preservers on ℝ++ ; Vasudeva 1979). A function 𝑓 : ℝ++ →
ℝ is an epp if and only if it is absolutely monotone. That is,
∑︁∞
𝑓 (𝑡 ) = 𝑐𝑝 𝑡 𝑝 where 𝑐𝑝 ≥ 0 for all 𝑝 ∈ ℕ.
𝑝=0
This statement includes the claim that the series converges for all 𝑡 > 0.
Indeed, when 𝑡 is sufficiently small, 𝑎 11ᵀ + 𝑡 𝒄 𝒄 ∗ is a psd matrix with strictly positive
entries. By the 𝑝 th order Taylor expansion of 𝑓 about 𝑎 with a derivative remainder,
we find that
(𝑡 𝑐 𝑗 𝑐 𝑘 )𝑟 (𝑡 𝑐 𝑗 𝑐 𝑘 )𝑝
∑︁
∑︁𝑑 𝑝−1 (𝑟 ) (𝑝)
𝑢 𝑗 𝑢𝑘 𝑓 (𝑎) + 𝑓 (𝑎 + 𝑡 𝜃 𝑗 𝑘 𝑐 𝑗 𝑐 𝑘 ) ≥ 0.
𝑗 ,𝑘 =1 𝑟 =0 𝑟! 𝑝!
Vandermonde matrix
1 𝑐 1 𝑐 2 . . . 𝑐 𝑝
1 1
1 𝑐 2 𝑐 2 . . . 𝑐 𝑝
2 2
𝑽 = . . (19.3)
. . . .
.. .. .. .. ..
𝑝
1 𝑐 𝑑
𝑐 𝑑2 . . . 𝑐 𝑑
Since the entries of 𝒄 are distinct, the Vandermonde matrix 𝑽 is nonsingular (Prob-
lem 19.33). Therefore, we can find a vector 𝒖 ∈ ℝ𝑑 that is orthogonal to the first
Í
𝑝 − 1 columns but not orthogonal to the 𝑝 th column. Equivalently, 𝑗 𝑢 𝑗 𝑐 𝑗𝑟 = 0 for
Í 𝑝
0 ≤ 𝑟 < 𝑝 , while 𝑗 𝑢 𝑗 𝑐 𝑗 ≠ 0.
With these choices, the sum in the penultimate display collapses to the form
𝑡 𝑝 ∑︁𝑑 𝑝 𝑝 (𝑝)
𝑢 𝑗 𝑢𝑘 · 𝑐 𝑗 𝑐𝑘 · 𝑓 (𝑎 + 𝑡 𝜃 𝑗 𝑘 𝑐 𝑗 𝑐 𝑘 ) ≥ 0.
𝑝! 𝑗 ,𝑘 =1
19.5.2 Smoothing
Next, we show that it is possible to take an arbitrary epp on ℝ++ and smooth it to
obtain an infinitely differentiable epp. This result requires the use of a mollifier, a nice
function that can confer its nice properties onto another function.
Fact 19.34 (Mollifier). Fix 𝛿 ∈ ( 0, 1) . There is a probability density function ℎ 𝛿 : ℝ++ →
ℝ+ that has compact support, is infinitely differentiable, has expectation 1, and has
variance 𝛿 .
Hint: The first three parts follow from dominated convergence, perhaps after the change
of variables 𝑢 = 𝑡 /𝑥 . For the last part, approximate the integral by a Riemann sum.
∞ Í
Uniformly in 𝛿 , the coefficients are absolutely summable because 𝑝= 0 𝑐 𝑝 (𝛿 ) =
𝑓𝛿 ( 1) → 𝑓 ( 1) . We will apply some standard results from complex analysis to complete
the argument.
Exercise 19.35 implies that 𝑓 𝛿 → 𝑓 pointwise as 𝛿 ↓ 0. We need to upgrade the
convergence. The family ( 𝑓 𝛿 : 0 < 𝛿 < 1) is uniformly bounded on each compact
set in ℂ because the coefficients in the power series for 𝑓 𝛿 are uniformly absolutely
summable. Passing to a subsequence if necessary, 𝑓 𝛿 → 𝑓 uniformly on each compact This result follows from the
set in ℂ [Ahl66, Thm. 12, p. 217]. Arzelà–Ascoli theorem.
Lecture 19: Entrywise PSD Preservers 181
∑︁∞
𝑓 (𝑡 ) = 𝑐𝑝 𝑡 𝑝 with 𝑐𝑝 ≥ 0 for all 𝑡 ∈ ℝ.
𝑝=0
Indeed, the coefficients must be positive because we have taken the limit of series with
positive coefficients. This is the required result.
19.5.4 Variations
Vasudeva’s theorem is, perhaps, the simplest of several results that characterize
entrywise psd preservers. A key feature of these results is that they do not make
strong prior assumptions on the smoothness properties of the epp, but rather extract
smoothness as a consequence of the structural assumption.
The first theorem of this type was derived by Schoenberg [Sch38].
Theorem 19.37 (Rudin 1959). A function 𝑓 : (−1, +1) → ℝ is an epp if and only if it
has a representation as a power series about zero with positive coefficients.
Rudin conjectured that there was an extension of his result to the complex setting,
and this result was obtained soon after by Herz [Her63].
See the survey [Bel+18] for more discussion of these results, as well as recent
advances.
Exercise 19.40 (Absolute versus complete). Show that 𝑓 is completely monotone on (𝑎, 𝑏)
if and only if its reversal 𝑡 ↦→ 𝑓 (−𝑡 ) is absolutely monotone on (−𝑏, −𝑎) .
In many ways, it is more natural to study completely monotone functions, and we
will see that they enjoy a remarkable integral representation.
Lecture 19: Entrywise PSD Preservers 182
Definition 19.41 (Right difference). For ℎ > 0, the right-difference operator of width ℎ
is defined on functions 𝑓 : ℝ+ → ℝ via the rule
Δℎ 𝑓 : 𝑎 ↦→ 𝑓 (𝑎 + ℎ) − 𝑓 (𝑎) for 𝑎 ∈ ℝ+ .
Exercise 19.42 (Higher-order differences). Prove that the iterated differences satisfy
∑︁𝑝 𝑝
Δℎ 𝑓
𝑝 𝑝−𝑘
(𝑎) = (−1) 𝑓 (𝑎 + 𝑘ℎ).
𝑘 =0 𝑘
With this concept, we can give another definition of complete monotonicity that
does not rely on differentiability properties.
Exercise 19.47 (Completely monotone functions: Convex cone). Prove that the completely
monotone functions on ℝ+ compose a convex cone that is closed under pointwise
limits.
Proof sketch. The reverse direction simply asks us to verify that the integral represents
a completely monotone function. This is an easy calculation.
Consider the convex set of normalized completely monotone functions:
Since completely monotone functions are positive and decreasing, every function 𝑓 ∈ B
satisfies 0 ≤ 𝑓 (𝑡 ) ≤ 1 for 𝑡 ∈ ℝ+ . After some argument, it follows that the set B is
compact in the topology of pointwise convergence.
Let 𝑒 be an extreme point of B. If 𝑒 (𝑡 ) = 1 for some 𝑡 > 0, then 𝑒 (𝑡 ) = 1 because
𝑒 is convex and decreasing. If 𝑒 (𝑡 ) = 0 for all 𝑡 > 0, then 𝑒 (𝑡 ) = 𝛿 0 (𝑡 ) .
Let us exclude these two cases. Then 𝑒 must be continuous on ℝ+ . Indeed, a
completely monotone function is continuous on ℝ++ . But 𝑒 cannot have a discontinuity
at zero, or else it is a convex combination of 𝛿 0 and another function in B.
To continue, observe that there is a point 𝑎 0 > 0 where 0 < 𝑒 (𝑎 0 ) < 1. Since 𝑒 is
continuous, we have 0 < 𝑒 (𝑎) < 1 on the interval 0 < 𝑎 ≤ 𝑎 0 . Fix a point 0 < 𝑎 ≤ 𝑎 0 ,
and define two functions
𝑒 (𝑡 + 𝑎) 𝑒 (𝑡 ) − 𝑒 (𝑡 + 𝑎)
𝑓𝑎 (𝑡 ) B and 𝑔 𝑎 (𝑡 ) B for 𝑡 ∈ ℝ+ .
𝑒 (𝑎) 1 − 𝑒 (𝑎)
The only continuous solution to these equations with 𝑒 ( 0) = 1 and 0 < 𝑒 (𝑡 ) ≤ 1 takes
the form 𝑒 (𝑡 ) = e−𝑡 𝑥 for some 𝑥 > 0.
The integral representation follows from a routine application of the Krein–Milman
theorem.
Notes
The presentation in this lecture is new. We have used the survey article [Bel+18] as
a guide to the literature, and we have adapted some of the organization from them.
Widder’s book [Wid41] remains an excellent source for material on Laplace transforms,
Lecture 19: Entrywise PSD Preservers 184
Lecture bibliography
[AS64] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas,
graphs, and mathematical tables. For sale by the Superintendent of Documents. U.
S. Government Printing Office, Washington, D.C., 1964.
[Ahl66] L. V. Ahlfors. Complex analysis: An introduction of the theory of analytic functions of
one complex variable. Second. McGraw-Hill Book Co., New York-Toronto-London,
1966.
[Bel+18] A. Belton et al. “A panorama of positivity”. Available at https://arXiv.org/
abs/1812.05482. 2018.
[Cha13] D. Chafaï. A probabilistic proof of the Schoenberg theorem. 2013. url: https :
//djalil.chafai.net/blog/2013/02/09/a- probabilistic- proof- of-
the-schoenberg-theorem/.
[Her63] C. S. Herz. “Fonctions opérant sur les fonctions définies-positives”. In: Ann. Inst.
Fourier (Grenoble) 13 (1963), pages 161–180. url: http://aif.cedram.org/
item?id=AIF_1963__13__161_0.
[Hor69] R. A. Horn. “The theory of infinitely divisible matrices and kernels”. In: Trans.
Amer. Math. Soc. 136 (1969), pages 269–286. doi: 10.2307/1994714.
[Lax02] P. D. Lax. Functional analysis. Wiley-Interscience, 2002.
[Olv+10] F. W. J. Olver et al., editors. NIST handbook of mathematical functions. With 1
CD-ROM (Windows, Macintosh and UNIX). National Institute of Standards and
Technology, 2010.
[Rud59] W. Rudin. “Positive definite sequences and absolutely monotonic functions”. In:
Duke Math. J. 26 (1959), pages 617–622. url: http://projecteuclid.org/
euclid.dmj/1077468771.
[Sch38] I. J. Schoenberg. “Metric spaces and positive definite functions”. In: Trans. Amer.
Math. Soc. 44.3 (1938), pages 522–536. doi: 10.2307/1989894.
[Vas79] H. Vasudeva. “Positive definite matrices and absolutely monotonic functions”. In:
Indian J. Pure Appl. Math. 10.7 (1979), pages 854–858.
[Wid41] D. V. Widder. The Laplace Transform. Princeton University Press, Princeton, N. J.,
1941.
II. problem sets
This assignment covers multilinear algebra and majorization. The problems are
optional, but keep in mind that you have to do mathematics to learn mathematics!
1. Kron Job. Let H be an 𝑛 -dimensional inner-product space over the field 𝔽 with
orthonormal basis ( e1 , . . . , e𝑛 ) . We write the matrix of a linear operator on 𝐻
with respect to this distinguished basis. We arrange the family ( e𝑖 ⊗ e 𝑗 : 1 ≤
𝑖 , 𝑗 ≤ 𝑛) of elementary tensors in H ⊗ H in lexicographic order. That is, e𝑖 ⊗ e𝑗
precedes e𝑘 ⊗ e when 𝑖 < 𝑘 or else 𝑖 = 𝑘 and 𝑗 < .
𝒙 ⊗𝒚 +𝒛 ⊗𝒘
is an elementary tensor if and only if 𝒘 = 𝛼 𝒚 for some scalar 𝛼 ∈ 𝔽 .
3. Let 𝑨 and 𝑩 be linear operators on H, so 𝑨 ⊗ 𝑩 is a linear operator on
H ⊗ H. Show that the matrix representation of this operator with respect to
the ordered basis for H ⊗ H takes the form
𝑎 11 𝑩 . . . 𝑎 1𝑛 𝑩
ℳ(𝑨 ⊗ 𝑩) = ...
.. ∈ 𝔽 𝑛 2 ×𝑛 2
.
𝑎 𝑛 1 𝑩 . . . 𝑎 𝑛𝑛 𝑩
The block matrix is called the Kronecker product of 𝑨 and 𝑩 . It gives a
concrete representation of the tensor product 𝑨 ⊗ 𝑩 . We typically omit the
ℳ in this notation.
2
4. The operator vec : 𝔽 𝑛×𝑛 → 𝔽 𝑛 maps a matrix to a vector by concatenating
the columns in order from left to right. Show that
2. (*) Verify that the symmetric tensor product is related to the permanent:
𝐾 = {𝒙 ∈ ℝ𝑛 : 𝑥 1 ≥ 𝑥 2 ≥ · · · ≥ 𝑥𝑛 ≥ 0}.
↓
Let Φ (𝑘 ) (𝒙 ) = 𝑖 =1 |𝑥 |𝑖 be the 𝑘 -max norm on ℝ𝑛 ,
Í𝑘
3. Fan minimum principles.
where 1 ≤ 𝑘 ≤ 𝑛 . Define the Ky Fan matrix norm k·k (𝑘 ) = Φ (𝑘 ) ◦ 𝝈 .
Φ (𝑘 ) (𝒙 ) = min{Φ (𝑛) (𝒚 ) + 𝑘 Φ ( 1) (𝒛 ) : 𝒙 = 𝒚 + 𝒛 }.
1. Let 𝑨, 𝑩 ∈ ℂ𝑛×𝑛 . Use the von Neumann trace theorem and the Hermitian
dilation to demonstrate that
∑︁𝑛
max{Re tr (𝑼 ∗ 𝑨𝑽 𝑩) : 𝑼 ,𝑽 ∈ ℂ𝑛×𝑛 are unitary} = 𝜎𝑖 (𝑨) 𝜎𝑖 (𝑩).
𝑖 =1
k𝑨 𝑠 𝑩 𝑠 k ≤ k𝑨𝑩 k 𝑠 for 0 ≤ 𝑠 ≤ 1.
2𝑝 1/( 2𝑝)
∑︁𝑚 𝑝 1/𝑝 ∑︁𝑚 1/( 2𝑝)
tr 𝑨 ∗𝑩 𝑨 𝑖 ≤ tr 𝑨∗𝑨𝑖 tr 𝑩 2𝑝 .
𝑖 =1 𝑖 𝑖 =1 𝑖
h𝒚 , 𝒙 i ≤ 𝑓 ∗ (𝒚 ) + 𝑓 (𝒙 ) for all 𝒙 , 𝒚 ∈ ℝ𝑛 .
The subdifferential of 𝑓 at the point 𝒙 is the set
𝜕𝑓 (𝒙 ) = {𝒚 ∈ ℝ𝑛 : h𝒚 , 𝒙 i = 𝑓 ∗ (𝒚 ) + 𝑓 (𝒙 )}.
If 𝑓 is differentiable at 𝒙 , then 𝜕𝑓 (𝒙 ) = {∇𝑓 (𝒙 )}. Similarly, for a function
𝐹 : ℍ𝑛 → ℝ ∪ {+∞}, the Fenchel conjugate is
𝐹 ∗ (𝒀 ) = sup𝑿 ∈ℍ𝑛 h𝒀 , 𝑿 i − 𝐹 (𝑿 ) .
2. Davis–Kahan. In this problem, we will develop the proof of the most famous
perturbation theorem for invariant subspaces. Let 𝑨 ∈ 𝕄𝑚 and 𝑩 ∈ 𝕄𝑛 be
normal matrices. Assume that |𝜆𝑖 (𝑨)| ≤ 𝜚 and |𝜆 𝑗 (𝑩)| > 𝜚 for all 𝑖 , 𝑗 . Consider
the Sylvester equation 𝑨𝑿 − 𝑿 𝑩 = 𝒀 .
Hint: The spectral radius formula states that |𝜆𝑖 (𝑨)| = lim𝑝→∞ k𝑨 𝑝 k 1/𝑝 .
3. For each unitarily invariant norm |||·||| , deduce that the solution satisfies
1
|||𝑿 ||| ≤ |||𝒀 |||.
𝛿
4. Now, let 𝑨, 𝑩 be normal matrices of the same dimension. Let S𝑨 and S𝑩 be
subsets of the complex plane, separated by an annulus of width 𝛿 . Consider
the spectral projector 𝑷 𝑨 of 𝑨 onto the subspace spanned by the eigenvalues
listed in S𝑨 , as well as the analog 𝑷 𝑩 for 𝑩 . Prove that
1
|||𝑷 𝑨 𝑷 𝑩 ||| ≤ |||𝑨 − 𝑩 |||.
𝛿
5. Specialize the last result to the case where 𝑨, 𝑩 are Hermitian. Consider
S𝑨 = [𝑎, 𝑏] and S𝑩 = (−∞, 𝑎 − 𝛿 ] ∪ [𝑏 + 𝛿 , +∞) . Interpret the left-hand
side as a measure of the size of the sines of the principal angles between
range (𝑷 𝑨 ) and range (𝑷 𝑩 ) ⊥ . This is called the sine-theta theorem.
Lecture 3: Perturbation Theory & Positive Maps 195
1.For each Hermitian 𝑨 , show that 𝝀(𝚿(𝑨)) ≺ 𝝀(𝑨) . Pinching flattens out
eigenvalues.
2. For each square matrix 𝑨 , show that 𝝈 (𝚿(𝑨)) ≺𝑤 𝝈 (𝑨) .
3. Suppose that {𝑷 𝑗 } is a family of orthogonal projectors with the property
Í
that 𝑷 𝑖 𝑷 𝑗 = 𝛿𝑖 𝑗 𝑷 𝑖 and 𝑗 𝑷 𝑗 = I. Consider the (general) pinching map
∑︁
𝚿(𝑨) = 𝑷 𝑗 𝑨𝑷 𝑗 .
𝑗
2. (*) It is not always the case that 𝚽(𝑨) is normal, even if 𝑨 is normal. Use
this observation to construct an example where the two inequalities above
in (a) are different.
3. Let 𝑩 be positive semidefinite. Consider the Schur multiplication operator
𝚽(𝑨) = 𝑨 · 𝑩 . Use the Russo–Dye Theorem to compute k𝚽k .
4. Let 𝑨 be a normal matrix whose eigenvalues are in the (strict) right
half-plane. Consider the Lyapunov equation
𝑨 ∗ 𝑿 + 𝑿 𝑨 = 𝑩.
Lecture 3: Perturbation Theory & Positive Maps 196
2. Let 𝑨 and 𝑩 be Hermitian matrices of the same size, and suppose that
𝑨 = 𝚽(𝑩) for a doubly stochastic map 𝚽. Show that
∑︁𝑁
𝑨= 𝑝𝑖 𝑼 𝑖 𝑩𝑼 𝑖∗
𝑖 =1
where 𝑝𝑖 are nonnegative numbers that sum to one and 𝑼 𝑖 are unitary.
These objects depend on both 𝚽 and 𝑩 . Hints: Use Birkhoff ’s Theorem to
prove the result in case 𝑨 and 𝑩 are diagonal matrices. Then introduce
eigenvalue decompositions of 𝑨 and 𝑩 .
Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a linear map. Consider the extension
7. *Complete Positivity.
𝚽𝑚 : 𝕄𝑚 (𝕄𝑛 ) → 𝕄𝑚 (𝕄𝑘 ) defined by
𝑨 11 . . . 𝑨 1𝑚 𝚽(𝑨 11 ) . . . 𝚽(𝑨 1𝑚 )
𝚽𝑚 : ... .. .. ↦−→ .. .. ..
. .
. . .
𝑨 𝑚 1 . . . 𝑨 𝑚𝑚 𝚽(𝑨 𝑚 1 ) . . . 𝚽(𝑨 𝑚𝑚 )
That is, we obtain 𝚽𝑚 by applying 𝚽 to each block of an 𝑚 × 𝑚 block matrix
whose blocks have size 𝑛 × 𝑛 . Since 𝕄𝑚 (𝕄𝑛 ) is isomorphic to 𝕄𝑚×𝑛 , we can
think about 𝚽𝑚 as a linear map on matrices. We say that 𝚽 is completely positive
when 𝚽𝑚 is a positive map for each 𝑚 = 1, 2, 3, . . . . This statement is equivalent
to the condition that 𝚽 ⊗ I𝑚 is a positive linear map for each 𝑚 = 1, 2, 3, . . . .
Completely positive maps play an important role in operator theory.
Lecture 3: Perturbation Theory & Positive Maps 197
Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
We may also ask how much the eigenvalues or eigenvectors of a matrix change
under perturbations. This course covers some of the most important results, but this is
just the tip of an iceberg. For more information, see Bhatia’s books [Bha97; Bha07a] or
Saad’s book [Saa11b]. The classic reference is Kato [Kat95], which is both wide and
deep.
Matrix functions
What does it mean to apply a (scalar) function to a matrix? For normal matrices, we
can simply apply the function to the eigenvalues, leaving the eigenvectors alone. For
general matrices, we can invoke the Cauchy integral formula. These two approaches
coincide whenever they are both valid. This type of spectral function has a beautiful
and rich theory. Higham’s book [Hig08] develops the foundations, focusing especially
on the numerical aspects. Bhatia’s book [Bha97, Chap. X] discusses perturbation of
matrix functions. There is also a nice series of papers by Lewis on spectral functions,
such as [Lew96], which forms the basis for a homework problem.
Another possible approach is to apply a scalar function to each entry of the matrix.
Although this approach seems naïve, it also has a rich theory that dates back to
Schoenberg’s work [Sch38] on Euclidean distance matrices in the 1930s. See Horn
& Johnson [HJ94, Chap. 6] and Bhatia [Bha07b] for a smattering of results, plus
additional references. The survey of Belton et al. [Bel+18] also covers this topic.
Integral representations
In analysis, it is often the case that an interesting class of functions can be seen as a
convex cone. Within the cone, the extremal functions play a key role because they
frame the boundary of the cone. As a consequence, every function in the cone can
be expressed as an average of the extremal functions. These averages are typically
written as integrals against a probability measure, and they provide a powerful tool
for understanding the behavior of the class of functions. The key results in this area
include the (strong) Krein–Milman theorem and the Choquet theory.
Examples of this principle include the theorem of Bochner on positive-definite
functions, the theorem of Bernstein on completely monotone functions, and the
theorem of Loewner on matrix monotone functions. See Simon’s book [Sim11] for an
introduction to this geometric perspective.
Integral representation theorems are also connected with ideas from complex
analysis (e.g., the Herglotz theorem), with interpolation (e.g., Nevanlinna–Pick), and
with the theory of moments (e.g., Stieltjes). For an introduction to these connections,
see the survey of Berg [Ber08]. For a more comprehensive classical treatment, see the
book [BCR84]. Simon [Sim19] has also written an entire book on Loewner’s theorem.
A surprising and beautiful application of integral representations arises in the
theory of matrix means. Kubo & Ando [KA79] develop a family of matrix means
that are induced by matrix monotone functions. Hiai & Kosaki [HK99] develop a
different type of matrix mean that is induced by a positive-definite function. See also
Bhatia [Bha07b, Chap. 4].
202
Semidefinite programming
Optimization over the cone of positive-semidefinite matrices plays an central role
in various disciplines, including data science, combinatorial optimization, control
theory, and quantum information. For an introduction to this field, see Boyd &
Vandenberghe [VB96]. For a more in-depth treatment, consider the book of Ben Tal &
Nemirovski [BTN01].
The Hausdorff–Toeplitz theorem asserts that the field of values of a matrix com-
poses a convex set in the complex plane. This is the simplest example of a class
of “hidden convexity” theorems, which describe convexity that appears in surprising
circumstances. Brickman’s theorem is a significant generalization, which states that
pairs of (positive) quadratic forms always induce a convex set of points in the plane.
Barvinok [Bar02, Sec. II.14] describes this result, along with some generalizations. See
also the paper [BT10] of Beck & Teboulle.
The results in the last paragraph assert that certain quadratic optimization problems
and their Lagrangian dual exhibit strong duality. The S-Lemma is a manifestation
of this same idea in control theory. In linear algebra, this result means that we can
compute the maximum eigenvalue of an Hermitian matrix, even though the problem is
not convex. See [BV04, App. B] for this perspective.
Semidefinite programming also plays a key role in algorithms for solving combi-
natorial optimization problems by relaxation and rounding. This idea goes back, at
least, to the work of Lovász [Lov79], and it became widely known after Goemans &
Williamson [GW95] developed their method for solving the maxcut problem. For a
survey with applications in signal processing, see [Luo+10]. Related results appear in
the books [Bar02; BTN01].
In fact, the idea of relaxation and rounding implicitly appears in Grothendieck’s
work on tensor products, which leads to the famous Grothendieck inequality for
matrices. This result was cast in the language of matrices by Lindenstrauss. This is a
major topic in functional analysis, but most of the references are technical. There is
an elementary treatment of Krivine’s proof of Grothendieck’s inequality in Vershynin’s
book [Ver18, Secs. 3.5–3.7]. For some algorithmic work, see [AN06; Tro09].
There is a “noncommutative” extension of the Grothendieck inequality due to
Pisier and Haagerup. For an algorithmic proof of this result, see the paper of Naor et
al. [NRV14]. Closely related optimization problems arise in the theory of quantum
optimal transport [Col+21].
a consequence, there are rich interactions between matrix analysis and quantum
information. See Watrous’s book for an introduction [Wat18].
Because of this type of connection, mathematical physics has indeed been a
major driver of research on matrix analysis. The joint convexity of quantum relative
entropy was derived as part of an effort to understand subadditivity properties of
quantum entropy under tensorization. There is also a remarkable duality between
subadditivity of entropy and (noncommutative) Brascamp–Lieb inequalities. See
Carlen’s notes [Car10] for a survey of these ideas.
The joint convexity of quantum relative entropy implies (in fact, is equivalent to)
a concavity property of the trace exponential [Tro12]. De Huang developed some
striking generalizations of this result using techniques from majorization and complex
interpolation; see [Hua19] and the works cited. These results play a key role in the
theory of matrix concentration.
Inequalities of Golden–Thompson type compare the matrix exponential of a sum
with the product of matrix exponentials. Some results of this type appear in [Bha97,
Chap. IX]. For far reaching generalizations with beautiful proofs, see the paper of
Sutter et al. [SBT17]. These results also have been used to study matrix concentration.
Hyperbolic polynomials
A hyperbolic polynomial 𝑝 : ℝ𝑛 → ℝ has the property that its restrictions 𝑡 ↦→
𝑝 (𝒙 −𝑡 e) to a fixed direction e have only real roots. This turns out to be a fundamental
concept in the geometry of polynomials, which is becoming a topic of increasing interest
in computational mathematics. For a digestible introduction, see the paper [Bau+01].
For a classic analysis reference, see the book of Hörmander [Hör94].
The Newton inequalities are, perhaps, the most elementary result that stems from
the notion of hyperbolicity. They state that the sequence of elementary symmetric
polynomials compose an ultra-log-concave sequence. This result has many striking
generalizations. In matrix analysis, the deepest theorem of this type is Alexandrov’s
inequality for mixed discriminants. See the paper of Shenfeld and Van Handel [SH19]
for an easy proof of the latter result. For some applications of mixed discriminants in
combinatorics, see Barvinok’s book [Bar16].
The geometry of polynomials also stands at the center of some other recent advances
in operator theory, combinatorics, and other areas. Marcus, Spielman, & Srivastava
used these ideas in their construction of bipartite Ramanujan graphs, their sharp
version of the restricted invertibility theorem, and their solution of the Kadison–Singer
problem. See [MSS14] for an introduction to this circle of ideas and references to the
original work. These results all have implications in matrix analysis.
In a related direction, Anari and collaborators have been using log-concave polyno-
mials for remarkable effect in their work on counting bases of matroids and related
problems. This research has implications for Monte Carlo Markov chain methods
for sampling from determinantal point processes. The first paper in the sequence
is [AGV21].
Projects bibliography
[AN06] N. Alon and A. Naor. “Approximating the cut-norm via Grothendieck’s inequality”.
In: SIAM J. Comput. 35.4 (2006), pages 787–803. doi: 10.1137/S0097539704441629.
[AP20] J. Altschuler and P. Parrilo. “Approximating Min-Mean-Cycle for low-diameter
graphs in near-optimal time and memory”. Available at https://arxiv.org/
abs/2004.03114. 2020.
204
[SH15] S. Sra and R. Hosseini. “Conic geometric optimization on the manifold of positive
definite matrices”. In: SIAM J. Optim. 25.1 (2015), pages 713–739. doi: 10.1137/
140978168.
[SBT17] D. Sutter, M. Berta, and M. Tomamichel. “Multivariate trace inequalities”. In: Comm.
Math. Phys. 352.1 (2017), pages 37–58. doi: 10.1007/s00220-016-2778-5.
[Tro09] J. A. Tropp. “Column subset selection, matrix factorization, and eigenvalue op-
timization”. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on
Discrete Algorithms. SIAM, Philadelphia, PA, 2009, pages 978–986.
[Tro12] J. A. Tropp. “From joint convexity of quantum relative entropy to a concavity
theorem of Lieb”. In: Proc. Amer. Math. Soc. 140.5 (2012), pages 1757–1760. doi:
10.1090/S0002-9939-2011-11141-9.
[VB96] L. Vandenberghe and S. Boyd. “Semidefinite programming”. In: SIAM Rev. 38.1
(1996), pages 49–95. doi: 10.1137/1038003.
[Ver18] R. Vershynin. High-dimensional probability. An introduction with applications in
data science, With a foreword by Sara van de Geer. Cambridge University Press,
Cambridge, 2018. doi: 10.1017/9781108231596.
[Wat18] J. Watrous. The theory of quantum information. Cambridge University Press, 2018.
[Zha05] F. Zhang, editor. The Schur complement and its applications. Springer-Verlag, New
York, 2005. doi: 10.1007/b105056.
1. Means for Matrices and Norm Inequalities
In this project we broadly investigate means of positive matrices and norm inequalities
Agenda:
involving them.
√ We begin with the goal of extending the arithmetic–geometric mean
1. A simple proof for matrix
inequality 𝜆𝜈 ≤ 2−1 (𝜆 + 𝜈) for positive reals 𝜆, 𝜈 , to the case of positive semidefinite arithmetic–geometric mean
matrices and unitarily invariant norms. In the search of a proof for the matrix inequality
arithmetic–geometric mean inequality, we present a simple method introduced by 2. From scalar means to matrix
Bhatia [Bha07b]. Yet, a more general analysis of means for matrices due to Hiai and means
3. A unified analysis of means for
Kosaki [HK99] leads to a unified framework for comparing norms of these matrix means. matrices
This perspective thus leads to an alternative proof of the matrix arithmetic–geometric 4. Norm inequalities
mean inequality and a variety of other notable norm inequalities. In this presentation,
primarily based on the material from [Bha07b, Chap. 5] and [HK99], we outline this
general analysis and the important consequences that arise.
Theorem 1.1 (Matrix arithmetic–geometric mean inequality). For any 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ)
and any 𝑿 ∈ 𝕄𝑛 (ℂ) it holds that
1
|||𝑯 1/2 𝑿 𝑲 1/2 ||| ≤ |||𝑯 𝑿 + 𝑿 𝑲 ||| (1.1)
2
for any unitarily invariant norm ||| · ||| .
Before we present a unified analysis of means for matrices and derive norm inequalities,
we discuss a simple proof for the matrix arithmetic–geometric mean inequality (1.1).
We can prove Theorem 1.1 by establishing the inequality in the special case 𝑯 = 𝑲 ,
a key insight that appears in Bhatia [Bha07b]. Indeed, it is enough to prove the
apparently weaker inequality
1
|||𝑫 1/2𝒀 𝑫 1/2 ||| ≤ |||𝑫𝒀 + 𝒀 𝑫 |||, (1.2)
2
for any 𝑫 ∈ ℍ𝑛+ (ℂ),𝒀 ∈ 𝕄𝑛 (ℂ) . In fact, by suitably choosing 𝑫 and 𝒀 in inequality
(1.2) as
𝑯 0 0 𝑿
𝑫B and 𝒀 B ,
0 𝑲 0 0
one can recover the more general inequality (1.1). Before we present a straightforward
proof of Theorem 1.1, we first note an auxiliary lemma.
Project 1: Hiai–Kosaki Means 209
Lemma 1.2 (Unitarily invariant norm of a Schur product). If 𝑨 ∈ ℍ𝑛+ (ℂ) , for any unitarily
invariant norm ||| · ||| on 𝕄𝑛 (ℂ)
Proof of Theorem 1.1. We consider |||𝑫 1/2𝒀 𝑫 1/2 ||| and note that since the norm in
question is unitarily invariant, the expression reduces to the case where 𝑫 is a diagonal
matrix of the form diag (𝜆 1 , . . . , 𝜆𝑛 ) . It is easily checked that in this case
1/2 1/2 𝑫𝒀 + 𝒀 𝑫
𝑫 𝒀𝑫 =𝚲 . (1.3)
2
Definition 1.3 (Scalar mean). A map 𝑀 : ℝ++ × ℝ++ → ℝ++ is a scalar mean if for ℝ++ B ( 0, ∞) .
any 𝜆, 𝜈 > 0 we have the following properties.
We now recall two propositions that establish a bijection between the class of scalar
means 𝔐 and what we define as representing functions.
Proposition 1.4 (Representing functions for scalar means). Let 𝑀 ∈ 𝔐 . Let the function 𝑓
be defined as 𝑓 (𝑡 ) B 𝑀 (𝑡 , 1) for 𝑡 > 0. Note that 𝑓 ( 0) can be defined by continuity
and hence by taking limits. The function 𝑓 then satisfies the following properties.
Then 𝑀 𝑓 is a scalar mean and we say that scalar mean 𝑀 𝑓 has representation 𝑓 .
Proof. The proof of this proposition follows from direct application of the properties of
the function 𝑓 .
Indeed, together Propositions 1.4 and 1.5 establish a bijection between scalar
means and functions with the properties in Proposition 1.4, which we call representing
functions.
Definition 1.6 (Matrix mean; Hiai & Kosaki 1999). Let 𝑀 ∈ 𝔐 and 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ) .
Let 𝑯 and 𝑲 have spectral decompositions 𝑯 = 𝑖𝑛=1 𝜆𝑖 𝑷 𝑖 and 𝑲 = 𝑛𝑗=1 𝜈 𝑗 𝑸 𝑗 ,
Í Í
respectively. We define by the operator 𝑀 (𝑯 , 𝑲 ) : 𝕄𝑛 (ℂ) → 𝕄𝑛 (ℂ) by
∑︁
𝑀 (𝑯 , 𝑲 )𝑿 = 𝑀 (𝜆𝑖 , 𝜈 𝑗 )𝑷 𝑖 𝑿 𝑸 𝑗 , for 𝑿 ∈ 𝕄𝑛 (ℂ) .
𝑖 ,𝑗
𝑀 (𝑯 , 𝑲 )𝑿 = 𝑼 ( [𝑀 (𝜆𝑖 , 𝜈 𝑗 )] (𝑼 ∗ 𝑿𝑽 ))𝑽 ∗ ,
for [𝑀 (𝜆𝑖 , 𝜈 𝑗 )] the matrix defined (𝑖 , 𝑗 ) entrywise and where denotes the usual
Schur product.
Exercise 1.8 (Alternative formulation of the matrix mean). Starting from Definition 1.6,
provide a proof for Proposition 1.7.
Theorem 1.9 (Comparison of norms of matrix means). For any 𝑀 , 𝐿 ∈ 𝔐 the following
are equivalent.
𝑀 (𝜆𝑖 , 𝜆 𝑗 )
𝐿 (𝜆𝑖 , 𝜆 𝑗 ) 1 ≤𝑖 ,𝑗 ≤𝑛
for any 𝑛 × 𝑛 matrices 𝑯 , 𝑲 , 𝑿 with 𝑯 , 𝑲 0, where the last inequality follows from
𝜇 being a probability measure. To extend to 𝑯 , 𝑲 < 0 it suffices to take the limit of
the inequality for 𝑯 + 𝜖 I and 𝑲 + 𝜖 I for 𝜖 → 0+ .
(2)⇒(3) By taking 𝑲 = 𝑯 , we have that (2) clearly implies (3).
(3)⇒(4) Consider arbitrary 𝜆 1 , . . . , 𝜆𝑛 > 0 and the matrix 𝑨 defined entrywise by
𝑎𝑖 𝑗 = 𝑀 (𝜆𝑖 , 𝜆 𝑗 )/𝐿 (𝜆𝑖 , 𝜆 𝑗 ) for any 1 ≤ 𝑖 , 𝑗 ≤ 𝑛 . By the symmetry property of matrix
means 𝑀 and 𝐿 , the matrix 𝑨 is Hermitian with all diagonal entries equal to 1. Hence,
it also holds that tr (𝑨) = 𝑛 .
Letting 𝑯 = diag (𝜆 1 , . . . , 𝜆𝑛 ) and applying (3), given what we know from Proposi-
tion 1.7, we obtain
k [𝑀 (𝜆𝑖 , 𝜆 𝑗 )] 𝑿 k ≤ k [𝐿 (𝜆𝑖 , 𝜆 𝑗 )] 𝑿k
for each 𝑿 ∈ 𝕄𝑛 (ℂ) . This implies
k𝑨 𝑿 k ≤ k𝑿 k,
for each 𝑿 ∈ 𝕄𝑛 (ℂ) .
Now, considering that h𝑨 𝑿 ,𝒀 i = h𝑿 , 𝑨 𝒀 i , it is possible to show that from
the above it holds that k𝑨 𝑿 k 1 ≤ k𝑿 k 1 . Since this holds for any 𝑿 ∈ 𝕄𝑛 (ℂ) , if we We recall that k · k 1 denotes the trace
take 𝑿 to be the matrix with all entries 1, we obtain the inequality k𝑨 k 1 ≤ 𝑛 . norm so k𝑨 k 1 = tr (𝑨 ∗ 𝑨) 1/2 .
If 𝛼 1 , . . . , 𝛼𝑛 are the real eigenvalues of 𝑨 , to prove (4) it suffices to show that
𝛼 1 , . . . , 𝛼𝑛 are all positive. We observe
∑︁𝑛
|𝛼𝑖 | = k𝑨 k 1 ≤ 𝑛,
𝑖 =1
but 𝑛 = tr (𝑨) = 𝑖𝑛=1 𝛼𝑖 . Hence
Í𝑛
|𝛼𝑖 | ≤ 𝑖𝑛=1 𝛼𝑖 which proves positivity of the
Í Í
𝑖 =1
eigenvalues of 𝑨 and thus (4).
(4)⇒(5) For any 𝑡𝑖 , 𝑡 𝑗 ∈ ℝ, 1 ≤ 𝑖 , 𝑗 ≤ 𝑛 ,
𝑀 ( e𝑡𝑖 −𝑡 𝑗 , 1)
𝑀 ( e𝑡𝑖 , e𝑡 𝑗 )
ℎ (𝑡𝑖 − 𝑡 𝑗 ) = = ,
𝐿 ( e𝑡𝑖 −𝑡 𝑗 , 1) 𝐿 ( e𝑡𝑖 , e𝑡 𝑗 )
which by (4) defines a positive semidefinite matrix on ℝ, hence by definition ℎ (𝑡 ) is a
positive definite function.
(5)⇒(1) Since the function ℎ (𝑡 ) B 𝑀 ( e𝑡 , 1)/𝐿 ( e𝑡 , 1) is positive definite on ℝ, by
Bochner’s∫ ∞theorem we have that there exists a probability measure 𝜇 on ℝ such that
ℎ (𝑡 ) = −∞ ei𝑡 𝑠 d𝜇(𝑠 ) for any 𝑡 ∈ ℝ (see Lecture 18). Since by the properties of scalar
means ℎ (𝑡 ) = ℎ (−𝑡 ) , we have that 𝜇 is a symmetric measure. Now for any 𝑯 , 𝑲 0
we compute 𝑀 (𝑯 , 𝑲 )𝑿 . Indeed, using Definition 1.6, we obtain
∑︁
𝑀 (𝑯 , 𝑲 )𝑿 = 𝑀 (𝜆𝑘 , 𝜈𝑙 )𝑷 𝑘 𝑿 𝑸 𝑙
𝑘 ,𝑙
∑︁
= 𝜈𝑙 𝑀 elog (𝜆𝑘 /𝜈𝑙 ) , 1 𝑷 𝑘 𝑿 𝑸 𝑙
𝑘 ,𝑙
∑︁ ∫ ∞
log (𝜆𝑘 /𝜈𝑙 ) i𝑠
= 𝜈𝑙 𝐿 e ,1 (𝜆𝑘 /𝜈𝑙 ) d𝜇(𝑠 ) 𝑷 𝑘 𝑿 𝑸 𝑙
𝑘 ,𝑙 −∞
∫ ∞ ∑︁
= (𝜆𝑘 /𝜈𝑙 ) i𝑠 𝐿 (𝜆𝑘 , 𝜈𝑙 )𝑷 𝑘 𝑿 𝑸 𝑙 d𝜇(𝑠 )
−∞ 𝑘 ,𝑙
∫ ∞
= 𝑯 i𝑠 (𝐿 (𝑯 , 𝑲 )𝑿 )𝑲 −i𝑠 d𝜇(𝑠 ),
−∞
where the second equality follows from the properties of scalar means, the third equality
from using the definition of ℎ and the fourth equality again from the properties of
scalar means.
Project 1: Hiai–Kosaki Means 213
As pointed out in [HK99], Theorem 1.9 suggests that to obtain norm inequalities
between matrix means it is important to establish positive definiteness of the related
function on ℝ. To this end we note the following proposition before we continue our
discussion.
Proposition 1.10 (Some positive definite functions). The functions of 𝑡 ,
sinh 𝑎𝑡 cosh 𝑎𝑡
and
sinh 𝑏𝑡 cosh 𝑏𝑡
are positive definite whenever 0 ≤ 𝑎 < 𝑏 . The function of 𝑡 ,
𝑡
sinh 𝑡 /2
Proof. We refer to [Kos98] and [HK99] for a proof of these facts and in particular to
[Kos98] for a more comprehensive analysis of useful positive definite functions.
Exercise 1.12 (𝛼 -scalar mean is a scalar mean.). Using Definition 1.3, check that the 𝛼 -scalar
mean as defined in Definition 1.11 is indeed a scalar mean.
In particular we have the following means.
Example 1.13 (Some 𝛼 -scalar means). The following are examples of 𝛼 -scalar means.
𝜆 +𝜈
𝑀2 (𝜆, 𝜈) = for any 𝜆, 𝜈 > 0.
2
2. Geometric mean For 𝛼 = 1/2, it holds that
√
𝑀1/2 (𝜆, 𝜈) = 𝜆𝜈 for any 𝜆, 𝜈 > 0.
𝜆 −𝜈
𝑀1 (𝜆, 𝜈) = . for any 𝜆, 𝜈 > 0
log 𝜆 − log 𝜈
Project 1: Hiai–Kosaki Means 214
5. Zero-scalar mean Taking the limit as 𝛼 → 0, it holds that It is interesting to note that the
zero-scalar mean is the reciprocal of
log 𝜆 − log 𝜈 the logarithmic mean of the
𝑀0 (𝜆, 𝜈) = for any 𝜆, 𝜈 > 0. reciprocals.
𝜈 −1 − 𝜆−1
Exercise 1.14 (Infinity-scalar mean). Characterize the 𝛼 -scalar mean as defined in Definition
1.11 when 𝛼 → ±∞.
The next theorem due to Hiai and Kosaki [HK99] establishes an attractive norm
inequality between matrix means.
Proof. For 1/2 ≤ 𝛼 < 𝛽 < ∞, by using Definition 1.11 can write
where we set
𝛼−1 1 sinh ((𝛽 − 1)𝑡 )
= at 𝛼 = 1, and = 𝑡 at 𝛽 = 1.
sinh ((𝛼 − 1)𝑡 ) 𝑡 𝛽 −1
Now, if 1/2 ≤ 𝛼 < 𝛽 ≤ 1, then
If 1 < 𝛼 < 𝛽 < 2𝛼 − 1 and so 0 < 𝛽 − 𝛼 < 𝛼 − 1, then the above computations
show that by Proposition 1.10, 𝑀 𝛼 ( e2𝑡 , 1)/𝑀 𝛽 ( e2𝑡 , 1) is positive definite. Therefore by
Project 1: Hiai–Kosaki Means 215
Theorem 1.9, the claim of this theorem holds. In the general case of 1 < 𝛼 < 𝛽 < ∞,
it is possible to choose 𝛼 = 𝛼 0 < 𝛼 1 < · · · < 𝛼𝑚 = 𝛽 that satisfies 𝛼𝑘 < 2𝛼𝑘 −1 − 1
for 1 ≤ 𝑘 ≤ 𝑚 , which leads to the conclusion. The special cases 1 = 𝛼 < 𝛽 < ∞
and 1 < 𝛼 < 𝛽 = ∞ can be obtained by taking the limit as 𝛼 → 1 and 𝛽 → ∞
respectively. We refer to [HK99] for the details on why the inequalities are preserved
under taking limits. Furthermore the results can be extended straightforwardly to
the case −∞ ≤ 𝛼 < 𝛽 ≤ 1/2, an argument for which for conciseness we refer to
[HK99].
Theorem 1.15 allows us to directly obtain a proof for the matrix arithmetic–geometric
mean inequality from Theorem 1.1.
Proof of Theorem 1.1. For any 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ) and any 𝑿 ∈ 𝕄𝑛 (ℂ) , we have
1
𝑀1/2 (𝑯 , 𝑲 )𝑿 = 𝑯 1/2 𝑿 𝑲 1/2 and 𝑀2 (𝑯 , 𝑲 )𝑿 = (𝑯 𝑿 + 𝑿 𝑲 ). (1.4)
2
Therefore by directly applying Theorem 1.15 we obtain that for any 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ)
and any 𝑿 ∈ 𝕄𝑛 (ℂ) ,
1
|||𝑯 1/2 𝑿 𝑲 1/2 ||| ≤ |||𝑯 𝑿 + 𝑿 𝑲 |||,
2
for any unitarily invariant norm ||| · ||| .
Exercise 1.16 (Proof of equalities in (1.4)). By using Proposition 1.7 show that the equalities
in (1.4) indeed hold.
Together Theorems 1.9 and 1.15 are very powerful and a plethora of important results
follow from these. We refer the reader to [HK99] for a more in depth investigation of
these topics and derivations of further norm inequalities. We also refer the reader to
[ABY20] for an example of an interesting application of these results.
1.5 Conclusions
In this investigation we began with the objective of proving a matrix arithmetic–
geometric mean inequality. This aim led us to systematically constructing a notion
of matrix mean that led to the fundamental Theorem 1.9. This allowed a more
general analysis that makes it possible to derive inequalities between norms of means
for matrices and thus many different norm inequalities. These insights and the
corresponding analysis is due to Hiai and Kosaki [HK99], where the interested reader
can find a further exploration of these topics.
Lecture bibliography
[AHJ87] T. Ando, R. A. Horn, and C. R. Johnson. “The singular values of a Hadamard product:
a basic inequality”. In: Linear and Multilinear Algebra 21.4 (1987), pages 345–365.
eprint: https : / / doi . org / 10 . 1080 / 03081088708817810. doi: 10 . 1080 /
03081088708817810.
[ABY20] R. Aoun, M. Banna, and P. Youssef. “Matrix Poincaré inequalities and concentration”.
In: Advances in Mathematics 371 (2020), page 107251. doi: https://doi.org/10.
1016/j.aim.2020.107251.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
Project 1: Hiai–Kosaki Means 216
[HK99] F. Hiai and H. Kosaki. “Means for matrices and comparison of their norms”. In:
Indiana Univ. Math. J. 48.3 (1999), pages 899–936. doi: 10.1512/iumj.1999.
48.1665.
[Kos98] H. Kosaki. “Arithmetic–geometric mean and related inequalities for operators”. In:
Journal of Functional Analysis 156.2 (1998), pages 429–451.
[KA79] F. Kubo and T. Ando. “Means of positive linear operators”. In: Math. Ann. 246.3
(1979/80), pages 205–224. doi: 10.1007/BF01371042.
2. The Eigenvector–Eigenvalue Identity
Eigenvectors and eigenvalues are widely used in simplifying equations, data science,
Agenda:
and are powerful analysis tools. For a 𝑛 × 𝑛 Hermitian matrix 𝑨 ∈ ℍ𝑛 (ℂ) , one of its
1. Cauchy interlacing theorem
eigenpairs (𝜆,𝒗 ) is the solution for 2. First order perturbation
theorem
𝑨𝒗 = 𝜆𝒗 . 3. Eigenvector–eigenvalue
identity
Here, 𝜆 is an eigenvalue of 𝑨 , and 𝒗 is an eigenvector associated with 𝜆. Moreover, 4. Proof of the identity
the eigenvalues of a Hermitian matrix are real, and its unit-norm eigenvectors form 5. Application
an orthonormal basis for ℂ𝑛 . One might ask if there is an intrinsic connection for the
eigenvalues and eigenvectors. And over the past few decades, an identity between
components of eigenvectors for a Hermitian matrix and its eigenvalues is repeatedly
rediscovered in different contexts over many different fields [Den+22]. Although it
is a simple yet extremely important result, it begins to attract boarder interest with
the publication of Wolchover’s article [Wol19]. In this section, we will establish an
identity between eigenvectors and eigenvalues of a Hermitian matrix. We will write the
↑
eigenvalues in the weak increasing order throughout the following sections: 𝜆𝑖 = 𝜆𝑖 .
L = {𝑗 1 , 𝑗 2 , . . . , 𝑗 𝑘 }
𝜆𝑘 ≤ 𝛽𝑘 ≤ 𝜆𝑘 +𝑛−𝑚 for 𝑘 = 1, . . . , 𝑚.
If 𝑚 = 𝑛 − 1, then
𝜆 1 ≤ 𝛽 1 ≤ 𝜆 2 ≤ 𝛽 2 ≤ . . . ≤ 𝛽𝑛−1 ≤ 𝜆𝑛 .
Proof. We can use the Courant–Fischer–Weyl minimax principle to prove the theorem
[Bha97]. Without loss, assume
𝑩 𝑪∗
𝑨= ,
𝑪 𝒁
W = span (𝒚 1 , . . . , 𝒚 𝑘 ) ⊆ ℂ𝑚 ,
𝒘
M= ∈ ℂ , 𝒘 ∈ W ⊂ ℂ𝑛 .
𝑛
0
Thus, we have
e ∗ 𝑨𝒘
𝒘 e = 𝒘 ∗ 𝑩𝒘 and 𝒘
∗
e = 𝒘 ∗𝒘 .
e 𝒘
Using the Courant–Fischer–Weyl minimax principle, the following inequality holds
𝒘 ∗ 𝑩𝒘 𝒘e ∗ 𝑨𝒘e e ∗ 𝑨𝒘
𝒘 e
𝛽𝑘 = max ∗
= max ∗ ≥ min max∗ = 𝜆𝑘 .
𝒘 ∈W 𝒘 𝒘 e ∈M 𝒘
𝒘 e 𝒘 e M0 ⊂ℂ 𝑛 0 e ∈M 𝒘
,dim M =𝑘 𝒘 0 e 𝒘 e
This gives one side of the inequalities in the theorem.
For the other side, we can use the similar technique and apply the minimax principle
to −𝑨 and −𝑩 . Define the following vector spaces
W 0 = span (𝒚 𝑘 , . . . , 𝒚 𝑚 ) ⊆ ℂ𝑚 ,
0
𝒘 𝑛 0 0
N= ∈ ℂ ,𝒘 ∈ W ⊂ ℂ𝑛 .
0
We have
e 0 = 𝑚 − 𝑘 + 1.
dim N = dim W
0
Then, for each 𝒘 0 ∈ W 0, we can find 𝒘
e ∈ N by choosing
𝒘 0
e0 =
𝒘 ∈ N.
0
Project 2: The Eigenvector–Eigenvalue Identity 219
e 0) ∗ (−𝑨)𝒘
(𝒘 e 0 = (𝒘 0) ∗ (−𝑩)𝒘 0 and (𝒘
e 0 ) ∗𝒘
e 0 = (𝒘 0) ∗𝒘 0.
(𝒘 0) ∗ (−𝑩)𝒘 0
−𝛽𝑘 = max
0 0
𝒘 ∈W (𝒘 0) ∗𝒘
0 ∗
(𝒘
e ) (−𝑨)𝒘 e0
= max
e 0 ∈N
𝒘 (𝒘e 0 ) ∗𝒘
e0
e 0) ∗ (−𝑨)𝒘
(𝒘 e0
≥ min0 max
0 0 ∗ 0
N0 ⊂ℂ𝑛 ,dim N =𝑚−𝑘 +1 𝒘
e ∈N0 (𝒘e)𝒘 e
= −𝜆𝑘 +𝑛−𝑚 .
That is
𝜆𝑘 +𝑛−𝑚 ≥ 𝛽𝑘 ,
which gives the desired inequality.
The Cauchy interlacing theorem allows us to bound the eigenvalues of a principal
submatrix of 𝑨 by the eigenvalues of the original matrix. Here, we derived the Cauchy
interlacing theorem from the Courant–Fischer–Weyl minimax principle. It is also
worthwhile to note that the Cauchy interlacing theorem, the Poincaré inequality and
the Courant–Fischer–Weyl minimax principle have independent proofs and they can
be derived from each other [Bha97].
𝜆(𝑡 ) = 𝜆 + h𝒗 , 𝑬𝒗 i𝑡 + 𝑂 (𝑡 2 ),
Proof. For a Hermitian matrix 𝑨 ∈ ℍ𝑛 (ℂ) , the left and right eigenvectors associated
with the (real) eigenvalue 𝜆 coincide. That is Recall that 𝒗 is a right eigenvector of
𝑨 associated with 𝜆 if 𝑨𝒗 = 𝜆𝒗 .
∗
𝑨 𝒗 = 𝑨𝒗 = 𝜆𝒗 . Similarily, 𝒘 is a left eigenvector
associated with 𝜆 if 𝒘 ∗ 𝑨 = 𝜆𝒘 ∗ .
When 𝑡 is small, the eigenvalue 𝜆(𝑡 ) is analytic with respect to 𝑡 . As 𝜆(𝑡 ) is the
eigenvalue of 𝑨 (𝑡 ) , by definition we have
𝑨 (𝑡 )𝒗 (𝑡 ) = 𝜆(𝑡 )𝒗 (𝑡 ).
Let’s calculate the Euclidean inner product of both sides with respect to 𝒗 , then
h𝒗 , 𝑨 (𝑡 )𝒗 (𝑡 )i = h𝒗 , 𝜆(𝑡 )𝒗 (𝑡 )i.
Write it out more explicitly, we get
h𝒗 , 𝑨 + 𝑡 𝑬 𝒗 (𝑡 )i = 𝜆(𝑡 )h𝒗 , 𝒗 (𝑡 )i.
We calculate the left-hand side (LHS) of the above equation
h𝒗 , 𝑨 + 𝑡 𝑬 𝒗 (𝑡 )i = h𝒗 , 𝑨𝒗 (𝑡 )i + h𝒗 , 𝑡 𝑬𝒗 (𝑡 )i
= h𝑨 ∗𝒗 , 𝒗 (𝑡 )i + 𝑡 h𝒗 , 𝑬𝒗 (𝑡 )i
= 𝜆h𝒗 , 𝒗 (𝑡 )i + 𝑡 h𝒗 , 𝑬𝒗 (𝑡 )i.
So, we have
𝜆(𝑡 ) − 𝜆 h𝒗 , 𝒗 (𝑡 )i = 𝑡 h𝒗 , 𝑬𝒗 (𝑡 )i.
This is equivalent to
𝜆(𝑡 ) − 𝜆( 0) h𝒗 , 𝑬𝒗 (𝑡 )i
=
𝑡 −0 h𝒗 , 𝒗 (𝑡 )i
for a non-zero 𝑡 . Because (𝜆, 𝒗 ) is a simple eigenpair of 𝑨 , the eigenvector 𝒗 and the
eigenvalue 𝜆 are both continuous with respect to 𝑨 . Thus, 𝒗 (𝑡 ) is continuous with
respect to 𝑡 . Then, we have
𝜆(𝑡 ) − 𝜆( 0) h𝒗 , 𝑬𝒗 (𝑡 )i h𝒗 , 𝑬𝒗 i
𝜆 0 ( 0) = lim = lim = = h𝒗 , 𝑬𝒗 i.
𝑡 →0 𝑡 −0 𝑡 →0 h𝒗 , 𝒗 (𝑡 )i h𝒗 , 𝒗 i
Thus, the Taylor expansion of 𝜆(𝑡 ) about 0 is
𝜆(𝑡 ) = 𝜆( 0) + 𝜆 0 ( 0)(𝑡 − 0) + 𝑂 (𝑡 2 ) = 𝜆 + h𝒗 , 𝑬𝒗 i𝑡 + 𝑂 (𝑡 2 ),
which gives the first-order perturbation theorem.
and
𝜆𝑖 (𝑴 𝑗 ) = 𝜆𝑖↑ (𝑴 𝑗 ) for 1 ≤ 𝑖 ≤ 𝑛 − 1.
Let 𝒗 1 , . . . , 𝒗 𝑛 be the unit-norm eigenvectors of 𝑨 associated with the ordered
eigenvalues 𝜆 1 (𝑨), . . . , 𝜆𝑛 (𝑨) , respectively. The eigenvector–eigenvalue identity
for 𝑨 is
Ö𝑛 Ö𝑛−1
|𝑣𝑖 𝑗 | 2
𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨) = 𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑴 𝑗 ) ,
𝑘 =1,𝑘 ≠𝑖 𝑘 =1
deleting the 𝑗 0th row and 𝑗 th column of the identity matrix and matrix 𝑨 , respectively.
Then, we have the following identity
0
Ö𝑛
(−1) 𝑗 +𝑗 det 𝜆𝑖 (𝑨)( I𝑛 ) 𝑗 0 𝑗 − 𝑴 𝑗 0 𝑗 = 𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨) 𝑣𝑖 𝑗 𝑣𝑖∗𝑗 0
𝑘 =1,𝑘 ≠𝑖
for 1 ≤ 𝑗 , 𝑗 0 ≤ 𝑛 .
𝜆𝑖 +1 (𝑨) = 𝜆𝑖 (𝑨).
By the Cauchy interlacing theorem, the inequality
𝜆𝑖 (𝑨) ≤ 𝜆𝑖 (𝑴 𝑗 ) ≤ 𝜆𝑖 +1 (𝑨)
holds as 𝑴 𝑗 ∈ ℍ𝑛−1 is a principal minor of 𝑨 ∈ ℍ𝑛 . Thus, for any 𝑗 , the 𝑖 th smallest
eigenvalue of 𝑴 𝑗 equals 𝜆𝑖 (𝑨) :
𝜆𝑖 (𝑨) − 𝜆𝑖 (𝑴 𝑗 ) = 0.
In this case, the identity is trivial as the left-hand side of the identity (2.4) is
Ö𝑛
LHS = |𝑣𝑖 𝑗 | 2
𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨)
𝑘 =1,𝑘 ≠𝑖
Ö𝑛
= |𝑣𝑖 𝑗 | 2 𝜆𝑖 (𝑨) − 𝜆𝑖 +1 (𝑨)
𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨) = 0,
𝑘 =1,𝑘 ≠𝑖 ,𝑖 +1
Thus the identity holds for 𝑖 with multiplicity of 𝜆𝑖 (𝑨) greater than one.
In the second half of the proof, we will consider the case when 𝜆𝑖 (𝑨) is a simple
eigenvalue of 𝑨 . As the eigenvalues are continuous with respect to 𝑨 , the eigenvalues
of a principal minor of 𝑨 should also be continuous with respect to 𝑨 . That is, 𝜆𝑖 (𝑨)
and 𝜆𝑖 (𝑴 𝑗 ) are continuous with respect to 𝑨 . Because 𝜆𝑖 (𝑨) is a simple eigenvalue
of 𝑨 , the eigenprojector 𝑷 𝜆𝑖 (𝑨) associated with 𝜆𝑖 (𝑨) is given by
𝑷 𝜆𝑖 (𝑨) = 𝒗 𝑖 𝒗 𝑖∗ .
Thus, the eigenvector 𝒗 𝑖 associated with a simple eigenvalue 𝜆𝑖 (𝑨) is continuous with
respect to 𝑨 . As the left-hand side and the right-hand side consist of products of the
Project 2: The Eigenvector–Eigenvalue Identity 223
eigenvalues and 𝑣𝑖 𝑗 , they are both continuous with respect to 𝑨 . Then, it suffices
to show that the identity holds for 𝑨 with simple eigenvalue 𝜆𝑖 . Let 𝜀 be a small
parameter, and consider the rank one perturbation 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 , where 𝜹 1 , . . . , 𝜹 𝑛 is
the standard basis. The characteristic polynomial of 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 is given by
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) = det 𝜆I𝑛 − 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 .
𝜀 𝜀
Furthermore, we write 𝑴
f𝑖 ,𝑘 and 𝑴 𝑨 and e
f𝑖 ,𝑘 the submatrix of e 𝑨 obtained by deleting
𝜀
𝑨 and e
the 𝑖 th row and 𝑘 th column of e 𝑨 , respectively. Note that 𝜹 𝑗 𝜹 ∗𝑗 is a matrix with
only the 𝑗 th row and 𝑗 th column is non-zero, so we have
f𝑖𝜀𝑗 = 𝑴
𝑴 f𝑖 𝑗 for any 1 ≤ 𝑖 ≤ 𝑛.
f 𝑗 𝑗 = 𝜆I𝑛−1 − 𝑴 𝑗 .
𝑴
In the third line of the above calculation, we use (2.4) and the fact that
Thus, we have
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) = 𝑝 𝑨 (𝜆) − 𝜀𝑝𝑴 𝑗 (𝜆).
On the other hand, based on the first order perturbation theory, the eigenvalue
𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) can be also expanded as
𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) = 𝜆𝑖 (𝑨) + 𝜀 h𝒗 𝑖 , 𝜹 𝑗 𝜹 ∗𝑗 𝒗 𝑖 i + 𝑂 (𝜀 2 )
= 𝜆𝑖 (𝑨) + 𝜀 |𝑣𝑖 𝑗 | 2 + 𝑂 (𝜀 2 ),
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) = 0.
Project 2: The Eigenvector–Eigenvalue Identity 224
𝜹 𝑗 𝜹 ∗ 𝜆𝑖 (𝑨) Δ𝜆 + 𝑂 (Δ𝜆) ,
0 2
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) = 𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀
𝑗
𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) · Δ𝜆
0
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) ≈ 𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀
𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) · Δ𝜆
0
= 𝑝 𝑨 𝜆𝑖 (𝑨) − 𝜀𝑝𝑴 𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀
𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) · Δ𝜆
0
= −𝜀𝑝𝑴 𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀
= −𝜀𝑝𝑴 𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨0 𝜆𝑖 (𝑨) · Δ𝜆 − 𝜀𝑝𝑴 0
𝜆𝑖 (𝑨) · Δ𝜆.
𝑗
Note that we use (2.4) again in the last step of the above calculation. Based on (2.4),
the quantity 𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) is zero when evaluated at 𝜆 = 𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) . Using the
expansion (2.4), we get Δ𝜆 = 𝜀 |𝑣𝑖 𝑗 | 2 + 𝑂 (𝜀 2 ) . Thus, the term 𝜀𝑝𝑴
0 𝜆𝑖 (𝑨) · Δ𝜆 is a
𝑗
second order term with respect to 𝜀 . By matching the linear term in 𝜀 , we get
It is interesting that both sides of the identity (2.4) equal zero when 𝜆𝑖 (𝑨) is
a repeated eigenvalue of 𝑨 . This also matches our intuition that if the eigenspace
associated with 𝑨 has a dimension that is greater than one, there are infinite number of
orthogonal bases for this eigenspace. Thus, it is impossible to pinpoint each component
of the eigenvector.
2.5 Application
The eigenvector-eigenvalue identity allows one to reconstruct the magnitude of each
component of eigenvectors of 𝑨 ∈ ℍ𝑛 from eigenvalues of 𝑨 and its principal minor
𝑴 𝑗 (for some 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑛 ). However, the phase information of each component
is not given in the identity. Nonetheless, proposition 2.6 shows that the relative phase
𝑣𝑖 𝑗 𝑣𝑖∗𝑗 0 can be determined.
As discussed in [Den+22], for a symmetric real matrix, the only ambiguity is the
sign of each component, which may be recovered by direct inspection of the eigenvector
equation 𝑨𝒗 𝑖 = 𝜆𝑖 (𝑨)𝒗 𝑖 . Given the eigenvalues and eigenvectors associated with 𝑨 ,
one can obtain the original matrix 𝑨 via the spectral decomposition.
Here, we are going to briefly show the application of the eigenvector-eigenvalue
identity in neutrino oscillation [DPZ20]. Here, we will drop a lot of physical background
knowledge needed to understand the physics of neutrino, and simply focus on the
mathematical expression that models the transition among different types of neutrinos.
In neutrino oscillation, the Pontecorvo–Maki–Nakagawa–Sakata matrix (PMNS matrix)
describes the “probability” of transformation between mass eigenstates (denoted by
1, 2, 3, which tells the mass of the neutrino, and is the basis for neutrinos propagate
in vacuum) and flavor eigenstates (denoted by 𝑒 , 𝜇, 𝜏 , which tells the species of the
Project 2: The Eigenvector–Eigenvalue Identity 225
neutrino, and is basis for neutrinos propagate in matter). Although it is not clear
whether or not the three-neutrino model is correct, we proceed by assuming the PMNS
matrix 𝑼 PMNS is a 3 × 3 unitary matrix. In this case, the PMNS matrix has the following
decomposition
𝑈𝑒 1 𝑈𝑒 2 𝑈𝑒 3
𝑼 PMNS = 𝑈𝜇1 𝑈𝜇2 𝑈𝜇3
𝑈𝜏 1 𝑈𝜏 2 𝑈𝜏 3
1 𝑐 13 𝑠 13 𝑐 12 𝑠 12
i𝛿
= 𝑐 23 𝑠 23 e 1 −𝑠12 𝑐 12 ,
−𝑠23 e−i𝛿 𝑐 23 −𝑠 13 𝑐 13 1
where 𝑠𝑖 𝑗 = sin 𝜃𝑖 𝑗 , 𝑐 𝑖 𝑗 = cos 𝜃𝑖 𝑗 , and 𝑈𝛼𝑖 is the probability of 𝑖 becoming 𝛼 for
𝛼 ∈ {𝑒 , 𝜇, 𝜏 } and 𝑖 ∈ {1, 2, 3}. The full solution for entries of the PMNS matrix can be
obtained by solving a hard cubic equation. However, it is possible to obtain the entries
via 𝑈ˆ𝛼𝑖 , for 𝛼 ∈ {𝑒 , 𝜇, 𝜏 } and 𝑖 ∈ {1, 2, 3}, which is component of eigenvectors of the
Hamiltonian in flavor basis.
In matter, the oscillation among these three flavors of neutrino is given by the
following matrix 𝑯 (the Hamiltonian in flavor basis):
𝐻𝑒𝑒 𝐻𝜇𝑒 𝐻𝜏𝑒
𝑯 = 𝐻𝑒𝜇 𝐻𝜇𝜇 𝐻𝜏𝜇
𝐻𝑒𝜏 𝐻𝜇𝜏 𝐻𝜏𝜏
0
" ! !#
𝑎
1
= 𝑼 PMNS Δ𝑚 21
2
𝑼 ∗PMNS + 0 ,
2𝐸 Δ𝑚 31
2
0
𝜆𝑗
for 𝑗 = 1, 2, 3.
2𝐸
For each principal minor of 𝑯 , let the eigenvalues for the minor obtained by deleting
all columns and rows contain subscript 𝛼 for 𝛼 ∈ {𝑒 , 𝜇, 𝜏 } be
𝜉𝑘𝛼
for 𝑘 = 1, 2.
2𝐸
Then, the eigenvector-eigenvalue identity tells us that the square of 𝑈ˆ𝛼 𝑗 is given by
Î2
𝑘 =1 (𝜆 𝑗 − 𝜉𝑘𝛼 )
|𝑈ˆ𝛼 𝑗 | 2 = Î
𝑖 ≠𝑗 (𝜆 𝑗 − 𝜆𝑖 )
for 𝛼 ∈ {𝑒 , 𝜇, 𝜏 }, and 𝑗 ∈ {1, 2, 3}. As one can find the entries of the PMNS matrix
via 𝑈ˆ𝛼 𝑗 , the PMNS matrix can be obtained by measuring the eigenvalues of 𝑯 and its
principal minors. This is often easier than measuring each entries of the PMNS matrix
itself. Furthermore, if experiments show that the three-neutrino theory is incorrect,
the above formula is ready to be applied for 𝑼 PMNS ∈ ℍ4 with trivial modifications.
Project 2: The Eigenvector–Eigenvalue Identity 226
Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[DPZ20] P. B. Denton, S. J. Parke, and X. Zhang. “Neutrino oscillations in matter via
eigenvalues”. In: Phys. Rev. D 101 (2020), page 093001. doi: 10.1103/PhysRevD.
101.093001.
[Den+22] P. B. Denton et al. “Eigenvectors from eigenvalues: a survey of a basic identity in
linear algebra”. In: Bull. Amer. Math. Soc. (N.S.) 59.1 (2022), pages 31–58. doi:
10.1090/bull/1722.
[GLO20] A. Greenbaum, R.-C. Li, and M. L. Overton. “First-Order Perturbation Theory for
Eigenvalues and Eigenvectors”. In: SIAM Review 62.2 (2020), pages 463–482. doi:
10.1137/19M124784X.
[Kat95] T. Kato. Perturbation theory for linear operators. Reprint of the 1980 edition.
Springer-Verlag, Berlin, 1995.
[Saa11a] Y. Saad. Numerical Methods for Large Eigenvalue Problems. 2nd edition. Society for
Industrial and Applied Mathematics, 2011. doi: 10.1137/1.9781611970739.
[Wol19] N. Wolchover. “Neutrinos lead to unexpected discovery in basic math”. In: Quanta
Magazine (2019). url: https://www.quantamagazine.org/neutrinos-lead-
to-unexpected-discovery-in-basic-math-20191113.
3. Bipartite Ramanujan Graphs of Any Degree
We review the work of Marcus, Spielman, and Srivastava on the existence of infinite
Agenda:
families of bipartite Ramanujan graphs of any degree, closely following the arguments
1. Bipartite Ramanujan graphs
laid forth in the series of works [MSS14; MSS15a; MSS15b]. In particular, we 2. Reductions
present a general perspective that relies on the notion of real stability of multivariate 3. Interlacing polynomials and
polynomials and the resulting implications on interlacing of univariate polynomials. real stability
These techniques were successfully applied by the authors of the aforementioned works 4. Proof of Theorem 3.13
to several other problems, including the Kadison–Singer problem. The results highlight
the power of approaching problems in matrix theory via direct consideration of the
characteristic polynomial.
A bipartite graph is one whose vertices V can be partitioned into disjoint subsets
S, T ⊆ V so that any edge bridges S and T. That is, for any edge (𝑢, 𝑣 ) ∈ E, either
𝑢 ∈ S and 𝑣 ∈ T, or else 𝑢 ∈ T and 𝑣 ∈ S. The eigenvalues of a bipartite graph are
symmetric about the origin; we leave this fact as an exercise.
Exercise 3.1 (Bipartite graph spectrum). Show that if 𝐺 = ( V, E) is bipartite, its spectrum
is symmetric about the origin. Hint: Argue that there is a vertex ordering under which
the adjacency matrix of a bipartite graph takes the form
0 𝑩
𝑩∗ 0
√𝐺 is called Ramanujan
Definition 3.3 (Ramanujan graph). A connected 𝑑 -regular graph
if all of its nontrivial eigenvalues have magnitude at most 2 𝑑 − 1. In particular, a
bipartite 𝑑 -regular graph 𝐺 is Ramanujan
√ if all of its eigenvalues, aside from ±𝑑 ,
are bounded in magnitude by 2 𝑑 − 1.
Ramanujan graphs are notable in that they are the ideal expander graphs. That is,
no infinite family of graphs can have (asymptotically) smaller nontrivial eigenvalues. A
lower bound due to Alon and Boppana [Alo86; Nil91] establishes that for any 𝜀 > 0
and any infinite family (𝐺𝑖 )𝑖∞=1 of 𝑑 -regular graphs, there exists some 𝑁 ∈ ℕ for which
√
𝐺𝑖 has a nontrivial eigenvalue of magnitude at least 2 𝑑 − 1 − 𝜀 when 𝑖 ≥ 𝑁 .
For fixed 𝑑 , it is easy to construct 𝑑 -regular Ramanujan graphs with few vertices,
as demonstrated in the following example.
Example 3.4 (Complete bipartite graph). Consider the bipartite graph 𝐾 𝑑,𝑑 with vertex set
V = S ∪ T, where S = {1, . . . , 𝑑 } and T = {𝑑 + 1, . . . , 2𝑑 }, each vertex 𝑠 ∈ S has an
edge to every vertex in T, and these are the only edges. This is known as a complete
bipartite graph. Its adjacency matrix 𝑨 is of the form
0 1
𝑨= ,
1 0
Project 3: Bipartite Ramanujan Graphs 229
where 1 denotes the 𝑑 × 𝑑 matrix of ones. Since 𝑨 has rank 2, it follows that all of the
nontrivial eigenvalues of 𝐾𝑑,𝑑 are zero. Thus 𝐾𝑑,𝑑 is Ramanujan.
In general, it is not known whether there exist 𝑑 -regular Ramanujan graphs with
arbitrarily many vertices. Our task in this lecture will be to show that this is the case
when we restrict our focus to bipartite Ramanujan graphs. Specifically, we prove the
following result of Marcus, Spielman, and Srivastava, following the works [MSS15a;
MSS15b].
Theorem 3.5 (Marcus, Spielman, and Srivastava 2015). For every 𝑑 ≥ 3, there is an
infinite family of 𝑑 -regular bipartite Ramanujan graphs.
Our treatment of this result draws from the papers [MSS15a; MSS15b], as well as
the survey [MSS14] and Spielman’s course notes [Spi18b].
3.2 Reductions
In this section, we present several reductions of Theorem 3.5. We begin by reducing the
existence of infinite families of 𝑑 -regular bipartite Ramanujan graphs to the existence
of a signed adjacency matrix with a certain eigenvalue bound. Subsequently, we shift
our focus to the characteristic polynomial of this matrix, and we show that we can
equivalently phrase the problem as a bound relating the root of a random characteristic
polynomial with the root of its expectation. In the next section, these reductions enable
us to deploy techniques from [MSS15b] on interlacing families of polynomials.
3.2.1 2-lifts
To prove the existence of infinite families of bipartite Ramanujan graphs, it would
be very useful if, given any bipartite Ramanujan graph on 𝑛 vertices, we could use it
to construct another Ramanujan graph with more than 𝑛 vertices. To this end, we
introduce the following definitions.
where the sign vector 𝝈 ∈ {±1}𝑚 is indexed by edges. That is, 𝑺 has the same
nonzero pattern as the adjacency matrix 𝑨 , but both entries corresponding to an
edge (𝑖 , 𝑗 ) are given a sign 𝜎 (𝑖 ,𝑗 ) ∈ {±1}.
• For every vertex 𝑣 ∈ V, the 2-lift 𝐺 𝑺 has two vertices, which we denote 𝑣 0 , 𝑣 1 .
• If 𝑠𝑢𝑣 = 1, then 𝐺 𝑺 has the two edges (𝑢 0 , 𝑣 0 ) and (𝑢 1 , 𝑣 1 ) .
• If 𝑠𝑢𝑣 = −1, then 𝐺 𝑺 has the two edges (𝑢 0 , 𝑣 1 ) and (𝑢 1 , 𝑣 0 ) .
2-lifts give a means of constructing larger graphs out of smaller graphs in a manner
that preserves certain desirable properties. In particular, any 2-lift of a bipartite,
Project 3: Bipartite Ramanujan Graphs 230
𝑑 -regular graph is bipartite and 𝑑 -regular, and the spectrum of a 2-lift 𝐺 𝑺 depends
only on the spectra of 𝐺 and 𝑺 . We leave these facts as exercises.
Exercise 3.8 (2-lift: Preservation of properties). Show that the following properties are
preserved by 2-lifts.
Exercise 3.9 (2-lift: Spectrum). Let 𝑺 be a signing of a graph 𝐺 . Show that the eigenvalues
of 𝐺 𝑺 are exactly the union of the eigenvalues of 𝐺 and the eigenvalues of 𝑺 .
These exercises imply
√ that, if 𝐺 is a 𝑑 -regular Ramanujan graph and it has a signing
𝑺 satisfying k𝑺 k ≤ 2 𝑑 − 1, then 𝐺 𝑺 is also Ramanujan.
√ Thus, if it were the case that
any 𝑑 -regular graph had a signing 𝑺 with k𝑺 k ≤ 2 𝑑 − 1, this would immediately
give a means for constructing infinite families of 𝑑 -regular Ramanujan graphs. It
was conjectured by Bilu and Linial [BL06] that such a signing can always be found;
however, we will prove the following weaker result from [MSS15a].
Since the eigenvalues of a bipartite graph are symmetric about the origin, Theo-
rem 3.10 immediately implies (via the preceding argument) the existence of infinite
families of 𝑑 -regular bipartite Ramanujan graphs.
Theorem 3.11 (Roots: Random signing vs. expected characteristic polynomial). Suppose
𝐺 is a 𝑑 -regular graph and 𝑺 is a uniformly random signing. Then, with strictly
positive probability,
𝜆𝑛 (𝜒(𝑺 )) ≥ −𝜆 1 (𝔼 [𝜒(𝑺 )]) .
Proof that Theorem 3.12 implies Theorem 3.11. Suppose that the theorem’s conclusion (3.2) We only prove that Theorem 3.12
holds. Since 𝑑 I − 𝑺 = 𝑒 ∈E 𝒗 𝑒 𝒗 𝑒 ∗ , it follows that implies Theorem 3.11, as this is
Í
sufficient to establish the reduction.
However, the reverse implication can
𝑑 − 𝜆𝑛 (𝑺 ) = 𝜆 1 (𝑑 I − 𝑺 ) be seen by a nearly identical
≤ 𝜆 1 (𝔼 [𝜒 (𝑑 I − 𝑺 )]) argument.
Theorem 3.13 (Roots: Independent rank-one sum vs. expected char. poly.). Suppose that
𝒗 1 , . . . , 𝒗 𝑚 ∈ ℂ𝑛 are independent random vectors, each with finite support. Then,
Project 3: Bipartite Ramanujan Graphs 232
𝜇𝑛 ≤ 𝜆𝑛 ≤ 𝜇𝑛−1 ≤ · · · ≤ 𝜇1 ≤ 𝜆 1 ,
𝑖 , 𝑗 ∈ {1, . . . , 𝑚}, the polynomials 𝑓𝑖 and 𝑓 𝑗 have a common interlacing. Show that
𝑓1 , . . . , 𝑓𝑚 have a common interlacing. Hint: This is an easy consequence of the
combinatorial Helly theorem on the real line, which states that the intersection of a
finite collection of non-pairwise-disjoint intervals in ℝ is nonempty; for example, see
[Pak10, Chapter 1].
If a family of polynomials has a common interlacing, then we can relate each root
of the average with the corresponding root of a member of the family. We provide the
following lemma without proof, though we recommend that the interested reader give
it some thought; its source and proof can be found in [MSS15a, Lemma 4.2].
Lemma 3.17 (Roots of family with common interlacing). Let 𝑓 1 , . . . , 𝑓𝑚 be a family of This existential result actually
real-rooted, degree-𝑛 polynomials with positive leading coefficients, and let 𝜇 be a extends beyond the leading root to all
other roots 𝜆𝑖 , 𝑖 ∈ {2, . . . , 𝑛 }; see
probability distribution on {1, . . . , 𝑚}. If 𝑓 1 , . . . , 𝑓𝑚 have a common interlacing, then [MSS14, Theorem 2.2]. For our
there exists some 𝑗 ∈ {1, . . . , 𝑚} for which purposes, we only need the stated
result involving the leading root 𝜆 1 .
𝜆 1 ( 𝑓 𝑗 ) ≤ 𝜆 1 𝔼𝑘 ∼𝜇 𝑓𝑘 .
It is evident that Lemma 3.17 reduces the proof of Theorem 3.13 to the problem of
∗ have a common interlacing
Í𝑚
showing that the characteristic polynomials 𝜒 𝑖 =1 𝒗 𝑖 𝒗 𝑖
(when considered as a family of polynomials under realizations of the random vectors
𝒗 𝑖 ). To this end, it will be useful to establish a sufficient condition for the existence of
a common interlacing of a family of polynomials. As it turns out, existence of such
an interlacing is equivalent to the real-rootedness of arbitrary convex combinations of
polynomials within the family. This basic idea has appeared in various forms a number
of times, including [Ded92, Theorem 2.1], [Fel80, Theorem 2’], and [CS07, Theorem
3.6]. Our proof follows that of [MSS14].
Proof. By Exercise 3.16, it suffices to show that any pair of polynomials 𝑓𝑖 , 𝑓 𝑗 with 𝑖 ≠ 𝑗
has a common interlacing under the given assumptions. Define the polynomial 𝑔𝑡 as
the convex combination
𝑔𝑡 (𝑥) B ( 1 − 𝑡 ) 𝑓𝑖 (𝑥) + 𝑡 𝑓 𝑗 (𝑥) where 𝑡 ∈ [0, 1] .
We will proceed under the assumption that 𝑓𝑖 and 𝑓 𝑗 have no roots in common. If they
do share some root 𝜆 ∈ ℝ, then 𝜆 will also be a root of any convex combination 𝑔𝑡 .
Then it suffices to find a common interlacing ℎ (𝑥) for the polynomials 𝑓𝑖 (𝑥)/(𝑥 − 𝜆)
and 𝑓 𝑗 (𝑥)/(𝑥 −𝜆) ; once we have done this, it is straightforward to see that (𝑥 −𝜆) ·ℎ (𝑥)
will be a common interlacing for 𝑓𝑖 and 𝑓 𝑗 .
Let 𝜆𝑘 (𝑡 ) denote the 𝑘 th largest root of 𝑔𝑡 . From complex analysis, we know that
the roots of a polynomial are continuous in its coefficients; thus the image of the unit
interval under 𝜆𝑘 is a continuous curve in the complex plane that begins at the 𝑘 th
largest root of 𝑓𝑖 at 𝑡 = 0, and ends at the 𝑘 th largest root of 𝑓 𝑗 at 𝑡 = 1. In particular,
this curve must reside on the real line, owing to the assumption that expectations (i.e.,
convex combinations) of polynomials in the family have real roots. As such, the image
𝜆𝑘 ( [ 0, 1]) is a closed interval in ℝ. Moreover, for any 𝑡 ∈ ( 0, 1) , it cannot be the case
that 𝜆𝑘 (𝑡 ) is a root of 𝑓𝑖 , for if this were the case, then the assumption of no common
roots would be violated:
0 = 𝑔𝑡 (𝜆𝑘 (𝑡 )) = ( 1 − 𝑡 ) 𝑓𝑖 (𝜆𝑘 (𝑡 )) + 𝑡 𝑓 𝑗 (𝜆𝑘 (𝑡 )) = 𝑡 𝑓 𝑗 (𝜆𝑘 (𝑡 )).
Project 3: Bipartite Ramanujan Graphs 234
Proof. We sketch a proof in the case that each matrix 𝑨 𝑖 is strictly positive definite;
the general result then follows from complex-analytic considerations. (See the proof in
[BB08] for further details.) We consider the restriction of the polynomial 𝑝 to a line.
Define 𝒛 (𝑡 ) = 𝜶 + 𝑡 𝜷 , where 𝜶 ∈ ℝ𝑚 and 𝜷 ∈ ℝ++ 𝑚 are arbitrary, and where 𝑡 ∈ ℂ.
−1/2 Í𝑚 −1/2
In particular, det (𝑷 ) is constant and det 𝑡 I +𝑷 𝑖 =1 𝛼𝑖 𝑨 𝑖 𝑷 Í is the character-
istic polynomial (in the variable 𝑡 ) of the Hermitian matrix −𝑷 −1/2 𝑚
−1/2
𝑖 =1 𝑎 𝑖 𝑨 𝑖 𝑷 .
By the spectral theorem, it follows that 𝑝 (𝒛 (𝑡 )) has real roots, implying real stabil-
ity.
We next present two results on the closure of the class of real-stable polynomials
under two operations: the first is the difference of the identity and a partial derivative
operator, and the second is partial evaluation of the polynomial at a real number. We
state these without proof, instead referring the reader to [MSS15b, Corollary 3.8 and
Proposition 3.9] for a more detailed overview.
Lemma 3.21 (Closure: Identity minus derivative). Let 𝑝 ∈ ℝ[𝑧 1 , . . . , 𝑧𝑚 ] be real stable. For
any 𝑖 ∈ {1, . . . , 𝑚}, it is the case that ( 1 − 𝜕𝑧𝑖 )𝑝 is real stable.
Project 3: Bipartite Ramanujan Graphs 235
Lemma 3.22 (Closure: Partial evaluation). Let 𝑝 ∈ ℝ[𝑧 1 , . . . , 𝑧𝑚 ] be real stable. For any
fixed 𝑎 ∈ ℝ, the (𝑚 − 1) -variate polynomial 𝑝 (𝑎, 𝑧 2 , . . . , 𝑧𝑚 ) is real stable.
We now define the mixed characteristic polynomial of a collection of matrices.
𝜇[𝑨 1 , . . . , 𝑨 𝑚 ; 𝑡 ]
Ö𝑚 ∑︁𝑚
B 1 − 𝜕𝑧𝑖 det 𝑡 I + 𝑧𝑖 𝑨 𝑖 . (3.5)
𝑖 =1 𝑖 =1
𝑧1 ,...,𝑧𝑚 =0
Like with the characteristic polynomial, we refer to the mixed characteristic polynomial
as 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ] , using 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ; 𝑡 ] instead when it is useful to emphasize the
name of its variable 𝑡 .
Finally, we present the following result from [MSS15b, Theorem 4.1] relating the
expected characteristic polynomial of the sum of independent rank-one matrices to a
mixed characteristic polynomial of the corresponding covariance matrices.
Proof sketch. For the sake of brevity, we will not prove the equality (3.6); we refer
the reader to [MSS15b, Section 4] for a proof of this result. On the other hand,
real-rootedness of 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ] follows immediately from the definition of the
mixed characteristic polynomial in (3.5), Proposition 3.20, positive semidefiniteness of
covariance matrices, the closure properties in Lemmas 3.21 and 3.22, and the fact that
real stability is equivalent to real-rootedness in the univariate case.
The mixed characteristic polynomial in the last line has real roots, by Theorem 3.24
(which we also employed in the second equality). The distribution 𝜈 was arbitrary, so
we may invoke Theorem 3.18 to discover that the family of polynomials (𝑞𝒖 1 ,...,𝒖 ,𝒘 𝑘 )𝑘𝑟 =1
has a common interlacing. Then by Lemma 3.17, there is some 𝑘 ∈ {1, . . . , 𝑟 } for
which
𝜆 1 (𝑞𝒖 1 ,...,𝒖 ,𝒘 𝑘 ) ≤ 𝜆 1 (𝑞𝒖 1 ,...,𝒖 ).
Indeed, 𝑞𝒖 1 ,...,𝒖 = 𝔼𝒗 +1 [𝑞𝒖 1 ,...,𝒖 ,𝒗 +1 ] . Applying this argument repeatedly at each level
∈ {1, . . . , 𝑚} and chaining the inequalities, we eventually obtain an assignment
𝒖 1 , . . . , 𝒖 𝑚 satisfying
∑︁𝑚 h ∑︁𝑚 i
𝜆1 𝜒 𝒖 𝑖 𝒖 𝑖 ∗ ≤ 𝜆1 𝔼 𝜒 𝒗𝑖𝒗𝑖 ∗ ,
𝑖 =1 𝑖 =1
Lecture bibliography
[Alo86] N. Alon. “Eigenvalues and Expanders”. In: Combinatorica 6.2 (June 1986), pages 83–
96. doi: 10.1007/BF02579166.
[BL06] Y. Bilu and N. Linial. “Lifts, Discrepancy and Nearly Optimal Spectral Gap”. In:
Combinatorica 26.5 (2006), pages 495–519. doi: 10.1007/s00493-006-0029-7.
[BB08] J. Borcea and P. Brändén. “Applications of stable polynomials to mixed determi-
nants: Johnson’s conjectures, unimodality, and symmetrized Fischer products”. In:
Duke Mathematical Journal 143.2 (2008), pages 205 –223. doi: 10.1215/00127094-
2008-018.
[CS07] M. Chudnovsky and P. Seymour. “The roots of the independence polynomial
of a clawfree graph”. In: Journal of Combinatorial Theory, Series B 97.3 (2007),
pages 350–357. doi: https://doi.org/10.1016/j.jctb.2006.06.001.
[Ded92] J. P. Dedieu. “Obreschkoff ’s theorem revisited: what convex sets are contained in the
set of hyperbolic polynomials?” In: Journal of Pure and Applied Algebra 81.3 (1992),
pages 269–278. doi: https://doi.org/10.1016/0022-4049(92)90060-S.
[Fel80] H. J. Fell. “On the zeros of convex combinations of polynomials.” In: Pacific Journal
of Mathematics 89.1 (1980), pages 43 –50. doi: pjm/1102779366.
[GG78] C. Godsil and I. Gutman. “On the matching polynomial of a graph”. In: Algebraic
Methods in Graph Theory 25 (Jan. 1978).
[HL72] O. J. Heilmann and E. H. Lieb. “Theory of monomer-dimer systems”. In: Communi-
cations in Mathematical Physics 25.3 (1972), pages 190 –232. doi: cmp/1103857921.
[HLW06] S. Hoory, N. Linial, and A. Wigderson. “Expander graphs and their applications”.
In: Bulletin of the American Mathematical Society 43.4 (2006), pages 439–561.
[Kow19] E. Kowalski. An introduction to expander graphs. Société mathématique de France,
2019.
[MSS14] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Ramanujan graphs and the
solution of the Kadison-Singer problem”. In: Proceedings of the International
Congress of Mathematicians—Seoul 2014. Vol. III. Kyung Moon Sa, Seoul, 2014,
pages 363–386.
Project 3: Bipartite Ramanujan Graphs 237
h𝑩𝒙 , 𝒚 i = h𝒚 𝒙 ∗ , 𝑩i.
One can verify that a matrix 𝒁 ∈ 𝕄𝑚+𝑛 (ℝ) has the form in (4.2) if and only if 𝒁 has
rank one and has all ones on its diagonal. Thus, (4.1) can be exactly reformulated as
the rank-constrained matrix optimization problem
Since any feasible solution to (4.3) is also feasible for (4.4), we must have k𝑩 k ∞ →1 ≤
SDP (𝑩) . The content of Grothendieck’s inequality is that SDP (𝑩) is no larger than a
modest multiple of k𝑩 k ∞ →1 . That is, the semidefinite programming relaxation (4.4)
provides a constant-factor approximation to the Grothendieck problem (4.1).
ℝ such that
Theorem 4.1 (Grothendieck’s inequality I). There exists a constant KG
We indulge in one final refomulation. Note that a matrix 𝒁 is feasible for (4.4) if
and only if it is the Gram matrix of unit vectors 𝒖 1 , . . . , 𝒖 𝑚 , 𝒚 1 , . . . , 𝒚 𝑛 in a Hilbert
space ( H, h·, ·i) . That is,
h𝒖 1 , 𝒖 1 i · · · h𝒖 1 , 𝒖 𝑚 i h𝒖 1 , 𝒗 1 i · · · h𝒖 1 , 𝒗 𝑛 i
.. .. .. ..
.. ..
. . . . . .
h𝒖 𝑚 , 𝒖 1 i · · · h𝒖 𝑚 , 𝒖 𝑚 i h𝒖 𝑚 , 𝒗 1 i · · · h𝒖 𝑚 , 𝒗 𝑛 i
𝒁 = .
h𝒗 1 , 𝒖 1 i · · · h𝒗 1 , 𝒖 𝑚 i h𝒗 1 , 𝒗 1 i · · · h𝒗 1 , 𝒗 𝑛 i
.. .. .. .. .. ..
. . . . . .
h𝒗 , 𝒖 i · · · h𝒗 𝑛 , 𝒖 𝑚 i h𝒗 𝑛 , 𝒗 1 i · · · h𝒗 𝑛 , 𝒗 𝑛 i
𝑛 1
With this observation, another equivalent statement of Grothendieck’s inequality
becomes evident. In premonition of what will come, we shall also generalize to the
field 𝔽 of either real or complex numbers.
𝔽 such that
Theorem 4.2 (Grothendieck’s inequality II). There exists a constant KG For 𝔽 = ℂ, we have the following
analog of (4.1):
∑︁𝑚 ∑︁𝑛
𝑏 𝑖 𝑗 h𝒖 𝑖 , 𝒗 𝑗 i ≤ KG𝔽 · k𝑩 k ∞ →1 , k𝑩 k ∞ →1 = max𝑛 | h𝑩𝒙 , 𝒚 i |,
𝑖 =1 𝑗 =1 𝒙 ∈𝕋
𝒚 ∈𝕋 𝑛
where 𝕋 denotes the elements of 𝔽 with modulus one , 𝑩 is an matrix over 𝔽 of where 𝕋 are the complex numbers of
any size 𝑚 × 𝑛 , and 𝒖 1 , . . . , 𝒖 𝑚 , 𝒗 1 , . . . , 𝒗 𝑛 are unit vectors in a Hilbert space unit modulus.
( H, h·, ·i) .
For algorithmic applications, we are not only just interested in the optimal value
of the Grothendieck problem (4.1) but also in the optimal solutions 𝒙 ∈ {±1}𝑚 and
𝒚 ∈ {±1}𝑛 . Fortunately, there are efficient ways for “rounding” the output of the
semidefinite programming relaxation (4.4) to approximate optimal solutions. The
following is a result of Alon and Naor [AN06].
The set of matrices 𝑼 ∈ 𝕄𝑛 ( H) such that 𝑼 ∗𝑼 = 𝑼𝑼 ∗ = I are called unitary and are
denoted 𝕌𝑛 ( H) . For a sesquilinear form T on 𝕄𝑛 (ℂ) expressed in coordinates as
∑︁𝑛
T(𝑿 ,𝒀 ) = 𝑡 𝑖 𝑗 𝑘 𝑥 𝑖 𝑗 𝑦𝑘 ,
𝑖 ,𝑗 ,𝑘 ,=1
A result of this form was conjectured by Grothendieck and proved by Pisier [Pis78].
Haagerup obtained the sharp constant using a more streamlined proof [Haa85], and
Haagerup and Itoh established the optimality of the constant [HI95].
Project 4: The NC Grothendieck Problem 241
bution: ∫ 𝑏
1 𝑥
ℙ {𝑡 ∈ [𝑎, 𝑏]}} = sech d𝑥 ;
𝑎 2 2
2 Set 𝑼 𝒛 ← h𝒛 , 𝑼 i ∈ 𝕄𝑛 (ℂ) and 𝑽 𝒛 := h𝒛 , 𝑽 i ∈ 𝕄𝑛 (ℂ) , where the inner product
is taken entrywise.
3 Compute polar decompositions
√ 𝑼 𝒛 = 𝚽𝒛 |𝑼√ 𝒛 | , 𝑽 𝒛 = 𝚿𝒛 |𝑽 𝒛 |
4 return 𝑿 ← 𝚽𝒛 |𝑼 𝒛 / 2 | i𝑡 , 𝒀 ← 𝚿𝒛 |𝑽 𝒛 / 2 | i𝑡
Project 4: The NC Grothendieck Problem 242
Let 𝑿 and 𝒀 be the outputs of Algorithm 4.1 applied to these inputs. Then
1
𝔼 Re T(𝑿 ,𝒀 ) ≥ − 𝜀 · OPT ( T).
2
𝔼 𝑼 𝒛 𝑼 ∗𝒛 , 𝔼 𝑼 ∗𝒛 𝑼 𝒛 , 𝔼 𝑽 𝒛 𝑽 ∗𝒛 , 𝔼 𝑽 ∗𝒛 𝑽 𝒛 2I.
The proof is a short and rather pedestrian calculation, which we omit. We refer the
interested reader to [NRV14, Lem. 2.2].
For our second lemma, it will be helpful to “upgrade” a pair of vector-valued
matrices 𝑩, 𝑪 ∈ 𝕄𝑛 (ℂ𝑑 ) which are subunitary in the sense that
k𝑩 ∗ 𝑩 k, k𝑩𝑩 ∗ k, k𝑪 ∗𝑪 k, k𝑪𝑪 ∗ k ≤ 1 (4.7)
˜
to full unitary vector-valued matrices 𝑹 , 𝑺 ∈ 𝕌𝑛 (ℂ𝑑 ) , possibly of a larger dimension
𝑑˜ ≥ 𝑑 , while preserving the sesquilinear form T. The next lemma shows this is
possible.
Lemma 4.8 (Subunitary to unitary). Suppose 𝑩, 𝑪 ∈ 𝕄𝑛 (ℂ𝑑 ) satisfy the subunitary
2
bound (4.7). Then there exist 𝑹 , 𝑺 ∈ 𝕌𝑛 (ℂ𝑑+2𝑛 ) such that
T(𝑹 , 𝑺 ) = T(𝑩, 𝑪 ).
Proof. Define residuals 𝑫 = I − 𝑩𝑩 ∗ and 𝑬 = I − 𝑩 ∗ 𝑩 . Define 𝜎 := tr 𝑫 = tr 𝑬 and
consider spectral decompositions
∑︁𝑛 ∑︁𝑛
𝑫= 𝜆𝑖 𝒖 (𝑖 ) 𝒖 (𝑖 )∗ and 𝑬 = 𝜇𝑖 𝒗 (𝑖 ) 𝒗 (𝑖 )∗ .
𝑖 =1 𝑖 =1
2
Define 𝑹 ∈ 𝕌𝑛 (ℂ𝑑+2𝑛 ) with (𝑖 , 𝑗 ) th entry
√︂ !
𝜆𝑘 𝜇 (𝑘 ) ()∗
𝒓 𝑖 𝑗 = 𝒃𝑖 𝑗 ⊕ 𝑢 𝑣 : 𝑘 , = 1, . . . , 𝑛 ⊕ 0, (4.8)
𝜎 𝑖 𝑗
Project 4: The NC Grothendieck Problem 243
Proof of Theorem 4.6. Let 𝔼𝒛 , 𝔼𝑡 , and 𝔼𝒛 ,𝑡 denote expectations over the randomness
in 𝒛 , 𝑡 , or both 𝒛 and 𝑡 , respectively. Since T is sesquilinear and 𝒛 is isotropic in the
sense that 𝔼 𝒛 𝒛 ∗ = I, we compute
𝔼𝑡 𝑎 i𝑡 = 2𝑎 − 𝔼𝑡 𝑎 2+i𝑡 . (4.11)
We shall show precisely this. For the remainder of this proof, fix 𝑡 ∈ ℝ.
To show (4.13), we “derandomize” the expectation over 𝒛 by passing from random 𝑑
ℂ{1,−1,i,−i} denotes the Hilbert
scalar-valued matrices 𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 ∈ 𝕄𝑛 (ℂ) to deterministic vector-valued space of functions
𝑑
matrices in 𝑩, 𝑪 ∈ 𝕄𝑛 (ℂ {1,−1,i,−i } ) , which we define as 𝑓 : {1, −1, i, −i }𝑑 → ℂ.
1 1
𝑩 (𝒛 ) := 𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 and 𝑪 (𝒛 ) := 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 .
2𝑑 2𝑑
Project 4: The NC Grothendieck Problem 244
𝑩𝑩 ∗ , 𝑩 ∗ 𝑩, 𝑪𝑪 ∗ , 𝑪 ∗𝑪 2I.
√ √
Thus, 𝑩/ 2 and 𝑪 / 2 are subunitary in the sense of Lemma 4.8, so there exists
˜
𝑹 , 𝑺 ∈ 𝕌𝑛 (ℂ𝑑 ) for some dimension 𝑑˜ such that
2 T(𝑹 , 𝑺 ) = T(𝑩, 𝑪 ) = 𝔼𝒛 T(𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 ).
But since 𝑹 and 𝑺 are unitary, Re T(𝑹 , 𝑺 ) ≤ SDP ( T) . Plugging in and rearranging
yields (4.13), completing the proof.
4.4.1 1 PCA
Consider a data set encoded in the columns of a matrix 𝑩 ∈ ℂ𝑚×𝑛 . The PCA problem
is to find a 𝑘 -dimensional subspace capturing most of the “energy” or “variance” in the
data matrix 𝑩 . In classical PCA, this can be formulated as an optimization problem
where 𝕌𝑚×𝑘 (ℂ) consists of the 𝑚 × 𝑘 matrices with orthonormal columns. An optimal
solution to this problem is readily determined by choosing 𝑾 to consist of the 𝑘
dominant left singular vectors of 𝑩 .
To make this procedure more robust to outliers, we can replace the Frobenius
norm with a different norm in (4.14) For instance, Kwak [Kwa08] proposes using the
following “1 PCA” problem
4.4.2 Reformulation
We now reformulate the 1 PCA problem as a bilinear optimization problem over
suitably defined unitary matrices. First, use the duality of 1 and ∞ to write the 1
norm as a maximization:
One can further encode 𝑾 and 𝒁 as unitary matrices 𝑼 and 𝑽 with a specific entries
set to zero:
𝑼 = 𝑾 0 ∈ 𝕌𝑚 (ℂ) ;
(4.16)
𝑽 = diag 𝑧𝑖 𝑗 : 1 ≤ 𝑖 ≤ 𝑘 , 1 ≤ 𝑗 ≤ 𝑛 ∈ 𝕌𝑘 𝑛 (ℂ).
Thus, one can reformulate the 1 PCA problem (4.15) as a bilinear optimization
problem over unitary matrices 𝑼 and 𝑽 by designing the sesquilinear form such
that optimal solutions 𝑼 and 𝑽 take the forms (4.16); see [NRV14, §5.1] for details.
The noncommutative Grothendieck inequality ensures the semidefinite programming
relaxation of this problem is a constant factor approximation. We conclude the that 1
PCA problem has a polynomial-time constant-factor approximation algorithm.
Notes
A survey on various Grothendieck inequalities and their applications is given by Pisier
[Pis12].
The optimal (commutative) Grothendieck constants are satisfy bounds
𝜋
1.67696 ≤ KGℝ < 1.7822 . . . = √ , 1.338 < KGℂ ≤ 1.4049.
2 log ( 1 + 2)
Davie established the lower bounds in unpublished work; Krivine showed the upper
bound on KGℝ [Kri78]; Braverman, Makarychev, Makarychev, and Naor showed the
slight suboptimality of Krivine’s bound [Bra+11]; and Haagerup showed the upper
bound on KGℂ in [Haa85]. Vershynin provides an accessible exposition on Krivine’s
bound [Ver18, §3.7].
The semidefinite programming interpretation of Grothendieck’s inequality was
popularized by Alon and Naor [AN06]. Their rounding procedure (Theorem 4.3) is
based on Krivine’s bound for the Grothendieck constant. Thus, in the notation of
Theorem 4.3,
𝜋
KGℝ,rnd ≤ √ .
2 log ( 1 + 2)
There are many ways to round answers from semidefinite programs to approximate so-
lutions of combinatorial optimization problems; among these, the Goemans–Williamson
algorithm for MaxCut is a distinguished example [GW95]. Assuming the Unique
Games Conjecture, it is NP-hard to approximate k𝑩 k ∞ →1 to within a factor of KGℝ − 𝜀
for any 𝜀 > 0 [RS09]; the semidefinite programming relaxation (4.4) provides an
Project 4: The NC Grothendieck Problem 246
Lecture bibliography
[AN06] N. Alon and A. Naor. “Approximating the cut-norm via Grothendieck’s inequality”.
In: SIAM J. Comput. 35.4 (2006), pages 787–803. doi: 10.1137/S0097539704441629.
[Bra+11] M. Braverman et al. “The Grothendieck Constant Is Strictly Smaller than Krivine’s
Bound”. In: 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.
Oct. 2011, pages 453–462. doi: 10.1109/FOCS.2011.77.
[BRS17] J. Briët, O. Regev, and R. Saket. “Tight Hardness of the Non-Commutative
Grothendieck Problem”. In: Theory of Computing 13.15 (Dec. 2017), pages 1–
24. doi: 10.4086/toc.2017.v013a015.
[Din+06] C. Ding et al. “R1 -PCA: Rotational Invariant L1 -Norm Principal Component Analysis
for Robust Subspace Factorization”. In: Proceedings of the 23rd International
Conference on Machine Learning. ICML ’06. Association for Computing Machinery,
June 2006, pages 281–288. doi: 10.1145/1143844.1143880.
[GW95] M. X. Goemans and D. P. Williamson. “Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming”. In:
J. Assoc. Comput. Mach. 42.6 (1995), pages 1115–1145. doi: 10.1145/227683.
227684.
[Gro53] A. Grothendieck. Résumé de La Théorie Métrique Des Produits Tensoriels Topologiques.
Soc. de Matemática de São Paulo, 1953.
[Haa85] U. Haagerup. “The Grothendieck Inequality for Bilinear Forms on C*-Algebras”.
In: Advances in Mathematics 56.2 (May 1985), pages 93–116. doi: 10.1016/0001-
8708(85)90026-X.
[HI95] U. Haagerup and T. Itoh. “Grothendieck Type Norms For Bilinear Forms On
C*-Algebras”. In: Journal of Operator Theory 34.2 (1995), pages 263–283.
[Kai83] S. Kaijser. “A Simple-Minded Proof of the Pisier-Grothendieck Inequality”. In:
Banach Spaces, Harmonic Analysis, and Probability Theory. Springer, 1983, pages 33–
55.
[Kri78] J.-L. Krivine. “Constantes de Grothendieck et Fonctions de Type Positif Sur Les
Spheres”. In: Séminaire d’Analyse fonctionnelle (dit" Maurey-Schwartz") (1978),
pages 1–17.
Project 4: The NC Grothendieck Problem 247
Find 𝑿 ∈ ℍ𝑛 : 𝑨 ∗ 𝑿 + 𝑿 𝑨 − 𝑿 𝑺 𝑿 + 𝑸 = 0,
is known as continuous algebraic Riccati equation (CARE) whereas the equation
Find 𝑿 ∈ ℍ𝑛 : 𝑿 = 𝑸 + 𝑨 ∗ (𝑺 + 𝑿 −1 ) −1 𝑨, (DARE)
is known as discrete algebraic Riccati equation (DARE). These equations are classified
as continuous and discrete since they naturally arise from studying continuous-time
and discrete-time dynamical systems in control and filtering problems. This manuscript
omits the continues version and deals with the discrete version, (DARE), only.
The rest of this manuscript is organized as follows. In Section 5.1, a central problem
in control theory is introduced to motivate the study of AREs. Section 5.2 follows
with a digression on the metric geometry of positive-definite (pd) matrices in order
to develop the tools to study AREs later in this manuscript. In Section 5.3, stability
of matrices and a closely related linear matrix equation are studied in the context of
linear dynamical systems. Section 5.4 concludes the manuscript with an analysis of the
solution of (DARE) by applying the tools and techniques developed in the previous
sections.
5.1 Motivation
In this section, the Linear-Quadratic Regulator (LQR) problem from optimal control
theory will be introduced. LQR problem is central to optimal control theory due to its
wide applicability in various real-world settings as well as its rich mathematical aspects
including AREs.
Example 5.1 (Discrete-time infinite-horizon LQRs). Consider the following linear dynamical
system,
𝒙 𝑡 +1 = 𝑨𝒙 𝑡 + 𝑩𝒖 𝑡 , for 𝑡 ≥ 0,
(LIN)
𝒙0 = 𝒛
where {𝒙 𝑡 }𝑡 ∈ℕ ⊂ ℂ𝑛 is the sequence of states, 𝒛 ∈ ℂ𝑛 is the initial state {𝒖 𝑡 }𝑡 ∈ℕ ⊂ ℂ𝑚
is the sequence of control inputs, 𝑨 ∈ 𝕄𝑛 is the state evolution matrix, and 𝑩 ∈ ℂ𝑛×𝑚
is the input matrix.
At each time step 𝑡 ∈ ℕ, a controller observes the current state 𝒙 𝑡 , exerts a control
input 𝒖 𝑡 to the dynamics, and then suffers from an instantaneous cost 𝑐 (𝒙 𝑡 , 𝒖 𝑡 ) as
a function of the state and the control input. In many real-world applications, it
Project 5: Algebraic Riccati Equations 249
is assumed that the instantaneous cost is a quadratic function of the state and the
control input, that is, 𝑐 (𝒙 𝑡 , 𝒖 𝑡 ) B 𝒙 𝑡∗𝑸 𝒙 𝑡 + 𝒖 𝑡∗𝑹𝒖 𝑡 where 𝑸 ∈ ℍ𝑛++ and 𝑹 ∈ ℍ𝑛++
are given positive-definite matrices. The objective of the controller is to reduce the
cumulative cost by deploying a control policy to design control inputs based on past
state observations. One straightforward approach is to deploy a linear state-feedback
controller 𝑲 ∈ ℂ𝑚×𝑛 such that the inputs are designed as 𝒖 𝑡 = 𝑲 𝒙 𝑡 for all 𝑡 ∈ ℕ.
In fact, the optimal control of the linear dynamical system (LIN) can be achieved
solely by a state-feedback policy as long as the pair (𝑨, 𝑩) is stabilizable (see the
definition 5.11) [Ber12]. The optimal state-feedback controller can be found by
minimizing the infinite-horizon cumulative cost among all state-feedback controllers
subject to linear dynamical system (LIN) as formulated below.
n ∑︁∞ o
min 𝑉 (𝒛 ; 𝑲 ) B 𝒙 𝑡∗𝑸 𝒙 𝑡 + 𝒖 𝑡∗𝑹𝒖 𝑡 ,
𝑲 ∈ℂ𝑚×𝑛 𝑡 =0
𝒙 𝑡 +1 = 𝑨𝒙 𝑡 + 𝑩𝒖 𝑡 , for 𝑡 ∈ ℕ, (OPT 1)
subject to 𝒖 𝑡 = 𝑲 𝒙 𝑡 , for 𝑡 ∈ ℕ,
𝒙0 = 𝒛.
Here, 𝒛 ∈ ℂ𝑛 is a given initial state and 𝑉 (𝒛 ; 𝑲 ) is the value function. The solution to
this problem is known as the infinite-horizon linear-quadratic regulator (LQR).
The value function gives the total cost of controlling the linear dynamical sys-
tem (LIN) with state-feedback controller 𝑲 ∈ ℂ𝑚×𝑛 starting from an initial state
𝒛 ∈ ℂ𝑛 . Note that the value function is not guaranteed to take a finite value for
all 𝑲 ∈ ℂ𝑚×𝑛 . In fact, a state-feedback controller 𝑲 ∈ ℂ𝑚×𝑛 is called stabilizing if
𝑉 (𝒛 ; 𝑲 ) takes a finite value for any initial state 𝒛 ∈ ℂ𝑛 . Assuming that a state-feedback
controller 𝑲 ∈ ℂ𝑚×𝑛 is stabilizing, the value function can be expressed recursively as
∑︁∞
𝑉 (𝒛 ; 𝑲 ) = 𝒙 𝑡∗𝑸 𝒙 𝑡 + (𝑲 𝒙 𝑡 ) ∗𝑹 (𝑲 𝒙 𝑡 ) ,
∑︁𝑡∞=0
= 𝒙 𝑡∗ (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒙 𝑡 ,
𝑡 =0
∑︁∞
= 𝒙 ∗0 (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒙 0 + 𝒙 𝑡∗ (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒙 𝑡 ,
𝑡 =1
= 𝒛 ∗ (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒛 + 𝑉 ((𝑨 + 𝑩𝑲 )𝒛 ; 𝑲 ), (5.1)
for any 𝒛 ∈ ℂ𝑛 where (5.1) follows from taking the next step at time 𝑡 = 1 as the new
initial state.
Furthermore, it can be argued that the value function is quadratic in the initial
state 𝒛 by noting that the current state at time 𝑡 ≥ 0 is a linear function of the initial
state as 𝒙 𝑡 = (𝑨 + 𝑩𝑲 )𝑡 𝒛 and the instantaneous cost is quadratic in the current state.
Thus, one can write 𝑉 (𝒛 ; 𝑲 ) = 𝒛 ∗𝑷 𝒛 where 𝑷 ∈ ℍ𝑛+ is a psd matrix which depends on
the feedback gain matrix, 𝑲 . Inserting the quadratic form into the recursive relation
in (5.1), one obtains the following condition
(𝑨 + 𝑩𝑲 ) ∗𝑷 (𝑨 + 𝑩𝑲 ) − 𝑷 + 𝑸 + 𝑲 ∗𝑹𝑲 = 0. (5.2)
The equation (5.2) is known as the Lypaunov equation with state-feedback controller
𝑲 . This relationship makes it possible to rewrite the optimal control problem (OPT 1)
Project 5: Algebraic Riccati Equations 250
min 𝒛 ∗𝑷 𝒛 ,
𝑲 ∈ℂ𝑚×𝑛 , 𝑷 ∈ℍ𝑛
(𝑨 + 𝑩𝑲 ) ∗𝑷 (𝑨 + 𝑩𝑲 ) − 𝑷 + 𝑸 + 𝑲 ∗𝑹𝑲 = 0, (OPT 2)
subject to
𝑷 < 0.
Noticing that the Lyapunov equation (5.2) is quadratic with respect to 𝑲 and
𝑹 + 𝑩 ∗𝑷 𝑩 0, one can rewrite the equation (5.2) by completion of squares as
𝑷 = 𝑸 + 𝑨 ∗𝑷 𝑨 − 𝑨 ∗𝑷 𝑩 (𝑹 + 𝑩 ∗𝑷 𝑩) −1 𝑩 ∗𝑷 𝑨
(5.3)
+ (𝑲 − 𝑳) ∗ (𝑹 + 𝑩 ∗𝑷 𝑩)(𝑲 − 𝑳).
𝑷 < 𝑸 + 𝑨 ∗𝑷 𝑨 − 𝑨 ∗𝑷 𝑩 (𝑹 + 𝑩 ∗𝑷 𝑩) −1 𝑩 ∗𝑷 𝑨. (5.4)
Conversely, for any 𝑷 ∈ ℍ𝑛+ satisfying the inequality (5.4), one can find a matrix 𝑲 ∈
ℂ𝑚×𝑛 such that the equation (5.3) holds. Therefore, the optimization problem (OPT 2)
is equivalent to
min 𝒛 ∗𝑷 𝒛 ,
𝑷 ∈ℍ𝑛
𝑷 < 𝑸 + 𝑨 ∗𝑷 𝑨 − 𝑨 ∗𝑷 𝑩 (𝑹 + 𝑩 ∗𝑷 𝑩) −1 𝑩 ∗𝑷 𝑨,
subject to
𝑷 < 0.
It can be argued that an optimal solution 𝑷 ★ < 0 to the problem above is attained
when the equality is satisfied, that is,
𝑷 ★ = 𝑸 + 𝑨 ∗𝑷 ★ 𝑨 − 𝑨 ∗𝑷 ★ 𝑩 (𝑹 + 𝑩 ∗𝑷 ★ 𝑩) −1 𝑩 ∗𝑷 ★ 𝑨
Hermitian matrices equipped with this norm, (ℍ𝑛 , k·k Φ ) , is a real (complete) normed
vector space. The exponential map defined as
exp : ℍ𝑛 → ℙ𝑛 ,
∑︁∞ 1 𝑘
𝑯 ↦→ 𝑯 ,
𝑘 =0 𝑘!
is a diffeomorphism, that is, a smooth bijection with smooth inverse denoted as
log : ℙ𝑛 → ℍ𝑛 . Thus, the open convex cone of pd matrices constitute a smooth
manifold. It can be shown that the tangent space of ℙ𝑛 at a point 𝑿 ∈ ℙ𝑛 is isomorphic
to the space of Hermitian matrices, that is, 𝑇𝑿 ℙ𝑛 ℍ𝑛 .
A Riemmannian manifold is defined as a smooth manifold whose tangent space
at every point is equipped with a smooth inner product. Finsler manifolds generalize
Riemmannian manifolds by equipping tangent spaces at every point with a smooth
norm, instead. The following theorem shows that ℙ𝑛 attains a metric function, called
a Finsler metric, induced by a symmetric gauge function (sgf) defined on its tangent
space.
Theorem 5.2 (Metric induced by an sgf, [Bha03, p. 216]). Let Φ : ℝ𝑛 → ℝ+ be a Aside: GL (𝑛) is the general lin-
ear group of invertible square ma-
symmetric gauge function (sgf). The nonnegative function defined as trices of size 𝑛 ∈ ℕ.
𝛿Φ : ℙ𝑛 × ℙ𝑛 → ℝ+ ,
(𝑿 , 𝒀 ) ↦→ k log (𝑿 −1/2 𝒀 𝑿 −1/2 )k Φ ,
Remark 5.3 The special case of Euclidean sgf results in a Riemmannian manifold
(ℙ𝑛 , 𝛿 2 ) since Euclidean norm admits an inner product. The metric 𝛿2 induced by
Euclidean norm is is called the Thompson metric.
The last part of Theorem 5.2 simply tells that inversion operation and group action
of GL (𝑛) onto ℙ𝑛 are isometries with respect to 𝛿 Φ . This invariance property will play
a role in the convergence analysis of fixed-point iterations of the Riccati operator.
The metric 𝛿 Φ is related to geometric mean of matrices in a natural way. The
following proposition shows that 𝛿 Φ is in fact the geodesic distance on ℙ𝑛 and that
Aside: A geodesic is a shortest-
the geodesics can be parameterized by geometric mean of matrices.
distance curve connecting two
Proposition 5.4 (Geodesics, [Bha03, p. 216]). Let 𝑿 ,𝒀 ∈ ℙ𝑛 be pd matrices. Define the points on a smooth manifold.
𝑡 −weighted geometric mean of two pd matrices as
𝑿 ♯𝑡 𝒀 B 𝑿 1/2 (𝑿 −1/2𝒀 𝑿 −1/2 )𝑡 𝑿 1/2 ,
for 𝑡 ∈ [ 0, 1] . The curve 𝛾 : 𝑡 ↦→ 𝑿 ♯𝑡 𝒀 starting at 𝑿 and ending at 𝒀 is a geodesic
wrt the metric 𝛿 Φ for any sgf Φ : ℝ𝑛 → ℝ+ , . In other words,
𝛿Φ (𝛾 (𝑠 ), 𝛾 (𝑡 )) = 𝛿 Φ (𝑿 ♯𝑠𝒀 , 𝑿 ♯𝑡 𝒀 ),
= |𝑠 − 𝑡 |𝛿 Φ (𝑿 ,𝒀 ) = |𝑠 − 𝑡 |𝛿Φ (𝛾 ( 0), 𝛾 ( 1)),
for 𝑠 , 𝑡 ∈ [ 0, 1] .
Project 5: Algebraic Riccati Equations 252
This section ends with some key definitions and theorems relevant to the conver-
gence analysis in Section 5.4.
𝛿Φ ( 𝑓 (𝑿 ), 𝑓 (𝒀 ))
𝐿Φ ( 𝑓 ) B sup
𝑿 ,𝒀 ∈ℙ𝑛 , 𝑿 ≠𝒀 𝛿Φ (𝑿 ,𝒀 )
1. nonexpansive if 𝐿 Φ ( 𝑓 ) ≤ 1,
The following lemma states a sufficient condition for a self-map to have a unique fixed
point as well as a simple iterative algorithm to approximate it.
Lemma 5.6 (Banach’s fixed-point theorem). Let 𝑓 : (ℙ𝑛 , 𝛿 Φ ) → (ℙ𝑛 , 𝛿 Φ ) be a strict
contraction on ℙ𝑛 . Then, there exists a unique fixed point 𝑿 ★ = 𝑓 (𝑿 ★ ) . Furthermore,
the iterates {𝑿 𝑘 }𝑘 ∈ℕ defined recursively as 𝑿 𝑘 +1 = 𝑓 (𝑿 𝑘 ) converge to the fixed point
𝑿 ★ for any 𝑿 0 ∈ ℙ𝑛 .
Finally, the next lemma shows that translations do not increase the distance in the
Finsler metric.
Lemma 5.7 (Translations are nonexpansive, [LL08, Prop. 4.1]). Let 𝑷 ∈ ℍ𝑛+ be a psd matrix.
For any 𝑿 ,𝒀 ∈ ℙ𝑛 , one has that
𝑥 ★ = 0 for any starting point 𝒙 0 ∈ ℂ𝑛 . A necessary and sufficient condition for this to
happen is lim𝑡 →∞ 𝑨 𝑡 = 0.
The limit condition for stability stated in the example above could be impractical
to verify in many settings. Therefore, alternative ways of feasibly certifying stability of
linear dynamical systems is needed.
Spectral radius serves as an equivalent criterion for certifying stability as stated by the
following lemma.
Lemma 5.10 The limit lim𝑡 →∞ 𝑨 𝑡 = 0 holds if and only if all the eigenvalues of 𝑨 are
inside the open unit disk of the complex plane, that is, 𝜌 (𝑨) < 1.
It can be shown that the limit lim𝑡 →∞ 𝑱 𝑡 = 0 holds if and only if for all eigenvalues
𝜆 ∈ sp (𝑨) , the limit lim𝑡 →∞ 𝑱 𝑡𝜆 = 0 holds for each Jordan block corresponding to
eigenvalue 𝜆. Since the elements of 𝑱 𝑡𝜆 consist only of powers of 𝜆𝑖 upto 𝑡 , the limit
condition is satisfied if and only if all the eigenvalues have modulus less than 1.
The notion of stability of dynamical systems can be extended to matrices by applying
the spectral radius criterion.
Definition 5.11 (Stable matrices). A matrix 𝑨 ∈ 𝕄𝑛 is called (Schur) stable if all the
eigenvalues of 𝑨 are strictly inside the unit circle, that is, 𝜌 (𝑨) < 1. A pair of
matrices (𝑨, 𝑩) is said to be stabilizable if there exists a matrix 𝑲 ∈ ℂ𝑚×𝑛 such
that 𝑨 + 𝑩𝑲 is stable.
As discussed earlier in Section 5.1 and at the beginning of this section, Lyapunov
equations are foundational in the analysis of Riccati equations. Lyapunov equations
are named after Russian mathematician Aleksandr Mikhailovich Lyapunov (1857-1918)
who pioneered the mathematical theory of stability of dynamical systems and worked
in the fields of mathematical physics, probability theory and potential theory [SMI92].
He originally proposed Lyapunov function method to certify stability of equilibrium
points of ordinary differential equations.
When applied to discrete-time linear dynamical systems, Lyapunov function method
leads to Lyapunov stability theorem for certifying stability by showing existence of
solution to a Lyapunov equation as stated in Theorem 5.14. The details of Lyapunov
function method is omitted in this manuscript for the sake of brevity. However, the
interested reader is refereed to [Kha96] for detailed derivations.
The proof of Theorem 5.14 relies on the following lemmas.
Lemma 5.12 (Gelfand’s formula, [Lax02, Thm. 17.4]). Suppose a consistent matrix norm
k·k is given. For any matrix 𝑨 ∈ 𝕄𝑛 , the spectral radius is bounded from above as
𝜌 (𝑨) ≤ k𝑨 𝑘 k 1/𝑘 for all 𝑘 ∈ ℕ. Furthermore, k𝑨 𝑘 k 1/𝑘 ↓ 𝜌 (𝑨) as 𝑘 → ∞.
Gelfand’s formula provides a way to approximate spectral radius from matrix
norms.
Project 5: Algebraic Riccati Equations 254
Lemma 5.13 (Bounded matrix norm, [HJ13, Lem. 5.6.10]). Let 𝑨 ∈ 𝕄𝑛 and 𝜖 > 0 be given.
There exists a submultiplicative matrix norm k·k 𝑨,𝜖 such that 𝜌 (𝑨) ≤ k𝑨 𝑠 k 𝑨,𝜖 ≤
𝜌 (𝑨) + 𝜖 .
Theorem 5.14 (Lyapunov stability theorem). Let 𝑨 ∈ 𝕄𝑛 be given. The following are
equivalent.
Proof. Suppose that the equation 𝑷 − 𝑨 ∗𝑷 𝑨 = 𝑸 holds for 𝑷 0 and 𝑸 0. Aside: Spectral norm k𝐴 k 2 is
Multiplying both hand sides with 𝑷 −1/2 from left and right as equal to the maximum singular
value.
I𝑛 − (𝑷 −1/2 𝑨 ∗𝑷 1/2 )(𝑷 1/2 𝑨𝑷 −1/2 ) = 𝑷 −1/2𝑸 𝑷 −1/2 0,
one obtains the bound k𝑷 1/2 𝑨𝑷 −1/2 k 2 < 1 on the spectral norm. Since the mapping
𝑨 ↦→ 𝑷 1/2 𝑨𝑷 −1/2 is a similarity transformation, the characteristic polynomial of
𝑷 1/2 𝑨𝑷 −1/2 and its spectrum are unchanged. Thus, one can certify stability of 𝑨 by
noticing
𝜌 (𝑨) = 𝜌 (𝑷 1/2 𝑨𝑷 −1/2 ) ≤ k𝑷 1/2 𝑨𝑷 −1/2 k 2 < 1,
where the first inequality is due to Lemma 5.12.
For the converse, suppose 𝜌 (𝑨) < 1. Consider the sequence {𝑷 𝑡 }𝑡 ∈ℕ generated
by the iterations 𝑷 𝑡 +1 = 𝑸 + 𝑨 ∗𝑷 𝑡 𝑨 with 𝑷 0 = 0 for a given pd matrix 𝑸 0. The
iterate at time 𝑡 ≥ 0 can be expressed in closed form as
∑︁𝑡 −1
𝑷𝑡 = (𝑨 ∗ ) 𝑠 𝑸 𝑨 𝑠 0.
𝑠 =0
Notice that the iterates are nondecreasing, that is, 𝑷 𝒕 +1 < 𝑷 𝒕 0 for all 𝑡 ≥ 0. Let
𝜖 > 0 be small enough such that 𝜌 (𝑨) + 𝜖 < 1. By Lemma 5.12, there exists a constant
𝑇𝜖 ∈ ℕ such that 𝜌 (𝑨) ≤ k𝑨 𝑡 k 1/𝑡 < 𝜌 (𝑨) + 𝜖 < 1 for all 𝑡 ≥ 𝑇𝜖 . Therefore, for any
𝑡 ≥ 𝑠 ≥ 𝑇𝜖 + 1, the distance between two iterates can be bounded as
∑︁𝑡 −1
k𝑷 𝑡 − 𝑷 𝑠 k 2 = (𝑨 ∗ ) 𝑘 𝑸 𝑨 𝑘 ,
𝑘 =𝑠 2
∑︁𝑡 −1
≤ k (𝑨 ∗ ) 𝑘 𝑸 𝑨 𝑘 k 2 ,
𝑘 =𝑠
∑︁𝑡 −1
≤ k𝑸 k 2 k𝑨 𝑘 k 22 ,
𝑘 =𝑠
∑︁𝑡 −1
≤ k𝑸 k 2 (𝜌 (𝑨) + 𝜖) 2𝑘 ,
𝑘 =𝑠
(𝜌 (𝑨) + 𝜖) 2𝑠 − (𝜌 (𝑨) + 𝜖) 2𝑡 (𝜌 (𝑨) + 𝜖) 2𝑠
= k𝑸 k 2 2
≤ k𝑸 k 2 ,
1 − (𝜌 (𝑨) + 𝜖) 1 − (𝜌 (𝑨) + 𝜖) 2
where the first inequality is due to triangle inequality and the second one is due to
submultiplicative property of spectral norm. Notice that the bound above is independent
of 𝑡 which leads to the bound
(𝜌 (𝑨) + 𝜖) 2𝑠
sup k𝑷 𝑡 − 𝑷 𝑠 k 2 ≤ k𝑸 k 2 .
𝑡 ≥𝑠 1 − (𝜌 (𝑨) + 𝜖) 2
Project 5: Algebraic Riccati Equations 255
The right-hand side converges to zero as 𝑠 → ∞ since 𝜌 (𝑨) + 𝜖 < 1. Therefore, the
sequence of iterates, {𝑷 𝑡 }𝑡 ∈ℕ , is Cauchy in ℍ𝑛+ and there exists a limit lim𝑡 →∞ 𝑷 𝑡 =
𝑷 ∈Í ℍ𝑛+ such that 𝑷 < 𝑷 𝑡 0. The limit can be expressed as an infinite sum as
𝑷 = 𝑡∞=0 (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 . Substituting this limit in the Lyapunov equation yields
∑︁∞ ∑︁∞
𝑷 − 𝑨 ∗𝑷 𝑨 = (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 +1𝑸 𝑨 𝑡 +1 ,
∑︁𝑡∞=0 ∑︁𝑡∞=0
= (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 +1𝑸 𝑨 𝑡 +1 ,
∑︁𝑡∞=0 ∑︁𝑡∞=0
= (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 = 𝑸 .
𝑡 =0 𝑡 =1
In order to show uniqueness of the solution of the Lyapunov equation, suppose
𝑷 1 , 𝑷 2 0 are two different solutions. Subtracting the equation 𝑷 1 = 𝑨 ∗𝑷 1 𝑨 + 𝑸
from 𝑷 2 = 𝑨 ∗𝑷 2 𝑨 + 𝑸 , one obtains
𝑷 1 − 𝑷 2 = 𝑨 ∗ (𝑷 1 − 𝑷 2 )𝑨.
Observe that multiplying the equality above repeatedly with 𝑨 ∗ from left and with 𝑨
from right yields
∑︁𝑡 −1
𝑷1 − 𝑷2 = (𝑨 ∗ ) 𝑠 (𝑷 1 − 𝑷 2 )𝑨 𝑠 ,
𝑠 =0
for any 𝑡 ≥ 0. Let 𝜖 > 0 be given such that 𝜌 (𝑨) + 𝜖 < 1. By Lemma 5.13, there
exists a submultiplicative norm k·k 𝑨,𝜖 such that k𝑨 𝑠 k 𝑨,𝜖 ≤ (𝜌 (𝑨) + 𝜖) 𝑠 . By triangle
inequality and submultiplicative property, the distance between the two solutions can
be bounded as
∑︁𝑡 −1
k𝑷 1 − 𝑷 2 k 𝑨,𝜖 ≤ k𝑷 1 − 𝑷 2 k 𝑨,𝜖 k𝑨 𝑠 k 2𝑨,𝜖 ,
𝑠 =0
∑︁𝑡 −1
≤ k𝑷 1 − 𝑷 2 k 𝑨,𝜖 (𝜌 (𝑨) + 𝜖) 2𝑠 ,
𝑠 =0
(𝜌 (𝑨) + 𝜖) 2𝑡
= k𝑷 1 − 𝑷 2 k 𝑨,𝜖 ,
1 − (𝜌 (𝑨) + 𝜖) 2
which holds for all 𝑡 ≥ 0. Taking the limit as 𝑡 → ∞, one gets that k𝑷 1 − 𝑷 2 k 𝑨,𝜖 = 0,
which is a contradiction. Hence, the solution must be unique.
Theorem 5.16 (Solution operator, [Lan70, Thm. 3]). Let 𝑨 ∈ 𝕄𝑛 be a stable matrix. The
solution operator L𝑨−1 : ℍ𝑛 → ℍ𝑛 can be expressed as
∑︁∞
L𝑨−1 (𝑸 ) = (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 , (5.6)
𝑡 =0
for any 𝑸 ∈ ℍ𝑛 .
Proof. Since 𝑨 is stable, all of its eigenvalues are inside the open unit disk. Hence, the
series 𝑡∞=0 (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 converges for any 𝑸 ∈ ℍ𝑛 (see the proof of Theorem 5.14).
Í
By evaluating the Lyapunov operator at the candidate solution (5.6), one gets
∑︁∞ ∑︁∞ ∑︁∞
L𝑨 (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 = (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − 𝑨 ∗ (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 𝑨,
𝑡 =0 𝑡 =0 𝑡 =0
∑︁∞ ∑︁∞
∗ 𝑡
= (𝑨 ) 𝑸 𝑨 − 𝑡
(𝑨 ∗ )𝑡 +1𝑸 𝑨 𝑡 +1 ,
∑︁𝑡∞=0 ∑︁𝑡∞=0
= (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 ,
𝑡 =0 𝑡 =1
= 𝑸.
Uniqueness of the solution operator for stable 𝑨 can be argued in a similar fashion as
in the proof of Theorem 5.14.
Remark 5.17 (Solution in Kronecker product form). The Lyapunov equation can also be
solved for general 𝑨, 𝑸 ∈ 𝕄𝑛 by vectorizing and using Kronecker products as
( I ⊗ I − 𝑨 | ⊗ 𝑨 ∗ ) vec (𝑷 ) = vec (𝑸 ).
If the eigenvalues of matrix 𝑨 are such that 𝜆𝑖 𝜆∗𝑗 ≠ 1, then the unique solution exists
for any 𝑸 and is given by
vec (𝑷 ) = ( I ⊗ I − 𝑨 | ⊗ 𝑨 ∗ ) −1 vec (𝑸 ).
Notice that the special case 𝜌 (𝑨) < 1 satisfies the spectrum condition since all
eigenvalue have modulus less than 1.
R(𝑿 ) = 𝑸 + 𝑨 ∗ 𝑿 𝑨 − 𝑨 ∗ 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨,
for any 𝑿 ∈ ℍ𝑛+ . Lemma 5.19 establishes the formal link between Riccati and Lyapunov
equations as hinted in Section 5.1. The following lemma is needed in order to prove
Lemma 5.19.
Lemma 5.18 (Block upper-diagonal-lower (UDL) decomposition). Let 𝑨 ∈ 𝕄𝑛 , 𝑩 ∈ ℂ𝑛×𝑚 ,
𝑪 ∈ ℂ𝑚×𝑛 , and 𝑫 ∈ 𝕄𝑚 be given complex matrices. The following decomposition is
admitted as long as 𝑫 is invertible.
𝑩𝑫 −1 𝑨 − 𝑩𝑫 −1𝑪 0𝑛𝑚
𝑨 𝑩 I𝑛 I𝑛 0𝑛𝑚
= .
𝑪 𝑫 0𝑚𝑛 I𝑚 0𝑚𝑛 𝑫 𝑫 −1𝑪 I𝑚
Project 5: Algebraic Riccati Equations 257
Proof. The proof follows directly by multiplying the matrices in the right-hand side.
Lemma 5.19 (A useful upper bound). Consider the function Γ : ℂ𝑚×𝑛 × ℍ𝑛+ → ℍ𝑛+ defined
as
Γ(𝑲 , 𝑿 ) B (𝑨 + 𝑩𝑲 ) ∗ 𝑿 (𝑨 + 𝑩𝑲 ) + 𝑸 + 𝑲 ∗𝑹𝑲 .
For any 𝑲 ∈ ℂ𝑚×𝑛 and 𝑿 ∈ ℍ𝑛+ , it holds that R(𝑿 ) 4 Γ(𝑲 , 𝑿 ) . Furthermore, the
equality R(𝑿 ) = Γ(𝑲 ★ (𝑿 ), 𝑿 ) holds for
𝑲 ★ (𝑿 ) B −(𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨.
𝑨∗𝑿 𝑨 + 𝑸 𝑨∗𝑿 𝑩
= 𝑼 𝑫𝑼 ∗ ,
𝑩 ∗𝑿 𝑨 𝑹 + 𝑩 ∗𝑿 𝑩
where
𝑨 ∗ 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1
I
𝑼 = 𝑛 ,
0𝑛 I𝑛
and
𝑨 ∗ 𝑿 𝑨 + 𝑸 − 𝑨 ∗ 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨
0𝑛
𝑫= .
0𝑛 𝑹 + 𝑩 ∗𝑿 𝑩
Notice that the first block of 𝑫 is same as R(𝑿 ) and the upper second block of 𝑼 is
simply −𝑲 ★ (𝑿 ) ∗ . Using this decomposition, Γ(𝑲 , 𝑿 ) can be rewritten as
I
Γ(𝑲 , 𝑿 ) = I𝑛 𝑲 ∗ 𝑼 𝑫𝑼 ∗ 𝑛 ,
𝑲
∗ ∗
R(𝑿 ) 0𝑛 I𝑛
= I𝑛 𝑲 − 𝑲 ★ (𝑿 ) ,
0𝑛 𝑹 + 𝑩 ∗𝑿 𝑩 𝑲 − 𝑲 ★ (𝑿 )
= R(𝑿 ) + (𝑲 − 𝑲 ★ (𝑿 )) ∗ (𝑹 + 𝑩 ∗ 𝑿 𝑩)(𝑲 − 𝑲 ★ (𝑿 )).
Thus, Γ(𝑲 , 𝑿 ) < R(𝑿 ) for any 𝑲 ∈ ℂ𝑚×𝑛 and 𝑿 ∈ ℍ𝑛+ with equality only if
𝑲 = 𝑲 ★ (𝑿 ) .
Proposition 5.20 (Properties of Riccati operator). Suppose 𝑿 , 𝑿 0 ∈ ℍ𝑛+ . Then, we have
that
In order to show concavity, Lemma 5.19 and linearity of Γ(𝑲 , 𝑿 ) wrt 𝑿 can be
used to write
Theorem 5.21 (Existence and uniqueness of stabilizing fixed point). If the pair (𝑨, 𝑩) is
stabilizable, then there exists a unique positive-definite fixed point 𝑿 ★ 0 of the
Riccati operator such that 𝑨 + 𝑩𝑲 ★ (𝑿 ★ ) is a stable matrix. Conversely, if there
exists a positive definite fixed-point 𝑿 ★ 0, then it is uniquely determined and
𝑨 + 𝑩𝑲 ★ (𝑿 ★ ) is a stable matrix.
𝑿 ★ = Γ(𝑲 ★ , 𝑿 ★ ) = (𝑨 + 𝑩𝑲 ★ ) ∗ 𝑿 ★ (𝑨 + 𝑩𝑲 ★ ) + 𝑸 + 𝑲 ∗★𝑹𝑲 ★
(𝑿 0 − 𝑿 1 ) − (𝑨 + 𝑩𝑲 1 ) ∗ (𝑿 0 − 𝑿 1 )(𝑨 + 𝑩𝑲 1 ) < 0.
By the 2th item of Proposition 5.15, it can be seen that 𝑿 0 < 𝑿 1 which implies
R(𝑿 0 ) < R(𝑿 1 ) by monotonicity.
Recursively repeating this process, a nonincreasing sequence of pd matrices
{𝑿 𝑘 }𝑘 ∈ℕ can be constructed such that 𝑿 𝑘 < 𝑿 𝑘 +1 . The stabilizing matrix of the next
step is 𝑲 𝑘 +1 B 𝑲 ★ (𝑿 𝑘 ) such that 𝑨 + 𝑩𝑲 𝑘 +1 is stable. Each iterate also satisfies the
Lyapunov equation Γ(𝑲 𝑘 , 𝑿 𝑘 ) = 𝑿 𝑘 0.
Project 5: Algebraic Riccati Equations 259
R(𝑿 ) = 𝑸 + 𝑨 ∗ 𝑿 − 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨,
−1
= 𝑸 + 𝑨 ∗ 𝑺 + 𝑿 −1 𝑨
Theorem 5.23 (Strict contraction of Riccati map, [LL08, Thm. 4.4]). Suppose that 𝑸 , 𝑺
0 are positive definite. Let 𝑨 ∈ 𝕄𝑛 be a complex-valued matrix. Then, the Riccati
map is a strict contraction with respect to 𝛿 Φ with the Lipschitz constant
𝜆max (𝑨 ∗𝑺 −1 𝑨)
𝐿 Φ ( R) ≤ < 1,
𝜆 min (𝑸 ) + 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
for any sgf Φ.
Notice that 𝛼 ≤ 𝜆 max (𝑨 ∗𝑺 −1 𝑨) . Then, for any sgf Φ, the Finsler distance between
Project 5: Algebraic Riccati Equations 260
𝜆 max (𝑨 𝑘∗ 𝑺 −1 𝑨 𝑘 )
𝛿Φ ( R𝑘 (𝑿 ), R𝑘 (𝒀 )) ≤ 𝛿Φ (𝑿 , 𝒀 ).
𝜆min (𝑸 ) + 𝜆 max (𝑨 𝑘∗ 𝑺 −1 𝑨 𝑘 )
Taking the limit of both sides and using the continuity of the metric (𝑿 ,𝒀 ) ↦→ 𝛿 Φ (𝑿 ,𝒀 )
and the eigenvalue function 𝑨 ↦→ 𝜆 max (𝑨 ∗𝑺 −1 𝑨) , one obtains the final bound
𝜆max (𝑨 ∗𝑺 −1 𝑨)
𝛿Φ ( R(𝑿 ), R(𝒀 )) ≤ 𝛿Φ (𝑿 , 𝒀 ),
𝜆 min (𝑸 ) + 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
where the coefficient is strictly less then 1.
Strict contraction of Riccati operator enables one to approximate the unique fixed point
by fixed-point iterations starting from any initial point. The rate of convergence is
exponential and controlled by the Lipschitz constant of the Riccati map as shown in
the following corollary.
Corollary 5.24 (Convergence rate of Riccati iterates, [LL08, Thm. 4.6]). Suppose 𝑸 , 𝑺 0
and let 𝑨 ∈ 𝕄𝑛 be arbitrary. Given an initial seed 𝑿 0 ∈ ℙ𝑛 , define recursions
𝑿 𝑘 +1 B R(𝑿 𝑘 ) for 𝑘 ∈ ℕ. Then, for any sgf Φ, the distance of the iterate 𝑿 𝑘 to the
unique fixed point 𝑿 ★ = R(𝑿 ★ ) is controlled as
𝐿𝑘
𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) ≤ 𝛿Φ (𝑿 1 , 𝑿 0 ),
1−𝐿
where
𝜆 max (𝑨 ∗𝑺 −1 𝑨)
𝐿= < 1.
𝜆 min (𝑸 ) + 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
Project 5: Algebraic Riccati Equations 261
𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) = 𝛿Φ ( R(𝑿 ★ ), R(𝑿 𝑘 −1 )) ≤ 𝐿𝛿 Φ (𝑿 ★ , 𝑿 𝑘 −1 ).
𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) ≤ 𝐿 𝑘 −1 𝛿Φ (𝑿 ★ , 𝑿 1 ),
𝛿 Φ (𝑿 ★ , 𝑿 1 ) ≤ 𝐿𝛿Φ (𝑿 ★ , 𝑿 0 ) ≤ 𝐿 (𝛿 Φ (𝑿 ★ , 𝑿 1 ) + 𝛿 Φ (𝑿 1 , 𝑿 0 )),
𝐿
𝛿 Φ (𝑿 ★ , 𝑿 1 ) ≤ 𝛿Φ (𝑿 1 , 𝑿 0 ),
1−𝐿
which implies
𝐿𝑘
𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) ≤ 𝐿 𝑘 −1 𝛿Φ (𝑿 ★ , 𝑿 1 ) ≤ 𝛿Φ (𝑿 1 , 𝑿 0 ).
1−𝐿
Lecture bibliography
[Ber12] D. Bertsekas. Dynamic Programming and Optimal Control: Volume I. v. 1. Athena
Scientific, 2012. url: https://books.google.com/books?id=qVBEEAAAQBAJ.
[Bha03] R. Bhatia. “On the exponential metric increasing property”. In: Linear Algebra and its
Applications 375 (2003), pages 211–220. doi: https://doi.org/10.1016/S0024-
3795(03)00647-5.
[BH06] R. Bhatia and J. Holbrook. “Riemannian geometry and matrix geometric means”.
In: Linear Algebra and its Applications 413.2 (2006). Special Issue on the 11th
Conference of the International Linear Algebra Society, Coimbra, 2004, pages 594–
618. doi: https://doi.org/10.1016/j.laa.2005.08.025.
[Bou93] P. Bougerol. “Kalman Filtering with Random Coefficients and Contractions”. In:
SIAM Journal on Control and Optimization 31.4 (1993), pages 942–959. eprint:
https://doi.org/10.1137/0331041. doi: 10.1137/0331041.
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[KSH00] T. Kailath, A. H. Sayed, and B. Hassibi. Linear estimation. Prentice Hall, 2000.
[Kal60] R. E. Kalman. “A New Approach to Linear Filtering and Prediction Problems”.
In: Journal of Basic Engineering 82.1 (Mar. 1960), pages 35–45. eprint: https:
//asmedigitalcollection.asme.org/fluidsengineering/article-pdf/
82/1/35/5518977/35\_1.pdf. doi: 10.1115/1.3662552.
[Kha96] H. Khalil. “Nonlinear Systems, Printice-Hall”. In: Upper Saddle River, NJ 3 (1996).
[Lan70] P. Lancaster. “Explicit Solutions of Linear Matrix Equations”. In: SIAM Review 12.4
(1970), pages 544–566. url: http://www.jstor.org/stable/2028490.
[LL07] J. Lawson and Y. Lim. “A Birkhoff Contraction Formula with Applications to
Riccati Equations”. In: SIAM Journal on Control and Optimization 46.3 (2007),
pages 930–951. eprint: https://doi.org/10.1137/050637637. doi: 10.1137/
050637637.
[Lax02] P. D. Lax. Functional analysis. Wiley-Interscience, 2002.
Project 5: Algebraic Riccati Equations 262
[LL08] H. Lee and Y. Lim. “Invariant metrics, contractions and nonlinear matrix equations”.
In: Nonlinearity 21.4 (2008), pages 857–878. doi: 10.1088/0951-7715/21/4/
011.
[LW94] C. Liverani and M. P. Wojtkowski. “Generalization of the Hilbert metric to the
space of positive definite matrices.” In: Pacific Journal of Mathematics 166.2 (1994),
pages 339 –355. doi: pjm/1102621142.
[Ric24] J. Riccati. “Animadversiones in aequationes differentiales secundi gradus”. In:
Actorum Eruditorum quae Lipsiae publicantur Supplementa. Actorum Eruditorum
quae Lipsiae publicantur Supplementa v. 8. prostant apud Joh. Grossii haeredes
& J.F. Gleditschium, 1724. url: https : / / books . google . com / books ? id =
UjTw1w7tZsEC.
[SMI92] V. I. SMIRNOV. “Biography of A. M. Lyapunov”. In: International Journal of
Control 55.3 (1992), pages 775–784. eprint: https : / / doi . org / 10 . 1080 /
00207179208934258. doi: 10.1080/00207179208934258.
6. Hyperbolic Polynomials
In this note, we survey the basic properties of hyperbolic polynomials and their
Agenda:
consequences. These polynomials generalize the determinant, and the roots of their
1. Hyperbolic polynomials
analogously-defined characteristic polynomials share remarkably many of the properties 2. Differentiation
of eigenvalues. Hyperbolic polynomials were originally introduced in the study of 3. Quadratics and Alexandrov’s
(hyperbolic) PDEs by Gårding [Går51], and have since found applications in several inequality
other areas, most famously in the resolution of the Kadison–Singer problem by Marcus, 4. SDPs and perturbation theory
5. Compositions
Spielman, and Srivastava [MSS14]. 6. Euclidean structure
Note that the characteristic polynomial always has degree 𝑑 , and hence has 𝑑 roots
counting multiplicities, since the leading term of 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒆 ) is (−1) 𝑑 𝑝 (𝒆 )𝑡 𝑑 and
we assume 𝑝 (𝒆 ) ≠ 0. See also Remark 6.8 below.
It is common to normalize hyperbolic polynomials so that 𝑝 (𝒆 ) = 1, but we do not
require this.
Example 6.2 The following are three of the most basic examples of hyperbolic polyno-
mials.
Note that the eigenvalues 𝜆𝑖 (𝒙 ) are continuous in 𝒙 , since the roots of a polynomial
are continuous functions of its coefficients (which follows from the argument principle
in complex analysis), and the coefficients of 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒆 ) are themselves continuous
(in fact, polynomial) functions of 𝒙 . Note also that
In particular, 𝝀(𝒆 ) = 1.
As usual in this class, we begin by characterizing the geometry of the set of
hyperbolic polynomials.
Proposition 6.3 (Basic geometry). If 𝑝 ∈ Hyp𝑑 (𝒆 ) then 𝛼𝑝 ∈ Hyp𝑑 (𝒆 ) for any nonzero
𝛼 ∈ ℝ. In particular, Hyp𝑑 (𝒆 ) is a cone, but it is not convex. If 𝑝 ∈ Hyp𝑑 (𝒆 ) and In this note, a cone is a set closed
𝑞 ∈ Hyp𝑚 (𝒆 ) , then 𝑝𝑞 ∈ Hyp𝑑+𝑚 (𝒆 ) . under multiplication by strictly
positive numbers.
Proof. The first and last claims are easy to verify directly from Definition 6.1.
The cone Hyp𝑑 (𝒆 ) is not convex since if 𝑝 ∈ Hyp𝑑 (𝒆 ) then −𝑝 ∈ Hyp𝑑 (𝒆 ) but
(𝑝 − 𝑝)/2 = 0 ∉ Hyp𝑑 (𝒆 ) because the zero polynomial vanishes at 𝒆 . We do not get convexity even if we
restrict to polynomials satisfying
𝑝 (𝒆 ) = 1, as can be seen from
Next, we introduce the generalization of the psd cone. Proposition 6.13(d) below.
Definition 6.4 The hyperbolicity cone for 𝑝 with respect to 𝒆 is the cone Λ++ = {𝒙 ∈
E : 𝜆 min (𝒙 ) > 0}. Its closure is denoted by Λ+ .
Note that Λ++ is indeed a cone, since 𝜆 min is positively homogeneous by (6.1). Note
also that
Λ+ = {𝒙 ∈ E : 𝜆 min (𝒙 ) ≥ 0}.
Indeed, the inclusion ⊆ follows by continuity of 𝜆 min , while the reverse inclusion follows
since if 𝜆 min (𝒙 ) = 0 then for 𝑡 > 0 we have 𝒙 + 𝑡 𝒆 ∈ Λ++ by (6.1) and 𝒙 + 𝑡 𝒆 → 𝒙 as
𝑡 ↓ 0. Equation (6.1) also shows that 𝑝/𝑝 (𝒆 ) > 0 on Λ++ and 𝑝/𝑝 (𝒆 ) ≥ 0 on Λ+ .
We have Λ++ = ℝ++ 𝑑 , Λ = ℝ𝑑 in Example 6.2(a) and Λ
+ + ++ = ℍ𝑑 , Λ+ = ℍ𝑑 in
++ +
Example 6.2(b).
We proceed to show that hyperbolicity cones are convex.
Lemma 6.5 The set Λ++ is the connected component containing 𝒆 in {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0}.
Moreover, it is star-shaped with center 𝒆 . A set S is star-shaped with center
𝑠0 ∈ S if the line segment between 𝑠0
Proof. We have 𝒆 ∈ Λ++ since 𝜆 min (𝒆 ) = 1, and Λ++ ⊆ {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0} since and any 𝑠 ∈ S is contained in S.
𝑝/𝑝 (𝒆 ) > 0 on Λ++ . If 𝒙 ∈ Λ++ then 𝜆 min (𝜏𝒙 + 𝜏𝒆
¯ ) = 𝜏𝜆 min (𝒙 ) + 𝜏¯ > 0 for all
𝜏 ∈ [0, 1] and 𝜏¯ = 1 − 𝜏 . This shows that the line segment between 𝒙 and 𝒆 is
contained in Λ++ and hence Λ++ is connected, and moreover start-shaped with center 𝒆 .
Therefore, Λ++ is contained in the connected component of 𝒆 in {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0}.
Conversely, if 𝒙 is in that connected component, then 𝒙 ∈ Λ++ because 𝜆 min varies
continuously on any path between 𝒙 and 𝒆 contained in {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0} and is
never zero on it.
The following theorem gives an important property of hyperbolicity cones, from
which their convexity readily follows. The proof below is a slight simplification of
Renegar’s proof in [Ren06, Thm. 3], which in turn is a simplification of Gårding’s
original proof in [Går51, Lemma 2.7].
Project 6: Hyperbolic Polynomials 265
Proof. Fix arbitrary 𝒚 ∈ E. For the first claim, we need to show that 𝑡 ↦→ 𝑝 (𝒚 − 𝑡 𝒙 )
has only real roots. Since this is a polynomial with real coefficients, and hence its
complex roots come in conjugate pairs, it is equivalent to show that all of its roots have
nonnegative imaginary parts. Suppose for contradiction that this is not the case.
Î
Define 𝑞 (𝛼, 𝑠 , 𝑡 ) = 𝑝 ( i𝛼𝒆 + 𝑠 𝒚 −𝑡 𝒙 ) . All the roots of 𝑡 ↦→ 𝑞 ( 1, 0, 𝑡 ) = 𝑝 (𝒆 ) 𝑗 ( i −
𝑡 𝜆 𝑗 (𝒙 )) have positive imaginary parts (namely, 𝜆 𝑗 (𝒙 ) −1 ), while some root of 𝑡 ↦→
𝑞 ( 0, 1, 𝑡 ) has a negative imaginary part. Using the continuous dependence of the roots
on the coefficients, we conclude that there must exist 𝛽 ∈ ( 0, 1) such that some root
of 𝑡 ↦→ 𝑞 ( 1 − 𝛽, 𝛽, 𝑡 ) is real, say that root is 𝑡 0. Then 𝒛 = 𝛽𝒚 − 𝑡 0𝒙 ∈ E is such that
𝑡 ↦→ 𝑝 (𝒛 − 𝑡 𝒆 ) has a purely imaginary root (namely, 𝑡 = −i ( 1 − 𝛽) ), contradicting the
hyperbolicity of 𝑝 with respect to 𝒆 .
The second claim follows from Lemma 6.5.
Corollary 6.7 The cones Λ++ and Λ+ are convex.
Proof. The cone Λ++ is star-shaped and every 𝒙 ∈ Λ++ is a center by Lemma 6.5, hence
it is convex. The cone Λ+ is the closure of a convex set, hence convex as well.
Remark 6.8 Corollary 6.7 can fail without the requirement that 𝑝 (𝒆 ) ≠ 0 in Defini-
tion 6.1, see [Ren06].
We can now generalize the convexity of the largest eigenvalue.
Theorem 6.9 The function 𝜆 max is convex and 𝜆 min is concave on all of E.
and note that 𝜆 min (𝒙 ) = 𝜆 min (𝒛 ) . Therefore, for any 𝜏 ∈ [ 0, 1] and 𝜏¯ = 1 −𝜏 , we have
𝜏𝒙 + 𝜏𝒛
¯ ∈ {𝒘 ∈ E : 𝜆 min (𝒘 ) ≥ 𝜆 min (𝒙 )}, so
𝑑
𝑝𝒚( 11 ) (𝒙 ) = 𝑝 (𝒙 + 𝑡 𝒚 1 ) = h∇𝑝 (𝒙 ), 𝒚 1 i,
𝑑𝑡 𝑡 =0
𝑑
𝑝𝒚(𝑘1 ,...,𝒚
)
𝑘
= 𝑝𝒚(𝑘1 ,...,𝒚
−1)
𝑘 −1
(𝒙 + 𝑡 𝒚 𝑘 ).
𝑑𝑡 𝑡 =0
Proof. It suffices to prove the case 𝑘 = 1, since then the general case follows by
induction. If 𝒚 1 ∈ Λ++ , then 𝑝 is hyperbolic with respect to 𝒚 1 by Theorem 6.6,
so 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒚 1 ) has 𝑑 real roots. By Rolle’s theorem, the roots 𝝀 ( 1) (𝒙 ) of
𝑡 ↦→ 𝑝 𝒚( 11 ) (𝒙 − 𝑡 𝒚 1 ) are nested between the roots 𝝀(𝒙 ) of 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒚 1 ) , meaning
that
𝜆 1 (𝒙 ) ≥ 𝜆1( 1) (𝒙 ) ≥ 𝜆 2 (𝒙 ) ≥ . . . ≥ 𝜆𝑑−1 (𝒙 ) ≥ 𝜆𝑑−
( 1)
1
(𝒙 ) ≥ 𝜆𝑑 (𝒙 ).
In particular, this shows that all 𝑑 − 1 roots of 𝑡 ↦→ 𝑝 ( 1) (𝒙 − 𝑡 𝒚 1 ) are real. By Euler’s
( 1) ( 1)
homogeneous function theorem, we have 𝑝 𝒚 1 (𝒚 1 ) = 𝑝 (𝒚 1 )𝑑 ≠ 0. Thus, 𝑝 𝒚 1 is
hyperbolic with respect to 𝒚 1 . Moreover, for any 𝒚 ∈ Λ++ we have 𝜆𝑑 (𝒚 ) > 0, hence
( 1)
𝜆𝑑− 1
(𝒚 ) > 0 so 𝒚 is in the hyperbolicity cone of 𝑝 𝒚( 11 ) . Theorem 6.6 then shows that
𝑝𝒚( 11 ) is hyperbolic with respect to 𝒚 .
An important special case of derivative polynomials is the multilinearization of a
homogeneous polynomial. If 𝑝 is a homogeneous polynomial of degree 𝑑 on E, one
defines polynomials 𝑝𝑖 1 ,...,𝑖𝑑 by
∑︁𝑑 ∑︁
𝑝 𝛼𝑖 𝒙 𝑖 = 𝑝𝑖1 ,...,𝑖𝑑 (𝒙 1 , . . . , 𝒙 𝑑 )𝛼 1𝑖1 · · · 𝛼𝑑𝑖𝑑 ,
𝑖 =1 𝑖 1 ,...,𝑖𝑑
e(𝒙 , . . . , 𝒙 , 𝒙 𝑘 , . . . , 𝒙 𝑑 ) = (𝑘 − 1) !𝑝𝒙(𝑑−𝑘
𝑝 +1)
𝑘 ,...,𝒙 𝑑
(𝒙 ),
𝑄 (𝒙 , 𝒚 ) 2
0 ≥ 𝑝 (𝒛 ) = 𝑝 (𝒙 ) − 2𝑡𝑄 (𝒙 , 𝒚 ) + 𝑡 2𝑝 (𝒚 ) ≥ 𝑝 (𝒙 ) − ,
𝑝 (𝒚 )
where last inequality is obtained by minimizing over 𝑡 . This shows (b) holds.
Example 6.14 (The second-order cone). Let E = ℝ𝑛+1 and 𝑝 (𝒙 ) = 𝑥 02 − = 𝒙 ᵀ 𝑨𝒙 2
Í𝑛
𝑖 =1 𝑥𝑖
for 𝑨 = diag ( 1, −1, . . . , −1) . Then 𝑝 is hyperbolic with respect to 𝒆 = ( 1, 0, . . . , 0) ᵀ
by Proposition 6.13(d), and its corresponding hyperbolicity cone is
n ∑︁𝑛 o
Λ+ = 𝒙 ∈ ℝ𝑛+1 : 𝑥𝑖2 ≤ 𝑥 02 .
𝑖 =1
e(𝒙 , 𝒚 , 𝒙 3 , . . . , 𝒙 𝑑 ) 2 ≥ 𝑝
𝑝 e(𝒙 , 𝒙 , 𝒙 3 , . . . , 𝒙 𝑑 )e
𝑝 (𝒚 , 𝒚 , 𝒙 3 , . . . , 𝒙 𝑑 ).
The function D is called the mixed discriminant, and inequality (6.4) is called
Alexandrov’s mixed discriminant inequality. One can also derive the equality cases
for (6.4) from hyperbolicity, see [Sch14, Thm. 5.5.4].
The inequality (6.4) can be used to derive the Alexandrov–Fenchel inequality for
mixed volumes, a fundamental result in convex geometry. See [SH19], who also give a
more elementary proof of (6.4).
Such cones Λ+ are said to be semidefinite-representable, and they are significant be-
cause optimizing a linear function over Λ+ subject to linear equality constraints is
a semidefinite program (SDP), which can be solved efficiently using interior-point
methods. Those same methods can be used to efficiently solve linear optimization
over any hyperbolicity cone [Gül97] (because log 1/𝑝 is a self-concordant barrier
function for Λ+ ), leading to so-called hyperbolic programs. A natural question is then
whether the class of hyperbolic programs is more expressive than SDPs, that is, can
every hyperbolic program be written as an SDP? This is the content of the general Lax
conjecture [Ren06], which conjectures that for any hyperbolicity cone Λ+ there exists
∼
𝑛 ∈ ℕ, a subspace 𝑆 ⊂ ℍ𝑛 , and a linear isomorphism E − → 𝑆 sending Λ+ to 𝑆 ∩ ℍ𝑛+ .
The special case of hyperbolicity cones in ℝ3 (which was the original Lax conjecture)
has been proved in [LPR05] using, according to [Ser09], a “deep result about Riemann
surfaces by Helton–Vinnikov [HV07]”. They show that for any hyperbolic polynomial
𝑝 on ℝ3 with respect to ( 1, 0, 0) ᵀ , there exist 𝑨, 𝑩 ∈ ℍ𝑑 (where 𝑑 = deg 𝑝 ) satisfying
(b) ᵀ
The sum of all the eigenvalues 1 𝝀 is linear.
Proof. For (a), instantiate Lemma 6.18 on Example 6.2(a). For (b), note that 1ᵀ 𝝀 =
𝐸 1 ◦ 𝝀 which is a polynomial of degree 1 by Lemma 6.18.
The following theorem, taken from [Bau+01, Thm. 3.1], shows that compo-
sitions of symmetric hyperbolic polynomials with the eigenvalue map is hyper-
bolic.
Theorem 6.20 (Hyperbolicity of composition). Let 𝑝 be a hyperbolic polynomial on
E with respect to 𝒆 of degree 𝑑 , with eigenvalue map 𝝀 . Let 𝑞 be a symmetric
hyperbolic polynomial on ℝ𝑑 with respect to 1 of degree 𝑘 , with eigenvalue map
𝝁 . Then 𝑞 ◦ 𝝀 is a hyperbolic polynomial on E with respect to 𝒆 of degree 𝑘 , with
eigenvalue map 𝝁 ◦ 𝝀 .
𝝀(𝜏𝒙 + 𝜏𝒚
¯ ) ≺ 𝝀(𝜏𝒙 ) + 𝝀(𝜏𝒚
¯ ) = 𝜏𝝀(𝒙 ) + 𝜏𝝀(𝒚
¯ ).
( 𝑓 ◦ 𝝀) (𝜏𝒙 + 𝜏𝒚
¯ ) ≤ 𝑓 (𝜏𝝀(𝒙 ) + 𝜏𝝀(𝒚
¯ )) ≤ 𝜏 ( 𝑓 ◦ 𝝀)(𝒙 ) + 𝜏¯ ( 𝑓 ◦ 𝝀)(𝒚 ),
1 1
h𝒙 , 𝒚 i B k𝒙 + 𝒚 k 2 − k𝒙 − 𝒚 k 2 .
4 4
Proposition 6.26 ([Bau+01, Thm. 4.2]). The function k · k is a seminorm on E, and h·, ·i is
a positive-semidefinite bilinear form on E. A seminorm is sublinear but may be
zero on non-zero vectors.
Project 6: Hyperbolic Polynomials 272
(a) 𝑝 is complete.
(b) Λ+ is pointed. A closed cone K is pointed if
(c) k𝒙 k = k𝝀(𝒙 ) k 2 is a norm. K ∩ (−K) = {0 }, or equivalently, if K
contains no lines.
Proof. If 𝝀(𝒙 ) = 0 then 𝝀(−𝒙 ) = 0 as well by (6.1), so 𝒙 ∈ Λ+ ∩ (−Λ+ ) . Conversely,
if 𝒙 ∈ Λ+ ∩ (−Λ+ ) then 𝜆 min (𝒙 ) ≥ 0 and 𝜆 min (−𝒙 ) = −𝜆 max (𝒙 ) ≥ 0, hence 0 ≤
𝜆 min (𝒙 ) ≤ 𝜆max (𝒙 ) ≤ 0 and 𝝀(𝒙 ) = 0. This shows (a) and (b) are equivalent. Finally,
(a) and (c) are equivalent because k𝒙 k is always a seminorm, and k𝒙 k = 0 if and only
if 𝝀(𝒙 ) = 0.
The polynomials in Example 6.2 and Example 6.14 are complete, since ℝ+𝑛 , ℍ𝑛+ , and
the second-order cone are pointed. The above norm is also the Hessian norm used in
interior-point methods with the barrier log 1/𝑝 [Bau+01, Rmk. 4.3].
The inner-product in Definition 6.25 satisfies a sharpened Cauchy–Schwarz inequal-
ity [Bau+01, Prop. 4.4].
Proposition 6.29 (Refined Cauchy–Schwarz). For any 𝒙 , 𝒚 ∈ E,
Proof. Corollary 6.21 shows that 𝝀(𝒙 + 𝒚 ) ≺ 𝝀(𝒙 ) + 𝝀(𝒚 ) . Since the 2 norm is convex
and symmetric, it is isotone, hence k𝝀(𝒙 + 𝒚 )k 2 ≤ k𝝀(𝒙 ) + 𝝀(𝒚 )k 2 . Squaring both
sides, this is equivalent to
giving the first claimed inequality. The second is Cauchy–Schwarz for the 2 norm.
Lecture bibliography
[Bau+01] H. H. Bauschke et al. “Hyperbolic polynomials and convex analysis”. In: Canad. J.
Math. 53.3 (2001), pages 470–488. doi: 10.4153/CJM-2001-020-6.
Project 6: Hyperbolic Polynomials 273
[Går51] L. Gårding. “Linear hyperbolic partial differential equations with constant co-
efficients”. In: Acta Mathematica 85.none (1951), pages 1 –62. doi: 10 . 1007 /
BF02395740.
[Gül97] O. Güler. “Hyperbolic Polynomials and Interior Point Methods for Convex Program-
ming”. In: Mathematics of Operations Research 22.2 (1997), pages 350–377.
[HV07] J. W. Helton and V. Vinnikov. “Linear matrix inequality representation of sets”.
In: Communications on Pure and Applied Mathematics 60.5 (2007), pages 654–674.
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.20155.
doi: https://doi.org/10.1002/cpa.20155.
[LPR05] A. Lewis, P. Parrilo, and M. Ramana. “The Lax conjecture is true”. In: Proceedings
of the American Mathematical Society 133.9 (2005), pages 2495–2499.
[MSS14] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Ramanujan graphs and the
solution of the Kadison-Singer problem”. In: Proceedings of the International
Congress of Mathematicians—Seoul 2014. Vol. III. Kyung Moon Sa, Seoul, 2014,
pages 363–386.
[Ren06] J. Renegar. “Hyperbolic Programs, and Their Derivative Relaxations”. In: Found.
Comput. Math. 6.1 (2006), pages 59–79. doi: 10.1007/s10208-004-0136-z.
[Sch14] R. Schneider. Convex bodies: the Brunn–Minkowski theory. 151. Cambridge university
press, 2014. doi: 10.1017/CBO9781139003858.
[Ser09] D. Serre. “Weyl and Lidskiı̆ inequalities for general hyperbolic polynomials”. In:
Chinese Annals of Mathematics, Series B 30.6 (2009), pages 785–802.
[SH19] Y. Shenfeld and R. van Handel. “Mixed volumes and the Bochner method”. In: Proc.
Amer. Math. Soc. 147.12 (2019), pages 5385–5402. doi: 10.1090/proc/14651.
7. The Laplace Transform Method for Matrices
Definition 7.1 (Mgf and cgf). Let 𝑋 be a real random variable. The moment generating
function (mgf) of 𝑋 is defined by
Next, we state the Laplace transform method, which illustrates how the cgf of a
real random variable can be used to establish bounds on its tail probabilities.
Theorem 7.2 (Laplace transform method). Let 𝑋 be a real random variable. Then, for
Project 7: Matrix Laplace Transform Method 275
each 𝑡 ∈ ℝ,
Exercise 7.3 Provide a proof for Theorem 7.2. Hint: Markov’s inequality.
The previous result can be readily employed to produce tail bounds for the sum of
independent random variables.
Corollary 7.4 (Laplace transform method for the independent sum). Let {𝑋 𝑖 }𝑘𝑖=1 be an
independent family of real random variables in L∞ . Then
n∑︁𝑘 o ∑︁𝑘
ℙ 𝑋𝑖 ≥ 𝑡 ≤ inf exp −𝜃𝑡 + 𝜉 𝑋𝑖 (𝜃 ) ;
𝑖 =1 𝜃 >0 𝑖 =1
n∑︁𝑘 o ∑︁𝑘
ℙ 𝑋𝑖 ≤ 𝑡 ≤ inf exp −𝜃𝑡 + 𝜉 𝑋𝑖 (𝜃 ) .
𝑖 =1 𝜃 <0 𝑖 =1
Proof. Let 𝑌 =
Í𝑘
𝑖 =1 𝑋 𝑖 . By independence of the random variables 𝑋𝑖 , we have that
Í𝑘 Ö𝑘 Ö𝑘 ∑︁𝑘
log 𝔼 e𝜃 𝑖 =1 𝑋 𝑖 = log 𝔼 e𝜃 𝑋𝑖 = log 𝔼 e𝜃 𝑋𝑖 = log 𝔼 e𝜃 𝑋𝑖
𝑖 =1 𝑖 =1 𝑖 =1
Í𝑘
for all 𝜃 ∈ ℝ. So 𝜉𝑌 = 𝑖 =1 𝜉 𝑋𝑖 , and the result follows by replacing 𝑌 and 𝜉𝑌 in
Theorem 7.2.
In the next section, we extend these concepts and results to the matrix case.
Note that when 𝑛 = 1, the definitions for the mgf and the cgf of a random matrix
𝑿 ∈ 𝕄𝑛 coincide with their classical definition in the setting of real random variables.
As in the scalar case, the cgf of a random matrix can be employed to provide bounds
on the tail probabilities of its largest eigenvalue. An extension of Theorem 7.2 to the
matrix setting was first provided by Ahlswede and Winter in [AW01]. We will instead
consider the following variant by Oliveira [Oli10]. The proof we present is that by
Tropp [Tro11].
Theorem 7.6 (Laplace transform method: Matrix case). Let 𝑿 be a random self-adjoint
Project 7: Matrix Laplace Transform Method 276
e𝜆max (𝜃 𝑿 ) = 𝜆 max ( e𝜃 𝑿 ) ≤ tr e𝜃 𝑿 .
The inequality is due to the fact that all eigenvalues of the matrix e𝜃 𝑿 are strictly
positive, so its trace dominates its largest eigenvalue. Hence, for 𝜃 > 0,
𝔼 e𝜆max (𝜃 𝑿 ) 𝔼 tr e𝜃 𝑿
ℙ {𝜆 max (𝑿 ) ≥ 𝑡 } ≤ ≤ .
e𝜃𝑡 e𝜃 𝑿
The result is obtained by taking the infimum over all strictly positive 𝜃 .
Exercise 7.7 (Laplace transform method: Lower bound). Provide a proof for the lower bound
of Theorem 7.2.
Paralleling the scalar case, we would like to employ the Laplace transform method
for random matrices, Theorem 7.6, to establish tail bounds for the largest eigenvalue
of the sum of independent random matrices. It is here where we encounter the biggest
challenge. Indeed, as is evidenced in the proof of Corollary 7.4, the natural step from
the Laplace transform bound to a Laplace-transform-like statement regarding the tail
decay of the sum of random variables, strongly depends on the additivity of the cgf.
This property of the scalar cgf is not transferred into the matrix case. The reason for
this is that, unlike the scalar exponential, the matrix exponential function does not
convert sums of matrices into products. That is, the identity
e𝑿 1 +𝑿 2 = e𝑿 1 𝑿 2
Before proceeding with the proof of Lieb’s Theorem, it is important to recall the
definition of the quantum relative entropy function.
In Lecture 17, we defined the (Umegaki) quantum relative entropy of two density
matrices 𝝔, 𝝂 ∈ 𝚫𝑛 as
and we proved that it is both nonnegative and jointly convex. These two properties
also hold for the quantum relative entropy function as defined by 7.9 and play a central
role in the proof of Theorem 7.8.
tr (𝒀 ) ≥ tr (𝑿 log 𝒀 − 𝑿 log 𝑿 + 𝑿 ).
Observe that 𝒀 0, as matrix functions are defined via the spectral resolution and
e𝜆 > 0 for all 𝜆 ∈ ℝ. So the substitution is valid. Furthermore, joint convexity
of 𝐷 and linearity of the trace imply joint concavity of the function (𝑿 , 𝑨) ↦→
tr (𝑿 𝑯 )−D (𝑿 ; 𝑨)−tr (𝑨) . The result follows from the fact that the partial maximization
of a jointly concave function is concave (see Exercise 7.10).
Exercise 7.10 (Partial maximization of a jointly concave function). Let 𝑓 (·, ·) be a jointly
concave function. Then the map 𝑦 ↦→ sup𝑥 𝑓 (𝑥, 𝑦 ) is concave.
Corollary 7.11 (Expectation of the trace exponential). Let 𝑯 be a fixed self-adjoint matrix.
Then, for any random self-adjoint matrix 𝑿 ,
Exercise 7.12 Provide a proof for Corollary 7.11. Hint: Employ Theorem 7.8 with 𝑨 = e𝑿
and invoke Jensen’s inequality.
Recall from the end of Section 1 that nonadditivity of the cgf was the mayor obstacle
for establishing a matrix parallel to Corollary 7.4. Tropp gives rest to this issue by
proving the following subadditivity result for matrix cgfs [Tro11].
Lemma 7.13 (Subadditivity of Matrix cgfs). Consider an independent sequence {𝑿 𝑖 }𝑘𝑖=1 of
random, self-adjoint matrices. Then
∑︁ ∑︁
𝑘 𝑘
𝔼 tr exp 𝜃 𝑿 𝑖 ≤ tr exp log 𝔼 e𝜃 𝑿 𝑖 for 𝜃 ∈ ℝ.
𝑖 =1 𝑖 =1
Í𝑘 −1
Invoking Corollary 7.11 with 𝑯 = 𝑿 𝑖 yields
𝑖 =1
∑︁
𝑘 −1
𝔼0 · · · 𝔼𝑘 −1 tr exp 𝑿𝑖 + 𝑿𝑘
𝑖 =1
∑︁
𝑘 −1 𝑿𝑘
≤ 𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + log 𝔼𝑘 −1 e
𝑖 =1
∑︁
𝑘 −1
= 𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + Ξ𝑘 .
𝑖 =1
∑︁
𝑘 −2
= 𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + 𝑿 𝑘 −1 + Ξ𝑘 .
𝑖 =1
∑︁
𝑘 −2
𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + 𝑿 𝑘 −1 + Ξ𝑘
𝑖 =1
∑︁
𝑘 −2
≤ 𝔼0 · · · 𝔼𝑘 −3 tr exp 𝑿 𝑖 + Ξ𝑘 −1 + Ξ𝑘
𝑖 =1
∑︁
𝑘 −3
= 𝔼0 · · · 𝔼𝑘 −3 tr exp 𝑿 𝑖 + 𝑿 𝒌 −2 + Ξ𝑘 −1 + Ξ𝑘 .
𝑖 =1
𝑗 =𝑚+1 Ξ𝑘
Í𝑘 −2 Í𝑘
Repeating this procedure recursively with 𝑯 = 𝑖 =1 𝑿𝑖 + for 𝑚 = 𝑘 , 𝑘 −
1, . . . , 1 yields ∑︁𝑘 ∑︁𝑘
𝔼 tr exp 𝑿 𝑖 ≤ tr Ξ𝑘 ,
𝑖 =1 𝑖 =1
Theorem 7.14 (Tail bounds for independent sums of random matrices). Consider an
independent sequence {𝑿 𝑖 }𝑘𝑖=1 of random, self-adjoint matrices. Then, for all
𝑡 ∈ ℝ,
∑︁ ∑︁
𝑘 𝑘
ℙ 𝜆 max 𝑿 𝑖 ≥ 𝑡 ≤ inf e−𝜃𝑡 tr exp log 𝔼 e𝜃 𝑿 𝑖 ;
𝑖 =1 𝜃 >0 𝑖 =1
∑︁ ∑︁
𝑘 −𝜃𝑡 𝑘 𝜃𝑿 𝑖
ℙ 𝜆 min 𝑿 𝑖 ≤ 𝑡 ≤ inf e tr exp log 𝔼 e .
𝑖 =1 𝜃 <0 𝑖 =1
Proof. Follows directly from subadditivity of matrix cgfs, Lemma 7.13, and Theorem
7.6.
We also consider the following Corollary, as it will be useful in the next section
when we prove an extension of the Chernoff bound to the matrix setting. We refer the
reader to [Tro11, Corollary 3.9] for a proof.
Corollary 7.15 Consider an independent sequence {𝑿 𝑖 }𝑘𝑖=1 of random, self-adjoint
matrices in 𝕄𝑛 . For all 𝑡 ∈ ℝ,
∑︁
𝑘 1 ∑︁𝑘 𝑿𝑖
ℙ 𝜆max 𝑿 𝑖 ≥ 𝑡 ≤ 𝑛 inf exp − 𝜃𝑡 + 𝑘 log 𝜆 max 𝔼e .
𝑖 =1 𝜃 >0 𝑘 𝑖 =1
o 𝜇
n∑︁𝑘 e𝛿
ℙ 𝑋𝑖 ≥ ( 1 + 𝛿 )𝜇 ≤ for 𝛿 ≥ 0;
𝑖 =1 ( 1 + 𝛿 ) 1+𝛿
e−𝛿
n∑︁𝑘 o 𝜇
ℙ 𝑋𝑖 ≤ ( 1 − 𝛿 )𝜇 ≤ for 𝛿 ∈ [ 0, 1] .
𝑖 =1 ( 1 − 𝛿 ) 1−𝛿
Next, we establish a similar result for controlling the extremal eigenvalues of the
sum of independent random matrices that satisfy some additional properties. For this
purpose, we first need a simple lemma.
Lemma 7.17 (Chernoff mgf). Let 𝑿 be a random positive-definite matrix such that
𝜆 max (𝑿 ) ≤ 1. Then,
𝔼 e𝜃 𝑿 4 I + ( e𝜃 − 1)(𝔼 𝑿 ).
Exercise 7.18 Provide a proof for Lemma 7.17. Hint: First establish the result for the
scalar case by bounding the exponential on [ 0, 1] by a straight line.
The following theorem is presented as Corollary 5.2 in [Tro11], where the reader
can find a stronger version of the Chernoff bound [Tro11, Theorem 5.1].
Project 7: Matrix Laplace Transform Method 280
Then,
∑︁ 𝜇max /𝑅
𝑘 e𝛿
ℙ 𝜆 max 𝑿 𝑖 ≥ ( 1 + 𝛿 )𝜇max ≤𝑛 for 𝛿 ≥ 0;
𝑖 =1 ( 1 + 𝛿 ) 1+𝛿
𝜇min /𝑅
e−𝛿
∑︁
𝑘
ℙ 𝜆 min 𝑿 𝑖 ≤ ( 1 − 𝛿 )𝜇min ≤𝑛 for 𝛿 ∈ [ 0, 1] .
𝑖 =1 ( 1 − 𝛿 ) 1−𝛿
1
Proof. For each 𝑖 = 1, . . . , 𝑘 , let 𝒀 𝒊 = 𝑅 𝑿𝑖. Then 𝜆 max (𝒀 𝒊 ) ≤ 1 for every 𝑖 ∈
{1, . . . , 𝑘 }, and for 𝑡 > 0 we have that
∑︁ ∑︁
𝑘 𝑘
ℙ 𝜆 max 𝑿 𝑖 ≥ 𝑡 = ℙ 𝜆 max 𝒀𝒊 ≥ 𝑡
𝑖 =1 𝑖 =1
In words, the adjacency matrix of G is the ( 0, 1) -matrix whose (𝑢, 𝑣 ) -th entry
indicates whether there is an edge from node 𝑢 to node 𝑣 .
𝑳 G B 𝑫 − 𝑨,
where 𝑫 is the diagonal matrix whose entries are the degrees of the nodes in V.
The Laplacian of a graph is often understood as its matrix representation.
So far we have mentioned that any graph can be approximated by a sparse graph
with high probability. Nonetheless, we have not yet specified what we mean by an
approximation. The notion of “closeness" between two graphs that we will consider is
based on the spectrum of their Laplacian matrices.
( 1 − 𝜀)𝑳 G 4 𝑳 H 4 ( 1 + 𝜀)𝑳 G .
Our main objective is to show, by means of the probabilistic method, that every
undirected, connected graph has a sparse spectral 𝜀 -approximation. This result is due
to Spielman and Srivastava [SS11] and the analysis presented in this section was based
on that by Spielman in his lecture notes [Spi19]. The argument consists of two steps.
First, we describe an algorithm that, given 𝜀 > 0 and a graph G, produces a sparse
graph H whose Laplacian coincides in expectation to that of G. Then, we invoke the
matrix Chernoff bound, Theorem 7.19, to show that H, the graph resulting from the
algorithm, is a spectral 𝜀 -approximation of G with high probability.
For the remainder of this section, we denote by G = ( V, E) an undirected, connected
graph with | V | = 𝑛 nodes. We also denote by 𝑤 (𝑢,𝑣 ) the weight of the node (𝑢, 𝑣 ) ∈ E.
Finally, for (𝑢, 𝑣 ) ∈ E, we denote by 𝑳 (𝑢,𝑣 ) the Laplacian matrix of the graph with 𝑛
nodes and a single edge connecting 𝑢 and 𝑣 . Note that 𝑳 (𝑢,𝑣 ) can alternatively be
written as (𝜹 𝑢 − 𝜹 𝑣 ) (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ , where 𝜹 𝑢 is the standard basis vector associated to
the position of 𝑢 .
Project 7: Matrix Laplace Transform Method 282
1
𝑝 (𝑢,𝑣 ) B 𝑤 (𝑢,𝑣 ) (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ 𝑳 †G (𝜹 𝑢 − 𝜹 𝑣 ),
𝑅
†
where 𝑳 G denotes the generalized inverse of 𝑳 G .
2. Add the edge (𝑢, 𝑣 ) with weight 𝑤 (𝑢,𝑣 ) /𝑝 (𝑢,𝑣 ) to H with probability 𝑝 (𝑢,𝑣 ) .
as desired.
The substance of Algorithm 7.1 lies in the choice of the probabilities 𝑝 (𝑢,𝑣 ) , which
have been cleverly selected so as to produce a sparse approximator H.
Proposition 7.24 (Sparsity of H). Given a graph G = ( V, E) , the expected number of edges
𝑛−1
of the graph H returned by Algorithm 7.1 is 𝑅 .
The last identity is due to the fact that G is connected, so 𝐿 G has 0 as an eigenvalue
with multiplicity exactly one. For more information on the relationship between
connectivity of a graph and the eigenvalues of its Laplacian matrix we refer the reader
to Chapter 2 of [MP93].
Exercise 7.25 (Sparsity with high probability). Show that choosing 𝑅 = Ω(𝜀 −2 ln 𝑛) in
Algorithm 7.1 produces a graph H with expected number of edges 𝑂 (𝜀 −2 𝑛 ln 𝑛) with
high probability.
( 1 − 𝜀)𝑳 †/
G
2
𝑳 G 𝑳 †/
G
2
4 𝑳 †/
G
2
𝑳 H 𝑳 †/
G
2
4 ( 1 + 𝜀)𝑳 †/
G
2
𝑳 G 𝑳 †/
G
2
,
†/2
where 𝑳 G denotes the square root of the generalized inverse of 𝑳 G .
Exercise 7.27 Provide a proof for Lemma 7.26.
We are now ready to prove the main theorem of this section.
†/2 †/2
Proof. Let 𝚷 B 𝑳 G 𝑳 G 𝑳 G and define
(𝑤
(𝑢,𝑣 )
𝑝 (𝑢,𝑣 ) 𝑳 †/
G
2
𝑳 (𝑢,𝑣 ) 𝑳 †/
G
2
, if (𝑢, 𝑣 ) is added to H
𝑿 (𝑢,𝑣 ) B
0, otherwise.
= 𝚷.
Í
Therefore, it suffices to show that the extremal eigenvalues of the sum (𝑢,𝑣 ) ∈E 𝑿 (𝑢,𝑣 )
stay within a factor of ( 1 ± 𝜀) of that of 𝚷 with high probability. It is easy to check that
𝜆 max (𝑳 †/
G
2
𝑳 (𝑢,𝑣 ) 𝑳 †/
G
2
) = (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ 𝑳 †G (𝜹 𝑢 − 𝜹 𝑣 ).
Project 7: Matrix Laplace Transform Method 284
So 𝜆 max (𝑿 (𝑣,𝑢) ) ≤ 𝑅 for all (𝑢, 𝑣 ) ∈ E. Furthermore, the equality above implies that
𝜀2
𝜇max (𝑿 (𝑢,𝑣 ) ) = 𝜇 min (𝑿 (𝑢,𝑣 ) ) = 1. Therefore, choosing 𝑅 = 3.5 ln 𝑛
and applying the
matrix Chernoff bound yields
1/𝑅
n∑︁ o e𝜀
ℙ 𝑿 (𝑢,𝑣 ) ≥ ( 1 + 𝜀)𝚷 ≤ 𝑛
(𝑢,𝑣 ) ∈E ( 1 + 𝜀) 1+𝜀
ln 𝑛/( 3.5𝜀 2 )
2
≤ 𝑛 e−𝜀 /3
= 𝑛 −1/6 .
2
The second inequality uses the relation e𝜀 ( 1 + 𝜀) −1−𝜀 ≤ e−𝜀 /3 for 𝜀 ∈ [ 0, 1] . Similarly,
2
applying the matrix Chernoff lower bound and using the fact that e−𝜀 ( 1 −𝜀) 𝜀−1 ≤ e−𝜀 /2
for 𝜀 ∈ [ 0, 1] we obtain
n∑︁ o
ℙ 𝑿 (𝑢,𝑣 ) ≤ ( 1 − 𝜀)𝚷 ≤ 𝑛 −3/2 .
(𝑢,𝑣 ) ∈E
†/2 †/2
We conclude that 𝑳 G 𝑳 H 𝑳 G approximates 𝚷 , hence 𝑳 H approximates 𝑳 G , suffi-
ciently well with high probability. Finally, observe that in light of Proposition 7.24, and
given the choice of 𝑅 , the expected number of edges of the resulting approximating
graph H is 3.5 (𝑛 − 1) ln 𝑛𝜀 −2 .
To understand the significance of this result, first recall that the spectrum of a graph
contains relevant information about the graph’s connectivity, underlying modularity,
and other invariants. As indicated by their name, spectral approximations of a graph G
have eigenvalues that are similar to those of G, and thus preserve many of G’s structural
properties. Furthermore, the solutions to linear systems of equations associated to the
Laplacian matrices of graphs that approximate each other are similar [Spi19]. This
property allows for more efficient computation of the solutions of Laplacian linear
systems. A well known testament of the utility of graph sparsification are expander
graphs, which are sparse spectral approximations of complete graphs.
The greatest potential of application of matrix sparsification lies in its utility for
reducing the computational burden of otherwise complex tasks. The perspicacious
reader may thus wonder about the computational feasibility of implementing the
algorithm, and particularly of computing the probabilities 𝑝 (𝑢,𝑣 ) for each (𝑢, 𝑣 ) ∈ E.
In [SS11], Spielman and Srivastava provide an efficient way of estimating the effective
resistances of every edge on a graph, hence the probabilities 𝑝 (𝑢,𝑣 ) .
7.5 Conclusion
In this lecture we extended the Laplace transform method for random variables to
the matrix setting, obtaining powerful tail bounds for the extremal eigenvalues of
random matrices and their independent sums. As an example of the results that can be
deduced from the matrix extension of the Laplace transform method, we established a
Chernoff inequality for random matrices and exhibited an application of the latter in
spectral graph theory.
Lecture bibliography
[AW01] R. Ahlswede and A. Winter. Strong Converse for Identification via Quantum Channels.
2001. arXiv: quant-ph/0012127 [quant-ph].
Project 7: Matrix Laplace Transform Method 285
[Lie73b] E. H. Lieb. “Convex trace functions and the Wigner-Yanase-Dyson conjecture”. In:
Advances in Mathematics 11.3 (1973), pages 267–288. doi: https://doi.org/10.
1016/0001-8708(73)90011-X.
[MP93] B. Mohar and S. Poljak. “Eigenvalues in Combinatorial Optimization”. In: Com-
binatorial and Graph-Theoretical Problems in Linear Algebra. Springer New York,
1993, pages 107–151.
[Oli10] R. I. Oliveira. “Sums of random Hermitian matrices and an inequality by Rudelson”.
In: Electronic Communications in Probability 15 (2010), pages 203–212.
[Spi19] D. Spielman. Spectral and Algebraic Graph Theory. 2019. url: http : / / cs -
www.cs.yale.edu/homes/spielman/sagt/sagt.pdf.
[SS11] D. A. Spielman and N. Srivastava. “Graph Sparsification by Effective Resistances”.
In: SIAM Journal on Computing 40.6 (2011), pages 1913–1926. eprint: https :
//doi.org/10.1137/080734029. doi: 10.1137/080734029.
[Tro11] J. A. Tropp. “User-Friendly Tail Bounds for Sums of Random Matrices”. In: Founda-
tions of Computational Mathematics 12.4 (2011), pages 389–434. doi: 10.1007/
s10208-011-9099-z.
[Tro12] J. A. Tropp. “From joint convexity of quantum relative entropy to a concavity
theorem of Lieb”. In: Proc. Amer. Math. Soc. 140.5 (2012), pages 1757–1760. doi:
10.1090/S0002-9939-2011-11141-9.
8. Operator-Valued Kernels
𝑲 B [𝑘 (𝒙 𝑖 , 𝒙 𝑗 )] 𝑖 𝑗 < 0 .
These kernels naturally model scalar-valued data. Hence, they may be referred to
as scalar-valued kernels. Kernels implicitly define functions from 𝔽 𝑑 into 𝔽 , and it is
often the goal in machine learning, statistics, and approximation theory to exploit this
space of functions for modeling, inference, prediction, and other tasks. We now briefly
introduce the notion of reproducing kernel Hilbert space in the scalar setting.
Definition 8.1 (Reproducing kernel Hilbert space). A Hilbert space ( H, h·, ·i) of functions
mapping X to 𝔽 is a reproducing kernel Hilbert space if for every 𝒙 ∈ X, the linear
functional Φ𝒙 : H → 𝔽 defined by
𝑓 ↦→ Φ𝒙 𝑓 B 𝑓 (𝒙 ) is continuous.
This definition ensures that functions in an RKHS have a pointwise meaning. Intuitively,
it follows that functions in RKHS are smoother than those in L2 . However, this abstract
property does not explain the terminology “reproducing” or “kernel” in the name.
The following alternative definition of a RKHS makes clear the connection to
positive-definite kernels.
Definition 8.2 (RKHS). A Hilbert space ( H, h·, ·i) of functions mapping the input
space X to 𝔽 is said to be the reproducing kernel Hilbert space of the positive-definite
kernel 𝑘 : X × X → 𝔽 if
That is, an operator-valued kernel takes a pair of vectors in the input space into the
space of bounded linear maps between the output space and itself. Not only does an
operator-valued kernel measure similarity between inputs, but it also accounts for
relationships between the outputs. When Y is finite-dimensional, we use the term
matrix-valued kernel.
Notice that the second condition (8.1) in Definition 8.3 generalizes the corresponding
scalar definition condition from 𝑛 -by-𝑛 psd kernel matrices to 𝑛 -by-𝑛 psd block
kernel operators 𝑲 ( X, X) B [𝑲 (𝒙 𝑖 , 𝒙 𝑗 )] 𝑖 ,𝑗 ∈ L( Y𝑛 ) . Here X B {𝒙 1 , . . . , 𝒙 𝑛 } and
Y𝑛 B Y × · · · × Y is the 𝑛 -fold product of Y (which is itself a Hilbert space).
𝒇 ↦→ Φ𝒙 ,𝒚 𝒇 B h𝒚 , 𝒇 (𝒙 )i Y is continuous.
While seemingly benign, the implications of Definition 8.4 are quite profound. Indeed,
let H be a vector-valued RKHS. For 𝒙 ∈ X and 𝒚 ∈ Y, the linear functional Φ𝒙 ,𝒚 is
bounded because it is continuous. Hence, the Riesz representation theorem ensures
the existence of a unique element 𝑲 𝒙 ,𝒚 ∈ H, parametrized by 𝒙 and 𝒚 , such that
The left-hand side of (8.2) is linear in 𝒚 . It follows that 𝑲 𝒙 ,𝒚 on the right-hand side is
also linear. We can define a linear operator 𝑲 𝒙 : Y → H via the rule
𝑲 𝒙 ,𝒚 C 𝑲 𝒙 𝒚 .
Furthermore, we can define an operator-valued kernel 𝑲 (𝒙 , 𝒙 0) : Y → Y by
We sometimes write 𝑲 𝒙 = 𝑲 (·, 𝒙 ) to emphasize that the first argument of the kernel
is open. Provided that 𝑲 𝒙 : Y → H is bounded (which we prove in Proposition 8.5),
it follows from (8.2) and (8.3) that
𝒇 (𝒙 ) = 𝑲 𝒙∗ 𝒇 and 𝑲 (𝒙 , 𝒙 0) = 𝑲 𝒙∗ 𝑲 𝒙 0 for each 𝒙 , 𝒙 0 ∈ X The second equality parallels the well
known scalar identity
because Y is a Hilbert space. This is a form of the vector-valued reproducing property 𝑘 (𝒙 , 𝒙 0 ) = h𝑘 ( ·, 𝒙 ), 𝑘 ( ·, 𝒙 0 ) i .
(cf. Definition 8.2).
Using some basic functional analysis, it is possible to deduce the following facts
about the kernel 𝑲 in (8.3).
Proposition 8.5 (Operator-valued kernel: Properties). The operator-valued kernel 𝑲 on X
from (8.3) satisfies the following properties for all 𝒙 and 𝒙 0 ∈ X.
1.(Kernel) 𝑲 (𝒙 , 𝒙 0) ∈ L( Y) , 𝑲 (𝒙 , 𝒙 0) ∗ ∈ L( Y) , and 𝑲 (𝒙 , 𝒙 ) ∈ L+ ( Y) .
2. (psd) 𝑲 is positive-definite.
1/2
3. (Kernel sections) 𝑲 𝒙 ∈ L( Y; H) with k𝑲 𝒙 k L( Y; H) = k𝑲 (𝒙 , 𝒙 )k .
L( Y)
1/2 1/2
4. (Cauchy–Schwarz) k𝑲 (𝒙 , 𝒙 0)k L( Y) ≤ k𝑲 (𝒙 , 𝒙 )k L( Y) k𝑲 (𝒙 0, 𝒙 0)k L( Y) .
Proof. The only hard item is (1), which is left as an exercise; see Problem 8.8. We will
prove (3), noting that the remaining items are similar or trivial given (1). Applying the
vector-valued reproducing property and the Cauchy–Schwarz inequality, we obtain
k𝑲 𝒙 k 2 = sup k𝑲 𝒙 𝒚 k 2H = sup h𝑲 𝒙 𝒚 , 𝑲 𝒙 𝒚 i H
k𝒚 k Y=1 k𝒚 k Y=1
= sup h𝒚 , 𝑲 (𝒙 , 𝒙 )𝒚 i Y
k𝒚 k Y=1
≤ k𝑲 (𝒙 , 𝒙 )k ,
where all unlabeled norms are the induced operator norms. For the other direction,
h𝑲 (𝒙 , 𝒙 )𝒚 , 𝑲 (𝒙 , 𝒙 )𝒚 i Y = h𝑲 𝒙 𝑲 (𝒙 , 𝒙 )𝒚 , 𝑲 𝒙 𝒚 i H
≤ k𝑲 𝒙 kk𝑲 (𝒙 , 𝒙 )𝒚 k Yk𝑲 𝒙 𝒚 k H
≤ k𝑲 𝒙 k 2 k𝑲 (𝒙 , 𝒙 )𝒚 k Yk𝒚 k Y .
These two bounds imply that k𝑲 (𝒙 , 𝒙 )𝒚 k Y ≤ k𝑲 𝒙 k 2 k𝒚 k Y as desired.
Problem 8.8 (Kernel bounds). Complete the proof of Proposition 8.5. Hint: For item (1),
use the adjoint and apply the uniform boundedness principle.
Definition 8.10 (Vector-valued RKHS [Kad+16]). A Hilbert space ( H, h·, ·i) of functions
mapping the input space X to the output Hilbert space ( Y, h·, ·i Y) is said to be the
vector-valued reproducing kernel Hilbert space of the positive-definite operator-valued
kernel 𝑲 : X × X → L( Y) if
We conclude this section with a converse result [Sch64]. In the scalar setting, the
analogous result is called the Moore–Aronszajn theorem [Aro50].
8.3 Examples
Operator-valued kernels are much more complex than their scalar counterparts. This
abstraction may make it difficult to intuit canonical ways to map vectors into operators
in an expressive way. To build this intuition, we develop examples of operator-valued
kernels that are related to scalar formulations. Catalogs of explicit matrix-valued and
operator-valued kernels may be found in [ARL+12; BHB16; Car+10; MP05].
The most common examples are the separable scalar-type kernels in the next two
examples. This is not surprising given the far-reaching implications of the vector-valued
Bochner and Schoenberg characterization theorems of Section 8.2.3.
Example 8.14 (Separable scalar operator-valued kernel). Consider the function 𝑲 : X× X →
L( Y) given by Notice that the input space is
(𝒙 , 𝒙 0) ↦→ 𝑲 (𝒙 , 𝒙 0) = 𝑘 (𝒙 , 𝒙 0)𝑻 (8.4) decoupled from the output space,
which may be a strong limitation in
for some scalar-valued kernel 𝑘 and some self-adjoint 𝑻 ∈ L+ ( Y) . Then 𝑲 is called practical settings. However, the block
a separable scalar operator-valued kernel. In supervised learning applications, the kernel matrix for (8.4) has a simple
tensor product structure that allows
map 𝑻 can be learned jointly as a hyperparameter along with the regression function for its efficient inversion.
[ARL+12]. Alternatively, 𝑻 can be chosen as the empirical covariance operator of the
labeled data (output space PCA), for example.
Here is a concrete instantiation of (8.4) in infinite dimensions. For any invertible
and self-adjoint 𝑪 ∈ L+ ( X) , define 𝑲 by
(𝒙 , 𝒙 0) ↦→ exp 1
2
k𝑪 −1 (𝒙 − 𝒙 0)k 2X 𝑻 , where
∫
(𝑻 𝒚 ) (𝒕 ) = 𝑒 − k𝒕 −𝒔 k ℝ𝑑 𝒚 (𝒔 ) d𝒔
D
Exercise 8.15 (Separable). Show that separable kernels 𝑲 of the form (8.4) are positive-
definite whenever 𝑘 is positive-definite.
In Lecture 19 we encountered inner-product kernels on 𝔽 𝑑 . These kernels take the
form
𝑘 (𝒙 , 𝒙 0) = 𝜑 (h𝒙 , 𝒙 0i𝔽𝑑 )
for a scalar function 𝜑 : 𝔽 → 𝔽 . We saw that these objects played a crucial role in
the theory of entrywise psd preservers. Moreover, 𝜑 sometimes admits a convergent
power series expansion via Vasudeva’s theorem and its consequences.
We now demonstrate a dramatic extension of inner-product kernels to the operator-
valued setting.
Example 8.16 (Kernel inner-product). Let X = ℂ𝑑 for some 𝑑 ∈ ℕ. Let 𝜶 ∈ ℤ𝑑+ denote a
multi-index (which is a succinct notation to deal with polynomials in arbitrary input
dimension; for example, see [Eva10]). Let 𝝋 : ℂ𝑑 → L( Y) be an operator-valued
Project 8: Operator-Valued Kernels 292
𝝋 ◦ 𝒌 : ℂ𝑑 × ℂ𝑑 → L( Y) (8.6)
∑︁
(𝒙 , 𝒙 0) ↦→ 𝑑
𝑘 1 (𝒙 , 𝒙 0) 𝛼1 · · · 𝑘𝑑 (𝒙 , 𝒙 0) 𝛼𝑑 𝑨 𝜶
𝜶 ∈ℤ+
is an operator-valued kernel.
𝑛 𝑑
!
∑︁ ∑︁ Ö
𝛼
=
1𝑖 𝑘 (𝒙 𝑖 , 𝒙 𝑗 ) h𝒚 𝑖 , 𝑨 𝜶 𝒚 𝑗 i Y 1𝑗
𝜶 ∈ℤ𝑑+ 𝑖 ,𝑗 =1 =1
for any {𝒙 𝑖 }𝑖 =1,...,𝑛 , {𝒚 𝑗 } 𝑗 =1,...,𝑛 , and 𝑛 ∈ ℕ. Since each 𝑘 is positive-definite, the
𝛼
function 𝑘 by entrywise psd preservation of the power function (as seen in Lecture
19). By the Schur product theorem,
𝑑
Ö
𝛼
𝑘 (𝒙 𝑖 , 𝒙 𝑗 ) < 0.
=1 𝑖𝑗
Example 8.18 (Random features). Nonseparable kernels can be constructed with vector-
valued random features. In their most general form [NS21], they are defined as a pair
(𝝋, 𝜇) , where 𝝋 : X × Θ → Y is jointly square Bochner integrable and 𝜇 is a Borel
probability measure on Θ. It follows that
(𝒙 , 𝒙 0) ↦→ 𝑲 (𝒙 , 𝒙 0) B 𝔼𝜽 ∼𝜇 [𝝋 (𝒙 , 𝜽 ) ⊗ Y 𝝋 (𝒙 0, 𝜽 )] (8.7)
Exercise 8.19 (Random feature). Prove that the operator-valued kernel (8.7) defined in
terms of random features is positive-definite.
barring some measurability issues in infinite dimensions. GPs are fully specified by
their finite-dimensional distributions. Here, 𝒇 𝑲 ∼ GP(𝒎, 𝑲 ) signifies that 𝒇 𝑲 has
mean function 𝒎 = 𝔼 𝒇 𝑲 ∈ H𝑲 and covariance function Although this GP connection is
conceptually useful, it is not
(𝒙 , 𝒙 0) ↦→ 𝑲 (𝒙 , 𝒙 0) = 𝔼[𝒇 𝑲 (𝒙 ) ⊗ Y 𝒇 𝑲 (𝒙 0)] , advisable to use such covariance
kernels in practice unless they are
known in closed form or the GP is fast
which is just the operator-valued kernel! The covariance kernel is in the random to sample from and evaluate.
feature form (8.7) with the feature map given by the GP itself. Therefore, the kernel is
positive-definite.
8.4.1 GP examples
We now present some examples of operator-valued kernels defined from a single scalar
kernel [NS21, Example 2.9] and their associated RKHSs. We explicitly make use of the
GP perspective and will see how output space correlations (or lack thereof) influence
sample paths of the corresponding GP. For simplicity in all that follows, the input space
is set to X B ( 0, 1) .
Example 8.21 (Brownian bridge: Scalar). Take Y B ℝ and define the function 𝑘 : X× X →
ℝ by
𝑘 (𝑥, 𝑥 0) B min{𝑥, 𝑥 0 } − 𝑥𝑥 0 . (8.8)
Project 8: Operator-Valued Kernels 294
H10 ( X; ℝ) = 𝑓 ∈ L2 ( X; ℝ) : 𝑓 0 ∈ L2 ( X; ℝ), 𝑓 ( 0) = 𝑓 ( 1) = 0 .
Aside: Without appealing to the
RKHS formalism, one can deduce
This is the Hilbert space of real-valued functions on ( 0, 1) that vanish at the endpoints that functions in H10 ( X; ℝ) are
pointwise well defined by the
with one weak derivative and equipped with the inner-product Sobolev embedding theorem since
∫ 1 here 1 > 𝑑/2 = 1/2.
0 0
h𝑓 , 𝑔 iH10 ( X;ℝ) = h𝑓 , 𝑔 iL2 ( X;ℝ) = 𝑓 0 (𝑡 )𝑔 0 (𝑡 ) d𝑡 .
0
To see this connection, we perform a direct calculation in the spirit of Definition 8.2.
For any 𝑥 ∈ X, the function 𝑘 (·, 𝑥) ∈ L2 ( X; ℝ) because it is bounded, vanishes at the
endpoints, and has a bounded weak derivative 𝑡 ↦→ 𝑘 0 (𝑡 , 𝑥) = 𝟙{𝑡 < 𝑥 } − 𝑥 . Hence
𝑘 (·, 𝑥) ∈ H𝑘 . Last, we verify the reproducing property:
∫ 1
h𝑘 (·, 𝑥), 𝑓 i ( X;ℝ) = 𝑘 0 (·, 𝑥)(𝑡 ) 𝑓 0 (𝑡 ) d𝑡
0
∫ 𝑥 ∫ 1
0
= 𝑓 (𝑡 ) d𝑡 − 𝑥 𝑓 0 (𝑡 )𝑑𝑡
0 0
= 𝑓 (𝑥) .
Therefore, H𝑘 = H10 ( X; ℝ) .
𝑲 (𝑥, 𝑥 0) B 𝑘 (𝑥, 𝑥 0) I ,
where 𝑘 is as in (8.8). Then 𝑲 is the covariance function of 𝒃 B (𝑏 1 , . . . , 𝑏 𝑝 ) , where
each 𝑏 𝑖 (·) is an independent Brownian bridge:
H10 ( X; ℝ𝑝 ) = 𝒇 ∈ L2 ( X; ℝ𝑝 ) : 𝒇 0 ∈ L2 ( X; ℝ𝑝 ), 𝒇 ( 0) = 𝒇 ( 1) = 0 .
∑︁𝑝 ∑︁𝑝
h𝒚 , 𝒇 (𝑥)iℝ𝑝 = 𝑦𝑖 𝑓𝑖 (𝑥) = h𝑘 (·, 𝑥)𝑦𝑖 , 𝑓𝑖 iH10 ( X;ℝ)
𝑖 =1
∑︁𝑝𝑖 =1
= h[𝑲 (·, 𝑥)𝒚 ] 𝑖 , 𝑓𝑖 iH10 ( X;ℝ)
𝑖 =1
= h𝑲 (·, 𝑥)𝒚 , 𝒇 iH10 ( X;ℝ𝑝 ) .
where each 𝑏 𝒔 (·) is independent Brownian bridge. This is because for any 𝒚 ∈ Y and
𝒔 ∈ D, the covariance operator satisfies
Calculations similar to those in the previous two examples show that 𝑲 is a positive-
definite operator-valued kernel with RKHS H𝑲 given by
H10 ( X; Y) = 𝒇 ∈ L2 ( X; Y) : 𝒇 0 ∈ L2 ( X; Y), 𝒇 ( 0) = 𝒇 ( 1) = 0 .
(8.10)
This is the Hilbert space of L2 ( D; ℝ) -valued functions on ( 0, 1) that vanish in the Aside: Lebesgue–Bochner spaces
such as L2 ( ( 0, 1) ; L2 ) here arise
L2 -sense at the endpoints with one weak Fréchet derivative, and equipped with the
frequently in the analysis of
inner-product time-dependent partial differen-
tial equations.
∫ 1
h𝒇 , 𝒈 iH10 ( X; Y) = h𝒇 0 (𝑡 ), 𝒈 0 (𝑡 )i Y d𝑡
0
∫ ∫ 1
= [𝑓 0 (𝑡 )] (𝒔 ) [𝑔 0 (𝑡 )] (𝒔 ) d𝑡 d𝒔 .
D 0
Exercise 8.24 (Lebesgue–Bochner). Complete the calculations in Example 8.23 that prove
𝑲 in (8.9) is positive-definite with RKHS (8.10).
Project 8: Operator-Valued Kernels 296
Warning 8.25 (Sample path regularity). Using a separable operator-valued kernel with
isotropic output operator such as I ∈ L( Y) in (8.9) can be detrimental in infinite
dimensions. Not only does this completely uncorrelate the outputs, but also I
does not have finite trace on Y whenever Y is infinite-dimensional. Thus, the
function-valued GP associated to the operator-valued kernel does not take its values
in Y with probability one! In the language of GPs, this is a bad prior distribution.
Indeed, following Example 8.23, for 𝑥 ∈ ( 0, 1) ,
This is not a Gaussian random function in Y but instead Gaussian white noise
on Y; it belongs to a much “rougher” space than Y itself. Instead, taking
𝑲 : (𝑥, 𝑥 0) ↦→ 𝑘 (𝑥, 𝑥 0)𝑻 with self-adjoint and psd trace-class 𝑻 ∈ S1 ( Y) addresses Here S1 denotes the Schatten-1 class
this issue. operators.
(𝑲 ( X, X) + 𝜆I)𝜷 = 𝒚 . (8.11)
Notes
The presentation of operator-valued kernels in this lecture is new. The theory of RKHS
is largely drawn from [MP05] while many of the exercises scattered throughout are
original. The examples of separable kernels are from [Kad+16] and [MP05]. The
random feature and GP Brownian bridge examples in the operator-valued setting are
new, as is the characterization of their RKHSs.
A complete theory of operator-valued kernels in a much more general setting than
that presented here was developed in the 1960s [Sch64]. Since then, many refinements
Project 8: Operator-Valued Kernels 297
have been made, especially with learning theoretic considerations in mind [CDV07;
Cap+08; Car+10; MP05]. There is a rich but technical literature on universal kernels
in both the scalar- and operator-valued setting [Cap+08; Car+10; MXZ06]. These are
kernels whose associated RKHS is dense in the space of continuous functions equipped
with the topology of uniform convergence over compact subsets. This universal
approximation property is of direct relevance to well-posed learning algorithms. Under
fairly weak assumptions the translation-invariant and radial kernels characterized by
Bochner’s and Schoenberg’s theorems are universal. Random feature developments
along similar lines in the operator-valued setting may be found in [BHB16; Min16].
Lecture bibliography
[ARL+12] M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. “Kernels for vector-valued
functions: A review”. In: Foundations and Trends® in Machine Learning 4.3 (2012),
pages 195–266.
[Aro50] N. Aronszajn. “Theory of reproducing kernels”. In: Transactions of the American
mathematical society 68.3 (1950), pages 337–404.
[BHB16] R. Brault, M. Heinonen, and F. Buc. “Random fourier features for operator-valued
kernels”. In: Asian Conference on Machine Learning. PMLR. 2016, pages 110–125.
[CDV07] A. Caponnetto and E. De Vito. “Optimal rates for the regularized least-squares
algorithm”. In: Foundations of Computational Mathematics 7.3 (2007), pages 331–
368.
[Cap+08] A. Caponnetto et al. “Universal multi-task kernels”. In: The Journal of Machine
Learning Research 9 (2008), pages 1615–1646.
[Car+10] C. Carmeli et al. “Vector valued reproducing kernel Hilbert spaces and universality”.
In: Analysis and Applications 8.01 (2010), pages 19–61.
[Car97] R. Caruana. “Multitask learning”. In: Machine learning 28.1 (1997), pages 41–75.
[Eva10] L. C. Evans. Partial differential equations. American Mathematical Soc., 2010.
[Evg+05] T. Evgeniou et al. “Learning multiple tasks with kernel methods.” In: Journal of
machine learning research 6.4 (2005).
[Kad+16] H. Kadri et al. “Operator-valued kernels for learning from functional response
data”. In: Journal of Machine Learning Research 17.20 (2016), pages 1–54.
[MP05] C. A. Micchelli and M. Pontil. “On learning vector-valued functions”. In: Neural
computation 17.1 (2005), pages 177–204.
[MXZ06] C. A. Micchelli, Y. Xu, and H. Zhang. “Universal Kernels.” In: Journal of Machine
Learning Research 7.12 (2006).
[Min16] H. Q. Minh. “Operator-valued Bochner theorem, Fourier feature maps for operator-
valued kernels, and vector-valued learning”. In: arXiv preprint arXiv:1608.05639
(2016).
[NS21] N. H. Nelsen and A. M. Stuart. “The random feature model for input-output maps
between banach spaces”. In: SIAM Journal on Scientific Computing 43.5 (2021),
A3212–A3243.
[RR08] A. Rahimi and B. Recht. “Random Features for Large-Scale Kernel Machines”.
In: Advances in Neural Information Processing Systems 20. Curran Associates, Inc.,
2008, pages 1177–1184. url: http://papers.nips.cc/paper/3182-random-
features-for-large-scale-kernel-machines.pdf.
[Sch64] L. Schwartz. “Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux
associés (noyaux reproduisants)”. In: Journal d’analyse mathématique 13.1 (1964),
pages 115–256.
9. Spectral Radius Perturbation and Stability
𝒙 𝑡 +1 = 𝑨𝒙 𝑡 + 𝑩𝒖 𝑡 , (9.1)
The next definition connects stability of (9.2) with the spectral radius of the system
matrices.
Aside: Another common stabil-
Definition 9.2 (Stability). We say the closed-loop dynamics is stable under feedback ity definition states that (9.2) is
stable if for all 𝒙 0 ∈ ℝ𝑛 , 𝒙 𝑡 → 0
gain 𝑲 if 𝜌 (𝑨 − 𝑩𝑲 ) < 1.
as 𝑡 → ∞. Using Jordan decom-
position of 𝐴 , it can be readily
A natural question in control design is the following. checked that the two definitions
of the stability are equivalent.
Project 9: Spectral Radius and Stability 299
For a feedback gain 𝑲 designed for 𝑨 and 𝑩 , how much can we perturb 𝑨 with
𝚫 ∈ 𝕄𝑛 such that 𝜌 (𝑨 + 𝚫 − 𝑩𝑲 ) < 1?
When 𝑨 , 𝑩 , 𝑸 , 𝑹 are all diagonal, it can be shown that the 𝑲 ★ is diagonal as well.
These diagonal structures arise when one considers dynamical systems that are made
up of multiple independent scalar subsystems. The generalized setting considers
block-diagonal matrices instead of diagonal matrices. Example of such systems are
networks of swarm robots.
Now suppose that 𝑨 is not perfectly diagonal but instead is nonzero at the off-
diagonal entries. How much of these nonzero off-diagonal entries can 𝑲 ★ tolerate
before the closed-loop stability is lost under 𝑲 ★ ? This question corresponds to the case
where we design an optimal controller for each subsystem in the network independently,
while in reality the subsystems are dynamically coupled loosely. This type of analysis
quantifies how dynamically coupled the subsystems can be before 𝑲 ★ (designed
assuming no dynamical coupling) no longer stabilizes the closed loop.
for 𝑖 = 1, . . . , 𝑛 .
In other words, the Geršgorin discs are balls in the complex plane with 𝑎𝑖𝑖 as the
Í
center and 𝑗 ≠𝑖 |𝑎𝑖 𝑗 | as the radius. It turns out the Geršgorin discs can be used to
Aside: Both Definition 9.3 and
locate the eigenvalues of a matrix.
Theorem 9.4 are valid for com-
plex matrices as well.
Theorem 9.4 (Geršgorin [HJ13]). The eigenvalues of 𝑨 ∈ 𝕄𝑛 are in the union of its
Geršgorin discs 𝑖𝑛=1 D𝑖 (𝑨) . Furthermore, if the union of 𝑘 discs is disjoint from
Ð
the remaining 𝑛 − 𝑘 discs, then the union contains exactly 𝑘 eigenvalues of 𝑨 ,
counted according to their algebraic multiplicities.
where 𝑁 (𝑡 ) is an integer function of 𝑡 and denotes the number of the zeros of 𝑝𝑡 (𝑧)
inside of Γ. Since 𝑁 (𝑡 ) is polynomial on 𝑡 and thus a continuous function of 𝑡 on the
bounded domain [ 0, 1] , we deduce that 𝑁 (𝑡 ) is a constant function on 𝑡 ∈ [ 0, 1] by
Liouville’s theorem. This means that the number of zeros of 𝑝𝑡 (𝑧) (eigenvalues of 𝑨 𝑡 )
inside of Γ remain the same regardless of our choice of 𝑡 . Since 𝑁 ( 0) = 𝑘 , we know
𝑁 ( 1) = 𝑘 . Therefore, there are 𝑘 eigenvalues of 𝑨 inside Γ. By the first part of the
theorem, we know that any eigenvalues of 𝑨 has to lie on a Geršgorin disc. Therefore,
we conclude that there are exactly 𝑘 eigenvalues of 𝑨 in 𝐺𝑘 (𝑨) .
Theorem 9.4 provides a nice quantitative bound on the eigenvalues. One may
wonder if there are refinement or alternatives that can help improve our estimation of
the locations of the eigenvalues. The answer is affirmative. Below we list a few such
results.
Corollary 9.5 (Transpose discs). The eigenvalues of a matrix 𝑨 are in the insersections of
the Geršgorin discs of 𝑨 ᵀ .
Proof. Observe that 𝑨 and 𝑨 ᵀ have the same eigenvalues. Apply Theorem 9.4 to 𝑨 ᵀ to
obtain the result.
Corollary 9.6 (Scaling discs). Let 𝑨 = [𝑎 𝑖 𝑗 ] ∈ 𝕄𝑛 and let 𝑫 ∈ 𝕄𝑛 be a diagonal matrix
with positive real numbers 𝑝 1 , 𝑝 2 , . . . , 𝑝𝑛 on the main diagonal. The eigenvalues 𝜆𝑖
of 𝑨 are in the union of the Geršgorin discs of 𝑫 −1 𝑨𝑫 , that is,
Ø𝑛 1 ∑︁
𝜆𝑖 (𝑨) ∈ 𝑧 ∈ ℂ : |𝑧 − 𝑎𝑖𝑖 | ≤ 𝑝 𝑗 |𝑎𝑖 𝑗 | ,
𝑖 =1 𝑝𝑖 𝑗 ≠𝑖
for all 𝑖 = 1, . . . , 𝑛 .
Together, Theorem 9.4, Corollary 9.5, and 9.6 provide three different approxima-
tions of the eigenvalues of 𝑨 , with 9.6 being especially useful in some cases.
Example 9.7 Consider the matrix
10 5 8
𝑨=
0 − 11 1 .
0 −1 13
The eigenvalues of 𝑨 are −10, 10, 12.96. The Geršgorin discs of 𝑨 are plotted in Figure
9.1. In particular, the union of the blue and the yellow disc is disjoint from the red disc.
Corroborating Theorem 9.4, we see that there are exactly 2 eigenvalues of 𝑨 inside the
union of the blue and the yellow disc.
Now consider a scaling matrix 𝑫 = diag ( 13, 1, 1) . The Geršgorin Disks for the
matrix 𝑫 −1 𝑨𝑫 is shown in Figure 9.2. This particular choice of scaling significantly
improves the estimation for eigenvalue 10. For this particular example, we can actually
compute the optimal scaling matrix. To see this, note that the radii of the scaled
discs will be ( 5𝑝 2 + 8𝑝 3 )/𝑝 1 , 𝑝 3 /𝑝 2 , 𝑝 2 /𝑝 3 . Therefore, the best we could do without
making a worse estimation than the original discs is to set 𝑝 2 = 𝑝 3 , and this will mean
that we can scale the radius of the first disc 13𝑝 2 /𝑝 1 arbitrarily by making 𝑝 1 large to
improve our estimation of the eigenvalue 10. We can achieve such improvement all
while maintaining the same estimation for the eigenvalue −10 and 12.96.
Project 9: Spectral Radius and Stability 302
15
10
-5
-10
-15
-15 -10 -5 0 5 10 15 20 25
Figure 9.1 The eigenvalues and Geršgorin Disks of a matrix plotted on the complex plane.
The crosses are the eigenvalues of the matrix while the circles denotes the Geršgorin
Disks of the matrix.
-1
-15 -10 -5 0 5 10 15
Figure 9.2 The scaled Geršgorin Disks of the same matrix plotted on the complex plane.
The crosses are the eigenvalues of the matrix while the circles are the Geršgorin Disks
of the matrix. Compared to the original Geršgorin Disks in Figure 9.1, the estimation
using a scaling matrix significantly improves for the blue circle disc, while maintaining
the same estimation radius for the other circles.
Project 9: Spectral Radius and Stability 303
for two independently evolving scalar states 𝑥𝑡1 := 𝒙 𝑡 ( 1) and 𝑥𝑡2 := 𝒙 𝑡 ( 2) with their
own control input 𝑢𝑡1 := 𝒖 𝑡 ( 1) and 𝑢𝑡2 := 𝒖 𝑡 ( 2) respectively. Now suppose that we
0.2 0
have used the LQ optimal control theory to design the control gain 𝑲 =
0 0.2
0 𝛼
via (9.4) and (9.5) for qudratic cost 𝑸 = I and 𝑹 = 20 · I. Let 𝚫 = be the
𝛽 0
unaccounted dynamics coupling between 𝑥 1 and 𝑥 2 such that the true system matrices
are 𝑨 + 𝚫, 𝑩 instead of 𝑨 , 𝑩 .
0.2 0
A simple calculation reveals that (𝑨 − 𝑩𝑲 ) = . According to the
0 0.2
sufficient condition (9.6), as long as 𝛼 and 𝛽 individually does not exceed 0.2, then
𝑲 designed for diagonal matrices 𝑨 and 𝑩 continues to stabilize 𝑨 + 𝚫 and 𝑩 . We
can numerically check how tight this sufficient condition is in this example for varying
values of 𝛼 and 𝛽 . In Figure 9.3, each horizontal curve on the surface corresponds to
a fixed 𝛽 value as we sweep over different 𝛼 values. Any points on the surface that
has 𝑦 axis value of more than 1 mean that the corresponding closed-loop system is
unstable. Sufficient condition (9.6) is loose, because for the top curve (top boundary
of the surface) which corresponds to 𝛽 = 0.3, there are clearly 𝛼 values small enough
such that the spectral radius of the resulting closed loop remains below 1. On the
other hand, the sufficient condition (9.6) is tight in the following sense. Although not
explicitly shown in the plot, when 𝛼 = 0.2 and 𝛽 = 0.2, i.e., (9.6) holds in equality,
then the spectral radius of the closed loop is 1, which is the cross-over point between
stability and instability. Combined with the plot, we see that if we fix one of the two
variable to be 0.2, then the closed loop system becomes unstable as soon as the other
variable exceeds 0.2.
Project 9: Spectral Radius and Stability 304
Figure 9.3 The spectral radius of the true closed loop 𝜌 (𝑨 − 𝑩𝑲 + 𝚫) for fixed 𝛽 values
varying from 0 to 0.3.
Proof. Let 𝑫 ∈ 𝕄𝑛 be an invertible matrix. Then 𝜌 (𝑨) = 𝜌 (𝑫 𝑨𝑫 −1 ) ≤ k𝑫 𝑨𝑫 −1 k 2 , Aside: Here we use k · k 2 to de-
where the first equality comes from the fact that similarity transformations do note the spectral norm.
not affect eigenvalues. This equality holds for any 𝑫 ∈ 𝕄𝑛 , therefore 𝜌 (𝑨) ≤
inf 𝐷 ∈𝕄𝑛 invertible k𝑫 𝑨𝑫 −1 k 2 .
Now suppose 𝑨 has Jordan decomposition 𝑨 = 𝑻 −1 𝑱𝑻 where 𝑱 is in the Jordan
normal form with eigenvalues of 𝑨 on the diagonal and 1’s on the super diagonal. Let
𝑷 = diag ( 1, 𝑘 , 𝑘 2 , . . . , 𝑘 𝑛−1 ) where 𝑘 > 0. Observe that the matrix 𝑷 𝑱 𝑷 −1 is a matrix
with eigenvalues of 𝑨 on the diagonal and 𝑘1 on the super diagonal. Therefore, as
𝑘 → 0, 𝑷 𝑱 𝑷 −1 → diag (𝝀(𝑨)) where 𝝀(𝑨) is the vector of eigenvalues of 𝑨 repeating
with algebraic multiplicity. Therefore, k𝑷 𝑱 𝑷 −1 k 2 → 𝜌 (𝑨) . Letting 𝑫 = 𝑷𝑻 , we have
𝜌 (𝑨) = 𝑫 𝑨𝑫 −1 ≥ inf 𝑫 ∈𝕄𝑛 invertible k𝑫 𝑨𝑫 −1 k 2 .
The variational characterization of the spectral radius reminds us of Corollary 9.6,
where we bounded the eigenvalues of a matrix 𝑨 with the entries of 𝑫 𝑨𝑫 −1 for 𝑫 a
diagonal matrix with positive real diagonal entries. Indeed, we can bound the spectral
radius of 𝑨 also using the entries of 𝑫 𝑨𝑫 −1 .
Corollary 9.11 (Variational bound on spectral radius via Geršgorin). Let 𝑨 = [𝑎 𝑖 𝑗 ] ∈ 𝕄𝑛 .
Then
1 ∑︁𝑛
𝜌 (𝑨) ≤ min max 𝑝 𝑗 |𝑎𝑖 𝑗 |
𝑝1 ,...,𝑝𝑛 >0 1 ≤𝑖 ≤𝑛 𝑝𝑖 𝑗 =1
and
1 ∑︁𝑛
𝜌 (𝑨) ≤ min max 𝑝𝑖 |𝑎𝑖 𝑗 |. (9.7)
𝑝 1 ,...,𝑝𝑛 >0 1 ≤𝑗 ≤𝑛 𝑝𝑗 𝑖 =1
Project 9: Spectral Radius and Stability 305
Proof. By Corollary 9.6, all eigenvalues of 𝑨 belong to the union of the Geršgorin
Ð𝑛 n 1 Í
o
discs, 𝑖 =1 𝑧 ∈ ℂ : |𝑧 − 𝑎𝑖𝑖 | ≤ 𝑝𝑖 𝑗 ≠𝑖 𝑝 𝑗 |𝑎𝑖 𝑗 | . For each of the Geršgorin disc of
the scaled matrix 𝑫 𝑨𝑫 −1 for 𝑫 = diag (𝑝 1 , . . . , 𝑝𝑛 ) with 𝑝 1 , . . . , 𝑝𝑛 > 0, the furthest
point away from the origin on the complex plane is 𝑝1𝑖 𝑛𝑗=1 𝑝 𝑗 |𝑎𝑖 𝑗 | . Therefore, all
Í
eigenvalues of 𝑨 , including the largest eigenvalue, has to be less than the maximum
furthest point over all Geršgorin discs of 𝑫 𝑨𝑫 −1 . Now apply the same reasoning to 𝑨 ᵀ
using Corollary 9.5, we obtain (9.7).
9.4 Notes
The majority of the results presented here is from chapter 6 of [HJ13]. Some corollaries
are exercises from [HJ13]. The examples are inspired and tweaked from [Mar+16].
The variational characterization of the spectral radius is presented in [Hua+21] as a
fact and proved here by JY.
Lecture bibliography
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[Hua+21] D. Huang et al. “Matrix Concentration for Products”. In: Foundations of Computa-
tional Mathematics (2021), pages 1–33.
[Mar+16] D. Marquis et al. ‘Gershgorin’s Circle Theorem for Estimating the Eigenvalues of a
Matrix with Known Error Bounds”. 2016.
IV. back matter
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 307
Bibliography
[ABY20] R. Aoun, M. Banna, and P. Youssef. “Matrix Poincaré inequalities and concentration”.
In: Advances in Mathematics 371 (2020), page 107251. doi: https://doi.org/10.
1016/j.aim.2020.107251.
[AB08] T. Arbogast and J. L. Bona. Methods of applied mathematics. ICES Report. UT-Austin,
2008.
[Aro50] N. Aronszajn. “Theory of reproducing kernels”. In: Transactions of the American
mathematical society 68.3 (1950), pages 337–404.
[Art11] M. Artin. Algebra. Pearson Prentice Hall, 2011.
[AV95] J. S. Aujla and H. Vasudeva. “Convex and monotone operator functions”. In: Annales
Polonici Mathematici. Volume 62. 1. 1995, pages 1–11.
[BCL94] K. Ball, E. A. Carlen, and E. H. Lieb. “Sharp Uniform Convexity and Smoothness
Inequalities for Trace Norms”. In: Invent Math 115.1 (Dec. 1994), pages 463–482.
doi: 10.1007/BF01231769.
[BBv21] A. S. Bandeira, M. T. Boedihardjo, and R. van Handel. “Matrix Concentration
Inequalities and Free Probability”. In: arXiv:2108.06312 [math] (Aug. 2021). arXiv:
2108.06312 [math].
[BR97] R. B. Bapat and T. E. S. Raghavan. Nonnegative matrices and applications. Cambridge
University Press, Cambridge, 1997. doi: 10.1017/CBO9780511529979.
[Bar02] A. Barvinok. A course in convexity. American Mathematical Society, Providence, RI,
2002. doi: 10.1090/gsm/054.
[Bar16] A. Barvinok. Combinatorics and complexity of partition functions. Springer, Cham,
2016. doi: 10.1007/978-3-319-51829-9.
[Bau+01] H. H. Bauschke et al. “Hyperbolic polynomials and convex analysis”. In: Canad. J.
Math. 53.3 (2001), pages 470–488. doi: 10.4153/CJM-2001-020-6.
[BT10] A. Beck and M. Teboulle. “On minimizing quadratically constrained ratio of two
quadratic functions”. In: J. Convex Anal. 17.3-4 (2010), pages 789–804.
[Bel+18] A. Belton et al. “A panorama of positivity”. Available at https://arXiv.org/
abs/1812.05482. 2018.
[BTN01] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization. Analysis,
algorithms, and engineering applications. Society for Industrial and Applied
Mathematics (SIAM), 2001. doi: 10.1137/1.9780898718829.
[Ben+17] P. Benner et al. Model reduction and approximation: theory and algorithms. SIAM,
2017.
[Ber08] C. Berg. “Stieltjes-Pick-Bernstein-Schoenberg and their connection to complete
monotonicity”. In: Positive Definite Functions: From Schoenberg to Space-Time
Challenges. Castellón de la Plana: University Jaume I, 2008, pages 15–45.
[BCR84] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Theory
of positive definite and related functions. Springer-Verlag, New York, 1984. doi:
10.1007/978-1-4612-1128-0.
[BP94] A. Berman and R. J. Plemmons. Nonnegative matrices in the mathematical sciences.
Revised reprint of the 1979 original. Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA, 1994. doi: 10.1137/1.9781611971262.
[Ber12] D. Bertsekas. Dynamic Programming and Optimal Control: Volume I. v. 1. Athena
Scientific, 2012. url: https://books.google.com/books?id=qVBEEAAAQBAJ.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha03] R. Bhatia. “On the exponential metric increasing property”. In: Linear Algebra and its
Applications 375 (2003), pages 211–220. doi: https://doi.org/10.1016/S0024-
3795(03)00647-5.
309
[Bha07a] R. Bhatia. Perturbation bounds for matrix eigenvalues. Reprint of the 1987 original.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
doi: 10.1137/1.9780898719079.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[BH06] R. Bhatia and J. Holbrook. “Riemannian geometry and matrix geometric means”.
In: Linear Algebra and its Applications 413.2 (2006). Special Issue on the 11th
Conference of the International Linear Algebra Society, Coimbra, 2004, pages 594–
618. doi: https://doi.org/10.1016/j.laa.2005.08.025.
[BJL19] R. Bhatia, T. Jain, and Y. Lim. “On the Bures-Wasserstein distance between positive
definite matrices”. In: Expo. Math. 37.2 (2019), pages 165–191. doi: 10.1016/j.
exmath.2018.01.002.
[BL06] Y. Bilu and N. Linial. “Lifts, Discrepancy and Nearly Optimal Spectral Gap”. In:
Combinatorica 26.5 (2006), pages 495–519. doi: 10.1007/s00493-006-0029-7.
[Bir46] G. Birkhoff. “Three observations on linear algebra”. In: Univ. Nac. Tacuman, Rev.
Ser. A 5 (1946), pages 147–151.
[Boc33] S. Bochner. “Monotone Funktionen, Stieltjessche Integrale und harmonische
Analyse”. In: Math. Ann. 108.1 (1933), pages 378–410. doi: 10.1007/BF01452844.
[BB08] J. Borcea and P. Brändén. “Applications of stable polynomials to mixed determi-
nants: Johnson’s conjectures, unimodality, and symmetrized Fischer products”. In:
Duke Mathematical Journal 143.2 (2008), pages 205 –223. doi: 10.1215/00127094-
2008-018.
[Bou93] P. Bougerol. “Kalman Filtering with Random Coefficients and Contractions”. In:
SIAM Journal on Control and Optimization 31.4 (1993), pages 942–959. eprint:
https://doi.org/10.1137/0331041. doi: 10.1137/0331041.
[BV04] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press,
Cambridge, 2004. doi: 10.1017/CBO9780511804441.
[BHB16] R. Brault, M. Heinonen, and F. Buc. “Random fourier features for operator-valued
kernels”. In: Asian Conference on Machine Learning. PMLR. 2016, pages 110–125.
[Bra+11] M. Braverman et al. “The Grothendieck Constant Is Strictly Smaller than Krivine’s
Bound”. In: 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.
Oct. 2011, pages 453–462. doi: 10.1109/FOCS.2011.77.
[BRS17] J. Briët, O. Regev, and R. Saket. “Tight Hardness of the Non-Commutative
Grothendieck Problem”. In: Theory of Computing 13.15 (Dec. 2017), pages 1–
24. doi: 10.4086/toc.2017.v013a015.
[CDV07] A. Caponnetto and E. De Vito. “Optimal rates for the regularized least-squares
algorithm”. In: Foundations of Computational Mathematics 7.3 (2007), pages 331–
368.
[Cap+08] A. Caponnetto et al. “Universal multi-task kernels”. In: The Journal of Machine
Learning Research 9 (2008), pages 1615–1646.
[Car10] E. Carlen. “Trace inequalities and quantum entropy: an introductory course”.
In: Entropy and the quantum. Volume 529. Contemp. Math. Amer. Math. Soc.,
Providence, RI, 2010, pages 73–140. doi: 10.1090/conm/529/10428.
[Car+10] C. Carmeli et al. “Vector valued reproducing kernel Hilbert spaces and universality”.
In: Analysis and Applications 8.01 (2010), pages 19–61.
[Car97] R. Caruana. “Multitask learning”. In: Machine learning 28.1 (1997), pages 41–75.
[Cha13] D. Chafaï. A probabilistic proof of the Schoenberg theorem. 2013. url: https :
//djalil.chafai.net/blog/2013/02/09/a- probabilistic- proof- of-
the-schoenberg-theorem/.
310
[CB21] C.-F. Chen and F. G. S. L. Brandão. “Concentration for Trotter Error”. Available
at https : / / arXiv . org / abs / 2111 . 05324. Nov. 2021. arXiv: 2111 . 05324
[math-ph, physics:quant-ph].
[Cho74] M.-D. Choi. “A Schwarz inequality for positive linear maps on 𝐶 ∗ -algebras”. In:
Illinois Journal of Mathematics 18.4 (1974), pages 565 –574. doi: 10.1215/ijm/
1256051007.
[CS07] M. Chudnovsky and P. Seymour. “The roots of the independence polynomial
of a clawfree graph”. In: Journal of Combinatorial Theory, Series B 97.3 (2007),
pages 350–357. doi: https://doi.org/10.1016/j.jctb.2006.06.001.
[Chu97] F. R. K. Chung. Spectral graph theory. American Mathematical Society, 1997.
[Col+21] S. Cole et al. “Quantum optimal transport”. Available at https://arXiv.org/
abs/2105.06922. 2021.
[CP77] “CHAPTER 6. Calculus in Banach Spaces”. In: Functional Analysis in Modern Applied
Mathematics. Volume 132. Mathematics in Science and Engineering. Elsevier, 1977,
pages 87–105. doi: https://doi.org/10.1016/S0076-5392(08)61248-5.
[Dav57] C. Davis. “A Schwarz inequality for convex operator functions”. In: Proc. Amer.
Math. Soc. 8 (1957), pages 42–44. doi: 10.2307/2032808.
[DK70] C. Davis and W. M. Kahan. “The Rotation of Eigenvectors by a Perturbation. III”.
In: SIAM Journal on Numerical Analysis 7.1 (1970), pages 1–46. eprint: https:
//doi.org/10.1137/0707001. doi: 10.1137/0707001.
[DKW82] C. Davis, W. M. Kahan, and H. F. Weinberger. “Norm-preserving dilations and
their applications to optimal error bounds”. In: SIAM J. Numer. Anal. 19.3 (1982),
pages 445–469. doi: 10.1137/0719029.
[Ded92] J. P. Dedieu. “Obreschkoff ’s theorem revisited: what convex sets are contained in the
set of hyperbolic polynomials?” In: Journal of Pure and Applied Algebra 81.3 (1992),
pages 269–278. doi: https://doi.org/10.1016/0022-4049(92)90060-S.
[DPZ20] P. B. Denton, S. J. Parke, and X. Zhang. “Neutrino oscillations in matter via
eigenvalues”. In: Phys. Rev. D 101 (2020), page 093001. doi: 10.1103/PhysRevD.
101.093001.
[Den+22] P. B. Denton et al. “Eigenvectors from eigenvalues: a survey of a basic identity in
linear algebra”. In: Bull. Amer. Math. Soc. (N.S.) 59.1 (2022), pages 31–58. doi:
10.1090/bull/1722.
[DT07] I. S. Dhillon and J. A. Tropp. “Matrix nearness problems with Bregman divergences”.
In: SIAM J. Matrix Anal. Appl. 29.4 (2007), pages 1120–1146. doi: 10 . 1137 /
060649021.
[Din+06] C. Ding et al. “R1 -PCA: Rotational Invariant L1 -Norm Principal Component Analysis
for Robust Subspace Factorization”. In: Proceedings of the 23rd International
Conference on Machine Learning. ICML ’06. Association for Computing Machinery,
June 2006, pages 281–288. doi: 10.1145/1143844.1143880.
[ENG11] A. Ebadian, I. Nikoufar, and M. E. Gordji. “Perspectives of matrix convex functions”.
In: Proceedings of the National Academy of Sciences 108.18 (2011), pages 7313–7314.
doi: 10.1073/pnas.1102518108.
[EY39] C. Eckart and G. Young. “A principal axis transformation for non-hermitian
matrices”. In: Bull. Amer. Math. Soc. 45.2 (1939), pages 118–121. doi: 10.1090/
S0002-9904-1939-06910-3.
[Eff09] E. G. Effros. “A matrix convexity approach to some celebrated quantum inequali-
ties”. In: Proceedings of the National Academy of Sciences 106.4 (2009), pages 1006–
1008. doi: 10.1073/pnas.0807965106.
[Eva10] L. C. Evans. Partial differential equations. American Mathematical Soc., 2010.
311
[Evg+05] T. Evgeniou et al. “Learning multiple tasks with kernel methods.” In: Journal of
machine learning research 6.4 (2005).
[Fel80] H. J. Fell. “On the zeros of convex combinations of polynomials.” In: Pacific Journal
of Mathematics 89.1 (1980), pages 43 –50. doi: pjm/1102779366.
[Går51] L. Gårding. “Linear hyperbolic partial differential equations with constant co-
efficients”. In: Acta Mathematica 85.none (1951), pages 1 –62. doi: 10 . 1007 /
BF02395740.
[Gar+20] A. Garg et al. “Operator scaling: theory and applications”. In: Found. Comput. Math.
20.2 (2020), pages 223–290. doi: 10.1007/s10208-019-09417-z.
[Gar07] D. J. H. Garling. Inequalities: a journey into linear analysis. Cambridge University
Press, Cambridge, 2007. doi: 10.1017/CBO9780511755217.
[GG78] C. Godsil and I. Gutman. “On the matching polynomial of a graph”. In: Algebraic
Methods in Graph Theory 25 (Jan. 1978).
[GR01] C. Godsil and G. Royle. Algebraic graph theory. Springer-Verlag, New York, 2001.
doi: 10.1007/978-1-4613-0163-9.
[GW95] M. X. Goemans and D. P. Williamson. “Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming”. In:
J. Assoc. Comput. Mach. 42.6 (1995), pages 1115–1145. doi: 10.1145/227683.
227684.
[GM10] G. H. Golub and G. Meurant. Matrices, moments and quadrature with applications.
Princeton University Press, Princeton, NJ, 2010.
[GVL13] G. H. Golub and C. F. Van Loan. Matrix computations. Fourth. Johns Hopkins
University Press, Baltimore, MD, 2013.
[GR07] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Seventh.
Translated from the Russian, Translation edited and with a preface by Alan
Jeffrey and Daniel Zwillinger, With one CD-ROM (Windows, Macintosh and UNIX).
Elsevier/Academic Press, Amsterdam, 2007.
[GLO20] A. Greenbaum, R.-C. Li, and M. L. Overton. “First-Order Perturbation Theory for
Eigenvalues and Eigenvectors”. In: SIAM Review 62.2 (2020), pages 463–482. doi:
10.1137/19M124784X.
[Gro53] A. Grothendieck. Résumé de La Théorie Métrique Des Produits Tensoriels Topologiques.
Soc. de Matemática de São Paulo, 1953.
[Gu15] M. Gu. “Subspace iteration randomization and singular value problems”. In: SIAM
J. Sci. Comput. 37.3 (2015), A1139–A1173. doi: 10.1137/130938700.
[Gül97] O. Güler. “Hyperbolic Polynomials and Interior Point Methods for Convex Program-
ming”. In: Mathematics of Operations Research 22.2 (1997), pages 350–377.
[Gur04] L. Gurvits. “Classical complexity and quantum entanglement”. In: J. Comput.
System Sci. 69.3 (2004), pages 448–484. doi: 10.1016/j.jcss.2004.06.003.
[Haa85] U. Haagerup. “The Grothendieck Inequality for Bilinear Forms on C*-Algebras”.
In: Advances in Mathematics 56.2 (May 1985), pages 93–116. doi: 10.1016/0001-
8708(85)90026-X.
[HI95] U. Haagerup and T. Itoh. “Grothendieck Type Norms For Bilinear Forms On
C*-Algebras”. In: Journal of Operator Theory 34.2 (1995), pages 263–283.
[Hal74] P. R. Halmos. Finite-dimensional vector spaces. 2nd ed. Springer-Verlag, New York-
Heidelberg, 1974.
[HP82] F. Hansen and G. K. Pedersen. “Jensen’s Inequality for Operators and Löwner’s
Theorem.” In: Mathematische Annalen 258 (1982), pages 229–241.
[HP03] F. Hansen and G. K. Pedersen. “Jensen’s Operator Inequality”. In: Bulletin of the
London Mathematical Society 35.4 (2003), pages 553–564.
312
[HLP88] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Reprint of the 1952 edition.
Cambridge University Press, Cambridge, 1988.
[HL72] O. J. Heilmann and E. H. Lieb. “Theory of monomer-dimer systems”. In: Communi-
cations in Mathematical Physics 25.3 (1972), pages 190 –232. doi: cmp/1103857921.
[HV07] J. W. Helton and V. Vinnikov. “Linear matrix inequality representation of sets”.
In: Communications on Pure and Applied Mathematics 60.5 (2007), pages 654–674.
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.20155.
doi: https://doi.org/10.1002/cpa.20155.
[Her63] C. S. Herz. “Fonctions opérant sur les fonctions définies-positives”. In: Ann. Inst.
Fourier (Grenoble) 13 (1963), pages 161–180. url: http://aif.cedram.org/
item?id=AIF_1963__13__161_0.
[Hia10] F. Hiai. “Matrix analysis: matrix monotone functions, matrix means, and majoriza-
tion”. In: Interdiscip. Inform. Sci. 16.2 (2010), pages 139–248. doi: 10.4036/iis.
2010.139.
[HK99] F. Hiai and H. Kosaki. “Means for matrices and comparison of their norms”. In:
Indiana Univ. Math. J. 48.3 (1999), pages 899–936. doi: 10.1512/iumj.1999.
48.1665.
[HP91] F. Hiai and D. Petz. “The proper formula for relative entropy and its asymptotics
in quantum probability”. In: Communications in Mathematical Physics 143 (1991),
pages 99–114.
[HP14] F. Hiai and D. Petz. Introduction to matrix analysis and applications. Springer, Cham;
Hindustan Book Agency, New Delhi, 2014. doi: 10.1007/978-3-319-04150-6.
[Hig89] N. J. Higham. “Matrix nearness problems and applications”. In: Applications of
matrix theory (Bradford, 1988). Volume 22. Inst. Math. Appl. Conf. Ser. New Ser.
Oxford Univ. Press, New York, 1989, pages 1–27. doi: 10.1093/imamat/22.1.1.
[Hig08] N. J. Higham. Functions of matrices. Theory and computation. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2008. doi: 10 . 1137 / 1 .
9780898717778.
[HW53] A. J. Hoffman and H. W. Wielandt. “The variation of the spectrum of a normal
matrix”. In: Duke J. Math. 20 (1953), pages 37–39.
[HLW06] S. Hoory, N. Linial, and A. Wigderson. “Expander graphs and their applications”.
In: Bulletin of the American Mathematical Society 43.4 (2006), pages 439–561.
[Hör94] L. Hörmander. Notions of convexity. Birkhäuser Boston, Inc., 1994.
[Hor69] R. A. Horn. “The theory of infinitely divisible matrices and kernels”. In: Trans.
Amer. Math. Soc. 136 (1969), pages 269–286. doi: 10.2307/1994714.
[HJ94] R. A. Horn and C. R. Johnson. Topics in matrix analysis. Corrected reprint of the
1991 original. Cambridge University Press, Cambridge, 1994.
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[Hua19] D. Huang. “Improvement on a Generalized Lieb’s Concavity Theorem”. Available
at https://arXiv.org/abs/1902.02194. 2019.
[Hua+21] D. Huang et al. “Matrix Concentration for Products”. In: Foundations of Computa-
tional Mathematics (2021), pages 1–33.
[Hur10] G. H. Hurlbert. Linear optimization. The simplex workbook. Springer, New York,
2010. doi: 10.1007/978-0-387-79148-7.
[Jor75] C. Jordan. “Essai sur la géométrie à 𝑛 dimensions”. In: Bulletin de la Société
mathématique de France 3 (1875), pages 103–174.
[Kad+16] H. Kadri et al. “Operator-valued kernels for learning from functional response
data”. In: Journal of Machine Learning Research 17.20 (2016), pages 1–54.
313
[Lie73b] E. H. Lieb. “Convex trace functions and the Wigner-Yanase-Dyson conjecture”. In:
Advances in Mathematics 11.3 (1973), pages 267–288. doi: https://doi.org/10.
1016/0001-8708(73)90011-X.
[LR73] E. H. Lieb and M. B. Ruskai. “Proof of the strong subadditivity of quantum-
mechanical entropy”. In: J. Mathematical Phys. 14 (1973). With an appendix by B.
Simon, pages 1938–1941. doi: 10.1063/1.1666274.
[LS13] J. Liesen and Z. Strakoš. Krylov subspace methods. Principles and analysis. Oxford
University Press, Oxford, 2013.
[Lin73] G. Lindblad. “Entropy, information and quantum measurements”. In: Communica-
tions in Mathematical Physics 33 (1973), pages 305–322.
[Lin63] J. Lindenstrauss. “On the Modulus of Smoothness and Divergent Series in Banach
Spaces.” In: Michigan Mathematical Journal 10.3 (1963), pages 241–252.
[LW94] C. Liverani and M. P. Wojtkowski. “Generalization of the Hilbert metric to the
space of positive definite matrices.” In: Pacific Journal of Mathematics 166.2 (1994),
pages 339 –355. doi: pjm/1102621142.
[Lov79] L. Lovász. “On the Shannon capacity of a graph”. In: IEEE Trans. Inform. Theory
25.1 (1979), pages 1–7.
[Löw34] K. Löwner. “Über monotone Matrixfunktionen”. In: Mathematische Zeitschrift 38
(1934), pages 177–216. url: http://eudml.org/doc/168495.
[Luo+10] Z. Q. Luo et al. “Semidefinite relaxation of quadratic optimization problems”. In:
IEEE Signal Process. Mag. 27.3 (2010), pages 20–34.
[MSS14] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Ramanujan graphs and the
solution of the Kadison-Singer problem”. In: Proceedings of the International
Congress of Mathematicians—Seoul 2014. Vol. III. Kyung Moon Sa, Seoul, 2014,
pages 363–386.
[MSS15a] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families I: Bipartite
Ramanujan graphs of all degrees”. In: Annals of Mathematics 182.1 (2015), pages 307–
325. url: http://www.jstor.org/stable/24523004.
[MSS15b] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families II: Mixed char-
acteristic polynomials and the Kadison—Singer problem”. In: Annals of Mathematics
182.1 (2015), pages 327–350. url: http://www.jstor.org/stable/24523005.
[MSS21] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families III: Sharper
restricted invertibility estimates”. In: Israel Journal of Mathematics (2021), pages 1–
28.
[Mar+16] D. Marquis et al. ‘Gershgorin’s Circle Theorem for Estimating the Eigenvalues of a
Matrix with Known Error Bounds”. 2016.
[MOA11] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: theory of majorization and
its applications. Second. Springer, New York, 2011. doi: 10.1007/978- 0- 387-
68276-1.
[MT11] M. McCoy and J. A. Tropp. “Two Proposals for Robust PCA Using Semidefinite
Programming”. In: Electronic Journal of Statistics 5.none (Jan. 2011), pages 1123–
1160. doi: 10.1214/11-EJS636.
[MP05] C. A. Micchelli and M. Pontil. “On learning vector-valued functions”. In: Neural
computation 17.1 (2005), pages 177–204.
[MXZ06] C. A. Micchelli, Y. Xu, and H. Zhang. “Universal Kernels.” In: Journal of Machine
Learning Research 7.12 (2006).
[Min88] H. Minc. Nonnegative matrices. A Wiley-Interscience Publication. John Wiley &
Sons, Inc., New York, 1988.
315
[Min16] H. Q. Minh. “Operator-valued Bochner theorem, Fourier feature maps for operator-
valued kernels, and vector-valued learning”. In: arXiv preprint arXiv:1608.05639
(2016).
[Mir60] L. Mirsky. “Symmetric gauge functions and unitarily invariant norms”. In: Quart.
J. Math. Oxford Ser. (2) 11 (1960), pages 50–59. doi: 10.1093/qmath/11.1.50.
[Mir59] L. Mirsky. “On the trace of matrix products”. In: Mathematische Nachrichten 20.3-6
(1959), pages 171–174. doi: 10.1002/mana.19590200306.
[MP93] B. Mohar and S. Poljak. “Eigenvalues in Combinatorial Optimization”. In: Com-
binatorial and Graph-Theoretical Problems in Linear Algebra. Springer New York,
1993, pages 107–151.
[Nao12] A. Naor. “On the Banach-Space-Valued Azuma Inequality and Small-Set Isoperime-
try of Alon–Roichman Graphs”. In: Combinatorics, Probability and Computing 21.4
(July 2012), pages 623–634. doi: 10.1017/S0963548311000757.
[NRV14] A. Naor, O. Regev, and T. Vidick. “Efficient rounding for the noncommutative
Grothendieck inequality”. In: Theory Comput. 10 (2014), pages 257–295. doi:
10.4086/toc.2014.v010a011.
[NS21] N. H. Nelsen and A. M. Stuart. “The random feature model for input-output maps
between banach spaces”. In: SIAM Journal on Scientific Computing 43.5 (2021),
A3212–A3243.
[NC00] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information.
Cambridge University Press, Cambridge, 2000.
[Nil91] A. Nilli. “On the Second Eigenvalue of a Graph”. In: Discrete Math. 91.2 (1991),
pages 207–210. doi: 10.1016/0012-365X(91)90112-F.
[ON00] T. Ogawa and H. Nagaoka. “Strong converse and Stein’s lemma in quantum
hypothesis testing”. In: IEEE Transactions on Information Theory 46.7 (2000),
pages 2428–2433. doi: 10.1109/18.887855.
[Oli10] R. I. Oliveira. “Sums of random Hermitian matrices and an inequality by Rudelson”.
In: Electronic Communications in Probability 15 (2010), pages 203–212.
[Olv+10] F. W. J. Olver et al., editors. NIST handbook of mathematical functions. With 1
CD-ROM (Windows, Macintosh and UNIX). National Institute of Standards and
Technology, 2010.
[Ost59] A. M. Ostrowski. “A quantitative formulation of Sylvester’s law of inertia”. In:
Proceedings of the National Academy of Sciences of the United States of America 45.5
(1959), page 740.
[Pak10] I. Pak. Lectures on Discrete and Polyhedral Geometry. 2010. url: https://www.
math.ucla.edu/~pak/book.htm.
[Par98] B. N. Parlett. The symmetric eigenvalue problem. Corrected reprint of the 1980
original. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA,
1998. doi: 10.1137/1.9781611971163.
[Par78] S. Parrott. “On a quotient norm and the Sz.-Nagy - Foiaş lifting theorem”. In: J.
Functional Analysis 30.3 (1978), pages 311–328. doi: 10.1016/0022-1236(78)
90060-5.
[Pin10] A. Pinkus. Totally positive matrices. Cambridge University Press, Cambridge, 2010.
[Pis78] G. Pisier. “Grothendieck’s Theorem for Noncommutative C*-Algebras, with an
Appendix on Grothendieck’s Constants”. In: Journal of Functional Analysis 29.3
(Sept. 1978), pages 397–415. doi: 10.1016/0022-1236(78)90038-1.
[Pis12] G. Pisier. “Grothendieck’s Theorem, Past and Present”. In: Bulletin of the American
Mathematical Society 49.2 (Apr. 2012), pages 237–323. doi: 10.1090/S0273-0979-
2011-01348-9.
316
[PX97] G. Pisier and Q. Xu. “Non-commutative martingale inequalities”. In: Comm. Math.
Phys. 189.3 (1997), pages 667–698. doi: 10.1007/s002200050224.
[RS09] P. Raghavendra and D. Steurer. “Towards Computing the Grothendieck Constant”.
In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics, Jan. 2009, pages 525–534. doi:
10.1137/1.9781611973068.58.
[RR08] A. Rahimi and B. Recht. “Random Features for Large-Scale Kernel Machines”.
In: Advances in Neural Information Processing Systems 20. Curran Associates, Inc.,
2008, pages 1177–1184. url: http://papers.nips.cc/paper/3182-random-
features-for-large-scale-kernel-machines.pdf.
[Ren06] J. Renegar. “Hyperbolic Programs, and Their Derivative Relaxations”. In: Found.
Comput. Math. 6.1 (2006), pages 59–79. doi: 10.1007/s10208-004-0136-z.
[RX16] É. Ricard and Q. Xu. “A Noncommutative Martingale Convexity Inequality”. In: The
Annals of Probability 44.2 (Mar. 2016), pages 867–882. doi: 10.1214/14-AOP990.
[Ric24] J. Riccati. “Animadversiones in aequationes differentiales secundi gradus”. In:
Actorum Eruditorum quae Lipsiae publicantur Supplementa. Actorum Eruditorum
quae Lipsiae publicantur Supplementa v. 8. prostant apud Joh. Grossii haeredes
& J.F. Gleditschium, 1724. url: https : / / books . google . com / books ? id =
UjTw1w7tZsEC.
[Rud59] W. Rudin. “Positive definite sequences and absolutely monotonic functions”. In:
Duke Math. J. 26 (1959), pages 617–622. url: http://projecteuclid.org/
euclid.dmj/1077468771.
[Rud90] W. Rudin. Fourier analysis on groups. Reprint of the 1962 original, A Wiley-
Interscience Publication. John Wiley & Sons, Inc., New York, 1990. doi: 10.1002/
9781118165621.
[Rud91] W. Rudin. Functional analysis. Second. McGraw-Hill, Inc., New York, 1991.
[Saa11a] Y. Saad. Numerical Methods for Large Eigenvalue Problems. 2nd edition. Society for
Industrial and Applied Mathematics, 2011. doi: 10.1137/1.9781611970739.
[Saa11b] Y. Saad. Numerical methods for large eigenvalue problems. Revised edition of the
1992 original [ 1177405]. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2011. doi: 10.1137/1.9781611970739.ch1.
[SSB18] B. Schlkopf, A. J. Smola, and F. Bach. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. The MIT Press, 2018.
[Sch14] R. Schneider. Convex bodies: the Brunn–Minkowski theory. 151. Cambridge university
press, 2014. doi: 10.1017/CBO9781139003858.
[Sch38] I. J. Schoenberg. “Metric spaces and positive definite functions”. In: Trans. Amer.
Math. Soc. 44.3 (1938), pages 522–536. doi: 10.2307/1989894.
[SS17] L. J. Schulman and A. Sinclair. “Analysis of a Classical Matrix Preconditioning
Algorithm”. In: J. Assoc. Comput. Mach. (JACM) 64.2 (2017), 9:1–9:23.
[Sch64] L. Schwartz. “Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux
associés (noyaux reproduisants)”. In: Journal d’analyse mathématique 13.1 (1964),
pages 115–256.
[Sen81] E. Seneta. Nonnegative matrices and Markov chains. Second. Springer-Verlag, New
York, 1981. doi: 10.1007/0-387-32792-4.
[Ser09] D. Serre. “Weyl and Lidskiı̆ inequalities for general hyperbolic polynomials”. In:
Chinese Annals of Mathematics, Series B 30.6 (2009), pages 785–802.
[Sha48] C. E. Shannon. “A mathematical theory of communication”. In: The Bell System
Technical Journal 27.3 (1948), pages 379–423. doi: 10.1002/j.1538-7305.1948.
tb01338.x.
317
[SH19] Y. Shenfeld and R. van Handel. “Mixed volumes and the Bochner method”. In: Proc.
Amer. Math. Soc. 147.12 (2019), pages 5385–5402. doi: 10.1090/proc/14651.
[Sim11] B. Simon. Convexity. An analytic viewpoint. Cambridge University Press, Cambridge,
2011. doi: 10.1017/CBO9780511910135.
[Sim15] B. Simon. Real analysis. With a 68 page companion booklet. American Mathematical
Society, Providence, RI, 2015. doi: 10.1090/simon/001.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
[Sin64] R. Sinkhorn. “A relationship between arbitrary positive matrices and doubly
stochastic matrices”. In: Ann. Math. Statist. 35 (1964), pages 876–879. doi: 10.
1214/aoms/1177703591.
[SMI92] V. I. SMIRNOV. “Biography of A. M. Lyapunov”. In: International Journal of
Control 55.3 (1992), pages 775–784. eprint: https : / / doi . org / 10 . 1080 /
00207179208934258. doi: 10.1080/00207179208934258.
[So09] A. M. So. “Improved Approximation Bound for Quadratic Optimization Problems
with Orthogonality Constraints”. In: Proceedings of the Twentieth Annual ACM-SIAM
Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,
Jan. 2009, pages 1201–1209. doi: 10.1137/1.9781611973068.130.
[Spi18a] D. Spielman. “Yale CPSC 662/AMTH 561 : Spectral Graph Theory”. Available at
http://www.cs.yale.edu/homes/spielman/561/561schedule.html. 2018.
[Spi18b] D. Spielman. Bipartite Ramanujan Graphs. 2018. url: http://www.cs.yale.
edu/homes/spielman/561/lect25-18.pdf.
[Spi19] D. Spielman. Spectral and Algebraic Graph Theory. 2019. url: http : / / cs -
www.cs.yale.edu/homes/spielman/sagt/sagt.pdf.
[SS11] D. A. Spielman and N. Srivastava. “Graph Sparsification by Effective Resistances”.
In: SIAM Journal on Computing 40.6 (2011), pages 1913–1926. eprint: https :
//doi.org/10.1137/080734029. doi: 10.1137/080734029.
[SH15] S. Sra and R. Hosseini. “Conic geometric optimization on the manifold of positive
definite matrices”. In: SIAM J. Optim. 25.1 (2015), pages 713–739. doi: 10.1137/
140978168.
[Ste56] E. M. Stein. “Interpolation of linear operators”. In: Transactions of the American
Mathematical Society 83.2 (1956), pages 482–492.
[SS90] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. 1st Edition. Academic
Press, 1990.
[SBT17] D. Sutter, M. Berta, and M. Tomamichel. “Multivariate trace inequalities”. In: Comm.
Math. Phys. 352.1 (2017), pages 37–58. doi: 10.1007/s00220-016-2778-5.
[Syl84] J. J. Sylvester. “Sur l’équation en matrices px= xq”. In: CR Acad. Sci. Paris 99.2
(1884), pages 67–71.
[Syl52] J. Sylvester. “A demonstration of the theorem that every homogeneous quadratic
polynomial is reducible by real orthogonal substitutions to the form of a sum of
positive and negative squares”. In: The London, Edinburgh, and Dublin Philosophical
Magazine and Journal of Science 4.23 (1852), pages 138–142.
[Tom74] N. Tomczak-Jaegermann. “The Moduli of Smoothness and Convexity and the
Rademacher Averages of the Trace Classes 𝑆𝑝 (1 ≤ 𝑝 < ∞)”. In: Studia Mathe-
matica 50.2 (1974), pages 163–182.
[Tro09] J. A. Tropp. “Column subset selection, matrix factorization, and eigenvalue op-
timization”. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on
Discrete Algorithms. SIAM, Philadelphia, PA, 2009, pages 978–986.
318
[Tro11] J. A. Tropp. “User-Friendly Tail Bounds for Sums of Random Matrices”. In: Founda-
tions of Computational Mathematics 12.4 (2011), pages 389–434. doi: 10.1007/
s10208-011-9099-z.
[Tro12] J. A. Tropp. “From joint convexity of quantum relative entropy to a concavity
theorem of Lieb”. In: Proc. Amer. Math. Soc. 140.5 (2012), pages 1757–1760. doi:
10.1090/S0002-9939-2011-11141-9.
[Tro15] J. A. Tropp. “An Introduction to Matrix Concentration Inequalities”. In: Foundations
and Trends in Machine Learning 8.1-2 (2015), pages 1–230.
[Tro18] J. A. Tropp. “Second-order matrix concentration inequalities”. In: Appl. Comput.
Harmon. Anal. 44.3 (2018), pages 700–736. doi: 10.1016/j.acha.2016.07.005.
[van14] R. van Handel. Probability in High Dimension. Technical report. Princeton University,
June 2014. doi: 10.21236/ADA623999.
[VB96] L. Vandenberghe and S. Boyd. “Semidefinite programming”. In: SIAM Rev. 38.1
(1996), pages 49–95. doi: 10.1137/1038003.
[Vas79] H. Vasudeva. “Positive definite matrices and absolutely monotonic functions”. In:
Indian J. Pure Appl. Math. 10.7 (1979), pages 854–858.
[Ver18] R. Vershynin. High-dimensional probability. An introduction with applications in
data science, With a foreword by Sara van de Geer. Cambridge University Press,
Cambridge, 2018. doi: 10.1017/9781108231596.
[Wat18] J. Watrous. The theory of quantum information. Cambridge University Press, 2018.
[Wid41] D. V. Widder. The Laplace Transform. Princeton University Press, Princeton, N. J.,
1941.
[Wol19] N. Wolchover. “Neutrinos lead to unexpected discovery in basic math”. In: Quanta
Magazine (2019). url: https://www.quantamagazine.org/neutrinos-lead-
to-unexpected-discovery-in-basic-math-20191113.
[Zha05] F. Zhang, editor. The Schur complement and its applications. Springer-Verlag, New
York, 2005. doi: 10.1007/b105056.