0% found this document useful (0 votes)
110 views331 pages

Matrix Analysis LN

Uploaded by

mmdiwarden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views331 pages

Matrix Analysis LN

Uploaded by

mmdiwarden
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 331

MATRIX ANALYSIS

ACM 204 / Caltech / Winter 2022

Prof. Joel A. Tropp


Typeset on August 22, 2022

Copyright ©2022. All rights reserved.

Cite as:
Joel A. Tropp, ACM 204: Matrix Analysis, Caltech CMS Lecture Notes 2022-01, Pasadena,
Winter 2022. doi:10.7907/nwsv-df59.

These lecture notes are composed using an adaptation of a template designed by


Mathias Legrand, licensed under CC BY-NC-SA 3.0.

Cover image: Sample paths of a randomized block Krylov method for estimating the
largest eigenvalue of a symmetric matrix.

Lecture images: Falling text in the style of The Matrix was created by Jamie Zawinski,
©1999–2003.
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

I lectures
1 Tensor Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Tensor product: Motivation 2
1.2 The space of tensor products 4
1.3 Isomorphism to bilinear forms 5
1.4 Inner products 7
1.5 Theory of linear operators 8
1.6 Spectral theory 10

2 Multilinear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 Multivariate tensor product 12
2.2 Permutations 14
2.3 Wedge products 16
2.4 Wedge operators 17
2.5 Spectral theory of wedge operators 18
2.6 Determinants 18
2.7 Symmetric tensor product 19

3 Majorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Majorization 21
3.2 Doubly stochastic matrices 24
3.3 T-transforms 25
3.4 Majorization and doubly stochastic matrices 27
3.5 The Schur–Horn theorem 28

4 Isotone Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Recap 30
4.2 Weyl majorant theorem 30
4.3 Isotonicity 32
4.4 Schur “convexity” 35

5 Birkhoff and von Neumann . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38


5.1 Doubly stochastic matrices 38
5.2 The Birkhoff–von Neumann theorem 39
5.3 The Minkowski theorem on extreme points 40
5.4 Proof of Birkhoff theorem 42
5.5 The Richter trace theorem 43

6 Unitarily Invariant Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


6.1 Symmetric gauge functions 46
6.2 Duality for symmetric gauge functions 48
6.3 Unitarily invariant norms 49
6.4 Characterization of unitarily invariant norms 50
6.5 Duality for unitarily invariant norms 51

7 Matrix Inequalities via Complex Analysis . . . . . . . . . . . . . . . . 54


7.1 Motivation: Real analysis is not always enough 54
7.2 The maximum modulus principle 56
7.3 Interpolation: The three-lines theorem 58
7.4 Example: Duality for Schatten norms 59

8 Uniform Smoothness and Convexity . . . . . . . . . . . . . . . . . . . . 62


8.1 Convexity and smoothness 62
8.2 Uniform smoothness for Schatten norms 63
8.3 Application: Sum of independent random matrices 65
8.4 Proof: Scalar case 67
8.5 Proof: Matrix case 69

9 Additive Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 74


9.1 Variational principles 74
9.2 Weyl monotonicity principle 76
9.3 The Lidskii theorem 77
9.4 Consequences of Lidskii’s theorem 78

10 Multiplicative Perturbation Theory . . . . . . . . . . . . . . . . . . . . . . 81


10.1 Recap of Lidskii’s theorem 81
10.2 The theorem of Li & Mathias 81
10.3 Ostrowski monotonicity 84
10.4 Proof of the Li–Mathias theorem 84
11 Perturbation of Eigenspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
11.1 Motivation 87
11.2 Principle angles between subspaces 88
11.3 Sylvester equations 90
11.4 Perturbation theory for eigenspaces 92

12 Positive Linear Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


12.1 Positive-semidefinite order 95
12.2 Positive linear maps 96
12.3 Examples of positive linear maps 96
12.4 Properties of positive linear maps 98
12.5 Convexity inequalities 99
12.6 Russo–Dye theorem 101

13 Matrix Monotonicity and Convexity . . . . . . . . . . . . . . . . . . . 104


13.1 Basic definitions and properties 104
13.2 Examples 105
13.3 The matrix Jensen inequality 108

14 Monotonicity: Differential Characterization . . . . . . . . . . . . . . 112


14.1 Recap 112
14.2 Differential characterizations 112
14.3 Derivatives of standard matrix functions 114
14.4 Proof of Loewner’s theorem 117
14.5 Examples 118

15 Monotonicity: Integral Characterization . . . . . . . . . . . . . . . . . 120


15.1 Recap 120
15.2 Integral representations of matrix monotone functions 120
15.3 The geometric approach to Loewner’s theorem 123
15.4 Matrix monotone functions on the positive real line 125
15.5 Integral representations of matrix convex functions 128
15.6 Application: Matrix Jensen and Lyapunov inequalities 130

16 Matrix Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


16.1 Scalar means 133
16.2 Matrix means 134
16.3 Representer functions for scalar means 138
16.4 Representation of matrix means 139
16.5 Matrix means from matrix representers 141
17 Quantum Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
17.1 Entropy and relative entropy 147
17.2 Quantum entropy and quantum relative entropy 149
17.3 The matrix perspective transformation 150
17.4 Tensors and logarithms 152
17.5 Convexity of matrix trace functions 153

18 Positive-Definite Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 156


18.1 Positive-definite kernels 156
18.2 Positive-definite functions 159
18.3 Examples of positive-definite functions 161
18.4 Bochner’s theorem 164

19 Entrywise PSD Preservers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170


19.1 Families of kernels 170
19.2 Entrywise functions that preserve the psd property 173
19.3 Examples of entrywise psd preservers 175
19.4 Absolutely monotone and completely monotone functions 177
19.5 Vasudeva’s theorem 178
19.6 *Completely monotone functions 181

II problem sets

1 Multilinear Algebra & Majorization . . . . . . . . . . . . . . . . . . . . . . 186

2 UI Norms & Variational Principles . . . . . . . . . . . . . . . . . . . . . . 190

3 Perturbation Theory & Positive Maps . . . . . . . . . . . . . . . . . . . 194

III projects

Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

1 Hiai–Kosaki Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208


by Edoardo Calvello
1.1 A simple proof for the matrix arithmetic–geometric mean inequality 208
1.2 From scalar means to matrix means 209
1.3 A unified analysis of means for matrices 211
1.4 Norm inequalities 213
1.5 Conclusions 215
2 The Eigenvector–Eigenvalue Identity . . . . . . . . . . . . . . . . . . . 217
by Ruizhi Cao
2.1 Cauchy interlacing theorem 217
2.2 First-order perturbation theorem 219
2.3 Eigenvector–eigenvalue identity 220
2.4 Proof of the identity 222
2.5 Application 224

3 Bipartite Ramanujan Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 227


by Nico Christianson
3.1 Bipartite Ramanujan graphs 227
3.2 Reductions 229
3.3 Interlacing polynomials and real stability 232
3.4 Proof of Theorem 3.13 235

4 The NC Grothendieck Problem . . . . . . . . . . . . . . . . . . . . . . . . 238


by Ethan Epperly
4.1 Grothendieck’s inequality 238
4.2 The noncommutative Grothendieck problem 240
4.3 Noncommutative Grothendieck efficient rounding: Proof 242
4.4 Application: Robust PCA 244

5 Algebraic Riccati Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 248


by Taylan Kargin
5.1 Motivation 248
5.2 Metric Geometry of Positive-Definite Cone 250
5.3 Stability of Matrices 252
5.4 Discrete Algebraic Riccati Equations 256

6 Hyperbolic Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263


by Eitan Levin
6.1 Basic definitions and properties 263
6.2 Derivatives and multilinearization 265
6.3 Hyperbolic quadratics and Alexandrov’s mixed discriminant inequality 267
6.4 Semidefinite representability, and additive perturbation theory 268
6.5 Hyperbolicity and convexity of compositions 269
6.6 Euclidean structure 271

7 Matrix Laplace Transform Method . . . . . . . . . . . . . . . . . . . . . . 274


by Elvira Moreno
7.1 The Laplace transform method 274
7.2 Laplace transform tail bound for sums of random matrices 276
7.3 The matrix Chernoff bound 279
7.4 Application : Sparsification via random sampling 280
7.5 Conclusion 284

8 Operator-Valued Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286


by Nicholas H. Nelsen
8.1 Scalar kernels and reproducing kernel Hilbert space 286
8.2 Operator-valued kernels 287
8.3 Examples 291
8.4 Vector-valued Gaussian processes 293

9 Spectral Radius and Stability . . . . . . . . . . . . . . . . . . . . . . . . . 298


by Jing Yu
9.1 System stability and spectral radius 298
9.2 Geršgorin disks 299
9.3 Bounding spectral radius 303
9.4 Notes 305

IV back matter
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Preface

“The Matrix is everywhere. It is all around us. Even now in this very room.”

—Morpheus, The Matrix, 1999

Matrices are a foundational tool in the mathematical sciences, in statistics, in


engineering, and in computer science. The purpose of this course is to develop a
deeper understanding of matrices, their structure, and function using tools from linear
algebra, convexity theory, and analysis.

Course overview
The topics of this course vary from term to term, depending on the audience. This
term, we covered the following material:
• Basics of multilinear algebra
• Majorization and doubly stochastic matrices
• Symmetric and unitarily invariant norms
• Uniform smoothness of matrix spaces
• Complex interpolation methods for matrix inequalities
• Variational principles for eigenvalues
• Perturbation theory for eigenvalues
• Angles between subspaces and perturbation theory for eigenspaces
• Tensor products and matrix equations
• Positive and completely positive linear maps
• Matrix monotonicity and convexity
• Differentiation of standard matrix functions
• Loewner’s theorems on matrix monotone functions
• Matrix means
• Convexity of matrix trace functions
• Positive-definite functions and Bochner’s theorem
• Entrywise positivity preservers and Schoenberg’s theorem
The course had three optional problem sets, which helped to cement some of the
foundational material. The problem sets are attached to the notes.
The primary assignment was a project, where each student read some classic or
modern papers in matrix analysis and wrote a synthetic treatment of the material. A
selection of the projects is attached to the lecture notes.

Prerequisites
ACM 204 is designed for G2 and G3 students in the mathematical sciences. The
prerequisites for this course are differential and integral calculus (e.g., Caltech Ma 1ac),
ordinary differential equations (e.g., Ma 2), and intermediate linear algebra (e.g., Ma
1b and ACM 104). Exposure to linear analysis (e.g., CMS 107), functional analysis (e.g.,
ACM 105), and optimization theory (e.g., CMS 122) is also valuable.
x

Supplemental textbooks
There is no required textbook for the course. Some recent books that cover related
material include

• [Bha97] Bhatia, Matrix Analysis, Springer, 1997.


• [Bha07a] Bhatia, Perturbation Bounds for Matrix Eigenvalues, SIAM, 2007.
• [Bha07b] Bhatia, Positive-Definite Matrices, Princeton, 2007.
• [Car10] Carlen, Trace Inequalities and Quantum Entropy, 2010.
• [Hia10] Hiai, Matrix Analysis: Matrix Monotone Functions, Matrix Means, and
Majorization, 2010.
• [HP14] Hiai & Petz, Introduction to Matrix Analysis and Applications, Springer,
2014.
• [Hig08] Higham, Functions of Matrices, SIAM, 2008.
• [HJ13] Horn & Johnson, Matrix Analysis, 2nd ed., Cambridge, 2013.
• [HJ94] Horn & Johnson, Topics in Matrix Analysis, Cambridge, 1994.
• [Kat95] Kato, Perturbation Theory for Linear Operators, 2nd ed., Springer, 1995.
• [MOA11] Marshall et al., Inequalities: Theory of Majorization and Its Applications,
2nd ed., Springer 2011.

Bhatia’s books [Bha97; Bha07b] are the primary sources for this course.

These notes
These lecture notes document ACM 204 as taught in Winter 2022, and they are primarily
intended as a reference for students who have taken the class. The notes are prepared
by student scribes with feedback from the instructor. The notes have been edited by
the instructor to try to correct his own failures of presentation. All remaining errors
and omissions are the fault of the instructor.
Please be aware that these notes reflect material presented in a classroom, rather
than a formal scholarly publication. In some places, the notes may lack appropriate
citations to the literature. There is no claim that the arrangement or presentation of
the material is primarily due to the instructor.
The notes also contain the projects of students who wished to share their work.
They received feedback and made revisions, but the projects have not been edited.
They represent the students’ individual work.

Acknowledgements
These notes were transcribed by students taking the course in Winter 2022. They are
Eray Atay, Jag Boddapati, Edoardo Calvello, Ruizhi Cao, Anthony (Chi-Fang) Chen,
Nicolas Christianson, Matthieu Darcy, Rohit Dilip, Ethan Epperly, Salvador Gomez,
Taylan Kargin, Eitan Levin, Elvira Moreno, Nicholas Nelson, Roy (Yixuan) Wang, and
Jing Yu. Without their care and attention, we would not have such an excellent record.

Joel A. Tropp
Steele Family Professor of Applied & Computational Mathematics
California Institute of Technology

[email protected]
http://users.cms.caltech.edu/~jtropp

Pasadena, California
March 2022
Notation

I have selected notation that is common in the linear algebra and probability literature.
I have tried to been consistent in using the symbols that are presented below. There
are some minor variations in different lectures, including the letter that indicates the
dimension of a matrix and the indexing of sums. Scribes are expected to use this same
notation!

Set theory
The Pascal notation B and C generates a definition. Sets without any particular
internal structure are denoted with sans serif capitals: A, B, E. Collections of sets are
written in a calligraphic font: A, B, F. The power set (that is, the collection containing
all subsets) of a set E is written as P( E) .
The symbol ∅ is reserved for the empty set. We use braces to denote a set. The
character ∈ (or, rarely, 3) is the member-of relation. The set-builder notation

{𝑥 ∈ A : 𝑃 (𝑥)}
carves out the (unique) set of elements that belong to a set A and that satisfy the
predicate 𝑃 . Basic set operations include union (∪), intersection (∩), symmetric
difference (4), set difference (\), and the complement (c ) with respect to a fixed set.
The relations ⊆ and ⊇ indicate set containment.
The natural numbers ℕ B {1, 2, 3, . . . }. Ordered tuples and sequences are written
with parentheses, e.g.,

(𝑎 1 , 𝑎 2 , 𝑎 3 , . . . , 𝑎 𝑛 ) or (𝑎 1 , 𝑎 2 , 𝑎 3 , . . . )
Alternative notations include things like (𝑎𝑖 : 𝑖 ∈ ℕ) or (𝑎𝑖 )𝑖 ∈ℕ or simply (𝑎𝑖 ) .

Real analysis
We mainly work in the field ℝ of real numbers, equipped with the absolute value
|·| . The extended real numbers ℝ B ℝ ∪ {±∞} are defined with the usual rules
of arithmetic and order. In particular, we instate the conventions that 0/0 = 0 and
0 · ±∞ = 0. We use the standard (American) notation for open and closed intervals;
e.g.,

(𝑎, 𝑏) B {𝑥 ∈ ℝ : 𝑎 < 𝑥 < 𝑏 } and [𝑎, 𝑏] B {𝑥 ∈ ℝ : 𝑎 ≤ 𝑥 ≤ 𝑏 }.


Occasionally, we may visit the rational field ℚ, and we very commonly use the complex
field ℂ. The imaginary unit, i, is written in an upright font.
We use modern conventions for words describing order; these may be slightly
different from what you are used to. In this course, we enforce the definition that positive
Warning: Positive means ≥ 0! 
means ≥ 0 and negative means ≤ 0. For example, the positive integers compose the set
ℤ+ B {0, 1, 2, 3, . . . } and the positive reals compose the set ℝ+ B {𝑥 ∈ ℝ : 𝑥 ≥ 0}.
When required, we may deploy the phrase strictly positive to mean > 0 and strictly
negative to mean < 0. Similarly, increasing means “never going down” and decreasing
means “never going up.”
xii

Linear algebra
We work in a real or complex linear space. The letters 𝑑 and 𝑛 (and occasionally
others) are used to denote the dimension of this space, which is always finite. For
example, we write ℝ𝑑 or ℂ𝑛 . We may write 𝔽 to refer to either field, or we may omit
the field entirely if it is not important.
We use the delta notation for standard basis vectors: 𝜹 𝑖 has a one in the 𝑖 th
coordinate and zeros elsewhere. The vector 1 has ones in each entry. The dimension
of these vectors is determined by context.
The symbol ∗ denotes the (conjugate) transpose of a vector or a matrix. In particular,

𝑧 is the complex conjugate of a complex number 𝑧 ∈ ℂ. We may also write ᵀ for the
ordinary transpose to emphasize that no conjugation is performed.
We equip 𝔽 𝑑 with the standard inner product h𝒙 , 𝒚 i B 𝒙 ∗ 𝒚 . The inner product
generates the Euclidean norm k𝒙 k 2 B h𝒙 , 𝒙 i .
We write ℍ𝑑 (𝔽) for the real-linear space of 𝑑 × 𝑑 self-adjoint matrices with entries
in the field 𝔽 . Recall that a matrix is self-adjoint when 𝑨 = 𝑨 ∗ . The symbols 0 and
I denote zero matrix and the identity matrix; their dimensions are determined by
context or by an explicit subscript.
We equip the space ℍ𝑑 with the trace inner product h𝑿 , 𝒀 i B tr (𝑿𝒀 ) , which
generates the Frobenius norm k𝑿 k 2F B h𝑿 , 𝑿 i . The map tr (·) returns the trace of
a square matrix; the parentheses are often omitted. We instate the convention that
nonlinear functions bind before the trace.
The spectral theorem states that every self-adjoint matrix 𝑨 ∈ ℍ𝑛 admits a spectral
resolution:
∑︁𝑚 ∑︁𝑚
𝑨= 𝜆𝑖 𝑷 𝑖 where 𝑷 𝑖 = I𝑛 and 𝑷 𝑖 𝑷 𝑗 = 𝛿𝑖 𝑗 𝑷 𝑖 .
𝑖 =1 𝑖 =1

Here, 𝜆 1 , . . . , 𝜆𝑚 are the distinct (real) eigenvalues of 𝑨 . The range of the orthogonal
projector 𝑷 𝑖 is the invariant subspace associated with 𝜆𝑖 . In this context, 𝛿𝑖 𝑗 is the
Kronecker delta.
The maps 𝜆 min (·) and 𝜆 max (·) return the minimum and maximum eigenvalues of
a self-adjoint matrix. The 2 operator norm k·k of a self-adjoint matrix satisfies the
relation 
k𝑨 k B max |𝜆 max (𝑨)|, |𝜆min (𝑨)| for 𝑨 ∈ ℍ𝑛 .
A self-adjoint matrix is positive semidefinite (psd) if its eigenvalues are nonnegative; a
self-adjoint matrix is positive definite (pd) if its eigenvalues are positive. The symbol 4
refers to the psd order: 𝑨 4 𝑯 if and only if 𝑯 − 𝑨 is psd.
We can define a standard matrix function for a self-adjoint matrix using the spectral
resolution. For an interval I ⊆ ℝ and for a function 𝑓 : I → ℝ,
∑︁𝑚 ∑︁𝑚
𝑨= 𝜆𝑖 𝑷 𝑖 implies 𝑓 (𝑨) = 𝑓 (𝜆𝑖 ) 𝑷 𝑖 .
𝑖 =1 𝑖 =1

Implicitly, we assume that the eigenvalues of the matrix 𝑨 lie within the domain I of
the function 𝑓 . When we apply a real function to a self-adjoint matrix, we are always
referring to the associated standard matrix function. In particular, we often encounter
powers, exponentials, and logarithms.
We write 𝕄𝑛 (𝔽) for the linear space of 𝑛 × 𝑛 matrices over the field 𝔽 . We also
define the linear space 𝕄𝑚×𝑛 (𝔽) of 𝑚 × 𝑛 matrices over the field 𝔽 . We can extend
the trace inner-product and Frobenius norm to this setting:

h𝑩, 𝑪 i B tr (𝑩 ∗𝑪 ) and k𝑩 k 2F B h𝑩, 𝑩i for 𝑨, 𝑩 ∈ 𝕄𝑚×𝑛 .


xiii

A square matrix 𝑸 ∈ 𝕄𝑛 that satisfies 𝑸 ∗𝑸 = I𝑛 is called orthogonal (resp. unitary)


in the real (resp. complex) case. A tall, rectangular matrix 𝑩 ∈ 𝕄𝑚×𝑛 with 𝑛 ≤ 𝑚 that
satisfies 𝑩 ∗ 𝑩 = I𝑛 is called orthonormal; this terminology is common in the numerical
literature. More generally, a rectangular matrix 𝑩 ∈ 𝕄𝑚×𝑛 is called a partial isometry
if 𝑩 ∗ 𝑩 is an orthogonal projector.
Every matrix 𝑩 ∈ 𝕄𝑚×𝑛 (𝔽) admits a singular value decomposition:

𝑩 = 𝑼 𝚺𝑽 ∗ where 𝑼 ∈ 𝔽 𝑚×𝑚 and 𝑽 ∈ 𝔽 𝑛×𝑛 .

The matrices 𝑼 and 𝑽 are orthogonal (or unitary). The rectangular matrix 𝚺 ∈ 𝔽 𝑚×𝑛
is diagonal, in the sense that the entries (𝚺)𝑖 𝑗 = 0 whenever 𝑖 ≠ 𝑗 . The diagonal
entries of 𝚺 are called singular values. They are conventionally arranged in decreasing
order and written with the following notation.

𝜎max (𝑩) B 𝜎1 (𝑩) ≥ 𝜎2 (𝑩) ≥ · · · ≥ 𝜎𝑟 (𝑩) C 𝜎min (𝑩) where 𝑟 B min{𝑚, 𝑛}.

The symbol k·k always refers to the 2 operator norm; it returns the maximum singular
value of its argument.
We write lin (·) for the linear hull of a family of vectors. The operators range (·)
and null (·) extract the range and null space of a matrix. The operator † extracts the
pseudoinverse.

Probability
The map ℙ {·} returns the probability of an event. The operator 𝔼[·] returns the
expectation of a random variable taking values in a linear space. We only include the
brackets when it is necessary for clarity, and we impose the convention that nonlinear
functions bind before the expectation.
The symbol ∼ means “has the distribution.” We abbreviate (statistically) independent
and identically distributed (iid). Named distributions, such as normal and uniform,
are written with small capitals.
We say that a random vector 𝒙 ∈ 𝔽 𝑛 is centered when 𝔼[𝒙 ] = 0. A random vector
is isotropic when 𝔼[𝒙𝒙 ∗ ] = I𝑛 . A random vector that is both centered and isotropic is
standardized.
An important property of the standard normal distribution, which we use heavily, is
the fact that it is rotationally invariant. If 𝒙 ∼ normal ( 0, I) , then 𝑸 𝒙 is also standard
normal for every matrix 𝑸 that is orthogonal (in the real case) or unitary (in the
complex case).

Order notation
We sometimes use the familiar order notation from computer science. The symbol Θ(·)
refers to asymptotic equality. The symbol 𝑂 (·) refers to an asymptotic upper bound.
I. lectures

1 Tensor Products . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Multilinear Algebra . . . . . . . . . . . . . . . . . . . . . 12

3 Majorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Isotone Functions . . . . . . . . . . . . . . . . . . . . . 30

5 Birkhoff and von Neumann . . . . . . . . . . . . . 38

6 Unitarily Invariant Norms . . . . . . . . . . . . . . . 46

7 Matrix Inequalities via Complex Analysis . . 54

8 Uniform Smoothness and Convexity . . . . . . 62

9 Additive Perturbation Theory . . . . . . . . . . . . . 74

10 Multiplicative Perturbation Theory . . . . . . . . 81

11 Perturbation of Eigenspaces . . . . . . . . . . . . . . 87

12 Positive Linear Maps . . . . . . . . . . . . . . . . . . . . 95

13 Matrix Monotonicity and Convexity . . . . . . 104

14 Monotonicity: Differential Characterization 112

15 Monotonicity: Integral Characterization . . . 120

16 Matrix Means . . . . . . . . . . . . . . . . . . . . . . . . . 133

17 Quantum Relative Entropy . . . . . . . . . . . . . . 147

18 Positive-Definite Functions . . . . . . . . . . . . . 156

19 Entrywise PSD Preservers . . . . . . . . . . . . . . . 170


1. Tensor Products

Date: 4 January 2022 Scribe: Rohit Dilip

In this lecture, we develop the theory of tensor products, which provides us with a way
Agenda:
to “multiply” vector spaces. We begin by discussing the axioms that such a construction
1. Tensor product: Motivation
should satisfy; then we develop a rigorous way to implement these axioms. We show 2. Tensor product spaces
that the tensor product of two Hilbert spaces is itself a Hilbert space, equipped with an 3. Bilinear forms
induced inner product. By developing an isomorphism with bilinear forms, we show 4. Theory of linear operators
how to regard tensor products as matrices. Finally, we show that we can construct
linear operators on a tensor product space in a very natural way, and we develop their
spectral theory in a transparent way.

1.1 Tensor product: Motivation


We begin with some background and motivation.

1.1.1 Setting
Throughout this chapter, we will assume H is an 𝑛 -dimensional Hilbert space over a field
𝔽 (either ℝ or ℂ). A Hilbert space is endowed with an inner product denoted by h·, ·i .
By convention, we assume the inner product is conjugate linear in the first coordinate
and linear in the second coordinate. We fix an orthonormal basis {𝒆 1 , 𝒆 2 , . . . , 𝒆 𝑛 }.
That is, h𝒆 𝑗 , 𝒆 𝑘 i = 𝛿 𝑗 𝑘 . Finally, we denote the space of linear operators acting on H The Kronecker delta 𝛿 𝑗 𝑗 = 1 for all 𝑗 ,
by L( H) . Since H is finite-dimensional, we can regard every element of L( H) is as a while 𝛿 𝑗 𝑘 = 0 when 𝑗 ≠ 𝑘 .
matrix with dimension dim ( H) .

1.1.2 Axioms for the tensor product


Vectors spaces admit scaling and addition, but we do not usually talk about how
to “multiply” two vectors together. Intuitively, this should be a fundamental binary
operation that distributes over addition, and it should return an object living in a larger
vector space. The latter condition can be understood via example: if we multiply two
sides of a rectangle to find its area, we understand that the product lives in a different
space because it has different units. This space is also larger in some sense than either
of the constituent spaces (i.e., 2D instead of 1D).
In this section, we enumerate the axioms that a reasonable interpretation of vector
“multiplication” might satisfy. We call this operation a tensor product.

Definition 1.1 (Tensor product: Axioms). The tensor product operation ⊗ maps a pair of
vectors 𝒙 , 𝒚 ∈ H to an object called the tensor product 𝒙 ⊗ 𝒚 . We can add tensors
and scale them. The product should satisfy the following properties:

1. Additivity. The product should distribute across the addition operation. For
Lecture 1: Tensor Products 3

all 𝒙 , 𝒚 , 𝒛 ∈ H, we require the following two equalities to hold.

(𝒙 + 𝒚 ) ⊗ 𝒛 = 𝒙 ⊗ 𝒛 + 𝒚 ⊗ 𝒛 ;
𝒙 ⊗ (𝒚 + 𝒛 ) = 𝒙 ⊗ 𝒚 + 𝒙 ⊗ 𝒛 .

2. Homogeneity. We would also like the product operation to behave well with
scalar multiplication in the field 𝔽 of the Hilbert space. In particular, for all
𝒙 , 𝒚 ∈ H, The homogeneity property is
independent of the argument index
𝛼 (𝒙 ⊗ 𝒚 ) = (𝛼𝒙 ) ⊗ 𝒚 = 𝒙 ⊗ (𝛼𝒚 ) for all 𝛼 ∈ 𝔽 . in the product; that is, there is no
conjugation of the field element 𝛼 .
The zero vector 0 is the identity operator over
3. Interaction with the zero vector.
addition; i.e., 𝒙 + 0 = 𝒙 for all 𝒙 ∈ H. We require that the tensor product
with zero to be absorbing. Thus, for vectors 𝒙 , 𝒚 ∈ H,

𝒙 ⊗ 0 = 0 ⊗ 0 = 0 ⊗ 𝒚.

4. Faithfulness. Finally, the tensor product should be faithful, i.e., multiplying


two nonzero vectors produces a nonzero vector. Put differently, for vectors
𝒙 , 𝒚 ∈ H, if 𝒙 ⊗ 𝒚 = 0 ⊗ 0, then either 𝒙 = 0 or 𝒚 = 0. Intuitively, faithfulness guarantees
that for a fixed vector 𝒚 , the mapping
𝒙 ↦→ 𝒙 ⊗ 𝒚 is one-to-one.
In the next section, we will show that there is a construction that is consistent with
the tensor product axioms. First, let us consider some familiar binary vector operations
to see whether these satisfy our desired axioms of a tensor product.
Example 1.2 (Inner product). The inner product does not generally satisfy the axioms
of the tensor product. Consider the case where H = ℝ2 . Then the inner product for
vectors 𝒙 and 𝒚 is defined by
h𝒙 , 𝒚 i : (𝒙 , 𝒚 ) ↦→ 𝒙 ᵀ 𝒚 .
However, it is not faithful, since one can easily find vectors 𝒙 and 𝒚 that are not equal
to 0, but where h𝒙 , 𝒚 i = 0. For instance, 𝒙 = ( 1, 1) ᵀ and 𝒚 = ( 1, −1) ᵀ . 

Example 1.3 (Schur product). The Schur product between vectors 𝒙 and 𝒚 in ℝ𝑛 is
denoted by 𝒙 𝒚 and is defined by elementwise multiplication of the vectors 𝒙 and 𝒚 .
That is,
: ℝ𝑛 × ℝ𝑛 → ℝ𝑛 ;
: (𝒙 , 𝒚 ) ↦→ (𝑥𝑖 𝑦𝑖 : 𝑖 = 1, . . . , 𝑛).
Similar to the inner product, the Schur product does not satisfy faithfulness. For
instance, when H = ℝ2 , one can pick 𝒙 = ( 1, 0) ᵀ and 𝒚 = ( 0, 1) ᵀ . Their Hadamard
product is 𝒙 𝒚 = ( 0, 0) ᵀ , but both vectors are nonzero. 

Example 1.4 (Outer product). We will denote the outer product between vectors 𝒙 and 𝒚
in ℂ𝑛 by 𝒙 ⊗ 𝒚 , for reasons that will be clear in the following discussion. It is defined
by
⊗ : ℂ𝑛 × ℂ𝑛 → ℂ𝑛×𝑛 ;
⊗ : (𝒙 , 𝒚 ) ↦→ 𝒙 𝒚 ᵀ = (𝑥𝑖 𝑦 𝑗 )𝑖𝑛=1,𝑗 =1 .
We can prove this by checking all four axioms for the tensor product.
1. Additivity.Since the outer product is a particular case of matrix multiplication
(between an 𝑛 × 1 matrix and a 1 × 𝑛 matrix) and matrix multiplication is
additive, the outer product is also additive.
Lecture 1: Tensor Products 4

2. Homogeneity. Given 𝛼 ∈ ℂ, homogeneity follows from explicitly constructing the


outer product using the indices: Note that if we had used the
conjugate transpose in our definition
𝒙 (𝛼𝒚 ) ᵀ = (𝑥𝑖 𝛼𝑦 𝑗 )𝑖𝑛=1,𝑗 =1 = (𝛼𝑥𝑖 𝑦 𝑗 )𝑖𝑛=1,𝑗 =1 ; of the outer product, the outer
product would not satisfy
(𝛼𝒙 )𝒚 ᵀ = (𝛼𝑥𝑖 𝑦 𝑗 )𝑖𝑛=1,𝑗 =1 ; homogeneity, since
(𝛼𝒙 )𝒚 ∗ ≠ 𝒙 (𝛼𝒚 ) ∗
𝛼 (𝒙 𝒚 ᵀ ) = 𝛼 (𝑥𝑖 , 𝑦 𝑗 )𝑖𝑛=1,𝑗 =1 = (𝛼𝑥𝑖 𝑦 𝑗 )𝑖𝑛=1,𝑗 =1 .

3. Interaction with zero.If either 𝒙 or 𝒚 is 0, then either 𝑥𝑖 = 0 for all 𝑖 or 𝑦 𝑗 = 0 for


all 𝑗 . Then it must hold that

𝒙 𝒚 ᵀ = (𝑥𝑖 𝑦 𝑗 )𝑖𝑛,𝑗 =1 = ( 0)𝑖𝑛,𝑗 =1 = 00ᵀ .

4. Faithfulness. If 𝒙 𝒚 ᵀ = 0 (the matrix composed entirely of 0 elements), then


𝑥𝑖 𝑦 𝑗 = 0 for all 𝑖 , 𝑗 . Fix the index 𝑖 = 1; then, either 𝑥 1 = 0 or 𝑥 1 ≠ 0.
1. If 𝑥 1 ≠ 0, then 𝑦 𝑗 = 0 for all 𝑗 , and 𝒚 = 0. This satisfies faithfulness.
2. If 𝑥 1 = 0, then proceed to 𝑖 = 2 and repeat this analysis.
After proceeding in this way, either 𝒚 = 0, or 𝑥𝑖 = 0 for all 𝑖 . In the latter case,
𝒙 = 0, and faithfulness is satisfied.

The outer product thus satisfies all four axioms of the tensor product. 

Exercise 1.5 (Extension to different spaces). Let H1 and H2 be finite-dimensional Hilbert


spaces, perhaps with different dimensions. Describe axioms for a tensor product
operation for vectors 𝒙 ∈ H1 and 𝒚 ∈ H2 .
Exercise 1.6 (Extension to multiple arguments). Generalize the tensor product to a product
of 𝑘 vectors from H. Hint: Each of the axioms in Definition 1.1 needs to be adjusted
slightly, but we should still be able to draw on our intuition from multiplying numbers
in ℂ. For instance, when multiplying together a series of numbers 𝑥 1 , 𝑥 2 , . . . , 𝑥𝑘
in ℂ, we expect homogeneity in every argument so that for 𝛼 ∈ ℂ, the product
𝑥 1 × · · · × 𝛼𝑥 𝑗 × · · · × 𝑥𝑘 = 𝛼 (𝑥 1 × · · · × 𝑥 𝑗 × · · · × 𝑥𝑘 ) . How should we adjust the
other axioms so that we preserve our intuitions from the H = ℂ case?

1.2 The space of tensor products


Having defined the tensor product operator ⊗ as a binary operator following the
axioms in Definition 1.1, we will now show how to define the tensor product space, a
linear space that contains the tensor products. We know this is a reasonable object to
examine because the outer product is a valid tensor product, so there is at least one
case where this object exists.
We first define an elementary tensor to be an object of the form 𝒙 ⊗ 𝒚 , where 𝒙 and
𝒚 are in a vector space H. We then construct the linear space H ⊗ H by taking all linear
combinations of elementary tensors. We agree that two tensors are the same if we can
reduce their difference to zero by repeatedly applying the axioms.
Mathematically, an element T ∈ H ⊗ H can be expressed as a sum of 𝑟 ∈ ℕ
elementary tensors weighted by values 𝛼𝑖 ∈ 𝔽 . That is,
∑︁𝑟
T= 𝛼𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖
𝑖 =1

where 𝒙 𝑖 and 𝒚 𝑖 are elements of H. This space is evidently closed under linear
Lecture 1: Tensor Products 5
Í𝑟 Í𝑝
combinations, since given tensors T1 = 𝑖 =1 𝛼𝑖𝒙 𝑖 ⊗ 𝒚 𝑖 and T2 = 𝑖 =1 𝛽𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 ,
∑︁𝑟 ∑︁𝑝
𝛾1 T1 + 𝛾2 T2 = 𝛾1 𝛼𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 + 𝛾2 𝛽 𝑗 𝒙 𝑗 ⊗ 𝒚 𝑗
𝑖 =1 𝑗 =1
∑︁max {𝑟 ,𝑝 }
= 𝜆𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 ,
𝑖 =1

where we have concatenated the two sums. Some intuition about non-elementary
tensors can be obtained by considering the specific case of an outer product, as in the
following example. One can view the tensor product as a
generalization of the outer product.
Example 1.7 (Elementary tensors). As a concrete example of a non-elementary tensor, For the particular case of vectors in
consider the standard orthonormal basis {𝜹 1 , 𝜹 2 } in ℝ2 . Two elementary tensors are ℂ𝑛 , the tensor product produces
    matrices in ℂ𝑛×𝑛 .
1 0 0 0
𝜹1 ⊗ 𝜹1 = and 𝜹2 ⊗ 𝜹2 = .
0 0 0 1

where ⊗ specifically references the outer product. Their sum is


 
1 0
𝜹1 ⊗ 𝜹1 + 𝜹2 ⊗ 𝜹2 = .
0 1

In this way, we can view an elementary tensor as a rank-1 matrix, while a non-elementary
tensor is any higher rank matrix. 

However, clearly tensor decompositions are not unique. To impose uniqueness, we


define equivalence classes where two vectors in H ⊗ H are equivalent if and only if
they are related by the axioms of vector multiplication, as can be seen through the
following example.
Example 1.8 (Non-uniqueness of representation). Take the vectors 𝜹 1 , 𝜹 2 , and 𝜹 1 + 𝜹 2 in
ℝ2 . Then the linear combination
 
1 0
(𝜹 1 + 𝜹 2 ) ⊗ (𝜹 1 + 𝜹 2 ) − 𝜹 1 ⊗ 𝜹 2 − 𝜹 2 ⊗ 𝜹 2 = (1.1)
0 1

is equal to the sum in the previous example, but has an ostensibly different representa-
tion. We can show that these representations are equivalent by using the axioms in
Definition 1.1 to distribute the sums in the first term of Equation (1.1) and simplify. 
Once these equivalence classes have been established, we can rigorously define a
tensor product space as follows.

Definition 1.9 (Tensor product space). Let H be a finite Hilbert space over a field 𝔽 .
The tensor product space H ⊗ H is defined by all linear combinations of expressions
𝒙 ⊗ 𝒚 with 𝒙 , 𝒚 ∈ H modulo the axioms presented in Section 1.1.2.

In summary: 𝒙 ⊗𝒚 is an expression formed from two vectors; taking linear combinations


of such expressions forms a new space, which is well defined if we impose equivalence
defined by the axioms in Section 1.1.2.
Exercise 1.10 (Extension to different spaces). Explain how to construct the tensor product
of two different finite-dimensional Hilbert spaces H1 and H2 .

1.3 Isomorphism to bilinear forms


Although we have completely defined a tensor product space, the preceding discussions
are somewhat abstract. In this section, we will present an alternate way to view a
Lecture 1: Tensor Products 6

tensor product space by describing an isomorphism between tensor product spaces


and bilinear forms. This is particularly useful because bilinear forms are isomorphic to
matrices, so tensor products can be identified with matrices.

1.3.1 Bilinear forms


First, we need the machinery of bilinear forms.

Definition 1.11 (Bilinear forms). A bilinear form on a Hilbert space H is a scalar-valued


function 𝐵 : H × H → 𝔽 that maps two arguments from H to a field 𝔽 and satisfies
the following properties.

1.For all 𝒙 ∈ H, the map 𝐵 (𝒙 , ·) is a linear functional on H.


2. For all 𝒚 ∈ H, the map 𝐵 (·, 𝒚 ) is a linear functional on H.

We will denote the space of bilinear forms acting on H × H by Bil ( H) .

Bilinear forms can be viewed more concretely by noting their correspondence with
matrices.
Proposition 1.12 (Bilinear forms are isomorphic to matrices). The linear space Bil ( H) is
isomorphic to 𝕄𝑛 (𝔽) . 𝕄𝑛 (𝔽) is the linear space of 𝑛 × 𝑛
matrices over the field 𝔽 .
Proof. Every matrix 𝑨 ∈ 𝕄𝑛 (𝔽) has an associated bilinear form defined by
∑︁𝑛
𝐵 𝑨 (𝒙 , 𝒚 ) = 𝑥𝑖 𝑎𝑖 𝑗 𝑦 𝑗 = 𝒙 ᵀ 𝑨𝒚 .
𝑖 ,𝑗 =1

Conversely, every bilinear form 𝐵 on H has an associated matrix 𝑨 defined by 𝑎𝑖 𝑗 =


𝐵 (𝒆 𝑖 , 𝒆 𝑗 ) for 𝑖 , 𝑗 = 1, . . . , 𝑛 . A bilinear form can thus be viewed as equivalent to a
matrix. 
This observation echoes a connection to tensor product spaces; the outer product
of two vectors is, after all, a matrix. This observation also provides a straightforward
way to see that the linear space Bil ( H) of bilinear forms on H has dimension ( dim H) 2 ,
since this is the dimension of 𝕄𝑛 (𝔽) .

1.3.2 Connection to tensor product spaces


Having defined bilinear forms, we can present an alternate definition for tensor product
spaces.

Definition 1.13 (Tensor product space). The tensor product space H ⊗ H is the algebraic
dual space of the space Bil ( H) of bilinear forms. Recall that the algebraic dual space of
a linear space V is the space of all
linear functionals on V.
Concretely, we identify each elementary tensor in H ⊗ H as a linear functional on
Bil ( H) via the following mapping:
𝒙 ⊗ 𝒚 : 𝐵 ↦→ 𝐵 (𝒙 , 𝒚 ).
This can be easily extended to all tensors by linearity as follows:
∑︁ ∑︁
𝛼𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 : 𝐵 ↦→ 𝛼𝑖 𝐵 (𝒙 𝑖 , 𝒚 𝑖 ).
𝑖 𝑖
Now, we must show that this construction satisfies the axioms in Definition 1.1.
In the following discussion,1 let us consider bilinear forms 𝐵 with underlying matrix
representations 𝑩 .
1For brevity, we do not check every case for each axiom, but all the other cases follow similarly.
Lecture 1: Tensor Products 7

1. Additivity.Additivity follows from the linearity of matrix–vector multiplication.


In particular, the mapping defined by (𝒙 + 𝒚 ) ⊗ 𝒛 acts on 𝐵 to output

𝐵 (𝒙 + 𝒚 , 𝒛 ) = (𝒙 ᵀ + 𝒚 ᵀ )𝑩𝒛
= 𝒙 ᵀ 𝑩𝒛 + 𝒚 ᵀ 𝑩𝒛
= 𝐵 (𝒙 , 𝒛 ) + 𝐵 (𝒚 , 𝒛 ),

which corresponds to the mapping defined by 𝒙 ⊗ 𝒛 + 𝒚 ⊗ 𝒛


2. Homogeneity. Homogeneity follows similarly to additivity. In particular, given a
mapping defined by 𝛼𝒙 ⊗ 𝒚

𝐵 (𝛼𝒙 , 𝒚 ) = 𝛼𝒙 ᵀ 𝑩𝒚
= 𝒙 ᵀ 𝑩 (𝛼𝒚 )
= 𝐵 (𝒙 , 𝛼𝒚 ),

which is the mapping defined by 𝒙 ⊗ 𝛼𝒚 .


A zero vector in either argument will make terms like 𝒙 ᵀ 𝑩𝒚
3. Interaction with zero.
equal to 0, so
𝐵 (𝒙 , 0) = 𝐵 ( 0, 𝒚 ) = 𝐵 ( 0, 0) = 0,
which implies that the mappings defined by 𝒙 ⊗ 0, and 0 ⊗ 𝒚 , and 0 ⊗ 0 are all
identical.
4. Faithfulness. If 𝒙 ⊗ 𝒚 = 0 ⊗ 0, then every bilinear form is mapped to 0, since
0ᵀ 𝑩 0 = 0. The only way for every matrix to be mapped to 0 is if for 𝒙 = 0 or for
𝒚 = 0; if 𝒙 and 𝒚 were both nonzero, I could always consider 𝑩 = 𝒙 𝒚 ᵀ ; then

𝒙 ᵀ 𝑩𝒚 = 𝒙 ᵀ 𝒙 𝒚 ᵀ 𝒚
= k𝒙 k 2 k𝒚 k 2 > 0,

which would violate the assumption that the tensor maps all bilinear forms to
zero.

Example 1.14 (Matrix associations in 𝔽 𝑛 ). If H ' 𝔽 𝑛 , then the space of bilinear forms
Bil ( H) ' 𝕄𝑛 (𝔽) , and H ⊗ H = 𝕄𝑛 (𝔽) . We can then identify 𝒙 ⊗ 𝒚 with 𝒙 𝒚 ᵀ . Because
bilinear forms are isomorphic to matrices and tensors are isomorphic to the dual of the
space of bilinear forms, tensors are also isomorphic to matrices. 

1.4 Inner products


Since the constituent Hilbert spaces have an inner product, it is natural to want an
inner product that acts on vectors from the tensor product space and that interacts
well with the inner products of the underlying Hilbert spaces. We define such an inner
product as follows.

Definition 1.15 (Tensor inner product). Given two elementary tensors 𝒙 1 ⊗ 𝒙 2 and
𝒚 1 ⊗ 𝒚 2 , let the inner product between the two elementary tensors be given by

h𝒙 1 ⊗ 𝒙 2 , 𝒚 1 ⊗ 𝒚 2 i B h𝒙 1 , 𝒚 1 ih𝒙 2 , 𝒚 2 i.

Extend to all tensors by (semi)linearity.


Lecture 1: Tensor Products 8

Concretely, the inner product between two general tensors in H ⊗ H is required to


obey D∑︁ E
𝑟 ∑︁ 𝑝 (𝑗 ) (𝑗 )
𝛼𝑖 𝒙 1(𝑖 ) ⊗ 𝒙 2(𝑖 ) , 𝛽𝑗 𝒚 1 ⊗ 𝒚 2
𝑖 =1 𝑗 =1
∑︁𝑟 ∑︁𝑝
= 𝛼 ∗ 𝛽 𝑗 h𝒙 1 ⊗ 𝒙 2 , 𝒚 1 ⊗ 𝒚 2 i
𝑖 =1 𝑗 =1 𝑖
∑︁𝑟 ∑︁𝑝
= 𝛼𝑖∗ 𝛽 𝑗 h𝒙 1 , 𝒚 1 ih𝒙 2 , 𝒚 2 i.
𝑖 =1 𝑗 =1

Exercise 1.16 (The tensor inner product is well-defined). Show that the tensor inner product
is well-defined. In this case, this means that two different representatives of the same
equivalence class (i.e., they are related by the axioms in Definition 1.1) should lead to
the same inner product.
Proposition 1.17 (Tensor orthonormal basis). Given an orthonormal basis {𝒆 1 , 𝒆 2 , . . . , 𝒆 𝑛 }
for H, the basis {𝒆 𝑖 ⊗ 𝒆 𝑗 : 𝑖 , 𝑗 = 1, . . . , 𝑛} is an orthonormal basis for H ⊗ H. In
particular, the dimension of H ⊗ H is ( dim H) 2 .

Proof. Using the tensor inner product, we note that

h𝒆 𝑖 ⊗ 𝒆 𝑗 , 𝒆 𝑖 ⊗ 𝒆 𝑗 i = h𝒆 𝑖 , 𝒆 𝑖 ih𝒆 𝑗 , 𝒆 𝑗 i = 1.

However, for indices 𝑘 ≠ 𝑖 and 𝑙 ≠ 𝑗 ,

h𝒆 𝑖 ⊗ 𝒆 𝑗 , 𝒆 𝑘 ⊗ 𝒆 𝑙 i = h𝒆 𝑖 , 𝒆 𝑘 ih𝒆 𝑗 , 𝒆 𝑙 i = 0,

using the orthonormality of the basis on H.


It remains to be shown that this basis is total; that is, there is no tensor linearly
independent from the span of basis vectors. For the finite-dimensional case, this
is straightforward. Any vector in H ⊗ H can definitionally be expressed as a linear
combination of elementary tensors, which can then be decomposed into the underlying
bases. If 𝛼, 𝛽, 𝛾 ∈ 𝔽 and 𝑟 ∈ ℕ then
∑︁𝑟 ∑︁𝑟 ∑︁𝑛 ∑︁𝑛 
𝛼𝑖 𝒙 𝑗 ⊗ 𝒚 𝑗 = 𝛼𝑖 𝛽𝑗 𝒆 𝑗 ) ⊗ ( 𝛾𝑘 𝒆 𝑘
𝑖 𝑖 𝑗 𝑘
∑︁𝑟 ∑︁𝑛
= 𝜆𝑖 𝑗 𝑘 𝒆 𝑗 ⊗ 𝒆 𝑘 ,
𝑖 𝑗 ,𝑘

where we have applied Definition 1.1 to reduce all the terms to a linear combination
of elements from the tensor orthonormal basis. This shows that H ⊗ H is in the span
of the set of elementary tensors. Every linear combination of elementary tensors is
trivially in H ⊗ H, which completes the proof. 
Exercise 1.18 (The tensor orthonormal basis is complete). Repeat the previous proof of
completeness using the equivalence of tensor product spaces with bilinear forms. Hint:
Remember the correspondence between bilinear forms and matrices.
We will often choose to order the orthonormal basis lexicographically; that is, we
say 𝒆 𝑗 ⊗ 𝒆 𝑘 precedes 𝒆 𝑚 ⊗ 𝒆 𝑛 if and only if 𝑗 < 𝑚 or 𝑗 = 𝑚 and 𝑘 < 𝑛 . In other
words, we sort by the first index, then the second).

1.5 Theory of linear operators


Having found that H ⊗ H is a vector space in its own right, a natural further step is to
define linear operators acting on H ⊗ H.
Lecture 1: Tensor Products 9

Definition 1.19 (Elementary tensor operators). Given linear operators 𝑨, 𝑩 ∈ L( H) For the finite dimensional case, every
acting on H and vectors 𝒙 , 𝒚 ∈ H, define a linear operator acting on H ⊗ H by the linear operator can of course be
described by a matrix.
following action on the elementary tensor 𝒙 ⊗ 𝒚 :

(𝑨 ⊗ 𝑩) (𝒙 ⊗ 𝒚 ) = (𝑨𝒙 ) ⊗ (𝑩𝒚 ).

We extend the action to a general tensor in H ⊗ H by linearity; that is


∑︁𝑟  ∑︁𝑟
(𝑨 ⊗ 𝑩) 𝛼𝑖 𝒙 𝑖 ⊗ 𝒚 𝑖 = 𝛼𝑖 (𝑨𝒙 ) ⊗ (𝑩𝒚 )
𝑖 =1 𝑖 =1

A general linear operator can be constructed as a sum of elementary linear operators.


This construction of linear operators interacts smoothly with familiar operations from
linear algebra. The following example on compositions of linear operators is quite
useful to derive further properties of tensored linear operators using properties of the
constituent operators.
Proposition 1.20 (Composition). Given linear operators 𝑨, 𝑩, 𝑪 and 𝑫 acting on a Hilbert
space H and vectors 𝒙 , 𝒚 ∈ H, the composition of the linear operators 𝑨 ⊗ 𝑩 and 𝑪 ⊗ 𝑫
is given by (𝑨 ⊗ 𝑩) (𝑪 ⊗ 𝑫) = (𝑨𝑪 ) ⊗ (𝑩𝑫) .

Proof. By considering the action of linear operators on elementary tensors, we have

(𝑨 ⊗ 𝑩) (𝑪 ⊗ 𝑫) (𝒙 ⊗ 𝒚 ) = (𝑨 ⊗ 𝑩) ((𝑪 𝒙 ) ⊗ (𝑫𝒚 )) = (𝑨𝑪 𝒙 ) ⊗ (𝑩𝑫𝒚 ),

which implies that (𝑨 ⊗ 𝑩) (𝑪 ⊗ 𝑫) = (𝑨𝑪 ) ⊗ (𝑩𝑫) . 


Corollary 1.21 (Identity and inverses). The identity operator on H × H is given by I ⊗ I,
where I is the identity operator on the Hilbert space H. Furthermore, the inverse of an
elementary operator is given by (𝑨 ⊗ 𝑩) −1 = (𝑨 −1 ⊗ 𝑩 −1 ) provided that both 𝑨 and
𝑩 are invertible.
Proof. Use Proposition 1.20. 
We can also consider other operations familiar from linear algebra.
Proposition 1.22 (Adjoint). The adjoint of an elementary linear operator 𝑨 ⊗ 𝑩 is given
by (𝑨 ⊗ 𝑩) ∗ = 𝑨 ∗ ⊗ 𝑩 ∗ . Recall that the adjoint of a linear map
𝑨 on a Hilbert space H is the map 𝐴 ∗
Proof. Given vectors 𝒙 𝑖 , 𝒚 𝑗 ∈ H, we can express the condition defining the adjoint as such that for all 𝒙 , 𝒚 ∈ H, the
following is true:
h(𝑨 ⊗ 𝑩) (𝒙 1 ⊗ 𝒚 1 ), 𝒙 2 ⊗ 𝒚 2 i = h𝑨𝒙 1 ⊗ 𝑩𝒚 1 , 𝒙 2 ⊗ 𝒚 2 i h𝑨𝒙 , 𝒚 i = h𝒙 , 𝑨 ∗ 𝒚 i.
= h𝑨𝒙 1 , 𝒙 2 ih𝑩𝒚 1 , 𝒚 2 i
= h𝒙 1 , 𝑨 ∗ 𝒙 2 ih𝒚 1 , 𝑩 ∗ 𝒚 2 i
= h𝒙 1 ⊗ 𝒚 1 , (𝑨 ∗ ⊗ 𝑩 ∗ )(𝒙 2 ⊗ 𝒚 2 )i.

Comparing the first and final lines, (𝑨 ⊗ 𝑩) ∗ = 𝑨 ∗ ⊗ 𝑩 ∗ . 


As seen above, this construction of linear operators makes it obvious when a
tensored linear operator preserves properties of linear operators on the constituent
spaces. A particularly important property is persistence.
Proposition 1.23 (Persistence). For all 𝑨, 𝑩 ∈ L( H) , the following statements hold.

1. If 𝑨 and 𝑩 are self-adjoint, then 𝑨 ⊗ 𝑩 is self-adjoint.


2. If 𝑨 and 𝑩 are unitary, then 𝑨 ⊗ 𝑩 is unitary.
Lecture 1: Tensor Products 10

3. If 𝑨 and 𝑩 are normal, then 𝑨 ⊗ 𝑩 is normal. Recall that a matrix 𝑨 is normal if it


4. If 𝑨 and 𝑩 are positive semidefinite, then 𝑨 ⊗ 𝑩 is positive semidefinite. commutes with its conjugate
transpose; i.e., 𝑨 ∗ 𝑨 = 𝑨𝑨 ∗ .
The converses do not necessarily hold.

Proof. As an example, let us prove the case for normal operators. If 𝑨, 𝑩 are normal,
then
(𝑨 ⊗ 𝑩) (𝑨 ⊗ 𝑩) ∗ = (𝑨 ⊗ 𝑩)(𝑨 ∗ ⊗ 𝑩 ∗ )
= (𝑨𝑨 ∗ ⊗ 𝑩𝑩 ∗ )
= (𝑨 ∗ 𝑨) ⊗ (𝑩 ∗ 𝑩)
= (𝑨 ∗ ⊗ 𝑩 ∗ )(𝑨 ⊗ 𝑩).
Therefore, 𝑨 ⊗ 𝑩 is also normal. 
Example 1.24 (Quantum computing). Quantum computers, devices that use properties
of quantum states to perform calculations, are studied using a circuit model where
each quantum gate acts on two qubits (the quantum equivalent of a bit). Formally, the
Hilbert space of a quantum circuit is the tensor product of the Hilbert spaces of all the
qubits. For physical reasons, only unitary operations are permitted in quantum circuits.
Because each two-qubit gate is a unitary operation, the unitary case of persistence
shows that the action on the entire space is also unitary. For further references, see
[NC00]. 

1.6 Spectral theory


Results describing the spectra of linear operators are fundamentally important in linear
algebra, so it makes sense to ask about the eigenvalues of a linear operator acting on a
tensor product space. We can use the construction developed in the previous section to
easily answer this question.
Proposition 1.25 (Spectral Theory). If 𝑨 ∈ L( H) has eigenvalues 𝜆 1 , . . . , 𝜆𝑛 ∈ ℂ and
associated eigenvectors 𝒖 1 , . . . , 𝒖 𝑛 ∈ H, then 𝑨 ⊗ 𝑨 has eigenvalues {𝜆𝑖 𝜆 𝑗 : 𝑖 , 𝑗 =
1, . . . , 𝑛} with associated eigenvectors {𝒖 𝑖 ⊗ 𝒖 𝑗 :, 𝑖 , 𝑗 = 1, . . . , 𝑛}.

Proof. We prove the expression for the eigenvalues using the Schur decomposition. The Schur decomposition expresses
an arbitrary complex square matrix 𝑨
𝑨 ⊗ 𝑨 = (𝑸𝑻 𝑸 ∗ ) ⊗ (𝑸𝑻 𝑸 ∗ ) as 𝑨 = 𝑸𝑻 𝑸 ∗ , where 𝑸 is unitary
and 𝑻 is an upper triangular matrix
= (𝑸 ⊗ 𝑸 ) (𝑻 ⊗ 𝑻 )(𝑸 ⊗ 𝑸 ) ∗ . whose diagonal entries are the
eigenvalues of 𝑨 .
In moving from the first to the second line, we have applied the Schur decomposition.
The center term is an upper triangular matrix with diagonal elements {𝜆𝑖 𝜆 𝑗 : 𝑖 , 𝑗 =
1, . . . , 𝑛} whose ordering is given by the lexicographic ordering of the orthonormal
basis for the tensor product space. Reading off these entries proves the result.
For the eigenvectors, we can simply note that

(𝑨 ⊗ 𝑨) (𝒖 𝑗 ⊗ 𝒖 𝑘 ) = (𝑨𝒖 𝑗 ) ⊗ (𝑨𝒖 𝑘 ) = 𝜆 𝑗 𝜆𝑘 (𝒖 𝑗 ⊗ 𝒖 𝑘 ),
which proves the result. 
A similar construction holds for singular values.
Proposition 1.26 (Singular value decomposition). If a linear operator 𝑨 ∈ L( H) has a
singular value decomposition (SVD) given by 𝑨 = 𝑼 𝚺𝑽 ∗ , where 𝑼 and 𝑽 are unitary
and 𝚺 = diag (𝜎1 , . . . , 𝜎𝑛 ) , then the singular values of 𝑨 ⊗ 𝑨 are {𝜎𝑖 𝜎 𝑗 : 𝑖 , 𝑗 =
1, . . . , 𝑛}.
Lecture 1: Tensor Products 11

Proof. The tensor operator 𝑨 ⊗ 𝑨 can be written as

𝑨 ⊗ 𝑨 = (𝑼 𝚺𝑽 ∗ ) ⊗ (𝑼 𝚺𝑽 ∗ )
= (𝑼 ⊗ 𝑼 ) (𝚺 ⊗ 𝚺)(𝑽 ⊗ 𝑽 ) ∗ ,

which has the same form of the SVD. As the center term is diagonal, we can read off
the entries in lexicographic order to prove the result. 
The spectral decomposition and singular value decomposition therefore provide
a method to construct tensor operators based on relations to the underlying linear
operators with particular eigenvalues and eigenvectors.

Notes
Much of the material in the section is discussed briefly in [Bha97, Chap. I]. The
book [Hal74] contains a more mathematical description of bilinear forms and tensor
products.

Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Hal74] P. R. Halmos. Finite-dimensional vector spaces. 2nd ed. Springer-Verlag, New York-
Heidelberg, 1974.
[NC00] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information.
Cambridge University Press, Cambridge, 2000.
2. Multilinear Algebra

Date: 6 January 2022 Scribe: Edoardo Calvello

In the first lecture, we considered an arbitrary 𝑛 -dimensional Hilbert space H


Agenda:
equipped with the semilinear inner product h·, ·i and a distinguished orthonormal basis
1. Multivariate tensor product
(𝒆 1 , . . . , 𝒆 𝑛 ) . We defined the bivariate tensor product space H ⊗ H as the algebraic 2. Permutations
dual of Bil ( H) , the space of bilinear forms taking H × H to 𝔽 , and we characterized the 3. Wedge products
general form of its elements, called tensors. Extending the definition of L( H) to linear 4. Wedge operators
operators on H ⊗ H leads to the space L( H ⊗ H) of linear operators on tensors. We 5. Spectral theory of wedge
operators
developed a spectral theory for elementary tensor operators. 6. Determinants
It is thus natural to ask whether these constructions can be extended to the 7. Symmetric tensor product
multivariate case. In this lecture, we introduce the 𝑘 -variate multilinear functionals
as a mechanism to construct the 𝑘 -fold tensor product space ⊗𝑘 H and develop the
relevant theory. Starting from the observation that the tensor product does not
commute, we consider the idea of summing over all permutations of the factors of the
product. After recalling some background material on permutations, we define the
antisymmetric tensor product, more briefly called the wedge product. It holds that
the associated wedge product space ∧𝑘 H forms a linear subspace of ⊗𝑘 H. Drawing
from our discussion on tensor product operators, this leads us to the theory of wedge
operators and their relationship to the determinant function. Finally, we mention the
symmetric tensor product and investigate its relation to the permanent.

2.1 Multivariate tensor product


In this section, we aim to extend the discussion on bivariate tensor products from the
previous chapter to the multivariate setting and develop the relevant theory.

2.1.1 Construction of multivariate tensor product


To extend the concept of a bilinear form to a higher-order product space, we appeal to
the following definition.

Definition 2.1 (𝑘 -variate multilinear functional). For 𝑘 ∈ ℕ, a function 𝐵 : H𝑘 → 𝔽


that is linear in each coordinate is called a multilinear functional. We denote by
ML𝑘 ( H) the space of 𝑘 -variate multilinear functionals.

Starting from the algebraic dual of the space ML𝑘 ( H) , denoted by ( ML𝑘 ( H)) ∗ , we
define the 𝑘 -fold tensor product space ⊗𝑘 H.

Definition 2.2 (Multivariate tensor product space). Given any Hilbert space H, we define
the 𝑘 -fold tensor product space ⊗𝑘 H as the algebraic dual of the space ML𝑘 ( H) .
Indeed, we identify ⊗𝑘 H  ( ML𝑘 ( H)) ∗ .

Let us next define a 𝑘 -variate elementary tensor.


Lecture 2: Multilinear Algebra 13

Definition 2.3 (𝑘 -variate elementary tensor). For any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H, we call 𝒙 1 ⊗


· · · ⊗ 𝒙 𝑘 a 𝑘 -variate elementary tensor.

Following this definition, we can identify each elementary tensor with a linear
functional on the space ML𝑘 ( H) . Indeed, for any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H, by defining the
action of 𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 on ML𝑘 ( H) as

(𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 ) (𝐵) = 𝐵 (𝒙 1 , . . . , 𝒙 𝑘 ) for any 𝐵 ∈ ML𝑘 ( H) .

Thus, the elementary tensor 𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 determines a linear functional on ML𝑘 ( H) .


For any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H, we have that 𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 ∈ ⊗𝑘 H. Now, since any
linear combination of linear functionals is a linear functional, it holds that any linear
combination of elementary tensors belongs to the space ⊗𝑘 H. The following proposition
completes the description of the 𝑘 -fold tensor product space ⊗𝑘 H.
Proposition 2.4 (Multivariate tensors). Any element 𝑥 ∈ ⊗𝑘 H is a linear combination of
elementary tensors. We call an arbitrary element of ⊗𝑘 H a 𝑘 -variate tensor.
Exercise 2.5 (Multivariate tensors). Provide a proof for Proposition 2.4.
The space of multilinear functionals satisfies dim ( ML𝑘 ( H)) = ( dim ( H)) 𝑘 . In the
finite-dimensional
 setting, the dual space has the same dimension. We may conclude
that dim ⊗𝑘 H = ( dim ( H)) 𝑘 .

2.1.2 Inner product


Since H is a Hilbert space with the inner product h·, ·i , we can equip the 𝑘 -fold tensor
product space ⊗𝑘 H with its own inner product h·, ·i ⊗𝑘 .

Definition 2.6 (Inner product for multivariate tensor space). The inner product h·, ·i ⊗𝑘 is
defined on elementary tensors in ⊗𝑘 H by
Ö𝑘
h𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 , 𝒚 1 ⊗ · · · ⊗ 𝒚 𝑘 i ⊗𝑘 := h𝒙 𝑖 , 𝒚 𝑖 i,
𝑖 =1

for any 𝒙 1 , . . . , 𝒙 𝑘 , 𝒚 1 , . . . , 𝒚 𝑘 ∈ H.

Exercise 2.7 (Tensor inner product). Verify that the function h·, ·i ⊗𝑘 as introduced in
Definition 2.6 is an inner product.
Indeed, this inner product can be extended to all tensors in ⊗𝑘 H by linearity. Given
the inner product h·, ·i ⊗𝑘 , it is possible to find an orthonormal basis for the space ⊗𝑘 H.
Proposition 2.8 (Orthonormal basis for ⊗𝑘 H). The set

𝒆 𝑖1 ⊗ · · · ⊗ 𝒆 𝑖𝑘 : 𝑖 𝑗 = 1, . . . , 𝑛 ; 𝑗 = 1, . . . , 𝑘

forms an orthonormal basis for the space ⊗𝑘 H.


𝑘
Exercise 2.9 (Tensor orthonormal basis). Using the fact that ( dim ( H)) = 𝑛 𝑘 , prove
Proposition 2.8.

2.1.3 Linear operators on multivariate tensor product


We can also introduce a 𝑘 -fold tensor product operator as follows.
Lecture 2: Multilinear Algebra 14

Definition 2.10 (𝑘 -fold tensor product operator). Given 𝑨 ∈ L( H) , we define the Note that it is sufficient to define the
action of the tensor product operator
(elementary) tensor product operator ⊗𝑘 𝑨 : ⊗𝑘 H → ⊗𝑘 H via
on elementary tensors, as the
 definition can be extended to general
⊗𝑘 𝑨 (𝒙 1 ⊗ · · · ⊗ 𝒙 𝑘 ) = (𝑨𝒙 1 ) ⊗ · · · ⊗ (𝑨𝒙 𝑘 ), tensors by linearity.

for any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H. We extend this definition to all tensors via linearity.

Similarly to the bivariate case, the 𝑘 -fold tensor product operator ⊗𝑘 𝑨 possesses
the following important properties.
Proposition 2.11 (Tensor operator: Properties). Let 𝑨 ∈ L( H) . Let ⊗𝑘 𝑨 : ⊗𝑘 H → ⊗𝑘 H be
the 𝑘 -fold tensor product operator associated to 𝑨 .
 
1. Composition. For any 𝑨, 𝑩 ∈ L( H) , it holds that ⊗𝑘 (𝑨𝑩) = ⊗𝑘 𝑨 ⊗𝑘 𝑩 .
 −1
2. Inverses. For any 𝑨 ∈ L( H) , we have ⊗𝑘 𝑨 = ⊗𝑘 𝑨 −1 .

= ⊗𝑘 𝑨 ∗ .

3. Adjoint. For any 𝑨 ∈ L( H) , it holds that ⊗𝑘 𝑨
4. Persistence. If 𝑨 ∈ L( H) is self-adjoint, unitary, normal, or positive semidefinite,
then ⊗𝑘 𝑨 inherits the respective property.

Exercise 2.12 (Tensor product operators). Provide a proof for Proposition 2.11.

2.1.4 Spectral theory


As in the previous chapter, it is possible to generalize to a spectral theory of 𝑘 -fold
tensor product operators. We thus formulate the following two propositions.
Proposition 2.13 (Spectrum of 𝑘 -fold tensor product operator). Let 𝑨 ∈ L( H) , and let ⊗𝑘 𝑨 :
⊗𝑘 H → ⊗𝑘 H be the 𝑘 -fold tensor product operator associated  to 𝑨 . The operator
⊗𝑘 𝑨 has eigenvalues 𝜆𝑖1 · · · 𝜆𝑖𝑘 : 1 ≤ 𝑖 𝑗 ≤ 𝑛, 𝑗 = 1, . . . , 𝑘 , where 𝜆 1 , . . . , 𝜆𝑛 are
eigenvalues of the linear operator 𝑨 .

Proof. As before, the proof of Proposition 2.13 follows directly from the Schur decom-
position of 𝑨 . 
Proposition 2.14 (Singular values of 𝑘 -fold tensor product operator). Let 𝑨 ∈ L( H) , and
let ⊗𝑘 𝑨 : ⊗𝑘 H → ⊗𝑘 H be the 𝑘 -fold tensor product operator associated  to 𝑨 .
The operator ⊗𝑘 𝑨 has singular values 𝜎𝑖 1 · · · 𝜎𝑖𝑘 : 1 ≤ 𝑖 𝑗 ≤ 𝑛, 𝑗 = 1, . . . , 𝑘 , where
𝜎1 , . . . , 𝜎𝑛 are singular values of the linear operator 𝑨 .
Proof. The proof of Proposition 2.14 again follows directly from the singular value
decomposition of 𝑨 . 
Exercise 2.15 (Tensor spectral theory). Elaborate on the details of the proofs of Propositions
2.13 and 2.14 by generalizing the arguments of Lecture 1 to the 𝑘 -fold tensor product
case.

2.2 Permutations
Before continuing our discussion on multilinear algebra, we first outline some back-
ground material on permutations which will be of use when defining antisymmetric
and symmetric tensor products. We present the definition of a permutation as follows.
Lecture 2: Multilinear Algebra 15

Definition 2.16 (Permutation on 𝑘 symbols). A bijective function 𝜋 : {1, . . . , 𝑘 } →


{1, . . . , 𝑘 } is called a permutation on 𝑘 symbols.

Exercise 2.17 (Counting permutations). Argue that there are 𝑘 ! permutations on 𝑘 symbols.
We next provide two informative examples of permutations.
Example 2.18 (Some permutations). Both of the following functions 𝜋 1 , 𝜋 2 : {1, 2, 3} →
{1, 2, 3} represent permutations.
1. Let 𝜋 1 : {1, 2, 3} → {1, 2, 3} be defined by the tableau

1 2 3
.
1 3 2

2. Let 𝜋 2 : {1, 2, 3} → {1, 2, 3} be defined by

1 2 3
.
2 3 1

Each map 𝜋 1 , 𝜋 2 is a permutation on 3 symbols. 

We can define a particular type of permutation, known as a transposition.

Definition 2.19 (Transposition). A transposition is a permutation that interchanges


two symbols and maps all other symbols to themselves.

Example 2.20 (Transposition). Referring back to Example 2.18, we can observe that the
permutation 𝜋 1 is a transposition, while 𝜋 2 is not. 

We proceed by illustrating a well-known result regarding transpositions.

Theorem 2.21 (Representation of permutations by transpositions). Any permutation can


be expressed as a composition of transpositions.

Exercise 2.22 (*Transpositions). Sketch a proof for Theorem 2.21. Hint: For a more intuitive
argument, appeal to a “sorting” scheme. For a more group-theoretic argument, appeal
to the “cycle representation.” First, show that any permutation can be written as
a product of disjoint cycles, and then show that a product of disjoint cycles can be
expressed as a product of 2-cycles.
Furthermore, let us recall the following result.

Theorem 2.23 (Parity of a permutation). Fix a permutation 𝜋 . Either every repre-


sentation of 𝜋 as a composition of transpositions contains an even number of
transpositions or else every representation of 𝜋 as a composition of transpositions
contains an odd number of transpositions.

Exercise 2.24 (Parity). Provide a proof for Theorem 2.23.


A permutation is even when it is represented by an even number of transpositions,
and it is odd when it is represented by an odd number of transpositions. As a
consequence of Theorem 2.23, each permutation is either even or odd. Finally, we
define the signature of a permutation.

Definition 2.25 (Signature of a permutation). The signature 𝜀𝜋 of the a permutation 𝜋


Lecture 2: Multilinear Algebra 16

is defined by (
+1 if 𝜋 is even,
𝜀𝜋 =
−1 if 𝜋 is odd.

It is important to note that Theorem 2.23 ensures that the signature of a permutation
is well-defined.

2.3 Wedge products


The fact that the tensor product does not commute leads to the observation that it may In general, 𝒙 ⊗ 𝒚 ≠ 𝒚 ⊗ 𝒙 !
be of interest to associate tensors whose factors are the same up to order. This can
be accomplished by summing over all permutations of the factors of the product. In
addition, it turns out to be fruitful to weight each term of the sum by the signature of
the corresponding permutation. This weighting confers to this so-defined product the
significant property of antisymmetry. Considering this idea of a weighted sum over
permutations leads to the definition of the wedge product.

Definition 2.26 (Wedge product). The wedge product 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 of vectors


𝒙 1 , . . . 𝒙 𝑘 ∈ H is defined as
1 ∑︁
𝒙 1 ∧ · · · ∧ 𝒙 𝑘 := √ 𝜀𝜋 · 𝒙 𝜋 ( 1) ⊗ · · · ⊗ 𝒙 𝜋 (𝑘 ) ,
𝑘! 𝜋 ∈S𝑘

where S𝑘 is the symmetric group consisting of all permutations on 𝑘 symbols and


where 𝜀𝜋 denotes the signature of 𝜋 . As such, 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 is clearly an element
of the space ⊗𝑘 H.

The purpose of the normalizing the wedge product by (𝑘 !) −1/2 is to ensure that
the wedge product of orthonormal vectors yields a unit vector. We will turn to this
matter later when we introduce the inner product.
Example 2.27 (Wedge product for 𝑘 = 2). When 𝑛 = 1, the wedge product 𝒙 ∧ 𝒚 = 0
always. In case 𝑘 = 2, the wedge product of 𝒙 , 𝒚 ∈ H is the tensor

1
𝒙 ∧ 𝒚 = √ (𝒙 ⊗ 𝒚 − 𝒚 ⊗ 𝒙 ) .
2
When 𝑛 = 3 and 𝔽 = ℝ, the wedge product is equivalent with the well-known cross-
product. For general 𝑛 and 𝑘 = 2, one can find an analogy between the wedge product
and the skew part of a matrix. (Why?) 

We observe that the signature of the permutation in the summand makes the wedge
product antisymmetric. Indeed, for any 𝑖 , 𝑗 ∈ {1, . . . , 𝑘 } we have that

𝒙 1 ∧ · · · ∧ 𝒙 𝑖 ∧ · · · ∧ 𝒙 𝑗 ∧ · · · ∧ 𝒙 𝑘 = −𝒙 1 ∧ · · · ∧ 𝒙 𝑗 ∧ · · · ∧ 𝒙 𝑖 ∧ · · · ∧ 𝒙 𝑘 .

Indeed, exchanging any two vectors introduces an additional transposition into each
permutation in Definition 2.26, which flips the signature. As a consequence, the wedge
product is also known as the antisymmetric tensor product.
The important observation above leads to the conclusion that if 𝒙 𝑖 = 𝒙 𝑗 for some
𝑖 ≠ 𝑗 in {1, . . . , 𝑘 }, then 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 = 0. Exchanging 𝒙 𝑖 and 𝒙 𝑗 does not change
the product, but flips its sign. Therefore, since this product is an element of a vector
space, 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 = 0.
Lecture 2: Multilinear Algebra 17

This statement has a remarkable generalization to the case in which (𝒙 1 , . . . , 𝒙 𝑘 )


is a linearly dependent set.
Proposition 2.28 (Linear dependence and wedge product). The set (𝒙 1 , . . . , 𝒙 𝑘 ) is linearly
dependent if and only if 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 = 0.

Proof. Without loss of generality, by the linear dependence of the set, it is possible
to write 𝒙 1 = 𝑘𝑗=2 𝛼 𝑗 𝒙 𝑗 for some 𝛼 2 , . . . , 𝛼𝑘 ∈ 𝔽 . This means in particular that by
Í
linearity,  
∑︁𝑘
𝒙1 ∧ · · · ∧ 𝒙𝑘 = 𝛼𝑗 𝒙 𝑗 ∧ 𝒙 2 ∧ · · · ∧ 𝒙 𝑘
𝑗 =2
∑︁𝑘 
= 𝛼𝑗 𝒙 𝑗 ∧ 𝒙 2 ∧ · · · ∧ 𝒙 𝑘
𝑗 =2
= 0.
The converse is true by a straightforward induction argument. 
Exercise 2.29 (Wedge: Linear dependence). Recalling that any permutation can be written
as a composition of transpositions, complete the induction argument in the proof of
Proposition 2.28.
We can now proceed by constructing the wedge product space.

Definition 2.30 (Wedge product space). We define the wedge product space ∧𝑘 H by

∧𝑘 H := lin {𝒙 1 ∧ · · · ∧ 𝒙 𝑘 : 𝒙 𝑖 ∈ H and 𝑖 = 1, . . . , 𝑘 } .

As such, ∧𝑘 H is a linear subspace of the tensor product space ⊗𝑘 H.

As a linear subspace of ⊗𝑘 H, the space ∧𝑘 H thus inherits the inner product h·, ·i ⊗𝑘 ,
from which it is possible to construct an orthonormal basis.
Proposition 2.31 (Orthonormal basis for ∧𝑘 H). The set

𝒆 𝑖 1 ∧ · · · ∧ 𝒆 𝑖𝑘 : 1 ≤ 𝑖 1 < · · · < 𝑖 𝑘 ≤ 𝑛

forms an orthonormal basis for the space ∧𝑘 H.


Exercise 2.32 (Wedge: Orthonormal basis). Provide a proof for Proposition 2.31.

Since the orthonormal basis for the linear subspace ∧𝑘 H contains 𝑛𝑘 elements,
 
we conclude that dim ∧𝑘 H = 𝑛𝑘 . In particular, we have dim (∧𝑛 H) = 1.

2.4 Wedge operators


We continue our discussion on wedge product spaces by introducing the wedge
operator.

Definition 2.33 (Wedge operator). Given 𝑨 ∈ L( H) , we define the (elementary)


wedge operator ∧𝑘 𝑨 : ∧𝑘 H → ∧𝑘 H via

∧𝑘 𝑨 (𝒙 1 ∧ · · · ∧ 𝒙 𝑘 ) = (𝑨𝒙 1 ) ∧ · · · ∧ (𝑨𝒙 𝑘 ),

for any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H. We extend to the wedge space by linearity.


Lecture 2: Multilinear Algebra 18

As such, we can identify the wedge operator ∧𝑘 𝑨 as the restriction of the tensor
operator ⊗𝑘 𝑨 to the wedge product space ∧𝑘 H. Indeed, appealing to Definition 2.26
one can conclude that for any 𝒙 1 , . . . , 𝒙 𝑘 ∈ H,
 1 ∑︁
⊗𝑘 𝑨 (𝒙 1 , . . . , 𝒙 𝑘 ) = √ 𝜀𝜋 (𝑨𝒙 𝜋 ( 1) ) ⊗ · · · ⊗ (𝑨𝒙 𝜋 (𝑘 ) )
𝑘! 𝜋 ∈𝑆𝑘

= (𝑨𝒙 1 ) ∧ · · · ∧ (𝑨𝒙 𝑘 )

= ∧𝑘 𝑨 (𝒙 1 ∧ · · · ∧ 𝒙 𝑘 ),

where the last equality follows from Definition 2.33. The wedge product space ∧𝑘 H is
thus an invariant subspace of the operator ⊗𝑘 𝑨 . As a consequence, the wedge operator
∧𝑘 𝑨 inherits all the properties of ⊗𝑘 𝑨 .
Proposition 2.34 (Wedge operator: Properties). Let 𝑨 ∈ L( H) , and let ∧𝑘 𝑨 : ∧𝑘 H → ∧𝑘 H
be the wedge operator associated to 𝑨 .
 
1. Composition. For any 𝑨, 𝑩 ∈ L( H) , it holds that ∧𝑘 (𝑨𝑩) = ∧𝑘 𝑨 ∧𝑘 𝑩 .
 −1
2. Inverses. For any 𝑨 ∈ L( H) , we have ∧𝑘 𝑨 = ∧𝑘 𝑨 −1 .

= ∧𝑘 𝑨 ∗ .

3. Adjoint. For any 𝑨 ∈ L( H) , it holds that ∧𝑘 𝑨
4. Persistence. If 𝑨 ∈ L( H) is self-adjoint, unitary, normal, or positive semidefinite,
then ∧𝑘 𝑨 inherits the respective property.

Proof. Proposition 2.34 is a straightforward consequence of the wedge space ∧𝑘 H


being an invariant subspace of the operator ⊗𝑘 𝑨 . 

2.5 Spectral theory of wedge operators


Much like in the case of the tensor product operator, it is possible to develop a spectral
theory of wedge product operators. We thus formulate the following two propositions.
Proposition 2.35 (Spectrum of wedge operators). Let 𝑨 ∈ L( H) , and let ∧𝑘 𝑨 : ∧𝑘 H →
∧𝑘 H be the wedge operator associated  to 𝑨 . The operator ∧𝑘 𝑨 has eigenvalues
𝜆𝑖1 · · · 𝜆𝑖𝑘 : 1 ≤ 𝑖 1 < · · · < 𝑖𝑘 ≤ 𝑛 , where 𝜆 1 , . . . , 𝜆𝑛 are the eigenvalues of the linear
operator 𝑨 .

Proof. As in the case of the tensor product operator, the proof of Proposition 2.35
follows directly from the Schur decomposition of 𝑨 . This time, we use the fact that
the orthonormal basis for the wedge space consists of tensors 𝒆 𝑖 1 ∧ · · · ∧ 𝒆 𝑖𝑘 with no
repeated indices. 
Proposition 2.36 (Singular values of wedge operators). Let 𝑨 ∈ L( H) , and let ∧𝑘 𝑨 :
∧𝑘 H → ∧𝑘 H be the wedge operator associated  to 𝑨 . The operator ∧𝑘 𝑨 has singular
values 𝜎𝑖 1 · · · 𝜎𝑖𝑘 : 1 ≤ 𝑖 1 < · · · < 𝑖𝑘 ≤ 𝑛 , where 𝜎1 , . . . , 𝜎𝑛 are the singular values
of the linear operator 𝑨 .

Proof. As in the case of the tensor product operator, the proof the proof of Proposition
2.36 follows directly from the singular value decomposition of 𝑨 . 

2.6 Determinants
In this section, we develop the theory of determinants of matrices as an instant
consequence of the theory of wedge products.
Lecture 2: Multilinear Algebra 19

Suppose that dim H = 𝑛 . Then we may identify the wedge operator ∧𝑛 𝑨 with the de-
terminant of 𝑨 . Indeed, the operator ∧𝑛 𝑨 acts on the space ∧𝑛 H = lin {𝒆 1 ∧ · · · ∧ 𝒆 𝑛 },
which is a one-dimensional space. Recall that a linear operator on a one-dimensional
space acts by scalar multiplication. In particular, the operator ∧𝑛 𝑨 scales by its
own single eigenvalue, 𝜆 1 · · · 𝜆𝑛 , where 𝜆 1 , . . . , 𝜆𝑛 are the eigenvalues of 𝑨 . These
considerations lead us to the following definition of the determinant.

Definition 2.37 (Determinant). For any linear operator 𝑨 ∈ L( H) , the determinant of This definition is a little abusive
𝑨 is defined as because the wedge operator ∧𝑛 𝑨 is
actually a scalar operator 𝛼 I acting on
det (𝑨) := ∧𝑛 𝑨 = 𝜆 1 · · · 𝜆𝑛 ,
a one-dimensional subspace of ⊗𝑛 H.
where 𝜆 1 , . . . , 𝜆𝑛 are the eigenvalues of 𝑨 .

With this definition, all properties of the determinant of a matrix follow. In


particular, considering L( H) = 𝕄𝑛 (𝔽) , we have the following important result.
Proposition 2.38 (Properties of the determinant of a matrix). Given the space 𝕄𝑛 (𝔽) , the
determinant has the following properties.

1. Multiplicativity. For any 𝑨, 𝑩 ∈ 𝕄𝑛 (𝔽) ,

det (𝑨𝑩) = det (𝑨) · det (𝑩).

The determinant is multilinear. Namely, letting 𝑨 = [𝒂 1 , . . . , 𝒂 𝑛 ] ,


2. Multilinearity.
for any 𝛼 ∈ 𝔽 and 𝒃 𝑗 ∈ 𝔽 𝑛 ,

det ([𝒂 1 , . . . , 𝛼 · 𝒂 𝑗 + 𝒃 𝑗 , . . . , 𝒂 𝑛 ] =
 
𝛼 · det [𝒂 1 , . . . , 𝒂 𝑗 , . . . , 𝒂 𝑛 ] + det [𝒂 1 , . . . , 𝒃 𝑗 , . . . , 𝒂 𝑛 ] ,

for all 𝑗 = 1, . . . , 𝑛 .
3. Antisymmetry. The determinant reverses the sign if two columns of its argument
are swapped. Namely, letting 𝑨 = [𝒂 1 , . . . , 𝒂 𝑛 ] , for any 𝑖 , 𝑗 ∈ {1, . . . , 𝑛},
 
det [𝒂 1 , . . . , 𝒂 𝑖 , . . . , 𝒂 𝑗 , . . . , 𝒂 𝑛 ] = −det [𝒂 1 , . . . , 𝒂 𝑗 , . . . , 𝒂 𝑖 , . . . , 𝒂 𝑛 ] .

4. Normalization. For the identity I ∈ 𝕄𝑛 (𝔽) , we have det ( I) = 1 .

Exercise 2.39 (Determinant). Using Definition 2.37, provide a proof for Proposition 2.38.
Finally, we note the following theorem.

Theorem 2.40 (Uniqueness of the determinant). The determinant is the unique function
from 𝕄𝑛 (𝔽) to 𝔽 with the above properties of multiplicativity, multilinearity,
antisymmetry and normalization.

Proof. For a proof of this statement see the first chapter of [Art11]. 

2.7 Symmetric tensor product


We recall that the antisymmetry of the wedge product is given by weighing the sum
over all permutations by each permutation’s signature. We omit the signature to define
the symmetric tensor product.
Lecture 2: Multilinear Algebra 20

Definition 2.41 (Symmetric tensor product). The symmetric tensor product 𝒙 1 ∨· · ·∨𝒙 𝑘
of vectors 𝒙 1 , . . . 𝒙 𝑘 ∈ H is defined as
1 ∑︁
𝒙 1 ∨ · · · ∨ 𝒙 𝑘 := √ 𝒙 𝜋 ( 1) ⊗ · · · ⊗ 𝒙 𝜋 (𝑘 ) ,
𝑘! 𝜋 ∈S𝑘

where S𝑘 is the symmetric group consisting of all permutations on 𝑘 symbols. As


such, clearly 𝒙 1 ∨ · · · ∨ 𝒙 𝑘 is an element of the space ⊗𝑘 H.

Indeed, for any 𝑖 , 𝑗 ∈ {1, . . . , 𝑘 } we have that

𝒙1 ∨ · · · ∨ 𝒙𝑖 ∨ · · · ∨ 𝒙 𝑗 ∨ · · · ∨ 𝒙𝑘 = 𝒙1 ∨ · · · ∨ 𝒙 𝑗 ∨ · · · ∨ 𝒙𝑖 ∨ · · · ∨ 𝒙𝑘 ,

hence the symmetry.


Similarly as for the wedge product, we can define the symmetric tensor product
space.
Exercise 2.42 (Symmetric tensor product space). By using Definition 2.41, define
 the
symmetric tensor product space ∨𝑘 H. Establish that its dimension is 𝑛+𝑘𝑘 −1 . How
quickly does the dimension grow as 𝑘 → ∞ for fixed 𝑛 ?
Furthermore, it is possible to define the symmetric tensor product operator and
develop the relevant theory.
Exercise 2.43 (Symmetric tensor product operator). By using Definition 2.41 and the notion
of symmetric tensor product space ∨𝑘 H, define and develop the theory for the
symmetric tensor product operator.
Let us next define the permanent of a matrix.

Definition 2.44 (Permanent of a matrix). Let 𝑨 = [𝑎 𝑖 𝑗 ] be any matrix in 𝕄𝑛 (𝔽) . The


permanent of 𝑨 is defined as
∑︁
perm (𝑨) := 𝑎 1𝜋 ( 1) · · · 𝑎 𝑛𝜋 (𝑛) ,
𝜋 ∈S𝑛

where S𝑛 is the symmetric group consisting of all permutations on 𝑛 symbols.

We refer to [Bha97] for an in depth discussion on the connections between the


symmetric tensor product operator and the permanent of a matrix.

Notes
The material of this chapter can largely be found in [Bha97], albeit to varying degrees
of detail. In particular, the above discussions represent a more in-depth exploration
of the foundational elements of multilinear algebra. The interested reader can find
a further exploration of these topics in [Bha97]. For a refresher on topics relating to
linear algebra, we refer to [Art11].

Lecture bibliography
[Art11] M. Artin. Algebra. Pearson Prentice Hall, 2011.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
3. Majorization

Date: 11 January 2022 Scribe: Matthieu Darcy

This lecture introduces the concept of majorization of vectors, a way of comparing


Agenda:
how “spiky” they are. Next, we explore the space of doubly stochastic matrices and
1. Majorization
a particular class of doubly-stochastic matrices, the T-transforms. We relate both 2. Doubly stochastic matrices
concepts by characterizing majorization via doubly stochastic matrices. We conclude 3. Characterization of
with an application: the Schur–Horn theorem, which relates the diagonal entries and majorization
the eigenvalues of an Hermitian matrix. 4. Schur–Horn theorem

3.1 Majorization
In this section, we introduce the concept of majorization for vectors. Informally
speaking, a vector 𝒙 is majorized by a vector 𝒚 , written 𝒙 ≺ 𝒚 , if 𝒚 is “spikier” (or
more concentrated) than 𝒙 .

3.1.1 Setting
In this lecture, we will work in ℝ𝑛 equipped with the standard basis (𝜹 𝑖 : 𝑖 =
1, 2, . . . , 𝑛) and the canonical inner product h𝒂, 𝒃i B 𝒂 ᵀ𝒃 . We define the vector of
ones:
1 B ( 1, 1, . . . 1) ᵀ ∈ ℝ𝑛 .
The trace of a vector is ∑︁𝑛
tr (𝒙 ) B h1, 𝒙 i = 𝑥𝑖 .
𝑖 =1
Much later, when we study positive linear maps, we will see that the trace of a vector
and the trace of a matrix are analogous.

3.1.2 Rearrangements
We now define rearrangements, a key concept in analysis.

Definition 3.1 (Rearrangement). Let 𝒙 ∈ ℝ𝑛 . The decreasing rearrangement 𝒙 ↓ ∈ ℝ𝑛


of 𝒙 is a vector with the same entries as 𝒙 , but placed in weakly decreasing order:

𝑥 1↓ ≥ 𝑥 2↓ ≥ · · · ≥ 𝑥𝑛↓ .

Likewise, an increasing rearrangement 𝒙 ∈ ℝ𝑛 has the same entries as 𝒙 , placed in


weakly increasing order:
𝑥 1↑ ≤ 𝑥 2↑ ≤ · · · ≤ 𝑥𝑛↑ .

In situations where the order of the entries of a vector is unimportant, it may be


useful to replace the vector by its decreasing rearrangement. More formally, we can
define an equivalence relation between vectors whose decreasing rearrangements are
the same and work with the equivalence classes.
Lecture 3: Majorization 22

Example 3.2 (Rearrangements). If 𝒙 = ( 1, 3, 2) ∈ ℝ3 , then its decreasing and increasing


rearrangements are

𝒙 ↓ = ( 3, 2, 1) ;
𝒙 ↑ = ( 1, 2, 3).
Observe that the vector 𝒚 = ( 2, 3, 1) has the same rearrangements. 

The next proposition provides us with an upper bound and a lower bound on the
inner product of two vectors from the inner products of their rearrangements.
Proposition 3.3 (Chebyshev rearrangement). For all 𝒙 , 𝒚 ∈ ℝ𝑛 ,

h𝒙 ↓ , 𝒚 ↑ i ≤ h𝒙 , 𝒚 i ≤ h𝒙 ↓ , 𝒚 ↓ i ;
h𝒙 ↑ , 𝒚 ↓ i ≤ h𝒙 , 𝒚 i ≤ h𝒙 ↑ , 𝒚 ↑ i.
Proof. The proof is left as an exercise to the reader. Hint: Assume that 𝒙 , 𝒚 ≥ 0, and
apply summation by parts. Otherwise, see [HLP88, p. 261]. 

3.1.3 Majorization order


Majorization compares two vectors by considering their decreasing rearrangements.

Definition 3.4 (Majorization order). Let 𝒙 , 𝒚 ∈ ℝ𝑛 . We say that 𝒚 majorizes 𝒙 , written


𝒙 ≺ 𝒚 , when
∑︁𝑘 ∑︁𝑘
𝑥𝑖↓ ≤ 𝑦↓ for each 𝑘 = 1, . . . 𝑛, and
𝑖 =1 𝑖
∑︁𝑖𝑛=1 ∑︁𝑛
𝑥↓ = 𝑦 ↓.
𝑖 =1 𝑖 𝑖 =1 𝑖

The second condition can be stated as tr (𝒙 ) = tr (𝒚 ).

Alternatively, the majorization order can be stated using the increasing rearrangements.
For 𝒙 , 𝒚 ∈ ℝ𝑛 , we have 𝒙 ≺ 𝒚 if and only if
∑︁𝑘 ∑︁𝑘
𝑥𝑖↑ ≥ 𝑦↑ for each 𝑘 = 1, . . . 𝑛, and
𝑖 =1 𝑖
∑︁𝑖𝑛=1 ∑︁𝑛
𝑥↑ = 𝑦 ↑.
𝑖 =1 𝑖 𝑖 =1 𝑖

Intuitively, a vector 𝒙 is majorized by 𝒚 if 𝒙 is “flatter”: its mass is more uniformly


distributed over its entries than 𝒚 .
Example 3.5 (Finite probability distribution). Let 𝒙 ∈ ℝ+𝑛 with tr (𝒙 ) = 1. We can interpret
any such vector as a discrete probability distribution on a finite sample space. The
following relations hold:
1 1 1
, ,..., ≺ 𝒙 ≺ ( 1, 0, . . . , 0).
𝑛 𝑛 𝑛
We can interpret 𝑛1 , . . . , 𝑛1 as the maximally flat vector: its mass is uniformly

distributed and therefore it takes the longest to accumulate to the total. On the other
hand, ( 1, 0, . . . , 0) is the spikiest vector: it is maximally concentrated and therefore
accumulates to the total as quickly as possible. These examples are illustrated in Figure
3.1 with increasing rearrangements. 

We also have the following equivalent definition of majorization:


Lecture 3: Majorization 23

Figure 3.1 (Majorization for probability distributions). Illustration of ma-


jorization for increasing rearrangements of probability vectors on ℝ𝑛 .

Proposition 3.6 (Majorization without rearrangement). For vectors 𝒙 , 𝒚 ∈ ℝ𝑛 , we have


𝒙 ≺ 𝒚 if and only if
tr (|𝒙 − 𝑡 1 |) ≤ tr (|𝒚 − 𝑡 1 |) for all 𝑡 ∈ ℝ. (3.1)
The symbol | · | denotes the entrywise absolute value. Note that (3.1) is exactly the
statement that
k𝒙 − 𝑡 1 k 1 ≤ k𝒚 − 𝑡 1 k 1 for all 𝑡 ∈ ℝ.
Proof. See Problem Set 1. 
Remark 3.7 (Partial order). The relation ≺ does not define a partial order because the
relations 𝒚 ≺ 𝒙 and 𝒙 ≺ 𝒚 do not imply that 𝒙 = 𝒚 . It does however define a partial
order on the set of decreasing rearrangements; that is the set of equivalence classes of
vectors where 𝒙 ∼ 𝒚 if and only if 𝒙 ↓ = 𝒚 ↓ .
Remark 3.8 (Notation and conventions). Majorization and rearrangements arise in other
fields, where one may encounter differing notations and conventions:
1. In statistics, rearrangements are called order statistics. They are usually denoted
↓ ↑
as 𝑥 [𝑖 ] = 𝑥𝑖 and 𝑥 (𝑖 ) = 𝑥𝑖 .

2. In analysis, rearrangements are sometimes written as 𝑥𝑖∗ = 𝑥𝑖 .
3. In some fields, especially physics, 𝒚  𝒙 means that 𝒚 is more chaotic than 𝒙 .
Hence uniform vectors are the highest in the partial order, which is the opposite
of our definition.
When reading papers that use majorization, one must take care to note which conven-
tions are being used!
Lecture 3: Majorization 24

3.2 Doubly stochastic matrices


In this section, we introduce the class of doubly stochastic matrices. We will see that
these matrices enact the majorization relation.

Definition 3.9 (Doubly stochastic matrices). A matrix 𝑺 ∈ 𝕄𝑛 (ℝ) with entries 𝑺 = [𝑠𝑖 𝑗 ]
is doubly stochastic if it has the following three properties:

The entries 𝑠𝑖 𝑗 ≥ 0 for all 𝑖 , 𝑗 = 1, 2, . . . , 𝑛 .


1. Positivity.
The row sums 𝑛𝑗=1 𝑠𝑖 𝑗 = 1 for all 𝑖 = 1, 2, . . . , 𝑛 .
Í
2. Rows.
Í𝑛
3. Columns. The column sums 𝑖 =1 𝑠𝑖 𝑗 = 1 for all 𝑗 = 1, 2, ..., 𝑛 .

The set DS𝑛 collects all of the 𝑛 × 𝑛 doubly stochastic matrices.

Example 3.10 (Doubly stochastic matrices). The following are examples of doubly stochastic
matrices:

1. The 𝑛 × 𝑛 identity matrix I𝑛 is a doubly stochastic matrix.


2. The 𝑛 × 𝑛 matrix 𝑛1 11ᵀ with constant entries is doubly stochastic.
3. Permutation matrices are doubly stochastic. Recall that 𝑷 is a permutation matrix
if it is a square matrix with 0–1 entries, and there is exactly one 1 per row and
per column.
4. Orthostochastic matrices are doubly stochastic. A matrix 𝑺 is orthostochastic
when its entries 𝑠𝑖 𝑗 = |𝑢 𝑖 𝑗 | 2 for a unitary matrix 𝑼 .

Doubly stochastic matrices naturally arise in a variety of applications, such as the study
of reversible Markov chains. 

Intuitively, we think of doubly stochastic matrices acting on vectors by averaging.


Interpreting each row (and each column) of a doubly stochastic matrix 𝑻 as a probability
distribution, we can view the entries of 𝑻 𝒗 as the expectation of the (discrete) function
𝒗 with respect to that probability distribution.
The next two propositions present some important properties of doubly stochastic
matrices.
Proposition 3.11 (Doubly stochastic matrices: Characterization). Each matrix 𝑺 ∈ DS𝑛 has
the following properties:

1. Positive.For all 𝒙 ∈ ℝ𝑛 , the relation 𝒙 ≥ 0 implies that 𝑺𝒙 ≥ 0.


2. Trace preserving. For all 𝒙 ∈ ℝ𝑛 , we have tr (𝑺𝒙 ) = tr (𝒙 ) .
3. Unital. We have 𝑺 1 = 1.

In fact, a matrix is doubly stochastic if and only if it satisfies these three properties.

Proof. The positivity property follows from the positivity of each coordinate (𝑺𝒙 )𝑖 =
Í 𝑛
𝑗 =1 𝑠𝑖 𝑗 𝑥 𝑗 ≥ 0 for 𝑖 = 1, . . . 𝑛 .
The unital property follows easily from the observation that (𝑺 1)𝑖 = 𝑛𝑗=1 𝑠𝑖 𝑗 = 1.
Í
To prove the trace preservation property, note that 𝑺 is doubly stochastic if and
only if 𝑺 ᵀ is also doubly stochastic. Therefore

tr (𝑺𝒙 ) = h1, 𝑺𝒙 i = h𝑺 ᵀ 1, 𝒙 i = h1, 𝒙 i = tr (𝒙 ),


where we have used properties of the adjoint and the unital property of 𝑺 ᵀ . 
The next result describes structural properties of the full set DS𝑛 of doubly stochastic
matrices.
Lecture 3: Majorization 25

Proposition 3.12 (Properties of DS𝑛 ). The set DSn has the following properties:

1. Convexity.The set DS𝑛 is a compact, convex set of ℝ𝑛×𝑛 .


2. Composition. If 𝑺 ,𝑻 ∈ DS𝑛 then 𝑺𝑻 ∈ DS𝑛 .

Proof. First, observe that the set DS𝑛 is bounded. Indeed, for all 𝑺 ∈ DS𝑛 , we have
k𝑺 k ∞ = max𝑖 𝑗 |𝑠𝑖 𝑗 | ≤ 1 . To show that DS𝑛 is closed and convex, simply observe that
each row and each column of a doubly stochastic matrix can be defined as a vector
belonging to the intersection of closed halfspaces.
In detail, we can write 𝑺 = [𝒔 1 , 𝒔 2 , . . . , 𝒔 𝑛 ] where 𝒔 1 , 𝒔 2 , . . . , 𝒔 𝑛 are the columns
of 𝑺 . Then, for each index 𝑖 = 1, . . . , 𝑛 , the condition h1, 𝒔 𝑖 i = 1 can be expressed as

𝒔 𝑖 ∈ {𝒙 ∈ ℝ𝑛 : h1, 𝒙 i ≥ 1} ∩ {𝒙 ∈ ℝ𝑛 : h1, 𝒙 i ≤ 1}.

Likewise, for each index 𝑖 = 1, . . . , 𝑛 , the condition 𝒔 𝑖 ≥ 0 can be expressed as

𝒔 𝑖 ∈ {𝒙 ∈ ℝ𝑛 : h𝜹 𝑗 , 𝒙 i ≥ 0} for 𝑗 = 1, . . . , 𝑛 .

The same holds for the rows of 𝑺 . The set DS𝑛 is therefore the intersection of (a finite
number of) closed and convex sets and hence is itself closed and convex.
To prove the composition property, consider two stochastic matrices 𝑺 , 𝑻 , and their
product 𝑺𝑻 . The entries of the product are clearly positive:
∑︁𝑛
(𝑺𝑻 )𝑖 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 ≥ 0.
𝑘 =1

The row and column sum are also preserved in the product. For each row and column
index,
∑︁𝑛 ∑︁𝑛 ∑︁𝑛 ∑︁𝑛  ∑︁𝑛 
(𝑠𝑡 )𝑖 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 = 1.
𝑗 =1 𝑗 =1 𝑘 =1 𝑘 =1 𝑗 =1
∑︁𝑛 ∑︁𝑛 ∑︁𝑛 ∑︁𝑛  ∑︁𝑛 
(𝑠𝑡 )𝑖 𝑗 = 𝑠𝑖𝑘 𝑡𝑘 𝑗 = 𝑠𝑘 𝑗 𝑡𝑖𝑘 = 1.
𝑖 =1 𝑖 =1 𝑘 =1 𝑘 =1 𝑖 =1

Therefore, 𝑺𝑻 is also stochastic. 


Corollary 3.13 (Convex combinations of permutation matrices). Any convex combination of
permutation matrices is a doubly stochastic matrix.

Proof. As was seen in Example 3.10, permutation matrices are doubly stochastic
matrices and, by the previous theorem, the set DS𝑛 is a convex set. 
The converse of the corollary above is also true: any doubly stochastic matrix is a
convex combination of permutation matrices. This is the content of the Birkhoff–von
Neumann theorem.

3.3 T-transforms
We now introduce a special class of of doubly stochastic matrices, the T-transforms.
T-transforms allow us to take convex combinations of two entries of a vector while
leaving all other coordinates unchanged. In other words, it averages two coordinates.
These matrices will allow us to relate majorization to the space DS𝑛 of doubly stochastic
matrices.
Lecture 3: Majorization 26

Definition 3.14 (T-transform). A T-transform is a doubly stochastic matrix 𝑻 of the


form
𝑻 = 𝜏 I + ( 1 − 𝜏)𝑷
where 𝜏 ∈ [ 0, 1] and 𝑷 is a permutation that transposes two coordinates. For
example, if the transposition acts on the coordinates (𝑘 , ) , then
 if 𝑖 =  ,
𝑣𝑘



(𝑷𝒗 )𝑖 = 𝑣  if 𝑖 = 𝑘 ,

otherwise.

𝑣𝑖

Thus, 𝑻 only acts nontrivially on the coordinates (𝑘 , ) .

By Proposition 3.12, each T-transform is a doubly stochastic matrix because it is


a convex combination of two permutation matrices. The next example shows that
T-transforms and majorization are equivalent in ℝ2 .
Example 3.15 (T-transforms and majorization in ℝ2 ). In ℝ2 , T-transforms take a particularly
simple form because they act on both coordinates:
   
1 0 0 1
𝑻 =𝜏 + ( 1 − 𝜏) for a parameter 𝜏 ∈ [ 0, 1] .
0 1 1 0

For vectors in ℝ2 , we will argue that 𝒙 ≺ 𝒚 if and only if 𝒙 = 𝑻 𝒚 for some choice of 𝜏 .
Indeed, let 𝒙 = (𝑥 1 , 𝑥 2 ), and 𝒚 = (𝑦1 , 𝑦2 ) . The majorization relation 𝒙 ≺ 𝒚 states
that
𝑥 1 ≤ 𝑦1 and 𝑥 1 + 𝑥 2 = 𝑦1 + 𝑦2 . (3.2)
Hence, 𝑦2 ≤ 𝑥 1 ≤ 𝑦1 . As a consequence, there is a 𝜏 ∈ [ 0, 1] such that
𝑥 1 = 𝜏𝑦1 + ( 1 − 𝜏)𝑦2 .
Plugging the last display back in to (3.2), we obtain
𝑥 2 = ( 1 − 𝜏)𝑦1 + 𝜏𝑦2 .
We conclude that 𝒙 = 𝑻 𝒚 for the T-transform with this distinguished parameter 𝜏 .
Conversely, we can reverse this chain of argument to see that 𝒙 = 𝑻 𝒚 implies that
𝒙 ≺ 𝒚 for vectors in ℝ2 . 

Example 3.16 (T-transforms in ℝ3 ). In ℝ3 , we consider the T-transform that acts on the


pair ( 1, 3) of coordinates. Then the transposition matrix has the form
0 0 1
 
𝑷 = 0 1 0 .
1 0 0
 
The T-transform becomes
 𝜏
 0 ( 1 − 𝜏) 
𝑻 =  0 1 0  for some 𝜏 ∈ [ 0, 1] .
 ( 1 − 𝜏) 0 𝜏 
 
Observe that (𝑻 𝒙 )2 = 𝑥 2 so that the second coordinate is unchanged by this transfor-
mation.
This observation generalizes to ℝ𝑛 for 𝑛 ∈ ℕ and an arbitrary choice (, 𝑘 ) of
coordinates. If 𝑻 is a T-transform acting nontrivially on coordinates (𝑘 , ) , then
(𝑻 𝒙 )𝑖 = 𝑥𝑖 for each 𝑖 ≠ 𝑘 ,  . 
Lecture 3: Majorization 27

Observe that the action of a T-transform induces a majorization relation.


Exercise 3.17 (T-transforms). Let 𝒚 ∈ ℝ𝑛 be a vector, and let 𝑻 be a T-transform on ℝ𝑛 .
Then 𝑻 𝒚 ≺ 𝒚 .

3.4 Majorization and doubly stochastic matrices


The main theorem of this lecture relates majorization to the set of doubly stochastic
matrices. Recall that a vector 𝒙 is majorized by a vector 𝒚 if the vector 𝒙 is “more
average.” Likewise, doubly stochastic matrices act on a vector by averaging its entries
with respect to discrete probability distributions defined by its rows. The next result
shows that these concepts are equivalent: 𝒙 ≺ 𝒚 if and only if 𝒙 is obtained through a
transformation of 𝒚 by a doubly stochastic matrix.

Theorem 3.18 (Majorization and DS𝑛 ). Fix two vectors 𝒙 , 𝒚 ∈ ℝ𝑛 . The following
statements are equivalent:

1. 𝒙 ≺𝒚
2. 𝒙 = 𝑻 𝑛 · · ·𝑻 1 𝒚 for certain T-transforms 𝑻 𝑖 .
3. 𝒙 = 𝑺 𝒚 for some 𝑺 ∈ DS𝑛 .
4. 𝒙 ∈ conv {𝑷 𝒚 : 𝑷 is a permutation matrix on ℝ𝑛 }.

Since each matrix 𝑻 1 ,𝑻 2 , . . . ,𝑻 𝑛 is a T-transform, observe that 𝑻 𝑛−1 · · ·𝑻 1 𝒚 and


𝑻 𝑛 · · ·𝑻 1 𝒚 only differ in two coordinates.
Proof. (3 ⇔ 4). Corollary 3.13 gives the reverse implication. The forward implication
follows from the Birkhoff–von Neumann theorem, which we will establish in Lecture 5.
It can also be established directly; see Exercise 3.19.
(2 ⇒ 3). Since 𝒙 = 𝑻 𝑛 · · ·𝑻 1 𝒚 = 𝑺 𝒚 it suffices to show that 𝑻 𝑛 · · ·𝑻 1 is doubly
stochastic. This point holds because T-transforms are doubly stochastic and the set of
doubly stochastic matrices is stable under products (Proposition 3.12).
(3 ⇒ 1). Let 𝒙 = 𝑺 𝒚 . For each 𝑡 ∈ ℝ, we may calculate that

|𝒙 − 𝑡 1 | = |𝑺 𝒚 − 𝑡 1 | = |𝑺 (𝒚 − 𝑡 1)| ≤ 𝑺 (|𝒚 − 𝑡 1 |).

We have used the unital property and Jensen’s inequality. Taking the trace,

tr (|𝒙 − 𝑡 1 |) ≤ tr (𝑺 (|𝒚 − 𝑡 1 |)) = tr (|𝒚 − 𝑡 1 |),

since doubly stochastic matrices are trace-preserving. This is the equivalent formulation
of majorization from Proposition 3.6.
(1 ⇒ 2). This is the hard part. Assume that 𝒙 ≺ 𝒚 . We will construct a sequence of
T-transforms 𝑻 1 ,𝑻 2 , . . . ,𝑻 𝑛 such that

𝒙 = 𝑻 𝑛𝑻 𝑛−1 · · ·𝑻 1 𝒚 and
𝑻 𝑘 +1𝑻 𝑘 . . .𝑻 1 𝑦 ≺ 𝑻 𝑘 . . .𝑻 1 𝒚 for all 𝑘 = 0, . . . , 𝑛 − 1.

Without loss of generality, we may assume that 𝒙 = 𝒙 ↓ and 𝒚 = 𝒚 ↓ because reordering


the vectors does not affect the majorization relation.
We proceed by induction on the dimension 𝑛 . For the base case 𝑛 = 2, see Example
3.15. For the induction step, observe that the two conditions
∑︁𝑛 ∑︁𝑛
𝑥 1 ≤ 𝑦1 and 𝑥𝑖 = 𝑦𝑖
𝑖 =1 𝑖 =1
Lecture 3: Majorization 28

imply that there is an index 𝑘 ∈ {1, . . . , 𝑛 − 1} such that 𝑦𝑘 ≤ 𝑥 1 ≤ 𝑦𝑘 −1 . Thus,

𝑥 1 = 𝜏𝑦1 + ( 1 − 𝜏)𝑦𝑘 .

We can find a T-transform 𝑻 1 on the set of coordinates ( 1, 𝑘 ) such that (𝑻 1 𝒚 )1 =


𝑥 1 , leaving all other coordinates of 𝒚 unchanged. By the implication (3 ⇒ 1) or
Exercise 3.17,
(𝑥 1 , 𝒘 ) B 𝑻 1 𝒚 ≺ 𝒚 ,
where 𝒘 denotes the last 𝑛 − 1 coordinates of 𝑻 1 𝒚 .
It remains to check that (𝑥 2 , . . . , 𝑥𝑛 ) ≺ 𝒘 . Observe that
(
𝑦𝑖 if 𝑖 ≠ 𝑘
𝑤𝑖 =
𝜏𝑦𝑖 + ( 1 − 𝜏)𝑡 𝑦1 if 𝑖 = 𝑘 .

where we use the convention that the indices of 𝒘 range from 2 to 𝑛 . For 𝑛 < 𝑘 , since
𝑥 1 < 𝑦𝑖 for 𝑖 = 1, . . . , 𝑛 , we have that
∑︁𝑛 ∑︁𝑛 ∑︁𝑛
𝑥𝑖 ≤ 𝑦𝑖 = 𝑤𝑖 .
𝑖 =2 𝑖 =2 𝑖 =2

For 𝑛 ≥ 𝑘 ,
∑︁𝑛 ∑︁𝑛
𝑥𝑖 ≤ 𝑦𝑖 .
𝑖 =1 𝑖 =1

Therefore,
∑︁𝑛 ∑︁𝑛 ∑︁𝑛
𝑥𝑖 ≤ 𝑦𝑖 − 𝑥 1 = 𝑤𝑖
𝑖 =2 𝑖 =1 𝑖 =2

because 𝑤𝑘 = 𝑦1 + 𝑦𝑘 − 𝑥 1 and 𝑤𝑖 = 𝑦𝑖 for 𝑖 ≠ 𝑘 by definition. Finally,


∑︁𝑛 ∑︁𝑛 ∑︁𝑛
𝑤𝑖 = 𝑦𝑖 − 𝑥 1 = 𝑥𝑖 .
𝑖 =2 𝑖 =1 𝑖 =2

As required, (𝑥 2 , 𝑥 3 , . . . , 𝑥𝑛 ) ≺ 𝒘 .
By our induction hypothesis, there is a T-transform on 𝒘 satisfying the requirements.
This transformation can be extended to a transformation 𝑻 2 on (𝑥 1 , 𝒘 ) by leaving the
first coordinate unchanged, completing the induction step.

The next exercise completes the above proof without resorting to the Birkhoff–von
Neumann theorem.
Exercise 3.19 (Direct proof of majorization and DS𝑛 ). Prove that (2 ⇒ 4) in Theorem 3.18
by expanding the product of T-transforms.

3.5 The Schur–Horn theorem


We now prove the important Schur–Horn theorem on Hermitian matrices. On the one
hand, Schur’s theorem states that the diagonal entries of any Hermitian matrix are
majorized by its eigenvalues. On the other, Horn’s theorem tells us that we can go
in the other direction: from a pair of vectors with a majorization relation there is a
matrix with the corresponding diagonal and eigenvalues.
Lecture 3: Majorization 29

Theorem 3.20 (Schur–Horn). Let 𝑨 ∈ ℍ𝑛 be a Hermitian matrix. The following two


claims hold.

1. Schur.We have diag (𝑨) ≺ 𝝀 ↓ (𝑨) .


2. Horn. If 𝒖 ≺ 𝒗 , then there exists a Hermitian matrix 𝑩 such that diag (𝑩) = 𝒖
and 𝝀 ↓ (𝑩) = 𝒗 ↓ .

The map 𝝀 ↓ : ℍ𝑛 ↦→ ℝ𝑛 returns the vector of eigenvalues arranged in decreasing


order.

Proof. We present a proof of Schur’s theorem only. For a proof of Horn’s theorem, see
[MOA11, p.302].
By the spectral theorem, we can write 𝑨 = 𝑼 ∗ 𝚲𝑼 where 𝑼 is unitary and
𝚲 is a diagonal matrix whose diagonal entries are listed in 𝝀𝑖↓ (𝑨) . Write 𝑼 =
[𝒖 1 , 𝒖 2 , . . . , 𝒖 𝑛 ] where 𝒖 1 , 𝒖 2 , . . . , 𝒖 𝑛 ∈ ℂ𝑛 are the columns of 𝑼 . Then
∑︁𝑛
𝑎𝑖𝑖 = 𝜹 𝑖∗ 𝑨 𝜹 𝑖 = 𝜹 𝑖∗𝑼 ∗ 𝚲𝑼 𝜹 𝑖 = (𝑼 𝜹 ) ∗ 𝚲(𝑼 𝜹 ) = 𝒖 𝑖∗ 𝚲𝒖 𝑖 = |𝑢 𝑖 𝑗 | 2𝜆 ↓𝑗 (𝑨).
𝑗 =1

Define the orthostochastic matrix 𝑺 with entries 𝑠𝑖 𝑗 = |𝑢 𝑖 𝑗 | 2 . The last display tells us
that
𝑎𝑖𝑖 = (𝑺 𝝀 ↓ (𝑨))𝑖 .
In other words, diag (𝑨) = 𝑺 𝝀 ↓ (𝐴) where 𝑺 is doubly stochastic. By Theorem 3.18, we
may conclude that diag (𝑨) ≺ 𝝀 ↓ (𝑨) . 

Notes
Majorization is a foundational topic in analysis. It plays a key role in the classic
book Inequalities by Hardy, Littlewood, and Pólya [HLP88]. Majorization is also
the core topic in the well-known book Inequalities by Marshall & Olkin, updated by
Arnold [MOA11]. The presentation in this lecture is adapted from Bhatia’s book [Bha97,
Chap. II], which owes a heavy debt to Ando’s arrangement of the material [And89].
Indeed, Ando understood that similar ideas are also at the heart of the theory of
positive linear maps, which we will explore in the second half of the course.

Lecture bibliography
[And89] T. Ando. “Majorization, doubly stochastic matrices, and comparison of eigenvalues”.
In: Linear Algebra and its Applications 118 (1989), pages 163–248.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[HLP88] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Reprint of the 1952 edition.
Cambridge University Press, Cambridge, 1988.
[MOA11] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: theory of majorization and
its applications. Second. Springer, New York, 2011. doi: 10.1007/978- 0- 387-
68276-1.
4. Isotone Functions

Date: 13 January 2022 Scribe: Anthony (Chi-Fang) Chen

We continue our discussion on majorization. A function 𝑓 : ℝ → ℝ that respects the


Agenda:
ordering on the real line is called a monotone function. In the same spirit, functions that
1. Weyl majorant theorem
respect the majorization order on vectors are called isotone functions. These functions 2. Isotonicity
provide convenient ways to transfer majorization properties from one space to another. 3. Sufficient conditions and
We can identify isotone functions by checking simple properties, such as permutation examples
invariance and convexity, and we will see plenty of examples. 4. Schur “convexity”

4.1 Recap
For vectors 𝒙 , 𝒚 ∈ ℝ𝑛 , the majorization relation 𝒙 ≺ 𝒚 is defined by the following
conditions:
∑︁𝑘 ∑︁𝑘
𝑥𝑖↓ ≤ 𝑦𝑖↓ for each 𝑘 = 1, . . . , 𝑛 ;
∑︁𝑖𝑛=1 ∑︁𝑛𝑖 =1
tr [𝒙 ] B 𝑥↓ = 𝑦↓ C tr [𝒚 ].
𝑖 =1 𝑖 𝑖 =1 𝑖

In other words, majorization consists of 𝑛 − 1 inequalities and one equality among the
(sorted) entries of the vectors. Intuitively, this captures the idea that vector 𝒚 is “spikier”
than vector 𝒙 , or vector 𝒙 is “flatter” than vector 𝒚 . Majorization plays a basic role in
matrix analysis. For example, we saw an important theorem of Schur that relates the
diagonal entries of a matrix to its decreasingly ordered eigenvalues.

Theorem 4.1 (Schur). Consider a self-adjoint matrix 𝑨 ∈ ℍ𝑛 (ℂ) written in the


standard coordinate basis. The vector of diagonal entries of the matrix is majorized
by the vector of eigenvalues:

diag (𝑨) ≺ 𝝀 ↓ (𝑨).

The function 𝝀 ↓ : ℍ𝑛 → ℝ𝑛 reports the (real) eigenvalues in decreasing order.

4.2 Weyl majorant theorem


Today, we begin with another important theorem that establishes a majorization
relationship between the eigenvalues and singular values of a square matrix. To state
this result, we define a function 𝝀 ↓ : 𝕄𝑛 (ℂ) → ℝ𝑛 that lists the complex eigenvalues
in decreasing order of magnitude, followed by increasing order of phase (modulo 2𝜋 ).
The function 𝝈 : 𝕄𝑛 (ℂ) → ℝ𝑛 reports the singular values in decreasing order.
Lecture 4: Isotone Functions 31

Theorem 4.2 (Weyl majorant theorem). For every matrix 𝑨 ∈ 𝕄𝑛 (ℂ) , we have
Ö𝑘 Ö𝑘
|𝜆𝑖↓ (𝑨)| ≤ 𝜎𝑖↓ (𝑨) for each 𝑘 = 1, . . . , 𝑛 ;
Ö𝑖𝑛=1 Ö𝑛𝑖 =1
|𝜆𝑖 (𝑨)| = 𝜎𝑖 (𝑨).
𝑖 =1 𝑖 =1

We can express these relations with the shorthand When there are zero eigenvalues and
zero singular values, the product
formulas give a precise meaning to
log (𝝀 ↓ ) ≺ log (𝝈 ). the log-majorization.

The proof requires two lemmas.


Lemma 4.3 (Products of eigenvalues and singular values). For every matrix 𝑨 ∈ 𝕄𝑛 (ℂ) , we
have
Ö𝑛 Ö𝑛
|𝜆𝑖↓ (𝑨)| = 𝜎𝑖 (𝑨).
𝑖 =1 𝑖 =1

Proof. Let us express the determinant in terms of both the eigenvalues and the singular
values.

| det (𝑨)| = | det (𝑸𝑻 𝑸 ∗ )| = | det (𝑸 )| · | det (𝑻 )| · | det (𝑸 ∗ )|


Ö𝑛
= |𝜆𝑖↓ (𝑨)| ; (4.1)
𝑖 =1
| det (𝑨)| = | det (𝑼 𝚺𝑽 ∗ )| = | det (𝑼 )| · | det (𝚺)| · | det (𝑽 ∗ )|
Ö𝑛
= 𝜎𝑖 (𝑨). (4.2)
𝑖 =1

In equation (4.1), we use Schur decomposition and the multiplicativity of the determi-
nant, and then we evaluate the determinant of the upper triangular matrix 𝑻 . Note
the determinant of any unitary matrix has magnitude one. Equation (4.2) is analogous
but uses the singular value decomposition instead. 
Lemma 4.4 (Spectral radius and spectral norm). For every matrix 𝑴 ∈ 𝕄𝑛 (ℂ) , the maximal
singular value bounds the magnitude of each eigenvalue:

|𝜆𝑖 (𝑴 )| ≤ 𝜎max (𝑴 ) for each index 𝑖 .


Proof. Consider any eigenvalue 𝜆𝑖 (𝑴 ) of the matrix with unit-norm eigenvector 𝒖 𝑖 .
Then

|𝜆𝑖 (𝑴 )| = 𝒖 𝑖∗ 𝑴 𝒖 𝑖 ≤ max{|𝒖 ∗ 𝑴 𝒗 | : k𝒖 k 2 = k𝒗 k 2 = 1} = 𝜎max (𝑴 ).


This is the advertised result. 
We are now prepared to establish Theorem 4.2.

Proof of Weyl majorant theorem. The equality is Lemma 4.3; for the inequalities, we
cleverly use multilinear algebra. For any 𝑘 = 1, . . . , 𝑛 , consider the anti-symmetric Without invoking multilinear algebra,
the proof gets horrible.
subspace 𝑘 ℂ𝑛 and the induced operator 𝑘 𝑨 C 𝑴 . In Lecture 2, we showed that
Ó Ó
the eigenvalues of operator 𝑴 are products of 𝑘 eigenvalues of 𝑨 with no repeated
indices. By Lemma 4.4, we can bound the eigenvalues of 𝑴 above by the maximal
singular value of 𝑴 . Indeed,
Ö𝑘 Ö𝑘
𝜆𝑖↓ (𝑨) = |𝜆 1↓ (𝑴 )| ≤ 𝜎max (𝑴 ) = 𝜎𝑖 (𝑨).
𝑖 =1 𝑖 =1

This is the advertised result. 


Lecture 4: Isotone Functions 32

4.3 Isotonicity
Today, we move on to study isotone functions that respect the majorization order.
We will introduce easy-to-check conditions for isotonicity and see that many existing
measures of “inequality” fit into this framework.

4.3.1 Motivation
Let us begin with an economics example. Suppose we wish to summarize how equal
or unequal a society is. Consider a vector 𝒙 ∈ ℝ+𝑛 with normalization tr [𝒙 ] = 1 that
describes the distribution of wealth of each individual in some society 𝒙 . As we have
already seen, the majorization relation between societies 𝒙 ≺ 𝒚 is a way to quantify
that “society 𝒚 is more unequal than society 𝒙 ”.
Motivated by the above, can we find an even simpler summary of “inequality”? For
example, can we reduce the distribution to a single number? Formally, we may ask if
there is a function Φ : ℝ𝑛 → ℝ such that

𝒙 ≺𝒚 implies Φ(𝒙 ) ≤ Φ(𝒚 ).

In fact, we will see that many familiar functions qualify.


Example 4.5 (Isotone functions). The following functions respect the majorization order:

• k-max. The sum of the largest 𝑘 entries obviously preserves the majorization
order:
∑︁𝑘
Φ(𝒙 ) = 𝑥 ↓,
𝑖 =1 𝑖

where 1 ≤ 𝑘 ≤ 𝑛 .
• Negative entropy. The negative entropy reflects the amount of randomness or
uniformity in a distribution:
∑︁𝑛
Φ(𝒙 ) = negent (𝒙 ) = 𝑥𝑖 log (𝑥𝑖 ).
𝑖 =1

• Variance. The variance is another measure of dispersion of a distribution:

1 ∑︁𝑛
Φ(𝒙 ) = Var [𝒙 ] = ¯ 2,
(𝑥𝑖 − 𝑥)
𝑛 𝑖 =1

1 Í𝑛
where 𝑥¯ B 𝑛 𝑖 =1 𝑥𝑖 is the mean.

By the end of this lecture, we will develop new tools to confirm that the negative
entropy and the variance also respect the majorization order. 

4.3.2 Definitions
To prepare for our treatment of isotone functions, we will need some additional
definitions.

Definition 4.6 (Weak majorization). For two vectors 𝒙 , 𝒚 ∈ ℝ𝑛 , we say that 𝒚 subma-
jorizes 𝒙 and write 𝒙 ≺𝑤 𝒚 when
∑︁𝑘 ∑︁𝑘
𝑥𝑖↓ ≤ 𝑦↓ for each 𝑘 = 1, . . . , 𝑛 . (4.3)
𝑖 =1 𝑖 =1 𝑖
Lecture 4: Isotone Functions 33

Similarly, we say that 𝒚 supermajorizes 𝒙 and write 𝒙 ≺𝑤 𝒚 when


∑︁𝑘 ∑︁𝑘
𝑥𝑖↑ ≥ 𝑦↑ for each 𝑘 = 1, . . . , 𝑛 . (4.4)
𝑖 =1 𝑖 =1 𝑖

In comparison with majorization, the submajorization and supermajorization conditions


both lack the trace equality constraint. This grants us more flexibility, but it also
renders the two conditions (4.3) and (4.4) inequivalent.
Exercise 4.7 (Majorization and weak majorization). Check the following equivalence.

𝒙 ≺𝒚 if and only if 𝒙 ≺𝑤 𝒚 and 𝒙 ≺𝑤 𝒚 .


We now introduce the main concept of this lecture.

Definition 4.8 (Isotone function). We say a vector-valued function Φ : ℝ𝑛 → ℝ𝑚 is In the definition, we also allow for
isotone if for all 𝒙 , 𝒚 ∈ dom (Φ) , functions defined only on a convex
subset of vectors (e.g., those with
positive entries).
𝒙 ≺𝒚 implies Φ(𝒙 ) ≺𝑤 Φ(𝒚 ).

One may wonder: Why not demand the full majorization relation Φ(𝒙 ) ≺ Φ(𝒚 ) ?
This condition turns out to be too limiting. For example, in the case of scalar-valued
functions (𝑚 = 1), the trace constraint in majorization would force the function to be
a constant.
Exercise 4.9 (Majorization on ℝ). For real numbers 𝑎, 𝑏 , verify that

𝑎 ≺ 𝑏 if and only if 𝑎 = 𝑏,
𝑎 ≺𝑤 𝑏 if and only if 𝑎 ≤ 𝑏.
In other words, the majorization relation for numbers is very rigid.

4.3.3 Sufficient conditions


To study which functions are isotone, we first introduce some definitions that will be
useful to formulate the sufficient conditions.

Definition 4.10 (Convex function: Vector-valued case). A vector-valued function Φ :


ℝ𝑛 → ℝ𝑚 is convex when

Φ (( 1 − 𝜏)𝒙 + 𝜏𝒚 ) ≤ ( 1 − 𝜏) Φ(𝒙 ) + 𝜏 Φ(𝒚 ) for all 𝜏 ∈ [0, 1]

and for all 𝒙 , 𝒚 ∈ dom (Φ) . The inequality ≤ acts entrywise.

Definition 4.11 (Permutation covariance). A function Φ : ℝ𝑛 → ℝ𝑚 is permutation The permutation 𝑷 0 may depend on
covariant if for each vector 𝒙 ∈ ℝ𝑛 and each permutation 𝑷 on ℝ𝑛 , there is a both 𝒙 and 𝑷 .
permutation 𝑷 0 on ℝ𝑚 such that

Φ(𝑷 𝒙 ) = 𝑷 0Φ(𝒙 ).

Example 4.12 (The real case). When the output dimension 𝑚 = 1, the preceding defini-
tions simplify. Permutation covariant functions must be permutation invariant:
Φ(𝑷 𝒙 ) = Φ(𝒙 ) for each permutation 𝑷 .
Convexity reduces to the familiar notion for real-valued functions. 
Lecture 4: Isotone Functions 34

Convexity and permutation covariance give sufficient (but not necessary) conditions
for isotonicity.

Theorem 4.13 (Isotonicity: Sufficient condition). If a function Φ : ℝ𝑛 → ℝ𝑚 is convex


and permutation covariant, then it is isotone.

Proof. Fix vectors 𝒙 , 𝒚 ∈ ℝ𝑛 that satisfy the majorization relation 𝒙 ≺ 𝒚 . In Lecture 3,


we showed that there exists a doubly stochastic matrix 𝑺 such that
∑︁𝑟
𝒙 = 𝑺𝒚 = 𝛼𝑖 𝑷 𝑖 𝒚 .
𝑖 =1

The second relation expresses the doubly stochastic matrix as a convex combination of
permutations 𝑷 𝑖 with 𝛼𝑖 ≥ 0 and 𝑟𝑖 𝛼𝑖 = 1. By convexity of the function Φ,
Í

 ∑︁𝑟  ∑︁𝑟
Φ(𝒙 ) = Φ 𝛼𝑖 𝑷 𝑖 𝒚 ≤ 𝛼𝑖 Φ(𝑷 𝑖 𝒚 )
𝑖 =1 𝑖 =1
∑︁𝑟
= 𝛼𝑖 𝑷 𝑖0 Φ(𝒚 ) C 𝒛
𝑖 =1

where the inequality acts entrywise. Since the vector 𝒛 is a convex combination
of permutations of vector 𝒚 𝑖 , we arrive at the majorization relation 𝒛 ≺ Φ(𝒚 ) .
The combined inequalities Φ(𝒙 ) ≤ 𝒛 ≺ Φ(𝒚 ) imply the submajorization relation
Φ(𝒙 ) ≺𝑤 Φ(𝒚 ) , which is the advertised result. 
Exercise 4.14 (Submajorization deduction). Check that the following implication holds.

Φ(𝒙 ) ≤ 𝒛 ≺ Φ(𝒚 ) implies Φ(𝒙 ) ≺𝑤 Φ(𝒚 ).

4.3.4 Isotone functions: Vector examples


First, let us give some examples of vector-valued functions that are isotone because of
the condition in Theorem 4.13.
Exercise 4.15 (Isotonicity: Vectorized scalar functions). Consider a scalar convex function
𝜑 : ℝ → ℝ, and extend it to vectors Φ : ℝ𝑛 → ℝ𝑛 by applying the scalar function 𝜑
entrywise:

Φ(𝑥 1 , · · · , 𝑥𝑛 ) B ( 𝜑 (𝑥 1 ), · · · , 𝜑 (𝑥𝑛 )). (4.5)

The constructed function Φ is convex and permutation covariant. Therefore, it is


isotone.
Example 4.16 (Vectorized scalar functions). Here are some particular cases of the construc-
tion in Exercise 4.15:

• Absolute powers. Convex power functions, applied entrywise, are isotone. For
example,

Φ(𝒙 ) B (|𝑥 1 |𝑝 , · · · , |𝑥𝑛 |𝑝 ), for 𝑝 ≥ 1.

• Positive parts. The positive part of a vector is an isotone function:

Φ(𝒙 ) B ((𝑥 1 )+ , · · · , (𝑥𝑛 )+ ).

• Exponential. The entrywise exponential of a vector is an isotone function:

Φ(𝒙 ) B ( e𝑥1 , · · · , e𝑥𝑛 ). (4.6)


Lecture 4: Isotone Functions 35

Note that each of these scalar functions is already convex. 

Corollary 4.17 (Log majorization). The log-majorization log (𝒙 ) ≺ log (𝒚 ) implies the
submajorization 𝒙 ≺𝑤 𝒚 .

Proof. This result follows immediately from the exponential example (4.6). 
This corollary connects nicely with the Weyl majorant theorem (Theorem 4.2). We
maintain the same notation for the decreasingly ordered eigenvalues and singular
values.
Corollary 4.18 (Eigenvalue and singular value majorization). For every matrix 𝑨 ∈ 𝕄𝑛 , we
have the submajorization relation 𝝀 ↓ (𝑨) ≺𝑤 𝝈 (𝑨) .

4.3.5 Isotone functions: Scalar examples


Next, let us consider the application of Theorem 4.13 in the special case when the
function is scalar-valued (𝑚 = 1).
Corollary 4.19 (Scalar-valued isotone functions: Sufficient conditions). Suppose that the
function Φ : ℝ𝑛 → ℝ is convex and permutation invariant. Then it is isotone. That is,
𝒙 ≺𝒚 implies Φ(𝒙 ) ≤ Φ(𝒚 ).
Example 4.20 (Isotonicity: Coordinate sums). If a function 𝜑 : ℝ → ℝ is convex, then
the function Φ(𝒙 ) =
Í𝑛
𝑖 =1 𝜑 (𝑥𝑖 ) is isotone. Here are some examples.
• Negative entropy. This construction allows us to check that the negative entropy
function is isotone.
𝒙 ≺ 𝒚 implies negent (𝒙 ) ≤ negent (𝒚 ),
Í
where negent (𝒙 ) B 𝑖 𝑥𝑖 log (𝑥𝑖 ) is the negative entropy.
• 𝑝 -norms. Each 𝑝 norm (𝑝 ≥ 1) is an isotone function.
𝒙 ≺𝒚 implies k𝒙 k𝑝 ≤ k𝒚 k𝑝 .
These functions can also be viewed as tracing over the entries of the construction (4.5).
Indeed, convexity and permutation covariance remain after taking the trace. 

Unfortunately, all the above examples rely on convexity. In the next section, we
will see that convexity is not required to attain isotonicity.

4.4 Schur “convexity”


We present the necessary and sufficient conditions for isotonicity for a real-valued,
differentiable function.

Definition 4.21 (Schur “convex” function). Consider an isotone real-valued function


Warning: Despite the name,
Φ : ℝ𝑛 → ℝ. That is, the majorization relation 𝒙 ≺ 𝒚 implies the inequality Schur “convex” functions need
Φ(𝒙 ) ≤ Φ(𝒚 ) . Then we say the function Φ is Schur convex or S-convex. not be convex! 

Theorem 4.22 (Schur “convexity”: Characterization). Suppose that the function Φ :


ℝ𝑛 → ℝ is differentiable on its convex domain. Then the following statements
are equivalent.

1. The function Φ is isotone (Schur-convex).


Lecture 4: Isotone Functions 36

2. The function Φ is permutation invariant. In addition, for all pairs (𝑖 , 𝑗 ) of


coordinate indices and all vectors 𝒙 ∈ dom (Φ) ,

(𝑥𝑖 − 𝑥 𝑗 ) 𝜕𝑖 Φ(𝒙 ) − 𝜕 𝑗 Φ(𝒙 ) ≥ 0.



(4.7)

For smooth functions, isotonicity is characterized locally by the derivative. This


theorem provides more examples of isotone functions.
Example 4.23 (Schur “convex” functions). The following real-valued functions Φ : ℝ𝑛 →
ℝ are isotone:
• Product. The negative product of entries of a positive vector is isotone.
Ö𝑛
Φ(𝒙 ) = − 𝑥𝑖 .
𝑖 =1
• Variance. The variance function is isotone.
1 ∑︁𝑛
Φ(𝒙 ) = ¯ 2,
(𝑥𝑖 − 𝑥)
𝑛 𝑖 =1
1 Í𝑛
where 𝑥¯ B 𝑛 𝑖 =1 𝑥𝑖 is the mean.
These examples are harder to verify directly from the definition of isotonicity, but the
theorem makes short work of them. 

Let us prove the theorem.

Proof of Theorem 4.22. (2 ⇒ 1). We begin with the reverse implication. Consider
vectors 𝒖, 𝒙 that satisfy the majorization relation 𝒖 ≺ 𝒙 . To check isotonicity of the
function Φ, we want to prove the inequality Φ(𝒖) ≤ Φ(𝒙 ) . In Lecture 3, we showed
that majorization can be expressed using a sequence of T-transforms:
𝒖 = 𝑻 𝑛 . . .𝑻 1 𝒙 .
Recall that a T-transform is a convex combination of the identity and a transposition:
𝑻 = 𝜏𝑰 + ( 1 − 𝜏)𝑸 for 𝜏 ∈ [0, 1] and 𝑸 a transposition.
Therefore, it suffices to obtain the inequality Φ(𝒖) ≤ Φ(𝒙 ) for a single transition
𝒖 = 𝑻 𝒙.
Without loss of generality, permutation invariance allows us to assume that the
T-transform 𝑻 averages the first two coordinates 𝑥 1 , 𝑥 2 only. We can write explicitly
 
𝒖 = ( 1 − 𝑠 )𝑥 1 + 𝑠𝑥 2 , 𝑠𝑥 1 + ( 1 − 𝑠 )𝑥 2 , . . . , 𝑥𝑛 for 𝑠 ∈ [0, 0.5]. (4.8)

The key idea is to interpolate from the vector 𝒙 to the vector 𝒖 . Define the function
 
𝒙 (𝜏) = ( 1 − 𝜏)𝑥 1 + 𝜏𝑥2 , 𝜏𝑥1 + ( 1 − 𝜏)𝑥 2 , . . . , 𝑥𝑛 for 𝜏 ∈ [0, 𝑠 ].
Note that 𝒙 ( 0) = 𝒙 and 𝒙 (𝑠 ) = 𝒖 . By the fundamental theorem of calculus (and the
assumption that the function Φ is differentiable),
∫ 𝑠
d
Φ(𝒖) − Φ(𝒙 ) = [Φ(𝒙 (𝜏))] d𝜏
d𝜏
∫0 𝑠  
(𝑥 2 − 𝑥 1 ) 𝜕1 Φ 𝒙 (𝜏) − 𝜕2 Φ 𝒙 (𝜏) d𝜏

=
0
∫ 𝑠
(𝑥 2 (𝜏) − 𝑥 1 (𝜏))  
𝜕1 Φ 𝒙 (𝜏) − 𝜕2 Φ 𝒙 (𝜏) d𝜏

=−
0 1 − 2𝜏
≤ 0.
Lecture 4: Isotone Functions 37

The second equality is the chain rule, and the last inequality is an assumption.
(1 ⇒ 2). Now we confirm the forward implication. Fix any vector 𝒙 . Observe that
𝒙 is majorized by any permutation 𝑷 𝒙 because they have the identical sorted entries.
Therefore, we can write down the two-sided majorization relation

𝒙 ≺ 𝑷 𝒙 ≺ 𝑷 −1 (𝑷 𝒙 ) = 𝒙 .

Apply Schur convexity to obtain permutation invariance of the function Φ:

Φ(𝒙 ) ≤ Φ(𝑷 𝒙 ) ≤ Φ(𝒙 ).

That is, Φ(𝑷 𝒙 ) = Φ(𝒙 ) .


To establish the differential condition (4.7), for every vector in the domain 𝒙 ∈
dom (Φ) , we consider the vector
 
𝒖 𝑖 𝑗 (𝑠 ) B 𝑻 𝑖 𝑗 𝒙 = . . . , ( 1 − 𝑠 )𝑥𝑖 + 𝑠𝑥 𝑗 , . . . , 𝑠𝑥𝑖 + ( 1 − 𝑠 )𝑥 𝑗 , . . . .

This is analogous to (4.8), but the T-transform 𝑻 𝑖 𝑗 acts on the (𝑖 , 𝑗 ) pair of indices.
Take the derivative as 𝑠 → 0 to obtain
Φ(𝒖 𝑖 𝑗 (𝑠 )) − Φ(𝒙 ) d
0 ≥ lim = [Φ(𝒖 𝑖 𝑗 (𝑠 ))]
𝑠 →0 𝑠 d𝑠
 
= (𝑥 𝑗 − 𝑥𝑖 ) 𝜕𝑖 Φ 𝒙 − 𝜕 𝑗 Φ 𝒙 .


The first inequality holds because of the majorization relation between the vectors and
the Schur convexity of the function Φ. Rearrange to obtain the advertised result. 

Notes
The material in this lecture is adapted from Bhatia [Bha97, Chap. II], which is based
on Ando’s vision [And89].

Lecture bibliography
[And89] T. Ando. “Majorization, doubly stochastic matrices, and comparison of eigenvalues”.
In: Linear Algebra and its Applications 118 (1989), pages 163–248.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
5. Birkhoff and von Neumann

Date: 18 January 2022 Scribe: Jagannadh Boddapati

In this lecture, we study the geometry of DS𝑛 , the set of 𝑛 × 𝑛 doubly stochastic
Agenda:
matrices. We establish the classic result of Birkhoff & von Neumann, which states
1. Doubly stochastic matrices
that the set of doubly stochastic matrices can be expressed as the convex hull of the 2. The Birkhoff–von Neumann
permutation matrices. We use this result to prove the von Neumann trace theorem, theorem
which plays a basic role in understanding unitary invariant norms and convex trace 3. Minkowski theorem
functions. 4. Proof of Birkhoff theorem
5. The von Neumann trace
theorem

5.1 Doubly stochastic matrices


In Lecture 3, we introduced the doubly stochastic matrices. We saw that doubly
stochastic matrices act on a vector by averaging its entries. We also saw that doubly
stochastic matrices arise in connection with majorization.

Definition 5.1 (Doubly stochastic matrices). A matrix 𝑺 ∈ ℝ𝑛×𝑛 is called doubly


stochastic if it satisfies the following properties.

1. Positive. For all 𝒙 ∈ ℝ𝑛 , the relation 𝒙 ≥ 0 implies 𝑺𝒙 ≥ 0. It is equivalent to


say that all the entries of the matrix are positive: 𝑠𝑖 𝑗 ≥ 0 for all 𝑖 , 𝑗 = 1, . . . , 𝑛 .
2. Trace preserving. For all 𝒙 ∈ ℝ𝑛 , we have tr (𝑺𝒙 ) = tr (𝒙 ) . This is equivalent to
say that each column adds up to one.
3. Unital. 𝑺 1 = 1. It is equivalent to say that each row adds up to one.

Definition 5.2 (Birkhoff polytope). The set DS𝑛 collects all of the 𝑛 × 𝑛 doubly
stochastic matrices:

DS𝑛 B 𝑺 ∈ ℝ𝑛×𝑛 : 𝑺 is doubly stochastic .

As we will discuss, the set DS𝑛 is a convex polytope known as the Birkhoff polytope.

The Birkhoff polytope arises in the study of majorization.

Theorem 5.3 (Doubly stochastic matrices: Characterization). If 𝒙 , 𝒚 ∈ ℝ𝑛 , then 𝒙 ≺ 𝒚


holds if and only if 𝒙 = 𝑺 𝒚 for some 𝑺 ∈ DS𝑛 .

That is, when 𝒙 ≺ 𝒚 , the entries of 𝒙 are “more average” than the entries of 𝒚 .

5.1.1 Properties of doubly stochastic matrices


In this section, we are going to explore the geometry of the set of doubly stochastic
matrices DS𝑛 in more detail. We make the following observations.

1. Polyhedron. The set DS𝑛 is a closed polyhedron, i.e., a finite intersection of


closed halfspaces. Indeed, the positivity constraints restrict each entry to be
Lecture 5: Birkhoff and von Neumann 39

positive, which is a halfspace constraint. Each equality constraint in Definition


5.1 can be written as two halfspace constraints. Being a finite intersection of
closed halfspaces, the set DS𝑛 is convex and closed. An example of a polyhedron
appears in Figure 5.1.
2. Bounded. All the entries of a doubly stochastic matrix lie between 0 and 1. Hence
the set DS𝑛 is bounded. In particular, DS𝑛 is compact.
3. Polytope. The set DS𝑛 is a polytope, i.e., is the convex hull of finitely many
points. This claim follows from a major result in convex geometry known as
the Weyl–Minkowski theorem, which states that every bounded polyhedron is a
polytope. See Figure 5.2 for an example of a polytope. Please refer to [Bha97] for
more details on doubly stochastic matrices and the Weyl–Minkowski theorem. Figure 5.1 Example of a polyhedron
formed by finite intersection of closed
Note that the Birkhoff polytope DS𝑛 contains all the 𝑛 × 𝑛 permutation matrices. halfspaces.

This observation leads us to explore what role the permutation matrices play in the
structure of DS𝑛 .
Exercise 5.4 (The convex hull of permutations). Deduce that the convex hull of the permu-
tation matrices is a subset of the doubly stochastic matrices:

conv 𝑷 ∈ ℝ𝑛×𝑛 : 𝑷 is a permutation matrix ⊆ DS𝑛 .

We will prove that this inclusion can be upgraded to a set equality.

5.2 The Birkhoff–von Neumann theorem Figure 5.2 Example of a polytope con-
In this section, we will discuss the Birkhoff–von Neumann theorem which relates structed as a convex-hull of five points
shown in red.
convex hull of permutation matrices to the set of doubly stochastic matrices DS𝑛 . To
prepare for our treatment, we will need some additional definitions.

Definition 5.5 (Extreme point). Let K ⊆ ℝ𝑑 be a (nonempty) convex set. An extreme


1
point of K is a point 𝒙 ∈ K such that 𝒚 , 𝒛 ∈ K and 𝒙 = 2
(𝒚 + 𝒛 ) together imply that
𝒙 = 𝒚 = 𝒛.

In other words, we cannot represent an extreme point 𝒙 as an average of distinct


points in K. See Figure 5.3 for examples of extreme points.
Now, we state the Birkhoff–von Neumann theorem, which originally appeared in
the paper [Bir46].

Theorem 5.6 (Birkhoff 1946; von Neumann 1953). The extreme points (i.e., vertices) of
DS𝑛 are precisely the permutation matrices. In particular, DS𝑛 can be written as
convex hull of permutation matrices,

DS𝑛 = conv 𝑷 ∈ ℝ𝑛×𝑛 : 𝑷 is a permutation matrix

To prove this theorem, we must demonstrate that the permutation matrices compose
the full set of extreme points of the Birkhoff polytope. First, we will present an important
result from convex geometry that shows that the extreme points play a key role in the
structure of convex sets.
Exercise 5.7 (Number of permutations). How many permutations suffice to express a
doubly stochastic matrix in DS𝑛 ? Hint: Use the Carathéodory theorem.
A nice corollary of Theorem 5.6 is an independent proof of the geometric character-
ization of majorization from the last lecture.
Lecture 5: Birkhoff and von Neumann 40

Figure 5.3 (Examples of extreme points). Points shown in red are extreme
while the points shown in blue are not extreme.

Corollary 5.8 (Permutation matrix). Let 𝒙 , 𝒚 ∈ ℝ𝑛 . The condition 𝒙 = 𝑺 𝒚 for 𝑺 ∈ DS𝑛 is


equivalent to, 𝒙 ∈ conv {𝑷 𝒚 : 𝑷 is a permutation matrix}.
To understand the meaning of this corollary, it is fruitful to define a new class of
convex polytopes.

Definition 5.9 (Permutahedron). Let 𝒚 ∈ ℝ𝑛 . A permutahedron is a polytope of the


form
conv {𝑷 𝒚 : 𝑷 ∈ ℝ𝑛×𝑛 a permutation matrix}.

As a consequence, we recognize that the set {𝒙 ∈ ℝ𝑛 : 𝒙 ≺ 𝒚 } of vectors that are


majorized by 𝒚 forms a convex polytope. This fact assigns a geometric meaning to the Figure 5.4 Example of a permutahe-
majorization relation. dron in ℝ3 .
Exercise 5.10 (Permutahedra). By construction, the permutahedron is a polytope. By
direct argument, show that a permutahedron is also a bounded polyhedron.
In addition to their role in majorization, permutahedra also arise from matching
problems, network science, transportation problems, and convex optimization.

5.3 The Minkowski theorem on extreme points


We now present the Minkowski theorem, which explains how to represent a compact,
convex set in terms of its extreme points. We will use this result to prove the Birkhoff
theorem in the next section.

Theorem 5.11 (Minkowski). Let K ⊆ ℝ𝑑 be a nonempty, compact, and convex set.


Then K is the convex hull of the set of its extreme points:

K = conv ( ext ( K)),

where ext ( K) is the set of extreme of points of K.

Proof. We shall prove this theorem by induction on the dimension of K. If dim ( K) = 0, The dimension of a convex set K is
then K = {𝒙 } is a singleton and the result follows. Assume that the result holds true defined as dim K B dim aff ( K) , the
dimension of the smallest affine space
containing the set K.
Lecture 5: Birkhoff and von Neumann 41

Figure 5.5 (Two cases for proof on Minkowski theorem by induction). The
left figure demonstrates the case when 𝒙 belongs to boundary ( K) .
The right figure demonstrates the case when 𝒙 does not belong to
boundary ( K).

for all nonempty, compact, convex sets with dimension 𝑑 − 1.


Now, let us consider a nonempty, compact, convex set K with dimension 𝑑 . We must
show that every point 𝒙 ∈ K can be represented as a convex combination of extreme
points of K. There are two cases as shown in Figure 5.5.
First, consider the case when 𝒙 ∈ boundary ( K) . Then we can separate 𝒙 weakly
from K via a hyperplane H. Then the intersection of K and H is a nonempty, compact,
and convex set. Hence, dim ( K ∩ H) is less than or equal to 𝑑 − 1, so induction applies.
We deduce that
𝒙 ∈ K ∩ H = conv ext ( K ∩ H) ⊆ conv ext ( K).
We have used the fact that the extreme points of a face of a convex set are also extreme
points of the set.
Second, consider the case when 𝒙 ∉ boundary ( K) . Choose a line L containing 𝒙 .
Since K is compact and convex, this line hits the boundary at two points, say 𝒚 , 𝒛 .
Therefore, we can write 𝒙 as a convex combination of 𝒚 , 𝒛 :

𝒙 = 𝜏𝒚 + 𝜏𝒛
¯ where 𝜏 ∈ [ 0, 1] and 𝜏¯ = 1 − 𝜏 .

Thus,
𝒙 ∈ conv {𝒚 , 𝒛 } ⊆ conv ext ( K).
Indeed, by the induction, each of 𝒚 , 𝒛 ∈ conv ext ( K) . 
Please refer to [Sch14] for more details on this theorem and its context. Let us
discuss a few consequences and generalization of the Minkowski theorem.
Corollary 5.12 (Extreme point). Every nonempty, convex, compact set has an extreme
point.
Exercise 5.13 (Bauer maximum principle). Show that every linear functional on a convex,
compact set attains its maximum at an extreme point. In particular, if the maximizer
is unique , it must be an extreme point of the set.
The Minkowski theorem has a generalization to infinite-dimensional spaces, a result
that has far-reaching implications in mathematical analysis.
Fact 5.14 (Krein–Milman). Suppose X is a locally convex topological vector space (for
example, a normed space). Suppose that K is a compact and convex subset of X. Then
K is equal to the closed convex hull of its extreme points:

K = conv ( ext ( K)).


Lecture 5: Birkhoff and von Neumann 42

Moreover, if B ⊆ K, then K is equal to the closed convex hull of B if and only if


ext ( K) ⊆ cl ( B) , where cl ( B) is the closure of B. 

5.4 Proof of Birkhoff theorem


In this section, we give a proof of the Birkhoff theorem, which states that the extreme
points (that is, the vertices) of DS𝑛 are precisely the permutation matrices. The first
step is an exercise.
Exercise 5.15 (Permutations and DS𝑛 ). Show that each permutation matrix 𝑷 ∈ ℝ𝑛×𝑛 is
an extreme point of DS𝑛 . Hint: Argue that there is a linear functional that achieves a
unique maximum on DS𝑛 at 𝑷 . Invoke the Bauer maximum principle.
The following notion will be helpful in the proof that the permutation matrices are
the only possible extreme points of the Birkhoff polytope.

Definition 5.16 (Perturbation). Let 𝑺 ∈ DS𝑛 belong to the set of doubly stochastic
matrices. A perturbation 𝑬 ∈ ℝ𝑛×𝑛 of the matrix 𝑺 is a matrix with the property
that 𝑺 ± 𝑬 ∈ DS𝑛 .

Exercise 5.17 (Extreme point and perturbation). Show that 𝑺 is an extreme point of DS𝑛
if and only if 𝑺 admits no perturbation 𝑬 , except the zero matrix. In particular, if 𝑺
admits a nontrivial perturbation, then 𝑺 is not an extreme point.

Proof of Theorem 5.6. We already know that the permutation matrices are extreme
Figure 5.6 Example of perturbation.
points of the Birkhoff polytope. We will argue that the permutation matrices compose
the full set of extreme points. An application of Minkowski’s theorem completes the
argument.
To that end, let us suppose that 𝑺 ∈ DS𝑛 is not a permutation. We will produce a
nonzero perturbation 𝑬 . Therefore, 𝑺 is not an extreme point of DS𝑛 .
Observe that every doubly stochastic matrix that is not a permutation contains at
least one entry that is not an integer. We will find a “cycle” consisting of nonintegral
entries. First, find an entry in row 𝑖 1 and column 𝑗 1 such that 𝑠𝑖 1 𝑗 1 lies between 0 and
1. Now, select another entry in row 𝑖 1 such that 𝑠𝑖 1 𝑗 2 lies strictly between 0 and 1. We
can do this because the entries in the row have to add up to 1. Find another entry in
column 𝑗 2 such that 𝑠𝑖 2 ,𝑗 2 lies strictly between 0 and 1. We keep moving horizontally
along the rows and vertically along the columns in sequence until we encounter an
index pair that we have already seen. Figure 5.8 illustrates this process. This process
have to terminate because there are only finite number of positions in the matrix.
Among all such sequences, we choose the one with fewest steps. This sequence
must be a cycle:
(𝑖 1 , 𝑗 1 ) = (𝑖𝑟 , 𝑗 𝑟 ).
Note that this cycle must have even number of steps. Indeed, in order to complete a
cycle, we need to move from a row and a column in sequence. An odd number of steps
in a given row or in a given column can be combined to form a shorter connection.
Given the indices in the cycle, we can find 𝜀 > 0 such that

𝑠𝑖𝑎 𝑗 𝑎 ± 𝜀 ∈ ( 0, 1) for each index 𝑎 .

We can construct a nontrivial perturbation 𝑬 whose nonzero entries are


Figure 5.7 Perturbation matrix 𝑬
𝑒𝑖1 𝑗 1 = 𝜀 ; 𝑒 𝑖1 𝑗 2 = −𝜀 ; 𝑒𝑖2 𝑗 2 = 𝜀 ; 𝑒 𝑖3 𝑗 2 = −𝜀 ; etc. used in proving Birkhoff theorem.
Lecture 5: Birkhoff and von Neumann 43

Figure 5.8 (Reordering indices to prove Birkhoff theorem). We move from


one index where the entry is nonintegral to other index where the
entry is nonintegral along a row and a column alternatively. We
repeat the process until we end up at the same index. We consider
only the shortest path and discard any indices that attach to the loop
but will not close the loop (indicated in transparent colors).

Then the row sums and column sums of 𝑬 are equal to zero. We can conclude that

𝑺 ± 𝑬 ∈ DS𝑛 .

We can conclude that 𝑺 is not an extreme point. 

5.5 The Richter trace theorem


We may now prove the following theorem of Richter on a maximization problem that
arises in the study of matrix traces. This is a close relative of an older result due to von
Neumann (Exercise 5.20). The proof is due to Mirsky [Mir59].

Theorem 5.18 (Richter trace theorem). Let 𝑨, 𝑩 ∈ ℍ𝑛 . Then


∑︁𝑛
max{tr (𝑼 ∗ 𝑨𝑼 𝑩) : 𝑼 ∈ 𝕄𝑛 is unitary} = 𝜆𝑖↓ (𝑨) 𝜆𝑖↓ (𝑩).
𝑖 =1

In other words, the theorem solves a matching problem. What is the best way we can
rotate 𝑨 to align it with 𝑩 ? Part of the assertion is that the maximum is attained.

Proof. We prove the theorem by establishing upper and lower bounds on the trace.
First, let us introduce the eigenvalue decompositions of 𝑨 and 𝑩 .

𝑨 = 𝑸 𝑨 diag (𝝀)𝑸 ∗𝑨 where 𝝀 = 𝝀 ↓ (𝑨) ;


𝑩 = 𝑸 𝑩 diag (𝝁)𝑸 𝑩∗ where 𝝁 = 𝝁 ↓ (𝑩).

We can compute the lower bound of the trace by choosing 𝑼 0 = 𝑸 𝑨 𝑸 𝑩∗ . Indeed,

max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) ≥ tr (𝑼 ∗0 𝑨𝑼 0 𝑩)
∑︁𝑛
= tr ( diag (𝝀) diag (𝝁)) = 𝜆𝑖 𝜇𝑖 .
𝑖

It remains to show that this lower bound is indeed the largest possible value.
Lecture 5: Birkhoff and von Neumann 44

We can compute the upper bound of the trace as follows. Recall that 𝜹 𝑖 is the 𝑖 th standard
basis vector.
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) = max𝑼 tr (𝑼 ∗ diag (𝝀)𝑼 diag (𝝁))
∑︁𝑛
= max𝑼 𝜹 𝑖∗𝑼 ∗ diag (𝝀)𝑼 diag (𝝁)𝜹 𝑖
∑︁𝑖𝑛=1
= max𝑼 (𝑼 𝜹 𝑖 ) ∗ diag (𝝀)(𝑼 𝜹 𝑖 )𝜇𝑖
𝑖

Here, the first equality is obtained by absorbing 𝑸 𝑨 , 𝑸 𝑩 into 𝑼 . The second equality
comes from the fact that the trace is the sum of diagonal entries. Next, we represent
diag (𝝀) as a sum of rank-one matrices:
∑︁
diag (𝝀) = 𝜆 𝑗 𝜹 𝑗 𝜹 ∗𝑗 .
𝑗

Combine the last two displays and simplify:


∑︁𝑛
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) = max𝑼 𝜇𝑖 𝜆 𝑗 (𝑼 𝜹 𝑖 ) ∗ 𝜹 𝑗 𝜹 ∗𝑗 (𝑼 𝜹 𝑖 )
𝑖 ,𝑗 =1
∑︁𝑛
= max𝑼 𝜇𝑖 𝜆 𝑗 |𝜹 ∗𝑗 𝑼 𝜹 𝑖 | 2 .
𝑖 ,𝑗 =1

Note that we have exposed the squared magnitudes of the entries of the unitary matrix,
which compose an orthostochastic matrix 𝑺 ∈ DS𝑛 with entries 𝑠𝑖 𝑗 = |𝑢 𝑖 𝑗 | 2 .
We see that the maximum only increases if we pass to the full set of doubly stochastic
matrices: ∑︁𝑛
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) ≤ max𝑺 ∈DS𝑛 𝜇𝑖 𝜆 𝑗 𝑠𝑖 𝑗 .
𝑖 ,𝑗 =1

Invoke Bauer’s maximum principle to see that this linear function attains its maximum
at an extreme point of the set. Then use Birkhoff ’s theorem to recognize that the
extreme points of DS𝑛 are precisely the permutation matrices 𝑷 ∈ ℝ𝑛×𝑛 . Thus,
∑︁𝑛
max𝑼 tr (𝑼 ∗ 𝑨𝑼 𝑩) ≤ max𝑷 𝜇𝑖 𝜆 𝑗 𝑝𝑖 𝑗
𝑖 ,𝑗 =1
∑︁𝑛 ∑︁𝑛
= max𝜋 𝜆𝜋 (𝑖 ) 𝜇𝑖 = 𝜆𝑖 𝜇𝑖 .
𝑖 =1 𝑖

We have passed from the permutation matrix to the associated permutation: 𝜋 (𝑖 ) = 𝑗


if and only if 𝑝𝑖 𝑗 = 1. Last, we invoke the Chebyshev rearrangement inequality to see
that the maximum is attained when the two vectors are both arranged in decreasing
order.
Thus, both upper bound and lower bound on the trace are identical, and we have
established the result. 
Here are some related results that often prove useful.
Exercise 5.19 (Richter trace theorem). Let 𝑨, 𝑩 ∈ ℍ𝑛 . Prove that
∑︁𝑛
min{tr (𝑼 ∗ 𝑨𝑼 𝑩) : 𝑼 ∈ 𝕄𝑛 is unitary} = 𝜆𝑖↓ (𝑨) 𝜆𝑖↑ (𝑩).
𝑖 =1

Exercise 5.20 (von Neumann trace theorem). Let 𝑨, 𝑩 ∈ 𝕄𝑛 . Prove that


∑︁𝑛
max{tr (𝑼 ∗ 𝑨𝑽 𝑩) : 𝑼 ,𝑽 ∈ 𝕄𝑛 are unitary} = 𝜎𝑖↓ (𝑨) 𝜎𝑖↓ (𝑩).
𝑖 =1
Lecture 5: Birkhoff and von Neumann 45

Notes
Birkhoff and von Neumann independently established the result that shares their
name. There is an influential geometric proof of the result due to Hoffmann &
Wielandt [HW53], which also connects the result with perturbation theory for eigenval-
ues of a normal matrix. The material on convex geometry is adapted from Barvinok’s
book [Bar02] and from Schneider’s treatise [Sch14]. We have extracted this direct
proof of Birkhoff ’s theorem from a note by Glenn Hurlbert, which appears in [Hur10].
The proof of Richter’s trace theorem is drawn from Mirsky’s work [Mir59].

Lecture bibliography
[Bar02] A. Barvinok. A course in convexity. American Mathematical Society, Providence, RI,
2002. doi: 10.1090/gsm/054.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bir46] G. Birkhoff. “Three observations on linear algebra”. In: Univ. Nac. Tacuman, Rev.
Ser. A 5 (1946), pages 147–151.
[HW53] A. J. Hoffman and H. W. Wielandt. “The variation of the spectrum of a normal
matrix”. In: Duke J. Math. 20 (1953), pages 37–39.
[Hur10] G. H. Hurlbert. Linear optimization. The simplex workbook. Springer, New York,
2010. doi: 10.1007/978-0-387-79148-7.
[Mir59] L. Mirsky. “On the trace of matrix products”. In: Mathematische Nachrichten 20.3-6
(1959), pages 171–174. doi: 10.1002/mana.19590200306.
[Sch14] R. Schneider. Convex bodies: the Brunn–Minkowski theory. 151. Cambridge university
press, 2014. doi: 10.1017/CBO9781139003858.
6. Unitarily Invariant Norms

Date: 20 January 2022 Scribe: Elvira Moreno

In this lecture, we first introduce symmetric gauge functions, an important family of


Agenda:
norms over ℝ𝑛 that have the property of being sign and permutation invariant. As part
1. Symmetric gauge functions
of our analysis, we develop duality theory for symmetric gauge functions and show 2. Duality for symmetric gauge
that they satisfy some widely used norm inequalities. We then discuss a closely related functions
family of matrix norms that are invariant under coordinate changes. By building on 3. Unitarily invariant norms
their connection to symmetric gauge functions, we develop duality theory for this 4. Duality for unitarily invariant
norms
important family of norms and provide a generalization of the well known Hölder
inequality.

6.1 Symmetric gauge functions


We begin by introducing an important class of norms on ℝ𝑛 .
Recall that a norm on ℝ𝑛 is a
Definition 6.1 (Symmetric gauge function). A symmetric gauge function is a map Φ : function Φ : ℝ𝑛 → 𝑅 satisfying the
ℝ𝑛 → ℝ+ that satisfies four properties: following three properties:
1. Positive definiteness. Φ(𝒙 ) ≥ 0
1. Norm: Φ is a norm. for all 𝒙 ∈ ℝ𝑛 , and Φ(𝒙 ) = 0 if
2. Permutation invariance: The equation Φ(𝑷 𝒙 ) = Φ(𝒙 ) holds for any vector and only if 𝒙 = 0.
𝒙 ∈ ℝ𝑛 and any permutation matrix 𝑷 ∈ 𝕄𝑛 . 2. Positive homogeneity.
Φ(𝛼𝒙 ) = |𝛼 |Φ(𝒙 ) for all 𝒙 ∈ ℝ𝑛
3. Sign invariance: The equation Φ(𝑫𝒙 ) = Φ(𝒙 ) holds for any 𝒙 ∈ ℝ𝑛 and for
and 𝛼 ∈ ℝ.
any 𝑛 × 𝑛 diagonal matrix of the form 𝑫 = diag (±1, . . . , ±1) . 3. Triangle inequality.
4. Normalization: Φ is normalized as Φ(( 1, 0, . . . , 0)) = 1. Φ(𝒙 + 𝒚 ) ≤ Φ(𝒙 ) + Φ(𝒚 ) for all
𝒙 , 𝒚 ∈ ℝ𝑛 .
In lecture 4, we defined the family of isotone functions on ℝ𝑛 as maps that respect
the majorization preorder on ℝ𝑛 , and we proved that convexity and permutation
invariance are sufficient conditions for isotonicity. It follows from properties (1) and
(2) that symmetric gauge functions are isotone, so they constitute a family of norms on
ℝ𝑛 that preserve the majorization preorder.

Definition 6.2 (Symmetric gauge function on ℂ𝑛 ). We can extend the concept of a


symmetric gauge function on ℝ𝑛 to the complex vector space ℂ𝑛 via

Φ(𝒛 ) B Φ(|𝒛 |) for 𝒛 ∈ ℂ𝑛 ,

where |𝒛 | B (|𝑧 1 |, . . . , |𝑧𝑛 |) is the entry-wise modulus of the vector 𝒛 .

Let us consider some examples of symmetric gauge functions.


Example 6.3 (𝑝 norms). For a vector 𝑥 ∈ ℝ𝑛 , define
∑︁𝑛  1/𝑝
k𝒙 k𝑝 B |𝑥𝑖 |𝑝 for each 𝑝 ∈ [ 1, ∞) ;
𝑖 =1
k𝒙 k ∞ B max𝑖 |𝑥𝑖 |.
Lecture 6: Unitarily Invariant Norms 47

The functions k·k𝑝 for 𝑝 ∈ [ 1, ∞] define norms on ℝ𝑛 , commonly known as 𝑝 norms.


It is easy to check that 𝑝 norms are symmetric gauge functions. 

Example 6.4 (Ky Fan norm). Fix 𝑘 ∈ {1, . . . , 𝑛}. For each vector 𝑥 ∈ ℝ𝑛 , define
∑︁
k𝒙 k (𝑘 ) B max |𝑥𝑖 |.
| I |=𝑘 𝑖 ∈I

That is, k𝒙 k (𝑘 ) is the sum of the 𝑘 largest entries of the vector |𝒙 | . The functions
k·k (𝑘 ) are norms on ℝ𝑛 , which are known by the name of Ky Fan norms. It is also easy
to check that these norms are symmetric gauge functions. 

There are many other norms on ℝ𝑛 that are symmetric gauge functions. For
instance, one could form combinations of 𝑝 and Ky Fan norms, or one could form
weighted sums of the ordered entries of the vector.
Proposition 6.5 (Monotonicity). If Φ : ℝ𝑛 → ℝ+ is a symmetric gauge function, then
|𝒙 | ≤ |𝒚 | implies that Φ(𝒙 ) ≤ Φ(𝒚 ) for all 𝒙 , 𝒚 ∈ ℝ𝑛 . Recall that 𝒙 ≤ 𝒚 for 𝒙 , 𝒚 ∈ ℝ𝑛 is
interpreted entrywise and that
Proof. Let 𝒙 , 𝒚 ∈ ℝ𝑛 be such that |𝒙 | ≤ |𝒚 | . By sign invariance of Φ, we can assume |𝒙 | B ( |𝑥 1 |, . . . , |𝑥𝑛 |) denotes the
that both 𝒙 and 𝒚 are positive and that 0 ≤ 𝑥𝑖 = 𝑡𝑖 𝑦𝑖 for some values 𝑡𝑖 ∈ [ 0, 1] for entrywise modulus.
each 𝑖 = 1, . . . , 𝑛 . By permutation invariance and iteration, it suffices to check the
case where 𝑡 2 = 𝑡 3 = · · · = 𝑡𝑛 = 1. Indeed, if the result holds for the case were 𝒙 and
𝒚 differ by a single entry, it can be obtained for the general case by applying it to one
coordinate at a time. Write 𝑥 1 = 𝑡 𝑦1 for 𝑡 ∈ [ 0, 1] . Then

Φ((𝑥 1 , 𝑥2 , . . . , 𝑥𝑛 )) = Φ((𝑡 𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ))
 
1+𝑡 1−𝑡 1+𝑡 1−𝑡 1+𝑡 1−𝑡
=Φ 𝑦1 − 𝑦1 , 𝑦2 + 𝑦2 , . . . , 𝑦𝑛 + 𝑦𝑛
2 2 2 2 2 2
 
1+𝑡 1−𝑡
=Φ (𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ) + (−𝑦1 , 𝑦2 , . . . , 𝑦𝑛 )
2 2
1+𝑡 1−𝑡
≤ Φ((𝑦1 , 𝑦2 , . . . , 𝑦𝑛 )) + Φ((−𝑦1 , 𝑦2 , . . . , 𝑦𝑛 ))
2 2
= Φ((𝑦1 , 𝑦2 , . . . , 𝑦𝑛 )).
The inequality follows from convexity of Φ, while the last equality uses the fact that Φ
is sign invariant. 
Next, we prove a theorem that demonstrates that the Ky Fan norms play an essential
role in the theory of symmetric gauge functions.

Theorem 6.6 (Fan dominance: Vector case). Fix 𝒙 , 𝒚 ∈ ℝ𝑛 . The following statements
are equivalent:
1. k𝒙 k (𝑘 ) ≤ k𝒚 k (𝑘 ) for each 𝑘 = 1, . . . , 𝑛 .
2. Φ(𝒙 ) ≤ Φ(𝒚 ) for every symmetric gauge function Φ on ℝ𝑛 .
Proof. Statement 1 follows immediately from 2, as Ky Fan norms are symmetric gauge
functions. For the other implication, note that 1 is equivalent to the condition that
|𝒙 | ≺𝜔 |𝒚 | . As a consequence, there exists a vector 𝒖 ∈ ℝ𝑛 such that |𝒙 | ≤ 𝒖 ≺ 𝒚 .
Therefore,
Φ(𝒙 ) ≤ Φ(|𝒙 |) ≤ Φ(𝒖) ≤ Φ(|𝒚 |) ≤ Φ(𝒚 ).
The first and last inequalities follow from sign invariance of Φ. Monotonicity of Φ
(Proposition 6.5) yields the second inequality. Finally, the third inequality follows from
the fact that Φ is convex and permutation invariant, hence isotone. 
Lecture 6: Unitarily Invariant Norms 48

The Fan dominance theorem has a striking meaning. Given vectors 𝒙 and 𝒚 in ℝ𝑛 ,
we can check Φ(𝒙 ) ≤ Φ(𝒚 ) for every symmetric gauge function Φ, by checking the
inequality for only 𝑛 functions, namely the Ky Fan 𝑘 -norms for 𝑘 = 1, . . . , 𝑛 . In this
sense, the theorem reduces a problem involving an infinite number of inequalities to
the verification of a finite number.
Exercise 6.7 (Submajorization). Given |𝒙 | ≺𝜔 |𝒚 | , explain how to construct a vector
𝒖 ∈ ℝ𝑛 such that |𝒙 | ≤ 𝒖 ≺ 𝒚 .

6.2 Duality for symmetric gauge functions


In this section, we develop the duality theory for symmetric gauge functions. More
specifically, we define the dual norm of a symmetric gauge function and establish a
generalization of the Hölder inequality.

Definition 6.8 (Duality). The dual norm Φ∗ of a symmetric gauge function Φ is given
by
Φ∗ (𝑦 ) B max{h𝒙 , 𝒚 i : Φ(𝒙 ) ≤ 1}.

Exercise 6.9 (Involution). Let Φ : ℝ𝑛 → ℝ be a symmetric gauge function. Prove that


(Φ∗ ) ∗ = Φ.
Exercise 6.10 (Dual symmetric gauge function). Prove that Φ∗ is a symmetric gauge function
if and only if Φ is a symmetric gauge function.
Let us consider the duality pairings for the examples studied in Section 6.1.
Example 6.11 (𝑝 duality pairs). Let 𝑝, 𝑞 ∈ [ 1, ∞] be Hölder conjugates. Then Recall that two real numbers
𝑝, 𝑞 ∈ [1, ∞) are said to be Hölder
k𝒚 k𝑝∗ = k𝒚 k 𝑞 for all 𝒚 ∈ ℝ𝑛 . conjugates if 𝑝1 + 𝑞1 = 1. Also, recall
that the Hölder conjugate of 𝑝 = 1 is
In particular, 2 is self-dual, while 1 and ∞ form a dual pair. 
𝑞 = ∞.
Example 6.12 (Ky Fan duality pair). For each 𝑘 = 1 . . . 𝑛 , the dual norm of the Ky Fan
𝑘 -norm is given by
 
1
k𝒙 k ∗(𝑘 ) = max k𝒙 k ( 1) , k𝒙 k (𝑛) for all 𝒙 ∈ ℝ𝑛 .
𝑘
Note that the Ky Fan 1-norm k·k ( 1) corresponds to the ∞ norm, while k·k (𝑛) corre-
sponds to the 1 norm. 

We now continue with two norm inequalities that hold for all symmetric gauge
functions and specialize to familiar inequalities that frequently appear in analysis.
Proposition 6.13 (Dual norm inequality). For each symmetric gauge function Φ,

|h𝒙 , 𝒚 i| ≤ Φ∗ (𝒙 ) Φ(𝒚 ) for all 𝒙 , 𝒚 ∈ ℝ𝑛 .


Proof. This follows immediately from the definition of the dual norm. 
Notice that when Φ = k·k 2 , the inequality in Proposition 6.13 corresponds to the
Cauchy–Schwarz inequality.
Recall that denotes the entrywise
Theorem 6.14 (Generalized Hölder inequality for symmetric gauge functions). Let Φ : product, also known as Hadamard
ℝ𝑛 → ℝ+ be a symmetric gauge function. Then, product or Schur product. Also recall
the notation |𝒙 | B ( |𝑥 1 |, . . . , |𝑥𝑛 |) .
 1/𝑝   1/𝑞
Φ(|𝒙 𝒚 |) ≤ Φ(|𝒙 |𝑝 ) Φ(|𝒚 |𝑞 )

,
Lecture 6: Unitarily Invariant Norms 49

where 𝑝 > 1 and 𝑞 is Hölder conjugate to 𝑝 .

Proof. Refer to [Bha97, Thm. IV.1.6] for a proof of this theorem. 


When Φ is the 1 norm, the inequality reduces to Hölder’s inequality
∑︁𝑛 ∑︁𝑛  1/𝑝 ∑︁𝑛  1/𝑞
|𝑥𝑖 𝑦𝑖 | ≤ |𝑥𝑖 |𝑝 |𝑦𝑖 |𝑞 .
𝑖 =1 𝑖 =1 𝑖 =1

The theorem has interesting implications when applied to other symmetric gauge
functions.

6.3 Unitarily invariant norms


In this section, we introduce an important family of matrix norms, called unitarily
invariant norms, which are intimately related to symmetric gauge functions.

Definition 6.15 (Unitarily invariant norm). A norm |||·||| on 𝕄𝑛 is unitarily invariant (UI)
if
|||𝑼 𝑨𝑽 ||| = |||𝑨 ||| for all 𝑨, 𝑼 ,𝑽 ∈ 𝕄𝑛 with 𝑼 ,𝑽 unitary.
We also insist that unitarily invariant norms be normalized in an appropriate
fashion: ||| diag ( 1, 0, . . . , 0)||| = 1.

For the remainder of this lecture, |||·||| will denote an arbitrary unitarily invariant
norm. Next, let us consider some familiar examples.
Example 6.16 (2 operator norm). The 2 operator norm, also known as the spectral norm, The singular values of a matrix are
is defined by numbered in decreasing order; i.e.,
by convention 𝜎1 ≥ 𝜎2 ≥ · · · ≥ 𝜎𝑛 .
k𝑨 k 2 B 𝜎1 (𝑨) for 𝑨 ∈ 𝕄𝑛 .
The 2 operator norm is unitarily invariant. 

Example 6.17 (Frobenius norm). The Frobenius norm, also known as the Hilbert–Schmidt
norm, is defined by
∑︁𝑛  1/2
k𝑨 k F B 𝜎𝑖 (𝑨) 2 for 𝑨 ∈ 𝕄𝑛 .
𝑖 =1

The Frobenius norm is unitarily invariant. 

Example 6.18 (Schatten 𝑝 -norms). For 1 ≤ 𝑝 < ∞, define


∑︁𝑛  1/𝑝
k𝑨 k 𝑆𝑝 B 𝜎𝑖 (𝑨)𝑝 for 𝑨 ∈ 𝕄𝑛 .
𝑖 =1

The functions k·k 𝑆𝑝 , commonly known as Schatten norms, define unitarily invariant
norms on the space 𝕄𝑛 . The Schatten 1-norm, also known as the trace norm or the
nuclear norm, corresponds to the sum of the singular values of the matrix. The norm
k·k 𝑆∞ coincides with the spectral norm (Example 6.16). 

Example 6.19 (Ky Fan matrix norm). For each 𝑘 = 1, . . . , 𝑛 , define


∑︁𝑘
k𝑨 k (𝑘 ) B 𝜎𝑖 (𝑨) for 𝑨 ∈ 𝕄𝑛 .
𝑖 =1

The functions k·k (𝑘 ) define unitarily invariant norms on 𝕄𝑛 , known as the Ky Fan
matrix norms. 
Lecture 6: Unitarily Invariant Norms 50

Exercise 6.20 (Why unitarily invariant?). Provide an explanation as to why each of the
examples above is unitarily invariant.
As in the case for symmetric gauge functions, combinations of Schatten and Ky Fan
matrix norms are also unitarily invariant.

6.4 Characterization of unitarily invariant norms


Observe that all the matrix norms in the examples above are completely determined
by the singular values of the matrix. This is a joint consequence of the singular value
decomposition (SVD) and unitary invariance. Indeed, if 𝑨 = 𝑼 𝚺𝑽 ∗ is an SVD of the
matrix 𝑨 ∈ 𝕄𝑛 , then |||𝑨 ||| = |||𝑼 𝚺𝑽 ∗ ||| = |||𝚺||| . This observation sets the ground for us
to establish the strong connection that exists between symmetric gauge functions and
unitarily invariant norms.

Theorem 6.21 (Unitarily invariant norms: Characterization). There is a one-to-one cor-


respondence between symmetric gauge functions and unitarily invariant norms.
This correspondence is specified by the following statements:

1. Given a symmetric gauge function Φ : ℝ𝑛 → ℝ+ , the function |||·||| Φ : 𝕄𝑛 →


ℝ+ defined by |||𝑨 ||| Φ B Φ(𝝈 (𝑨)) is a unitarily invariant norm on 𝕄𝑛 .
2. Given a unitarily invariant norm |||·||| on 𝕄𝑛 , the map Φ : ℝ𝑛 → ℝ defined
by Φ(𝒙 ) B ||| diag (𝒙 )||| is a symmetric gauge function on ℝ𝑛 .

In fact, these operations are mutual inverses.

Proof. (1). Let Φ : ℝ𝑛 → ℝ be a symmetric gauge function, and let |||·||| Φ be defined as
in (1). Positive definiteness of |||·||| Φ follows from the facts that Φ is positive definite
and the zero matrix is the only matrix whose singular values are all zero. Being a
norm, Φ is positive homogeneous, and since multiplying a matrix by a real number 𝛼
scales its singular values by |𝛼 | , the function |||·||| Φ is also positive homogeneous.
It remains to show that |||·||| Φ satisfies the triangle inequality and is unitarily
invariant. Let 𝑨, 𝑩 ∈ 𝕄𝑛 be 𝑛 × 𝑛 real matrices. By Exercise 6.22, we know that
𝝈 (𝑨) + 𝝈 (𝑩) ≺𝜔 𝝈 (𝑨) + 𝝈 (𝑩) . Since the symmetric gauge function Φ is isotone,
Φ(𝝈 (𝑨 + 𝑩)) ≤ Φ(𝝈 (𝑨) + 𝝈 (𝑩)) ≤ Φ(𝝈 (𝑨)) + Φ(𝝈 (𝑩)).
We conclude that |||·||| Φ is a norm. Note that the first and second inequalities above
follow from monotonicity and convexity of Φ, respectively.
Finally, observe that multiplying any given matrix by a unitary matrix does not alter
its singular values. Indeed, for any 𝑨 ∈ 𝕄𝑛 , its norm |||𝑨 ||| Φ is completely determined
by its singular values. It follows that |||·||| Φ is unitarily invariant.
(2). Let |||·||| be a unitarily invariant norm, and let Φ be defined as in (2). It is
immediate that Φ inherits all the norm properties from |||·||| . To show unitary invariance
of |||·||| Φ , we first let 𝑫 = diag (±1, . . . , ±1) . The matrix 𝑫 is unitary, so we have that

Φ(𝑫𝒙 ) = ||| diag (𝑫𝒙 )||| = |||𝑫 diag (𝒙 )||| = ||| diag (𝒙 )|||.
Similarly, since permutation matrices are unitary, we have that

Φ(𝑷 𝒙 ) = |||𝑷 diag (𝒙 )||| = ||| diag (𝒙 )|||


for any permutation matrix 𝑷 ∈ 𝕄𝑛 . Finally, note that |||·||| Φ is normalized, as
𝜎 ( diag ( 1, 0, . . . , 0)) = ( 1, 0, . . . , 0) and Φ is normalized. 
Lecture 6: Unitarily Invariant Norms 51

Problem 6.22 (Singular values of the sum). Let 𝑨, 𝑩 ∈ 𝕄𝑛 (ℂ) . Show that the vectors of
singular values 𝝈 (𝑨) and 𝝈 (𝑩) satisfy the submajorization inequality 𝝈 (𝑨)+𝝈 (𝑩) ≺𝜔
𝝈 (𝑨) + 𝝈 (𝑩).
Next, we state the Fan dominance theorem for unitarily invariant matrices, analo-
gous to Theorem 6.6. As in the vector case, this theorem emphasizes the importance
of Ky Fan norms in the theory of unitarily invariant norms.

Theorem 6.23 (Fan dominance: Matrix case). Fix 𝑨, 𝑩 ∈ 𝕄𝑛 . The following statements
are equivalent:

1. k𝑨 k (𝑘 ) ≤ k𝑩 k (𝑘 ) for each 𝑘 = 1, . . . , 𝑛 .
2. |||𝑨 ||| ≤ |||𝑩 ||| for every unitarily invariant norm.

Proof. The theorem follows from Theorem 6.6 and from the characterization of unitarily
invariant norms in terms of symmetric gauge functions. 

6.5 Duality for unitarily invariant norms


In the last section, we defined unitarily invariant norms and established a characteri-
zation in terms of symmetric gauge functions. Now, we build on this connection to
develop duality theory for unitarily invariant norms and establish a generalization of
the Hölder inequality.

Definition 6.24 The dual |||·||| ∗ of a unitarily invariant norm |||·||| on 𝕄𝑛 is given by Recall that h𝑨, 𝑩 i B Tr (𝑩 ∗ 𝑨) for
𝑨, 𝑩 ∈ 𝕄𝑛 .
|||𝑩 ||| ∗ B max{h𝑩, 𝑨i : |||𝑨 ||| ≤ 1} for each 𝑩 ∈ 𝕄𝑛 .

Exercise 6.25 (Involution). Let |||·||| : ℝ𝑛 → ℝ be a unitarily invariant norm. Prove that
(Φ∗ ) ∗ = Φ.
Exercise 6.26 (Dual unitarily invariant norms). Prove that |||·||| ∗ is unitarily invariant if and
only if |||·||| is unitarily invariant.
Exercise 6.27 (Dual norm inequality). For each unitarily invariant norm |||·||| ,

|h𝑩, 𝑨i| ≤ |||𝑩 ||| ∗ · |||𝑨 ||| for all 𝑨, 𝑩 ∈ 𝕄𝑛 .

Verify this statement.


In Section 6.3 we saw that, given a symmetric gauge function Φ on ℝ𝑛 , we can
define a unitarily invariant norm on 𝕄𝑛 via |||𝑨 ||| Φ B Φ(𝝈 (𝑨)) . The following theorem
describes how the dual of the unitarily invariant norm |||·||| Φ relates to the dual Φ∗ of
the symmetric gauge function used to define it.

Theorem 6.28 (Von Neumann duality). The dual |||·||| ∗Φ of the unitarily invariant norm
|||·||| Φ associated to a symmetric gauge function Φ is the unitarily invariant norm
|||·||| Φ∗ associated to the dual Φ∗ of the symmetric gauge function. That is,

|||𝑩 ||| ∗Φ = |||𝑩 ||| Φ∗ for all 𝑩 ∈ 𝕄𝑛 .

Proof. To prove this result, we use von Neumann’s trace theorem. Let 𝑨, 𝑩 ∈ 𝕄𝑛 .
Then
∑︁𝑛
max{tr (𝑼 ∗ 𝑨𝑽 𝑩) : 𝑼 ,𝑽 ∈ 𝕄𝑛 are unitary} = 𝜎𝑖 (𝑨)𝜎𝑖 (𝑩).
𝑖 =1
Lecture 6: Unitarily Invariant Norms 52

Exercise 6.29 invites you to prove this result.


As a consequence, we may calculate that

|||𝑩 ||| ∗Φ = max {tr (𝑩 ∗ 𝑨) : |||𝑨 ||| Φ ≤ 1}


= max {tr (𝑩 ∗𝑼 𝑨𝑽 ) : 𝑼 ,𝑽 ∈ 𝕄𝑛 are unitary, |||𝑨 ||| Φ ≤ 1}
n∑︁𝑛 o
= max 𝜎𝑖 (𝑨)𝜎𝑖 (𝑩) : Φ(𝜎 (𝑨)) ≤ 1
𝑖 =1
= max {h𝜎 (𝑨), 𝜎 (𝑩)i : Φ(𝜎 (𝑨)) ≤ 1}
≤ Φ∗ (𝜎 (𝑩))
= |||𝑩 ||| Φ∗ .

The second equality above is justified by unitary invariance of |||·||| ∗Φ .


The inequality |||𝑩 ||| Φ∗ ≤ |||𝑩 ||| ∗Φ follows from a similar line of reasoning and is left
as an exercise for the reader. 
Exercise 6.29 (von Neumann trace theorem: Singular values). Provide a proof for the von
Neumann trace theorem for singular values used in the proof of Theorem 6.28.
In light of the previous theorem, we can easily find expressions for the dual norms
of our examples from section 6.3.
Example 6.30 (Schatten norms). Let 𝑝, 𝑞 ∈ [ 1, ∞] be Hölder conjugates. Then

k𝑩 k 𝑆∗𝑝 = k𝑩 k 𝑆𝑞 .

Recall that the dual to the 𝑝 norm is the 𝑞 norm, where 𝑞 is the Hölder conjugate of 𝑝 .
Therefore, the dual norm to the Schatten 𝑝 -norm is the Shatten 𝑞 -norm. Indeed, the
Schatten 𝑝 - and 𝑞 -norms coincide with the 𝑝 and 𝑞 norms of the vector of singular
values. In particular, the Schatten 2-norm is self dual, while the Schatten norms k·k 𝑆1
and k·k 𝑆∞ form a dual pair. 

Example 6.31 (Ky Fan norms). For each 𝑘 = 1, . . . , 𝑛 , the dual norm of the Ky Fan 𝑘 -norm
is given by  
1
k𝑩 k ∗(𝑘 ) = max k𝑩 k ( 1) , k𝑩 k (𝑛) .
𝑘
Note that the Ky Fan 1-norm k·k ( 1) corresponds to the spectral norm, while k·k (𝑛)
corresponds to trace norm. 

We end this lecture with a generalization of Hölder’s inequality for unitarily


invariant norms.

Theorem 6.32 (Generalized Hölder inequality). Let |||·||| be unitarily invariant, and let
1 1
𝑝, 𝑞 > 1 be such that 𝑝 + 𝑞 = 1. Then

||||𝑨𝑩 |||| ≤ ||||𝑨 |𝑝 ||| 1/𝑝 ||||𝑩 |𝑞 ||| 1/𝑞 for all 𝑨, 𝑩 ∈ 𝕄𝑛 ,

where |𝑴 | B (𝑴 ∗ 𝑴 ) 1/2 .

Proof sketch. Let 𝑨, 𝑩 ∈ 𝕄𝑛 . By Exercise 6.33 and isotonicity of symmetric gauge


functions, we have that Φ(𝝈 (𝑨𝑩)) ≤ Φ(𝝈 (𝑨) 𝝈 (𝑩)) for all symmetric gauge
functions Φ. The result then follows from the characterization of unitarily invariant
norms (Theorem 6.21) and Theorem 6.14. 
Refer to [Bha97, Cor. IV.2.6] for a more detailed proof.
Lecture 6: Unitarily Invariant Norms 53

Exercise 6.33 (Singular values of the product). Let 𝑨, 𝑩 ∈ 𝕄𝑛 . Show that 𝝈 (𝑨𝑩) ≺𝜔
𝝈 (𝑨) 𝝈 (𝑩) .
In this lecture, we discussed two important and closely related families of norms,
symmetric gauge functions on ℝ𝑛 and unitarily invariant norms on the space of 𝑛 × 𝑛
real matrices 𝕄𝑛 . We established generalizations of Hölder-type inequalities for these
two families of norms by exploiting their defining properties and their connection to
each other. These results exemplify how considering families of norms with invariance
properties allows us to prove results for a broad class of norms which can then specialize
into results of interest when applied to particular examples.

Notes
This material is adapted from Bhatia’s book [Bha97, Chap. IV].

Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
7. Matrix Inequalities via Complex Analysis

Date: 25 January 2022 Scribe: Yixuan (Roy) Wang

In this lecture, we develop a new approach for proving matrix inequalities based
Agenda:
on the theory of complex interpolation. As motivation, we begin with the duality
1. Motivation
theorem for Schatten 𝑝 -norms, and we recall some of the difficulties that arose in the 2. Interpolation inequalities
proof. We recast the duality theorem as a theorem about interpolation, which suggests 3. Maximum modulus principle
the possibility of deriving this result using general tools for studying interpolation 4. The three-lines theorem
problems. We develop a powerful approach based on the Hadamard three-lines 5. Example: Duality of Schatten
norms
theorem, a quantitative version of the maximum principle for analytic functions. As an
illustration of these ideas, we provide an alternative derivation of the duality theorem
for Schatten 𝑝 -norms.

7.1 Motivation: Real analysis is not always enough


To begin, let us present some basic inequalities for Schatten norms, including the main
duality theorem. We will see that this result can be framed as a kind of interpolation
inequality. This observation opens the door to using complex interpolation theory.

7.1.1 Schatten norm inequalities


We first recall the definition of the Schatten 𝑝 -norms of a complex matrix.

Definition 7.1 (Schatten norms). The Schatten 𝑝 -norms k · k𝑝 for 𝑝 ∈ [ 1, ∞] are


defined as ∑︁𝑛  1/𝑝
k𝑨 k𝑝 B 𝜎𝑖 (𝑨)𝑝 for 1 ≤ 𝑝 < ∞ ;
𝑖 =1
k𝑨 k ∞ B 𝜎1 (𝑨) .
Here, 𝑨 ∈ 𝕄𝑛 (ℂ) is an 𝑛 × 𝑛 complex matrix. The function 𝜎𝑖 returns the 𝑖 th
largest singular value of a matrix.

For powers 𝑝 ∈ [ 1, ∞) , we can also write the Schatten 𝑝 -norm as a trace: Recall that the power binds before
the trace.
k𝑨 k𝑝 = ( tr |𝑨 |𝑝 ) 1/𝑝 for 𝑨 ∈ 𝕄𝑛 (ℂ) .

As usual, |𝑨 | B (𝑨 ∗ 𝑨) 1/2 is the matrix absolute value. The 𝑝 th power of a positive-


semidefinite matrix is defined via the usual functional calculus (i.e., by raising the
eigenvalues to the 𝑝 th power).
In Lecture 6, we characterized the unitarily invariant norms on the matrix space
𝕄𝑛 (ℂ) , and we established a one-to-one correspondence with the symmetric gauge
functions on the vector space ℝ𝑛 . In particular, the Schatten 𝑝 -norms are analogous
with the vector 𝑙𝑝 norms. By the duality for 𝑝 vector norms and the von Neumann
duality theorem, we derived a duality relation for Schatten norms: k · k𝑝∗ = k · k 𝑞
where the indices satisfy the conjugacy relation 1/𝑝 + 1/𝑞 = 1.
Lecture 7: Matrix Inequalities via Complex Analysis 55

Theorem 7.2 (Duality relation for Schatten 𝑝 -norms). For all indices 𝑝, 𝑞 ∈ [ 1, ∞] with
1/𝑝 + 1/𝑞 = 1, the dual norm of the Schatten 𝑝 -norm is the Schatten 𝑞 -norm. The
duality is equivalent to a Hölder-type inequality.

| tr (𝑨 ∗ 𝑩)| ≤ k𝑨 k𝑝 · k𝑩 k 𝑞 for all 𝑨, 𝑩 ∈ 𝕄𝑛 (ℂ) . (7.1)

As unspooled in Lecture 6, the proof of the duality of Schatten 𝑝 -norms involves


Birkhoff ’s theorem on doubly stochastic matrices, the von Neumann trace theorem,
the von Neumann duality theorem, and finally duality for 𝑝 vector norms. It is fairly
complicated, and we spent four lectures building up to the theorem.
In addition to its complexity, this type of reasoning does not offer a very flexible
approach to proving matrix inequalities. We may encounter related problems that
we cannot solve directly from these ideas. For example, consider the following
Schwarz-type inequality.

Theorem 7.3 (A Schwarz-type inequality). Fix 𝑝 ≥ 1. For all 𝑩 ∈ 𝕄𝑛 (ℂ) and all
𝑨 𝑗 ∈ 𝕄𝑛 (ℂ) for 𝑗 = 1, 2, · · · , 𝑚 , we have
∑︁𝑚 ∑︁𝑚
𝑨 ∗𝑗 𝑩 𝑨 𝑗 ≤ 𝑨 ∗𝑗 𝑨 𝑗 k𝑩 k 2𝑝 .
𝑗 =1 𝑝 𝑗 =1 2𝑝

Proof. See Problem Set 2. 


To prove results of this type, it is valuable to have additional tools. This lecture will
show how to use basic methods from complex analysis to derive matrix inequalities,
including results like Theorem 7.3.

7.1.2 Interpolation inequalities


As a first step, let us rephrase the duality statement (7.1) as an interpolation inequality.
Proposition 7.4 (Schatten duality: Interpolation form). Theorem 7.2 follows from the state-
ment below. For all positive-semidefinite matrices 𝑨, 𝑩 ∈ ℍ𝑛+ (ℂ) , we have
k𝑨 𝜃 𝑩 1−𝜃 k 1 ≤ k𝑨 k 1𝜃 · k𝑩 k 11−𝜃 for 0 < 𝜃 < 1. (7.2)
You can think about k𝑨 𝜃 𝑩 1−𝜃 k 1 as the trace norm of a weighted geometric mean
of the two positive-semidefinite matrices. The right-hand side is a weighted geometric
mean of their trace norms.

Proof. We will establish (7.1) for positive-semidefinite matrices 𝑨, 𝑩 ∈ ℍ𝑛+ (ℂ) ; Exer-
cise 7.6 asks you to derive the extension for general matrices.
First, fix 𝑝 ∈ ( 1, ∞) . We show that (7.2) implies (7.1) by a change of variables.
Consider the bijections 𝑨 ↦→ 𝑨 𝑝 and 𝑩 ↦→ 𝑩 𝑞 where 𝜃 = 1/𝑝 and 1 − 𝜃 = 1/𝑞 . We
obtain the inequality
k𝑨𝑩 k 1 ≤ k𝑨 k𝑝 · k𝑩 k 𝑞 for all psd 𝑨, 𝑩 ∈ ℍ𝑛+ (ℂ) .
Exercise 7.5 asks you to check that | tr (𝑨𝑩)| ≤ k𝑨𝑩 k 1 , which yields the inequality (7.1).
The boundary cases 𝑝 = 1 and 𝑝 = ∞ follow from the last display when we take
limits. 
Exercise 7.5 (The trace norm). Prove that the trace is bounded by the trace norm:

| tr 𝑴 | ≤ k𝑴 k 1 for all 𝑴 ∈ 𝕄𝑛 (ℂ) .


Hint: Among many proofs, the easiest one introduces an SVD of the matrix.
Lecture 7: Matrix Inequalities via Complex Analysis 56

Exercise 7.6 (Schatten duality: General case). For the general case when 𝑨, 𝑩 ∈ 𝕄𝑛 (ℂ) ,
derive inequality (7.1) from inequality (7.2). Hint: Introduce polar factorizations.
In view of this discussion, we may focus on proving the interpolation inequality
(7.2). A surprising and powerful approach to this problem is to allow the interpolation
parameter 𝜃 to take on complex values. This insight leads us to consider a complex
interpolation inequality, which we can establish using the miracle of complex analysis.
The key tool in this argument is a quantitative extension of the maximum modulus
principle, called the Hadamard three-lines theorem. The rest of this lecture develops
the required background and establishes the interpolation inequality (7.2).

7.2 The maximum modulus principle


In this section, we present the definition of a complex analytic function and prove an
essential fact called the mean value formula. Then we establish the simplest version of
the maximum modulus principle.

7.2.1 Domains and analytic functions


We begin with some definitions on domains and analytic functions.

Definition 7.7 (Domain). A domain Ω ⊆ ℂ is an open and connected set.

Definition 7.8 (Analytic function). A function 𝑓 : Ω → ℂ on a domain Ω ⊆ ℂ is


(complex) analytic if it has a locally convergent power series expansion at each point
of the domain. More precisely, consider a closed disc inside the domain:

D𝑟 (𝑎) B {𝑧 ∈ ℂ : |𝑧 − 𝑎 | ≤ 𝑟 } ⊂ Ω.

Then there exists a convergent Taylor expansion


∑︁∞
𝑓 (𝑧) = 𝑐 𝑘 (𝑧 − 𝑎) 𝑘 for all 𝑧 ∈ D𝑟 (𝑎) .
𝑘 =0

The coefficients 𝑐 𝑘 ∈ ℂ. In particular, 𝑐 0 = 𝑓 (𝑎) .

We can show that the Taylor expansion about each point 𝑎 ∈ Ω is uniquely
determined, and it converges absolutely on the largest (open) disc D𝑟 (𝑎) that is
contained in Ω. Moreover, when D𝑟 (𝑎) ⊆ D𝑅 (𝑎) ⊂ Ω, the coefficients in the
expansion coincide.
From this definition, we can easily confirm that analytic functions are continuous
inside the domain Ω. In fact, a complex function is analytic on a domain if and only if
it is differentiable within the domain (also called holomorphic).
Example 7.9 (Exponentials). For any complex number 𝑐 ∈ ℂ, the function 𝑧 ↦→ e𝑐 𝑧 is
analytic on the complex plane ℂ. Moreover, the familiar Taylor series expansion for
the exponential converges in the whole complex plane ℂ. 

7.2.2 Mean value formula and maximum modulus principle


In this section, we shall establish the maximum modulus principle for analytic functions.
Our key tool is the mean value formula.
Proposition 7.10 (Mean value formula). Let 𝑓 : Ω → ℂ be analytic, and let D𝑟 (𝑎) ⊂ Ω.
Lecture 7: Matrix Inequalities via Complex Analysis 57

Then we have the formula


∫ 2𝜋
1
𝑓 (𝑎) = 𝑓 (𝑎 + 𝑟 ei𝜃 ) d𝜃 . (7.3)
2𝜋 0

Proof. The proof of this proposition is straightforward. We can just integrate the
power series expansion of 𝑓 at 𝑎 ∈ ℂ term by term. The constant order (𝑘 = 0) term
contributes 𝑓 (𝑎) ; the higher monomial terms vanish. 
We can apply the mean value formula to prove a special case of the maximum
modulus principle: the version for a disc.
Proposition 7.11 (Maximum modulus principle: Disc). Let 𝑓 : Ω → ℂ be analytic, and
consider a closed disc contained in the domain: D𝑟 (𝑎) ⊂ Ω. If the function 𝑧 ↦→ |𝑓 (𝑧)|
achieves its maximum over D𝑟 (𝑎) at the point 𝑎 , then the function 𝑓 must be constant
on the disc D𝑟 (𝑎) . That is, 𝑓 (𝑧) = 𝑓 (𝑎) for each 𝑧 ∈ D𝑟 (𝑎) .

Proof. Fix the center 𝑎 ∈ Ω and the radius 𝑟 > 0 of the disc. By the triangle inequality
applied to the mean value formula (7.3), we compute
∫ 2𝜋 ∫ 2𝜋
1 i𝜃 1
| 𝑓 (𝑎)| ≤ | 𝑓 (𝑎 + 𝑟 e )| d𝜃 ≤ |𝑓 (𝑎)| d𝜃 = |𝑓 (𝑎)| .
2𝜋 0 2𝜋 0

Therefore, both inequalities hold with equality. Since |𝑓 | is continuous, the value
𝑓 (𝑎 + 𝑟 ei𝜃 ) has constant phase for all 𝜃 ∈ [0, 2𝜋) . Furthermore, the magnitude
|𝑓 (𝑎 + 𝑟 ei𝜃 )| = |𝑓 (𝑎)| for all 𝜃 . In other words, 𝜃 ↦→ 𝑓 (𝑎 + 𝑟 ei𝜃 ) is a constant
function. By the mean value formula, the constant value of 𝑓 (𝑎 + 𝑟 ei𝜃 ) equals 𝑓 (𝑎) .
Finally, we apply the same argument for each 𝑟0 < 𝑟 and its associated disc D𝑟0 (𝑎) .
We conclude that 𝑓 (𝑧) = 𝑓 (𝑎) for each 𝑧 ∈ D𝑟 (𝑎) , because it is on the boundary of
some disc D𝑟0 (𝑎) for 𝑟0 = |𝑧 − 𝑎 | ≤ 𝑟 . 
With the maximum modulus principle for the disc at hand, we are in position to
prove the general maximum modulus principle on bounded domains.

Theorem 7.12 (Maximum modulus principle: Bounded domain). Let Ω ⊆ ℂ be a bounded


domain. Assume that 𝑓 : Ω → ℂ is analytic on Ω and continuous on Ω. Then
𝑧 ↦→ |𝑓 (𝑧)| achieves its maximum on the boundary 𝜕Ω.

Proof. Since the closure Ω is compact and |𝑓 | is continuous on Ω, we know that |𝑓 |


attains its maximum on Ω by the extreme value theorem.
We argue by contradiction. Suppose | 𝑓 | does not achieve a maximum on 𝜕Ω. Let
𝑎 ∈ Ω be a point where | 𝑓 (𝑎)| ≥ |𝑓 (𝑧)| for all 𝑧 ∈ Ω. Let 𝑏 ∈ 𝜕Ω be the closest point
Figure 7.1 Identifying the disc to apply
in 𝜕Ω to 𝑎 ; such a point exists since Ω is compact.
maximum modulus principle on a disc.
By assumption, | 𝑓 (𝑏)| < | 𝑓 (𝑎)| . Since |𝑓 | is continuous on the line segment
[𝑎, 𝑏] , there must exist a point 𝑧 ∈ (𝑎, 𝑏) ⊂ Ω in the interior of the domain such that
|𝑓 (𝑧)| < | 𝑓 (𝑎)| ; see Figure 7.1 for an illustration.
Consider a disc D𝑟 (𝑎) that contains 𝑧 and sits inside Ω. Since |𝑓 | achieves its
maximum at 𝑎 , we have that | 𝑓 (𝑎)| ≥ |𝑓 (𝑤 )| for all 𝑤 ∈ D𝑟 (𝑎) . By the maximum
modulus principle for a disc, 𝑓 is a constant on D𝑟 (𝑎) . But this is a contradiction to
the fact that | 𝑓 (𝑧)| < | 𝑓 (𝑎)| . 
Lecture 7: Matrix Inequalities via Complex Analysis 58

Aside: (Maximum modulus principle). In fact, a stronger result is valid. Assume that
𝑓 : Ω → ℂ is analytic on a bounded domain Ω and continuous on Ω. If the
function |𝑓 | attains maximum at any point inside the domain Ω, then the function
𝑓 is constant on the domain.

7.3 Interpolation: The three-lines theorem


The maximum modulus principle (Theorem 7.12) applies to general bounded domains.
However, the result can fail for unbounded domains without additional assumptions. For
one thing, we do not even know if | 𝑓 | attains its maximal value; it could be unbounded.
Nevertheless, for sufficiently regular functions, we can establish a maximum modulus
principle for particular unbounded domains (e.g., strips and wedges).
The next result provides a quantitative version of the maximum modulus principle
on the strip. This theorem is attributed to Hadamard.

Theorem 7.13 (Hadamard three-lines). Let Ω = {𝑧 : 0 < Re 𝑧 < 1} be a vertical strip


in the complex plane. Assume that the function 𝑓 : Ω → ℂ is analytic on Ω, and
assume that 𝑓 is bounded and continuous on Ω. Define the quantity

𝑀 (𝜃 ) B sup𝑡 ∈ℝ | 𝑓 (𝜃 + i𝑡 )| for 𝜃 ∈ [0, 1] .

Then the function 𝑀 is log-convex. In particular, we have

𝑀 (𝜃 ) ≤ 𝑀 ( 0) 1−𝜃 · 𝑀 ( 1) 𝜃 for 𝜃 ∈ [ 0, 1] . (7.4)

Proof. We will only prove the interpolation inequality (7.4); the general log-convexity
statement follows by a scaling argument. Without loss of generality, we can also
assume that 𝑀 ( 0) and 𝑀 ( 1) are strictly positive; otherwise we can add a small
positive constant 𝛿 to 𝑓 and take the limit 𝛿 ↓ 0.
Consider the auxiliary function

𝐹 (𝑧) B 𝑓 (𝑧) · 𝑀 ( 0) 𝑧−1 𝑀 ( 1) −𝑧 for 𝑧 ∈ Ω.

Then 𝐹 is analytic on Ω and continuous on Ω. Moreover, for 𝑧 = 𝜃 + i𝑡 with 𝜃 , 𝑡 ∈ ℝ,


we can compute

|𝐹 (𝜃 + i𝑡 )| = | 𝑓 (𝜃 + i𝑡 )| · 𝑀 ( 0) 𝜃 −1 𝑀 ( 1) −𝜃 . (7.5)

Therefore 𝐹 is bounded on Ω, and |𝐹 (𝑧)| ≤ 1 on 𝜕Ω by construction.


Claim 7.14 (Auxiliary function is bounded). We claim that |𝐹 (𝑧)| ≤ 1 for all 𝑧 ∈ Ω.
Granted that Claim 7.14 holds, we may quickly complete the argument:

1 ≥ sup𝑡 ∈ℝ |𝐹 (𝜃 + i𝑡 )| = [ sup𝑡 ∈ℝ |𝑓 (𝜃 + i𝑡 )|] · 𝑀 ( 0) 𝜃 −1 𝑀 ( 1) −𝜃


= 𝑀 (𝜃 ) · 𝑀 ( 0) 𝜃 −1 𝑀 ( 1) −𝜃 .
The first equality is (7.5). Rearrange to reach the bound (7.4).
To prove Claim 7.14, we will use a regularization argument. For 𝜀 > 0, consider
the function
2
𝐹 𝜀 (𝑧) = 𝐹 (𝑧) · e𝜀 (𝑧 −1) for 𝑧 ∈ Ω.
Once again, we recognize that 𝐹 𝜀 is analytic on Ω and continuous on Ω. Moreover, for
𝑧 = 𝜃 + i𝑡 with 𝜃 ∈ [0, 1] and 𝑡 ∈ ℝ, we can compute
2
−1−𝑡 2 ) 2
|𝐹 𝜀 (𝑧)| = |𝐹 (𝑧)| · e𝜀 (𝜃 ≤ |𝐹 (𝑧)| · e−𝜀𝑡 ≤ |𝐹 (𝑧)| . (7.6)
Lecture 7: Matrix Inequalities via Complex Analysis 59

Figure 7.2 Identifying the


bounded domain to apply maxi-
mum modulus principle.

Therefore, 𝐹 𝜀 is bounded on Ω, and |𝐹 𝜀 (𝑧)| ≤ 1 on 𝜕Ω.


2
Identify a positive 𝑟 > 0 such that e𝜀𝑟 ≥ sup𝑧 ∈Ω |𝐹 (𝑧)| ; see Figure 7.2. Consider
the bounded rectangle domain R ⊂ Ω with sides defined by the lines Re 𝑧 = 1 and
Re 𝑧 = 0 and Im 𝑧 = 𝑟 and Im 𝑧 = −𝑟 . By the computation in (7.6), we deduce that
2
|𝐹 𝜀 (𝑧)| ≤ |𝐹 (𝑧)| · e−𝜀𝑟 ≤ 1 for 𝑧 ∈ Ω with | Im 𝑧 | ≥ 𝑟 .

In other words, |𝐹 𝜀 | ≤ 1 on the part of the strip outside the rectangle R.


What about the rectangle R itself? By the maximum modulus principle, the
maximum of |𝐹 𝜀 | on R must be achieved on the boundary 𝜕R. According to the last
display, |𝐹 𝜀 | ≤ 1 on the top and bottom of the rectangle. By construction of the
auxiliary function and the regularizer, we have |𝐹 𝜀 | ≤ |𝐹 | ≤ 1 on the left and right
sides of the rectangle.
From the last two paragraphs, we determine that |𝐹 𝜀 (𝑧)| ≤ 1 on Ω. Finally, note
that
2 2
1 ≥ |𝐹 𝜀 (𝑧)| = |𝐹 (𝑧)| · e𝜀 (𝜃 −1−𝑡 ) → |𝐹 (𝑧)| pointwise as 𝜀 ↓ 0.
This observation completes the proof of Claim 7.14. 

Exercise 7.15 (Three-lines: Weaker regularity condition). From the proof of Theorem 7.13, we
may infer that the regularity of 𝑓 plays an important role. The boundedness condition
2
can be relaxed, however, as long as we have the property that 𝑓 (𝑧) = 𝑜 ( e𝑐 𝑧 ) for
any positive constant 𝑐 . That is, 𝑓 should increase more slowly than any quadratic
exponential function. Adapt the proof to address this scenario.

Aside: (Analytic functions on the strip: Integral representations). There is an alternative


proof of the theorem via solving the Poisson equation for the strip [Ste56]. Indeed,
analytic functions are harmonic, satisfying Δ𝑓 = 0. Therefore, we can pass to the
Green’s function kernel of the strip and use an integral of the boundary values to
represent the function 𝑓 on the whole domain. We can establish the interpolation
inequality by bounding this integral.

7.4 Example: Duality for Schatten norms


Finally, with the Hadamard three-lines theorem at hand, we may give an alternative
proof of the duality for Schatten norms by establishing Proposition 7.4. Observe that
Lecture 7: Matrix Inequalities via Complex Analysis 60

the interpolation inequality in (7.2) resembles the conclusion (7.4) of the three-lines
theorem. This parallel suggests an argument based on complex analysis. As a first
step, we explain what it means to apply a complex power to a positive matrix.

Definition 7.16 (Complex power). For a positive-semidefinite matrix 𝑨 ∈ 𝕄𝑛 (ℂ) ,


Í
consider the spectral resolution 𝑨 = 𝑗 𝜆 𝑗 𝑷 𝑗 , where 𝜆 𝑗 > 0 and the 𝑷 𝑗 are
orthoprojectors. For a complex number 𝑧 ∈ ℂ, we define the complex power
∑︁ ∑︁
𝑨𝑧 B 𝜆𝑧𝑗 𝑷 𝑗 = e𝑧 log 𝜆 𝑗 𝑷 𝑗 .
𝑗 𝑗

This is a standard example of the functional calculus (for normal matrices).

The first step in the argument is to establish the duality between the trace norm
(Schatten-1) and the operator norm (Schatten-∞).
Exercise 7.17 (Trace norm and operator norm). By an independent argument, prove that

| tr (𝑨 ∗ 𝑩)| ≤ k𝑨 k 1 · k𝑩 k ∞ for all 𝑨, 𝑩 ∈ 𝕄𝑛 (ℂ) .

Hint: There are many possible arguments. The easiest approach is to introduce an SVD
of 𝑨 , cycle the trace, and use the variational definition of the largest singular value.

Proof of Proposition 7.4. Let 𝑨, 𝑩 ∈ ℍ𝑛+ (ℂ) be positive-semidefinite matrices, and let
𝜃 ∈ ( 0, 1) . We intend to bound the quantity

tr (𝑨 𝜃 𝑩 1−𝜃 𝑪 ) where k𝑪 k ∞ = 1.

By allowing the parameter 𝜃 to take on complex values, we can obtain an analytic


function for input to the Hadamard theorem.
Consider the function

𝑓 (𝑧) B tr (𝑨 𝑧 𝑩 1−𝑧 𝑪 ) for 0 ≤ Re 𝑧 ≤ 1.

By the definition of complex power, we can compute


∑︁
𝑓 (𝑧) = 𝜆𝑧𝑗 𝜇𝑘1−𝑧 tr (𝑷 𝑗 𝑸 𝑘 𝑪 ),
𝑗 ,𝑘
Í Í
where we have introduced the spectral resolutions 𝑨 = 𝑗 𝜆 𝑗 𝑷 𝑗 and 𝑩 = 𝑘 𝜇𝑘 𝑸 𝑘 .
We see that 𝑓 is a linear combination of analytic functions on Ω = ℂ, Therefore 𝑓 is
analytic and continuous the closure of the strip Ω = {𝑧 : 0 < Re 𝑧 < 1}.
As in the statement of Hadamard’s three-lines theorem, define

𝑀 (𝜃 ) B sup𝑡 ∈ℝ | 𝑓 (𝜃 + i𝑡 )| for 0 ≤ 𝜃 ≤ 1.

We need to check that 𝑓 is bounded before we invoke the theorem. In fact, each term
of the linear combination in the expression for 𝑓 (𝑧) is bounded since

|𝜆𝑘𝑧 𝜇1𝑗 −𝑧 | = 𝜆𝑘Re 𝑧 𝜇1𝑗 −Re 𝑧 ≤ max𝑗 max{𝜆 𝑗 , 𝜇 𝑗 }.

Therefore, 𝑓 is indeed bounded. We can apply Theorem 7.13 to obtain the inequality

| tr (𝑨 𝜃 𝑩 1−𝜃 𝑪 )| ≤ 𝑀 (𝜃 ) ≤ 𝑀 ( 0) 1−𝜃 · 𝑀 ( 1) 𝜃 . (7.7)

Let us see why this inequality gives us what we need.


Lecture 7: Matrix Inequalities via Complex Analysis 61

We can easily bound the terms 𝑀 ( 0) and 𝑀 ( 1) . Using Exercise 7.17 and Exercise 7.5,
we see that

| 𝑓 ( i𝑡 )| ≤ k𝑨 i𝑡 𝑩 1−i𝑡 k 1 · k𝑪 k ∞ = k𝑨 i𝑡 𝑩𝑩 −i𝑡 k 1 = k𝑩 k 1 .

Indeed, 𝑨 i𝑡 and 𝑩 −i𝑡 are unitary matrices and the Schatten norms are unitarily invariant.
We recognize that 𝑀 ( 0) = sup𝑡 ∈ℝ | 𝑓 ( i𝑡 )| ≤ k𝑩 k 1 . Similarly, we have 𝑀 ( 1) ≤ k𝑨 k 1 .
Finally, we can introduce these bounds into (7.7):

k𝑨 𝜃 𝑩 1−𝜃 k 1 = max{tr (𝑨 𝜃 𝑩 1−𝜃 𝑪 ) : k𝑪 k ∞ = 1} ≤ k𝑨 k 1𝜃 · k𝑩 k 11−𝜃 .

The first relation follows from Exercise 7.17. This is the required result. 

Notes
The idea of using complex interpolation to prove convexity inequalities is due to
Thorin. Littlewood described this approach as “the most impudent in mathematics,
and brilliantly successful” [Gar07, qtd. p. 135]. The application to proving matrix
inequalities is familiar to operator theorists, but it does not appear in the standard
books on matrix analysis. The presentation here is due to the instructor.
We have given an independent proof of the maximum modulus principle, adapted
to our purpose. The proof of the Hadamard three-lines theorem is also standard; for
example, see [Gar07, Prop. 9.1.1]. You will find more information about maximum
modulus principles on bounded and unbounded domains in any complex analysis text,
such as Ahlfors [Ahl66].
The proof of duality for the Schatten norms is intended as an illustration of these
ideas. Theorem 7.3 is due to Lust-Piquard, and the proof via complex interpolation
is reported by Pisier & Xu [PX97, Lem. 1.1]. There is a beautiful application of
complex interpolation to proving multivariate Golden–Thompson inequalities [SBT17].
The instructor has also used complex interpolation to develop new types of matrix
concentration inequalities [Tro18].

Lecture bibliography
[Ahl66] L. V. Ahlfors. Complex analysis: An introduction of the theory of analytic functions of
one complex variable. Second. McGraw-Hill Book Co., New York-Toronto-London,
1966.
[Gar07] D. J. H. Garling. Inequalities: a journey into linear analysis. Cambridge University
Press, Cambridge, 2007. doi: 10.1017/CBO9780511755217.
[PX97] G. Pisier and Q. Xu. “Non-commutative martingale inequalities”. In: Comm. Math.
Phys. 189.3 (1997), pages 667–698. doi: 10.1007/s002200050224.
[Ste56] E. M. Stein. “Interpolation of linear operators”. In: Transactions of the American
Mathematical Society 83.2 (1956), pages 482–492.
[SBT17] D. Sutter, M. Berta, and M. Tomamichel. “Multivariate trace inequalities”. In: Comm.
Math. Phys. 352.1 (2017), pages 37–58. doi: 10.1007/s00220-016-2778-5.
[Tro18] J. A. Tropp. “Second-order matrix concentration inequalities”. In: Appl. Comput.
Harmon. Anal. 44.3 (2018), pages 700–736. doi: 10.1016/j.acha.2016.07.005.
8. Uniform Smoothness and Convexity

Date: 27 January 2022 Scribe: Ethan Epperly

In the previous two lectures, we studied a class of unitarily invariant norms known as
Agenda:
the Schatten 𝑝 -norms. Recall that the Schatten 𝑝 -norm (1 ≤ 𝑝 < ∞) is
1. Convexity and smoothness
∑︁𝑛  1/𝑝 2. Uniform smoothness for
k𝑨 k𝑝 := 𝜎𝑖 (𝑨)𝑝 for 𝑨 ∈ 𝕄𝑛 (𝔽) . Schatten norms
𝑖 =1
3. Proof: Scalar case
Today, we will study the geometry of the space of matrices equipped with the Schatten 4. Proof: Matrix case
𝑝 -norms. In particular, we shall answer the question “How smooth is the unit ball
of the Schatten 𝑝 -norm?” In answering this question, we will develop a powerful
uniform smoothness inequality for Schatten 𝑝 -norms, which has important applications
in random matrix theory and other areas.

8.1 Convexity and smoothness


Let ( X, k·k) be a normed linear space. Our first task is to define a concept of smoothness
and a dual concept of convexity for the space X. To gain intuition for these concepts,
consider the pictoral depiction of a unit ball in Figure 8.1. At the point in orange
labeled “very convex”, the ball is more pointed and convex. The ball is more smooth
and flat at the blue point labeled “very smooth”.
As we suggested, the notions of convexity and smoothness are dual to each other.
If a unit ball is very convex, then the dual unit ball will be very smooth. This duality is
particularity apparent for polytopes, where the pointed vertices of polytope correspond
to the flat facets of its dual. See Figure 8.2 for an illustration with the 1 and ∞ balls.
Figure 8.1 The unit ball of a normed
Let us develop a quantitive notion of convexity for the space X. Consider unit-norm linear space ( X, k · k) and two points
vectors 𝒙 , 𝒚 ∈ X. Then, by convexity of k·k , on its boundary where the ball is
smooth and convex.
1 1 1
(𝒙 + 𝒚 ) ≤ k𝒙 k + k𝒚 k = 1.
2 2 2
We expect that if X is “very convex”, then this inequality will be far from saturated.
This motivates us to introduce the modulus of continuity:
1

𝛿 X (𝑠 ) B inf 1 − 2
(𝒙 + 𝒚 ) : k𝒙 k = k𝒚 k = 1, k𝒙 − 𝒚 k = 2𝑠 (8.1)
The modulus of convexity is defined for all 𝑠 ∈ [ 0, 1] . We illustrate the modulus of
convexity in Figure 8.3.
Now, we turn our attention to defining a modulus of smoothness. For unit-norm
vectors 𝒙 , 𝒚 ∈ X, convexity implies
1 1
k𝒙 + 𝜏𝒚 k + k𝒙 − 𝜏𝒚 k ≤ 1 + 𝜏. Figure 8.2 Duality of vertices and
2 2 facets for the 1 and ∞ unit balls.
Once again, we expect this inequality to be far from saturated if the unit ball is “very
smooth” at 𝒙 . We quantify the discrepancy by the modulus of smoothness:
 
1 1
𝜚X (𝜏) B sup k𝒙 + 𝜏𝒚 k + k𝒙 − 𝜏𝒚 k − 1 : k𝒙 k = k𝒚 k = 1 . (8.2)
2 2
Lecture 8: Uniform Smoothness and Convexity 63

Figure 8.3 (Modulus of convexity). Graphical illustration of 𝛿 X (𝑠 ) , the


modulus of convexity (8.1).

Figure 8.4 (Modulus of smoothness). Graphical illustration of 𝜚X (𝜏) , the


modulus of smoothness (8.2).

The modulus of smoothness is defined for all 𝜏 ∈ [ 0, +∞) .


The anticipated duality between convexity and smoothness is encapsulated in the
following theorem. The basic qualitative result was obtained by M. M. Day, while the
quantitative form here due to J. Lindenstrauss [Lin63].

Theorem 8.1 (Lindenstrauss). Let X be a normed linear space and X∗ its dual. Then

𝜚X∗ (𝜏) = sup {𝜏𝑠 − 𝛿X (𝑠 ) : 𝑠 ∈ [0, 1]} .

In other words, the modulus of smoothness of the dual of X is the Legendre–Fenchel


conjugate of the modulus of convexity of X.

Informally, this theorem states that if X is very convex, then its dual X∗ is very
smooth. The converse is valid as well.

8.2 Uniform smoothness for Schatten norms


The main result of this lecture is an inequality that fully captures the uniform smoothness
properties of matrices equipped with the Schatten 𝑝 -norm (𝑝 ≥ 2). We begin with the
statement and its consequences, and then we outline some applications before turning
to the proof.
Lecture 8: Uniform Smoothness and Convexity 64

Theorem 8.2 (Tomczak-Jaegermann 1974; Ball–Carlen–Lieb 1994). For 𝑿 ,𝒀 ∈ 𝕄𝑛 (𝔽) Theorem 8.2, and indeed all of the
and 𝑝 ≥ 2, we have results in this lecture, also hold for
rectangular matrices
  2/𝑝 𝑿 ,𝒀 ∈ 𝕄𝑚×𝑛 (𝔽) .
1 1
≤ k𝑿 k𝑝2 + (𝑝 − 1)k𝒀 k𝑝2 .
𝑝 𝑝
k𝑿 + 𝒀 k𝑝 + k𝑿 − 𝒀 k𝑝 (8.3)
2 2

The constant 𝑝 − 1 is optimal, even for 𝑛 = 1.

This result was first established by Tomczak-Jaegermann in the more general setting
of trace-class operators on Hilbert spaces [Tom74]; she obtained the optimal constant
𝑝 − 1 for all even integers 𝑝 . Ball, Carlen, and Lieb [BCL94] obtained the optimal
constant of 𝑝 − 1 for all 𝑝 ≥ 2. We will present an alternative approach to the theorem,
developed in 2021 by the instructor (unpublished).
Several remarks are in order:

• By considering diagonal matrices, we see that the same inequality holds for
vectors 𝒙 , 𝒚 ∈ 𝑝𝑛 with 𝑝 ≥ 2.
  2/𝑝
1 1
≤ k𝒙 k𝑝2 + (𝑝 − 1)k𝒚 k𝑝2 .
𝑝 𝑝
k𝒙 + 𝒚 k𝑝 + k𝒙 − 𝒚 k𝑝
2 2

In fact, all of the other results in this lecture have parallels for 𝑝𝑛 spaces and for
L𝑝 spaces.
• By Lyapunov’s inequality,
1 1
k𝑿 + 𝒀 k𝑝2 + k𝑿 − 𝒀 k𝑝2 ≤ k𝑿 k𝑝2 + (𝑝 − 1)k𝒀 k𝑝2 . (8.4)
2 2

Equivalently, invoke the concavity of 𝑡 ↦→ 𝑡 2/𝑝 and Jensen’s inequality.


• The uniform smoothness bound (8.4) is reversed for 𝑝 ∈ [ 1, 2] , yielding a uniform
convexity bound:
1 1
k𝑿 + 𝒀 k𝑝2 + k𝑿 − 𝒀 k𝑝2 ≥ k𝑿 k𝑝2 + (𝑝 − 1)k𝒀 k𝑝2 for 𝑝 ∈ [ 1, 2] . (8.5)
2 2
• For the Schatten 2-norm (i.e., the Frobenius norm), the bounds (8.3) and (8.4)
holds with equality. This is the parallelogram law:

1 1
k𝑿 + 𝒀 k 22 + k𝑿 − 𝒀 k 22 = k𝑿 k 22 + k𝒀 k 22 ,
2 2
which holds for vectors 𝑿 and 𝒀 in a Hilbert space with norm k·k . See Figure 8.5
or an illustration.
• For 𝑝 ≈ log 𝑛 , we have k·k𝑝 ≈ k·k ∞ . See Exercise 8.5 for a quantitative
statement. Plugging this relation into (8.4),
1 1
k𝑿 + 𝒀 k 2∞ + k𝑿 − 𝒀 k 2∞ / k𝑿 k 2∞ + log 𝑛 · k𝒀 k 2∞ .
2 2 Figure 8.5 Geometry of the parallelo-
The next set of exercises supports these observations. gram law.

Exercise 8.3 (Optimality). Prove that (8.3) cannot be improved by replacing 𝑝 − 1 by a


smaller constant. (Hint: Consider 𝑛 = 1. Use a second order expansion for small 𝒀 .)
Problem 8.4 (Duality). Deduce that the uniform convexity result (8.5) follows as a formal
consequence of (8.4).
Lecture 8: Uniform Smoothness and Convexity 65

Exercise 8.5 (Approximating 𝑆 ∞ ). For 𝑿 ∈ 𝕄𝑛 (𝔽) , show that

k𝑿 k ∞ ≤ k𝑿 k𝑝 ≤ 𝑛 1/𝑝 k𝑿 k ∞ for all 𝑝 ≥ 1.

For all 𝑛 ≥ 2, conclude that



k𝑿 k ∞ ≤ k𝑿 k𝑝 ≤ e k𝑿 k ∞ for 𝑝 = 2 log 𝑛 . (8.6)

As a corollary of the uniform smoothness bound Theorem 8.2, we obtain the


modulus of smoothness for the Schatten 𝑝 -norm:
Corollary 8.6 (Modulus of smoothness). For the normed space X = (𝕄𝑛 (𝔽), k·k𝑝 ) with
𝑝 ≥ 2, the modulus of smoothness satisfies the bound
1
𝜚(𝜏) ≤ (𝑝 − 1)𝜏 2 .
2

The result is sharp as 𝜏 ↓ 0 in the sense lim𝜏 ↓0 12 (𝑝 − 1)𝜏 2 /𝜚(𝜏) = 1.

Proof. Fix 𝑝 ≥ 2 and consider matrices 𝑿 and 𝒀 with k𝑿 k𝑝 = k𝒀 k𝑝 = 1. By convexity


of the function 𝑎 ↦→ |𝑎 | 2 − 1, we obtain the numeric inequality

1
𝑎 −1 ≤ (|𝑎 | 2 − 1) for 𝑎 ∈ ℝ.
2
Applying this inequality for 𝑎 = k𝑿 + 𝒀 k𝑝 and 𝑎 = k𝑿 − 𝒀 k𝑝 gives

1 1
𝜚X (𝜏) ≤ (k𝑿 + 𝜏𝒀 k𝑝 − 1) + (k𝑿 − 𝜏𝒀 k𝑝 − 1)
2 2 
1 1 1
≤ k𝑿 + 𝜏𝒀 k𝑝2 + k𝑿 − 𝜏𝒀 k𝑝2 − 1 .
2 2 2

Now apply the uniform smoothness bound (8.4):

1  1
𝜚X (𝜏) ≤ k𝑿 k𝑝2 + (𝑝 − 1)𝜏 2 k𝒀 k𝑝2 − 1 = (𝑝 − 1)𝜏 2 .
2 2

This is the promised conclusion. The optimality for small 𝜏 can be proven using a
second-order expansion for small 𝜏 , as in Exercise 8.3. 
Qualitatively, this result shows that the unit ball of the Schatten 𝑝 -norm becomes
less smooth as 𝑝 increases. By analogy to the 𝑝 spaces, this result is unsurprising
as the 2 unit ball is round and smooth, whereas the ∞ unit ball has rough corners.
Corollary 8.6 confirms that this intuition carries over to Schatten 𝑝 -norms. See
Figure 8.6 for a schematic drawing.
Exercise 8.7 (Modulus of convexity). Use Theorem 8.1 and Corollary 8.6 to bound the Figure 8.6 The unit ball of the Schat-
ten 𝑝 -norm becomes rougher as 𝑝 in-
modulus of convexity for the Schatten 𝑝 -norm (1 ≤ 𝑝 ≤ 2). creases.

8.3 Application: Sum of independent random matrices


The uniform smoothness bound Theorem 8.2 has important implications for random
matrix theory and related areas. We shall focus on one application: bounding the
expected spectral norm of a sum of independent random matrices.
Lecture 8: Uniform Smoothness and Convexity 66

First, consider an independent family (𝑋 1 , . . . , 𝑋 𝑁 ) of centered real random


variables. Then, by the additivity of the variance, Recall that a random variable 𝑋 is
centered if it has expectation zero:
∑︁𝑁 2 ∑︁𝑁 𝔼 𝑋 = 0.
𝔼 𝑋𝑖 = 𝔼 |𝑋𝑖 | 2 .
𝑖 =1 𝑖 =1

This calculation carries over immediately to Hilbert spaces. For instance, for an
independent family (𝑿 1 , . . . , 𝑿 𝑁 ) of centered random variables in the space of matrices
𝕄𝑛 (𝔽) equipped with the Schatten 2-norm,
∑︁𝑁 2 ∑︁𝑁
𝔼 𝑿𝑖 = 𝔼 k𝑿 𝑖 k 22 . (8.7)
𝑖 =1 2 𝑖 =1

This statement can be verified using the bilinearity of the trace inner product h𝑿 , 𝒀 i :=
tr (𝑿 ∗𝒀 ) .
Unfortunately, Schatten 2-norm bounds are often uninformative for computational
applications of random matrices [Tro15, pp. 122–123]. With the help of the uniform
smoothness bound (8.4), we can obtain a Schatten 𝑝 -norm inequality version of (8.7).
By taking 𝑝 ≈ log 𝑛 , the result of Exercise 8.5 yields a Schatten ∞-norm bound (i.e., a
bound in the spectral norm).
To begin, observe that the uniform smoothness bound (8.4) can be interpreted A random variable
in probabilistic language. Introduce a random variable 𝜀 ∼ uniform{±1}. The 𝜀 ∼ uniform {±1 } is referred to as a
Rademacher random variable.
conclusion of (8.4) can be reformulated as an expectation:

𝔼 k𝑿 + 𝜀𝒀 k𝑝2 ≤ k𝑿 k𝑝2 + (𝑝 − 1) · 𝔼 k𝜀𝒀 k𝑝2 .


We can improve this inequality in the same way that we can upgrade midpoint convexity
to the full statement of Jensen’s inequality. Indeed, the same bound holds if we replace
𝜀𝒀 by a general centered random matrix 𝒁 .
Proposition 8.8 (Ricard–Xu 2016). Let 𝑿 ∈ 𝕄𝑛 (𝔽) be a fixed matrix and 𝒁 ∈ 𝕄𝑛 (𝔽) a
centered random matrix. For 𝑝 ≥ 2,

𝔼 k𝑿 + 𝒁 k𝑝2 ≤ k𝑿 k𝑝2 + (𝑝 − 1) · 𝔼 k𝒁 k𝑝2 .


This result was orginally proven by Ricard and Xu [RX16] in the setting of von
Neumann algebras. An elementary proof appears in the paper [Hua+21, Lem. A.1]. We
present an even simpler proof adapted from [Nao12], which yields a slightly suboptimal
constant of 2 (𝑝 − 1) in place of 𝑝 − 1.

Proof. We compute
1  1
𝔼 k𝑿 + 𝒁 k𝑝2 + k𝑿 k𝑝2 ≤ 𝔼 k𝑿 + 𝒁 k𝑝2 + 𝔼 k𝑿 − 𝒁 k𝑝2

2 2
≤ k𝑿 k𝑝2 + (𝑝 − 1) · 𝔼 k𝒁 k𝑝2 .
The first inequality is Jensen’s inequality, and the second is uniform smoothness (8.4).
Rearranging gives the advertised result with a suboptimal constant 2 (𝑝 − 1) . 
Applying this result iteratively yields inequalities for the expected Schatten norm
of a sum of independent random matrices.
Corollary 8.9 (Sums of independent random matrices). Consider an independent family
(𝑿 1 , . . . , 𝑿 𝑁 ) of independent centered random matrices in 𝕄𝑛 (𝔽) . For each 𝑝 ≥ 2,
∑︁𝑁 2 ∑︁𝑁
𝔼 𝑿 𝑖 ≤ (𝑝 − 1) 𝔼 k𝑿 𝑖 k𝑝2 . (8.8)
𝑖 =1 𝑝 𝑖 =1
Lecture 8: Uniform Smoothness and Convexity 67

Consequently, for each 𝑛 ≥ 3,


√︂
∑︁𝑁 √︁ ∑︁𝑁
𝔼 𝑿𝑖 ≤ 2e log 𝑛 · 𝔼 k𝑿 𝑖 k 2∞ (8.9)
𝑖 =1 ∞ 𝑖 =1

Proof sketch. The first conclusion follows from Proposition 8.8 applied iteratively to
each 𝑿 𝑘 , conditional on the realization of (𝑿 1 , . . . , 𝑿 𝑘 −1 ) . For the second conclusion,
we derive choose 𝑝 = 2 log 𝑛 to obtain
 ∑︁𝑁 2 ∑︁𝑁 2 ∑︁𝑁 2
𝔼 𝑿𝑖 ≤𝔼 𝑿𝑖 ≤𝔼 𝑿𝑖
𝑖 =1 ∞ 𝑖 =1 ∞ 𝑖 =1 𝑝
∑︁𝑁 ∑︁𝑁
≤ (𝑝 − 1) 𝔼 k𝑿 𝑖 k𝑝2 ≤ 2e log 𝑛 · 𝔼 k𝑿 𝑖 k 2∞ .
𝑖 =1 𝑖 =1

The first inequality is Jensen, the second and fourth are the norm comparison (8.6),
and the third is (8.8). Taking square roots gives the desired conclusion. 
Uniform smoothness can be used to derive matrix concentration inequalities for
other random matrix models, such as products of independent random matrices in
[Hua+21]. See [Tro15] for a survey of exponential matrix concentration
√︁ inequalities,
based on more sophisticated tools from matrix analysis. The log 𝑛 prefactor in (8.9)
is known to be necessary [Tro15, pp. 114–115], but it can sometimes be removed with
still more difficult tools [BBv21].
Aside: (Uniform smoothness and quantum computation). As another application, recent
work has used uniform smoothness improved error bounds for simulating the time-
evolution of quantum systems using Trotter formulas, a workhorse algorithm in
quantum computation [CB21].

8.4 Proof: Scalar case


Having discussed applications of the uniform smoothness bound (8.3), we now turn
to the task of proving it. To reduce notational clutter, we shall introduce probabilistic
notation, as we already saw in (8.3). Let 𝑥, 𝑦 ∈ be real numbers, and let 𝜀 be a
Rademacher random variable. Our task is to prove the following inequality:

(𝔼 |𝑥 + 𝜀𝑦 |𝑝 ) 2/𝑝 ≤ |𝑥 | 2 + (𝑝 − 1) · |𝑦 | 2 . (8.10)

This is known as the Gross two-point inequality. To simplify, we shall assume 𝑝 is an


even natural number. Lifting this restriction is left to the reader (Exercise 8.11 and
Problem 8.12).
Our approach will be based on interpolation. The idea is straightforward. We define
a path between a simple object that we understand and a more complicated object.
We obtain inequalities by controlling the derivative along the interpolation path. There
are other simpler ways of proving the Gross two-point inequality (8.10), but the proof
by interpolation generalizes directly to matrices. In this task, the following lemma will
be helpful.
Lemma 8.10 (Mean-value inequality). Let 𝜑 : ℝ → ℝ be such that 𝜑 0 is convex. Then,
for all 𝑏, 𝑎 ∈ ℝ,
1
(𝑏 − 𝑎) (𝜑 (𝑏) − 𝜑 (𝑎)) ≤ (𝑏 − 𝑎) 2 (𝜑 0 (𝑏) + 𝜑 0 (𝑎)).
2
Lecture 8: Uniform Smoothness and Convexity 68

Proof. By the fundamental theorem of calculus and convexity,


∫ 1
(𝑏 − 𝑎) ( 𝜑 (𝑏) − 𝜑 (𝑎)) = (𝑏 − 𝑎) 2 𝜑 0 (( 1 − 𝜏)𝑏 + 𝜏𝑎) d𝜏
0
∫ 1
≤ (𝑏 − 𝑎) 2 ( 1 − 𝜏) 𝜑 0 (𝑏) + 𝜏 𝜑 0 (𝑎) d𝜏
 
0
1
= (𝑏 − 𝑎) 2 (𝜑 0 (𝑏) + 𝜑 0 (𝑎)).
2
This is the desired conclusion. 
With this result in hand, we prove the Gross two-point inequality (8.10).

Proof of Theorem 8.2 (scalar case). Assume that 𝑝 is an even natural number. Define
the interpolant √
𝑢 (𝑡 ) := 𝔼 |𝑥 + 𝜀 𝑡 𝑦 |𝑝 for 𝑡 ∈ [0, 1] . (8.11)
Observe that Why define the interpolant 𝑢 as
𝑢 ( 0) = |𝑥 |𝑝 and 𝑢 ( 1) = 𝔼 |𝑥 + 𝜀𝑦 |𝑝 . √ square root
(8.11) with the
dependence 𝑡 instead of the more
The value 𝑢 ( 1) at the endpoint is the left-hand side of Gross’s inequality (8.10), raised natural-seeming linear interpolant
to the 𝑝/2 power. 𝑢 (𝑡 ) := 𝔼 |𝑥 + 𝜀𝑡 𝑦 |𝑝 ?
With the help of the mean-value inequality (Lemma 8.10) for the function 𝜑 : 𝑎 ↦→
In fact, the linear choice also works,
|𝑎 |𝑝−1 , we compute and bound the derivative of the interpolant 𝑢 : but it gives a suboptimal constant of
1 h √ i 2 (𝑝 − 1) in place of 𝑝 − 1.
𝑢¤ (𝑡 ) = 𝑝 · √ · 𝔼 (𝜀𝑦 ) (𝑥 + 𝜀𝑦 𝑡 )𝑝−1
2𝑡
√ √ 𝑝−1 1 √ 𝑝−1
 
1 1
= 𝑝 · ( 2𝑦 𝑡 ) · (𝑥 + 𝑦 𝑡 ) − (𝑥 − 𝑦 𝑡 )
4𝑡 2 2
√ 𝑝−2 1 √ 𝑝−2
 
1 2 1
≤ 𝑝 (𝑝 − 1)𝑦 (𝑥 + 𝑦 𝑡 ) + (𝑥 − 𝑦 𝑡 )
2 2 2
1 √
= 𝑝 (𝑝 − 1)𝑦 2 · 𝔼(𝑥 − 𝑦 𝑡 )𝑝−2
2
1 h √ i 1−2/𝑝
≤ 𝑝 (𝑝 − 1)𝑦 2 · 𝔼(𝑥 − 𝑦 𝑡 )𝑝
2
1
= 𝑝 (𝑝 − 1)𝑦 2 · 𝑢 (𝑡 ) 1−2/𝑝 .
2
The first inequality is the mean-value inequality Lemma 8.10, and the second inequality
is Lyapunov’s inequality. We have obtained a differential inequality for the function 𝑢 .
To solve this inequality, define the function 𝑣 (𝑡 ) = 𝑢 (𝑡 ) 2/𝑝 for 𝑡 ∈ [ 0, 1] . Then
𝑣 ( 0) = |𝑥 | 2 , while 𝑣 ( 1) is the left-hand side of Gross’ inequality (8.10). Finally, we
use the fundamental theorem of calculus to compute
∫ 1
𝑝 2/𝑝 2
(𝔼 |𝑥 + 𝜀𝑦 | ) − |𝑥 | = 𝑣 ( 1) − 𝑣 ( 0) = 𝑣¤ (𝑡 ) d𝑡
0
∫ 1
2
= 𝑢 (𝑡 ) 2/𝑝−1 · 𝑢¤ (𝑡 ) d𝑡 (8.12)
0 𝑝
∫ 1
≤ (𝑝 − 1) 𝑦 2 d𝑡 = (𝑝 − 1)𝑦 2 .
0

The inequality is transferred from the last display. Rearranging gives the Gross
two-point inequality (8.10). 
Lecture 8: Uniform Smoothness and Convexity 69

Exercise 8.11 (Gross: 𝑝 ≥ 3). Modify the proof of Gross’ two-point inequality (8.10) to
handle the case 𝑝 ≥ 3. Hint: This is essentially the same proof; the only change is that
signum functions arise from the derivative.
Problem 8.12 (Gross: 𝑝 ∈ ( 2, 3) ). Prove Gross’ two-point inequality (8.10) for 𝑝 ∈ ( 2, 3) .
Hint: The derivative 𝜑 0 is now concave. Modify Lemma 8.10, bounding the integral by
the midpoint rule.

8.5 Proof: Matrix case


We shall now extend the proof of Gross’ two-point inequality for scalars (8.10) to prove
the uniform smoothness bound (8.3) for matrices.

8.5.1 Setup
The first step is a standard tool in matrix analysis; we transform possibly non-Hermitian
matrices 𝑿 ,𝒀 to Hermitian matrices by passing to the the Hermitian dilation:
 
0 𝑴
𝑴 ∈ 𝕄𝑛 (𝔽) maps to H(𝑴 ) B ∈ ℍ2𝑛 (𝐹 ).
𝑴∗ 0

As the following exercise shows, this allows us to assume without loss of generality
that 𝑿 and 𝒀 are Hermitian.
Exercise 8.13 (Hermitian dilation). Let 𝑨, 𝑩 ∈ 𝕄𝑛 (𝔽) . Prove the following properties of
the Hermitian dilation:

1. The mapping H : 𝕄𝑛 (𝔽) → ℍ2𝑛 (𝔽) is a real-linear function. That is, for
𝑨, 𝑩 ∈ 𝕄𝑛 (𝔽) and 𝛼, 𝛽 ∈ ℝ,

H(𝛼𝑨 + 𝛽𝑩) = 𝛼 H(𝑨) + 𝛽 H(𝑩).

2. The eigenvalues of H(𝑨) are equal to ±𝜎𝑖 (𝑨) for 𝑖 = 1, 2, . . . , 𝑛 .


3. Schatten norms of 𝑨 and H(𝑨) are proportional: k H(𝑨)k𝑝 = 21/𝑝 k𝑨 k𝑝 .

For a Hermitian matrix 𝑴 , the Schatten 𝑝 -norm is given by a trace:

k𝑴 k𝑝 = ( tr |𝑴 |𝑝 ) 1/𝑝 where |𝑴 | = (𝑴 ∗ 𝑴 ) 1/2 .

For even 𝑝 and Hermitian 𝑴 , this expression simplifies further: k𝑴 k𝑝 = ( tr 𝑴 𝑝 ) 1/𝑝 .


This reformulation is useful because it will allow us to leverage the linearity and
cyclicity of the trace in our calculations.

8.5.2 Trace functions and inequalities


The linearity and cyclicity of the trace allow for many scalar inequalities to be “upgraded”
to trace inequalities by systematic arguments. We introduce one such technique, the
generalized Klein inequality, in this section. In order to do this, we begin by defining
matrix functions:

Definition 8.14 (Standard matrix function). Let 𝜑 : ℝ → ℝ. We extend this function


to Hermitian matrices 𝜑 : ℍ𝑛 → ℍ𝑛 as follows. Let 𝑨 ∈ ℍ𝑛 be a matrix with
spectral resolution ∑︁
𝑨= 𝜆𝑖 𝑷 𝑖 .
𝑖
Lecture 8: Uniform Smoothness and Convexity 70

Then ∑︁
𝜑 (𝑨) B 𝜑 (𝜆𝑖 )𝑷 𝑖 ∈ ℍ𝑛 (𝔽)
𝑖

We will need the following standard derivative computation.


Proposition 8.15 (Derivative of trace power). For each natural number 𝑝 ∈ ℕ and all
𝑨, 𝑯 ∈ ℍ𝑛 (𝔽) ,
d
tr (𝑨 + 𝑡 𝑯 )𝑝 = 𝑝 tr [𝑨 𝑝−1𝑯 ].
d𝑡 𝑡 =0

Proof sketch. The key observation is the identity


∑︁𝑝−1
𝑩 𝑝 − 𝑨𝑝 = 𝑩 𝑘 (𝑩 − 𝑨)𝑨 𝑝−1−𝑘 for all 𝑨, 𝑩 ∈ 𝕄𝑛 .
𝑘 =0

This formula can be verified by direct algebraic manipulation. 


Exercise 8.16 (Derivative of trace power). Prove Proposition 8.15.
There are a variety of techniques for deriving trace inequalities from scalar inequal-
ities. We shall make use of the following result.
Proposition 8.17 (Generalized Klein inequality). Suppose that 𝑓𝑘 , 𝑔𝑘 : ℝ → ℝ are functions
such that ∑︁
𝑓𝑘 (𝑎) 𝑔𝑘 (𝑏) ≥ 0 for all 𝑎, 𝑏 ∈ ℝ.
𝑘
Then ∑︁
tr [𝑓𝑘 (𝑨) 𝑔𝑘 (𝑩)] ≥ 0 for all 𝑨, 𝑩 ∈ ℍ𝑛 (𝔽).
𝑘

Proof. Introduce spectral resolutions


∑︁ ∑︁
𝑨= 𝜆𝑖 𝑷 𝑖 and 𝑩 = 𝜇𝑗 𝑸 𝑗 .
𝑖 𝑗

Then calculate
∑︁ ∑︁ h ∑︁  ∑︁ i
tr [𝑓𝑘 (𝑨) 𝑔𝑘 (𝑩)] = tr 𝑓𝑘 (𝜆𝑖 )𝑷 𝑖 𝑔𝑘 (𝜇 𝑗 )𝑸 𝑗
𝑘 𝑘 𝑖 𝑗
∑︁ h∑︁ i
= 𝑓𝑘 (𝜆𝑖 ) 𝑔𝑘 (𝜇 𝑗 ) · tr (𝑷 𝑖 𝑸 𝑗 ).
𝑖 ,𝑗 𝑘

The bracketed quantity is positive by assumption and tr (𝑷 𝑖 𝑸 𝑗 ) is positive because the


trace of the product of two positive-semidefinite matrices is positive. Therefore,
∑︁
tr [𝑓𝑘 (𝑨) 𝑔𝑘 (𝑩)] ≥ 0,
𝑘

as claimed. 
Exercise 8.18 (Trace of psd product). Prove that tr (𝑨𝑩) ≥ 0 for positive semidefinite
matrices 𝑨 and 𝑩 . Hint: Write 𝑩 = 𝑩 1/2 𝑩 1/2 and cycle the trace.
As a corollary, we obtain a trace version of the mean-value inequality, Lemma 8.10.
Corollary 8.19 (Mean-value trace inequality). Let 𝜑 : ℝ → ℝ be such that 𝜓 := 𝜑 0 is
convex. Then for 𝑨, 𝑩 ∈ ℍ𝑛 (𝔽) ,

1 
tr (𝑩 − 𝑨) 2 (𝜓 (𝑩) + 𝜓 (𝑨)) .

tr [(𝑩 − 𝑨) ( 𝜑 (𝑨) − 𝜑 (𝑩))] ≤
2
Lecture 8: Uniform Smoothness and Convexity 71

Exercise 8.20 (Mean-value trace inequality). Prove Corollary 8.19 using the mean-value
inequality (Lemma 8.10) and the generalized Klein inequality (Proposition 8.17).
Exercise 8.21 (Derivative of a power). Derive the result in Proposition 8.15 using the
generalized Klein inequality (Proposition 8.17). Extend this argument to compute the
derivative of tr |·|𝑝 for all 𝑝 > 0.

8.5.3 Proof of Theorem 8.2


With these ingredients, we may complete the proof of the uniform smoothness bound
for Schatten 𝑝 -norms.

Proof of Theorem 8.2. By passing to the Hermitian dilation if necessary, we can assume
that 𝑿 and 𝒀 are Hermitian. We again restrict attention to 𝑝 an even integer. We leave
the proof for general 𝑝 > 2 as an exercise.
Define the interpolant
h √ i
𝑢 (𝑡 ) := tr 𝑿 + 𝜀𝒀 𝑡 for 𝑡 ∈ [ 0, 1] .

We now compute and bound the derivative of 𝑢 using our computation for the
derivative of the trace power (Proposition 8.15) and the mean-value trace inequality
(Corollary 8.19):

1 h √ i
𝑢¤ (𝑡 ) = 𝑝 · √ 𝔼 tr (𝜀𝒀 ) (𝑿 + 𝜀𝒀 𝑡 )𝑝−1
2 𝑡
√ √  𝑝−1  √  𝑝−1
  
1 1
= 𝑝· tr ( 2 𝑡𝒀 ) 𝑿 + 𝒀 𝑡 − 𝑿 −𝒀 𝑡
2 4𝑡
" !#
1 2 1
 √  𝑝−2 1  √  𝑝−2
≤ 𝑝 · tr 𝒀 (𝑝 − 1) 𝑿 + 𝒀 𝑡 + (𝑝 − 1) 𝑿 + 𝒀 𝑡
2 2 2
1 h √ i
= 𝑝 (𝑝 − 1) · 𝔼 tr 𝒀 2 (𝑿 + 𝜀𝒀 𝑡 )𝑝−2
2
1 h  √  𝑝 i 1−2/𝑝
≤ 𝑝 (𝑝 − 1) ( tr 𝒀 𝑝 ) 2/𝑝 𝔼 tr 𝑿 + 𝜀𝒀 𝑡 By convention, the trace binds after
2 nonlinear operations. For example,
1 h  √  𝑝 i 1−2/𝑝 tr 𝒀 𝑝 = tr (𝒀 𝑝 ) .
≤ 𝑝 (𝑝 − 1) ( tr 𝒀 𝑝 ) 2/𝑝 𝔼 tr 𝑿 + 𝜀𝒀 𝑡
2
1
= 𝑝 (𝑝 − 1)k𝒀 k𝑝2 · 𝑢 (𝑡 ) 1−2/𝑝 .
2
The first equality is Proposition 8.15. The three inequalities are Corollary 8.19, the
Hölder inequality for Schatten norms, and Lyapunov’s inequality. The remainder of
the proof follows from the same calculation (8.12) as the scalar case. 

Problems
Problem 8.22 (Matrix Khintchine inequality). The uniform smoothness inequalities in this
lecture can be substantially improved using the same method. Consider a self-adjoint
random matrix of the form
∑︁𝑛
𝑿 =𝑩+ 𝜀𝑖 𝑨 𝑖 where 𝜀𝑖 are i.i.d. Rademacher.
𝑖 =1

The self-adjoint matrices 𝑩, 𝑨 1 , . . . , 𝑨 𝑛 ∈ ℍ𝑑 are fixed.


Lecture 8: Uniform Smoothness and Convexity 72

1. For 𝑝 ≥ 2, prove the matrix Khintchine inequality:


∑︁𝑛
𝔼 k𝑿 k𝑝2 ≤ k𝑩 k𝑝2 + (𝑝 − 1) · 𝑨 𝑖2 .
𝑖 =1 𝑝

2. For even 𝑝 , show that the constant can be improved to [(𝑝 − 1) !!] 1/𝑝 .

Notes
Leonard Gross, for whom the Gross two-point inequality (8.10) is named, is a math-
ematician and mathematical physicist who did foundational work on quantum field
theory and statistical physics. He is among the early developers of logarithmic Sobolev
inequalities, which can be used to prove concentration inequalities for nonlinear
functions of independent random variables [van14, §3].
Interpolation is a very effective tool. Another relatively accessible application uses
interpolation to prove comparison inequalities for Gaussian processes, which can be
used to prove sharp bounds on the expected spectral norm of a standard Gaussian
random matrix [Ver18, §§7.2–7.3]. An advanced application of these ideas can be
used to develop matrix concentration inequalities which sometimes circumvent the
logarithmic factor we obtained in (8.9); see [BBv21].
The Hermitian dilation has an old history in matrix analysis and operator theory.
Jordan first used the Hermitian dilation in 1874 in his discovery of the singular value
decomposition. (The singular value decomposition was also discovered independently
the year prior by Beltrami.) The Hermitian dilation was popularized by Wielandt
and is often referred to as the Jordan–Wielandt matrix for this reason. See [SS90,
pp. 34–35] for a discussion of this history.

Lecture bibliography
[BCL94] K. Ball, E. A. Carlen, and E. H. Lieb. “Sharp Uniform Convexity and Smoothness
Inequalities for Trace Norms”. In: Invent Math 115.1 (Dec. 1994), pages 463–482.
doi: 10.1007/BF01231769.
[BBv21] A. S. Bandeira, M. T. Boedihardjo, and R. van Handel. “Matrix Concentration
Inequalities and Free Probability”. In: arXiv:2108.06312 [math] (Aug. 2021). arXiv:
2108.06312 [math].
[CB21] C.-F. Chen and F. G. S. L. Brandão. “Concentration for Trotter Error”. Available
at https : / / arXiv . org / abs / 2111 . 05324. Nov. 2021. arXiv: 2111 . 05324
[math-ph, physics:quant-ph].
[Hua+21] D. Huang et al. “Matrix Concentration for Products”. In: Foundations of Computa-
tional Mathematics (2021), pages 1–33.
[Lin63] J. Lindenstrauss. “On the Modulus of Smoothness and Divergent Series in Banach
Spaces.” In: Michigan Mathematical Journal 10.3 (1963), pages 241–252.
[Nao12] A. Naor. “On the Banach-Space-Valued Azuma Inequality and Small-Set Isoperime-
try of Alon–Roichman Graphs”. In: Combinatorics, Probability and Computing 21.4
(July 2012), pages 623–634. doi: 10.1017/S0963548311000757.
[RX16] É. Ricard and Q. Xu. “A Noncommutative Martingale Convexity Inequality”. In: The
Annals of Probability 44.2 (Mar. 2016), pages 867–882. doi: 10.1214/14-AOP990.
[SS90] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. 1st Edition. Academic
Press, 1990.
Lecture 8: Uniform Smoothness and Convexity 73

[Tom74] N. Tomczak-Jaegermann. “The Moduli of Smoothness and Convexity and the


Rademacher Averages of the Trace Classes 𝑆𝑝 (1 ≤ 𝑝 < ∞)”. In: Studia Mathe-
matica 50.2 (1974), pages 163–182.
[Tro15] J. A. Tropp. “An Introduction to Matrix Concentration Inequalities”. In: Foundations
and Trends in Machine Learning 8.1-2 (2015), pages 1–230.
[van14] R. van Handel. Probability in High Dimension. Technical report. Princeton University,
June 2014. doi: 10.21236/ADA623999.
[Ver18] R. Vershynin. High-dimensional probability. An introduction with applications in
data science, With a foreword by Sara van de Geer. Cambridge University Press,
Cambridge, 2018. doi: 10.1017/9781108231596.
9. Additive Perturbation Theory

Date: 1 February 2022 Scribe: Salvador Rey Gomez De La Cruz

Matrix perturbation theory studies how a function of a matrix changes as the matrix
Agenda:
changes. In particular, we may ask how the eigenvalues of an Hermitian matrix change
1. Poincaré theorem
under additive perturbation. That is, for Hermitian matrices 𝑨, 𝑬 of the same size, 2. Courant–Fischer–Weyl
how do the eigenvalues of the perturbed matrix 𝑨 + 𝑬 compare with the eigenvalues minimax principle
of 𝑨 ? Today, we will develop Lidskii’s theorem, a result that provides a very detailed 3. Weyl monotonicity principle
answer to this question. This is one of the most important applications of majorization 4. Lidskii theorem
theory in matrix analysis.

9.1 Variational principles


The first step toward Lidskii’s theorem is a set of variational principles, which describe
the eigenvalues of a matrix as solutions to optimization problems. Our first result is
due to Poincaré. This theorem gives upper and lower bounds on eigenvalues in terms of
quadratic forms. Afterward, we will see that these bounds lead to exact representations
of the eigenvalues.

Theorem 9.1 (Poincaré). Let 𝑨 ∈ ℍ𝑛 , and let L ⊆ ℂ𝑛 be a subspace with dimension


𝑘 . There are unit vectors 𝒙 , 𝒚 ∈ L such that The larger the dimension of the
subspace L, the smaller the quadratic
can be. Similarly, the larger the
h𝒙 , 𝑨𝒙 i ≤ 𝜆𝑘↓ (𝑨) and dimension, the larger the quadratic
h𝒚 , 𝑨𝒚 i ≥ 𝜆𝑘↑ (𝑨). form can be.


Proof. The proof relies on dimension counting. Let (𝜆𝑖 , 𝒖 𝑖 ) be the eigenpairs of 𝑨 .
The set (𝒖 𝑖 : 𝑖 = 1, . . . 𝑛) comprises an orthonormal basis of ℂ𝑛 . In other words,
h𝒖 𝑖 , 𝒖 𝑗 i = 𝛿𝑖 𝑗 .
Consider the subspace S = span {𝒖 𝑘 , . . . , 𝒖 𝑛 }. Observe that dim L + dim S = 𝑛 + 1.
Since the total dimension is greater than 𝑛 , the intersection L ∩ S is a nontrivial
subspace. Since the subspace is nontrivial, it contains a unit-norm vector 𝒙 ∈ L ∩ S.
We may express this vector in the distinguished basis for the subspace:
∑︁𝑛
𝒙= 𝛼𝑖 𝒖 𝑖 where 𝛼𝑖 ∈ ℂ.
𝑖 =𝑘

Because 𝒙 is a unit vector and the basis is orthonormal,


∑︁𝑛
1 = k𝒙 k 2 = |𝛼𝑖 | 2 .
𝑖 =𝑘

Because 𝒖 𝑖 are eigenvectors of 𝑨 , we find that


∑︁𝑛
𝑨𝒙 = 𝛼𝑖 𝜆𝑖↓𝒖 𝑖 .
𝑖 =𝑘
Lecture 9: Additive Perturbation Theory 75

We may now evaluate the quadratic form h𝒙 , 𝑨𝒙 i . Indeed,


∑︁𝑛 ∑︁𝑛
h𝒙 , 𝑨𝒙 i = h𝛼 𝑗 𝒖 𝑗 , 𝛼𝑖 𝜆𝑖↓𝒖 𝑖 i
𝑖 =𝑘 𝑗 =𝑘
∑︁𝑛
= |𝛼 𝑗 | 2𝜆 ↓𝑗
𝑗 =𝑘
∑︁𝑛
≤ 𝜆𝑘↓ |𝛼 𝑗 | 2 = 𝜆𝑘↓ .
𝑗 =𝑘


The final inequality comes from the fact that the eigenvalues 𝜆 𝑗 are arranged in
descending order. This proves the first inequality in the statement of the theorem.
In order to prove the second inequality, we simply replace 𝑨 with −𝑨 . This
operation yields a unit vector 𝒙 ∈ L with the property that

−h𝒙 , 𝑨𝒙 i ≤ 𝜆𝑘↓ (−𝑨) = −𝜆𝑘↑ (𝑨).

In the first line, we used the fact that the eigenvalues of −𝑨 are the negatives eigenvalues
of 𝑨 . Because of the change in sign, the order reverses. This is the second statement
in the theorem. 
Theorem 9.1 yields a corollary that provides a minimax representation for the
ordered eigenvalues.
Corollary 9.2 (Courant–Fischer–Weyl minimax principle). Let 𝑨 ∈ ℍ𝑛 . Then

𝜆𝑘↓ (𝑨) = max𝑛 min h𝒙 , 𝑨𝒙 i = min max h𝒚 , 𝑨𝒚 i


L ⊆ℂ k𝒙 k 2 =1 S ⊆ℂ𝑛 k𝒚 k 2 =1
dim ( L)=𝑘 𝒙 ∈L dim ( S)=𝑛−𝑘 +1 𝒚 ∈S

This result represents eigenvalues as quadratic forms. Let us emphasize that these
expressions are linear in the matrix 𝑨 .

Proof. As before, let (𝜆𝑖 (𝑨), 𝒖 𝑖 ) denote the eigenpairs of 𝑨 . By Theorem 9.1, each
subspace L ⊆ ℂ𝑛 with dimension 𝑘 contains a unit vector 𝒙 ∈ L such that

𝜆𝑘↓ (𝑨) ≥ h𝒙 , 𝑨𝒙 i ≥ min h𝒚 , 𝑨𝒚 i.


k𝒚 k 2 =1
𝒚 ∈L

By choosing L = span {𝒖 1 , . . . 𝒖 𝑘 }, we see that equality can obtain.


For the second statement in Theorem 9.2, apply the first statement to the matrix −𝑨
↓ ↓
instead of 𝑨 . Recall that 𝜆𝑘 (−𝑨) = −𝜆𝑛−𝑘 +1 (𝑨) and that min (𝑎) = − max (𝑎) . 
A direct consequence of Corollary 9.2 is the Rayleigh–Ritz theorem, a fundamental
result with wide impact in computational mathematics. In particular, it offers a way of
approaching eigenvalue estimates by means of optimization tools.
Corollary 9.3 (Rayleigh–Ritz). Let 𝑨 ∈ ℍ𝑛 . Then

𝜆1↓ (𝑨) = max h𝒙 , 𝑨𝒙 i, and


k𝒙 k 2 =1

𝜆𝑛↓ (𝑨) = min h𝒚 , 𝑨𝒚 i.


k𝒚 k 2 =1

Proof. To prove this result, simply let 𝑘 = 𝑛 and 𝑘 = 1 in the first and second statement
of Corollary 9.2, respectively. 
Corollary 9.3 has some striking implications:
Lecture 9: Additive Perturbation Theory 76

• The maximum eigenvalue map 𝑨 ↦→ 𝜆 max (𝑨) is a convex function on ℍ𝑛 .


• The minimum eigenvalue map 𝑨 ↦→ 𝜆 min (𝑨) is a concave function on ℍ𝑛 .

Indeed, we observe that the maximum eigenvalue is the maximum of linear functions
of the matrix, so is a convex function. Likewise, the minimum eigenvalue is a minimum
of linear functions, so is concave.
Exercise 9.4 (Spectral condition number convex cone). For 𝑐 ≥ 1, consider the set K (𝑐 ) B Recall that the spectral condition
{𝑨 ∈ ℍ𝑛 : 𝜅2 (𝑨) ≤ 𝑐 } of matrices whose spectral conditional number is bounded by number 𝜅2 (𝑨) B 𝜆 max (𝑨)/𝜆 min (𝑨)
for each Hermitian matrix 𝑨 .
𝑐 . Show that K (𝑐 ) is a closed convex cone.
Exercise 9.5 (Rayleigh–Ritz via diagonalization). Prove Corollary 9.3 by diagonalizing 𝑨
and computing the quadratic form explicitly.
Exercise 9.6 (Rayleigh quotient). Recall that the Rayleigh quotient is defined as

h𝒙 , 𝑨𝒙 i
𝑅 (𝑨 ; 𝒙 ) B for 𝒙 ≠ 0.
h𝒙 , 𝒙 i
Prove Corollary 9.3 by differentiating the Rayleigh quotient with respect to 𝒙 and
finding the second-order stationary points.

9.2 Weyl monotonicity principle


The first step toward analyzing additive perturbations is to study a special case where
the perturbation is a positive-semidefinite matrix. This result is called the Weyl
monotonicity principle.

9.2.1 Positive-semidefinite matrices


First, we recall the the definition of a positive-semidefinite matrix and the concept of
the positive-semidefinite order on Hermitian matrices.

Definition 9.7 (Positive semidefinite matrix). A complex Hermitian matrix 𝑨 ∈ ℍ𝑛 (ℂ)


is positive semidefinite (psd) when h𝒙 , 𝑨𝒙 i ≥ 0 for all 𝒙 ∈ ℂ𝑛 .

Definition 9.8 (Positive semidefinite order). For Hermitian matrices 𝑨, 𝑩 ∈ ℍ𝑛 , we


write 𝑩 < 𝑨 when 𝑩 − 𝑨 is a positive semidefinite matrix.

9.2.2 Weyl monotonicity


With these definitions out of the way, we can give a rigorous statement of the Weyl
monotonicity principle.
Corollary 9.9 (Weyl monotonicity principle). Let 𝑨, 𝑯 ∈ ℍ𝑛 , and assume that 𝑯 is psd.
Then
𝜆𝑘↓ (𝑨 + 𝑯 ) ≥ 𝜆𝑘↓ (𝑨) for each 𝑘 = 1, . . . , 𝑛 .
↓ ↓
Equivalently, 𝑩 < 𝑨 implies that 𝜆𝑘 (𝑩) ≥ 𝜆𝑘 (𝑨) for each index 𝑘 .

Warning 9.10 (Eigenvalue increase). The converse of the Weyl monotonicity principle
↓ ↓
is not true! The conditions 𝜆𝑘 (𝑩) ≥ 𝜆𝑘 (𝑨) do not imply that 𝑩 < 𝑨 . 
Lecture 9: Additive Perturbation Theory 77

Proof. Let S𝑛−𝑘 +1 be the subspace that represents 𝜆𝑘 (𝑨 + 𝑯 ) in the second (minimax)
relation in Corollary 9.2. Then

𝜆𝑘↓ (𝑨 + 𝑯 ) = max h𝒚 , (𝑨 + 𝑯 )𝒚 i = max (h𝒚 , 𝑨𝒚 i + h𝒚 , 𝑯 𝒚 i)


k𝒚 k 2 =1 k𝒚 k 2 =1
𝒚 ∈S𝑛−𝑘 +1 𝒚 ∈S𝑛−𝑘 +1
≥ max h𝒚 , 𝑨𝒚 i
k𝒚 k 2 =1
𝒚 ∈S𝑛−𝑘 +1
≥ min max h𝒚 , 𝑨𝒚 i
S ⊆ℂ𝑛 k𝒚 k 2 =1
dim S=𝑛−𝑘 +1 𝒚 ∈S

= 𝜆𝑘↓ (𝑨).
The first inequality is valid because 𝑯 is psd. The second inequality occurs because

S𝑛−𝑘 +1 represents 𝜆𝑘 (𝑨 + 𝑯 ) via the minimax principle, so the minimum over all
subspaces S ⊆ ℂ with dimension 𝑛 − 𝑘 + 1 has to be smaller. The final equality
𝑛

follows from Corollary 9.2.


In order to prove the equivalence, simply note that 𝑩 < 𝑨 if and only if 𝑯 B
𝑩 − 𝑨 < 0. Then invoke the first result. 
Exercise 9.11 (Eigenvalue increase). Consider the family of relations 𝜆𝑘 (𝑩) ≥ 𝜆𝑘 (𝑨) for
each index 𝑘 . Show that these relations do not determine a partial order on Hermitian
matrices. Hint: Equality of eigenvalues does not imply equality of matrices.

9.3 The Lidskii theorem


Corollary 9.9 states that perturbation of an Hermitian matrix by a psd matrix increases
each of the eigenvalues. Lidskii’s theorem describes how the eigenvalues change under
an arbitrary Hermitian perturbation.

Theorem 9.12 (Lidskii). Let 𝑨, 𝑬 ∈ ℍ𝑛 . Let 1 ≤ 𝑖 1 < 𝑖 2 < · · · < 𝑖 𝑘 ≤ 𝑛 be distinct


indices. Then
∑︁𝑘 h i ∑︁𝑘
𝜆𝑖↓𝑗 (𝑨 + 𝑬 ) − 𝜆𝑖↓𝑗 (𝑨) ≤ 𝜆 ↓𝑗 (𝑬 )
𝑗 =1 𝑗 =1

We give a simple proof of Lidskii’s theorem due to Li & Mathias [LM99]. This
argument reduces the full result to two applications of the Weyl monotonicity principle.
See [Bha97] for three alternative proofs of Lidskii’s theorem.

Proof. Fix a number 𝑘 , and choose indeces 𝑖 1 < 𝑖 2 < · · · < 𝑖 𝑛 . Without loss of

generality, we may assume that 𝜆𝑘 (𝑬 ) = 0. If not, we simply replace 𝑬 with
𝑬 − 𝜆𝑘↓ (𝑬 ) I. Indeed, this transformation reduces the 𝑘 th eigenvalue of 𝑬 to zero, and

it shifts both sides of the inequality by an equal amount, namely −𝑘𝜆𝑘 (𝑬 ) .
Next, introduce the Jordan decomposition 𝑬 = 𝑬 + − 𝑬 − , where the summands 𝑬 +
and 𝑬 − are commuting psd matrices. Indeed, 𝑬 admits a spectral representation
∑︁𝑛
𝑬 = 𝜆 ↓𝑗 𝑷 𝑗 .
𝑗 =1

The elements 𝑬 ± of the Jordan decomposition are the matrices


∑︁𝑛 ∑︁𝑛
𝑬+ = (𝜆 ↓𝑗 )+𝑷 𝑗 and 𝑬 − = (𝜆 ↓𝑗 )−𝑷 𝑗 .
𝑗 =1 𝑗 =1
Lecture 9: Additive Perturbation Theory 78

The two functions (𝑎)± B max{±𝑎, 0} return the positive (resp., negative) part of a
number 𝑎 ∈ ℝ. Note that 𝑬 4 𝑬 + and so 𝑨 + 𝑬 4 𝑨 + 𝑬 + .
We make the following calculation, invoking the Weyl monotonicity principle twice.
∑︁𝑘 h i ∑︁𝑘 h i
𝜆𝑖↓𝑗 (𝑨 + 𝑬 ) − 𝜆𝑖↓𝑗 (𝑨) ≤ 𝜆𝑖↓𝑗 (𝑨 + 𝑬 + ) − 𝜆𝑖↓𝑗 (𝑨)
𝑗 =1 𝑗 =1
∑︁𝑛 h i
≤ 𝜆 ↓𝑗 (𝑨 + 𝑬 + ) − 𝜆 ↓𝑗 (𝑨)
𝑗 =1
= tr (𝑨 + 𝑬 + ) − tr (𝑨) = tr (𝑬 + )
∑︁𝑛
= 𝜆 ↓ (𝑬 + )
𝑗 =1 𝑗
∑︁𝑘
= 𝜆 ↓𝑗 (𝑬 )
𝑗 =1

The first inequality follows from Corollary 9.9 because 𝑨 + 𝑬 4 𝑨 + 𝑬 + . The second
inequality introduces the missing indices; it relies on Corollary 9.9 and the fact that
𝑨 4 𝑨 + 𝑬 + . Next, we recall that the trace is the sum of eigenvalues, and we rely the
↓ ↓
fact that the trace is linear. Finally, note that 𝜆𝑘 (𝑬 ) = 0. Therefore, 𝜆 𝑗 (𝑬 + ) = 0 for
𝑗 ≥ 𝑘 . Meanwhile, the largest 𝑘 eigenvalues of 𝑬 + and 𝑬 coincide. This observation
completes the proof. 

9.4 Consequences of Lidskii’s theorem


Theorem 9.12 has many remarkable applications. In this section, we will explore a few
of the immediate consequences.

9.4.1 Majorization relations


First, let us rewrite Lidskii’s theorem as a majorization relation. If we choose 𝑖 𝑗 = 𝑗 for
each 𝑗 = 1, . . . , 𝑘 , then the following majorization relation arises.

𝝀 ↓ (𝑨 + 𝑬 ) − 𝝀 ↓ (𝑨) ≺ 𝝀 ↓ (𝑬 ). (9.1)

In the majorization relation, the trace equality follows from the linearity of the trace
and the fact that tr (𝑨 + 𝑬 ) = 𝑛𝑗=1 𝜆 𝑗 (𝑨 + 𝑬 ) . In particular, we may consider the
Í
𝑘 = 1 case of the majorization relation:

𝜆 1↓ (𝑨 + 𝑬 ) − 𝜆 1↓ (𝑨) ≤ 𝜆1↓ (𝑬 ).

This gives an elegant additive bound for the largest eigenvalue.


From the majorization relation (9.1), we may perform two changes of variables to
arrive at a pair of relations

𝝀 ↓ (𝑩) − 𝝀 ↓ (𝑨) ≺ 𝝀 ↓ (𝑩 − 𝑨) ≺ 𝝀 ↓ (𝑩) − 𝝀 ↑ (𝑨).

The first majorization relation is accomplished by rewriting (9.1) with 𝑩 = 𝑨 + 𝑬 . The


second majorization relation follows upon rearranging (9.1), setting 𝑩 = 𝑬 , changing
the sign of 𝑨 , and noting that 𝜆 ↓ (−𝑨) = −𝜆 ↑ (𝑨) . By another change of variables in
the last display, we arrive at the equivalent relation

𝝀 ↓ (𝑩) + 𝝀 ↑ (𝑨) ≺ 𝝀 ↓ (𝑩 + 𝑨) ≺ 𝝀 ↓ (𝑩) + 𝝀 ↓ (𝑨).

These two pairs of majorization relations give detailed information about how the
eigenvalues behave when we add or subtract Hermitian matrices.
Lecture 9: Additive Perturbation Theory 79

9.4.2 Norm comparisons


The appearance of majorization relations allows us to activate tools from our study
of majorization to derive further results. In particular, we may use symmetric gauge
functions to summarize how the eigenvalues change under additive perturbation.
Corollary 9.13 (Eigenvalue norm comparisons). Let Φ : ℝ𝑛 → ℝ+ be a symmetric gauge
function. Then
Φ(𝝀 ↓ (𝑨 + 𝑬 ) − 𝝀 ↓ (𝑨)) ≤ Φ(𝝀 ↓ (𝑬 ))
Proof. Each symmetric gauge function is isotone. For any vectors 𝒙 , 𝒚 ∈ ℝ𝑛 such that
𝒙 ≺ 𝒚 , we have Φ(𝒙 ) ≤ Φ(𝒚 ) because of the isotonicity. Apply this result to (9.1). 
Example 9.14 (Weyl perturbation theorem). Consider the symmetric gauge function Φ(·) =
k·k ∞ . One obtains the inequality
↓ ↓ ↓
max 𝑗 |𝜆 𝑗 (𝑨 + 𝑬 ) − 𝜆 𝑗 (𝑨)| ≤ max 𝑗 |𝜆 𝑗 (𝑬 )| = k𝑬 k ∞ .

In other words, the maximum change in each eigenvalue is controlled by the spectral-
norm difference between the two matrices. This last relation implies that the eigenvalue
function
𝝀 ↓ : (ℍ𝑛 , 𝑆 ∞ ) → (ℝ𝑛 , ∞ )
is a 1-Lipschitz map between two metric spaces. In particular, 𝝀 ↓ is a continuous
function on the space of Hermitian matrices.
Aside: The eigenvalues of a gen-
It is sometimes more convenient to make another change of variables. One obtains
eral 𝑛 × 𝑛 matrix also change
↓ ↓ continuously. Nevertheless, this
max 𝑗 |𝜆 𝑗 (𝑩) − 𝜆 𝑗 (𝑨)| ≤ k𝑩 − 𝑨 k ∞ . statement requires some care be-
cause there is no natural way to
This statement describes how the eigenvalues of two Hermitian matrices reflect the order the eigenvalues. Further-
difference between the two matrices.  more, the proofs involve tools
from algebra or complex analysis.
Example 9.15 (Hoffman–Wielandt theorem). Consider the symmetric gauge function
Φ(·) = k·k 2 . Then
∑︁𝑛 ∑︁𝑛
|𝜆 ↓𝑗 (𝑨 + 𝑬 ) − 𝜆 ↓𝑗 (𝑨)| 2 ≤ |𝜆 ↓𝑗 (𝑬 )| 2 = k𝑬 k 2F .
𝑗 =1 𝑗 =1

This implies that the eigenvalue map

𝝀 ↓ : (ℍ𝑛 , 𝑆 2 ) → (ℝ𝑛 , 2 )
is 1-Lipschitz between two Euclidean spaces. Equivalently, by a change of variables,
∑︁𝑛
|𝜆 ↓𝑗 (𝑩) − 𝜆 ↓𝑗 (𝑨)| 2 ≤ k𝑩 − 𝑨 k 2F . Aside: The Hoffman–Wielandt
𝑗 =1 theorem can be extended to nor-
mal matrices, but it requires
This is the form in which the result is usually presented.  matching the eigenvalues of the
two matrices more carefully. For
details, see [Bha97, Sec. VI and
9.4.3 Beyond norms Eqn. VI.36].
By using other isotone functions, we may derive many more results on perturbation of
eigenvalues. Here is a picayune example.
Example 9.16 (Entropy of eigenvalues). Consider two psd matrices 𝑨 and 𝑯 with the
same dimension.
ent (𝝀 ↓ (𝑨 + 𝑯 ) − 𝜆 ↓ (𝑨)) ≥ ent (𝝀 ↓ (𝑯 )).
Í
Recall that the entropy of a positive vector is defined as ent (𝒙 ) B − 𝑗 𝑥 𝑗 log 𝑥 𝑗 , and
the negation of the entropy is an isotone function. 
Lecture 9: Additive Perturbation Theory 80

Notes
Variational principles for eigenvalues and perturbation theory for eigenvalues are
perennial topics in matrix analysis. Good references include the books of Bhatia [Bha97;
Bha07a], Parlett [Par98], Saad [Saa11b], and Stewart & Sun [SS90]. The classic
reference is Kato’s magnum opus [Kat95].
Our approach to variational principles is drawn from [Bha97, Chap. III]. Historically,
Lidskii’s theorem was regarded as a very difficult result, but Li & Mathias [LM99] have
really cut to the heart of the matter.

Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07a] R. Bhatia. Perturbation bounds for matrix eigenvalues. Reprint of the 1987 original.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
doi: 10.1137/1.9780898719079.
[Kat95] T. Kato. Perturbation theory for linear operators. Reprint of the 1980 edition.
Springer-Verlag, Berlin, 1995.
[LM99] C.-K. Li and R. Mathias. “The Lidskii-Mirsky-Wielandt theorem–additive and
multiplicative versions”. In: Numerische Mathematik 81.3 (1999), pages 377–413.
[Par98] B. N. Parlett. The symmetric eigenvalue problem. Corrected reprint of the 1980
original. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA,
1998. doi: 10.1137/1.9781611971163.
[Saa11b] Y. Saad. Numerical methods for large eigenvalue problems. Revised edition of the
1992 original [ 1177405]. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2011. doi: 10.1137/1.9781611970739.ch1.
[SS90] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. 1st Edition. Academic
Press, 1990.
10. Multiplicative Perturbation Theory

Date: 3 February 2022 Scribe: Eitan Levin

In the last lecture, we studied how the eigenvalues of a Hermitian matrix 𝑨 change
Agenda:
under an additive perturbation 𝑨 + 𝑬 . This discussion culminated with Lidskii’s
1. Recap of Lidskii’s theorem
theorem. In this lecture, we prove a multiplicative analogue of Lidskii’s theorem 2. Li–Mathias theorem
due to Li and Mathias. They consider multiplicative perturbations of the form 𝑺 ∗ 𝑨𝑺 . 3. Consequences
The proof of the Li–Mathias theorem proceeds in parallel with their proof of Lidskii’s 4. Sylvester inertia theorem
theorem, which we recall below. 5. Ostrowski monotonicity
6. Proof of Li-0Mathias
Throughout this lecture, eigenvalues and singular values are always sorted in
decreasing order, so we write 𝝀, 𝝈 instead of 𝝀 ↓ , 𝝈 ↓ to lighten our notation.

10.1 Recap of Lidskii’s theorem


We begin by recalling Lidskii’s theorem and the proof strategy from Lecture 9. For all
Hermitian matrices 𝑨, 𝑬 ∈ ℍ𝑛 and distinct indices 1 ≤ 𝑖 1 < . . . < 𝑖𝑘 ≤ 𝑛 , Lidskii’s
theorem states that
∑︁𝑘 ∑︁𝑘
[𝜆𝑖 𝑗 (𝑨 + 𝑬 ) − 𝜆𝑖 𝑗 (𝑨)] ≤ 𝜆 𝑗 (𝑬 ). (10.1)
𝑗 =1 𝑗 =1

Equivalently, this can be stated in terms of majorization as 𝜆(𝑨 + 𝑬 ) − 𝜆(𝑨) ≺ 𝜆(𝑬 ) .


By applying isotone functions to this majorization inequality, one can obtain many
more scalar and vector inequalities.
In Lecture 9, we gave a proof of Lidskii’s theorem that proceeds along the following
lines:

1. Shift the matrix 𝑨 so that the 𝑘 th eigenvalue is zero.


2. Extract the positive part 𝑬 + of 𝑬 , and reduce to the case 𝑬 = 𝑬 + < 0.
3. Apply Weyl’s monotonicity theorem, which states that 𝜆𝑖 (𝑨 + 𝑬 ) − 𝜆𝑖 (𝑨) ≥ 0
for all 𝑖 = 1, . . . , 𝑛 if 𝑬 < 0.
4. Bound the left-hand side of (10.1) by tr (𝑨 + 𝑬 ) − tr (𝑨) , and invoke the additivity
of the trace.

In this lecture, we prove multiplicative analogues of the above results, where differences
are replaced by ratios, positivity is replaced by expansivity, and the additivity of the
trace is replaced by the multiplicativity of the determinant.

10.2 The theorem of Li & Mathias


We consider multiplicative perturbations of a Hermitian matrix 𝑨 ∈ ℍ𝑛 of the form
𝑺 ∗ 𝑨𝑺 for 𝑺 ∈ 𝕄𝑛 , in order for the perturbed matrix to also be Hermitian. Our
main goal in this lecture is to prove the following multiplicative analogue of Lidskii’s
theorem (10.1), due to Li and Mathias [LM99, Thm. 2.3].
Lecture 10: Multiplicative Perturbation Theory 82

Theorem 10.1 (Li–Mathias). Choose 𝑨 ∈ ℍ𝑛 , and 𝑺 ∈ 𝕄𝑛 , and 𝑘 ∈ {1, . . . , 𝑛}. For


any 𝑘 distinct indices 1 ≤ 𝑖 1 < . . . < 𝑖𝑘 ≤ 𝑛 such that 𝜆𝑖 𝑗 (𝑨) ≠ 0 for all 𝑗 , we have

Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) Ö𝑘
≤ 𝜆𝑖 𝑗 (𝑺 ∗𝑺 ). (10.2)
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1

Moreover, if 𝑘 = 𝑛 and 𝑨 is invertible, then (10.2) holds with equality.

If the multiplier 𝑺 is approximately unitary, i.e., 𝑺 ∗𝑺 ≈ I, then 𝜆 𝑗 (𝑺 ∗𝑺 ) ≈ 1 for all 𝑗 .


In this case, Theorem 10.1 shows that 𝜆(𝑺 ∗ 𝑨𝑺 ) ≈ 𝜆(𝑨) .
Before proving Theorem 10.1, we will note some of its consequences. First, it would
be desirable to express Theorem 10.1 as a majorization inequality, similarly to Lidskii’s
theorem. To that end, we would like to take the logarithm of both sides of (10.2) to
convert the product into a sum. However, this is only well-defined if each factor in the
product on the left-hand side of (10.2) is positive. To show that this is indeed the case,
at least when 𝑺 is invertible, we appeal to Sylvester’s law of inertia.

10.2.1 Sylvester’s inertia theorem


We proceed to state and prove Sylvester’s law of inertia after two preliminary definitions.

Definition 10.2 (Inertia). The inertia of a Hermitian matrix 𝑨 ∈ ℍ𝑛 is the triplet


of integers inertia (𝑨) B (𝑛 + (𝑨), 𝑛 0 (𝑨), 𝑛 − (𝑨)) which equal, respectively, the
number of positive, zero, and negative eigenvalues of 𝑨 .

Definition 10.3 (Congruence). Two Hermitian matrices 𝑨, 𝑩 ∈ ℍ𝑛 are congruent if


there exists an invertible matrix 𝑺 ∈ 𝕄𝑛 satisfying 𝑨 = 𝑺 ∗ 𝑩𝑺 .

Congruence is an equivalence relation on ℍ𝑛 . It arises from the effect of a change


of basis on the symmetric matrix representing a quadratic form. In particular, note that
𝑨 is congruent to 𝑩 if and only if 𝑩 is congruent to 𝑨 , since 𝑨 = 𝑺 ∗ 𝑩𝑺 for invertible
𝑺 ∈ 𝕄𝑛 holds if and only if 𝑩 = (𝑺 −1 ) ∗ 𝑨𝑺 −1 .
Sylvester’s theorem [Syl52] states that the inertia is invariant under congruence
transformation.
Theorem 10.4 (Sylvester’s law of inertia). Two Hermitian matrices 𝑨, 𝑩 ∈ ℍ𝑛 are
congruent if and only if inertia (𝑨) = inertia (𝑩) .

In the setting of Theorem 10.1, Theorem 10.4 implies that if 𝑺 is invertible then
sgn (𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 )) = sgn (𝜆𝑖 𝑗 (𝑨)) for all 𝑗 , hence each factor on the left-hand side
of (10.2) is positive. If 𝑺 is not invertible, then each factor
on the left-hand side of (10.2) is only
Proof of Theorem 10.4. Suppose 𝑨 and 𝑩 are congruent, so 𝑨 = 𝑺 ∗ 𝑩𝑺 for some guaranteed to be nonnegative, by
invertible 𝑺 ∈ 𝕄𝑛 . Note that 𝑨𝒙 = 0 for 𝒙 ∈ ℝ𝑛 if and only if 𝑩 (𝑺𝒙 ) = 0 because continuity of each such factor in 𝑺 .

𝑺 ∗ is invertible, hence 𝑺 maps ker (𝑨) isomorphically onto ker (𝑩) . In particular,
𝑛 0 (𝑨) = dim ker (𝑨) = dim ker (𝑩) = 𝑛 0 (𝑩) .
Next, let (𝜆𝑖 , 𝒖 𝑖 ) be orthogonal eigenpairs of 𝑨 , where 𝜆𝑖 > 0 for 𝑖 = 1, . . . , 𝑛 + (𝑨)
by definition of 𝑛 + . Define L+ = lin{𝒖 1 , . . . , 𝒖 𝑛 + (𝑨) }. Any nonzero 𝒙 ∈ L+ can be
Í𝑛+ (𝑨)
|𝛼𝑖 | 2 = k𝒙 k 2 > 0. Hence,
Í
written as 𝒙 = 𝑖 =1 𝛼𝑖 𝒖 𝑖 where 𝑖
∑︁𝑛+ (𝑨)
h(𝑺𝒙 ), 𝑩 (𝑺𝒙 )i = h𝒙 , 𝑺 ∗ 𝑩𝑺 (𝒙 )i = h𝒙 , 𝑨𝒙 i = 𝜆𝑖 |𝛼𝑖 | 2 > 0.
𝑖 =1
Lecture 10: Multiplicative Perturbation Theory 83

We conclude that h𝒚 , 𝑩𝒚 i > 0 for all nonzero 𝒚 ∈ 𝑺 L+ . Since 𝑺 is invertible and the
eigenvectors 𝒖 𝑖 are orthogonal, dim (𝑺 L+ ) = dim L+ = 𝑛 + (𝑨) . By the Courant–Fischer–
Weyl minimax principle (see Lecture 9),

𝜆𝑛+ (𝑨) (𝑩) = max𝑛 min h𝒚 , 𝑩𝒚 i ≥ min h𝒚 , 𝑩𝒚 i > 0,


L ⊆𝔽 𝒚 ∈L 𝒚 ∈𝑺 L+
dim L=𝑛 + (𝑨) k𝒚 k=1 k𝒚 k=1

which implies 𝑛 + (𝑩) ≥ 𝑛 + (𝑨) . Interchanging the roles of 𝑨 and 𝑩 , we also get
𝑛 + (𝑨) ≥ 𝑛 + (𝑩) and thus 𝑛 + (𝑨) = 𝑛 + (𝑩) .
Finally, 𝑛 − (𝑨) = 𝑛 − 𝑛 0 (𝑨) − 𝑛 + (𝑨) = 𝑛 − (𝑩) , so we obtain inertia (𝑨) =
inertia (𝑩) .
The converse is not needed for this lecture, and is left as an exercise. 
Exercise 10.5 (Sylvester inertia theorem: Converse). Prove the remaining direction in
Theorem 10.4. That is, two Hermitian matrices with the same inertia are congruent.

10.2.2 Consequences of the Li–Mathias theorem


We may now derive some consequences of Theorem 10.1. First, when 𝑺 and 𝑨 are
invertible, then Theorem 10.1 is equivalent to the majorization inequality

𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 )
   𝑛
≺ log 𝜆 𝑗 (𝑺 ∗𝑺 )
𝑛
log 𝑗 =1 . (10.3)
𝜆 𝑗 (𝑨) 𝑗 =1

For a positive-definite matrix 𝑨  0, we can further rewrite this expression in the form
log 𝜆(𝑺 ∗ 𝑨𝑺 ) − log 𝜆(𝑨) ≺ log 𝜆(𝑺 ∗𝑺 ) .
Just as in the additive case, we can obtain many scalar inequalities from (10.3) by
applying isotone functions to both sides. For example,

𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 )
max 𝑗 =1,...,𝑛 log ≤ max𝑗 =1,...,𝑛 | log 𝜆 𝑗 (𝑺 ∗𝑺 )| = k𝑺 k 2 ,
𝜆 𝑗 (𝑨)
∑︁𝑛 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) ∑︁𝑛
≤ 𝜆 𝑗 (𝑺 ∗𝑺 ) = k𝑺 k 𝐹2 .
𝑗 =1 𝜆 𝑗 (𝑨) 𝑗 =1

Theorem 10.1 also implies multiplicative perturbation bounds for singular values.
Corollary 10.6 (Li–Mathias for singular values). Choose 𝑺 ,𝑻 ∈ 𝕄𝑛 and 𝑘 ∈ {1, . . . , 𝑛}. For
any distinct indices 1 ≤ 𝑖 1 < . . . < 𝑖𝑘 ≤ 𝑛 such that 𝜎𝑖 𝑗 (𝑻 ) > 0 for all 𝑗 , we have

Ö𝑘 𝜎𝑖 𝑗 (𝑻 𝑺 ) Ö𝑘
≤ 𝜎𝑗 (𝑺 ).
𝑗 =1 𝜎𝑖 𝑗 (𝑻 ) 𝑗 =1

Moreover, if 𝑘 = 𝑛 and 𝑻 is invertible then the above inequality holds with equality.

Proof. Set 𝑨 = 𝑻 ∗𝑻 in Theorem 10.1. Take the square roots of both sides of (10.2). 
Once again, Corollary 10.6 can be restated in terms of majorization. This leads to a
classic result from operator theory.
Corollary 10.7 (Gel’fand–Naimark). If 𝑺 ,𝑻 ∈ 𝕄𝑛 are invertible, then

log 𝜎 (𝑻 𝑺 ) − log 𝜎 (𝑻 ) ≺ log 𝜎 (𝑺 ).

As before, we can draw many further consequences by applying isotone functions.


Lecture 10: Multiplicative Perturbation Theory 84

10.3 Ostrowski monotonicity


To prove Theorem 10.1, we need a multiplicative analogue of the Weyl monotonicity
theorem, which played a core role in the proof of Lidskii’s theorem. Weyl monotonicity
shows that additive perturbation by a positive semidefinite matrix increases all the
eigenvalues. Analogously, multiplicative perturbation by an expansive matrix stretches
all the eigenvalues:

Theorem 10.8 (Ostrowski monotonicity). Suppose 𝑺 ∈ 𝕄𝑛 satisfies 𝑺 ∗𝑺 < I. For any


𝑨 ∈ ℍ𝑛 and any 𝑗 ∈ {1, . . . , 𝑛} such that 𝜆 𝑗 (𝑨) ≠ 0, we have

𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 )
≥ 1.
𝜆 𝑗 (𝑨)

Proof. The hypothesis 𝑺 ∗𝑺 < I implies that 𝜆𝑛 (𝑺 ∗𝑺 ) ≥ 1 and in particular, that 𝑺 is


invertible.
To begin, note that 𝜆 𝑗 (𝑨 − 𝜆 𝑗 (𝑨) I) = 0. Sylvester’s law of inertia (Theorem 10.4)
then implies

0 = 𝜆 𝑗 (𝑺 ∗ (𝑨 − 𝜆 𝑗 (𝑨) I)𝑺 ) = 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 − 𝜆 𝑗 (𝑨)𝑺 ∗𝑺 ).

If 𝜆 𝑗 (𝑨) > 0, then the upper bound in Weyl’s pertrubation inequality (see Lecture 9)
gives

0 = 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 − 𝜆 𝑗 (𝑨)𝑺 ∗𝑺 ) ≤ 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) + 𝜆 1 (−𝜆 𝑗 (𝑨)𝑺 ∗𝑺 )


= 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) − 𝜆 𝑗 (𝑨)𝜆𝑛 (𝑺 ∗𝑺 ).

Similarly, if 𝜆 𝑗 (𝑨) < 0 then the lower bound in Weyl’s inequality gives

0 = 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 − 𝜆 𝑗 (𝑨)𝑺 ∗𝑺 ) ≥ 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) + 𝜆𝑛 (−𝜆 𝑗 (𝑨)𝑺 ∗𝑺 )


= 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) − 𝜆 𝑗 (𝑨)𝜆𝑛 (𝑺 ∗𝑺 ).

Dividing by 𝜆 𝑗 (𝑨) and using the fact 𝜆𝑛 (𝑺 ∗𝑺 ) ≥ 1, we arrive at the desired result. 
Exercise 10.9 (Ostrowski: Quantitative version). Prove the following strengthening of
Theorem 10.8. For any invertible 𝑺 ∈ 𝕄𝑛 and any 𝑨 ∈ ℍ𝑛 , we have 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) =
𝜃 𝑗 𝜆 𝑗 (𝑨) for some 𝜃 𝑗 ∈ [𝜆𝑛 (𝑺 ∗𝑺 ), 𝜆1 (𝑺 ∗𝑺 )] .
This relation can be viewed as a quantitative version of Sylvester’s law of inertia,
which is how Ostrowski originally presented this result in [Ost59].

10.4 Proof of the Li–Mathias theorem


We are now ready to prove Theorem 10.1 by substituting multiplicative analogues for
their additive counterparts in the proof of Lidskii’s theorem (after a few reductions
specific to the multiplicative case).

Proof of Theorem 10.1. If 𝑘 = 𝑛 and 𝑨 is invertible, then


Ö𝑛 𝜆 𝑗 (𝑺 ∗ 𝑨𝑺 ) det (𝑺 ∗ 𝑨𝑺 ) det (𝑺 ∗ ) det (𝑨) det (𝑺 )
= = = det (𝑺 ∗𝑺 ),
𝑗 =1 𝜆 𝑗 (𝑨) det (𝑨) det (𝑨)

where we used the multiplicativity of the determinant. It remains to prove the


inequality (10.2).
Lecture 10: Multiplicative Perturbation Theory 85

We begin with a series of reductions. First, we may assume both 𝑨 and 𝑺 are Here we use the continuity of the
invertible because both sides of the inequality (10.2) are continuous at (𝑨, 𝑺 ) whenever eigenvalues, which follows from
Weyl’s perturbation inequality in
𝜆𝑖 𝑗 (𝑨) ≠ 0 for all 𝑗 , and any square matrix is a limit of invertible matrices. Lecture 9.
Second, let 𝑺 = 𝑼 𝚺𝑽 ∗ be an SVD of 𝑺 , where 𝚺 = diag (𝑠 1 , . . . , 𝑠𝑛 ) is the diagonal
matrix of singular values of 𝑺 , arranged in descending order. We may then assume
𝑺 = 𝚺 because eigenvalues are invariant under conjugation by unitary matrices, so
 

𝜆 𝑗 (𝑺 𝑨𝑺 ) 𝜆 𝑗 𝑽 [𝚺∗ (𝑼 ∗ 𝑨𝑼 )𝚺]𝑽 ∗ 𝜆 𝑗 𝚺∗ (𝑼 ∗ 𝑨𝑼 )𝚺

= = .
𝜆 𝑗 (𝑨) 𝜆 𝑗 (𝑨) 𝜆 𝑗 (𝑼 ∗ 𝑨𝑼 )
Thus, if we can prove the result for 𝑺 = 𝚺 and arbitrary invertible 𝑨 , then after
changing variables 𝑨 ↦→ 𝑼 ∗ 𝑨𝑼 we obtain the result for the original 𝑺 and arbitrary
invertible 𝑨 .
Third, we may assume 𝑠𝑘 = 1. Indeed, if we replace 𝑺 by 𝑠𝑘−1𝑺 in the inequal-
ity (10.2), then both sides are scaled by 𝑠𝑘−2𝑘 . Note that 𝑠𝑘 > 0 because 𝑺 is invertible.
To summarize, it suffices to prove (10.2) for invertible 𝑨 and 𝑺 = diag (𝑠 1 , . . . , 𝑠𝑛 )
where 𝑠 1 ≥ . . . ≥ 𝑠𝑘 = 1 ≥ · · · ≥ 𝑠𝑛 > 0. To that end, define the matrix 𝑺 is the “expansive part” of 𝑺 ,
b
analogous to the positive part 𝑬 + of
𝑺 = diag (𝑠 1 , . . . , 𝑠𝑘 , 1, . . . , 1 ).
b an additive perturbation 𝑬 .
| {z }
𝑛 − 𝑘 times

Observe that

I 4 (𝑺 −1b
𝑺 ) ∗ (𝑺 −1b
𝑺 ) = diag ( 1, . . . , 1, 𝑠𝑘−+11 , . . . , 𝑠𝑛−1 ).
Now, Ostrowski monotonicity gives
 

𝜆𝑖 𝑗 (𝑺 −1b
𝑺 ) ∗𝑺 ∗ 𝑨𝑺 (𝑺 −1b
𝑺) 𝜆 𝑖 𝑗 (b 𝑺)
𝑺 𝑨b
1≤ = .
𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 )
Because 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 )/𝜆𝑖 𝑗 (𝑨) > 0 by Sylvester’s law of inertia, we deduce that
∗ ∗
Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) 𝜆𝑖 𝑗 (b 𝑺 ) Ö𝑘 𝜆𝑖 𝑗 (b
𝑺 𝑨b 𝑺 𝑨b 𝑺)
≤ · ∗ = .
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝜆𝑖 𝑗 (𝑺 𝑨𝑺 ) 𝑗 =1 𝜆𝑖 𝑗 (𝑨)
∗ ∗
𝑺 < I, Ostrowski monotonicity also gives 𝜆 𝑗 (b
𝑺 b
Since b 𝑺 )/𝜆 𝑗 (𝑨) ≥ 1 for all 𝑗 .
𝑺 𝑨b
Therefore,
∗ ∗ ∗
Ö𝑘 𝜆 𝑖 𝑗 (b 𝑺)
𝑺 𝑨b Ö𝑛 𝜆 𝑗 (b
𝑺 𝑨b 𝑺 ) det (b 𝑺 𝑨b 𝑺) ∗
≤ = = det (b
𝑺 b𝑺)
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1 𝜆 𝑗 (𝑨) det (𝑨)
Ö𝑘 Ö𝑘
= 𝑠 𝑗2 = 𝜆 𝑗 (𝑺 ∗𝑺 ),
𝑗 =1 𝑗 =1

where we used the multiplicativity of the determinant similarly to the proof of the
equality case above. This establishes the desired inequality (10.2). 
Exercise 10.10 (Li–Mathias: Lower bound). In the setting of Theorem 10.1, prove the lower
bound
Ö𝑘 𝜆𝑖 𝑗 (𝑺 ∗ 𝑨𝑺 ) Ö𝑘
≥ 𝜆𝑛−𝑗 +1 (𝑺 ∗𝑺 ).
𝑗 =1 𝜆𝑖 𝑗 (𝑨) 𝑗 =1

Note that this implies a corresponding lower bound in Corollary 10.6. Hint: First
assume 𝑺 is invertible and apply Theorem 10.1.
Lecture 10: Multiplicative Perturbation Theory 86

In Lectures 9–10, we studied the change in the eigenvalues and singular values
under perturbations. In the next lecture, we expand our scope to consider the change
in the eigenspaces of an Hermitian matrix under perturbation.

Notes
This lecture is adapted from the paper [LM99] of Li & Mathias. The proofs of Sylvester’s
theorem and Ostrowski’s theorem are drawn from Horn & Johnson [HJ13]. See Bhatia’s
books [Bha97; Bha07a] for some additional discussion of multiplicative perturbation.

Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07a] R. Bhatia. Perturbation bounds for matrix eigenvalues. Reprint of the 1987 original.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
doi: 10.1137/1.9780898719079.
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[LM99] C.-K. Li and R. Mathias. “The Lidskii-Mirsky-Wielandt theorem–additive and
multiplicative versions”. In: Numerische Mathematik 81.3 (1999), pages 377–413.
[Ost59] A. M. Ostrowski. “A quantitative formulation of Sylvester’s law of inertia”. In:
Proceedings of the National Academy of Sciences of the United States of America 45.5
(1959), page 740.
[Syl52] J. Sylvester. “A demonstration of the theorem that every homogeneous quadratic
polynomial is reducible by real orthogonal substitutions to the form of a sum of
positive and negative squares”. In: The London, Edinburgh, and Dublin Philosophical
Magazine and Journal of Science 4.23 (1852), pages 138–142.
11. Perturbation of Eigenspaces

Date: 8 February 2022 Scribe: Taylan Kargin

The last two lectures covered the perturbation theory for eigenvalues of Hermitian
Agenda:
matrices. We will continue our discussion with perturbation of the eigenspaces of
1. Motivation
Hermitian matrices, which has a rather different character. We will start out with 2. Principle angles
the motivation demonstrating the major challenging in understand the behavior of 3. Sylvester equations
eigenspaces under perturbation. This observation leads us to study the principle 4. Eigenspace perturbation
angles between subspaces and the solutions of Sylvester equations. Finally, we will
use these tools to develop a result on the perturbation of eigenspaces associated with
well-separated eigenvalues.

11.1 Motivation
Let 𝑨, 𝑩 ∈ ℍ𝑛 be Hermitian matrices of the same dimension. From Lidskii’s theorem
(Lecture 9), if 𝑨 and 𝑩 are close, then the vectors 𝝀(𝑨) and 𝝀(𝑩) of decreasingly
ordered eigenvalues are also close. One might ask the following question. Are the
eigenspaces of 𝑨 and 𝑩 close as well? Let us investigate by posing a simple example.
Example 11.1 (Bad eigenspaces). For small 𝜀 > 0, consider the matrices
   
1+𝜀 0 1 𝜀
𝑨= and 𝑩 = .
0 1−𝜀 𝜀 1

These matrices represent two different perturbations of the identity matrix. Each of
the matrices 𝑨 , 𝑩 has eigenvalues 1 ± 𝜀 . Furthermore, the two matrices are close with
respect to every UI norm.
One might expect that the eigenspace of 𝑨 associated with the 1 + 𝜀 eigenvalue
is close to the eigenspace of 𝑩 associated with the 1 + 𝜀 eigenvalue and similarly for
eigenspaces associated with 1 − 𝜀 eigenvalue. The eigenpairs of the matrices are easily
determined.
     
1 0
𝑨 has eigenpairs 1 + 𝜀, and 1 − 𝜀, ;
0 1
     
1 1 11
𝑩 has eigenpairs 1 + 𝜀, √ and 1 − 𝜀, √ .
2 1 2 − 1

For each eigenvalue, the eigenvectors of 𝑨 and 𝑩 are very far apart. This happens in
spite of the fact that the matrices 𝑨 and 𝑩 are close to each other in every UI norm. 
As seen from the preceding example, two matrices can be close without having
similar eigenspaces. To appreciate why, observe that 2 × 2 identity matrix has only one
distinct eigenvalue, 1, and the associated eigenspace spans all of ℂ2 . Both 𝑨 and 𝑩
are perturbations of the identity matrix, and we are looking at eigenvectors associated
to very close eigenvalues. When their eigenvalues coalesce at 𝜀 = 0, the 1-dimensional
eigenspaces combine to form a 2-dimensional eigenspace with a continuous family of
Lecture 11: Perturbation of Eigenspaces 88

eigenvectors. By making perturbations of the identity in different directions, we can


force the two matrices to have strikingly different eigenvectors.
This observations suggest the idea that eigenspaces associated to well-separated
eigenvalues may be insensitive to perturbations. To illustrate this effect, let us consider
the matrices
1 + 𝜀 0 0 1 𝜀 0 
  
𝑨= 0 1 − 𝜀 0 and 𝑩 = 𝜀 1 0 .

 0 0 𝜆 0 0 𝜆 
  
where 𝜆  1. In that case, the eigenvector associated with the eigenvalue 𝜆 is not
affected by the instability of the eigenspaces with eigenvalues 1 ± 𝜀 . Perturbation
theory for eigenspaces builds on this insight.

11.2 Principle angles between subspaces


In this section, we introduce the concept of principle angles between subspaces to
quantify what it means for eigenspaces to be similar to each other or different from
each other.

11.2.1 Geometric approach


We start by defining the acute angle between a pair of vectors and then we will build
up from there to notions of closeness between subspaces.

Definition 11.2 (Acute angle between vectors). Let 𝒖, 𝒗 ∈ ℂ𝑛 be vectors with unit
2 -norm. The acute angle 𝜃 ∈ [ 0, 𝜋/2] between 𝒖 and 𝒗 is determined by the
relation cos 𝜃 (𝒖, 𝒗 ) = |h𝒖, 𝒗 i| .

Figure 11.1 demonstrates that the acute angle between vectors is simply the smaller
angle between the 1-dimensional subspaces spanned by 𝒖 and 𝒗 . Now, the question
that we have to pose, which goes back to work of [Jor75], is how to extend this idea of
finding the smallest angle to subspaces.
We begin with a mechanism for parameterizing the subspaces. Let 𝑿 ,𝒀 ∈ ℂ𝑛×𝑘
Figure 11.1 Acute angle between vec-
be tall matrices, each one with orthonormal columns. We emphasize that the columns tors
of the concatenation [𝑿 ,𝒀 ] do not need to be orthonormal. We can parameterize unit
vectors in the ranges of these matrices as follows.

{𝑿 𝒖 : k𝒖 k = 1} and {𝒀 𝒗 : k𝒗 k = 1}.
These sets parametrize the unit spheres in the range of 𝑿 and in the range of 𝒀 . Now,
the acute angle between two unit vectors from the ranges 𝑿 𝒖 and 𝒀 𝒗 satisfies

cos 𝜃 (𝑿 𝒖,𝒀 𝒗 ) = |h𝑿 𝒖, 𝒀 𝒗 i| = |𝒖 ∗ (𝑿 ∗𝒀 )𝒗 |.

The first principle angle 𝜃 1 between range (𝑿 ) and range (𝒀 ) is defined as the minimum
acute angle between the unit vectors in range (𝑿 ) and range (𝒀 ) . In order to minimize
the acute angle, we need to maximize the cosine of this angle:

cos 𝜃 1 ( range (𝑿 ), range (𝒀 )) B max |𝒖 ∗ (𝑿 ∗𝒀 )𝒗 | = 𝜎1 (𝑿 ∗𝒀 )


k𝒖 k=1, k𝒗 k=1

In short, the closest pair of vectors in these two subspaces, range (𝑿 ) and range (𝒀 ) , is
obtained by taking the first singular value of 𝑿 ∗𝒀 .
To continue, let (𝒖 1 , 𝒗 1 ) be a pair of unit vectors where the maximum is achieved
in the definition of 𝜃 1 . The second principle angle 𝜃 2 can be defined recursively by
Lecture 11: Perturbation of Eigenspaces 89

finding the smallest angle between unit vectors in range (𝑿 ) and range (𝒀 ) , excluding
the directions 𝑿 𝒖 1 and 𝒀 𝒗 1 .

cos 𝜃 2 ( range (𝑿 ), range (𝒀 )) B max |𝒖 ∗ (𝑿 ∗𝒀 )𝒗 | = 𝜎2 (𝑿 ∗𝒀 ).


k𝒖 k=1, k𝒗 k=1
𝒖⊥𝒖 1 , 𝒗 ⊥𝒗 1

The second relation is a consequence of the Courant–Fischer–Weyl minimax principle.

11.2.2 Algebraic approach


By extending this approach, we arrive at the general definition of principle angles
between a pair of subspaces.

Definition 11.3 (Principle angles). Let E, F ⊂ ℂ𝑛 be subspaces, possibly with different


dimensions. Let 𝑿 ,𝒀 be orthonormal matrices such that range (𝑿 ) = E and
range (𝒀 ) = F. The 𝑖 th principle angle 𝜃𝑖 ( E, F) ∈ [ 0, 𝜋/2] between the subspaces E
and F is determined by the relation

cos 𝜃𝑖 ( E, F) B 𝜎𝑖 (𝑿 ∗𝒀 ) for 𝑖 = 1, . . . , min{dim E, dim F}.

The principle angles only depend on the two subspaces, even though the matrices
𝑿 and 𝒀 in the definition are not uniquely determined. To see why, it is helpful to
rewrite this definition in terms of the orthogonal projectors onto the subspaces, as the
orthogonal projectors are unique.

Definition 11.4 (Principle angles II). Let E, F ⊂ ℂ𝑛 be subspaces. Let 𝑷 , 𝑸 ∈ ℍ𝑛 be


the orthogonal projectors onto E and F, respectively. Then the 𝑖 th principle angle
𝜃𝑖 ( E, F) ∈ [0, 𝜋/2] is determined by

cos 𝜃𝑖 ( E, F) = 𝜎𝑖 (𝑷 𝑸 ) for 𝑖 = 1, . . . , min{dim E, dim F}.

Exercise 11.5 (Principle angles: Equivalence). Check that Definitions 11.3 and 11.4 give the
same result.

11.2.3 Similarity and distance between subspaces


The concept of principle angles allows us to define various notions of the similarity
between a pair of subspaces. Two subspaces are similar when the principle angles are
small. Equivalently, subspaces are similar when the cosines of the principle angles are
large. We can quantify this effect by applying UI norms to the product of projectors.
For example, let 𝑷 , 𝑸 ∈ ℍ𝑛 be orthogonal projectors, and define the subspaces
E = range (𝑷 ) and F = range (𝑸 ) . The operator norm of the projector product satisfies

k𝑷 𝑸 k 2 = cos2 𝜃 1 ( E, F).

When the cosine of the first principle angle is relatively large (that is, cos 𝜃 1 ≈ 1), then
the subspaces are very similar. Conversely, if the cosine of the angle is small (that is,
cos 𝜃 1 ≈ 0), then the subspaces are very different. Thus, the spectral norm k𝑷 𝑸 k of
the projector product gives one measure of how close the two subspaces are. Other UI
norms lead to other measures of similarity.
In parallel, we can measure the distance between subspaces by considering the
sines of principle angles. For example, sin2 𝜃 1 ( E, F) is a measure of distance between
E and F. That is, sin 𝜃 1 ≈ 0 when the subspaces are similar, and sin 𝜃 1 ≈ 1 when the
Lecture 11: Perturbation of Eigenspaces 90

subspaces are very different. It turns out that the these distances can also be expressed
in terms of projector products. For subspaces E, F with the same dimension,

k( I − 𝑷 )𝑸 k 2 = sin2 𝜃 1 ( E, F).

Other UI norms lead to other notions of distance. This insight allows us to develop
metric geometry for subspaces.
Exercise 11.6 (Distances between subspaces). PS3 has some exercises that describe various
ways to measure the distance between subspaces.

11.3 Sylvester equations


Before we study perturbation theory for eigenspaces, we need to take another detour to
develop the basic theory of Sylvester equations [Syl84]. For now, this digression might
seem perplexing. By the end of the lecture, you will see the wonderful connection
between these ideas.

11.3.1 Formulation and solvability


We will start with form and solvability properties of Sylvester equation. Let 𝑨 ∈ ℍ𝑚 ,
and 𝑩 ∈ ℍ𝑛 be Hermitian matrices, perhaps with different dimension. Fix a matrix
𝒀 ∈ ℂ𝑚×𝑛 . The Sylvester equation with this data is the linear system

Find 𝑿 ∈ ℂ𝑚×𝑛 : 𝑨𝑿 − 𝑿 𝑩 = 𝒀 . (SYL)

It is fruitful to rewrite this linear equation using Kronecker products.

( I ⊗ 𝑨 ᵀ − 𝑩 ⊗ I) vec (𝑿 ) = vec (𝒀 ). (11.1)

This expression converts the matrix equation into an ordinary linear system, which
may be easier to understand.
Exercise 11.7 (Sylvester equation: Tensor form). Check if the equation above is equivalent
to (SYL).
Proposition 11.8 (Solvabililty of (SYL)). Let 𝑨 ∈ ℍ𝑚 and 𝑩 ∈ ℍ𝑛 . The Sylvester equa-
tion (SYL) has a unique solution for all 𝒀 ∈ ℂ𝑚×𝑛 if and only if 𝜆𝑖 (𝑨) ≠ 𝜆 𝑗 (𝑩) for all
𝑖, 𝑗.
Proof. The Sylvester equation (SYL) has a unique solution for each right-hand side 𝒀
if and only if the matrix in the Kronecker product formulation (11.1) is nonsingular.
According to PS1, the eigenvalues of the tensor I ⊗ 𝑨 ᵀ − 𝑩 ⊗ I are the numbers

𝜆𝑖 (𝑨 ᵀ ) − 𝜆 𝑗 (𝑩) = 𝜆𝑖 (𝑨) − 𝜆 𝑗 (𝑩) for all 𝑖 , 𝑗 .

Therefore, the tensor is nonsingular if and only if 𝜆𝑖 (𝑨) ≠ 𝜆 𝑗 (𝑩) for all 𝑖 , 𝑗 . 

Warning 11.9 (Equal coefficients). If 𝑨 = 𝑩 , then (SYL) never has a unique solution.
By placing additional assumptions on the solution matrix (e.g., Hermiticity), we
can sometimes recover the unique solvability. 

Exercise 11.10 (General Sylvester equations). Extend this discussion to 𝑨 ∈ 𝕄𝑚 and


𝑩 ∈ 𝕄𝑛 that may not be Hermitian.
Lecture 11: Perturbation of Eigenspaces 91

11.3.2 Integral representation of the solution


In this section, we will develop a remarkable representation for the solution of the
Sylvester equation under a stronger spectral separation condition. This approach gives
a direct path to bound the norm of the solution operator.
For motivation, we begin with the trivial scalar case of (SYL). Suppose that
𝑏 < 0 ≤ 𝑎 . In this case, the equation (SYL) takes the form 𝑎𝑥 − 𝑥𝑏 = 𝑦 , and it has
the unique solution 𝑥 = 𝑦 (𝑎 − 𝑏) −1 . We can rewrite this solution in a bizarre way:
∫ ∞ ∫ ∞
−𝑡 (𝑎−𝑏)
𝑥 =𝑦 e d𝑡 = e−𝑡 𝑎 𝑦 e𝑡 𝑏 d𝑡 .
0 0

The power of this formulation is that we can extend it to the matrix case by capitalizing
the letters, which is a surprisingly useful technique in matrix analysis.

Theorem 11.11 (Sylvester with sign conditions). Let 𝑨 ∈ ℍ𝑚 and 𝑩 ∈ ℍ𝑛 . Assume


𝜆𝑖 (𝑨) ≥ 0 and 𝜆 𝑗 (𝑩) < 0 for all 𝑖 , 𝑗 . Then, for an arbitrary 𝒀 ∈ ℂ𝑚×𝑛 , the
solution 𝑿 to (SYL) can be written as
∫ ∞
𝑿 = e−𝑡 𝑨𝒀 e𝑡 𝑩 d𝑡 .
0

Proof. For a square matrix 𝑴 ∈ 𝕄𝑛 , we can use the Taylor series expansion of the
exponential function to see that

d 𝑡𝑴
e = 𝑴 e𝑡 𝑴 = e𝑡 𝑴 𝑴 .
d𝑡
Introduce this ansatz into (SYL):
∫ ∞
𝑨 e−𝑡 𝑨𝒀 e𝑡 𝑩 − e−𝑡 𝑨𝒀 e𝑡 𝑩 𝑩 d𝑡
 
𝑨𝑿 − 𝑿 𝑩 =
∫0 ∞
d  −𝑡 𝑨 𝑡 𝑩 
= −e 𝒀 e d𝑡
0 d𝑡
= 𝒀 − lim e−𝑡 𝑨𝒀 e𝑡 𝑩 = 𝒀 .
𝑡 →∞

The limit in the second to the last equality vanishes. Indeed, 𝑨 has positive eigenvalues,
so e−𝑡 𝑨 is bounded. Since 𝑩 has strictly negative eigenvalues, e𝑡 𝑩 → 0. 

11.3.3 Norm of the solution operator


Using Theorem 11.11, we can quickly bound the norm of the solution to the Sylvester
equation (SYL).
Corollary 11.12 (Norm of solution operator). Let 𝑨 ∈ ℍ𝑚 and 𝑩 ∈ ℍ𝑛 . For 𝑠 ∈ ℝ and
𝜀 > 0, assume 𝜆𝑖 (𝑨) ≥ 𝑠 and 𝜆 𝑗 (𝑩) ≤ 𝑠 − 𝜀 for all 𝑖 , 𝑗 . For each UI norm |||·||| , the
solution 𝑿 to (SYL) satisfies the inequality |||𝑿 ||| ≤ 𝜀 −1 |||𝒀 ||| .

Proof. Without loss of generality, we can take 𝑠 = 0. Indeed, the mappings 𝑨 ↦→ 𝑨 − 𝑠 I,


and 𝑩 ↦→ 𝑩 − 𝑠 I, leave the left-hand side of the Sylvester equation (SYL) invariant.
Take the norm of integral representation in Theorem 11.11, and invoke the triangle
Lecture 11: Perturbation of Eigenspaces 92

Figure 11.2 (Spectral separation). An 𝜀 -spectral gap between sp (𝑨) ∩ S𝑨 and sp (𝑩) ∩ S𝑩

inequality. This step yields the bound


∫ ∞
|||𝑿 ||| ≤ ||| e−𝑡 𝑨𝒀 e𝑡 𝑩 ||| d𝑡
0
∫ ∞
≤ ||| e−𝑡 𝑨 ||| · |||𝒀 ||| · ||| e𝑡 𝑩 ||| d𝑡
0
∫ ∞
1
≤ |||𝒀 ||| 1 · e−𝑡 𝜀 d𝑡 = |||𝒀 |||.
0 𝜀
The second inequality holds because each UI norm is an operator ideal norm. The third
inequality depends on the assumptions that the spectra of 𝑨 and 𝑩 are separated. 

Let us take a moment to reinterpret this corollary. Fix the data 𝑨 ∈ ℍ𝑚 and
𝑩 ∈ ℍ𝑛 for the Sylvester equation (SYL), and assume that spectral separation criterion
from Corollary 11.12 holds. We can define the linear solution operator 𝚽 : 𝒀 ↦→ 𝑿 that
maps the right-hand side 𝒀 of the Sylvester equation to the unique solution 𝑿 . Then
Corollary 11.12 show that

k𝚽k B max{k𝚽(𝒀 )k : k𝒀 k ≤ 1} ≤ 𝜀 −1 .
In other words, the norm of the solution operator is controlled by the spectral separation
of the data 𝑨, 𝑩 .

11.4 Perturbation theory for eigenspaces


At last, we are prepared to describe how eigenspaces change under perturbation. These
results hinge on the theory of Sylvester equations.
Í
Let 𝑨 ∈ ℍ𝑛 be an Hermitian matrix with spectral resolution 𝑨 = 𝜆∈sp (𝑨) 𝜆𝑷 𝜆 .
Recall that sp (𝑨) = {𝜆 ∈ ℝ : det (𝑨 − 𝜆I) = 0} denotes the set of eigenvalues of
𝑨 , and 𝑷 𝜆 ∈ ℍ𝑛 is the unique orthogonal projector onto the eigenspace associated
with the eigenvalue 𝜆. For a set S ⊂ ℝ, define the restricted spectral projector
Í
𝑷 𝑨 ( S) = 𝜆∈sp (𝑨)∩S 𝑷 𝜆 which acts as the orthogonal projector onto the invariant
subspace spanned by all the eigenvalues in S.
Our goal is to argue that if S𝑨 and S𝑩 are well-separated sets and 𝑨 and 𝑩 are close,
then 𝑷 𝑨 ( S𝑨 ) and 𝑷 𝑩 ( S𝑩 ) are nearly orthogonal. The following theorem formalizes
this idea. This result is essentially due to Davis & Kahan [DK70].

Theorem 11.13 (Perturbation of eigenspaces). Let 𝑨, 𝑩 ∈ ℍ𝑛 . Let S𝑨 , S𝑩 ⊂ ℝ be


intervals with dist ( S𝑨 , S𝑩 ) > 𝜀 > 0 as in Figure 11.2. Introduce the restricted
spectral projectors 𝑷 𝑨 = 𝑷 𝑨 ( S𝑨 ) and 𝑷 𝑩 = 𝑷 𝑩 ( S𝑩 ) . Then

1 1
|||𝑷 𝑨 𝑷 𝑩 ||| ≤ |||𝑷 𝑨 (𝑨 − 𝑩)𝑷 𝑩 ||| ≤ |||𝑨 − 𝑩 |||
𝜀 𝜀
Lecture 11: Perturbation of Eigenspaces 93

for any UI norm |||·||| .

Let us sketch the idea behind the argument and then give a more rigorous treatment.
Since a matrix commutes with its spectral projectors, observe that

𝒀 0 = 𝑷 𝑨 (𝑨 − 𝑩)𝑷 𝑩 = 𝑨 (𝑷 𝑨 𝑷 𝑩 ) − (𝑷 𝑨 𝑷 𝑩 )𝑩.

Therefore, 𝑿 0 = 𝑷 𝑨 𝑷 𝑩 is the solution to a Sylvester equation. Aside from the zero


eigenvalue, the coefficient matrices have spectral separation because the sets S𝑨 and S𝑩
are separated. Heuristically, we should be able to invoke Corollary 11.12 to bound the
norm of 𝑿 0 . This gives a bound on the similarity between the two spectral subspaces
in terms of the norm of 𝒀 0 , which reflects the discrepancy 𝑨 − 𝑩 between the matrices.

Proof. To make this argument solid, we introduce factorizations 𝑷 𝑨 = 𝑸 𝑨 𝑸 ∗𝑨 and


𝑷 𝑩 = 𝑸 𝑩 𝑸 𝑩∗ where 𝑸 𝑨 has orthonormal columns and 𝑸 𝑩 has orthonormal columns.
Define the matrix

𝒀 = 𝑸 ∗𝑨 (𝑨 − 𝑩)𝑸 𝑩 .

Since 𝑸 𝑨 and 𝑸 𝑩 are orthonormal matrices, we can write

𝒀 = (𝑸 ∗𝑨 𝑸 𝑨 )𝑸 ∗𝑨 𝑨𝑸 𝑩 − 𝑸 ∗𝑨 𝑩𝑸 𝑩 (𝑸 𝑩∗ 𝑸 𝑩 )
= 𝑸 ∗𝑨 𝑷 𝑨 𝑨𝑸 𝑩 − 𝑸 ∗𝑨 𝑩𝑷 𝑩 𝑸 𝑩
= 𝑸 ∗𝑨 𝑨𝑷 𝑨 𝑸 𝑩 − 𝑸 ∗𝑨 𝑷 𝑩 𝑩𝑸 𝑩
= (𝑸 ∗𝑨 𝑨𝑸 𝑨 ) (𝑸 ∗𝑨 𝑸 𝑩 ) − (𝑸 ∗𝑨 𝑸 𝑩 )(𝑸 𝑩∗ 𝑩𝑸 𝑩 ).

Indeed, the spectral projectors commute with their matrices 𝑷 𝑨 𝑨 = 𝑨𝑷 𝑨 and 𝑷 𝑩 𝑩 =


𝑩𝑷 𝑩 . Therefore, the matrix 𝑿 = 𝑸 ∗𝑨 𝑸 𝑩 solves the Sylvester equation with data
𝑸 ∗𝑨 𝑨𝑸 𝑨 and 𝑸 𝑩∗ 𝑩𝑸 𝑩 .
By construction, all eigenvalues of 𝑸 ∗𝑨 𝑨𝑸 𝑨 belong to S𝑨 , and all eigenvalues of

𝑸 𝑩 𝑩𝑸 𝑩 belong to S𝑩 . Therefore, the eigenvalues are separated by 𝜀 . By Corollary 11.12,
we have that
1 1
|||𝑸 ∗𝑨 𝑸 𝑩 ||| = |||𝑿 ||| ≤ |||𝒀 ||| = |||𝑸 ∗𝑨 (𝑨 − 𝑩)𝑸 𝑩 |||. (11.2)
𝜀 𝜀
Since |||·||| is UI, the left- and right-hand sides of (11.2) do not change if we introduce
the orthonormal matrices 𝑸 𝑨 and 𝑸 𝑩∗ . By doing so, the spectral projectors appear, and
we recognize that

1
|||𝑷 𝑨 𝑷 𝑩 ||| ≤ |||𝑷 𝑨 (𝑨 − 𝑩)𝑷 𝑩 |||
𝜀
1 1
≤ |||𝑷 𝑨 ||| · |||𝑨 − 𝑩 ||| · |||𝑷 𝑩 ||| = |||𝑨 − 𝑩 |||.
𝜀 𝜀
The second inequality depends on the operator ideal property of UI norms. 
The result of Theorem 11.13 verifies our intuition from the beginning of the lecture.
When 𝑨 and 𝑩 are close to each other, the spectral projectors for well-separated
eigenvalues are almost orthogonal to each other. Quantitatively, the UI norm of the
projector product 𝑷 𝑨 𝑷 𝑩 is very small, which reflects dissimilarity of the subspaces
range (𝑷 𝑨 ) and range (𝑷 𝑩 ) . In particular, when 𝑨 = 𝑩 , the eigenvectors associated
with different eigenvalues are orthogonal.
Lecture 11: Perturbation of Eigenspaces 94

Notes
James Joseph Sylvester (1814–1897) was an English mathematician who made important
contributions in matrix theory, number theory and combinatorics. His work includes
the proof of Sylvester’s law of inertia and the study of Sylvester equations.
The idea of using Sylvester equations to study the perturbation of eigenspaces is
usually attributed to Davis & Kahan, who worked out these ideas in an important
paper [DK70]. The presentation in this lecture is more modern. It is modeled on
Bhatia’s book [Bha97, Chap. VII]. We have chosen to emphasize the variational
perspective on principal angles because it seems more intuitive.

Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[DK70] C. Davis and W. M. Kahan. “The Rotation of Eigenvectors by a Perturbation. III”.
In: SIAM Journal on Numerical Analysis 7.1 (1970), pages 1–46. eprint: https:
//doi.org/10.1137/0707001. doi: 10.1137/0707001.
[Jor75] C. Jordan. “Essai sur la géométrie à 𝑛 dimensions”. In: Bulletin de la Société
mathématique de France 3 (1875), pages 103–174.
[Syl84] J. J. Sylvester. “Sur l’équation en matrices px= xq”. In: CR Acad. Sci. Paris 99.2
(1884), pages 67–71.
12. Positive Linear Maps

Date: 10 February 2022 Scribe: Ruizhi Cao

In the first half of the course, we discussed majorization and its consequences, and
Agenda:
then we turned to the study of perturbation theory for eigenvalues and eigenspaces.
1. Positive-semidefinite order
In the second half of the course, we are going to talk about positive-semidefinite 2. Positive linear maps
matrices and operations on positive-semidefinite matrices. This lecture introduces the 3. Examples of positive maps
positive-semidefinite order, and it defines a class of linear maps that respect the order. 4. Basic properties
We will study the properties of these positive linear maps. 5. Convexity inequalities
6. Russo–Dye theorem

12.1 Positive-semidefinite order


We will first remind the reader of the definition of a positive-semidefinite (psd) matrix.
Then we introduce the psd order, a natural partial order relation on the self-adjoint
matrices.
Definition 12.1 (Positive semidefinite; positive definite). For an 𝑛 × 𝑛 complex matrix Warning: The situation is more
delicate for matrices over the real
𝑨 ∈ 𝕄𝑛 (ℂ) , we say that 𝑨 is positive semidefinite (psd) if we have field. In that case, the definition
of a psd matrix must also include
h𝒖, 𝑨𝒖i ≥ 0 for all 𝒖 ∈ ℂ𝑛 . a symmetry assumption. 

For 𝑨 ∈ 𝕄𝑛 (ℂ) , we say that 𝑨 is positive definite (pd) if h𝒖, 𝑨𝒖i > 0 for all
nonzero vectors 𝒖 ∈ ℂ𝑛 .

The most basic fact about psd matrices is the conjugation rule, which states that
conjugation preserves the psd property.
Proposition 12.2 (Conjugation rule). Let 𝑨 ∈ 𝕄𝑛 , and let 𝑿 ∈ ℂ𝑛×𝑘 .

1. If 𝑨 is psd, then the matrix 𝑿 ∗ 𝑨𝑿 is also psd.



2. If 𝑿 𝑨𝑿 is a psd matrix and 𝑿 is surjective (has rank 𝑛 ), then 𝑨 is psd.

Exercise 12.3 (Conjugation rule). Prove Proposition 12.2.


Exercise 12.4 (The psd cone). Show that the set 𝕄𝑛+ of psd matrices forms a proper cone.
Recall that a proper cone is closed, convex, pointed, and solid.

Definition 12.5 (Semidefinite order). For two self-adjoint matrices 𝑨, 𝑩 ∈ 𝕄𝑛sa , we Recall that a matrix 𝑨 ∈ 𝕄𝑛 is
write 𝑨 4 𝑩 when 𝑩 − 𝑨 is psd. The relation “ 4 ” is a partial order on 𝕄𝑛sa since self-adjoint (s.a.) when 𝑨 ∗ = 𝑨 .
the cone 𝕄𝑛sa is proper. We call this the psd order, also known as the semidefinite
order or Loewner order. Similarly, we write 𝑨 ≺ 𝑩 when 𝑩 − 𝑨 is pd.

It is natural to ask what kind of functions respect the psd order. That is, we are
interested in functions 𝑭 : 𝕄𝑛sa → 𝕄𝑘sa for which

𝑨4𝑩 implies 𝑭 (𝑨) 4 𝑭 (𝑩).


Lecture 12: Positive Linear Maps 96

A function that satisfies this type of relation is said to be monotone with respect to the
psd order.
We can also ask what kind of functions are “convex” with respect to the psd order.
This amounts to the relation
1
+ 12 𝑩 4 12 𝑭 (𝑨) + 12 𝑭 (𝑩) for all 𝑨, 𝑩 ∈ 𝕄𝑛sa .

𝑭 2
𝑨 (12.1)

In other words, the function of an average is bounded by the average of the function
values.
In this lecture, we will consider linear functions on matrices that are monotone
with respect to the psd order. For a linear function 𝑭 , the convexity relation (12.1) is
always valid, so we will focus on montonicity for now. In the next lecture, we will
study nonlinear functions that are monotone or convex.

12.2 Positive linear maps


For today, the key concept is a positive linear map, which is a linear function that maps
each psd matrix to another psd matrix, perhaps with a different dimension.

Definition 12.6 (Positive linear map). A linear map 𝚽 : 𝕄𝑛 → 𝕄𝑘 is positive when


𝑨 < 0 implies that 𝚽(𝑨) < 0 for all 𝑨 ∈ 𝕄𝑛 . We say that 𝚽 is strictly positive
when 𝑨  0 implies that 𝚽(𝑨)  0.

Exercise 12.7 (Positive maps are monotone). Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a linear map. Show
that the following conditions are equivalent.
1. 𝚽 is positive.
2. 𝑨 < 𝑩 implies 𝚽(𝑨) < 𝚽(𝑩) for all 𝑨, 𝑩 ∈ 𝕄𝑛sa .

Definition 12.8 (Unital, trace preserving). A linear map 𝚽 : 𝕄𝑛 → 𝕄𝑘 is unital if


𝚽( I𝑛 ) = I𝑘 . The map is trace preserving (TP) if tr 𝚽(𝑨) = tr 𝑨 for all 𝑨 ∈ 𝕄𝑛 .

Exercise 12.9 (Unital, trace preserving). Show that a linear map 𝚽 : 𝕄𝑛 → 𝕄𝑘 is unital if
and only if its adjoint 𝚽∗ : 𝕄𝑘 → 𝕄𝑛 is trace preserving. We equip matrices with the trace
inner product: h𝑨, 𝑩 i = tr (𝑨 ∗ 𝑩) .
Compare the definition of a unital linear map and a trace-preserving linear map Adjoints of linear maps are defined
with the analogous definitions for linear functions on vectors. Recall that these concepts with respect to this inner product.
arose in our study of majorization.

12.3 Examples of positive linear maps


There are numerous examples for positive linear maps. Some of the common positive
linear maps are listed below.
Example 12.10 (Scalar-valued positive linear maps). Here are some examples of positive
linear maps that take numerical values.
1. Trace. The trace functional 𝜑 : 𝕄𝑛 → ℂ given by 𝜑 (𝑨) B tr 𝑨 is positive and
trace preserving.
2. Normalized trace. The normalized trace functional 𝜑 : 𝕄𝑛 → ℂ is defined as

1
𝜑 (𝑨) B tr 𝑨 = tr 𝑨.
𝑛
The normalized trace is positive and unital.
Lecture 12: Positive Linear Maps 97

3. Sum of entries. The linear functional 𝜑 : 𝕄𝑛 → ℂ given by


∑︁
𝜑 (𝑨) B 𝑎𝑖 𝑗
𝑖 ,𝑗

is positive. As usual, 𝑎𝑖 𝑗 ∈ ℂ are the entries of 𝑨 . Indeed, we can write


𝜑 (𝑨) = h1, 𝑨 1i , where 1 ∈ ℂ𝑛 is the vector of ones. The scalar h1, 𝑨 1i is
positive whenever 𝑨 is psd.
4. Linear functionals. Each linear functional 𝜑 : 𝕄𝑛 → ℂ can be parameterized as

𝜑 (𝑨) B tr (𝑷 𝑨) where 𝑷 ∈ 𝕄𝑛 .

The linear functional 𝜑 is positive if and only if 𝑷 is psd. It is unital if and only
if tr 𝑷 = 1. Positive semidefinite matrices with
trace one are called density matrices.
Are there other distinguished positive linear functionals? 

Example 12.11 (Matrix-valued positive linear maps). Next, we consider positive linear maps
from matrices to matrices.

1. Scalar matrices. The linear function 𝚽 : 𝕄𝑛 → 𝕄𝑘 given by

𝚽(𝑨) B ( tr𝑨) I𝑘

is positive and unital. It is trace preserving if and only if 𝑘 = 𝑛 .


2. Diagonal. The linear function 𝚽 : 𝕄𝑛 → 𝕄𝑘 given by
∑︁
𝚽(𝑨) B diag (𝑨) = 𝑎 𝑗 𝑗 E𝑗 𝑗
𝑗

is positive, unital, and trace preserving.


3. Conjugation. Let 𝑿 ∈ ℂ𝑛×𝑘 . The linear function 𝚽 : 𝕄𝑛 → 𝕄𝑘 given by

𝚽(𝑨) B 𝑿 ∗ 𝑨𝑿

is positive. Furthermore, if 𝑿 is orthonormal, then 𝚽 is unital. In this case, The conjugation map is the primitive
the conjugation operation extracts a principal submatrix (with respect to some example of a completely positive linear
map. See Problem Set 3 for
basis). definitions and discussion.
4. Pinching. Let (𝑷 𝑖 ) be a family of orthogonal projectors on 𝕄𝑛 that are mutually
Í
orthogonal (𝑷 𝑖 𝑷 𝑗 = 𝛿𝑖 𝑗 𝑷 𝑖 ) and that partition the identity ( 𝑗 𝑷 𝑗 = I𝑛 ). The
function 𝚽 : 𝕄𝑛 → 𝕄𝑛 given by
∑︁
𝚽(𝑨) B 𝑷 𝑗 𝑨𝑷 𝑗
𝑗

is called a pinching. The pinching operation 𝚽 is positive, unital, and trace


preserving.

Can you think of more examples? 

Example 12.12 (Tensors and positive linear maps). There are also several natural examples
associated with tensor product operations.

1. Tensors. Fix a psd matrix 𝑩 ∈ 𝕄𝑘 . The function 𝚽 : 𝕄𝑛 → 𝕄𝑛𝑘 given by

𝚽(𝑨) B 𝑩 ⊗ 𝑨

is positive. Likewise, 𝚽(𝑨) B 𝑨 ⊗ 𝑩 is positive.


Lecture 12: Positive Linear Maps 98

2. Schur products. Let 𝑩 ∈ 𝕄𝑛 be psd. The Schur product map 𝚽 : 𝕄𝑛 → 𝕄𝑛 given


by
𝚽(𝑨) B 𝑩 𝑨
is positive, where denotes the Schur product (entrywise product). This result
is a consequence of the Schur product theorem, which states that the Schur
product of two psd matrices is a psd matrix. To prove this claim, note that 𝑩 · 𝑨
is a principal submatrix of 𝑩 ⊗ 𝑨 .
3. Partial traces. Consider the linear space 𝕄𝑛 ⊗ 𝕄𝑘 of tensor products of matrices.
For an elementary tensor 𝑩 ⊗ 𝑨 , we define the partial trace with respect to the
first factor:
tr1 (𝑩 ⊗ 𝑨) B ( tr 𝑩) 𝑨.
We extend this map to all tensor products by linearity. The partial trace is a
positive linear map. The partial trace tr2 with respect to the second factor is
defined in an analogous fashion.

Can you think of more examples? 

Exercise 12.13 (Cone of positive linear maps). Show that the class of positive linear maps
𝚽 : 𝕄𝑛 → 𝕄𝑘 is a closed convex cone.

12.4 Properties of positive linear maps


Positive linear maps satisfy many elegant properties. We begin with some results which
show that positive linear maps preserve basic algebraic properties of matrices.
Proposition 12.14 (Self-adjointness). Let 𝚽 be a positive linear map. For each self-adjoint
𝑨 , the image 𝚽(𝑨) is also self-adjoint.
Proof. The self-adjoint matrix 𝑨 has a Jordan decomposition: Recall that ifÍthe spectral resolution
of 𝑨 is 𝑨 = 𝑗 𝜆 𝑗 𝑷 𝑗 , then
𝑨 = 𝑨 + − 𝑨 − where each of 𝑨 ± is psd. ∑︁
𝑨+ = (𝜆 𝑗 ) + 𝑷 𝑗
𝑗
By linearity of the map 𝚽, we can decompose 𝚽(𝑨) as
∑︁
𝑨− = (𝜆 𝑗 ) − 𝑷 𝑗 .
𝑗
sa
𝚽(𝑨) = 𝚽(𝑨 + ) − 𝚽(𝑨 − ) ∈ 𝕄𝑛 . As usual, (𝑎) + B max {𝑎, 0 } and
(𝑎)− B max {−𝑎, 0 } for 𝑎 ∈ ℝ, 𝑃 𝑗
Indeed, the difference of two psd matrices is self-adjoint.  is the spectral projector.

Proposition 12.15 (Adjoints). Let 𝚽 be a positive linear map. Then 𝚽(𝑨 ∗ ) = 𝚽(𝑨) ∗ for
each matrix 𝑨 .

Proof. Each matrix 𝑨 ∈ 𝕄𝑛 has a Cartesian decomposition: The s.a. matrices 𝑯 and 𝑺 are given
by the expressions
𝑨 = 𝑯 + i𝑺 where each 𝑯 , 𝑺 ∈ 𝕄𝑛sa . 1
𝑨 + 𝑨∗ ;

𝑯 =
2
By linearity of 𝚽 and Proposition 12.14, 1
𝑨 − 𝑨∗ .

𝑺=
∗ ∗ 2i
𝚽(𝑨 ∗ ) = 𝚽(𝑯 − i𝑺 ) = 𝚽(𝑯 ) − i𝚽(𝑺 ) = 𝚽(𝑯 ) + i𝚽(𝑺 ) = 𝚽(𝑨) .

This is the required result. 


Lecture 12: Positive Linear Maps 99

12.5 Convexity inequalities


It is fruitful to think about a unital positive linear map as a generalization of the
expectation of a random variable. Here are several parallels.

Properties Expectation Unital, positive map


Linearity 𝔼 [𝛼𝑋 + 𝑌 ] = 𝛼 𝔼 [𝑋 ] + 𝔼 [𝑌 ] 𝚽(𝛼𝑨 + 𝑩) = 𝛼𝚽(𝑨) + 𝚽(𝑩)
Unital 𝔼 [1] = 1 𝚽( I) = I
Positive 𝑋 ≥ 0 implies 𝔼 [𝑋 ] ≥ 0 𝑨 < 0 implies 𝚽(𝑨) < 0

Similarly, a unital, trace-preserving, positive linear map is an analogous to a doubly


stochastic matrix, which is a particularly nice kind of averaging operation.
The key fact about expectation is Jensen’s inequality, which describes how expecta-
tion interacts with convex functions. Likewise, unital positive linear maps satisfy some
important convexity theorems. This section develops these ideas.

12.5.1 Schur complements


Before we begin, we must remind the reader about the definition of the Schur
complement and the core theorem on Schur complements of psd matrices.

Theorem 12.16 (Schur complements). Assume that 𝑨  0 is pd matrix. Then


 
𝑨 𝑩
< 0 if and only if 𝑯 − 𝑩 ∗ 𝑨 −1 𝑩 < 0.
𝑩∗ 𝑯

The matrix 𝑯 − 𝑩 ∗ 𝑨 −1 𝑩 is called the Schur complement of the block matrix with
respect to the block 𝑨 .

Proof. By block Gaussian elimination, we have

−𝑨 −1 𝑩
     
I 0 𝑨 𝑩 I 𝑨 0
= . (12.2)
−𝑩 ∗ 𝑨 −1 I 𝑩∗ 𝑯 0 I 0 𝑯 − 𝑩 ∗ 𝑨 −1 𝑩

The right hand side of (12.2) is positive semidefinite. Apply the conjugation rule to
complete the proof. 
Schur complements arise from partial Gaussian elimination, from Cholesky decom-
position, and from partial least-squares. They also describe conditioning of jointly
Gaussian random variables.

12.5.2 The Kadison inequality


Our first result describes how the square function interacts with a unital positive linear
map. This result is analogous to the application of Jensen’s inequality to the square
function.

Theorem 12.17 (Kadison inequality). Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital positive linear


map. Then
𝚽(𝑨) 2 4 𝚽(𝑨 2 ) for all s.a. 𝑨 ∈ 𝕄𝑛sa .
Note that this result only applies to self-adjoint matrices.
Lecture 12: Positive Linear Maps 100

Warning 12.18 (Larger powers?). The analog of Kadison’s inequality is false for powers
greater than two. On the other hand, there is a version of Lyapunov’s inequality
that holds for higher powers. 

Proof. Each s.a. matrix 𝑨 admits a spectral resolution:


∑︁
𝑨= 𝜆𝑗 𝑷 𝑗 where 𝜆 𝑗 ∈ ℝ.
𝑗

As usual, the spectral projectors are mutually orthogonal and decompose the identity.
With this representation, the square 𝑨 2 satisfies
∑︁
𝑨2 = 𝜆2𝑗 𝑷 𝑗 .
𝑗

By linearity, we have
∑︁ ∑︁
𝚽(𝑨) = 𝜆 𝑗 𝚽(𝑷 𝑗 ) and 𝚽(𝑨 2 ) = 𝜆2𝑗 𝚽(𝑷 𝑗 ).
𝑗 𝑗

Note that 𝑷 𝑗 < 0 implies that 𝚽(𝑷 𝑗 ) < 0. Since 𝚽 is unital, we also have
∑︁ ∑︁ 
𝚽(𝑷 𝑗 ) = 𝚽 𝑷 𝑗 = 𝚽( I) = I.
𝑗 𝑗

These facts play a key role in the argument.


The key idea is to form a block matrix and to argue that it is psd. By linearity of 𝚽
and the preceding displays, we may calculate that
  ∑︁   ∑︁  
I 𝚽(𝑨) 𝚽(𝑷 𝑗 ) 𝜆 𝑗 𝚽(𝑷 𝑗 ) 1 𝜆𝑗
𝚽(𝑨) 𝚽(𝑨 2 )
= 2
𝑗 𝜆 𝑗 𝚽(𝑷 𝑗 ) 𝜆 𝑗 𝚽(𝑷 𝑗 )
=
𝑗 𝜆𝑗 𝜆2𝑗 ⊗ 𝚽(𝑷 𝑗 ) < 0.

Indeed, the matrix formed from the eigenvalues is psd because


     ∗
1 𝜆 1 1
= < 0 for 𝜆 ∈ ℝ.
𝜆 𝜆2 𝜆 𝜆

The matrices 𝚽(𝑷 𝑗 ) are also psd, and tensor products preserve positivity.
Now, by the Schur complement theorem, the Schur complement of the block I in
the block matrix is also psd. That is,

0 4 𝚽(𝑨 2 ) − 𝚽(𝑨) I−1 𝚽(𝑨) = 𝚽(𝑨 2 ) − 𝚽(𝑨) 2 .

This is equivalent to the statement of Kadison’s inequality. 

12.5.3 The Choi inequality


Choi’s inequality describes how unital positive linear maps interact with the matrix
inverse. This result is analogous to the application of Jensen’s inequality to the
reciprocal.

Theorem 12.19 (Choi inequality). Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital, strictly positive


linear map. Then

𝚽(𝑨) −1 4 𝚽(𝑨 −1 ) for all pd 𝑨 ∈ 𝕄𝑛 .

Note that this result only holds for positive definite matrices.
Lecture 12: Positive Linear Maps 101

Warning 12.20 (Smaller powers?). The analog of Choi’s inequality is false for powers
that are less than −1. 

Proof sketch. We form a block matrix and argue that it is psd:


 
𝚽(𝑨) I
.
I 𝚽(𝑨 −1 )

The key new observation is that


∗
𝜆1/2 𝜆1/2
   
𝜆 1
= −1/2 < 0 for 𝜆 > 0.
1 𝜆−1 𝜆 𝜆−1/2

The rest of the details are similar to the proof of Kadison’s inequality. 
Example 12.21 (Diagonals). Use the theorems above, we have the following results.

1. For a self-adjoint matrix 𝑨 ∈ 𝕄𝑛sa , the square of the diagonal entries of 𝑨 is


entrywise bounded by the diagonal entries of 𝑨 2 . That is, we have

diag (𝑨) 2 4 diag (𝑨 2 ) for each s.a. 𝑨 .

Indeed, diag : 𝕄𝑛 → 𝕄𝑛 is a positive unital map. Apply Kadison’s inequality.


2. For a pd matrix 𝑨 ∈ 𝕄𝑛 , the inverse of the diagonal entries of 𝑨 is entrywise
bounded by the diagonal entries of 𝑨 −1 . That is,

diag (𝑨) −1 4 diag (𝑨 −1 ) for pd 𝑨 .

This result follows from the Choi inequality because diag is strictly positive and
unital.

You may wish to develop analogous results for other unital positive linear maps from
our list of examples. 

12.6 Russo–Dye theorem


In the previous section, we studied some basic properties of positive linear maps. Apart
from these properties, we also want to know how much a positive map can dilate a
matrix. This question leads us to equip linear maps with a norm and to find expressions
for the norm of a positive linear map.

Theorem 12.22 (Russo–Dye). Let 𝚽 be a unital positive linear map. Then

k𝚽k B sup{k𝚽(𝑨)k : k𝑨 k ≤ 1} = 1,

where k·k is the spectral norm (Schatten ∞-norm).

Corollary 12.23 (Russo–Dye). Let 𝚽 be a positive linear map. Then k𝚽k = k𝚽( I)k .
We will prove these results in the upcoming subsections. There are many interesting
applications of these results. See Problem Set 3 for examples.
Lecture 12: Positive Linear Maps 102

12.6.1 Contractions
To begin, we need a basic fact from operator theory.
Proposition 12.24 (Contractions). Each contraction 𝑲 ∈ 𝕄𝑛 can be written as an average A contraction is a matrix 𝑲 ∈ 𝕄𝑛
of two unitary matrices: that satisfies k𝑲 k ≤ 1.

𝑲 = 12 (𝑼 + 𝑽 ) where 𝑼 ,𝑽 ∈ 𝕄𝑛 are unitary.


Exercise 12.25 (Contractions). Prove Proposition 12.24. Hint: Each singular value 𝜎 𝑗 of
1
ei𝜃 𝑗 + e−i𝜃 𝑗 for a number

𝑲 satisfies 𝜎𝑗 ∈ [0, 1] . Observe that 𝜎𝑗 = cos (𝜃 𝑗 ) = 2
𝜃 𝑗 ∈ [0, 2𝜋) .

12.6.2 Proof of Theorem 12.22


Let us establish the Russo–Dye theorem. We begin with the case of a unitary matrix.
We will show that k𝚽(𝑼 ) k ≤ 1 for each unitary 𝑼 . Adapting the proof of the Kadison
inequality, we can show that the following block matrix is psd:
 
I 𝚽(𝑼 )
< 0.
𝚽(𝑼 ) ∗ I

For this purpose, recall that each unitary matrix has a spectral resolution where the
eigenvalues are complex numbers with modulus one. By the Schur complement
theorem, we have
𝚽(𝑼 ) ∗ 𝚽(𝑼 ) 4 I.
This is equivalent to k𝚽(𝑼 ) k ≤ 1.
For a general matrix 𝑨 ∈ 𝕄𝑛 with k𝑨 k ≤ 1, note that 𝑨 is a contraction. Thus, we
can write
1

𝑨= 2
𝑼 +𝑽
for some unitary matrices 𝑼 and 𝑽 . By linearity of 𝚽 and the triangle inequality for
the norm, we have

k𝚽(𝑨) k = k 21 𝚽(𝑼 ) + 12 𝚽(𝑽 ) k ≤ 12 k𝚽(𝑼 )k + 12 k𝚽(𝑽 )k ≤ 1.


We have shown that

k𝚽k = sup{k𝚽(𝑨)k : k𝑨 k ≤ 1} ≤ 1.
To finish, note that 𝚽( I) = I because 𝚽 is unital. Therefore, we may conclude that
k𝚽k ≥ k𝚽( I)k = 1.

12.6.3 Proof of Corollary 12.23


We now turn to the proof of the corollary. First, assume that 𝚽 is strictly positive. In
this case, the matrix 𝑷 B 𝚽( I) is positive definite. We can form another linear map 𝚿
by the expression
𝚿(𝑨) B 𝑷 −1/2 𝚽(𝑨)𝑷 −1/2 .
By the conjugation rule, the map 𝚿 is positive and linear. It is also unital because

𝚿( I) B 𝑷 −1/2 𝚽( I)𝑷 −1/2 = 𝑷 −1/2𝑷 𝑷 −1/2 = I.


Thus, by the Russo–Dye theorem, we conclude that k𝚿k = 1. For any contraction
𝑨 ∈ 𝕄𝑛 with k𝑨 k ≤ 1, we have
k𝚽(𝑨) k = k𝑷 1/2 𝚿(𝑨)𝑷 1/2 k ≤ k𝑷 kk𝚿(𝑨)k ≤ k𝑷 k = k𝚽( I)k.
Lecture 12: Positive Linear Maps 103

Considering 𝑨 = I, we realize that k𝚽k = k𝚽( I)k .


To complete the argument, we we assume that 𝚽 is positive, but maybe not strictly
positive. Consider the following family of strictly positive linear maps:

𝚽𝜀 (𝑨) B 𝚽(𝑨) + 𝜀 ( tr 𝑨) I for 𝜀 > 0.

We can apply the result of the last paragraph to see that

k𝚽𝜀 k = k𝚽𝜀 ( I)k = kΦ( I) + 𝜀 I k.

By continuity of the norm, we can let 𝜀 ↓ 0 to conclude that k𝚽k = k𝚽( I)k for any
positive linear map 𝚽.

Notes
This lecture is adapted from Bhatia’s book on positive-definite matrices [Bha07b,
Chap. 2].

Lecture bibliography
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
13. Matrix Monotonicity and Convexity

Date: 15 February 2022 Scribe: Nico Christianson

In the last lecture, we examined positive linear maps, a class of matrix-valued linear
Agenda:
maps that respect the positive semidefinite order. Now, we go beyond linear maps,
1. Monotonicity and Convexity
establishing notions of monotonicity and convexity for nonlinear matrix-valued functions 2. Examples
with respect to the semidefinite order. It is found that these properties are more 3. Matrix Jensen
restrictive than their scalar counterparts. Yet, this restriction brings with it considerable
added structure. In particular, we demonstrate that matrix convex functions satisfy a
remarkable generalization of Jensen’s inequality to non-commutative “matrix convex
combinations.”

13.1 Basic definitions and properties


We begin by recalling some notions from previous lectures. A linear map 𝚽 : 𝕄𝑛 → 𝕄𝑚
is said to be positive if and only if it enjoys the monotonicity property
𝑨4𝑩 implies 𝚽(𝑨) 4 𝚽(𝑩)
for all 𝑨, 𝑩 ∈ ℍ𝑛 . It follows trivially by linearity that a positive linear map is convex;
i.e.,
𝚽(𝜏𝑨 + 𝜏𝑩) ¯ 4 𝜏𝚽(𝑨) + 𝜏𝚽(𝑩)
¯ for all 𝜏 ∈ [ 0, 1] ,
where 𝑨, 𝑩 ∈ ℍ𝑛 and 𝜏¯ B 1 −𝜏 . In this lecture, we examine the class of nonlinear func-
tions that exhibit monotonicity and convexity with respect to the positive semidefinite
order.

13.1.1 Standard matrix functions


We begin with several definitions.
It follows via Rayleigh-Ritz that
Definition 13.1 Let I ⊆ ℝ be an interval of the real line. We define ℍ𝑛 ( I) as the set ℍ𝑛 ( I) is convex.
of Hermitian matrices with all eigenvalues lying in the interval I. That is,

ℍ𝑛 ( I) = {𝑨 ∈ 𝕄𝑛 : 𝑨 = 𝑨 ∗ and 𝜆𝑖 (𝑨) ∈ I for each 𝑖 = 1, . . . , 𝑛} .

With ℍ𝑛 ( I) defined, we can provide a definition of standard matrix functions that


extends to functions whose domain is a proper subset of the real line, such as the
inverse, the square root, the logarithm, and entropy.

Definition 13.2 (Standard matrix function). Let 𝑓 : I → ℝ be a function on an interval


I ⊆ ℝ. For each 𝑛 ∈ ℕ, define a matrix function 𝑓 : ℍ𝑛 ( I) → ℍ𝑛 via the spectral
resolution. That is, for any 𝑛 ∈ ℕ and each matrix 𝑨 ∈ ℍ𝑛 ( I) , we define
∑︁𝑛 ∑︁𝑛
𝑓 (𝑨) = 𝑓 (𝜆𝑖 )𝑷 𝑖 where 𝑨= 𝜆𝑖 𝑷 𝑖 .
𝑖 =1 𝑖 =1

Note that standard matrix functions are unitarily equivariant. That is, if 𝑼 ∈ 𝕄𝑛 is
unitary, then 𝑓 (𝑼 ∗ 𝑨𝑼 ) = 𝑼 ∗ 𝑓 (𝑨)𝑼 .
Lecture 13: Matrix Monotonicity and Convexity 105

13.1.2 Monotonicity and Convexity


This broadened definition of standard matrix functions enables a natural generalization
of the properties of monotonicity [Löw34] and convexity [Kra36] to matrices. These properties are sometimes called
operator monotonicity and convexity.
Definition 13.3 (Matrix monotonicity; Loewner 1934). A function 𝑓 : I → ℝ is matrix
monotone on I when 𝑨 4 𝑩 implies 𝑓 (𝑨) 4 𝑓 (𝑩) for all 𝑨, 𝑩 ∈ ℍ𝑛 ( I) and all
𝑛 ∈ ℕ.

Definition 13.4 (Matrix convexity; Kraus 1936). A function 𝑓 : I → ℝ is matrix convex


on I when
𝑓 (𝜏𝑨 + 𝜏𝑩)
¯ 4 𝜏 𝑓 (𝑨) + 𝜏¯ 𝑓 (𝑩) for all 𝜏 ∈ [0, 1],
for all matrices 𝑨, 𝑩 ∈ ℍ𝑛 ( I) and all 𝑛 ∈ ℕ, where 𝜏¯ = 1 − 𝜏 . We say that a
function 𝑔 : I → ℝ is matrix concave when −𝑔 is matrix convex.

13.1.3 Basic properties


We briefly report some of the properties that are immediate from the definitions of
matrix monotonicity and convexity.
Proposition 13.5 (Matrix monotonicity). The following statements hold true. The key takeaway of Proposition 13.5
is that the class of matrix monotone
1. Scalar case. If 𝑓 is matrix monotone on I, then 𝑓 is increasing on I. functions forms a closed, convex cone.
2. Convex cone. If 𝑓 , 𝑔 are matrix monotone on I, then for all 𝛼, 𝛽 ≥ 0, the weighted As we will see later, matrix monotone
functions are in fact a proper subset
combination 𝛼 𝑓 + 𝛽 𝑔 is matrix monotone on I. of the cone of scalar monotone
3. Closure. If 𝑓𝑘 → 𝑓 pointwise on I and each 𝑓𝑘 is matrix monotone on I, then the functions.
limit 𝑓 is also matrix monotone on I.

Proof sketch. The first two properties can be verified easily. The third property requires
use of the fact that the cone of psd matrices is closed. 
Exercise 13.6 (Matrix convexity). Verify analogous properties for the class of matrix convex
functions. That is, characterize the scalar case, and check that the class of matrix
convex functions forms a closed, convex cone.

13.2 Examples
In this section, we detail a variety of functions that are (or are not) matrix monotone
and convex.

13.2.1 Affine functions


Consider the affine function 𝑓 (𝑡 ) = 𝛼𝑡 + 𝛽 with 𝛼, 𝛽 ∈ ℝ. The function 𝑓 is matrix Increasing affine functions are the
convex on ℝ regardless of the choice of 𝛼, 𝛽 . Moreover, 𝑓 is matrix monotone on ℝ so unique examples of matrix monotone
functions on the entire real line!
long as it is increasing, i.e., if 𝛼 ≥ 0.

13.2.2 Square function


Consider the square function 𝑓 (𝑡 ) = 𝑡 2 . This function is not matrix monotone on ℝ+ .
To see this, consider the matrices 𝑨, 𝑩 ∈ ℍ2 (ℝ+ ) defined as
   
1 1 2 1
𝑨= and 𝑩 = .
1 1 1 1
Lecture 13: Matrix Monotonicity and Convexity 106

Clearly 𝑨 4 𝑩 . Yet    
2
2 2 5 3
𝑨 = 64 = 𝑩 2.
2 2 3 2
On the other hand, the square function 𝑓 is matrix convex on the entire real line.
More generally, positive quadratics 𝑓 (𝑡 ) = 𝛼𝑡 2 + 𝛽𝑡 + 𝛾 with 𝛼 ≥ 0 and 𝛽, 𝛾 ∈ ℝ
compose the full class of matrix convex functions on ℝ.
We prove the matrix convexity of the square function in the next proposition.
Proposition 13.7 (Square is matrix convex). The square function 𝑡 ↦→ 𝑡 2 is matrix convex
on ℝ.

Proof. We establish midpoint matrix convexity, which can be extended in a straightfor-


ward manner to general matrix convexity. Consider arbitrary 𝑨, 𝑩 ∈ ℍ𝑛 , and observe
that
0 4 (𝑨 − 𝑩) 2 = 𝑨 2 + 𝑩 2 − 𝑨𝑩 − 𝑩 𝑨 (13.1)
since the eigenvalues of a squared matrix are positive. Rearranging (13.1), we obtain

𝑨𝑩 + 𝑩 𝑨 4 𝑨 2 + 𝑩 2 . (13.2)

Therefore, it holds that


2
1
+ 12 𝑩 1
𝑨 2 + 𝑩 2 + 𝑨𝑩 + 𝑩 𝑨 4 21 𝑨 2 + 12 𝑩 2

2
𝑨 = 4

where the semidefinite relation follows via (13.2). This calculation establishes midpoint
convexity. 
Exercise 13.8 (Square: Matrix convexity). Extend the preceding proof to obtain general
matrix convexity of the square function. That is, establish that

¯ 2 4 𝜏𝑨 2 + 𝜏𝑩
(𝜏𝑨 + 𝜏𝑩) ¯ 2 for all 𝜏 ∈ [ 0, 1] ,

where 𝜏¯ = 1 − 𝜏 .

13.2.3 Inverse
The inverse function 𝑓 (𝑡 ) = 𝑡 −1 is matrix convex on ℝ++ B ( 0, ∞) , and its negative
𝑔 (𝑡 ) = −𝑡 −1 is matrix monotone on ℝ++ . We prove these properties separately in the
next two propositions.
Proposition 13.9 (Inverse: Monotonicity). The negative inverse function 𝑡 ↦→ −𝑡 −1 is matrix
monotone on ℝ++ .

Proof. By the conjugation rule, 0 ≺ 𝑨 4 𝑩 implies that I 4 𝑨 −1/2 𝑩 𝑨 −1/2 . Since


all eigenvalues of the identity are one, the inverse reverses this relation, so that
I < (𝑨 −1/2 𝑩 𝑨 −1/2 ) −1 = 𝑨 1/2 𝑩 −1 𝑨 1/2 . Applying the conjugation rule once more gives
𝑨 −1 < 𝑩 −1 . Last, negation reverses the positive-semidefinite order. 
Proposition 13.10 (Inverse: Convexity). The inverse function 𝑡 ↦→ 𝑡 −1 is matrix convex on
ℝ++ .
Proof. For positive definite matrices 𝑨, 𝑩  0, it follows from the Schur complement
theorem that    
𝑨 I 𝑩 I
< 0 and < 0.
I 𝑨 −1 I 𝑩 −1
Lecture 13: Matrix Monotonicity and Convexity 107

Convexity of the positive-semidefinite cone implies that, for any 𝜏 ∈ [ 0, 1] and


𝜏¯ = 1 − 𝜏 ,  
𝜏𝑨 + 𝜏𝑩
¯ I
< 0. (13.3)
I 𝜏𝑨 −1 + 𝜏𝑩
¯ −1
Applying the Schur complement theorem to (13.3), we obtain that

¯ −1 4 𝜏𝑨 −1 + 𝜏𝑩
(𝜏𝑨 + 𝜏𝑩) ¯ −1 ,
which establishes matrix convexity. 

13.2.4 Power functions


We report (without proof) the matrix monotonicity and convexity statuses of the family
of power functions 𝑓 (𝑡 ) = 𝑡 𝑝 . Proof sketches for these facts will be presented in a
Lecture 15, on integral representations of matrix monotone and convex functions.
Fact 13.11 (Powers: Monotonicity). The following power functions (and only these) are
matrix monotone.
• The power function 𝑓 (𝑡 ) = 𝑡 𝑝 is matrix monotone on ℝ+ for 𝑝 ∈ [ 0, 1] .
• The power function 𝑓 (𝑡 ) = −𝑡 𝑝 is matrix monotone on ℝ++ for 𝑝 ∈ [−1, 0] .
The former result is due to Löwner [Löw34], though it is sometimes referred to as the
Löwner–Heinz theorem. 

Fact 13.12 (Powers: Convexity). The following power functions (and only these) are matrix
convex.
• The power function 𝑓 (𝑡 ) = 𝑡 𝑝 is matrix convex on ℝ+ for 𝑝 ∈ [ 1, 2] .
• The power function 𝑓 (𝑡 ) = −𝑡 𝑝 is matrix convex on ℝ+ for 𝑝 ∈ [ 0, 1] .
• The power function 𝑓 (𝑡 ) = 𝑡 𝑝 is matrix convex on ℝ++ for 𝑝 ∈ [−1, 0] .
These results follow as a consequence of Fact 13.11 by general considerations that will
be discussed in Lecture 15. 

13.2.5 Logarithm
The function 𝑓 (𝑡 ) = log 𝑡 is matrix monotone and concave on ℝ++ . We leave the proof
of this fact as a series of exercises.
Exercise 13.13 (Logarithm: Integral representation). For each 𝑎 > 0, verify that
∫ ∞
( 1 + 𝜆) −1 − (𝑎 + 𝜆) −1 d𝜆.
 
log 𝑎 =
0

Show that this integral representation extends to positive definite matrices via capital-
ization. For any positive definite 𝑨  0,
∫ ∞
( 1 + 𝜆) −1 I − (𝑨 + 𝜆I) −1 d𝜆.
 
log 𝑨 =
0

Exercise 13.14 (Inverse: Monotonicity and Convexity). For 𝜆 ≥ 0, show that the map
𝑓 (𝑡 ) = (𝜆 + 𝑡 ) −1 is matrix convex on ℝ++ , and that its negative −𝑓 is matrix monotone
on ℝ++ . Hint: The arguments follow those for the inverse.
Exercise 13.15 (Logarithm: Monotonicity and Concavity). Show that the logarithm 𝑓 (𝑡 ) =
log 𝑡 is matrix monotone and matrix concave on ℝ++ . Hint: apply the previous exercises
and the fact that matrix monotone and convex functions compose convex cones that
are closed under pointwise limits.
Lecture 13: Matrix Monotonicity and Convexity 108

13.2.6 Entropy
The negative entropy 𝑓 (𝑡 ) = 𝑡 log 𝑡 is matrix convex on ℝ+ . We leave the proof of this
fact as an exercise.
Exercise 13.16 (Entropy: Convexity). Prove that the negative entropy is matrix convex on
ℝ+ . Hint: Use the identity
∫ ∞
𝑡 ( 1 + 𝜆) −1 − 𝑡 (𝑡 + 𝜆) −1 d𝜆
 
𝑡 log 𝑡 =
0

and follow the route charted by Exercises 13.13, 13.14, and 13.15.

13.2.7 Exponential
The exponential 𝑓 (𝑡 ) = e𝑡 is neither matrix monotone nor matrix convex on any
interval. The exponential’s failure to be matrix monotone or convex implies that the
classes of matrix monotone and convex functions are strictly smaller than their scalar
counterparts.

13.3 The matrix Jensen inequality


A striking fact about convexity in the scalar setting is that it is a self-improving property.
That is, by simply assuming that

𝑓 (𝜏𝑎 + 𝜏𝑏)
¯ ≤ 𝜏 𝑓 (𝑎) + 𝜏¯ 𝑓 (𝑏) for all 𝜏 ∈ [ 0, 1]

and all 𝑎, 𝑏 ∈ I, where 𝜏¯ = 1 − 𝜏 , we can obtain Jensen’s inequality. For any collection
of 𝑎𝑖 ∈ I and (𝑝𝑖 )𝑖𝑚=1 with 𝑖𝑚=1 𝑝𝑖 = 1 and 𝑝𝑖 ≥ 0, we have
Í

∑︁𝑚  ∑︁𝑚
𝑓 𝑝𝑖 𝑎 𝑖 ≤ 𝑝𝑖 𝑓 (𝑎𝑖 ).
𝑖 =1 𝑖 =1

Even more, it holds that


𝑓 (𝔼 𝑋 ) ≤ 𝔼 𝑓 (𝑋 )
for all integrable random variables 𝑋 taking values in I.
As it turns out, and as we shall prove in this section, convexity is self-improving in
the matrix setting. Moreover, this self-improvement is even more dramatic, extend-
ing beyond simple scalar convex combinations to noncommutative “matrix convex
combinations,” which we define as follows.

Definition 13.17 (Matrix convex combination). Let (𝑨 𝑖 )𝑖𝑚=1 be a collection of self-adjoint


matrices in ℍ𝑛 . Let (𝑲 𝑖 )𝑖𝑚=1 consist of general matrices in 𝕄𝑛 that satisfy
Í𝑚 ∗
𝑖 =1 𝑲 𝑖 𝑲 𝑖 = I. In analogy to the scalar case, the sum
∑︁𝑚
𝑲 𝑖 ∗𝑨𝑖 𝑲 𝑖
𝑖 =1

is called a matrix convex combination of the matrices (𝑨 𝑖 )𝑖𝑚=1 .

The condition 𝑖𝑚=1 𝑲 𝑖 ∗ 𝑲 𝑖 = I on the “coefficients” of a matrix convex combination


Í
is analogous to the normalization condition 𝑖𝑚=1 𝑝𝑖 = 1 in the scalar case. Moreover,
Í
1/2
scalar convex combinations can be recovered by selecting 𝑲 𝑖 = 𝑝𝑖 I for each 𝑖 =
1, . . . , 𝑚 . However, matrix convex combinations significantly generalize the scalar
case, as illustrated in the following example.
Lecture 13: Matrix Monotonicity and Convexity 109

Example 13.18 (A matrix convex combination). Consider the matrices


   
0 1 0 0
𝑲1 = and 𝑲 2 = .
0 0 1 0

Clearly it holds that 𝑲 1 ∗ 𝑲 1 + 𝑲 2 ∗ 𝑲 2 = I. For arbitrary matrices 𝑨, 𝑩 ∈ ℍ2 , it can be


verified that  
∗ ∗ 𝑏 22 0
𝑲 1 𝑨𝑲 1 + 𝑲 2 𝑩𝑲 2 = ,
0 𝑎 11
which is obviously not a scalar convex combination of the matrices 𝑨 and 𝑩 . 

The remarkable main result of this section is that matrix convexity is self-improving
to an extent far beyond the scalar case. That is, matrix convexity of a function
𝑓 : I → ℝ implies a more general form of Jensen’s inequality that holds for arbitrary
matrix convex combinations [HP82; HP03].

Theorem 13.19 (Matrix Jensen inequality; Hansen–Pedersen 1982, 2003). Fix a matrix
convex function 𝑓 : I → ℝ. Let (𝑨 𝑖 )𝑖𝑚=1 be a collection of matrices, each residing
in ℍ𝑛 ( I) , and let (𝑲 𝑖 )𝑖𝑚=1 consist of matrices in 𝕄𝑛 that satisfy the normalization
condition 𝑖𝑚=1 𝑲 𝑖 ∗ 𝑲 𝑖 = I. Then
Í

∑︁𝑚  ∑︁𝑚
𝑓 𝑲 𝑖 ∗𝑨𝑖 𝑲 𝑖 4 𝑲 𝑖 ∗ 𝑓 (𝑨 𝑖 )𝑲 𝑖 .
𝑖 =1 𝑖 =1

Proof of Theorem 13.19. We present the proof in the case that 𝑚 = 2. That is, we will
prove that

𝑓 (𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ) 4 𝑲 1 ∗ 𝑓 (𝑨 1 )𝑲 1 + 𝑲 2 ∗ 𝑓 (𝑨 2 )𝑲 2
when 𝑲 1 ∗ 𝑲 1 + 𝑲 2 ∗ 𝑲 2 = I. The case of general 𝑚 follows a similar argument, and in
fact follows as a corollary of this case.
To begin, we introduce that block matrix
 
𝑨1 0
𝑨B ∈ ℍ2𝑛 ( I).
0 𝑨2
The key idea in the proof is to lift our attention to this block matrix. We will reinterpret
the matrix convex combinations in the matrix Jensen inequality in terms of operations
on the block matrix that involve simple averages, unitary conjugation, and positive
linear maps. By this mechanism, we can access the definition of matrix convexity and
exploit the unitary equivariance of standard matrix functions.
We preface the proof with four tricks that will be employed in the execution of the
proof. First, note that it is straightforward to apply a standard matrix function to a
block diagonal matrix. If 𝑻 , 𝑴 ∈ ℍ𝑛 ( I) and 𝑓 : I → ℝ, we simply have
   
𝑻 0 𝑓 (𝑻 ) 0
𝑓 = . (13.4)
0 𝑴 0 𝑓 (𝑴 )
In view of this fact, it is helpful to develop methods for extracting the block-diagonal
part of a block matrix.
Indeed, we can represent the block diagonal pinching of a matrix as a simple
average of unitary conjugates. That is, defining the unitary block diagonal matrix
 
I 0
𝑼 B ,
0 −I
Lecture 13: Matrix Monotonicity and Convexity 110

we have the identity


     
1 𝑻 𝑩 1 𝑻 𝑩 𝑻 0
∗ + 𝑼∗ ∗ 𝑼 = (13.5)
2 𝑩 𝑴 2 𝑩 𝑴 0 𝑴

where 𝑩 ∈ 𝕄𝑛 is an arbitrary matrix.


Third, we shall represent the matrix convex combination via unitary
 ∗ conjugation.
∗
Observe that 𝑲 1 𝑲 1 + 𝑲 2 𝑲 2 = I implies that the block matrix 𝑲 1 𝑲 2 ∗ has
∗ ∗

orthonormal columns; thus by extending the columns into an orthonormal basis, we


can extend it into a unitary matrix 𝑸 :
 
𝑲 1 𝑳1
𝑸 = where 𝑸 ∗𝑸 = 𝑸𝑸 ∗ = I.
𝑲 2 𝑳2

Conjugating 𝑨 with 𝑸 , we have

𝑲 1∗ 𝑨 1𝑲 1 + 𝑲 2∗ 𝑨 2𝑲 2 ∗
 

𝑸 𝑨𝑸 = (13.6)
∗ ∗

where the asterices indicate blocks of the matrix that are irrelevant to the argument.
Last, recall that the map [·] 11 that extracts the top left ( 1, 1) block of a block matrix
is a positive linear map, and hence it preserves the positive-semidefinite order. For
our purposes, this map extracts the top left block of 𝑸 ∗ 𝑨𝑸 specified in (13.6). For
example, it holds that

[𝑸 ∗ 𝑨𝑸 ] 11 = 𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ;
[𝑸 ∗ 𝑓 (𝑨)𝑸 ] 11 = 𝑲 1 ∗ 𝑓 (𝑨 1 )𝑲 1 + 𝑲 2 ∗ 𝑓 (𝑨 2 )𝑲 2 .

The second relation exploits the fact that 𝑨 is a block diagonal matrix so we can
identify 𝑓 (𝑨) .
Armed with these tricks, we may now proceed with the proof. We will maintain
the quantities of interest in the (1, 1) block of the block matrix, while the remaining
blocks remain at our discretion. To begin, observe that

𝑓 (𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ) = 𝑓 ([𝑸 ∗ 𝑨𝑸 ] 11 )

by the identity (13.6). Since each block matrix in the pinching identity (13.5) has the
same first diagonal block, it then follows that
  
𝑓 ( [𝑸 ∗ 𝑨𝑸 ] 11 ) = 𝑓 1 ∗
2
𝑸 𝑨𝑸 + 21 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 11
.

The identity (13.5) ensures that 12 𝑸 ∗ 𝑨𝑸 + 12 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 is block diagonal. Then


using the first identity (13.4), the application of a standard matrix function to a block
diagonal matrix, we can pull the map [·] 11 outside 𝑓 :
  
1 ∗
+ 21 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 1 ∗
+ 12 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼
 
𝑓 2
𝑸 𝑨𝑸 11
= 𝑓 2
𝑸 𝑨𝑸 11
.

By matrix convexity of 𝑓 , it holds that


1 ∗
+ 12 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 4 12 𝑓 (𝑸 ∗ 𝑨𝑸 ) + 12 𝑓 (𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 ).

𝑓 2
𝑸 𝑨𝑸

This inequality transmits through the positive linear map [·] 11 . Thus,
1 ∗
+ 12 𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 𝑓 (𝑸 ∗ 𝑨𝑸 ) + 21 𝑓 (𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 )
  1 
𝑓 2
𝑸 𝑨𝑸 11
4 2 11
.
Lecture 13: Matrix Monotonicity and Convexity 111

Since standard matrix functions commute with unitary conjugation, we have

𝑓 (𝑸 ∗ 𝑨𝑸 ) + 12 𝑓 (𝑼 ∗ (𝑸 ∗ 𝑨𝑸 )𝑼 ) 𝑸 ∗ 𝑓 (𝑨)𝑸 + 21 𝑼 ∗𝑸 ∗ 𝑓 (𝑨)𝑸𝑼
1  1 
2 11
= 2 11
.
To complete the argument, we reverse our course to undo each of the steps. Once
more, using the fact that each block matrix in the pinching identity (13.5) has the same
first diagonal block, we conclude that

𝑸 ∗ 𝑓 (𝑨)𝑸 + 21 𝑼 ∗𝑸 ∗ 𝑓 (𝑨)𝑸𝑼 = [𝑸 ∗ 𝑓 (𝑨)𝑸 ] 11


1 
2 11
= 𝑲 1 ∗ 𝑓 (𝑨 1 )𝑲 1 + 𝑲 2 ∗ 𝑓 (𝑨 2 )𝑲 2 .
Sequencing the preceding displays, we have obtained that

𝑓 (𝑲 1 ∗ 𝑨 1 𝑲 1 + 𝑲 2 ∗ 𝑨 2 𝑲 2 ) 4 𝑲 1 ∗ 𝑓 (𝑨 1 )𝑲 1 + 𝑲 2 ∗ 𝑓 (𝑨 2 )𝑲 2 ,
which is the desired result. 

Notes
This lecture is adapted from Bhatia’s book [Bha97, Chap. V] and from the instructor’s
monograph [Tro15] on matrix concentration.
The proof of the matrix Jensen inequality is drawn from Hansen & Pedersen’s
second paper [HP03]. Using a similar type of argument, Davis [Dav57] had long since
proved a weaker version of the matrix Jensen inequality (Theorem 13.19) under the
extra assumptions that 0 ∈ I and 𝑓 ( 0) = 0. By bringing in deeper tools, Choi [Cho74]
strengthened the result further by removing the conditions, and he extended it to a
wider class of averaging operations.
The significance of the Hansen–Pedersen result is that it admits a direct (although
clever) proof. They developed this argument as the first step in their proof of Loewner’s
integral theorem, a deep result on matrix monotone and matrix convex convex functions.
In fact, Loewner’s theorem is the key ingredient in the proof of Choi’s generalization.
We will return to these matters in Lecture 15.

Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Cho74] M.-D. Choi. “A Schwarz inequality for positive linear maps on 𝐶 ∗ -algebras”. In:
Illinois Journal of Mathematics 18.4 (1974), pages 565 –574. doi: 10.1215/ijm/
1256051007.
[Dav57] C. Davis. “A Schwarz inequality for convex operator functions”. In: Proc. Amer.
Math. Soc. 8 (1957), pages 42–44. doi: 10.2307/2032808.
[HP82] F. Hansen and G. K. Pedersen. “Jensen’s Inequality for Operators and Löwner’s
Theorem.” In: Mathematische Annalen 258 (1982), pages 229–241.
[HP03] F. Hansen and G. K. Pedersen. “Jensen’s Operator Inequality”. In: Bulletin of the
London Mathematical Society 35.4 (2003), pages 553–564.
[Kra36] F. Kraus. “Über konvexe Matrixfunktionen.” ger. In: Mathematische Zeitschrift 41
(1936), pages 18–42. url: http://eudml.org/doc/168648.
[Löw34] K. Löwner. “Über monotone Matrixfunktionen”. In: Mathematische Zeitschrift 38
(1934), pages 177–216. url: http://eudml.org/doc/168495.
[Tro15] J. A. Tropp. “An Introduction to Matrix Concentration Inequalities”. In: Foundations
and Trends in Machine Learning 8.1-2 (2015), pages 1–230.
14. Monotonicity: Differential Characterization

Date: 17 February 2022 Scribe: Nicholas H. Nelsen

In this lecture, we continue to analyze the monotonicity and convexity of nonlinear


Agenda:
standard matrix functions. Motivated by scalar tests for monotonicity and convexity,
1. Recap
we seek analogous results in the matrix function setting. Loewner’s theorem, the 2. Derivatives and differences
main result of the lecture, gives a precise characterization of matrix monotonicity; it 3. Loewner’s matrix
is necessary and sufficient to check that a particular kernel matrix—associated with 4. Fréchet derivatives
the nonlinear function in question and its derivatives—is positive semidefinite. The 5. Daleckii–Krein theorem
6. Proof of Loewner theorem
remainder of the lecture is devoting to proving this result. To that end, we take a detour 7. Examples
to first discuss differentiation in normed vector spaces, explicitly characterize these
derivatives for standard matrix functions, and finally give some concrete examples
where Loewner’s result may be applied.

14.1 Recap
The previous two lectures introduced matrix monotonicity and matrix convexity, first
in the context of positive linear maps (i.e., linear functions on matrices) and then in
the context of standard matrix functions (which have a more rigid structure but are
nonlinear). We obtained important convexity inequalities, such as Choi’s inequality for
positive linear maps and the matrix Jensen inequality for matrix convex functions.
To set the stage for this lecture, we next recall the following definitions. For an
interval I ⊆ ℝ, we define ℍ𝑛 ( I) B {𝑨 ∈ ℍ𝑛 : 𝜆𝑖 (𝑨) ∈ I for all 𝑖 }. This set I usually serves as the
domain of the standard matrix
Definition 14.1 (Matrix monotonicity; Loewner 1934). A function 𝑓 : I → ℝ is matrix functions we will consider.

monotone on I if
𝑨4𝑩 implies 𝑓 (𝑨) 4 𝑓 (𝑩)
for every 𝑨, 𝑩 ∈ ℍ𝑛 ( I) and every 𝑛 ∈ ℕ. The action of 𝑓 on a self-adjoint
matrix is interpreted in the sense of
standard matrix functions.

Definition 14.2 (Matrix convexity; Kraus 1936). A function 𝑓 : I → ℝ is matrix convex


on I if
𝑓 (𝜏𝑨 + 𝜏𝑩)
¯ 4 𝜏 𝑓 (𝑨) + 𝜏¯ 𝑓 (𝑩) for all 𝜏 ∈ [0, 1] ,
where 𝜏¯ B 1 − 𝜏 , for every 𝑨, 𝑩 ∈ ℍ𝑛 ( I) and every 𝑛 ∈ ℕ.

Today, we will primarily focus on analyzing matrix monotonicity.

14.2 Differential characterizations


Many properties of a differentiable, scalar function may be deduced directly from
its derivative. Motivated by this observation, we first recap scalar derivative tests
for monotonicity and convexity. These tests are generalized to the matrix function
setting, through the machinery of divided differences, in the main results of the lecture:
Loewner’s theorem and the theorem of Aujla & Vasudeva.
Lecture 14: Monotonicity: Differential Characterization 113

14.2.1 Scalar case


We begin our study of monotonicity and convexity by drawing intuition from the scalar
setting. Recall that, if 𝑓 : I → ℝ is differentiable, then
1. 𝑓 is increasing on I if and only if 𝑓 0 (𝑡 ) ≥ 0 for all 𝑡 ∈ I ,
2. 𝑓 is convex on I if and only if 𝑓 (𝑡 ) − 𝑓 (𝑠 ) ≥ 𝑓 0 (𝑠 )(𝑡 − 𝑠 ) for all 𝑡 , 𝑠 ∈ I .
These conditions are useful in practice because they can be easily verified, provided the
first derivative exists. However, it turns out that the first (scalar) derivative does not
carry enough information to characterize monotonicity and convexity when univariate
functions are lifted to the space of self-adjoint matrices. It is then natural to ask:
Are there analogous differential characterizations in the matrix setting?
The answer is both remarkable and beautiful.

14.2.2 Divided differences


To answer this question, we turn to an object initiated by Newton that arises frequently
in numerical analysis and approximation theory.

Definition 14.3 (Divided difference). Let 𝑓 : I → ℝ be continuously differentiable on


the open interval I ⊆ ℝ. Then the first divided difference 𝑓 [1 ] : I × I → ℝ is the
bivariate function
 𝑓 (𝑠 ) − 𝑓 (𝑡 )
[1 ]


 , 𝑠 ≠𝑡 ,
𝑓 (𝑠 , 𝑡 ) B 𝑠 −𝑡
 𝑓 0 (𝑡 ) ,

𝑠 =𝑡 .

Intuitively, the first divided difference records the slopes of secants and tangents to 𝑓 .
Tabulating the bivariate function 𝑓 [1 ] at a set of points produces Loewner’s matrix.
This is essentially the “kernel matrix”
Definition 14.4 (Loewner Matrix). Let 𝑓 : I → ℝ be continuously differentiable on the
of the function 𝑓 [1 ] on the dataset of
open interval I ⊆ ℝ. For 𝑛 ∈ ℕ, let 𝚲 = diag (𝜆 1 , . . . , 𝜆𝑛 ) ∈ ℍ𝑛 ( I) be any diagonal points (𝜆𝑘 : 𝑘 = 1, . . . , 𝑛) . This
matrix with entries in I. Then the Loewner matrix of 𝑓 at 𝚲 is observation leads to connections with
the theory of positive-definite
[1 ] [1 ] functions, discussed in Lecture 18.
 
𝑳 𝑓 (𝚲) B 𝑓 (𝚲) B 𝑓 (𝜆𝑖 , 𝜆 𝑗 ) 𝑖 ,𝑗 =1...,𝑛
∈ ℍ𝑛 .

The first divided difference and Loewner’s matrix play a key role in generalizing scalar
differential characterizations of monotonicity and convexity to the matrix setting.

14.2.3 Matrix setting


We are now ready to present the main results of this lecture.

Theorem 14.5 (Loewner 1934). Assume that 𝑓 : I → ℝ is continuously differentiable


on an open interval I ⊆ ℝ. Then
[1 ]
𝑓 is matrix monotone on I if and only if 𝑓 (𝚲) < 0

for every diagonal 𝚲 ∈ ℍ𝑛 ( I) and for every 𝑛 ∈ ℕ.

We see that 𝑳 𝑓 in the matrix setting plays the same role as does 𝑓 0 in the scalar setting.
We will prove Loewner’s theorem later in Section 14.4.
There is an analogous result [AV95] for matrix convexity, albeit established much
later than Loewner’s result.
Lecture 14: Monotonicity: Differential Characterization 114

Theorem 14.6 (Aujla–Vasudeva 1995). Assume that 𝑓 : I → ℝ is continuously differen-


tiable on an open interval I ⊆ ℝ. Then Recall that “ ” is the Schur or
Hadamard entrywise product with
[1 ] respect to the standard basis.
𝑓 is matrix convex on I if and only if 𝑓 (𝑩) − 𝑓 (𝑨) < 𝑓 (𝑨) (𝑩 − 𝑨)

for every 𝑩 ∈ ℍ𝑛 ( I) , every diagonal 𝑨 ∈ ℍ𝑛 ( I) , and every 𝑛 ∈ ℕ.

This theorem strongly parallels the differential characterization of scalar convexity.


Exercise 14.7 (Convexity: Differential characterization). Prove Theorem 14.6. Hint: Imitate
the proof of Theorem 14.5.
We conclude by stating another result that relates matrix convexity to matrix
monotonicity. This theorem plays an important role in more thorough treatments of
the Loewner theory.

Theorem 14.8 (Bendat–Sherman). Assume that 𝑓 : I → ℝ is twice continuously


differentiable on the open interval I ⊆ ℝ and matrix convex. Then for every 𝜇 ∈ I,
the function 𝑔 𝜇 : I → ℝ defined by

[1 ]
𝜆 ↦→ 𝑔 𝜇 (𝜆) B 𝑓 (𝜇, 𝜆)

is matrix monotone.
The proof may be found in [Bha97, Theorem V.3.10].

14.3 Derivatives of standard matrix functions


Standard matrix functions are obtained by lifting a scalar function to matrices. It is
important to understand the differentiability properties of the standard matrix function.
To do this, we first define the Fréchet and Gâteaux derivative of maps between spaces
of matrices. The Daleckii–Krein formula (Theorem 14.13) instantiates these derivatives
explicitly for continuously differentiable standard matrix functions, and it will be used
in the proof of Loewner’s theorem in the next section.

14.3.1 Fréchet derivatives


To prove Theorem 14.5, we need to precisely describe what it means to differentiate
functions on spaces of matrices, and more generally, vector-valued maps. The key idea
is that the derivative is a linear approximation.

Definition 14.9 (Fréchet derivative). Let I be an open interval of ℝ, and let 𝑭 : ℍ𝑛 ( I) →


ℍ𝑛 be a matrix-valued function on self-adjoint matrices. The Fréchet derivative of Aside: Although specialized to
𝑭 at a point 𝑨 ∈ ℍ𝑛 ( I) is the linear map our our self-adjoint matrix set-
ting, the definition of Fréchet
derivative easily extends to gen-
D𝑭 (𝑨) : ℍ𝑛 → ℍ𝑛 eral infinite-dimensional Banach
spaces [CP77].
defined by the property

k𝑭 (𝑨 + 𝑯 ) − 𝑭 (𝑨) − D𝑭 (𝑨)𝑯 k
→0
k𝑯 k
as 𝑯 → 0 in any norm on ℍ𝑛 , provided that such an object D𝑭 (𝑨) exists.
Lecture 14: Monotonicity: Differential Characterization 115

Note that although the Fréchet derivative of 𝑭 at a point 𝑨 is linear, the mapping

D𝑭 : 𝑨 ↦→ D𝑭 (𝑨)

is nonlinear in general. This parallels the usual notion of scalar derivative from calculus.
Indeed, for 𝑓 : I → ℝ continuously differentiable, 𝑓 0 is a nonlinear function but the
Fréchet derivative
D 𝑓 (𝑡 ) : ℎ ↦→ 𝑓 0 (𝑡 )ℎ
is just the linear operator of scalar multiplication by 𝑓 0 (𝑡 ) , for 𝑡 ∈ I, which can be
identified with 𝑓 0 (𝑡 ) itself.
Exercise 14.10 (Derivative). Show that, if it exists, the Fréchet derivative is unique.
Exercise 14.11 (Linear map). Show that if 𝑭 : 𝑨 ↦→ 𝚽𝑨 is linear, that is, 𝚽 : ℍ𝑛 → ℍ𝑛 is
a linear map, then D𝑭 (𝑨) = 𝚽 for every 𝑨 ∈ ℍ𝑛 .
It is useful to relate the Fréchet derivative to another kind of derivative, the Gâteaux
derivative, that generalizes directional derivatives to Banach space. Indeed, if it exists ,
the Fréchet derivative of 𝑭 at 𝑨 parametrizes all the Gâteaux derivatives of 𝑭 at 𝑨 in
the sense that
d  
D𝑭 (𝑨)𝑯 = 𝑭 (𝑨 + 𝑡 𝑯 ) (14.1) Aside: If 𝑭 is real-valued, this
d𝑡 𝑡 =0 is usually called the first variation
for every 𝑯 . The right hand side is called the Gâteaux derivative of 𝑭 at 𝑨 in the or variational derivative in the cal-
direction 𝑯 . culus of variations.

Warning 14.12 (Directional derivatives). The converse is false. A map can be Gâteaux
differentiable in every direction without being Fréchet differentiable. This phe-
nomenon already arises for functions on ℝ2 . 

14.3.2 Daleckii–Krein formula


The next result gives a concrete characterization of the Fréchet derivative of a standard
matrix function.
Theorem 14.13 (Daleckii–Krein). Assume that 𝑓 : I → ℝ is continuously differentiable
on the open interval I ⊆ ℝ. For diagonal 𝚲 ∈ ℍ𝑛 ( I) ,
[1 ]
D 𝑓 (𝚲)𝑯 = 𝑓 (𝚲) 𝑯 for every 𝑯 ∈ ℍ𝑛 . (14.2)

Moreover, for 𝑨 ∈ ℍ𝑛 ( I) with spectral decomposition 𝑨 = 𝑼 𝚲𝑼 ∗ ,


[1 ] [1 ]
(𝑼 ∗𝑯𝑼 ) 𝑼 ∗ .
 
D 𝑓 (𝑨)𝑯 = 𝑓 (𝑨) 𝑨 𝑯 C𝑼 𝑓 (𝚲) (14.3)

The following lemma gives the derivative of a polynomial standard matrix function
and is used in a limiting argument in the proof of the Daleckii–Krein formula.
Lemma 14.14 (Polynomial: Derivative). Let 𝑓 : ℝ → ℝ be a polynomial. Then for 𝑯 ∈ ℍ𝑛
and diagonal 𝚲 ∈ ℍ𝑛 ,
[1 ]
D 𝑓 (𝚲)𝑯 = 𝑓 (𝚲) 𝑯. (14.4)

Proof. Notice that both sides of (14.4) are linear in 𝑓 since differentiation and point
evaluation are linear operations. Hence, without loss of generality, we may consider
the reduction to the monomial 𝑓 : 𝑥 ↦→ 𝑥 𝑝 for each 𝑝 ∈ ℤ+ .
To this end, recall the algebraic identity
∑︁𝑝−1
𝑩𝑝 − 𝑪 𝑝 = 𝑩 𝑘 (𝑩 − 𝑪 )𝑪 𝑝−1−𝑘
𝑘 =0
Lecture 14: Monotonicity: Differential Characterization 116

that holds for every 𝑩, 𝑪 ∈ 𝕄𝑛 . Using this, for 𝑨, 𝑯 ∈ ℍ𝑛 we compute


(𝑨 + 𝑡 𝑯 )𝑝 − 𝑨 𝑝 1 ∑︁𝑝−1
= (𝑨 + 𝑡 𝑯 ) 𝑘 (𝑡 𝑯 )𝑨 𝑝−1−𝑘
𝑡 𝑡 𝑘 =0
∑︁𝑝−1
→ 𝑨 𝑘 𝑯 𝑨 𝑝−1−𝑘
𝑘 =0
as 𝑡 ↓ 0. The right hand side of the display is the Gâteux derivative of 𝑓 at 𝑨 in the
direction 𝑯 , and it is possible to verify by Taylor’s theorem in Banach space [CP77]
that the Fréchet derivative of 𝑓 exists and hence must agree with the Gâteux derivative
as in (14.1). In particular, this limit is zero when 𝑝 = 0 (the constant function) since
the left-hand side in the above display is identically equal to the zero matrix in this
case.
Let 𝚲 = diag (𝜆 1 . . . , 𝜆𝑛 ) . As 𝑓 [1 ] (𝚲) = 0 for 𝑝 = 0, we have already verified
(14.4) for 𝑝 = 0. Now let 𝑝 ∈ ℕ. Since the diagonal entries of 𝚲 may repeat, we must
consider two cases, depending on whether the entries are the same or different.
For the first case, consider pairs 𝑖 , 𝑗 ∈ {1, . . . , 𝑛} such that 𝜆𝑖 ≠ 𝜆 𝑗 . We compute
h∑︁𝑝−1 i
𝚲𝑘 𝑯 𝚲𝑝−1−𝑘
 
D 𝑓 (𝚲)𝑯 𝑖 ,𝑗
=
𝑘 =0 𝑖 ,𝑗
∑︁𝑝−1
𝑘 𝑝−1−𝑘
= h𝜹 𝑖 , 𝚲 𝑯𝚲 𝜹𝑗i
𝑘 =0
∑︁𝑝−1
= h𝚲𝑘 𝜹 𝑖 , 𝑯 (𝚲𝑝−1−𝑘 𝜹 𝑗 )i
𝑘 =0
∑︁𝑝−1
𝑝−1−𝑘
= 𝜆𝑘 ℎ 𝑖 𝑗 𝜆 𝑗
𝑘 =0 𝑖
∑︁𝑝−1 
𝑝−1−𝑘
= 𝜆𝑘𝑖 𝜆 𝑗 ℎ𝑖 𝑗
𝑘 =0
 𝜆𝑝 − 𝜆𝑝 
𝑖 𝑗
= ℎ𝑖 𝑗 .
𝜆𝑖 − 𝜆 𝑗
The last equality follows from an algebraic identity and uses the convention 00 = 1.
Notice that since 𝜆𝑖 ≠ 𝜆 𝑗 ,
𝑝 𝑝
𝜆𝑖 − 𝜆 𝑗
[1 ]
 
= 𝑓 (𝚲) .
𝜆𝑖 − 𝜆 𝑗 𝑖𝑗

We have attained the desired result.


For the second case, consider pairs 𝑖 , 𝑗 ∈ {1, . . . , 𝑛} such that 𝜆𝑖 = 𝜆 𝑗 . In particular,
this includes the diagonal 𝑖 = 𝑗 . If 𝜆𝑖 = 𝜆 𝑗 = 0, then 𝑓 0 (𝜆 𝑗 ) = 𝑓 0 ( 0) = 𝟙 {𝑝=1 } . Using
this observation and continuing from the calculation above,
∑︁𝑝−1
  𝑝−1−𝑘
D 𝑓 (𝚲)𝑯 𝑖 ,𝑗
= 𝜆𝑘𝑗 𝜆 𝑗 ℎ𝑖 𝑗
𝑘 =0
∑︁𝑝−1
𝑝−1
= 𝜆 ℎ𝑖 𝑗
𝑘 =0 𝑗
𝑝−1
= 𝑝𝜆 𝑗 ℎ𝑖 𝑗
= 𝑓 0 (𝜆 𝑗 )ℎ𝑖 𝑗
 [1 ] 
= 𝑓 (𝚲) 𝑖 𝑗 ℎ𝑖 𝑗
as asserted. 
Exercise 14.15 (Power: Divided differences). Verify the algebraic identity
∑︁𝑝−1 𝑎𝑝 − 𝑏𝑝
𝑎 𝑘 𝑏 𝑝−1−𝑘 =
𝑘 =0 𝑎 −𝑏
Lecture 14: Monotonicity: Differential Characterization 117

for 𝑎 ≠ 𝑏 ∈ ℝ and 𝑝 ∈ ℕ, with the convention that 00 = 1.


We can now prove Theorem 14.13; a full treatment may be found in [Sim19, Chapter
5] or in [Bha97, Chapter V].

Proof sketch: Daleckii–Krein Formula. The argument proceeds by approximation. We


focus on the case where the eigenvalues of diagonal 𝚲 ∈ ℍ𝑛 ( I) are distinct. By an
extension of the Stone–Weierstrass theorem [Sim19, Proposition 5.4], there exists
a sequence of polynomials (𝑓𝑛 : 𝑛 ∈ ℕ) that converge to 𝑓 uniformly on compact
subintervals of I. Furthermore, the sequence ( 𝑓𝑛0 : 𝑛 ∈ ℕ) of derivatives converges to
𝑓 0 uniformly on compact subintervals of I. It follows that

𝑓𝑛[1 ] (𝜆𝑖 , 𝜆 𝑗 ) → 𝑓 [1 ]
(𝜆𝑖 , 𝜆 𝑗 ) as 𝑛 → ∞

for each 𝑖 , 𝑗 . By Lemma 14.14, for each 𝑯 ∈ ℍ𝑛 ,

= 𝑓𝑛[1 ] (𝜆𝑖 , 𝜆 𝑗 ) [𝑯 ] 𝑖 𝑗 → 𝑓 [1 ]
 
D 𝑓𝑛 (𝚲)𝑯 𝑖𝑗
(𝜆𝑖 , 𝜆 𝑗 ) [𝑯 ] 𝑖 𝑗 as 𝑛 → ∞

for each 𝑖 , 𝑗 . Since the convergence holds entrywise, it also holds in any UI matrix
norm by a characteristic polynomial argument. By [Sim19, Lemma 5.5(a)], the Gâteaux
derivative of 𝑓 exists and equals the limit of the left-hand side above, which yields
(14.2) if it can be shown that the Fréchet derivative exists. This step is found in the
proof of [Bha97, Theorem V.3.3].
Equation (14.3) follows from unitary equivariance of standard matrix functions.
More explicitly, it is easy to verify using the definition of polynomial standard matrix
functions and the calculation in the proof of Lemma 14.14 that
[1 ]
(𝑼 ∗𝑯𝑼 ) 𝑼 ∗ .
 
D 𝑓𝑛 (𝑨)𝑯 = 𝑼 𝑓𝑛 (𝚲)

A similar limiting argument as above concludes the proof sketch. 

14.4 Proof of Loewner’s theorem


We may now prove Theorem 14.5. Under the hypotheses, let us first assume that 𝑓 is
matrix monotone on I. We need to show that 𝑓 [1 ] (𝚲) < 0 for every diagonal 𝚲 ∈ ℍ𝑛 ( I) ,
where 𝑛 ∈ ℕ. To this end, consider a specific perturbation matrix 𝑯 B 11∗ < 0. Then

𝚲 + 𝑡 𝑯 < 𝚲 for 𝑡 ≥ 0

by definition of the psd order. Since I is open and the eigenvalue map 𝝀(·) is continuous,
there exists 𝛿 > 0 such that 𝚲 + 𝑡 𝑯 ∈ ℍ𝑛 ( I) for all 0 ≤ 𝑡 < 𝛿 . By matrix monotonicity
of 𝑓 on I,
𝑓 (𝚲 + 𝑡 𝑯 ) < 𝑓 (𝚲) when 0 ≤ 𝑡 < 𝛿 .
Since the psd cone is closed, we further deduce

𝑓 (𝚲 + 𝑡 𝑯 ) − 𝑓 (𝚲)
D 𝑓 (𝚲)𝑯 = lim < 0,
𝑡 ↓0 𝑡
where we are allowed to equate the Fréchet and Gâteaux derivatives of 𝑓 above because
existence of the Fréchet derivative is guaranteed by Theorem 14.13. Again by the
Daleckii–Krein formula,
[1 ] [1 ]
0 4 D 𝑓 (𝚲)𝑯 = 𝑓 (𝚲) 𝑯 =𝑓 (𝚲) .
Lecture 14: Monotonicity: Differential Characterization 118

The forward implication is valid.


For the reverse implication, we assume 𝑓 [1 ] (𝚲) < 0 for every diagonal 𝚲 ∈ ℍ𝑛 ( I) .
Let 𝑨 0 , 𝑨 1 ∈ ℍ𝑛 ( I) be arbitrary and satisfy 𝑨 1 < 𝑨 0 . We need to show that 𝑓 (𝑨 1 ) <
𝑓 (𝑨 0 ) . The argument proceeds by interpolation. Introduce The set ℍ𝑛 ( I) is convex by the
Rayleigh–Ritz representation of
𝑨𝜏 B ( 1 − 𝜏)𝑨 0 + 𝜏𝑨 1 ∈ ℍ𝑛 ( I) for 𝜏 ∈ [ 0, 1] . eigenvalues.

By assumption, the 𝜏 -derivative satisfies


d
𝑨¤ 𝜏 B 𝑨𝜏 = 𝑨 1 − 𝑨 0 < 0 .
d𝜏
Write the eigenvalue decompositions of the parameterized matrix as 𝑨𝜏 = 𝑼 𝜏 𝚲𝜏 𝑼 𝜏∗ .
We use the fundamental theorem of calculus to obtain
∫ 1
d  
𝑓 (𝑨 1 ) − 𝑓 (𝑨 0 ) = 𝑓 (𝑨𝜏 ) d𝜏
0 d𝜏
∫ 1
= D 𝑓 (𝑨𝜏 ) 𝑨¤ 𝜏 d𝜏
0
∫ 1
[1 ]
(𝑼 𝜏∗ 𝑨¤ 𝜏 𝑼 𝜏 ) 𝑼 𝜏∗ d𝜏 < 0.
 
= 𝑼𝜏 𝑓 (𝚲𝜏 )
0

The second equality is the chain rule; the third equality is from the Daleckii–Krein
formula (14.3); and the last inequality follows from the Schur product theorem. Indeed,
by hypothesis, 𝑓 [1 ] (𝚲𝜏 ) < 0, while 𝑼 𝜏∗ 𝑨¤ 𝜏 𝑼 𝜏 < 0 by the conjugation rule. Therefore,
𝑓 [1 ] (𝚲𝜏 ) (𝑼 𝜏∗ 𝑨¤ 𝜏 𝑼 𝜏 ) < 0. This is what we needed to show.

14.5 Examples
We conclude with two illustrative examples where Loewner’s theorem may be applied.
Example 14.16 (Rational function). For I ⊆ ℝ and real coefficients 𝑎, 𝑏, 𝑐 , 𝑑 ∈ ℝ such Aside: Such rational functions
that 𝑎𝑑 − 𝑏𝑐 > 0 and −𝑑/𝑐 ∉ I, define the rational function 𝑓 : I → ℝ by arise, for example, in the aptly
named Loewner framework for
𝑎𝑡 + 𝑏 interpolatory model reduction
𝑡 ↦→ 𝑓 (𝑡 ) B . of dynamical systems [Ben+17,
𝑐𝑡 + 𝑑
Chapter 8].
Notice that 𝑓 is continuous on I, and
𝑎𝑑 − 𝑏𝑐
𝑡 ↦→ 𝑓 0 (𝑡 ) =
(𝑐𝑡 + 𝑑) 2
is also continuous on I. Hence Loewner’s result (Theorem 14.5) applies. For 𝚲 =
diag (𝜆 1 , . . . , 𝜆𝑛 ) and 𝑛 ∈ ℕ, the Loewner matrix of 𝑓 is
 
[1 ] 𝑎𝑑 − 𝑏𝑐
𝑓 (𝚲) = .
(𝑐𝜆𝑖 + 𝑑)(𝑐𝜆 𝑗 + 𝑑) 𝑖 ,𝑗 =1,...,𝑛

This formula agrees with 𝑓 0 (𝜆 𝑗 ) when 𝜆𝑖 = 𝜆 𝑗 . Introduce the matrix


√  
𝑺 B 𝑎𝑑 − 𝑏𝑐 diag (𝑐𝜆𝑘 + 𝑑) −1 .
𝑘 =1,...,𝑛

Observe that
[1 ]
𝑓 (𝚲) = 𝑺 ∗ ( 11∗ )𝑺 = ( 1∗𝑺 ) ∗ ( 1∗𝑺 ) < 0 .
We may conclude that the rational function 𝑓 is matrix monotone on I. 
Lecture 14: Monotonicity: Differential Characterization 119

Using Loewner’s theorem, by verifying that the Loewner matrix is psd we were
able to conclude matrix monotonicity (in this case, of rational functions). The reverse This is useful in the study of kernel
direction is also of interest, where monotonicity can be informative about positive methods and positive definite
functions [Bha07b, Chapter 5].
semi-definiteness.
Example 14.17 (Logarithm). Since 𝑡 ↦→ log 𝑡 is continuously differentiable and matrix
monotone on ( 0, ∞) , it follows by Theorem 14.5 that

𝟙 {𝜆𝑖 =𝜆 𝑗 } + log 𝜆𝑖 − log 𝜆 𝑗


 
<0
𝜆𝑖 𝟙 {𝜆𝑖 =𝜆 𝑗 } + 𝜆𝑖 − 𝜆 𝑗 𝑖 ,𝑗 =1,...,𝑛

for any {𝜆 1 , . . . , 𝜆𝑛 } ⊂ ( 0, ∞) and 𝑛 ∈ ℕ. These matrices arises in the study of


logarithmic means, for example. 

Notes
This lecture is adapted from Bhatia’s book [Bha97, Chap. V] with some elements
from [Bha07b, Chap. 5].

Lecture bibliography
[AV95] J. S. Aujla and H. Vasudeva. “Convex and monotone operator functions”. In: Annales
Polonici Mathematici. Volume 62. 1. 1995, pages 1–11.
[Ben+17] P. Benner et al. Model reduction and approximation: theory and algorithms. SIAM,
2017.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[CP77] “CHAPTER 6. Calculus in Banach Spaces”. In: Functional Analysis in Modern Applied
Mathematics. Volume 132. Mathematics in Science and Engineering. Elsevier, 1977,
pages 87–105. doi: https://doi.org/10.1016/S0076-5392(08)61248-5.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
15. Monotonicity: Integral Characterization

Date: 22 February 2022 Scribe: Joel A. Tropp

In the last lecture, we showed that smooth matrix monotone functions are characterized
Agenda:
by a differential property. The Loewner matrix, which packages divided differences of
1. Integral representation of
the function, must be a psd matrix. In this lecture, we will discuss a complementary matrix monotone functions
characterization of matrix monotone functions as those that can be expressed using 2. Uniqueness
a certain integral representation. This approach allows us to identify several new 3. Geometric approach
examples of matrix monotone functions. Moreover, the description yields some general 4. Monotonicity on the positive
reals
theorems on matrix convexity, including some matrix Lyapunov inequalities that go 5. Filtering with a positive linear
well beyond the Kadison and Choi inequalities. map
6. Integral representation of
matrix convex functions
15.1 Recap
Last time, we developed characterizations of continuously differentiable matrix mono-
tone and matrix convex functions in terms of derivative properties. Let I ⊂ ℝ be an
open interval. Each continuously differentiable function 𝑓 : I → ℝ induces a family of
Loewner matrices. For a diagonal matrix 𝚲 ∈ ℍ𝑛 ( I) and each 𝑛 ∈ ℕ, we define
[1 ] [ 1 ] (𝜆
 
𝑳 𝑓 (𝚲) B 𝑓 (𝚲) B 𝑓 𝑗 , 𝜆𝑘 ) 𝑗 ,𝑘 =1,...,𝑛 .

As usual, 𝑓 [1 ] (𝑎, 𝑏) denotes the divided difference of 𝑓 at the points 𝑎, 𝑏 ∈ I.


In the theory of matrix monotonicity and convexity, the Loewner matrix plays
a role similar to the derivative in the scalar setting. Indeed, under the smoothness
assumption on 𝑓 , we have presented two theorems.
1. Monotonicity. The function 𝑓 is matrix monotone on I if and only if
𝑳 𝑓 (𝑨) < 0 for each diagonal 𝑨 ∈ ℍ𝑛 ( I).
We require this condition to hold for each 𝑛 ∈ ℕ. The 𝑛 = 1 case simply states
that 𝑓 0 is positive on I. This result is due Loewner.
2. Convexity. The function 𝑓 is matrix convex on I if and only if

𝑳 𝑓 (𝑨) (𝑩 − 𝑨) < 𝑩 − 𝑨 for all 𝑨, 𝑩 ∈ ℍ𝑛 ( I) with 𝑨 diagonal.


As before, this condition is required for each 𝑛 ∈ ℕ. The 𝑛 = 1 case corresponds
with the characterization of convexity for a smooth, scalar function. This result
is due to Aujla & Vasudeva.
Later in this lecture, we will also discuss some other relationships between matrix
monotone and matrix convex functions that do not have scalar counterparts.

15.2 Integral representations of matrix monotone functions


In Lecture 13, we established that the logarithm is a matrix monotone function on
ℝ++ B ( 0, ∞) by the following argument. First, we observed that
∫ ∞
( 1 + 𝜆) −1 − (𝑎 + 𝜆) −1 d𝜆 for 𝑎 > 0.
 
log 𝑎 =
0
Lecture 15: Monotonicity: Integral Characterization 121

We have a similar integral representation for the logarithm of a pd matrix:


∫ ∞
( 1 + 𝜆) −1 I − (𝑨 + 𝜆I) −1 d𝜆 for 𝑨  0.
 
log 𝑨 =
0

Since the function 𝑡 ↦→ −(𝑡 + 𝜆) −1 is matrix monotone on ℝ++ , the integral repre-
sentation shows that the logarithm is the limit of positive sums of matrix monotone
functions. Since the cone of matrix monotone functions is closed under pointwise
limits, the logarithm also is matrix monotone.
This argument may seem concocted. In fact, this approach is based on a penetrating
insight that every matrix monotone function admits an integral representation. This
is a famous result of Loewner, which is regarded as one of the deepest theorems in
matrix analysis. We present the result and some immediate consequences.

15.2.1 Loewner’s integral theorem


The form of the integral representation of a matrix monotone function depends
Warning: In this theory, we must
superficially on the interval where it is defined. We begin with the standard open take care about whether the
interval (−1, +1) . Later, we will see that results for this interval can be transferred endpoints of the interval are
directly to other intervals. included because matrix
monotone functions need not be
continuous at the endpoints. This
Theorem 15.1 (Matrix monotonicity: Standard interval; Loewner 1934). Let 𝑓 : (−1, +1) → result is framed for an open
ℝ be a nonconstant function. interval to avoid this problem. 

1. If 𝑓 is matrix monotone on (−1, +1) , then 𝑓 admits a unique representation


of the form ∫
𝑡
𝑓 (𝑡 ) = 𝛼 + 𝛽 d𝜇(𝜆), (15.1)
[−1,+1 ] 1 − 𝜆𝑡
where 𝛼 ∈ ℝ and 𝛽 > 0 and 𝜇 is a Borel probability measure on [−1, +1] .
2. Conversely, suppose that 𝑓 satisfies (15.1). Then 𝑓 is a matrix monotone
function on (−1, +1) .

The claim (2) is called the “easy” direction of Loewner’s theorem. We verify this
result in Section 15.2.2. As we will see, the easy direction already offers a powerful
tool. Section 15.2.3 contains several examples.
Meanwhile, the claim (1) is called the “hard” direction of Loewner’s theorem. In
particular, this result implies that every matrix monotone function 𝑓 on (−1, +1)
is analytic in the interval (−1, +1) . This fact is striking because scalar monotone
functions do not even need to be continuous. Introducing deeper ideas from complex
analysis, one may prove that the function 𝑓 continues analytically into the upper
half-plane, where it defines a Herglotz function; see [Bha97, Chap. V] or [Sim19] for
more discussion about this point.
Simon [Sim19] outlines 11 different proofs (!) of the hard part of Loewner’s theorem.
Each one approaches the summit from a different direction, and each one requires a
long and arduous climb. In Section 15.2.4, we give the easy proof of the uniqueness
statement. The existence claim is the difficult step. In Section 15.3, we will summarize
a geometric existence proof, but there is no space here for all of the details.

15.2.2 Proof of Theorem 15.1: “Easy” direction


Let us supply the short proof of claim (2) in Loewner’s theorem.
Exercise 15.2 (Matrix monotone functions: Elementary examples). For each 𝜆 ∈ [−1, +1] ,
confirm that 𝑓 : 𝑡 ↦→ 𝑡 /( 1 − 𝜆𝑡 ) is a matrix monotone function on (−1, +1) . Hint:
Lecture 15: Monotonicity: Integral Characterization 122

Simplify the function algebraically so that you can invoke the fact that the negative
inverse is operator monotone.
The constant function 𝑡 ↦→ 𝛼 is matrix monotone on any interval. For each
𝜆 ∈ [−1, +1] , the function 𝑡 →
↦ 𝑡 /( 1 − 𝜆𝑡 ) is matrix monotone on the interval (−1, +1)
by Exercise 15.2. By the usual limiting arguments (simple functions, dominated
convergence), we see that the integral in (15.1) represents a matrix monotone function.
Indeed, matrix monotone functions are closed under positive linear combinations
and pointwise limits. Altogether, the expression (15.1) represents a matrix monotone
function on the standard open interval (−1, +1) .

15.2.3 First examples


Once we recognize that matrix monotone functions admit integral representations of
the type (15.1), we can seek out representations for particular functions. The “easy”
direction of Loewner’s theorem confirms that these integrals describe matrix monotone
functions.
Example 15.3 (Logarithms and friends). Using basic tools from integral calculus, we quickly
confirm that ∫ 0
𝑡
log ( 1 + 𝑡 ) = d𝜆 for 𝑡 ∈ (−1, +1) .
−1 1 − 𝜆𝑡
Theorem 15.1(2) now implies that 𝑡 ↦→ log ( 1 + 𝑡 ) is matrix monotone on (−1, +1) .
A similar calculation reveals that 𝑡 ↦→ − log ( 1 − 𝑡 ) is matrix monotone on (−1, +1) .
From here, we can derive further results of interest:
 
1 1+𝑡
arctanh (𝑡 ) = log is matrix monotone on (−1, +1) .
2 1−𝑡

Indeed, matrix monotone functions compose a convex cone. 

Problem 15.4 (Tangent). Show that the function 𝑡 ↦→ tan (𝜋𝑡 /2) has an integral repre-
sentation of the form (15.1), hence is matrix monotone on (−1, +1) . Hint: You can
derive this fact from the definite integral
∫ ∞
𝜋  𝜋𝑡  𝜆𝑡 − 1 d𝜆
tan = · for 𝑡 ∈ (−1, +1) .
2 2 0 𝜆 − 𝜆−1 𝜆
This non-obvious statement appears as [GR07, Formula 3.274(3)]. There are more
elementary ways to prove that the tangent is matrix monotone; cf. Lecture 18.

15.2.4 Proof of Theorem 15.1: Uniqueness


To prove that integral representations are unique, we must argue that the integrands
are rich enough to approximate all continuous functions.
Exercise 15.5 (Elementary monotone functions: Totality). Establish that the collection In a normed space, a subset is total if
its linear span is dense.
{𝜆 ↦→ 1} ∪ {𝜆 ↦→ 𝑡 ( 1 − 𝜆𝑡 ) −1 : 𝑡 ∈ (−1, +1)}
is total in the space C [−1, +1] of continuous functions, equipped with the supremum
norm. Hint: This is a direct application of the Stone–Weierstrass theorem because the
linear span is an algebra.
To prove the uniqueness claim, let 𝑓 be a matrix monotone function on (−1, +1)
with a representation of the form (15.1). Since 𝛼 = 𝑓 ( 0) and 𝛽 = 𝑓 0 ( 0) > 0, we can
shift and scale 𝑓 to make 𝛼 = 0 and 𝛽 = 1.
Lecture 15: Monotonicity: Integral Characterization 123

Suppose that there are two Borel probability measures 𝜇, 𝜈 on [−1, +1] for which
∫ +1 ∫ +1
𝑡 𝑡
𝑓 (𝑡 ) = d𝜇(𝑡 ) = d𝜈 (𝑡 ) for all 𝑡 ∈ (−1, +1) .
−1 1 − 𝜆𝑡 −1 1 − 𝜆𝑡

If these two measures are different, they can be separated by a continuous function
ℎ : [−1, +1] → ℝ. That is,
∫ +1 ∫ +1
ℎ (𝜆) d𝜇(𝜆) < ℎ (𝜆) d𝜈 (𝜆).
−1 −1

Since 𝜇 and 𝜈 are both probability measures, we may shift ℎ to ensure that it has zero
∫ +1
integral (with respect to Lebesgue measure): −1 ℎ (𝜆) d𝜆 = 0.
By Exercise 15.5, the functions 𝜆 ↦→ 𝑡 ( 1 − 𝜆𝑡 ) −1 are total in the space of continuous
functions on [−1, +1] with a zero integral. Therefore, we can approximate ℎ in the
supremum norm by a linear combination. For example, with some real coefficients
𝑐 1 , . . . , 𝑐 𝑘 , we have
∑︁𝑛 𝑐 𝑘 𝑡𝑘
ℎ (𝜆) − ≤ 𝜀 for all 𝜆 ∈ [−1, +1] .
𝑘 =1 1 − 𝜆𝑡𝑘

By choosing 𝜀 sufficiently small, we find that


∫ +1
∑︁𝑛 ∑︁𝑛 𝑐 𝑘 𝑡𝑘
𝑐 𝑘 𝑓 (𝑡𝑘 ) = d𝜇(𝜆)
𝑘 =1 𝑘 =1 −1 1 − 𝜆𝑡𝑘
∑︁𝑛 ∫ +1 𝑐 𝑘 𝑡𝑘 ∑︁𝑛
< d𝜈 (𝜆) = 𝑐 𝑘 𝑓 (𝑡𝑘 ).
𝑘 =1 −1 1 − 𝜆𝑡𝑘 𝑘 =1

This contradiction forces us to conclude that the measures are the same.

15.3 The geometric approach to Loewner’s theorem


Many theorems on integral representation have an elegant geometric interpretation
as averages of extreme points of a compact, convex set. Hansen & Petersen [HP82]
developed a proof of Theorem 15.1 based on this strategy. The argument relies on
a complicated interplay between the smoothness and convexity properties of matrix
monotone functions, so we cannot give all of the details here. See [Bha97, Chap. V.4]
or [Sim19, Chap. 28] for complete arguments.

15.3.1 Slicing the cone


As we have discussed, the matrix monotone functions on (−1, +1) compose a convex
cone, closed under pointwise limits. It is convenient to normalize these functions,
which amounts to taking a slice through the cone; see Figure 15.1. It can be shown
(not easy!) that every matrix monotone function is differentiable. Therefore, we may
introduce the set

B B {𝑓 : (−1, +1) → ℝ : 𝑓 is matrix monotone, 𝑓 ( 0) = 0, and 𝑓 0 ( 0) = 1}. Figure 15.1 (Base of cone). A base B of
the cone of matrix monotone functions
Furthermore, one may confirm that each nonconstant matrix monotone function 𝑔 on (−1, +1) .
satisfies 𝑔 0 ( 0) > 0. Therefore, the function (𝑔 (𝑡 ) − 𝑔 ( 0))/𝑔 0 ( 0) ∈ B.
Lecture 15: Monotonicity: Integral Characterization 124

As it happens, the set B is compact with respect to the topology of pointwise


convergence. Roughly speaking, this point follows from the fact that matrix monotone
functions in B are bounded above and below:
𝑡
≤ 𝑓 (𝑡 ) ≤ 0 when −1 < 𝑡 ≤ 0;
1+𝑡
𝑡
0 ≤ 𝑓 (𝑡 ) ≤ when 0 ≤ 𝑡 < 1.
1−𝑡
This claim also requires a fair amount of argument.

15.3.2 The Krein–Milman theorem


Since B is a compact and convex set, we can activate a famous representation theorem
that generalizes Minkowski’s theorem on extreme points (Lecture 5).

Theorem 15.6 (Krein–Milman). Let B be a compact, convex subset of a locally convex


topological linear space. Then the set coincides with the closed convex hull of its
extreme points:
B = conv ( ext ( B)).

In general, the closure is required, and there is a possibility that the extreme points
are uninformative (e.g., the extreme points could be dense in B!). Nevertheless, there
are many settings where we can explicitly identify the extreme points of the set B. The
limit of convex combinations can often be realized as an integral over the extreme
points, which leads to an integral representation of the elements of B.

15.3.3 Elementary monotone functions are extreme


As it happens, the elementary matrix monotone functions we have been studying are
all extreme points of the set B. We will prove this claim directly using an argument
attributed to Boutet de Monvel [Sim19, Thm. 28.12].

Theorem 15.7 (Elementary monotone functions: Extremality). For each 𝜆 ∈ [−1, +1] ,
introduce the function
𝑡
𝜑𝜆 (𝑡 ) = for 𝑡 ∈ (−1, +1) .
1 − 𝜆𝑡
The function 𝜑𝜆 is an extreme point of B.

In fact, the family ( 𝜑𝜆 : 𝜆 ∈ [−1, +1]) exhausts the set of extreme points of B.
This claim is significantly harder to prove, so we must take it for granted.

Proof. Fix 𝜆 ∈ [−1, +1] , and note that 𝜑𝜆 ∈ B. The key to the proof is to observe that
the 2 × 2 Loewner matrix associated with 𝜑𝜆 always has rank one:
∗
( 1 − 𝜆𝑠 ) −1 ( 1 − 𝜆𝑠 ) −1
 
𝜑𝜆[1 ] (𝑠 , 𝑡 ) = for all 𝑠 , 𝑡 ∈ (−1, +1) .
( 1 − 𝜆𝑡 ) −1 ( 1 − 𝜆𝑡 ) −1

Therefore, the Loewner matrix lies in an extreme ray of the psd cone.
Suppose now that we can write 𝜑𝜆 = 12 𝑓 + 21 𝑔 where 𝑓 , 𝑔 ∈ B. The Loewner
matrix is linear in the function, so

𝜑𝜆[1 ] (𝑠 , 𝑡 ) = 21 𝑓 [1 ]
(𝑠 , 𝑡 ) + 12 𝑔 [1 ] (𝑠 , 𝑡 ) for all 𝑠 , 𝑡 ∈ (−1, +1) .
Lecture 15: Monotonicity: Integral Characterization 125
[1 ]
But the Loewner matrix 𝜑𝜆 (𝑠 , 𝑡 ) lies in an extreme ray of the psd cone, so the two
matrices on the right are both positive scalar multiples of it. In particular,

𝑓 [1 ]
(𝑠 , 𝑡 ) = 𝛼 (𝑠 , 𝑡 ) · 𝜑𝜆[1 ] (𝑠 , 𝑡 ) for some 𝛼 (𝑠 , 𝑡 ) ≥ 0.

To complete the argument, we select 𝑠 = 0 and write out the entries of the matrices.
From the normalization 𝑓 ( 0) = 0 and 𝑓 0 ( 0) = 1, it emerges that

( 1 − 𝜆𝑡 ) −1
   
1 𝑓 (𝑡 )/𝑡 1
= 𝛼 ( 0 , 𝑡 ) · for all 𝑡 ∈ (−1, +1) .
𝑓 (𝑡 )/𝑡 𝑓 0 (𝑡 ) ( 1 − 𝜆𝑡 ) −1 ( 1 − 𝜆𝑡 ) −2

Inspecting the top-left entry, we realize that 𝛼 ( 0, 𝑡 ) = 1, regardless of the choice of 𝑡 .


As a consequence, the top-right entry yields 𝑓 (𝑡 ) = 𝑡 ( 1 − 𝜆𝑡 ) −1 for all 𝑡 ∈ (−1, +1) .
In other terms, 𝑓 = 𝜑𝜆 , and we conclude that 𝜑𝜆 is an extreme point of B. 

15.3.4 From extreme points to integrals


Granted the claim that ( 𝜑𝜆 : 𝜆 ∈ [−1, +1]) is the complete family of extreme points
of B, we can establish the integral representation. Let 𝑓 ∈ B be a function. The
Krein–Milman theorem yields a sequence of approximations
∫ +1
𝜑𝜆 (𝑡 ) d𝜇𝑛 (𝜆) → 𝑓 (𝑡 ) as 𝑛 → ∞ for each 𝑡 ∈ (−1, +1) .
−1

In this expression, 𝜇𝑛 is a probability measure supported on 𝑛 points in [−1, +1] , so it


describes a convex combination of 𝑛 functions 𝜑𝜆𝑘 with 𝜆𝑘 ∈ [−1, +1] . By weak-∗
compactness of the simplex of probability measures on the compact space [−1, +1] ,
we can extract a subsequence that converges to a probability measure 𝜇 on [−1, +1] .
For this limiting measure,
∫ +1
𝜑𝜆 (𝑡 ) d𝜇(𝜆) = 𝑓 (𝑡 ) for each 𝑡 ∈ (−1, +1) .
−1

For a nonconstant matrix monotone function 𝑔 on (−1, +1) , we apply this statement
to the function 𝑓 (𝑡 ) = (𝑔 (𝑡 ) − 𝑔 ( 0))/𝑔 0 ( 0) to complete our sketch of the proof of the
“hard” direction of Theorem 15.1.

15.4 Matrix monotone functions on the positive real line


Loewner’s result gives an integral representation of the matrix monotone functions
on the interval (−1, +1) . As we will see, integral representations for other classes of
matrix monotone functions follow via simple transformations. We will focus on the
most useful case: matrix monotone functions on the positive real line.

15.4.1 Fractional linear transformations


We can move among open intervals in the real line using fractional linear transfor-
mations. In this section, we will show that fractional linear transformations preserve
matrix monotonicity.
Exercise 15.8 (Fractional linear transformations). A fractional linear transformation is a
function of the form
𝛼𝑡 + 𝛽
𝜓 (𝑡 ) B for 𝑡 ≠ −𝛿 /𝛾 and where 𝛼, 𝛽, 𝛾 , 𝛿 ∈ ℝ.
𝛾𝑡 + 𝛿
Lecture 15: Monotonicity: Integral Characterization 126

When the determinant 𝛼𝛿 − 𝛽𝛾 > 0, check that 𝜓 is increasing on the intervals


(−∞, −𝛿 /𝛾 ) and (−𝛿 /𝛾 , +∞) .
Show that there is an increasing fractional linear transformation that maps a
finite or semi-infinite interval (𝑐 , 𝑑) onto a finite or semi-infinite interval (𝑎, 𝑏) for
𝑎, 𝑏, 𝑐 , 𝑑 ∈ ℝ. We do not allow both endpoints of an interval to be infinite.
Show that an increasing, surjective fractional linear transformation 𝜓 : (𝑐 , 𝑑) →
(𝑎, 𝑏) is matrix monotone.
Exercise 15.9 (Matrix monotonicity: Composition). Suppose that 𝑓 , 𝑔 are matrix monotone.
Check that ℎ (𝑡 ) = 𝑓 (𝑔 (𝑡 )) is matrix monotone, provided that the domains and
codomains are compatible.
Exercise 15.10 (Matrix monotonicity: Change of domain). Let 𝑓 : (𝑎, 𝑏) → ℝ be a matrix
monotone function on a finite or semi-infinite interval. Let 𝜓 : (𝑐 , 𝑑) → (𝑎, 𝑏) be an
increasing fractional linear transformation that is surjective. Confirm that 𝑓 ◦ 𝜓 is
matrix monotone on (𝑐 , 𝑑) .

15.4.2 Loewner’s theorem for the strictly positive reals


Using these results, we can transfer Theorem 15.1 to obtain a parallel result for matrix
monotone functions on ℝ++ .
Corollary 15.11 (Loewner: Strictly positive reals). Consider a function 𝑔 : ℝ++ → ℝ. The
function 𝑔 is matrix monotone on ℝ++ if and only if it admits a (unique) representation
of the form ∫ If the measure 𝜈 has an atom at 0, it
𝑡 −1 contributes a pole at zero. If the
𝑔 (𝑡 ) = 𝛼 + d𝜈 (𝜆) for all 𝑡 > 0. measure 𝜈 has an atom at 1, it
[ 0,1 ] 𝜆 + ( 1 − 𝜆)𝑡 contributes a linear component.
In this expression, 𝛼 ∈ ℝ and 𝜈 is a finite, positive Borel measure on [ 0, 1] .

Proof sketch. Let us introduce the fractional linear transformation 𝜓 that maps the
strictly positive real line ℝ++ onto the standard open interval (−1, +1) . The map 𝜓
and its inverse 𝜓 −1 are increasing functions of the form
𝑡 −1 1+𝑠
𝜓 (𝑡 ) = for 𝑡 ∈ ℝ++ where 𝜓 −1 (𝑠 ) = for 𝑠 ∈ (−1, +1).
𝑡 +1 1−𝑠
We can construct the matrix monotone function 𝑓 (𝑠 ) = 𝑔 (𝜓 −1 (𝑠 )) on (−1, +1) . Apply
Theorem 15.1 to 𝑓 to obtain an integral representation. Then make the change of
variables 𝑠 = 𝜓 (𝑡 ) to transfer this representation back to the function 𝑔 . Make the
affine change of variables 𝜆 ↦→ 2𝜆 − 1 to shift the domain of integration from [−1, +1]
to [ 0, 1] . The result of this process is quoted above. 
Example 15.12 (Logarithm, again). We can use the “easy” direction of Corollary 15.11 to
verify that particular functions are matrix monotone. As an example, observe that
∫ 1
𝑡 −1
log 𝑡 = d𝜆 for all 𝑡 > 0.
0 𝜆 + ( 1 − 𝜆)𝑡
Therefore, the logarithm is matrix monotone. The representation of the logarithm that
we saw before is related to this one by a further change of variables in the integral. 
Exercise 15.13 (Loewner: Strictly positive reals). It is sometimes convenient to change
variables in Corollary 15.11. For a matrix monotone function 𝑔 : ℝ++ → ℝ, show that

𝜆(𝑡 − 1)
𝑔 (𝑡 ) = 𝛼 + 𝛾𝑡 + d𝜇(𝜆) for all 𝑡 > 0.
ℝ++ 𝜆 +𝑡
Lecture 15: Monotonicity: Integral Characterization 127

where 𝛼 ∈ ℝ and 𝛾 ≥ 0 and 𝜇 is a finite, positive Borel measure on ℝ++ . This is closer
to our original presentation of the logarithm in terms of an integral.

15.4.3 Loewner’s theorem for the positive reals


So far, we have studied matrix monotone functions on an open interval. In this setting,
the function need not have a limiting value at the endpoints. This is the case for the
logarithm and the tangent function. When the function does take a limiting value, we
can obtain integral representations that are valid up to and including the endpoint.
Corollary 15.14 (Loewner: Positive reals). Consider a function 𝑔 : ℝ+ → ℝ that is
continuous at 𝑡 = 0. The function 𝑔 is matrix monotone on ℝ+ if and only if it admits
a (unique) representation of the form

𝑡 d𝜈 (𝜆)
𝑔 (𝑡 ) = 𝛽 + 𝛾𝑡 + · for all 𝑡 ≥ 0.
( 0,1) 𝜆 + ( 1 − 𝜆)𝑡 𝜆

In this expression, 𝛽 ∈ ℝ and 𝛾 ≥ 0, and 𝜈 is a finite, positive Borel measure on ( 0, 1) .

Proof sketch. By continuity, 𝑔 ( 0) = lim𝑡 ↓0 𝑔 (𝑡 ) . Taking a limit of the integral repre-


sentation in Corollary 15.11, we find that
∫ 1
d𝜈 (𝜆)
𝑔 ( 0) = 𝛼 − C 𝛽 > −∞.
0 𝜆
Adding and subtracting this quantity in the representation in Corollary 15.11, we obtain
the statement after some simplification. 
Exercise 15.15 (Loewner: Positive reals). It is often convenient to make another change of
variables in Corollary 15.14. For a continuous matrix monotone function 𝑔 : ℝ+ → ℝ,
show that ∫
𝜆𝑡
𝑔 (𝑡 ) = 𝛽 + 𝛾𝑡 + d𝜇(𝜆) for all 𝑡 ≥ 0.
ℝ++ 𝜆 +𝑡
Here, 𝛽 ∈ ℝ and 𝛾 ≥ 0 and 𝜇 is a finite, positive Borel measure on ℝ++ .
Exercise 15.15 suggests that we look for functions that have integral representations
with this form. Here are two examples.
Example 15.16 (Shifted logarithm: Monotonicity). Beginning with Example 15.12, we can
make the change of variables 𝑡 ↦→ 1 + 𝑡 and 1 − 𝜆 ↦→ 𝜆−1 to obtain the representation
∫ ∞
𝜆𝑡
log ( 1 + 𝑡 ) = · 𝜆−2 d𝜆.
1 𝜆 +𝑡
Therefore, the shifted logarithm 𝑡 ↦→ log ( 1 + 𝑡 ) is matrix monotone on ℝ+ . 

Example 15.17 (Powers: Monotonicity). For 𝑟 ∈ ( 0, 1] , we can use contour integration to


show that ∫ ∞
sin (𝜋𝑟 ) 𝜆𝑡
𝑟
𝑡 = · 𝜆𝑟 −2 d𝜆.
𝜋 0 𝜆 +𝑡
Therefore, the power functions 𝑡 ↦→ for 𝑟 ∈ ( 0, 1] are matrix monotone on ℝ+ .
𝑡𝑟
We can deduce that negative powers are matrix monotone using the composition
rule (Exercise 15.9). Indeed, since 𝑡 ↦→ −𝑡 −1 is matrix monotone on ℝ++ , the power
function 𝑡 ↦→ −𝑡 −𝑟 is matrix monotone on ℝ++ for each 𝑟 ∈ ( 0, 1] . 
Lecture 15: Monotonicity: Integral Characterization 128

15.5 Integral representations of matrix convex functions


Just as we developed integral representations for matrix monotone functions, we can
develop integral representations for matrix convex functions. These results follow from
the representations of matrix monotone functions by means of some general principles.
For brevity, we will only consider matrix convex functions on the positive real line.

15.5.1 Relations between matrix monotonicity and convexity


In the scalar setting, monotone functions and convex functions are independent
Warning: Some proofs of
concepts. The main connection is that a differentiable convex function has a monotone Loewner’s theorem (including
derivative. In the matrix setting, however, there is an intricate web of relations between the one we have outlined)
matrix monotone and matrix convex functions. This section outlines some of the basic depend on the convexity and
concavity results in this section,
facts, along with direct proofs. so it may be circular to use
integral representations to
Theorem 15.18 (Matrix monotonicity and concavity). Let 𝑓 : ℝ+ → ℝ be a (continuous) establish these statements. 

function on the positive real line.

1. If 𝑓 is matrix monotone, then 𝑓 is matrix concave.


2. If 𝑓 is matrix concave and takes positive values, then 𝑓 is matrix monotone.

Proof. For the second part, assume that 𝑓 : ℝ+ → ℝ+ is matrix concave and positive .
First, suppose that 𝑨 ≺ 𝑩 . For scalars 𝜏 ∈ ( 0, 1) and 𝜏¯ = 1 − 𝜏 , since 𝑓 is concave and
positive, 
𝑓 (𝜏𝑩) = 𝑓 𝜏𝑨 + 𝜏¯ · (𝜏/𝜏)
¯ (𝑩 − 𝑨)

< 𝜏 · 𝑓 (𝑨) + 𝜏¯ · 𝑓 (𝜏/𝜏)(𝑩
¯ − 𝑨) < 𝜏 · 𝑓 (𝑨).
Since 𝑓 is continuous, we may take 𝜏 ↑ 1. We conclude that 𝑨 ≺ 𝑩 implies
𝑓 (𝑨) 4 𝑓 (𝑩) . To handle the case where 𝑨 4 𝑩 , observe that 𝑓 (𝑨) 4 𝑓 (𝑩 + 𝜀 I) for
𝜀 > 0, and take limits as 𝜀 ↓ 0 to resolve that 𝑓 is matrix monotone.
For the first part, the key to this argument is the following claim, which we will
establish in a moment.
Claim 15.19 (Monotonicity: Submatrix). Let 𝑓 : ℝ+ → ℝ be a continuous matrix monotone Recall that [ ·] 11 extracts the top-left
function. For a psd block matrix 𝑨 , we have 𝑓 ([𝑨] 11 ) < [𝑓 (𝑨)] 11 . ( 1, 1) block of a block matrix, and it
is a positive linear map.
We assume that 𝑓 : ℝ+ → ℝ is matrix monotone. Granted the claim, we introduce
a unitary block matrix that generates convex combinations:

𝜏 1/2 I 𝜏¯1/2 I
 
𝑼𝜏 B for 𝜏 ∈ [ 0, 1] with 𝜏¯ = 1 − 𝜏 .
−𝜏¯1/2 I 𝜏 1/2 I

For psd matrices 𝑨 1 , 𝑨 2 , construct the psd block diagonal matrix 𝑨 = 𝑨 1 ⊕ 𝑨 2 . Using
the claim, we calculate that

¯ 2 ) = 𝑓 [𝑼 𝜏 𝑨𝑼 𝜏∗ ] 11 < 𝑓 (𝑼 𝜏 𝑨𝑼 𝜏∗ ) 11
  
𝑓 (𝜏𝑨 1 + 𝜏𝑨
= 𝑼 𝜏 𝑓 (𝑨)𝑼 𝜏∗ 11 = 𝜏 𝑓 (𝑨 1 ) + 𝜏¯ 𝑓 (𝑨 2 ).
 

In other words, 𝑓 is matrix concave.


To prove Claim 15.19, we introduce another block matrix that helps diagonalize a
block matrix:  1/2 
𝜀 I 0
𝑻𝜀 B for 𝜀 > 0.
0 −𝜀 −1/2 I
Lecture 15: Monotonicity: Integral Characterization 129

For a psd block matrix 𝑨 , a short calculation reveals that Of course, [ ·] 22 extracts the ( 2, 2)
block of a block matrix.
 
( 1 + 𝜀) · [𝑨] 11 0
𝑨4𝑨 + 𝑻 ∗𝜀 𝑨𝑻 𝜀 = .
0 ( 1 + 𝜀 −1 ) · [𝑨] 22

Apply the monotone function 𝑓 to this relation. Since [·] 11 is a positive linear map, it
preserves the psd order on self-adjoint matrices. Therefore,

[𝑓 (𝑨)] 11 4 𝑓 ( 1 + 𝜀) · [𝑨] 11 .

Take the limit as 𝜀 ↓ 0 using the assumption that 𝑓 is continuous. 


Next, we point out that a matrix convex function on the positive real line induces
another matrix monotone function.
Proposition 15.20 (Monotonicity from convexity). Let 𝑓 : ℝ+ → ℝ be a matrix convex
function with 𝑓 ( 0) ≤ 0. Then 𝑔 (𝑡 ) = 𝑓 (𝑡 )/𝑡 is matrix monotone on ℝ++ .

Proof. Assume that 0 ≺ 𝑨 4 𝑩 . Since 𝑩 −1/2 𝑨𝑩 −1/2 4 I, we see that the matrix
𝑲 = 𝑩 −1/2 𝑨 1/2 is a contraction. Let 𝑳 = ( I − 𝑲 ∗ 𝑲 ) 1/2 . The matrix Jensen inequality
(Lecture 13) implies that

𝑓 (𝑨) = 𝑓 (𝑲 ∗ 𝑩𝑲 ) = 𝑓 (𝑲 ∗ 𝑩𝑲 + 𝑳 ∗ 0𝑳) 4 𝑲 ∗ 𝑓 (𝑩)𝑲 + 𝑳 ∗ 𝑓 ( 0)𝑳 4 𝑲 ∗ 𝑓 (𝑩)𝑲 .

Conjugate both sides by 𝑨 −1/2 to see that

𝑔 (𝑨) = 𝑨 −1/2 𝑓 (𝑨)𝑨 −1/2 4 𝑩 −1/2 𝑓 (𝑩)𝑩 −1/2 = 𝑔 (𝑩).

Indeed, standard matrix functions of the same matrix commute. 


Problem 15.21 (Convexity from monotonicity). A converse of Proposition 15.20 also holds.
Assume that 𝑓 : ℝ+ → ℝ is continuous and 𝑓 ( 0) ≤ 0. If 𝑔 (𝑡 ) = 𝑓 (𝑡 )/𝑡 is matrix
monotone on ℝ++ , then 𝑓 : ℝ+ → ℝ is matrix convex. Hint: Using a dilation argument,
show that 𝑓 is matrix convex if and only if 𝑓 (𝑷 𝑨𝑷 ) 4 𝑷 𝑓 (𝑨)𝑷 for all orthoprojectors
𝑷 and all psd matrices 𝑨 . Show that monotonicity of 𝑔 implies this condition.
Exercise 15.22 (Powers: Convexity). We can identify convexity properties of the power
functions using Proposition 15.20 and Problem 15.21. Determine whether 𝑡 ↦→ 𝑡 𝑟 is
matrix convex or matrix concave for each 𝑟 ∈ [−1, 2] .

15.5.2 Integral representation


We are now prepared to give an integral representation of matrix convex functions on
the positive real line. Results of this type are usually attributed to Bendat & Sherman.
Corollary 15.23 (Matrix convexity: Integral representation). Consider a (continuous) matrix
convex function 𝑓 : ℝ+ → ℝ. Then 𝑓 admits the representation

2 𝜆𝑡 (𝑡 − 1)
𝑓 (𝑡 ) = 𝛼 + 𝛽𝑡 + 𝛾𝑡 + d𝜇(𝜆) for all 𝑡 ≥ 0.
ℝ++ 𝜆 +𝑡
The coefficient 𝛾 ≥ 0, and the Borel measure 𝜇 is finite and positive.

Proof. We may shift 𝑓 so that 𝑓 ( 0) = 0. Proposition 15.20 implies that 𝑔 (𝑡 ) = 𝑓 (𝑡 )/𝑡


is matrix monotone on ℝ++ . Exercise 15.13 yields an integral representation for 𝑔 on
ℝ++ . Multiply through by 𝑡 , and remove the shift from 𝑓 . Since 𝑓 is continuous, the
representation is also valid at 𝑡 = 0. 
Lecture 15: Monotonicity: Integral Characterization 130

Exercise 15.24 (Entropy). Using an integral representation for the logarithm, find an
integral representation for the negative entropy function negent (𝑡 ) B 𝑡 log 𝑡 for 𝑡 ≥ 0.
Exercise 15.25 (Matrix convexity: Other intervals). Consider a matrix convex function
𝑓 : I → ℝ on an open interval I. Bendat & Sherman proved that the divided difference
𝑓 [1 ] (𝑠 , ·) : I → ℝ is matrix monotone for each 𝑠 ∈ I. Use this fact to derive an integral
representation for the matrix convex function 𝑓 .

15.6 Application: Matrix Jensen and Lyapunov inequalities


The integral representations from this lecture have remarkable consequences for matrix
analysis. In this section, we will use them to derive a stronger matrix Jensen inequality.
In turn, this inequality allows us to develop a satisfactory extension of Lyapunov’s
inequality.

15.6.1 Matrix Jensen for a unital, positive linear maps


In Lecture 13, we established the matrix Jensen inequality, which describes how matrix
convex functions interact with matrix convex combinations. Earlier, we argued that
all unital, positive linear maps can be regarded as averaging operators. The integral
representations for matrix convex functions now permit us to derive a Jensen inequality
for any unital, positive map.

Theorem 15.26 (Matrix Jensen: Positive linear maps). Consider a matrix convex function
𝑓 : ℝ+ → ℝ, and let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital positive linear map. Then

𝑓 (𝚽(𝑨)) 4 𝚽(𝑓 (𝑨)) for all psd 𝑨 ∈ ℍ𝑛+ .

In particular, this inequality is valid when −𝑓 is matrix monotone.

We can trace this kind of result to work of Davis [Dav57] and Choi [Cho74]. See
also Ando’s paper [And79].

Proof. Corollary 15.23 implies that


∫  
2 𝜆(𝜆 + 1)
𝑓 (𝑡 ) = 𝛼 + 𝛽𝑡 + 𝛾𝑡 + 𝑡 − (𝜆 + 1) + · 𝜆 d𝜇(𝜆).
ℝ++ 𝜆 +𝑡
The coefficient 𝛾 ≥ 0. In particular, for a psd matrix 𝑨 ,

2
𝑨 − (𝜆 + 1) I + 𝜆(𝜆 + 1)(𝜆I + 𝑨) −1 · 𝜆 d𝜇(𝜆).
 
𝑓 (𝑨) = 𝛼 I + 𝛽𝑨 + 𝛾 𝑨 +
ℝ++

Kadison’s inequality and Choi’s inequality yield


2  −1
𝚽(𝑨 2 ) 4 𝚽(𝑨) 𝚽 (𝜆I + 𝑨) −1 4 𝜆I + 𝚽(𝑨) .

and

Apply 𝚽 to the integral representation of 𝑓 (𝑨) to obtain


2
𝚽( 𝑓 (𝑨)) 4 𝛼 I + 𝛽𝚽(𝑨) + 𝛾 𝚽(𝑨)
∫ h
 −1 i
+ 𝚽(𝑨) − (𝜆 + 1) I + 𝜆(𝜆 + 1) 𝜆I + 𝚽(𝑨) · 𝜆 d𝜇(𝜆)
ℝ++
= 𝑓 (𝚽(𝑨)).
We have repeatedly used the linearity, the unital property, and the continuity of 𝚽.
The semidefinite inequality follows from Kadison’s and Choi’s inequalities. 
Lecture 15: Monotonicity: Integral Characterization 131

Exercise 15.27 (Choi’s convexity theorem). Theorem 15.26 holds for every matrix convex
function 𝑓 , regardless of its domain. Prove it. Hint: Use Exercise 15.25.
Exercise 15.28 (Matrix Jensen). Deduce that the matrix Jensen inequality from Lecture 13
is a special case of Choi’s convexity theorem (Exercise 15.27). Hint: Consider the block
diagonal matrix 𝑨 = 𝑨 1 ⊕ 𝑨 2 ⊕ · · · ⊕ 𝑨 𝑛 . Show that the matrix convex combination is
a unital, positive linear map on this matrix.

15.6.2 Matrix Lyapunov inequalities


Theorem 15.26 has remarkable consequences when applied to the most common matrix
convex functions [And79, Cor. 4.2]. These results give additional credence to the idea
that unital, positive maps are averaging operators.
Corollary 15.29 (Matrix Lyapunov). Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital , positive linear map.
For each 𝑝 ≥ 1 ,
Warning: It is tempting to raise
 1/𝑝 both sides of this inequality to
for all psd 𝑨 ∈ ℍ𝑛+ .

𝚽(𝑨) 4 𝚽(𝑨 𝑝 ) the 𝑝 th power, but the resulting
statement is false for 𝑝 > 2. 

Proof. For 𝑟 ∈ ( 0, 1) , the function 𝑡 ↦→ 𝑡 𝑟 is matrix concave on ℝ++ . This statement


follows when we combine Example 15.17 and Theorem 15.18; alternatively, you can
make a direct inspection of the integral representation for the power. Theorem 15.26
now implies that
𝚽(𝑨 𝑟 ) < 𝚽(𝑨)𝑟 .
Make the change of variables 𝑨 ↦→ 𝑨 1/𝑟 , and select 𝑟 = 1/𝑝 where 𝑝 ≥ 1. 
Exercise 15.30 (Matrix Lyapunov: Exponential). Let 𝚽 : 𝕄𝑛 → 𝕄𝑛 be a unital, strictly
positive linear map. Prove that

𝚽(𝑨) 4 log (𝚽( exp (𝑨))) for all 𝑨 ∈ ℍ𝑛 .

Hint: Note that 𝑡 ↦→ log (𝜀 + 𝑡 ) is matrix concave for 𝑡 ≥ 0 and 𝜀 > 0. Choose an
appropriate value of 𝜀 ; there is no need to take a limit.
Exercise 15.31 (Matrix Lyapunov: More powers). Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital, positive
linear map. For 𝑝 ∈ [ 1/2, 1] , derive the matrix Lyapunov inequality
 1/𝑝
4 𝚽(𝑨) for all psd 𝑨 ∈ ℍ𝑛+ .

𝚽(𝑨 𝑝 )

Exercise 15.32 (Matrix entropy: Filtering). Recall that the entropy ent (𝑡 ) B −𝑡 log 𝑡 for
𝑡 ≥ 0 is matrix concave. Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital positive linear map. Show that

ent (𝚽(𝑨)) < 𝚽( ent (𝑨)). for all psd 𝑨 ∈ ℍ𝑛+ .

In other words, entropy decreases when we filter by a unital, positive linear map.

Notes
Loewner’s theorem is a classic result that has attracted a significant amount of attention,
and it has been the subject of several books, notably [Sim19]. Nevertheless, there does
not seem to be a short, accessible treatment of these ideas. Nor does there appear
to be a self-contained proof of Loewner’s theorem that can be presented in a single
lecture. This lecture is the instructor’s attempt to set out the main facts about integral
representation of matrix monotone and matrix convex functions in a way that might
Lecture 15: Monotonicity: Integral Characterization 132

be independently useful or support further study. It draws heavily on Bhatia [Bha97,


Chap. V] and Simon [Sim19], but the organization is new.
The proof of uniqueness in Loewner’s theorem is adapted from Simon’s book [Sim19,
Chap. 1]. The geometric proof, via Krein–Milman, is due to Hansen & Pedersen [HP82].
We have sketched some of the ideas from the argument, following Bhatia [Bha97,
Chap. V] and Simon [Sim19, Chap. 28]. The proof that elementary monotone functions
are monotone is adapted from an argument of Boutet de Monvel, which appears in
Simon’s book.
Simon [Sim19, Chap. 1] makes it clear that Loewner’s theorem can be transferred
from one open interval to another. We have expanded on this point to show how the
formulation for the strictly positive real line relates to the formulation on the standard
open interval. The method of condensing the result for the positive real line appears in
Bhatia’s book [Bha97, pp. 144–145].
Some of the examples in this lecture appear in Bhatia’s books [Bha97, Chap. V]
and [Bha07b, Chaps. 4, 5], while some of them were identified by the instructor to
support the presentation.
The results on filtering with a positive linear map date back to work of Davis [Dav57],
Choi [Cho74], and Ando [And79].

Lecture bibliography
[And79] T. Ando. “Concavity of certain maps on positive definite matrices and applications
to Hadamard products”. In: Linear Algebra and its Applications 26 (1979), pages 203–
241.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[Cho74] M.-D. Choi. “A Schwarz inequality for positive linear maps on 𝐶 ∗ -algebras”. In:
Illinois Journal of Mathematics 18.4 (1974), pages 565 –574. doi: 10.1215/ijm/
1256051007.
[Dav57] C. Davis. “A Schwarz inequality for convex operator functions”. In: Proc. Amer.
Math. Soc. 8 (1957), pages 42–44. doi: 10.2307/2032808.
[GR07] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Seventh.
Translated from the Russian, Translation edited and with a preface by Alan
Jeffrey and Daniel Zwillinger, With one CD-ROM (Windows, Macintosh and UNIX).
Elsevier/Academic Press, Amsterdam, 2007.
[HP82] F. Hansen and G. K. Pedersen. “Jensen’s Inequality for Operators and Löwner’s
Theorem.” In: Mathematische Annalen 258 (1982), pages 229–241.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
16. Matrix Means

Date: 24 February 2022 Scribe: Jing Yu

In this lecture, we consider the question about how to “average” a pair of psd matrices.
Agenda:
We introduce a class of matrix means, and we give a complete characterization of these
1. Scalar means
functions. This topic may seem like a departure from the recent lectures. In fact, the 2. Axioms for matrix means
means that we study are in one-to-one correspondence with a class of matrix monotone 3. Representers for scalar means
functions. 4. Representers for matrix means
We begin by formalizing the notion of a scalar mean, which describes an average of 5. Matrix perspective
transformations
two positive numbers. Following the ideas of Kubo & Ando [KA79], we explain how 6. Matrix means from
to extend this construction to matrices. Then we show that (bivariate) means can be representers
described by (univariate) representer functions. This observation leads to a satisfying 7. Integral representations
theory in both the scalar and matrix settings.

16.1 Scalar means


What does it mean to “average” two positive numbers? Although the arithmetic mean
is the most widely used notion, you may have encountered several other ways to
compute an average of numbers. To draw a clear distinction with the arithmetic mean,
it is common to call these functions “means” rather than averages.
Example 16.1 (Scalar means). Here are several important means defined for strictly
positive numbers 𝑎, 𝑏 > 0.
• Arithmetic mean: The arithmetic mean is a familiar example that arises in statistics
and probability:
(𝑎, 𝑏) ↦→ 12 (𝑎 + 𝑏).
The arithmetic mean is the best constant approximation to a random variable
that takes values 𝑎, 𝑏 with equal probability.
• Geometric mean: The formula for the area of a rectangle leads to the notion of
the geometric mean: √
(𝑎, 𝑏) ↦→ 𝑎𝑏.
This mean also appears in the study of inequalities and in functional analysis
because it is connected to the convexity of the exponential function.
• Harmonic mean: The harmonic mean arises in the analysis of electrical circuits
because of Kirchhoff ’s law. It computes an average via the rule
1 −1
 −1
(𝑎, 𝑏) ↦→ 2
𝑎 + 12 𝑏 −1 .
We can extend from strictly positive numbers to positive numbers by taking
limits.
• Logarithmic mean: The logarithmic mean is a less common example, although it
sometimes arises in the study of heat transfer. It takes the form
∫ 1
𝑎 −𝑏
(𝑎, 𝑏) ↦→ = 𝑎 𝑟 𝑏 1−𝑟 d𝑟 .
log 𝑎 − log 𝑏 0
Lecture 16: Matrix Means 134

The second formula is valid for numbers that are positive, but not necessarily
strictly positive.

Other examples include the family of binomial means and the family of power means.
Have you seen other types of means? 

What are the properties common to these examples that can justify their interpre-
tation as the mean of two numbers? By extracting the key features, we may define a
class of scalar means.

Definition 16.2 (Scalar mean). A function 𝑀 : ℝ+ × ℝ+ → ℝ+ on pairs of positive


numbers is called a scalar mean if it has the following properties.

The mean 𝑀 (𝑎, 𝑏) > 0 for all 𝑎, 𝑏 > 0.


1. Strict positivity.
2. Ordering. The mean lies between the values of its arguments:

0≤𝑎 ≤𝑏 implies 𝑎 ≤ 𝑀 (𝑎, 𝑏) ≤ 𝑏 ;


0≤𝑏 ≤𝑎 implies 𝑏 ≤ 𝑀 (𝑎, 𝑏) ≤ 𝑎.

3. Monotonicity. The sections of the mean are increasing:

𝑎→
↦ 𝑀 (𝑎, 𝑏) is increasing for each 𝑏 ≥ 0;
𝑏→↦ 𝑀 (𝑎, 𝑏) is increasing for each 𝑎 ≥ 0.

4. Positive homogeneity. A positive scalar can pass through the mean:

𝑀 (𝜆𝑎, 𝜆𝑏) = 𝜆𝑀 (𝑎, 𝑏) for each 𝜆 ≥ 0 and all 𝑎, 𝑏 ≥ 0.

5. Continuity. The mean (𝑎, 𝑏) ↦→ 𝑀 (𝑎, 𝑏) is continuous on ℝ+ × ℝ+ .

We say that a mean is symmetric if it also satisfies 𝑀 (𝑎, 𝑏) = 𝑀 (𝑏, 𝑎) for all
𝑎, 𝑏 ≥ 0. This property is typical, but we will not insist on it.

We remark that these properties are interrelated, so they are not fully independent
from each other. For instance, the ordering property already implies strict positivity.
For a symmetric mean, we do not need to make separate hypotheses about the behavior
of the first and second argument. It is also common to define the mean for strictly
positive numbers only, since we can use continuity to obtain the value of the mean
when one of the arguments is zero.
Exercise 16.3 (Scalar mean: Examples). Confirm that each item in Example 16.1 is a
symmetric scalar mean in the sense of Definition 16.2.

16.2 Matrix means


We may now define the concept of a matrix mean by generalizing the axioms for a
scalar mean. This approach to matrix means is a hybrid between the classic paper of
Aside: This approach is not the
Kubo & Ando [KA79] and the presentation in Bhatia [Bha07b, Chap. 4.1]. After giving
only way to construct a sensi-
the definition, we look at some of the most basic examples. ble notion of a matrix mean. In
particular, Hiai & Kosaki [HK99]
have developed another elegant
16.2.1 Axioms
theory.
We begin with an axiomatization of matrix means. For the most part, this task is
straightforward. Let us emphasize that we only consider means of psd matrices in
the same way that we only consider means of positive numbers. The ordering of real
Lecture 16: Matrix Means 135

numbers is replaced by the psd order. The only difficulty arises from the generalization
of the positive homogeneity property, which we will discuss after the definition.

Definition 16.4 (Matrix mean). Fix 𝑛 ∈ ℕ. A function 𝑴 : ℍ𝑛+ × ℍ𝑛+ → ℍ𝑛+ on pairs of
psd matrices is called a matrix mean on ℍ𝑛+ if it has the following properties.

The mean 𝑴 (𝑨, 𝑩)  0 for all 𝑨, 𝑩  0.


1. Strict positivity.
2. Ordering. The mean lies between the values of its arguments:

04𝑨 4𝑩 𝑨 4 𝑴 (𝑨, 𝑩) 4 𝑩 ;
implies
0 4 𝑩 4 𝑨 implies 𝑩 4 𝑴 (𝑨, 𝑩) 4 𝑨.

3. Monotonicity. The sections of the mean are increasing:

𝑨 1 4 𝑨 2 implies 𝑴 (𝑨 1 , 𝑩) 4 𝑴 (𝑨 2 , 𝑩) for each 𝑩 < 0;


𝑩 1 4 𝑩 2 implies 𝑴 (𝑨, 𝑩 1 ) 4 𝑴 (𝑨, 𝑩 2 ) for each 𝑨 < 0.

4. Conjugation. For all 𝑨, 𝑩 < 0, conjugation passes through the mean:

𝑴 (𝑿 ∗ 𝑨𝑿 , 𝑿 ∗ 𝑩 𝑿 ) = 𝑿 ∗ 𝑴 (𝑨, 𝑩)𝑿 for each 𝑿 ∈ 𝕄𝑛 .

5. Continuity. The mean (𝑨, 𝑩) ↦→ 𝑴 (𝑨, 𝑩) is continuous on ℍ𝑛+ × ℍ𝑛+ .

We say that a matrix mean is symmetric if it also satisfies the identity 𝑴 (𝑨, 𝑩) =
𝑴 (𝑩, 𝑨) for all 𝑨, 𝑩 < 0.

As in the case of scalar means, we often define a matrix mean for strictly positive
matrices. The continuity requirement allows us to extend the matrix mean to all psd
matrices. For brevity, we will not give any details on these continuity arguments.
We can think about the conjugation axiom for a matrix mean as a counterpart to the
positive homogeneity property of a scalar mean. For each 𝜆 > 0, the function 𝑎 ↦→ 𝜆𝑎
is a bijection on the positive numbers. Likewise, for each nonsingular 𝑿 ∈ 𝕄𝑛 , the
congruence 𝑨 ↦→ 𝑿 ∗ 𝑨𝑿 is a bijection on the psd matrices. Therefore, it is natural to
ask that the mean preserve simultaneous congruence. The extension to all 𝑿 ∈ 𝕄𝑛
follows from a short continuity argument.
The conjugation axiom has some striking implications. For example, it ensures that
the mean of two scalar matrices must also be a scalar matrix (i.e., a multiple of the
identity matrix).
Exercise 16.5 (Matrix mean: Scalar matrices). Let 𝑴 be a matrix mean on ℍ𝑛+ , as in
Definition 16.4. For all scalars 𝛼, 𝛽 ≥ 0, prove that

𝑴 (𝛼 I𝑛 , 𝛽 I𝑛 ) = 𝛾 I𝑛 for some 𝛾 = 𝛾 (𝛼, 𝛽) ≥ 0.

Hint: Assume 𝑴 (𝛼 I, 𝛽 I) = 𝑨 . Apply the congruence property by writing I = 𝑸 ∗𝑸 for


a unitary matrix 𝑸 ∈ 𝕄𝑛 . Deduce that 𝑨 must be a scalar matrix by averaging over all
unitary 𝑸 .
Definition 16.4 provokes several questions. For example, do matrix means exist?
What are some interesting examples? Are matrix means interpretable as “averages”
of matrices? Can we characterize matrix means? To answer these questions, we first
present some examples of matrix means.
Lecture 16: Matrix Means 136

16.2.2 Basic examples


The most transparent example of a matrix mean is the matrix extension of the arithmetic
mean.

Definition 16.6 (Matrix arithmetic mean). The matrix arithmetic mean is the function

𝑴 (𝑨, 𝑩) B 12 (𝑨 + 𝑩) for 𝑨, 𝑩 ∈ ℍ𝑛+ .

Exercise 16.7 (Matrix arithmetic mean). Confirm that the matrix arithmetic mean is
symmetric, and it satisfies all the axioms in Definition 16.4.
There are two very basic, but not so obvious, examples of a matrix mean that merit
explicit mention.

Definition 16.8 (Left and right matrix means). The left matrix mean and right matrix
mean are respectively given by the expressions

𝑴 (𝑨, 𝑩) B 𝑨 for all 𝑨, 𝑩 ∈ ℍ𝑛+ ;


𝑴 (𝑨, 𝑩) B 𝑩 for all 𝑨, 𝑩 ∈ ℍ𝑛+ .

Exercise 16.9 (Matrix left and right means). Confirm that the left and right matrix means
satisfy the axioms in Definition 16.4, but they are not symmetric.

16.2.3 Matrix harmonic mean


Next, we turn to another example that also turns out to be truly fundamental.

Definition 16.10 (Matrix harmonic mean). The matrix harmonic mean is defined as

1 −1
 −1
𝑴 (𝑨, 𝑩) B 2
𝑨 + 12 𝑩 −1 for all 𝑨, 𝑩 ∈ ℍ𝑛++ .

We extend this definition to all psd matrices by taking limits:

𝑴 (𝑨, 𝑩) B lim𝜀 ↓0 𝑴 (𝑨 + 𝜀 I, 𝑩 + 𝜀 I) for all 𝑨, 𝑩 ∈ ℍ𝑛+ .

Exercise 16.11 (Harmonic mean: Projectors). Compute the harmonic mean 𝑴 (𝑷 , 𝑩) for an
orthogonal projector 𝑷 ∈ ℍ𝑛+ and a positive-definite matrix 𝑩 ∈ ℍ𝑛++ .
Exercise 16.12 (Matrix harmonic mean). Note that the matrix harmonic mean is symmetric.
Confirm that the matrix harmonic mean satisfies all the axioms in Definition 16.4. Hint:
Assume that the arguments are positive definite, and recall that the matrix inverse is
order reversing. Use continuity to extend the axioms to all psd matrices and to the
case of psd conjugation.
It is valuable to rewrite the harmonic mean using a related function called the
parallel sum [AJD69].

Definition 16.13 (Parallel sum). The parallel sum of two positive-definite matrices is
defined as  −1
(𝑨 : 𝑩) B 𝑨 −1 + 𝑩 −1 where 𝑨, 𝑩 ∈ ℍ𝑛++ .
We extend this definition to psd matrices via continuity. It is evident that 2 (𝑨 : 𝑩)
coincides with the harmonic mean of 𝑨 and 𝑩 .

The parallel sum is intimately connected to Schur complements. Indeed, for


Lecture 16: Matrix Means 137

positive-definite matrices 𝑨, 𝑩 ∈ ℍ𝑛++ ,

𝑨 : 𝑩 = (𝑨 −1 + 𝑩 −1 ) −1 = 𝑩 (𝑨 + 𝑩) −1 𝑨 = 𝑨 − 𝑨 (𝑨 + 𝑩) −1 𝑨.

We can recognize the latter matrix as a Schur complement:


 
𝑨 𝑨
𝑨 :𝑩 = /(𝑨 + 𝑩).
𝑨 𝑨 +𝑩

This representation extends to psd matrices. By symmetry, we also have the relation
𝑨 : 𝑩 = 𝑩 − 𝑩 (𝑨 + 𝑩) −1 𝑩 , and we can recognize the Schur complement of another
block matrix.
The representation of the parallel sum as a Schur complement gives an alternative
proof of the fact that the parallel sum is monotone. Indeed, Schur complements of a
psd matrix are increasing with respect to the matrix. We also discover that the parallel
sum is concave.
Exercise 16.14 (Parallel sum: Concavity). For psd matrices 𝑨 𝑖 , 𝑩 𝑖 ∈ ℍ𝑛+ for 𝑖 = 1, 2, use
the connection with Schur complements to prove that

(𝜏𝑨 1 + 𝜏𝑨
¯ 2 ) : (𝜏𝑩 1 + 𝜏𝑩
¯ 2 ) < 𝜏 (𝑨 1 : 𝑩 1 ) + 𝜏¯ (𝑨 2 : 𝑩 2 ) for 𝜏 ∈ [ 0, 1] .

As usual, 𝜏¯ B 1 − 𝜏 . In other words, the parallel sum is jointly concave on pairs of psd
matrices.

16.2.4 Matrix geometric mean


Beyond the most elementary examples, we may attempt to develop a matrix extension
of the geometric mean. We begin with the special case where the two matrices
commute. In this instance, it is natural to define the geometric mean as

𝑨 ♯ 𝑩 = 𝑨 1/2 𝑩 1/2 for commuting 𝑨, 𝑩 ∈ ℍ𝑛+ .

Indeed, commuting matrices are simultaneously diagonalizable, so this expression


amounts to computing the geometric mean of two diagonal matrices, entry by entry.
To extend this formula to all matrices, we realize that the matrix geometric mean ♯
must satisfy the conjugation axiom. For positive-definite 𝑨, 𝑩 ∈ ℍ𝑛++ , this condition
implies that

𝑨 ♯ 𝑩 = (𝑨 1/2 I𝑨 1/2 ) ♯ (𝑨 1/2 𝑨 −1/2 𝑩 𝑨 −1/2 𝑨 1/2 )


 1/2 1/2
= 𝑨 1/2 I ♯ (𝑨 −1/2 𝑩 𝑨 −1/2 ) 𝑨 1/2 = 𝑨 1/2 𝑨 −1/2 𝑩 𝑨 −1/2

𝑨 .

We have used the fact that the identity matrix commutes with everything. This
expression appears complicated, but we have no choice about it once we accept that
matrix means satisfy the conjugation axiom.
These considerations lead to the following definition of the matrix geometric mean.

Definition 16.15 (Matrix geometric mean). For psd matrices 𝑨, 𝑩 ∈ ℍ𝑛+ , the matrix
geometric mean is defined as
 1/2
𝑨 ♯ 𝑩 B 𝑨 1/2 · 𝑨 −1/2 𝑩 𝑨 −1/2 · 𝑨 1/2 where 𝑨, 𝑩 ∈ ℍ𝑛++ .

We extend this definition to all psd matrices by continuity.


Lecture 16: Matrix Means 138

It is clear that the matrix geometric mean is positive. Using the fact that the
square-root is a matrix monotone function, it is not hard to show that the matrix
geometric mean satisfies the order property. On the other hand, it is not clear that the
matrix geometric mean has the other required properties (congruence, monotonicity).
Although one may prove these results directly, we will instead develop them as a
consequence of a more general theory of matrix means.

16.3 Representer functions for scalar means


Our goal is to develop a characterization of matrix means. To that end, we first return
to the scalar case, where we argue that scalar means are in one-to-one correspondence
with a class of univariate functions.

16.3.1 Representers from scalar means


When we fix one of its arguments, each scalar mean yields a univariate function. This
function has attractive properties of its own.

Definition 16.16 (Scalar mean: Representer). Consider a scalar mean 𝑀 : ℝ+ × ℝ+ →


ℝ+ . The representer of the mean is the function

𝑓 : ℝ+ → ℝ+ given by 𝑓 (𝑡 ) B 𝑀 ( 1, 𝑡 ) for 𝑡 ≥ 0.

It is easy to provide examples.


Example 16.17 (Scalar mean: Representers). Here are the representer functions of some
basic scalar means.

• Arithmetic mean. The representer of the arithmetic mean is 𝑓 (𝑡 ) = ( 1 + 𝑡 )/2.


• Geometric mean. The representer of the geometric mean is 𝑓 (𝑡 ) = 𝑡 1/2 .
• Harmonic mean. The representer of the harmonic mean is 𝑓 (𝑡 ) = 2𝑡 /(𝑡 + 1) .
• Logarithmic mean. The representer of the logarithmic mean is the function
𝑓 (𝑡 ) = (𝑡 − 1)/log (𝑡 ) .
You may wish to compute the representer functions for general power means and for
binomial means. 

The axioms for a scalar mean induce several structural constraints on its representer
function.
Exercise 16.18 (Scalar mean: Representer properties). Consider a scalar mean 𝑀 : ℝ+ ×
ℝ+ → ℝ+ . Introduce the representer function 𝑓 (𝑡 ) = 𝑀 ( 1, 𝑡 ) for 𝑡 ≥ 0. Prove that
the representer enjoys the following properties.

1. Strict positivity. The representer 𝑓 (𝑡 ) > 0 for 𝑡 > 0.


2. Normalization. The representer take the value 𝑓 ( 1) = 1.
3. Monotonicity. The representer 𝑡 ↦→ 𝑓 (𝑡 ) is increasing.
4. Subadditivity. The function 𝑡 ↦→ 𝑓 (𝑡 )/𝑡 is decreasing for 𝑡 > 0.
5. Continuity. The representer 𝑓 is continuous.
6. Symmetry. If the mean 𝑀 is symmetric , then 𝑓 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) for all 𝑡 > 0.

Most of these results are straightforward. The subadditivity and symmetry properties
may take a moment of thought.
Lecture 16: Matrix Means 139

16.3.2 Scalar means from representers


Of course, each scalar mean generates a unique representer function. The central
question is whether we can reverse the process. That is, can we reconstruct a scalar
mean from its representer? The answer is positive, as the next result shows.
Proposition 16.19 (Representers yield scalar means). Let 𝑓 : ℝ+ → ℝ be a function that
satisfies properties (1)–(5) in Exercise 16.18. Define the perspective function of 𝑓 :

𝑀 𝑓 (𝑎, 𝑏) B 𝑎 · 𝑓 (𝑏/𝑎) for all 𝑎, 𝑏 > 0.

We extend to a function 𝑀 𝑓 : ℝ+ × ℝ+ → ℝ+ by taking limits.


Then 𝑀 𝑓 is a scalar mean with representer 𝑓 . Indeed, 𝑓 ↔ 𝑀 𝑓 is a bijection
between representer functions and scalar means.
Moreover, if 𝑓 satisfies the symmetry property (5), then the mean 𝑀 𝑓 is symmetric
in its arguments.

Proof. We can easily verify that 𝑀 𝑓 enjoys the properties of a scalar mean directly
from the analogous properties of the representer 𝑓 . The only technical difficulty arises
in verifying the existence of the limit of 𝑀 𝑓 (𝑎, 𝑏) as 𝑎 ↓ 0 and 𝑏 > 0.
To see that representers and scalar means are in one-to-one correspondence, note
that 𝑓 (𝑡 ) = 𝑀 𝑓 ( 1, 𝑡 ) for all 𝑡 > 0, so that 𝑓 is the representer of 𝑀 𝑓 . 

16.4 Representation of matrix means


We would like to undertake the same project in the matrix setting. In other words, we
wish to express matrix means in terms of representer functions. Although the approach
has a lot in common with the scalar setting, the conditions on a matrix representer
function are more stringent.

16.4.1 Representers from matrix means


One might imagine that a matrix mean would have more complicated behavior than a
scalar mean, and so it might not be possible to characterize it so simply. In fact, we
can reduce each matrix mean to a scalar function.

Definition 16.20 (Matrix mean: Representer). Let 𝑴 : ℍ𝑛+ × ℍ𝑛+ → ℍ𝑛+ be a matrix
mean on ℍ𝑛+ , as in Definition 16.4. The representer of the matrix mean is the scalar
function

𝑓 : ℝ+ → ℝ+ for which 𝑓 (𝑡 ) · I𝑛 = 𝑴 ( I𝑛 , 𝑡 · I𝑛 ) for 𝑡 ≥ 0.

This definition uses Exercise 16.5 to ensure that the mean of two scalar matrices is
itself a scalar matrix.

Exercise 16.21 (Matrix mean: Representers). Compute the representers for the matrix
arithmetic mean, left and right means, the matrix harmonic mean, and the matrix
geometric mean.
The axioms for a matrix mean ensure that the representer has many of the same
properties as in the scalar setting. The next result collects the easy facts.
Exercise 16.22 (Matrix mean: Representer properties). Consider a matrix mean 𝑴 on
ℍ𝑛+ . Introduce the representer function 𝑓 (𝑡 ) I = 𝑴 ( I, 𝑡 I) for 𝑡 ≥ 0. Prove that the
representer enjoys the following properties.
Lecture 16: Matrix Means 140

1. Strict positivity.The representer 𝑓 (𝑡 ) > 0 for 𝑡 > 0.


2. Normalization. The representer take the value 𝑓 ( 1) = 1.
3. Monotonicity. The representer 𝑡 ↦→ 𝑓 (𝑡 ) is monotone.
4. Subadditivity. The function 𝑡 ↦→ 𝑓 (𝑡 )/𝑡 is decreasing for 𝑡 > 0.
5. Continuity. The representer 𝑓 is continuous.
6. Symmetry. If the mean 𝑴 is symmetric , then 𝑓 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) for all 𝑡 > 0.

The arguments are essentially the same as in the scalar case.

16.4.2 Monotonicity of the matrix mean representer


Although we have defined the matrix mean representer in terms of scalar matrices, it
actually captures the action of the mean between the identity and any psd matrix. As
a consequence, we discover that the matrix mean representer is a matrix monotone
function on matrices of an appropriate dimension.

Theorem 16.23 (Matrix mean representer). Consider a matrix mean 𝑴 on ℍ𝑛+ with
representer 𝑓 : ℝ+ → ℝ+ . Then the representer function satisfies

𝑓 (𝑩) = 𝑴 ( I, 𝑩) for all 𝑩 ∈ ℍ𝑛+ .

In particular, the representer function is matrix monotone on ℍ𝑛+ .

As a consequence, a scalar mean representer need not be a matrix mean representer


because scalar monotonicity (𝑛 = 1) does not imply matrix monotonicity (𝑛 > 1).
We begin with a lemma that describes how matrix means interact with orthogonal
projectors.
Lemma 16.24 (Matrix mean: Commuting projectors). Under the assumptions of Theo-
rem 16.23, let 𝑷 ∈ ℍ𝑛+ be an orthogonal projector that commutes with 𝑨, 𝑩 ∈ ℍ𝑛+ .
Then
𝑷 𝑴 (𝑨, 𝑩) = 𝑴 (𝑨, 𝑩)𝑷 = 𝑴 (𝑨𝑷 , 𝑩𝑷 )𝑷 .
That is, the projector commutes with the mean and satisfies a conjugation property.

Proof. First, we show that 𝑷 commutes with the mean 𝑴 (𝑨, 𝑩) . Indeed, since 𝑷
commutes with 𝑨 , we have the relation 𝑨𝑷 = 𝑨 1/2𝑷 𝑨 1/2 4 𝑨 . Likewise, 𝑩𝑷 4 𝑩 . By
monotonicity of the mean, 𝑴 (𝑨𝑷 , 𝑩𝑷 ) 4 𝑴 (𝑨, 𝑩) . Conjugate by the projector to
see that

𝑷 𝑴 (𝑨, 𝑩)𝑷 = 𝑴 (𝑷 𝑨𝑷 , 𝑷 𝑩𝑷 ) = 𝑴 (𝑨𝑷 , 𝑩𝑷 ) 4 𝑴 (𝑨, 𝑩).

We have used the conjugation property of the mean at the first step.
Equivalently, we have shown that To see what is going on, assume that
range (𝑷 ) is a subspace spanned by
𝑴 (𝑨, 𝑩) − 𝑷 𝑴 (𝑨, 𝑩)𝑷 < 0. coordinates. Then we are removing a
“diagonal block” from the psd matrix
Since 𝑴 (𝑨, 𝑩) is psd, this statement implies that the restriction of the matrix 𝑴 (𝑨, 𝑩) 𝑴 (𝑨, 𝑩) , and yet we are left with a
psd matrix. This can only happen if
to range (𝑷 ) is zero. In particular, the “off-diagonal blocks” are zero too.

( I − 𝑷 )𝑴 (𝑨, 𝑩)𝑷 = 0 = 𝑷 𝑴 (𝑨, 𝑩)( I − 𝑷 ).

The last relation is the conjugate transpose of the first. Rearrange this expression to
see that 𝑷 commutes with 𝑴 (𝑨, 𝑩) .
By a similar argument, 𝑷 also commutes with 𝑴 (𝑨𝑷 , 𝑩𝑷 ) . These two facts
complete the proof. 
Lecture 16: Matrix Means 141

With this lemma at hand, we may now establish Theorem 16.23.

Proof of Theorem 16.23. Fix a matrix 𝑩 ∈ ℍ𝑛+ , and introduce the spectral resolution
Í
𝑩 = 𝑖 𝜆𝑖 𝑷 𝑖 . Since the projectors decompose the identity,
∑︁ ∑︁
𝑴 ( I, 𝑩) = 𝑴 ( I, 𝑩)𝑷 𝑖 = 𝑴 (𝑷 𝑖 , 𝑨𝑷 𝑖 )𝑷 𝑖
∑︁𝑖 𝑖
∑︁
= 𝑴 (𝑷 𝑖 , 𝜆𝑖 𝑷 𝑖 )𝑷 𝑖 = 𝑴 ( I, 𝜆𝑖 I)𝑷 𝑖
∑︁𝑖 𝑖
= 𝑓 (𝜆𝑖 )𝑷 𝑖 = 𝑓 (𝑩).
𝑖

We have used Lemma 16.24 in the second and third lines. To pass from the second line
to the third, we used the fact that the the range of the projector 𝑷 𝑖 is an eigenspace of
𝑩 with eigenvalue 𝜆𝑖 . Last, we have recognized the matrix mean representer, and we
applied the definition of a standard matrix function.
Finally, consider 𝑛 × 𝑛 matrices with 0 4 𝑩 1 4 𝑩 2 . Then

𝑓 (𝑩 1 ) = 𝑴 ( I, 𝑩 1 ) 4 𝑴 ( I, 𝑩 2 ) = 𝑓 (𝑩 2 ).

We have invoked the axiom that the matrix mean is monotone on ℍ𝑛+ . 

16.5 Matrix means from matrix representers


As in the scalar case, our next task is to reverse the process and try to construct a
matrix mean from a matrix representer function.

16.5.1 Matrix perspective transformations


Our goal is to use the univariate matrix mean representer to construct a bivariate matrix
mean. As with the matrix geometric mean, we recognize that the conjugation axiom
forces our hand. Therefore, the structure of the matrix mean is already determined by
its representer.

Definition 16.25 (Matrix perspective transformation). Let 𝑓 : ℝ+ → ℝ be a (continuous)


function. The matrix perspective of 𝑓 is the bivariate matrix function

𝑴 𝑓 (𝑨, 𝑩) B 𝑨 1/2 · 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) · 𝑨 1/2 for 𝑨, 𝑩 ∈ ℍ𝑛++ . (16.1)

In particular,

𝑴 𝑓 (𝑨, 𝑩) = 𝑨 · 𝑓 (𝑩 𝑨 −1 ) for commuting 𝑨, 𝑩 ∈ ℍ𝑛++ .

If 𝑓 is subadditive, we may extend the perspective to ℍ𝑛+ × ℍ𝑛+ by taking limits.

Regardless of the choice of the standard matrix function 𝑓 , the perspective trans-
formation 𝑴 𝑓 interacts nicely with conjugation. When the function 𝑓 has additional
properties, the perspective 𝑴 𝑓 may inherit some of these features. This section
contains some elaboration, and we will continue this discussion in the next lecture.
Although the form (16.1) of the perspective transformation is motivated by the
conjugation axiom, it is not immediate that the perspective satisfies the conjugation
property. The first proposition guarantees that it does.
Proposition 16.26 (Matrix perspective: Conjugation). Let 𝑓 : ℝ+ → ℝ+ be a (continuous)
function. Then the perspective transformation 𝑴 𝑓 satisfies the conjugation axiom. For
Lecture 16: Matrix Means 142

all 𝑨, 𝑩 ∈ ℍ𝑛++ ,

𝑴 𝑓 (𝑿 ∗ 𝑨𝑿 , 𝑿 ∗ 𝑩 𝑿 ) = 𝑿 ∗ 𝑴 𝑓 (𝑨, 𝑩)𝑿 for all 𝑿 ∈ 𝕄𝑛 .

If 𝑓 is subadditive, this expression extends to all psd matrices 𝑨, 𝑩 ∈ ℍ𝑛+ .

Proof. Assume that 𝑨, 𝑩 ∈ ℍ𝑛++ are positive definite, and fix a nonsingular matrix
𝑿 ∈ 𝕄𝑛 . The remaining cases will follow from continuity.
The quantity of interest takes the unwieldy form

𝑴 𝑓 (𝑿 ∗ 𝑨𝑿 , 𝑿 ∗ 𝑩 𝑿 )
= (𝑿 ∗ 𝑨𝑿 ) 1/2 · 𝑓 (𝑿 ∗ 𝑨𝑿 ) −1/2 (𝑿 ∗ 𝑩 𝑿 )(𝑿 ∗ 𝑨𝑿 ) −1/2 · (𝑿 ∗ 𝑨𝑿 ) 1/2 .


To tame this expression, introduce the matrix 𝒀 = 𝑿 (𝑿 ∗ 𝑨𝑿 ) −1/2 . It has the polar
factorization 𝒀 = 𝑷𝑼 where 𝑷 = (𝒀 𝒀 ∗ ) 1/2 and 𝑼 is unitary. After a short calculation,
we find that 𝑷 = 𝑨 −1/2 . Therefore,

𝑴 𝑓 (𝑿 ∗ 𝑨𝑿 , 𝑿 ∗ 𝑩 𝑿 ) = 𝑿 ∗𝒀 −∗ · 𝑓 (𝒀 ∗ 𝑩𝒀 ) · 𝒀 −1 𝑿
= 𝑿 ∗𝑷 −1𝑼 · 𝑓 (𝑼 ∗ (𝑷 𝑩𝑷 )𝑼 ) · 𝑼 ∗𝑷 −1 𝑿
= 𝑿 ∗ 𝑨 1/2 · 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) · 𝑨 1/2 𝑿 = 𝑿 ∗ 𝑴 𝑓 (𝑨, 𝑩)𝑿 .

We have used the unitary equivariance of the standard matrix function to eliminate
the unitary matrices. 
An important corollary of the conjugation invariance property is that the perspective
transformation of a matrix monotone function is monotone in both arguments.
Corollary 16.27 (Matrix perspective: Monotonicity). Let 𝑓 : ℝ+ → ℝ+ be a continuous
matrix monotone function. Then each variable of the perspective transformation 𝑴 𝑓
is matrix monotone on ℍ𝑛+ :

0 4 𝑨1 4 𝑨2 implies 𝑴 𝑓 (𝑨 1 , 𝑩) 4 𝑴 𝑓 (𝑨 2 , 𝑩) for all 𝑩 < 0;


0 4 𝑩1 4 𝑩2 implies 𝑴 𝑓 (𝑨, 𝑩 1 ) 4 𝑴 𝑓 (𝑨, 𝑩 2 ) for all 𝑨 < 0.

Proof. First, we remark that a positive matrix monotone function is always subadditive. Indeed, a matrix monotone function
Therefore, we can take limits to extend the perspective 𝑴 𝑓 from positive-definite 𝑓 : ℝ+ → ℝ+ is concave, so it must
be subadditive.
matrices to psd matrices.
The matrix monotonicity of 𝑩 ↦→ 𝑴 𝑓 (𝑨, 𝑩) is an easy consequence of the expres-
sion (16.1) for the perspective 𝑴 𝑓 , the conjugation rule, and the matrix monotonicity
of 𝑓 .
As for the other variable, we may assume that 𝑨, 𝑩 are positive definite. Then the
conjugation property (Proposition 16.26) ensures that

𝑴 𝑓 (𝑨, 𝑩) = 𝑩 1/2 𝑴 𝑓 (𝑩 −1/2 𝑨𝑩 −1/2 , I)𝑩 1/2 .

From the definition (16.1) of the perspective, we find that

𝑴 𝑓 (𝑩 −1/2 𝑨𝑩 −1/2 , I) = (𝑩 −1/2 𝑨𝑩 −1/2 ) · 𝑓 (𝑩 −1/2 𝑨𝑩 −1/2 ) −1 .




Since 𝑓 is matrix monotone, the function 𝑡 ↦→ 𝑡 · 𝑓 ( 1/𝑡 ) is also matrix monotone. (See
Problem 16.28.) Therefore, 𝑨 ↦→ 𝑴 𝑓 (𝑩 −1/2 𝑨𝑩 −1/2 , I) is matrix monotone. By the
conjugation rule, we conclude that 𝑨 ↦→ 𝑴 𝑓 (𝑨, 𝑩) is also matrix monotone. 
Lecture 16: Matrix Means 143

Problem 16.28 (Matrix perspective: Monotonicity property). Suppose that 𝑓 : ℝ++ → ℝ+ is


matrix monotone. For each contraction 𝑲 , prove that

𝑲 ∗ 𝑓 (𝑨)𝑲 4 𝑓 (𝑲 ∗ 𝑨𝑲 ) for all psd 𝑨 .

Deduce that 𝑔 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) is matrix monotone on ℝ++ . Hint: If you exploit the fact
that 𝑓 is matrix concave, then the proof is easy. But this argument may be circular, so
you should give an independent proof; see Exercise ??.
In general, the perspective function 𝑴 𝑓 treats its two arguments rather differently.
In the context of matrix means, it is valuable to understand when the perspective is
symmetric in its arguments. The next exercise states the result.
Exercise 16.29 (Matrix perspective: Symmetry). Suppose that 𝑓 : ℝ+ → ℝ+ has the
property that 𝑓 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) for 𝑡 > 0. Prove that

𝑴 𝑓 (𝑨, 𝑩) = 𝑴 𝑓 (𝑩, 𝑨) for all 𝑨, 𝑩 ∈ ℍ𝑛++ .

Hint: This point follows easily from the conjugation property, much like the result of
Corollary 16.27.

16.5.2 Families of matrix means


In this lecture, we will be interested in what happens when we take the perspective
transformation of a matrix mean representer. The next definition collects the properties
that we need.

Definition 16.30 (Matrix mean representer). We say that 𝑓 : ℝ+ → ℝ+ is a matrix


mean representer when 𝑓 is a (continuous) matrix monotone function that satisfies
the normalization 𝑓 ( 1) = 1.

Definition 16.31 (Matrix mean family). Let 𝑓 : ℝ+ → ℝ+ be a matrix mean representer,


as in Definition 16.30. Using the matrix perspective, we can define a bivariate
function on matrices of any dimension 𝑛 ∈ ℕ:

𝑴 𝑓 (𝑨, 𝑩) B 𝑨 1/2 · 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) · 𝑨 1/2 for all 𝑨, 𝑩 ∈ ℍ𝑛++ .

We extend to psd matrices by continuity. We call 𝑴 𝑓 a family of matrix means.

Our main result states that the matrix perspective of a matrix mean representer
generates a family of matrix means. We have already laid most of the groundwork, so
this result will follow quickly.

Theorem 16.32 (Matrix representers yield matrix means). Let 𝑓 be a matrix mean
representer, as in Definition 16.30, and let 𝑴 𝑓 be the associated family of matrix
means, as in Definition 16.31.
Then the 𝑴 𝑓 is a matrix mean on ℍ𝑛+ for each 𝑛 ∈ ℕ with matrix mean
representer 𝑓 . Indeed, 𝑓 ↔ 𝑴 𝑓 is a bijection between matrix representer
functions and families of matrix means.
Moreover, if 𝑓 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) for 𝑡 > 0, then the mean 𝑴 𝑓 is symmetric in its
arguments.

Proof. When 𝑓 is a matrix mean representer, we can argue that 𝑴 𝑓 is a matrix mean
on matrices of any dimension 𝑛 ∈ ℕ.
Lecture 16: Matrix Means 144

Positivity. Strict positivity of 𝑴 𝑓 follows from the fact that a matrix monotone
function 𝑓 : ℝ+ → ℝ+ with 𝑓 ( 1) = 1 is strictly positive.
Order. The order properties are straightforward. For example, to see that 0 ≺ 𝑨 4 𝑩
implies that 𝑨 4 𝑴 𝑓 (𝑨, 𝑩) , we just need to check that I 4 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) . But this
is a consequence of the fact that 𝑨 −1/2 𝑩 𝑨 −1/2 4 I and the normalization 𝑓 ( 1) = 1.
To check that 0 ≺ 𝑨 4 𝑩 implies that 𝑴 𝑓 (𝑨, 𝑩) 4 𝑩 , we invoke the conjugation
property (Proposition 16.26) to obtain the equivalent relation 𝑓 (𝑨 −1/2 𝑩 𝑨 −1/2 ) 4 I,
which we have already verified. The other cases are essentially the same.
Monotonicity. Corollary 16.27 already establishes the monotonicity property.
Conjugation. The conjugation axiom was obtained in Proposition 16.26.
Continuity. The function 𝑴 𝑓 is continuous on positive-definite matrices since 𝑓 is
continuous. It is continuous for psd matrices by construction.
Symmetry. Exercise 16.29 requests the proof of the symmetry property.
Bijection. Finally, note that 𝑴 𝑓 ( I, 𝑡 I) = 𝑓 (𝑡 ) · I for all 𝑡 > 0. In other words, 𝑓 is
the matrix mean representer of 𝑴 𝑓 . Since 𝑴 𝑓 is a matrix mean for each dimension
𝑛 ∈ ℕ, the function 𝑓 must be matrix monotone, continuous, and normalized with
𝑓 ( 1) = 1. We conclude that 𝑓 ↔ 𝑴 𝑓 is a bijection. 

16.5.3 Examples
Examples of families of matrix means include the matrix arithmetic mean, the left and
right matrix mean, the matrix harmonic mean, and the matrix geometric mean. There
are other important examples.
Example 16.33 (Weighted matrix geometric means). For a parameter 𝑟 ∈ [ 0, 1] , we may
consider the matrix monotone function 𝑓 (𝑡 ) = 𝑡 𝑟 . This function induces a family of
weighted geometric means:

𝑨 ♯𝑟 𝑩 B 𝑨 1/2 · 𝑨 −1/2 𝑩 𝑨 −1/2 · 𝑨 1/2 for 𝑨, 𝑩 ∈ ℍ𝑛++ and 𝑛 ∈ ℕ.


𝑟

We can extend to all psd matrices by taking limits. The weighted geometric means
interpolate between the left and right mean. With respect to an appropriate geometry
on psd matrices, the weighted geometric means 𝑟 ↦→ 𝑨 ♯𝑟 𝑩 trace out a geodesic
between 𝑨 and 𝑩 . We recognize that ordinary matrix geometric mean 𝑨 ♯ 𝑩 as the
midpoint of this geodesic. 
∫1
Example 16.34 (Matrix logarithmic mean). The function 𝑓 (𝑡 ) = 0
𝑡 𝑟 d𝑟 = (𝑡 − 1)/log (𝑡 )
represents the scalar logarithmic mean. This function is matrix monotone and satisfies
the normalization 𝑓 ( 1) = 1, so it induces a family of symmetric matrix means, most
easily written using the weighted geometric mean:
∫ 1
𝑴 𝑓 (𝑨, 𝑩) = (𝑨 ♯𝑟 𝑩) d𝑟 for 𝑨, 𝑩 ∈ ℍ𝑛+ and 𝑛 ∈ ℕ.
0

Is there an alternative expression that makes the role of the logarithm clear? 

16.5.4 Integral representations


Theorem 16.32 allows us to draw on Loewner’s theory of matrix monotone functions.
Recall that every (continuous) matrix monotone function 𝑓 : ℝ+ → ℝ+ with 𝑓 ( 1) = 1
has an integral representation:
∫ ∞
𝜆𝑨 1+𝜆
𝑓 (𝑨) = 𝛼 + 𝛽𝑨 + · d𝜇(𝜆) for all psd 𝑨 .
0 𝑨 + 𝜆I 𝜆
Lecture 16: Matrix Means 145

The coefficients 𝛼, 𝛽 ≥ 0 and 𝜇 is a finite, positive Borel measure on ℝ++ . The


normalization ensures that 𝛼 + 𝛽 +𝜇(ℝ++ ) = 1. This fact has a spectacular consequence
for matrix means.
Exercise 16.35 (Matrix mean: Integral representation). Let 𝑓 : ℝ+ → ℝ+ be a continuous
matrix monotone function with 𝑓 ( 1) = 1. Then the associated family 𝑴 𝑓 of matrix
means takes the form
∫ ∞
1+𝜆
𝑴 𝑓 (𝑨, 𝑩) = 𝛼𝑨 + 𝛽𝑩 + 2 (𝜆𝑨 : 𝑩) · d𝜇(𝜆)
0 2𝜆

for all psd 𝑨, 𝑩 with the same dimension. Here, 𝑨 : 𝑩 denotes the parallel sum.
The coefficients 𝛼, 𝛽 ≥ 0 and 𝜇 is a finite, positive Borel measure on ℝ++ . The
normalization ensures that 𝛼 + 𝛽 + 𝜇(ℝ++ ) = 1.
Find a simplification for the symmetric case where 𝑴 𝑓 (𝑨, 𝑩) = 𝑴 𝑓 (𝑩, 𝑨) .
Exercise 16.35 has the remarkable interpretation that every family 𝑴 𝑓 of matrix
means consists of a conic combination of the left mean, the right mean, and a family
of harmonic means. We can deduce other significant results as well.
Exercise 16.36 (Minimal and maximal means). The harmonic mean is the minimal matrix
mean, while the arithmetic mean is the maximal matrix mean. That is, for a family
𝑴 𝑓 of symmetric matrix means,
1
2 (𝑨 : 𝑩) 4 𝑴 𝑓 (𝑨, 𝑩) 4 2
(𝑨 + 𝑩)

for all psd 𝑨, 𝑩 with the same dimension.


Exercise 16.37 (Matrix means: Concavity). Let 𝑴 𝑓 be a family of matrix means. For all
psd 𝑨 𝑖 , 𝑩 𝑖 with the same dimension (𝑖 = 1, 2),

𝑴 𝑓 (𝜏𝑨 1 + 𝜏𝑨 ¯ 2 ) < 𝜏𝑴 𝑓 (𝑨 1 , 𝑩 1 ) + 𝜏𝑴
¯ 2 , 𝜏𝑩 1 + 𝜏𝑩 ¯ 𝑓 (𝑨 2 , 𝑩 2 ) for all 𝜏 ∈ [ 0, 1] .

As usual, 𝜏¯ B 1 − 𝜏 . Hint: The parallel sum is matrix concave.


We can strengthen the last exercise substantially.
Problem 16.38 (Matrix means: Filtering). Consider a family 𝑴 𝑓 of matrix means. Let
𝚽 : 𝕄𝑛 → 𝕄𝑛 be a positive linear map (not necessarily unital!). Prove that

𝑴 𝑓 (𝚽(𝑨), 𝚽(𝑩)) < 𝚽(𝑴 𝑓 (𝑨, 𝑩)).

Hint: The parallel sum satisfies the same concavity property. First, show that this claim
holds for a unital, strictly positive linear map by invoking Choi’s inequality. Remove
the extra conditions by emulating the proof of the Russo–Dye theorem (Lecture 12).

Notes
The results in this lecture are drawn from Kubo & Ando [KA79] and from Bhatia’s
book [Bha07b, Chap. 4]. The arrangement of material is somewhat different from
these sources. It may be conceptually simpler to regard the basic object as a family of
means induced by a function and then to derive the properties from this representation.

Lecture bibliography
[AJD69] W. N. Anderson Jr and R. J. Duffin. “Series and parallel addition of matrices”. In:
Journal of Mathematical Analysis and Applications 26.3 (1969), pages 576–594.
Lecture 16: Matrix Means 146

[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[HK99] F. Hiai and H. Kosaki. “Means for matrices and comparison of their norms”. In:
Indiana Univ. Math. J. 48.3 (1999), pages 899–936. doi: 10.1512/iumj.1999.
48.1665.
[KA79] F. Kubo and T. Ando. “Means of positive linear operators”. In: Math. Ann. 246.3
(1979/80), pages 205–224. doi: 10.1007/BF01371042.
17. Quantum Relative Entropy

Date: 1 March 2022 Scribe: Eray Unsal Atay

In this lecture, we study matrix entropy functions, also known as quantum entropies.
Agenda:
Quantum entropies arise in quantum information theory and quantum statistical
1. Scalar entropies
physics. They also have remarkable applications in random matrix theory and data 2. Matrix entropies
science. Quantum entries have deep roots in matrix analysis because matrix monotone 3. Matrix perspectives
and convex functions play an important role in the analysis. 4. Tensors and logarithms
We start with some background from information theory in the scalar setting. 5. Convexity of quantum relative
entropy
We introduce the (scalar) entropy function and the relative entropy, and we discuss
their basic properties. These definitions extend to the matrix setting, and we will
see that matrix entropies have many properties that parallel those of scalar entropies.
Nevertheless, the properties of matrix entropies are far more difficult to establish. We
show how to develop these results using matrix perspective transformations, combined
with the idea of lifting a matrix problem to tensors.
We will present two major results in this lecture. The first is the convexity of the
matrix perspective transformation of a matrix convex function. The second result is
the convexity of quantum relative entropy, which represents a nontrivial application of
matrix perspectives. This theorem is one of the crown jewels of matrix analysis.

17.1 Entropy and relative entropy


This section provides the definitions of entropy and relative entropy in the scalar
setting. The entropy measures the disorder in a probability distribution, while the
relative entropy reflects the difference between two probability distributions.

17.1.1 Probability distributions and entropy


We begin with the definition of the probability simplex.
The set Δ𝑛 is closed and convex in
Definition 17.1 (Probability simplex). Define the probability simplex Δ𝑛 in ℝ𝑛 to be the ℝ𝑛 . Its extreme points are the
set of positive vectors whose entries add up to one: standard basis vectors 𝜹 𝑖 for
𝑖 = 1, . . . , 𝑛 .
Δ𝑛 B {𝒑 ∈ ℝ𝑛 : 𝒑 ≥ 0 and tr (𝒑) = 1}.

In this expression, ≥ is the entrywise inequality. Each vector 𝒑 ∈ Δ𝑛 models a


probability distribution on {1, . . . , 𝑛}.

The entropy is a function on the probability simplex Δ𝑛 that measures the amount
of randomness in a probability distribution.

Definition 17.2 (Entropy). The entropy of a probability distribution 𝒑 ∈ Δ𝑛 is


∑︁𝑛
ent (𝒑) B − 𝑝𝑖 log 𝑝𝑖 .
𝑖 =1
Lecture 17: Quantum Relative Entropy 148

We instate the convention that 0 log 0 = 0.

Recall that negative entropy is an isotone (i.e., Schur convex) function. The theory
of isotone functions implies that
0 ≤ ent (𝒑) ≤ log 𝑛 for each 𝒑 ∈ Δ𝑛 . (17.1)
The minimum in (17.1) is achieved when 𝒑 = 𝜹 𝑖 for some 𝑖 ∈ {1, . . . , 𝑛}. These
probability distributions describe constant (i.e., deterministic) random variables. The
maximum is achieved when 𝒑 = 𝑛 −1 1, which models a uniform random variable.
In other words, the entropy increases as the disorder of a probability distribution
increases .
Let us provide some historical background. The concept of entropy emerged in
statistical physics and thermodynamics, due to work by Gibbs and Boltzmann. Later,
entropy became one of the core tools in information theory after Shannon [Sha48]
found operational interpretations.
Fact 17.3 (Shannon 1948). The entropy ent (𝒑) of a probability distribution 𝒑 ∈ Δ𝑛 is
(proportional to) the average number of bits per symbol to encode a sequence of iid
random variables that are distributed according to 𝒑 . 

In other words, assume that we draw an infinite sequence of independent random


variables, each distributed according to 𝒑 . If the distribution is deterministic, we
can specify the whole sequence by providing its constant value; asymptotically, this
amounts to 0 bits per element of the sequence. On the other hand, for the uniform
distribution, it takes log2 𝑛 bits on average to represent each value in the sequence.

17.1.2 Relative entropy


The key object in today’s lecture is the relative entropy, which is a measure of the
difference between two probability distributions.

Definition 17.4 (Relative entropy). The relative entropy or Kullback–Leibler (KL) diver-
gence between two probability distributions 𝒑, 𝒒 ∈ Δ𝑛 is given by the expression
∑︁𝑛
D (𝒑 ; 𝒒 ) B 𝑝𝑖 ( log 𝑝𝑖 − log 𝑞𝑖 ).
𝑖 =1

Exercise 17.5 (Relative entropy). Verify that the relative entropy has the following basic
properties.
Explain why D (𝒑 ; 𝒒 ) ≥ 0 for all 𝒑, 𝒒 ∈ Δ𝑛 .
1. Positivity.
2. Unboundedness.Check that D (𝒑 ; 𝒒 ) can take the value +∞.
3. Asymmetry. Show that D (𝒑 ; 𝒒 ) is not symmetric in its arguments, so it is not a
metric.
Relative entropy also has important operational interpretations in information
theory and statistics.
Fact 17.6 (Stein’s lemma). Fix a distribution 𝒒 ∈ Δ𝑛 . If the relative entropy D (𝒑 ; 𝒒 ) < +∞,
then D (𝒑 ; 𝒒 ) −1 is roughly the number of independent samples we need from the
distribution 𝒑 in order to decide with high probability that 𝒑 ≠ 𝒒 . 

The key fact about relative entropy is that it is a convex function.


Exercise 17.7 (Relative entropy is convex). The function (𝒑 ; 𝒒 ) ↦→ D (𝒑 ; 𝒒 ) is convex on
Δ𝑛 × Δ𝑛 . Hint: Interpret (𝑎, 𝑏) ↦→ 𝑎 log (𝑎/𝑏) , defined on ℝ++ × ℝ++ , as a perspective
transformation of − log.
Lecture 17: Quantum Relative Entropy 149

The result in Exercise 17.7 is the primary motivation for this lecture. It has important
applications in optimization theory. For example, it plays a key role in the study of
geometric programming and relative entropy programming.
Exercise 17.8 (Information projections). For a closed, convex set C ⊆ Δ𝑛 , characterize the
solution of the minimization problem min𝒑 ∈C D (𝒑 ; 𝒒 ) . The result is analogous with
the characterization of the Euclidean projection onto the convex set C.

17.2 Quantum entropy and quantum relative entropy


In this section, we generalize the notions we introduced in the scalar setting to matrices.
The quantum entropy reflects the variability of the eigenvalues of a (normalized)
psd matrix. The quantum relative entropy describes the discrepancy between two
(normalized) psd matrices.

17.2.1 Density matrices and matrix entropy


We start by defining density matrices, which generalize probability distributions.

Definition 17.9 (Density matrices). An 𝑛 × 𝑛 density matrix is a psd matrix 𝝔 ∈ ℍ𝑛


with trace one. Define the set 𝚫𝑛 of 𝑛 × 𝑛 density matrices:

𝚫𝑛 B {𝝔 ∈ ℍ𝑛 : 𝝔 < 0 and tr (𝝔) = 1}.

We can think about a density matrix as a “quantum” probability distribution, which


describes the state of a quantum system. The set of density matrices is the quantum
version of the probability simplex.
Exercise 17.10 (Density matrices). If 𝝔 ∈ 𝚫𝑛 , show that 𝝀 ↓ (𝝔) ∈ Δ𝑛 . Confirm that the
extreme points of 𝚫𝑛 are the rank-one density matrices, which are called pure states.
Pure states take the form 𝝔 = 𝒖𝒖 ∗ , where k𝒖 k = 1.
Next, we introduce the entropy of a density matrix.

Definition 17.11 (von Neumann). The quantum entropy of a density matrix 𝝔 ∈ 𝚫𝑛 is


∑︁𝑛
ent (𝝔) B − tr (𝝔 log 𝝔) = − 𝜆𝑖 log 𝜆𝑖 ,
𝑖 =1

where 𝜆 1 , . . . , 𝜆𝑛 are the (decreasingly ordered) eigenvalues of 𝝔.

The quantum entropy is given by the eigenvalues of the density matrix, which
compose a probability distribution. Quantum entropy then addresses the question
“How disordered are the eigenvalues of a density matrix?” In other words, quantum
entropy offers a way to measure the disorder in a quantum system with state 𝝔.
From our discussion, we immediately obtain the following bounds on the quantum Notice that the bounds on entropy
entropy. are the same in the scalar case in
(17.1) and in the matrix case in (17.2).
0 ≤ ent (𝝔) ≤ log 𝑛 for each 𝝔 ∈ 𝚫𝑛 . (17.2)
The minimum occurs if and only if 𝝔 is a rank-1 matrix (that is, a pure state). The
maximum occurs if and only if 𝝔 = 𝑛 −1 I𝑛 .
There are many operational interpretations of the quantum entropy in quantum
information theory, but they are outside the scope of this lecture.
Lecture 17: Quantum Relative Entropy 150

17.2.2 Matrix relative entropy


Next, we extend the notion of relative entropy to the matrix setting. The next definition
is due to Umegaki. It describes one way of comparing the discrepancy between two
quantum states.

Definition 17.12 (Umegaki). The (Umegaki) quantum relative entropy of two density
matrices 𝝔, 𝝂 ∈ 𝚫𝑛 is Aside: There are several other
notions of relative entropy in the
S (𝝔; 𝝂) B tr [𝝔( log 𝝔 − log 𝝂)]. quantum setting.

Warning 17.13 (Relative entropy and eigenvectors). Unless 𝝔 and 𝝂 commute, their
eigenvalues alone do not determine the value of the quantum relative entropy
S (𝝔; 𝝂) ! The interactions between the eigenvectors also play a role. 

Exercise 17.14 (Quantum relative entropy is positive). For all 𝝔, 𝝂 ∈ 𝚫𝑛 , prove that
S (𝝔; 𝝂) ≥ 0. Hint: Use the generalized Klein inequality (Lecture 8).
Quantum relative entropy also has operational interpretations in quantum informa-
tion theory. For example, we have the quantum extension of Stein’s lemma due to Hiai
& Petz [HP91] and to Ogawa & Nagaoka [ON00].
Fact 17.15 (Quantum Stein’s lemma). Let 𝝂 ∈ 𝚫𝑛 be a density matrix. If S (𝝔; 𝝂) < +∞,
then S (𝝔; 𝝂) −1 is (roughly) the number of unentangled quantum systems, prepared in
state 𝝔, that we must measure to determine that 𝝔 ≠ 𝝂 with high probability. 

Having introduced the matrix generalizations of the relative entropy, we can state
a major theorem that extends the convexity property of scalar entropy.

Theorem 17.16 (Convexity of quantum relative entropy). The map (𝝔; 𝝂) ↦→ S (𝝔; 𝝂) is
convex on 𝚫𝑛 × 𝚫𝑛 .

Exercise 17.17 (Quantum versus scalar). Show that the convexity of quantum relative
entropy (Theorem 17.16) implies the convexity of scalar relative entropy (Exercise 17.7).
Hint: Consider diagonal density matrices.
Unlike the scalar result in Exercise 17.7, Theorem 17.16 is quite hard to prove. It
was first obtained by Lindblad [Lin73] using results of Lieb [Lie73a]. In this lecture, we
present a proof due to Effros [Eff09] that is based on matrix perspective transformations.
A key step in this argument is to lift the problem to tensor products, an idea that
first appeared in Ando’s beautiful paper [And79] of the convexity of quantum relative
entropy and related functions.
Theorem 17.16 has many applications in quantum information theory and quantum
statistical physics. In particular, it plays the starring role in the proof that quantum
entropy is strongly subadditive [LR73]. The convexity of quantum relative entropy is
also the main ingredient in the theory of exponential matrix concentration developed
by the lecturer [Tro11; Tro15].

17.3 The matrix perspective transformation


In the scalar setting, the convexity of the relative entropy can be established using
perspective transformations (Exercise 17.7). This motivates us to develop a matrix
extension of the perspective transformation and to investigate its properties.
Lecture 17: Quantum Relative Entropy 151

17.3.1 Definition and examples


We begin with the definition and some examples.

Definition 17.18 (Matrix perspective transformation). Let 𝑓 : ℝ++ → ℝ. Let 𝑨, 𝑯 ∈ ℍ𝑛++


be strictly positive-definite matrices. The matrix perspective of 𝑓 is the bivariate
function
𝚿 𝑓 (𝑨 ; 𝑯 ) B 𝑨 1/2 𝑓 (𝑨 −1/2𝑯 𝑨 −1/2 )𝑨 1/2 .
In particular, if 𝑨 and 𝑯 commute , then 𝚿 𝑓 (𝑨 ; 𝑯 ) = 𝑨 𝑓 (𝑯 𝑨 −1 ) .

Definition 17.18 may look familiar from Lecture 16. Indeed, if we assume that 𝑓 is
matrix monotone, positive, and satisfies some other inessential properties, then the
construction of 𝚿 𝑓 agrees with the Kubo–Ando matrix mean [KA79].
In this lecture, we study the matrix perspective transformation of a function 𝑓 that
is matrix convex. Let us take a moment to see what these functions look like.
Example 17.19 (Matrix perspectives). Here are some basic examples of matrix perspective
transformations for several matrix convex functions 𝑓 : ℝ++ → ℝ.

• Constant. The function 𝑓 (𝑡 ) = 1 yields 𝚿 𝑓 (𝑨 ; 𝑯 ) = 𝑨 .


• Identity. The function 𝑓 (𝑡 ) = 𝑡 yields 𝚿 𝑓 (𝑨 ; 𝑯 ) = 𝑯 .
• Square. The function 𝑓 (𝑡 ) = 𝑡 2 yields 𝚿 𝑓 (𝑨 ; 𝑯 ) = 𝑯 𝑨 −1 𝑯 .
• Inverse. The function 𝑓 (𝑡 ) = 𝑡 −1 yields 𝚿 𝑓 (𝑨 ; 𝑯 ) = 𝑨𝑯 −1 𝑨 .

We remark that 𝑓 (𝑡 ) = 1 and 𝑓 (𝑡 ) = 𝑡 are both matrix monotone, and they yield
asymmetric matrix means that Kubo & Ando call the “left mean” and the “right mean.”
The other two examples, 𝑓 (𝑡 ) = 𝑡 2 and 𝑓 (𝑡 ) = 𝑡 −1 , are matrix convex but not matrix
monotone. These two functions are the extremal examples of matrix convex functions,
and they yield perspectives that are attractively symmetrical with each other. 

This lecture involves matrix perspective transformations of the negative logarithm


and some matrix convex power functions. We omit explicit expressions for now because
we will apply the perspective in a particular setting where matters are simpler.

17.3.2 Convexity properties


If 𝑓 : ℝ++ → ℝ is a scalar convex function, then you may recall that the scalar
perspective (𝑎, ℎ) ↦→ 𝑎 𝑓 (ℎ/𝑎) is a (jointly) convex function on ℝ++ × ℝ++ . This result
extends to matrix perspectives, provided that 𝑓 is matrix convex. The basic result is
due to Effros [Eff09]; we present a generalization due to Ebadian et al. [ENG11].

Theorem 17.20 (Matrix perspectives are matrix convex). Let 𝑓 : ℝ++ → ℝ be a matrix
convex function. Consider strictly positive-definite matrices 𝑨 𝑖 , 𝑯 𝑖 ∈ ℍ𝑛++ for
𝑖 = 1, 2. Then, for all 𝜏 ∈ [0, 1] with 𝜏¯ B 1 − 𝜏 , we have

𝚿 𝑓 (𝜏𝑨 1 + 𝜏𝑨
¯ 2 ; 𝜏𝑯 1 + 𝜏𝑯
¯ 2 ) 4 𝜏𝚿 𝑓 (𝑨 1 ; 𝑯 1 ) + 𝜏𝚿
¯ 𝑓 (𝑨 2 ; 𝑯 2 ). (17.3)

That is, the perspective is (jointly) operator convex on ℍ𝑛++ × ℍ𝑛++ .

This theorem tells that the perspective of the averages is “smaller” than the average
of the perspectives. It is the key to our proof of Theorem 17.16.

Proof. We will represent the left-hand side of (17.3) as a matrix convex combination,
which allows us to invoke the matrix Jensen inequality. For notational convenience,
Lecture 17: Quantum Relative Entropy 152

define
𝑨 B 𝜏𝑨 1 + 𝜏𝑨
¯ 2 and 𝑯 B 𝜏𝑯 1 + 𝜏𝑯
¯ 2.
Introduce the coefficient matrices

𝑲 1 B 𝜏 1/2 𝑨 11/2 𝑨 −1/2 and 𝑲 2 B 𝜏¯1/2 𝑨 12/2 𝑨 −1/2 .


By a short calculation, we find that

𝑲 ∗1 𝑲 1 = 𝜏𝑨 −1/2 𝑨 1 𝑨 −1/2 and 𝑲 ∗2 𝑲 2 = 𝜏𝑨


¯ −1/2 𝑨 2 𝑨 −1/2 .

Adding these expressions,


𝑲 ∗1 𝑲 1 + 𝑲 ∗2 𝑲 2 = I.
Thus, the coefficient matrices model a matrix convex combination.
Invoking the matrix Jensen inequality, we determine that

𝚿 𝑓 (𝑨 ; 𝑯 ) = 𝑨 1/2 𝑓 (𝑨 −1/2𝑯 𝑨 −1/2 )𝑨 1/2


 
= 𝑨 1/2 𝑓 𝑲 ∗1 (𝑨 −1 1/2𝑯 1 𝑨 −1 1/2 )𝑲 1 + 𝑲 ∗2 (𝑨 −2 1/2𝑯 2 𝑨 −2 1/2 )𝑲 2 𝑨 1/2
h i
4 𝑨 1/2 𝑲 ∗1 𝑓 (𝑨 −1 1/2𝑯 1 𝑨 −1 1/2 )𝑲 1 + 𝑲 ∗2 𝑓 (𝑨 −2 1/2𝑯 2 𝑨 −2 1/2 )𝑲 2 𝑨 1/2
= 𝜏𝚿 𝑓 (𝑨 1 ; 𝑯 1 ) + 𝜏𝚿
¯ 𝑓 (𝑨 2 ; 𝑯 2 ).

This is just what we wanted to prove. 


Example 17.21 (Schwarz inequalities). Applying Theorem 17.20 to the perspective of
𝑓 : 𝑡 ↦→ 𝑡 −1 already yields an interesting statement. For 𝑨 𝑖 , 𝑯 𝑖  0,

(𝜏𝑨 1 + 𝜏𝑨 ¯ 2 ) −1 (𝜏𝑨 1 + 𝜏𝑨
¯ 2 ) (𝜏𝑯 1 + 𝜏𝑯 ¯ 2)
4 𝜏𝑨 1𝑯 −1 1 𝑨 1 + 𝜏𝑨
¯ 2𝑯 − 1
2 𝑨 2 for 𝜏 ∈ [ 0, 1] .

This type of expression is called a matrix Schwarz inequality. In some sense, this
quadratic over linear function is the extremal example of a matrix convex function. 

17.4 Tensors and logarithms


Ando [And79] had the magnificent idea to prove matrix convexity theorems by lifting
formulas involving noncommuting matrices to formulas involving commuting tensors.
This mechanism is easy to implement, and it eliminates most of the difficulty from the
argument.

17.4.1 Tensors and multiplication operators


In this lecture, we take an approach to tensor product operators that is similar to
the abstract approach in Lecture 1. The details are also slightly different from the
concrete approach based on Kronecker products. The tensor product we describe here
is isomorphic to the other constructions.
We will define elementary tensor product operators as multiplication operators
acting on matrices.

Definition 17.22 (Multiplication operators). Let 𝑨, 𝑯 ∈ ℍ𝑛 . We define the tensor


operator 𝑨 ⊗ 𝑯 : 𝕄𝑛 → 𝕄𝑛 via left-multiplication with 𝑨 and right-multiplication
with 𝑯 . That is,
(𝑨 ⊗ 𝑯 ) (𝑴 ) = 𝑨𝑴 𝑯 for 𝑴 ∈ 𝕄𝑛 .
Lecture 17: Quantum Relative Entropy 153

The elementary tensor operators of the form 𝑨 ⊗ 𝑯 span the real linear space
L(ℍ𝑛 ⊗ ℍ𝑛 ) of (self-adjoint) linear operators on 𝕄𝑛 .

Regardless of the choice of 𝑨, 𝑯 , it is evident that the tensor operators 𝑨 ⊗ I and


I ⊗ 𝑯 commute. That is,

(𝑨 ⊗ I)( I ⊗ 𝑯 ) = ( I ⊗ 𝑯 )(𝑨 ⊗ I) = (𝑨 ⊗ 𝑯 ).

This commutativity property makes these elementary tensor operators quite pleasant
to work with.

17.4.2 The logarithm of a tensor operator


Since 𝑨 ⊗𝑯 is a self-adjoint matrix acting on a Hilbert space, it has a spectral resolution,
and we can define standard matrix functions in the usual way. In particular, there is
an elegant expression for the logarithm of this tensor.
Exercise 17.23 (Logarithm of an elementary tensor operator). Let 𝑨, 𝑯 ∈ ℍ𝑛++ be strictly
positive-definite matrices. Then

log (𝑨 ⊗ 𝑯 ) = ( log 𝑨) ⊗ I + I ⊗ ( log 𝑯 ).

Hint: Introduce the spectral resolutions of 𝑨 and 𝑯 .


This result provides a satisfactory substitute for the familiar properties of the scalar
logarithm. There is an analogous result for exponentials.
Exercise 17.24 (Exponential of a tensor sum). Let 𝑨, 𝑯 ∈ ℍ𝑛 be self-adjoint matrices.
Prove that
exp (𝑨 ⊗ I + I ⊗ 𝑯 ) = exp (𝑨) ⊗ exp (𝑯 ).

17.5 Convexity of matrix trace functions


At this stage, we can provide the proof of Theorem 17.16, which states that the quantum
relative entropy is convex. We will also see that this approach yields other remarkable
convexity theorems.

17.5.1 Proof of Theorem 17.16


We will show that the quantum relative entropy (𝑨, 𝑯 ) ↦→ S (𝑨 ; 𝑯 ) is convex on pairs
of strictly positive-definite matrices. The general result follows from a continuity
argument.
We will condense the convexity of the quantum relative entropy from the matrix
convexity of 𝑓 (𝑡 ) B − log 𝑡 . The perspective theorem implies that

(𝑨, 𝑯 ) ↦→ 𝚿 𝑓 (𝑨 ⊗ I; I ⊗ 𝑯 ) is jointly matrix convex.

We need to see what the perspective 𝚿 𝑓 looks like. Since 𝑨 ⊗ I and I ⊗ 𝑯 commute,
the perspective can be expressed simply as

𝚿 𝑓 (𝑨 ⊗ I; I ⊗ 𝑯 ) = (𝑨 ⊗ I) · 𝑓 ( I ⊗ 𝑯 )(𝑨 ⊗ I) −1


= −(𝑨 ⊗ I) · log 𝑨 −1 ⊗ 𝑯


= −(𝑨 ⊗ I) [(− log 𝑨) ⊗ I + I ⊗ ( log 𝑯 )]


= (𝑨 log 𝑨) ⊗ I − 𝑨 ⊗ ( log 𝑯 ).
Lecture 17: Quantum Relative Entropy 154

We have used basic algebraic properties of tensor product operators, as well as


Exercise 17.23.
Fix an arbitrary matrix 𝑿 ∈ 𝕄𝑛 . Define a (scalar-valued) quadratic form: As usual, this is the trace inner
product on matrices.
𝜑 𝑿 (𝑨 ; 𝑯 ) B 𝑿 , 𝚿 𝑓 (𝑨 ⊗ I; I ⊗ 𝑯 )(𝑿 ) .

You should confirm that 𝜑 𝑿 is a convex function of the pair (𝑨, 𝑯 ) . Indeed, observe
that the psd order on L(ℍ𝑛 ⊗ ℍ𝑛 ) corresponds to increases in this type of quadratic
form.
Write out the quadratic form explicitly using the interpretation of the tensor product
operator in terms of left- and right-multiplication. We find that

𝜑 𝑿 (𝑨 ; 𝑯 ) = h𝑿 , (𝑨 log 𝑨)𝑿 − 𝑨𝑿 log 𝑯 i


= tr [𝑿 ∗ (𝑨 log 𝑨)𝑿 − 𝑿 ∗ 𝑨𝑿 log 𝑯 ] is convex.

Choosing 𝑿 = I, we may conclude that

(𝑨, 𝑯 ) ↦→ tr [𝑨 log 𝑨 − 𝑨 log 𝑯 ] = S (𝑨 ; 𝑯 ) is convex.

This is the statement of Theorem 17.16.


Exercise 17.25 (Alternative perspectives). Prove Theorem 17.16 by applying the same
argument to the matrix convex function 𝑓 (𝑡 ) = 𝑡 log 𝑡 .

17.5.2 Joint concavity of powers


The strategy that we used to prove Theorem 17.16 pays further dividends. By changing
the matrix convex function 𝑓 , we can establish more convexity and concavity theorems.
Here is another important example, obtained by Lieb [Lie73a] using a rather difficult
argument.

Theorem 17.26 (Lieb 1973). Fix an arbitrary matrix 𝑿 ∈ 𝕄𝑛 . For any real 𝑟 ∈ ( 0, 1) ,
the function
(𝑨, 𝑯 ) ↦→ tr 𝑿 ∗ 𝑨 𝑟 𝑿 𝑯 1−𝑟
 

is concave on ℍ𝑛+ × ℍ𝑛+ .

Proof sketch. The function 𝑓 (𝑡 ) = −𝑡 𝑟 is matrix convex on ℝ++ . Pursue the same
reasoning as in the proof of Theorem 17.16. 

Notes
This lecture is based on the instructor’s monograph [Tro15] with some contributions by
Prof. Richard Kueng that appeared in the lecture notes for a previous version of this
course. Ando [And79] developed the technique of proving matrix convexity inequalities
by lifting to tensor products. We use an implementation of the argument proposed by
Effros [Eff09], and extended by Ebadian et al. [ENG11].

Lecture bibliography
[And79] T. Ando. “Concavity of certain maps on positive definite matrices and applications
to Hadamard products”. In: Linear Algebra and its Applications 26 (1979), pages 203–
241.
Lecture 17: Quantum Relative Entropy 155

[ENG11] A. Ebadian, I. Nikoufar, and M. E. Gordji. “Perspectives of matrix convex functions”.


In: Proceedings of the National Academy of Sciences 108.18 (2011), pages 7313–7314.
doi: 10.1073/pnas.1102518108.
[Eff09] E. G. Effros. “A matrix convexity approach to some celebrated quantum inequali-
ties”. In: Proceedings of the National Academy of Sciences 106.4 (2009), pages 1006–
1008. doi: 10.1073/pnas.0807965106.
[HP91] F. Hiai and D. Petz. “The proper formula for relative entropy and its asymptotics
in quantum probability”. In: Communications in Mathematical Physics 143 (1991),
pages 99–114.
[KA79] F. Kubo and T. Ando. “Means of positive linear operators”. In: Math. Ann. 246.3
(1979/80), pages 205–224. doi: 10.1007/BF01371042.
[Lie73a] E. H. Lieb. “Convex trace functions and the Wigner-Yanase-Dyson conjecture”. In:
Advances in Math. 11 (1973), pages 267–288. doi: 10.1016/0001-8708(73)90011-
X.
[LR73] E. H. Lieb and M. B. Ruskai. “Proof of the strong subadditivity of quantum-
mechanical entropy”. In: J. Mathematical Phys. 14 (1973). With an appendix by B.
Simon, pages 1938–1941. doi: 10.1063/1.1666274.
[Lin73] G. Lindblad. “Entropy, information and quantum measurements”. In: Communica-
tions in Mathematical Physics 33 (1973), pages 305–322.
[ON00] T. Ogawa and H. Nagaoka. “Strong converse and Stein’s lemma in quantum
hypothesis testing”. In: IEEE Transactions on Information Theory 46.7 (2000),
pages 2428–2433. doi: 10.1109/18.887855.
[Sha48] C. E. Shannon. “A mathematical theory of communication”. In: The Bell System
Technical Journal 27.3 (1948), pages 379–423. doi: 10.1002/j.1538-7305.1948.
tb01338.x.
[Tro11] J. A. Tropp. “User-Friendly Tail Bounds for Sums of Random Matrices”. In: Founda-
tions of Computational Mathematics 12.4 (2011), pages 389–434. doi: 10.1007/
s10208-011-9099-z.
[Tro15] J. A. Tropp. “An Introduction to Matrix Concentration Inequalities”. In: Foundations
and Trends in Machine Learning 8.1-2 (2015), pages 1–230.
18. Positive-Definite Functions

Date: 3 March 2022 Scribe: Joel A. Tropp

In this lecture, we will turn to another topic involving the analysis of positive-
Agenda:
semidefinite matrices. We will develop the notion of a positive-definite kernel and
1. Positive-definite kernels
make a connection between these kernels and positive-definite matrices. Positive- 2. Positive-definite functions
definite kernels play an important role in approximation theory, statistics, machine 3. Examples
learning, and physics. Our focus in this lecture is an important class of examples 4. Bochner’s theorem
called translation-invariant kernels, which are associated with convolution operators. 5. Fourier analysis
6. Extensions
Positive-definite, translation-invariant kernels are generated by positive-definite func-
tions. We will develop some examples of positive-definite functions. Then we will state
and prove Bochner’s theorem, which characterizes the continuous positive-definite
functions.

18.1 Positive-definite kernels


As we have seen, positive-semidefinite (psd) matrices are a subject that rewards study.
The psd property for matrices is defined in terms of quadratic forms:

𝑨 ∈ 𝕄𝑛 (ℂ) is psd if and only if h𝒖, 𝑨𝒖i ≥ 0 for all 𝒖 ∈ ℂ𝑛 . (18.1)

We would like to generalize these notions to functions. This section presents the
definitions, basic examples, and some applications.

18.1.1 Kernels
A bivariate function is often called a kernel, and we may think about them as acting
on functions via integration. In parallel with the concept of a psd matrix, we can
introduce the concept of a positive-definite kernel.
We are being informal in stating this
Definition 18.1 (Positive-definite kernel). A measurable function 𝐾 : ℝ𝑑 × ℝ𝑑 → ℂ definition because it does not play a
is called a kernel on ℝ𝑑 . The kernel acts on a (suitable) function ℎ : ℝ𝑑 → ℂ by central role in this lecture. One must
take care to pose appropriate
integration: ∫ regularity assumptions on the kernel
(𝐾 ℎ) (𝒙 ) B 𝐾 (𝒙 , 𝒚 )ℎ (𝒚 ) d𝒚 for 𝒙 ∈ ℝ𝑑 . function 𝐾 and the test functions ℎ .
ℝ𝑑

We say that 𝐾 is a positive-definite (pd) kernel if it satisfies the condition In this lecture only, we use the
overline for complex conjugation to
avoid confusion with convolution

hℎ, 𝐾 ℎi B ℎ (𝒙 )𝐾 (𝒙 , 𝒚 )ℎ (𝒚 ) d𝒙 d𝒚 ≥ 0 (18.2) operators.
ℝ𝑑 ×ℝ𝑑

for all (suitable) functions ℎ : ℝ𝑑 → ℂ.

Exercise 18.2 (Pd kernels are Hermitian). If 𝐾 is a pd kernel, prove that the kernel is also
Hermitian: 𝐾 (𝒙 , 𝒚 ) = 𝐾 (𝒚 , 𝒙 ) for all 𝒙 , 𝒚 .
Lecture 18: Positive-Definite Functions 157

Definition 18.3 (Kernel matrix). Let 𝐾 be a kernel on ℝ𝑑 . For each finite point set
{𝒙 1 , . . . , 𝒙 𝑛 } ⊂ ℝ𝑑 , we associate a kernel matrix:
 
𝑲 B 𝑲 (𝒙 1 , . . . , 𝒙 𝑛 ) B 𝐾 (𝒙 𝑖 , 𝒙 𝑗 ) 𝑖 ,𝑗 =1,...,𝑛 .

Under appropriate regularity assumptions, positive-definite kernels induce positive-


semidefinite kernel matrices and vice versa.
Exercise 18.4 (Kernels and matrices). Show that the following claims are equivalent.

1. The kernel 𝐾 is bounded and continuous and positive definite for continuous test
functions ℎ that are in L1 (ℝ𝑑 ) .
2. For each finite point set {𝒙 1 , . . . , 𝒙 𝑛 } , the associated kernel matrix 𝑲 (𝒙 1 , . . . , 𝒙 𝑛 )
is positive semidefinite.

Hint: The direction (1 ⇒ 2) follows when we choose a sequence of functions that tends
toward a sum of point masses, located as the data points. The direction (2 ⇒ 1) follows
when we truncate the integrals to a compact set and approximate by a Riemann sum.

Warning 18.5 (Positive definite?). As with the definition (18.1) of a positive-semidefinite


matrix, the definition (18.2) of a positive-definite kernel involves a weak inequality.
Nevertheless, it is customary to call these kernels positive definite, rather than
“positive semidefinite”. The analog of a strictly positive-definite matrix is called a
strictly positive-definite kernel. 

18.1.2 Examples of positive-definite kernels


It is often helpful to think about the value 𝐾 (𝒙 , 𝒚 ) of a positive-definite kernel as a
measure of the similarity between the points 𝒙 and 𝒚 . This heuristic is supported by
the fact that 𝐾 (𝒙 , 𝒙 ) ≥ 0, so a point is always positively similar with itself. The kernel
matrix 𝑲 defined in the Exercise 18.4 tabulates the similarities among a family of data
points. The next set of examples helps support this point.
Example 18.6 (Positive-definite kernels). Here are a few basic examples of pd kernels that
commonly arise.

1. Inner-product kernel. The inner-product kernel is, perhaps, the simplest positive-
definite kernel on ℝ𝑑 . It is defined as

𝐾 (𝒙 , 𝒚 ) B h𝒙 , 𝒚 i.

The associated kernel matrix is usually called the Gram matrix of the data points. The Gram matrix 𝑮 of points
To see that the kernel is positive-definite, note that 𝒙 1 , . . . , 𝒙 𝑛 ∈ ℝ𝑑 takes the form
∫ ∫ 2 𝑮 = [ h𝒙 𝑗 , 𝒙 𝑘 i ] 𝑗 ,𝑘 =1,...,𝑛 .
hℎ, 𝐾 ℎi = ℎ (𝒙 )h𝒙 , 𝒚 iℎ (𝒚 ) d𝒙 d𝒚 = ℎ (𝒚 )𝒚 d𝒚 ≥ 0.
ℝ𝑑 ×ℝ𝑑 ℝ𝑑

The test function ℎ must have sufficient decay to ensure that the integral is
defined.
2. Correlation kernel. The correlation kernel is the positive-definite kernel that
tabulates the correlations between pairs of vectors:

h𝒙 , 𝒚 i
𝐾 (𝒙 , 𝒚 ) B
k𝒙 kk𝒚 k
Lecture 18: Positive-Definite Functions 158

with the understanding that 𝐾 (𝒙 , 0) = 𝐾 ( 0, 𝒚 ) = 0. The associated kernel


matrix is the correlation matrix of the data points. The correlation kernel is a
positive-definite kernel on ℝ𝑑 by the same argument as the inner-product kernel.
3. Gaussian kernel. For a bandwidth parameter 𝛼 > 0, the Gaussian kernel on ℝ𝑑 is
the positive-definite kernel
2
𝐾 (𝒙 , 𝒚 ) B e− k𝒙 −𝒚 k /𝛼
.
Note the Gaussian kernel 𝐾 (𝒙 , 𝒚 ) only depends on the difference 𝒙 − 𝒚 between
its arguments, so it is translation invariant. It is not obvious that this kernel is
positive definite; we will establish this fact a little later (Proposition 18.30). The
Gaussian kernel is widely used in applications.

There are many other examples of kernels. Our discussion here is limited because we
will be focusing on the translation-invariant case. 

18.1.3 Applications of kernels


Kernels arise in a wide variety of settings, including approximation theory, statistics,
and machine learning. Here are a couple basic applications.
Example 18.7 (Interpolation). Consider scattered data (𝒙 𝑖 , 𝑦𝑖 ) ∈ ℝ𝑑 × ℝ for 𝑖 = 1, . . . , 𝑛
where the ordinates 𝒙 𝑖 are distinct. The interpolation problem asks us to identify a
function 𝑓 : ℝ𝑑 → ℝ that satisfies 𝑓 (𝒙 𝑖 ) = 𝑦𝑖 for each index 𝑖 .
Let 𝐾 be a (real-valued) kernel on ℝ𝑑 . Kernel interpolation poses a model of the
form ∑︁
𝑓 (𝒙 ) = 𝛼𝑖 𝐾 (𝒙 , 𝒙 𝑖 ) where each 𝛼𝑖 ∈ ℝ.
𝑖
See Figure 18.1 for an illustration. The interpolation problem has a unique solution if
and only if the kernel matrix 𝑲 = [𝐾 (𝒙 𝑗 , 𝒙 𝑘 )] is nonsingular. In this case, we obtain
the coefficients (𝛼𝑖 ) for the model by solving a set of linear equations.
Strictly positive-definite kernels play a role here because they guarantee that the Figure 18.1 (Kernel interpolation). In-
kernel matrix 𝑲 is strictly positive, hence nonsingular, for every choice of (distinct) terpolating real data with a Gaus-
data points. Therefore, the kernel provides a unique interpolant 𝑓 for any set of sian kernel. As the bandwidth in-
creases, the kernel interpolant be-
scattered data. The Gaussian kernel is often used for this purpose. 
comes smoother.
Example 18.8 (The kernel trick). Suppose that we have a data analysis method for
Euclidean data (𝒙 𝑖 : 𝑖 = 1, . . . , 𝑛) in ℝ𝑑 that only depends on the Gram matrix. For
example, principal component analysis (PCA) is an unsupervised learning method
that computes features from the data by finding the leading eigenvectors of the Gram
matrix and using these eigenvectors to weight the data points. Related examples of
Euclidean data analysis methods include canonical correlation analysis (CCA), linear
discriminant analysis (LDA), and ridge regression (RR).

The kernel trick posits that we can develop alternative methods for data analysis by
replacing the Gram matrix with a pd kernel matrix associated with the data.

For instance, let us summarize how kernel PCA works. Suppose we use a pd kernel
to compute a kernel matrix. To find features for the data, we can compute the leading
eigenvectors of the kernel matrix. Each eigenvector 𝒖 ∈ ℝ𝑑 leads to a feature of the
Í
form 𝜑 (𝒙 ) = 𝑖 𝑢 𝑖 𝐾 (𝒙 , 𝒙 𝑖 ) . This methodology is powerful and widely used.
In summary, we can think about the kernel matrix of a set of data points as a
far-reaching generalization of the Gram matrix. The main criticism of kernel methods
is that the computational cost of the linear algebra can interfere with the application
to large data sets. 
Lecture 18: Positive-Definite Functions 159

18.1.4 Generalizations
A few more remarks are in order.
Remark 18.9 (Bounded + continuous). We often assume that the kernel is bounded and
continuous to simplify our exposition. Nevertheless, there are many applications where
we must discard these hypotheses.
In physics, kernels often reflect forces of repulsion between particles. For example,
the Coulomb (electrical potential) kernel takes the form

1
𝐾 (𝒙 , 𝒚 ) B k𝒙 − 𝒚 k −1 for 𝒙 , 𝒚 ∈ ℝ3 .
4𝜋
This form reflects the fact that two negatively charged particles repel each other, and
the force increases as the charges approach each other. The integral operator

(𝐾 𝜎) (𝒙 ) = 𝐾 (𝒙 , 𝒚 )𝜎 (𝒚 ) d𝒚
ℝ3

describes the electrical potential induced at a point 𝒙 ∈ ℝ3 by a distribution 𝜎 : ℝ3 →


ℝ of charge.
Remark 18.10 (Other domains). On an abstract measure space ( X, 𝜇) , a positive-definite
kernel 𝐾 : X × X → ℂ is a (measurable) function that satisfies

ℎ (𝑥)𝐾 (𝑥, 𝑦 )ℎ (𝑦 ) d𝜇(𝑥) d𝜇(𝑦 ) ≥ 0 for all (suitable) ℎ : X → ℂ.
X×X

This generality can be valuable when working with data that may not take vector
values. Using the kernel trick, we can develop methodology for analysis of general
data sets by adapting techniques from the Euclidean setting.

18.2 Positive-definite functions


In this lecture, we will make a detailed study of positive-definite, translation-invariant
kernels. These kernels are associated with convolution operators. They lead us to
introduce and investigate the class of positive-definite functions.

18.2.1 Definitions
A translation-invariant kernel depends only on the difference between its arguments,
so it is spatially homogeneous.

Definition 18.11 (Translation-invariant kernel). Consider a measurable function 𝑓 :


ℝ𝑑 → ℂ. Introduce a kernel 𝐾 𝑓 on ℝ𝑑 via the formula

𝐾 𝑓 (𝒙 , 𝒚 ) B 𝑓 (𝒙 − 𝒚 ) for all 𝒙 , 𝒚 ∈ ℝ𝑑 .

A kernel from this class is called translation-invariant. The associated integral


operator is called a convolution:

( 𝑓 ∗ ℎ)(𝒙 ) B (𝐾 𝑓 ℎ)(𝒙 ) = 𝑓 (𝒙 − 𝒚 )ℎ (𝒚 ) d𝒚 for 𝒙 ∈ ℝ𝑑 .
ℝ𝑑

This operator is defined for all suitable functions ℎ : ℝ𝑑 → ℂ.


Lecture 18: Positive-Definite Functions 160

Exercise 18.12 (Convolution: Eigenfunctions). Let 𝑓 : ℝ𝑑 → ℂ be an integrable function.


For each 𝒕 ∈ ℝ𝑑 , show that the wave 𝒙 ↦→ ei h𝒕 , 𝒙 i is an eigenfunction of the convolution
operator 𝐾 𝑓 .
The functions that lead to positive-definite, translation-invariant kernels have a
special name. In consonance with the literature, we present this concept in terms of
kernel matrices, rather than kernel functions.

Definition 18.13 (Positive-definite function). A function 𝑓 : ℝ𝑑 → ℂ is called a positive- We do not require a pd function 𝑓 to
be continuous, but we will typically
definite (pd) function if it has the following property. For all data 𝒙 1 , . . . , 𝒙 𝑛 ∈
ℝ𝑑 ,
enforce this property. We will see
the kernel matrix 𝑲 𝑓 associated with the convolution operator 𝐾 𝑓 is positive that boundedness follows as a
semidefinite: consequence of the definition.
 
𝑲 𝑓 B 𝑲 𝑓 (𝒙 1 , . . . , 𝒙 𝑛 ) B 𝑓 (𝒙 𝑗 − 𝒙 𝑘 ) 𝑗 ,𝑘 =1,...,𝑛 < 0.

Exercise 18.14 (Young’s inequality). Suppose that 𝑓 ∈ L1 (ℝ𝑑 ) . Fix 𝑝 ∈ [ 1, ∞] . Prove that

k 𝑓 ∗ ℎ k𝑝 ≤ kℎ k𝑝 for each ℎ ∈ L𝑝 (ℝ𝑑 ) .

Therefore, the convolution 𝐾 𝑓 is a bounded operator on L𝑝 . Hint: This is an application


of Hölder’s inequality.
Exercise 18.15 (Positive-definite convolution kernels). Let 𝑓 : ℝ𝑑 → ℂ be a positive-
definite function that is bounded and continuous . Show that the associated convolution
operator 𝐾 𝑓 is positive definite for all continuous test functions ℎ that are in L1 (ℝ) .
Hint: Truncate the integral to a compact set, and approximate by a Riemann sum.
Exercise 18.16 (Discontinuous pd functions). Show that the 0–1 indicator function 𝟙ℤ of
the integers ℤ is a pd function on ℝ.

18.2.2 Properties
Individual positive-definite functions have a number of nice features.
Proposition 18.17 (Positive-definite function). Let 𝑓 : ℝ𝑑 → ℂ be a pd function.

At the origin, 𝑓 ( 0) ≥ 0.
1. Positivity.
2. Symmetry. We have the relation 𝑓 (−𝒙 ) = 𝑓 (𝒙 ) for each 𝒙 ∈ ℝ𝑑 .
3. Boundedness. The value | 𝑓 (𝒙 )| ≤ 𝑓 ( 0) for each 𝒙 ∈ ℝ𝑑 .

Proof. For an arbitrary point 𝒙 ∈ ℝ𝑑 , we may form the 2 × 2 kernel matrix 𝑲 𝑓 ( 0, 𝒙 ) .


This matrix is psd:  
𝑓 ( 0) 𝑓 (−𝒙 )
𝑲 𝑓 ( 0, 𝒙 ) = < 0.
𝑓 (𝒙 ) 𝑓 ( 0)
To prove (1), note that the diagonal entries of a psd matrix are positive. To prove (3),
recall that every psd matrix is Hermitian. To obtain (3), we invoke Hadamard’s psd Hadamard’s psd criterion states that
criterion, which ensures that | 𝑓 (𝒙 )| 2 ≤ 𝑓 ( 0) 2 . We may take the square-root because 𝑨 < 0 implies |𝑎𝑖 𝑗 | 2 ≤ 𝑎𝑖𝑖 𝑎 𝑗 𝑗 for all
𝑓 ( 0) ≥ 0.  𝑖 , 𝑗 . To prove this, note that each
2 × 2 principal submatrix of 𝑨 is psd,
so its determinant is positive.
The next exercise shows that continuity of a pd function at the origin is tantamount
to continuity everywhere.
Problem 18.18 (Continuity). Prove that a pd function 𝑓 : ℝ𝑑 → ℂ is uniformly continuous
if and only if the real part of 𝑓 is continuous at the origin. Hint: Consider the 3 × 3 psd
kernel matrix 𝑲 𝑓 ( 0, 𝒙 , 𝒚 ) . Compute the quadratic form in the vector 𝒖 = (𝑧, 1, −1) ∗
for 𝑧 ∈ ℂ, and optimize over 𝑧 .
Lecture 18: Positive-Definite Functions 161

The class of positive-definite functions has some important stability properties. The
last one is distinctive.
Proposition 18.19 (Class of positive-definite functions). The class of pd functions on ℝ𝑑
satisfies the following properties.
1. Convex cone. If 𝑓 , 𝑔 are pd functions on ℝ𝑑 , then 𝛼 𝑓 + 𝛽 𝑔 is pd for all 𝛼, 𝛽 ≥ 0.
2. Closedness. If ( 𝑓 𝑛 : 𝑛 ∈ ℕ) is a sequence of pd functions on ℝ𝑑 that converges
pointwise to 𝑓 , then 𝑓 is pd.
3. Multiplication. If 𝑓 , 𝑔 are pd functions on ℝ𝑑 , then 𝑓 𝑔 is pd.

Proof. Point (1) holds because the psd cone is convex, and point (2) holds because the
psd cone is closed. To prove (3), we must argue that the entrywise product of two psd
matrices remains psd. But this is the statement of Schur’s product theorem. 

18.3 Examples of positive-definite functions


In this section, we will present several examples of pd functions. To lighten notation,
we will restrict our attention to pd functions on the real line, but these examples have
straightforward generalizations to higher dimensions.

18.3.1 Complex exponentials


Recall that the eigenfunctions of a convolution operator are the complex exponentials.
As a consequence, it should not come as a surprise that complex exponentials provide
our first example of a pd function. The theme of our discussion is that the complex
exponentials are the building blocks for designing other pd functions.
Proposition 18.20 (Complex exponentials are pd). For each fixed 𝑡 ∈ ℝ, the function
𝑥 ↦→ e−i𝑡 𝑥 for 𝑥 ∈ ℝ is pd.
Proof. To see that the complex exponential is a pd function, we examine the kernel
matrix associated with arbitrary points 𝑥 1 , . . . , 𝑥𝑛 ∈ ℝ. We find that
 |   | ∗
   −i𝑡 𝑥 
e−i𝑡 (𝑥 𝑗 −𝑥𝑘 ) = e−i𝑡 𝑥 𝑗 
 
𝑲 = 𝑗 ,𝑘 =1,...,𝑛
e 𝑘 
  < 0. (18.3)
 |   | 
  𝑗 =1,...,𝑛   𝑘 =1,...,𝑛
In the penultimate expression, the factors are a column vector and its conjugate
transpose. This outer product is a psd matrix with rank one. 
Exercise 18.21 (Waves are pd). For each fixed 𝒕 ∈ ℝ𝑑 , show that 𝒙 ↦→ e−i h𝒕 , 𝒙 i for 𝒙 ∈ ℝ𝑑
is a pd function on ℝ𝑑 .

18.3.2 Cosine and sine


Our next example provides a hint about how we can build pd functions from complex
exponentials.
Proposition 18.22 (Cosine is pd). For each 𝑡 ∈ ℝ, the function 𝑥 ↦→ cos (𝑡 𝑥) is pd on ℝ.

Proof. The identity cos (𝑡 𝑥) = 21 ( ei𝑡 𝑥 +e−i𝑡 𝑥 ) expresses the cosine as a conic combination
of two complex exponentials. The claim follows because each complex exponential is
pd, and the class of pd functions is closed under conic combinations. 
Exercise 18.23 (Sine is not pd). Show that 𝑥 ↦→ sin (𝑥) is not pd on ℝ. Hint: Sine is an
odd function.
Lecture 18: Positive-Definite Functions 162

18.3.3 The sine cardinal (sinc) function


We can extend the principle behind the cosine example by allowing for more elaborate
combinations, expressed as integrals.
Proposition 18.24 (Sinc is pd). The function sinc (𝑥) B sin (𝑥)/𝑥 for 𝑥 ∈ ℝ is pd. We define sinc ( 0) = 1 to ensure
continuity.
Proof. We can write the sinc function as an integral:
∫ +1
1
sinc (𝑥) = e−i𝑡 𝑥 d𝑡 .
2 −1

To conclude that sinc is pd, we write the integral as a limit of Riemann sums. Each
Riemann sum is pd because it is a conic combination of complex exponentials. The
limit is pd because pd functions are stable under pointwise limits. 
As an aside, let us frame an exercise that makes a link between the theory of
positive-definite functions and matrix monotone functions. This is just one of many
such examples; see [Bha07b, Chap. 5].
Exercise 18.25 (Tangent is matrix monotone). Show that 𝑥 ↦→ tan (𝑥) is matrix monotone on
(−𝜋/2, +𝜋/2) by writing the Loewner matrix of the tangent in terms of a kernel matrix
associated with the sinc function. Hint: sin (𝛼 − 𝛽) = sin (𝛼) cos (𝛽) − sin (𝛽) cos (𝛼) .

18.3.4 Fourier transforms


Now, let us take a massive jump in abstraction before returning to earth. In this section,
we consider fully general conic combinations of complex exponentials.

Definition 18.26 (Fourier transform: Positive). Let 𝑓 : ℝ → ℂ be an integrable function.


Its Fourier transform 𝑓b : ℝ → ℂ is the function

𝑓b(𝑥) B e−i𝑡 𝑥 𝑓 (𝑡 ) d𝑡 for 𝑥 ∈ ℝ.

More generally, let 𝜇 be a finite, complex Borel measure on ℝ. Its Fourier transform
b : ℝ → ℂ is the function
𝜇

b(𝑥) B
𝜇 e−i𝑡 𝑥 d𝜇(𝑡 ) for 𝑥 ∈ ℝ.

The first definition involves the measure d𝜇(𝑡 ) = 𝑓 (𝑡 ) d𝑡 with density 𝑓 .

For our purposes, the key insight is that Fourier transforms induce pd functions.
This is a natural outcome, but the proof requires a little thought.
Proposition 18.27 (Fourier transforms are pd). Let 𝜇 be a finite, positive Borel measure on
ℝ. Then its Fourier transform 𝜇
b is a pd function on ℝ.
Proof. We return to the definition of a pd function. For points 𝑥 1 , . . . , 𝑥𝑛 ∈ ℝ, form
the kernel matrix

e−i𝑡 (𝑥 𝑗 −𝑥𝑘 )
   
b(𝑥 𝑗 − 𝑥𝑘 )
𝑲 𝜇b = 𝜇 𝑗 ,𝑘
= 𝑗 ,𝑘
d𝜇(𝑡 ).

We will show that the kernel matrix 𝑲 𝜇b is a well-defined psd matrix, so 𝜇


b is a pd
function.
Lecture 18: Positive-Definite Functions 163

For each 𝑡 ∈ ℝ, the matrix in the integrand is psd since the complex exponential is
a pd function. Therefore, for each unit vector 𝒗 ∈ ℂ𝑛 ,


𝒗 ∗ e−i𝑡 (𝑥 𝑗 −𝑥𝑘 ) 𝑗 ,𝑘 𝒗 d𝜇(𝑡 ) ≥ 0.
 
𝒗 𝑲 𝜇b𝒗 =

Indeed, the integral of a positive function is positive. The integral is finite because the
integrand is bounded by 𝑛 . 
Positive-definite functions that arise from Fourier transforms enjoy some additional
regularity properties.
Exercise 18.28 (Fourier transforms: Continuity). Let 𝜇 be a finite, positive Borel measure
on ℝ. Show that its Fourier transform 𝜇 b is a continuous function. Hint: Complex
exponentials are bounded and continuous; truncate the integral to a compact set.
Problem 18.29 (Riemann–Lebesgue lemma). For an integrable function 𝑓 ∈ L1 (ℝ) , prove
that 𝑓b is a continuous function that vanishes at infinity. Hint: Approximate 𝑓 by simple A function 𝑔 : ℝ → ℝ vanishes at
functions, and note that the sinc function is a continuous function that vanishes at infinity if |𝑔 (𝑥) | → 0 as |𝑥 | → ∞.
infinity.

18.3.5 Gaussians
The Fourier transform provides us with a powerful tool for detecting other examples of
pd functions. Here is a critical example.
2
Proposition 18.30 (The Gaussian kernel is pd). The function 𝑥 ↦→ e−𝑥 /2 for 𝑥 ∈ ℝ is pd.
This result is a consequence of the following fundamental fact, which expresses the
Gaussian as a Fourier transform. As we will discuss, Gaussians play a central role in
Fourier analysis because they serve as approximate identities.
Fact 18.31 (Gaussian: Fourier transform). The density 𝜑𝑏,𝑣 of a Gaussian random variable
with mean 𝑏 ∈ ℝ and variance 𝑣 > 0 is the function
2
e−(𝑡 −𝑏) /( 2𝑣 )
𝜑𝑏,𝑣 (𝑡 ) B √ d𝑡 for 𝑡 ∈ ℝ.
2𝜋𝑣

The density is normalized: ℝ
𝜑𝑏,𝑣 (𝑡 ) d𝑡 = 1. Its Fourier transform is the function
2
b𝑏,𝑣 (𝑥) = ei𝑏𝑥 · e−𝑣𝑥
𝜑 /2
for 𝑥 ∈ ℝ.

In other words, the Fourier transform of a Gaussian is a twisted Gaussian. 

Proof sketch. By ∫a change of variables, we may assume that 𝑏 = 0 and 𝑣 = 1. To


prove that 𝐼 B ℝ 𝜑 0,1 (𝑡 ) d𝑡 = 1, write 𝐼 2 as a double integral and change to polar
coordinates. To evaluate the Fourier transform 𝜑 b0,1 , we expand 𝑡 ↦→ e−i𝑡 𝑥 as a Taylor
series. The integrals of the odd terms vanish. The integrals of the even terms are
the moments of a standard normal density, which may be evaluated with repeated
integration by parts. 

18.3.6 Inverse quadratics


As a final example, we consider another Fourier transform that arises in probability
theory.
Proposition 18.32 (Inverse quadratic is pd). The function 𝑥 ↦→ ( 1 + 𝑥 2 ) −1 is pd on ℝ.
Lecture 18: Positive-Definite Functions 164

Proof. We confirm that the function is pd by writing it as the Fourier transform of


(twice) the Laplace density:

1
= e−i𝑡 𝑥 e− |𝑡 | d𝑡 .
1 + 𝑥2 ℝ

For example, use symmetry about 𝑡 = 0 to pass to a cosine integral, and integrate by
parts twice. 

18.4 Bochner’s theorem


All our examples of pd functions derive in one way or another from complex exponentials.
This repetition may suggest a lack of creativity, but there is a deeper reason. In fact,
every continuous pd function can be represented as a Fourier transform [Boc33].

Theorem 18.33 (Bochner 1933). A continuous function 𝑓 on the real line is positive
definite if and only if 𝑓 is the Fourier transform of a finite, positive Borel measure
𝜇 on the real line: ∫
𝑓 (𝑥) = 𝜇
b(𝑥) = e−i𝑡 𝑥 d𝜇(𝑡 ).

Proposition 18.27 establishes the “easy” direction: the Fourier transform of a finite,
positive measure is a pd function. This result has already paid dividends because it
allowed us to identify a number of interesting positive-definite functions.
In this section, we will give a proof the “hard” direction. This result provides a
powerful representation for positive-definite functions. It serves as a building block for
establishing other difficult theorems in matrix analysis and other fields. In particular,
Bochner’s theorem has a consequence for probability theory: The characteristic function
of a random variable is a continuous pd function and vice versa. This can be used (in a
somewhat roundabout way) to prove Lévy’s continuity theorem, which is a key step in
the standard proof of the central limit theorem.
Beyond that, Bochner’s theorem has striking applications in machine learning,
where it serves as the foundation of the method of random Fourier features [RR08].

18.4.1 Complex exponentials are extreme points


Before we turn to the proof, let us recall the geometric perspective on integral
representations. The pd functions on the real line form a convex cone, closed under
pointwise limits. By normalizing the functions, we obtain a compact, convex base of
this cone:
B B {𝑓 : 𝑓 is pd on ℝ and 𝑓 ( 0) = 1}.
According to the Krein–Milman theorem, we can represent every function 𝑓 ∈ B as a
limit of convex combinations of the extreme points. Bochner’s theorem states that

𝑓 (𝑥) = 𝜇
b(𝑥) = e−i𝑡 𝑥 d𝜇(𝑡 ) for a probability measure 𝜇 .

This representation tells us that every extreme point of B must be a complex exponential.
Let us offer an independent proof of the fact that every complex exponential is an
extreme point. We argue in the spirit of Boutet de Monvel [Sim19, Thm. 28.12].
Proposition 18.34 (Complex exponentials are extreme). For each 𝑡 ∈ ℝ, the complex
exponential 𝑒𝑡 : 𝑥 ↦→ e−i𝑡 𝑥 is an extreme point of B.
Lecture 18: Positive-Definite Functions 165

Proof. Fix the frequency 𝑡 ∈ ℝ. We already know that 𝑒𝑡 is an element of B. Moreover,


the expression (18.3) shows that the 2 × 2 kernel matrix 𝑲 𝑒𝑡 ( 0, 𝑥) has rank one for
each 𝑥 ∈ ℝ. This observation is the key to the proof.
Suppose that 𝑒𝑡 = 12 𝑓 + 12 𝑔 where 𝑓 , 𝑔 ∈ B. The kernel matrix is linear in the
kernel function, so
𝑲 𝑒𝑡 ( 0, 𝑥) = 21 𝑲 𝑓 ( 0, 𝑥) + 21 𝑲 𝑔 ( 0, 𝑥).
But this expression represents the rank-one matrix 𝑲 𝑒𝑡 ( 0, 𝑥) as an average of two psd
matrices (because 𝑓 , 𝑔 are pd functions). Recall that each rank-one matrix lies in an
extreme ray of the psd cone. Therefore, 𝑲 𝑓 = 𝛼 (𝑥)𝑲 𝑒𝑡 for some 𝛼 (𝑥) ≥ 0. That is,
   
1 𝑓 (−𝑥) 1 ei𝑡 𝑥
𝑲 𝑓 ( 0, 𝑥) = = 𝛼 (𝑥) = 𝛼 (𝑥)𝑲 𝑒𝑡 ( 0, 𝑥).
𝑓 (𝑥) 1 e−i𝑡 𝑥 1
We deduce that 𝛼 (𝑥) = 1, and so 𝑓 (𝑥) = e−i𝑡 𝑥 for all 𝑥 ∈ ℝ. Likewise, 𝑔 (𝑥) = e−i𝑡 𝑥
for all 𝑥 ∈ ℝ. We conclude that 𝑒𝑡 cannot be written as the average of two distinct
functions in B, so it is an extreme point. 

18.4.2 Fourier analysis with Gaussians


To prove Bochner’s theorem, we introduce some rudiments of Fourier analysis. This
task will be easier because we only need to use the main formulas when one of the
functions involved is a Gaussian.
For an integrable function 𝑓 : ℝ → ℂ, the Fourier transform 𝑓b and inverse Fourier
transform 𝑓q are defined explicitly: Among several possible definitions,
∫ ∫ we have chosen the probabilists’
1 Fourier transform.
𝑓b(𝑥) B e−i𝑡 𝑥 𝑓 (𝑡 ) d𝑡 and 𝑓q(𝑡 ) B ei𝑡 𝑥 𝑓 (𝑥) d𝑥 for 𝑥 ∈ ℝ.
ℝ 2𝜋 ℝ

The Riemann–Lebesgue lemma states that 𝑓b, 𝑓q ∈ C0 (ℝ) , the space of continuous
functions on ℝ that vanish at infinity, equipped with the supremum norm.
A few basic properties merit comment. The Fourier transform satisfies an elegant
duality property:
∫ ∫
𝑓 (𝑥)ℎ
b(𝑥) d𝑥 = 𝑓b(𝑥)ℎ (𝑥) d𝑥 for 𝑓 , ℎ ∈ L1 (ℝ) . (18.4)
ℝ ℝ
The Fourier transform also converts convolution into pointwise multiplication:
∗ ℎ = 𝑓b · ℎ
𝑓š b ∈ C0 (ℝ) for 𝑓 , ℎ ∈ L1 (ℝ) . (18.5)
Young’s inequality ensures that the convolution 𝑓 ∗ ℎ is an integrable function.
Let us introduce the class of twisted Gaussian functions:
2
G B {𝑡 ↦→ 𝑟 ei𝑡 𝑎 e−(𝑡 −𝑏) /( 2𝑣 )
where 𝑟 , 𝑎, 𝑏 ∈ ℝ and 𝑣 > 0}.
Using Fact 18.31, we can quickly confirm that

𝑔 ∈ G implies b
𝑔 , 𝑔q ∈ G and b
𝑔q = 𝑔 .
In particular, the Fourier transform is a bijection on G.
The main result that we require is a precursor of Plancherel’s identity:
∫ ∫
1
𝑓 (𝑡 )𝑔 (𝑡 ) d𝑡 = 𝑓b(𝑥)b
𝑔 (𝑥) d𝑥 when 𝑓 ∈ L1 (ℝ) and 𝑔 ∈ G . (18.6)
ℝ 2𝜋 ℝ

Indeed, we note that 𝑔 = 𝑔q , and apply the duality (18.4) to move the Fourier transform
b
to 𝑓 . Then we pass the complex conjugate through the inverse Fourier transform to
obtain the complex conjugate of the Fourier transform and a scale factor.
Lecture 18: Positive-Definite Functions 166

18.4.3 Approximate identities


The reason that Gaussians suffice for our purposes is that they serve as “approximate
identities”.
Exercise 18.35 (Gaussian: Approximate identity). Consider centered Gaussian densities: As 𝑣 ↓ 0, the Fourier transform of 𝑔𝑣
satisfies
1 2 2 /2
𝑔𝑣 (𝑡 ) B √ · e−𝑡 /( 2𝑣 )
for 𝑣 > 0. 𝑔𝑣 (𝑥) = e−𝑣𝑥
b → 1.
2𝜋𝑣

Let 𝑓 : ℝ → ℂ be a bounded, continuous function. Prove that



𝑓 (𝑏) = lim ( 𝑓 ∗ 𝑔𝑣 ) (𝑏) = lim 𝑓 (𝑏 − 𝑡 ) · 𝑔𝑣 (𝑡 ) d𝑡 .
𝑣 ↓0 𝑣 ↓0 ℝ

That is, Gaussians approximate the identity element for the convolution operation. Thus,
we can isolate the value of a continuous function by integration against increasingly

localized Gaussians. See Figure 18.2. Hint: Truncate the integral to the interval ±𝑐 𝑣 ,
where 𝑐 depends on sup 𝑓 . Apply the mean value theorem to 𝑓 on this interval.

18.4.4 Proof of Bochner’s theorem


Figure 18.2 (Approximate identity).
Let us turn to the proof of Theorem 18.33. This argument is a significant revision of Using an approximate identity to iso-
the proofs in [Bha07b, Thm. 5.5.3] and [Sim15, Thm. 6.6.6]. late a function value.
Let 𝑓 : ℝ → ℝ be the target pd function. Without loss of generality, we may
rescale 𝑓 so that 𝑓 ( 0) = 1 . By hypothesis, the function 𝑓 is continuous, but it may not
be integrable. To repair this defect, we consider a sequence of regularized functions:
2
𝐹𝑛 (𝑡 ) B 𝑓 (𝑡 ) · e−𝑡 /𝑛
for 𝑡 ∈ ℝ and each 𝑛 ∈ ℕ.

We will produce the measure 𝜇 that represents 𝑓 as the limit of the measures 𝜇𝑛 that
represent each 𝐹𝑛 .
Positive definiteness. To begin, note that 𝐹𝑛 is positive definite and continuous
because it is the product of two pd, continuous functions. The function 𝐹𝑛 ∈ L1 (ℝ)
because 𝑓 is bounded, and the Gaussian belongs to L1 (ℝ) .
Let us extract the consequence of the pd property that we will need. Choose a test
function 𝑔 ∈ G. By Exercise 18.15, the convolution kernel induced by 𝐹𝑛 is positive
definite for this class of test functions. Therefore, we calculate that
∫ ∫
0≤ 𝑔 (𝑠 )𝐹𝑛 (𝑠 − 𝑡 )𝑔 (𝑡 ) d𝑠 d𝑡 =
𝑔 (𝑠 ) · (𝐹𝑛 ∗ 𝑔 )(𝑠 ) d𝑠
ℝ×ℝ ℝ
∫ ∫ (18.7)
1 1
= 𝑔 (𝑥) · (𝐹
b œ 𝑛 ∗ 𝑔 ) (𝑥) d𝑥 = 𝑔 (𝑥)| 2 · 𝐹b𝑛 (𝑥) d𝑥.
|b
2𝜋 ℝ 2𝜋 ℝ
To pass to the second line, we invoked Plancherel’s identity (18.6). The last relation is
the convolution theorem (18.5).
Positivity. The pd property of 𝐹𝑛 implies that its Fourier transform 𝐹
b𝑛 is positive:

𝐹b𝑛 (𝑥) ≥ 0 for all 𝑥 ∈ ℝ.

To prove this, recall that 𝐹b𝑛 is bounded and continuous because of the Riemann–
Lebesgue lemma. Fix a point 𝑏 ∈ ℝ, and select a twisted Gaussian function 𝑔𝑣 ∈ G
whose Fourier transform satisfies
1 2
𝑔𝑣 (𝑥)| 2 = √
|b · e−(𝑥−𝑏) /2𝑣
for 𝑣 > 0.
2𝜋𝑣
Lecture 18: Positive-Definite Functions 167

The function |b𝑔𝑣 | 2 is an approximate identity, centered at 𝑏 , with width 𝑣 . Introduce
2
|b
𝑔𝑣 | into the relation (18.7), and take the limit as 𝑣 ↓ 0 using the fact that 𝐹b𝑛 is
continuous. This demonstrates that 𝐹b𝑛 (𝑏) ≥ 0.
Duality. The rest of the proof depends on the complex conjugate of Plancherel’s
identity (18.6):
∫ ∫
1
𝑔 (𝑡 )𝐹𝑛 (−𝑡 ) d𝑡 = 𝑔 (𝑥) 𝐹b𝑛 (𝑥) d𝑥
b for 𝑔 ∈ G. (18.8)
ℝ 2𝜋 ℝ

What happened to the complex conjugates? Since 𝐹𝑛 is pd, we know that 𝐹𝑛 (𝑡 ) =


𝐹𝑛 (−𝑡 ) and that 𝐹b𝑛 ≥ 0. To exploit this relation, we can choose 𝑔 to be an approximate
identity. This allows us to transfer information between 𝐹𝑛 and 𝐹b𝑛 .
b𝑛 is integrable, which is not obvious a
Integrability. The next step is to argue that 𝐹
priori. More precisely, we show that

1
𝐹b𝑛 (𝑥) d𝑥 = 1.
2𝜋 ℝ

To do so, we select the test function


1 2 2
𝑔𝑣 (𝑡 ) = √ e−𝑡 /( 2𝑣 )
with 𝑔𝑣 (𝑥) = e−𝑣𝑥
b /2
.
2𝜋𝑣
Since 𝐹𝑛 is bounded and continuous and 𝑔𝑣 is an approximate identity localized at the
origin, ∫
lim 𝑔𝑣 (𝑡 )𝐹𝑛 (−𝑡 ) d𝑡 = 𝐹𝑛 ( 0) = 1.
𝑣 ↓0 ℝ

Since 𝐹b𝑛 ≥ 0 and b


𝑔𝑣 ↑ 1 as 𝑣 ↓ 0, monotone convergence implies that
∫ ∫
1 1
lim 𝑔𝑣 (𝑥) 𝐹b𝑛 (𝑥) d𝑥 =
b 𝐹b𝑛 (𝑥) d𝑥.
𝑣 ↓0 2𝜋 ℝ 2𝜋 ℝ

The last two displays establish the claim because of the identity (18.8).
Measures and limits. Let us introduce a sequence of Borel probability measures on
the real line:
1
d𝜇𝑛 (𝑥) B 𝐹b𝑛 (𝑥) d𝑥 for each 𝑛 ∈ ℕ.
2𝜋
To verify that 𝜇𝑛 is a probability measure, we rely on the facts that 𝐹b𝑛 is positive and its
integral is normalized. Passing to a subsequence if needed, the sequence (𝜇𝑛 : 𝑛 ∈ ℕ)
of probability measures has a weak-∗ limit 𝜇 . Therefore, This claim follows from the
Banach–Alaoglu theorem, applied to
the dual of the space C0 (ℝ) of
∫ ∫
lim ℎ (𝑥) d𝜇𝑛 (𝑥) = ℎ (𝑥) d𝜇(𝑥) for all ℎ ∈ C0 (ℝ) . continuous functions that vanish at
𝑛→∞ ℝ 𝑅 infinity.

The limit 𝜇 is a positive Borel measure with 𝜇(ℝ) ≤ 1.


Fourier transforms. It remains to show that the function 𝑓 is the Fourier transform of
the measure 𝜇 . For fixed 𝑏 ∈ ℝ, we choose the test functions

1 2 2
𝑔𝑣 (𝑡 ) = √ e−(𝑥+𝑏) /( 2𝑣 )
with 𝑔𝑣 (𝑥) = e−i𝑏𝑥 · e−𝑣𝑡
b /2
for 𝑣 > 0.
2𝜋𝑣
Lecture 18: Positive-Definite Functions 168

Using the duality identity (18.8) and taking the limit as 𝑛 → ∞, we can relate the
function 𝑓 to the measure 𝜇 . Indeed,
∫ ∫
𝑔𝑣 (𝑡 ) 𝑓 (−𝑡 ) d𝑡 = lim 𝑔𝑣 (𝑡 )𝐹𝑛 (−𝑡 ) d𝑡
ℝ 𝑛→∞ ℝ
∫ ∫
= lim 𝑔𝑣 (𝑥) d𝜇𝑛 (𝑥) =
b 𝑔𝑣 (𝑥) d𝜇(𝑥).
b
𝑛→∞ ℝ ℝ

The first limit is a consequence of monotone convergence because 𝐹𝑛 ↑ 𝑓 and 𝑔𝑣 ≥ 0.


The second limit is a consequence of the weak-∗ convergence of 𝜇𝑛 to 𝜇 because
𝑔𝑣 ∈ C0 (ℝ) . Last, we take limits as 𝑣 ↓ 0. Indeed,
b
∫ ∫ ∫
𝑓 (𝑏) = lim 𝑔𝑣 (𝑡 ) 𝑓 (−𝑡 ) d𝑡 = lim 𝑔𝑣 (𝑥) d𝜇(𝑥) =
b e−i𝑏𝑥 d𝜇(𝑥).
𝑣 ↓0 ℝ 𝑣 ↓0 ℝ ℝ

The first limit is valid because 𝑔𝑣 is an approximate identity, localized at −𝑏 , and 𝑓


is bounded and continuous. The second limit follows from dominated convergence
because b 𝑔𝑣 is bounded and 𝜇 is a finite measure. We have established Bochner’s
theorem.

18.4.5 Extensions
Bochner’s theorem holds in far more general settings. In particular, it is valid for pd
functions on ℝ𝑑 . The proof is essentially the same as the proof of Theorem 18.33.

Theorem 18.36 (Bochner). A continuous function 𝑓 on ℝ𝑑 is positive definite if and


only if 𝑓 is the Fourier transform of a finite, positive Borel measure 𝜇 on ℝ𝑑 . That
is, ∫
𝑓 (𝑡 ) = e−i h𝒕 , 𝒙 i d𝜇(𝒕 ).
ℝ𝑑

More generally, the theorem holds in abstract settings. We give one such result
without proper definitions, and we illustrate it with an example.

Theorem 18.37 (Bochner–Weil). A continuous function 𝑓 on a locally compact abelian


(LCA) group is positive-definite if and only if 𝑓 is the Fourier transform of a finite,
positive Borel measure 𝜇 on the dual group.

See [Rud90] for the setting of LCA groups, and see [Rud91] for Raikov’s generaliza-
tion to the setting of commutative Banach algebras.
Example 18.38 (Positive-definite sequences). Consider the LCA group (ℤ, +) , comprising
the integers with addition. A function on ℤ is called a sequence, and a sequence
(𝑎𝑘 : 𝑘 ∈ ℤ) is positive definite when
∑︁
𝑢 𝑗 𝑎 𝑗 −𝑘 𝑢 𝑘 ≥ 0 for 𝒖 : ℤ → ℂ with finite support.
𝑗 ,𝑘

Positive-definite sequences arise from positive-definite convolution operators on the


integers. These operators correspond with (bi-infinite) Toeplitz matrices.
The dual group (𝕋, +) is the torus with modular addition. The Fourier transform
computes the Fourier series of a measure on the torus. Thus, the Bochner–Weil theorem
guarantees that ∫
𝑎𝑘 = e−i𝑡 𝑘 d𝜇(𝑡 ),
[−𝜋,+𝜋)
Lecture 18: Positive-Definite Functions 169

where 𝜇 is a positive, finite Borel measure on [−𝜋, +𝜋) .


This result is called the Carathéodory–Herglotz–Toeplitz theorem. It is actually a
precursor to Bochner’s work, and it can be established with a rather more elementary
argument; for example, see [Bha07b, Thm. 5.5.2]. 

Notes
This lecture contains a new presentation of this material. Applications of kernels
are extracted from [SSB18]. The procession of examples is drawn from Bhatia’s
book [Bha07b, Chap. 5]. The proof that complex exponentials are extreme rays of the
cone of pd functions seems to be new. The self-contained proof of Bochner’s theorem is
a significant revision of the proof in Simon’s real analysis text [Sim15]. The material on
Fourier analysis is adapted from Arbogast & Bona [AB08]. See Rudin’s books [Rud90;
Rud91] for generalizations of Bochner’s theorem.

Lecture bibliography
[AB08] T. Arbogast and J. L. Bona. Methods of applied mathematics. ICES Report. UT-Austin,
2008.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[Boc33] S. Bochner. “Monotone Funktionen, Stieltjessche Integrale und harmonische
Analyse”. In: Math. Ann. 108.1 (1933), pages 378–410. doi: 10.1007/BF01452844.
[RR08] A. Rahimi and B. Recht. “Random Features for Large-Scale Kernel Machines”.
In: Advances in Neural Information Processing Systems 20. Curran Associates, Inc.,
2008, pages 1177–1184. url: http://papers.nips.cc/paper/3182-random-
features-for-large-scale-kernel-machines.pdf.
[Rud90] W. Rudin. Fourier analysis on groups. Reprint of the 1962 original, A Wiley-
Interscience Publication. John Wiley & Sons, Inc., New York, 1990. doi: 10.1002/
9781118165621.
[Rud91] W. Rudin. Functional analysis. Second. McGraw-Hill, Inc., New York, 1991.
[SSB18] B. Schlkopf, A. J. Smola, and F. Bach. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. The MIT Press, 2018.
[Sim15] B. Simon. Real analysis. With a 68 page companion booklet. American Mathematical
Society, Providence, RI, 2015. doi: 10.1090/simon/001.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
19. Entrywise PSD Preservers

Date: 8 March 2022 Scribe: Joel A. Tropp

In this lecture, we discuss several classes of kernel functions that are defined on
Agenda:
Euclidean spaces of every dimension: radial kernels and inner-product kernels. In the
1. Radial kernels
first case, Bochner’s theorem leads to a characterization of all radial kernels. To study 2. Inner-product kernels
the second case, we must investigate a new concept. Suppose that we apply a function 3. Entrywise psd preservers
to each entry of a matrix to produce a new matrix. This function is called an entrywise 4. Examples
psd preserver if it maps each psd matrix to another psd matrix. These functions also 5. Vasudeva’s theorem
6. Extensions
enjoy a beautiful theory that complements the Loewner theory of matrix monotone
functions and the Bochner theory of positive-definite functions.

19.1 Families of kernels


Our first object is to extend the work from the last lecture to construct a family
of positive-definite convolution kernels that is defined for Euclidean spaces of each
dimension. Then we will turn to another family of kernels, based on inner products,
that also extend to every dimension.

19.1.1 Positive-definite functions


Recall that a function 𝑓 : ℝ𝑑 → ℂ is positive definite (pd) on ℝ𝑑 when the matrix
 
𝑲 𝑓 (𝒙 1 , . . . , 𝒙 𝑛 ) B 𝑓 (𝒙 𝑗 − 𝒙 𝑘 ) 𝑗 ,𝑘 =1,...,𝑛 < 0 for all 𝒙 1 , . . . , 𝒙 𝑛 ∈ ℝ𝑑 .

In other words, we consider the convolution kernel 𝐾 𝑓 induced by the function 𝑓 . The
function 𝑓 is pd when the kernel matrices 𝑲 𝑓 associated with the convolution kernel
are all psd.
Under mild regularity assumptions, the eigenfunctions of a convolution operator
are complex exponentials. This observation suggests that the complex exponentials will
play a basic role in characterizing convolution kernels. Indeed, we have the following
fundamental result.

Theorem 19.1 (Bochner). A continuous function 𝑓 : ℝ𝑑 → ℂ is positive definite on


ℝ𝑑 if and only if 𝑓 is the Fourier transform of a finite, positive Borel measure 𝜇.
That is, ∫
𝑓 (𝒙 ) = e−i h𝝃, 𝒙 i d𝜇(𝝃).
ℝ𝑑

Bochner’s theorem describes an individual pd function, defined on a Euclidean


space of a particular dimension. One may also wonder whether there is a way to
extend this result to obtain a family of pd convolution kernels in each dimension.

19.1.2 Radial kernels


The key idea is to study a class of convolution kernels that are rotationally invariant.
These kernels are spatially and directionally homogeneous.
Lecture 19: Entrywise PSD Preservers 171

Definition 19.2 (Radial kernel). A (translation-invariant) kernel 𝐾 : ℝ𝑑 → ℂ is radial


if it takes the form

𝐾 (𝒙 , 𝒚 ) B 𝜑 (k𝒙 − 𝒚 k) for all 𝒙 , 𝒚 ∈ ℝ𝑑 , In this lecture, k · k is the Euclidean


norm.
where 𝜑 : ℝ+ → ℂ is a function. We say that the function 𝜑 is positive-definite
radial on ℝ𝑑 when the associated kernel matrices
 
𝑲 𝜑 (𝒙 1 , . . . , 𝒙 𝑛 ) B 𝜑 (k𝒙 𝑗 − 𝒙 𝑘 k) 𝑗 ,𝑘 =1,...,𝑛 < 0

for all choices of 𝒙 1 , . . . , 𝒙 𝑛 ∈ ℝ𝑑 and each 𝑛 ∈ ℕ.

A radial kernel is defined on a Euclidean space of a particular dimension 𝑑 .


Nevertheless, we can also ask when 𝜑 is positive-definite radial on ℝ𝑑 for every 𝑑 ∈ ℕ.
In this case, we simply say that 𝜑 is “positive-definite radial” without additional
qualification. Schoenberg provided an elegant answer to this question [Sch38].

Theorem 19.3 (Schoenberg 1938). A continuous function 𝜑 : ℝ+ → ℝ+ is positive- Every pd radial function must take
definite radial if and only if it is given by the Laplace transform of a finite, positive positive values! (Why?)
Borel measure 𝜇 on ℝ+ . More precisely,

2 2
𝜑 (𝑡 ) = e−𝑟 𝑡 /2
d𝜇(𝑟 ) for 𝑡 ∈ ℝ+ .
ℝ+

Proof sketch. The reverse direction is simple. We write the Gaussian as a characteristic We use random variables and
expectations to simplify some of the
function (i.e., a Fourier transform). For 𝒙 ∈ ℝ𝑑 ,
formulas here.
2
k𝒙 k 2 /2
e−𝑟 = 𝔼[ e−i h𝒛 , 𝒙 i/𝑟 ] where 𝒛 ∼ normal ( 0, I𝑑 ).
Using dominated convergence,
∫ ∫ 
−𝑟 2 k𝒙 k 2 /2 −i h𝒛 , 𝒙 i/𝑟
𝜑 (k𝒙 k) = e d𝜇(𝑟 ) = 𝔼 e d𝜇(𝑟 ) .
ℝ𝑑 ℝ𝑑

The integral is a pd function, and the average of pd functions is pd.


For the forward direction, we assume that the function 𝜑 (k·k) is pd on ℝ𝑑 for each
𝑑 ∈ ℕ. Use rotational invariance to average over the unit sphere, and apply Bochner’s
theorem to deduce that ∫
𝜑 (𝑡 ) = Ω𝑑 (𝑟𝑡 ) d𝜇𝑑 (𝑟 ).
ℝ+
The measure 𝜇𝑑 on ℝ+ is positive, with total mass equal to 𝜑 ( 0) . The functions

Ω𝑑 (𝑠 ) B 𝔼[e−i𝑠 𝜃1 ] where 𝜽 ∼ unif (𝕊𝑑−1 ).


In this expression, 𝜃 1 is the first coordinate of a random vector 𝜽 drawn uniformly
from the Euclidean unit sphere 𝕊𝑑−1 in ℝ𝑑 .
Since the first coordinate of a spherical vector is almost a centered Gaussian with
variance 𝑑 −1 , it is not too hard to argue that
√ 2
lim𝑑→∞ Ω𝑑 (𝑟 𝑑) → e−𝑟
. /2

It takes some additional work to show that d𝜇𝑑 (𝑟 𝑑) has a weak limit d𝜇(𝑟 ) in the
sense of tempered distributions. Schoenberg’s paper [Sch38, Thm. 2] approaches the
problem using hard analysis. See Chafaï [Cha13] for a probabilistic argument. 
Lecture 19: Entrywise PSD Preservers 172

19.1.3 Inner-product kernels


Are there other families of kernels that operate in any dimension? Here is another
class, based on inner products instead of Euclidean distances.

Definition 19.4 (Inner-product kernel). For a field 𝔽 = ℝ or 𝔽 = ℂ, an inner-product


kernel 𝐾 on 𝔽 𝑑 is a function of the form

𝐾 𝜑 (𝒙 , 𝒚 ) B 𝜑 (h𝒙 , 𝒚 i) for all 𝒙 , 𝒚 ∈ 𝔽 𝑑 ,

where 𝜑 : 𝔽 → 𝔽 is a function. The inner-product kernel is positive definite on 𝔽 𝑑


when the associated kernel matrices are psd:
 
𝑲 𝜑 (𝒙 1 , . . . , 𝒙 𝑛 ) B 𝜑 (h𝒙 𝑗 , 𝒙 𝑘 i) 𝑗 ,𝑘 =1,...,𝑛 < 0

for every choice of 𝒙 1 , . . . , 𝒙 𝑛 ∈ 𝔽 𝑑 and all 𝑛 ∈ ℕ.

We can characterize pd inner-product kernels more simply.


Exercise 19.5 (Psd preservers). Show that an inner-product kernel 𝐾 𝜑 on 𝔽 𝑑 is positive
definite if and only if
   
𝑎 𝑗 𝑘 < 0 implies 𝜑 (𝑎 𝑗 𝑘 ) < 0 for 𝑨 = [𝑎 𝑗 𝑘 ] ∈ 𝕄𝑑 (𝔽) .

That is, the function 𝜑 is an entrywise psd preserver. Hint: A matrix in 𝕄𝑑 (𝔽) is psd if
and only if it is the Gram matrix of 𝑑 points in 𝔽 𝑑 .
This seems a bit perverse: we are always told that entrywise operations on matrices
are unorthodox. Nevertheless, this exercise suggests that entrywise functions that
preserve the psd property may merit study. This is the primary object of this lecture.

19.1.4 Applications
Before turning to our work, let us mention some motivating applications of psd
inner-product kernels and entrywise psd preservers.
Example 19.6 (Kernel methods). As we briefly discussed in Lecture 18, we can use the
kernel trick to design new methods for data analysis. In a method that only uses the
Gram matrix of Euclidean data, we can substitute a psd kernel matrix to try to find
alternative (non-Euclidean) structure in the data. Some commonly used inner-product
kernels include polynomial kernels of the form

𝐾 (𝒙 , 𝒚 ) = h𝒙 , 𝒚 i𝑝 or 𝐾 (𝒙 , 𝒚 ) = ( 1 + h𝒙 , 𝒚 i)𝑝 for 𝑝 ∈ ℕ.

As we will see, these inner-product kernels are indeed pd. 

Example 19.7 (Covariance regularization). Suppose that we have acquired a psd covariance
matrix 𝑨 < 0 whose entries tabulate (estimated) covariances among a family of random
variables. In practice, it is common that covariance estimates are inaccurate, so we
may wish to process the matrix to mitigate the effects of noise. In some settings, it is
important to ensure that the procedure respects the psd property so that the processed
matrix is still a covariance matrix.
One class of inexpensive methods applies a scalar function 𝑓 to each entry of the
covariance matrix to obtain [𝑓 (𝑎 𝑗 𝑘 )] . In this context, it is natural to insist that the
function 𝑓 is an entrywise psd preserver. 
Lecture 19: Entrywise PSD Preservers 173

19.2 Entrywise functions that preserve the psd property


In this section, we initiate our study of scalar functions that act entrywise on a psd
matrix to produce another psd matrix. We begin with basic definitions, and then we
outline some of the main properties.

19.2.1 Entrywise functions


It is valuable to restrict our attention to matrices whose entries lie within specified sets.

Definition 19.8 (Entrywise matrix function). Let E ⊆ 𝔽 . Define the set of 𝑛 × 𝑛 matrices In this lecture, we always use brackets
with entries in E: to denote entrywise behavior.

𝕄𝑛 [ E] B {𝑨 ∈ 𝕄𝑛 (𝔽) : 𝑎 𝑗 𝑘 ∈ E for all 𝑗 , 𝑘 }.

We can extend a scalar function 𝑓 : E → 𝔽 entrywise to matrices:

𝑓 [𝑨] B [𝑓 (𝑎 𝑗 𝑘 )] ∈ 𝕄𝑛 (𝔽) for each 𝑨 = [𝑎 𝑗 𝑘 ] ∈ 𝕄𝑛 [E] .

The definition of an entrywise matrix function should be contrasted with our


previous definition of a standard matrix function. Indeed, standard matrix functions
were defined only for Hermitian matrices, while entrywise functions can be applied
to any square (or even rectangular) matrix. It is, perhaps, surprising that there is
anything interesting to say about these objects.

19.2.2 Entrywise psd preservers


Our main definition concerns entrywise functions that map certain psd matrices to psd
matrices.

Definition 19.9 (Entrywise psd preserver). Let D be an interval (𝔽 = ℝ) or a disc


centered on the real line (𝔽 = ℂ). We say that 𝑓 : D → 𝔽 is an entrywise psd
preserver (epp) on 𝕄𝑛 [ D] if

𝑨 < 0 implies 𝑓 [𝑨] < 0 for all 𝑨 ∈ 𝕄𝑛 [ D] .

If this claim holds true for all 𝑛 ∈ ℕ, we simply say that 𝑓 is an entrywise psd
preserver (epp) on D.

Exercise 19.10 (Epps and positivity). If D ∩ ℝ+ = ∅, show that the definition of an epp on
𝕄𝑛 [ D] is vacuous. Otherwise, if D ∩ ℝ+ ≠ ∅, exhibit an example of an epp on 𝕄𝑛 [D] .
As with positive linear maps and standard matrix functions, a differentiable epp is
also monotone with respect to the psd order.
Problem 19.11 (Differentiable epps are monotone). For simplicity, assume that D ⊆ 𝔽 is an
open set. Suppose that 𝑓 : D → 𝔽 is differentiable. Show that 𝑓 is an epp on 𝕄𝑛 [ D]
if and only if

𝑨4𝑩 implies 𝑓 [𝑨] 4 𝑓 [𝑩] for all 𝑨, 𝑩 ∈ 𝕄𝑛 [D] .

Hint: The proof is essentially the same as the proof that a differentiable function is
matrix monotone if and only if the Loewner matrix is psd. In this case, the argument is
much easier because we have no need of the Daleckii–Krein formula.
Lecture 19: Entrywise PSD Preservers 174

19.2.3 Properties
Entrywise psd preservers enjoy some of the same properties as positive-definite
functions.
Proposition 19.12 (Entrywise psd preserver). For a disc D ⊆ 𝔽 , let 𝑓 : D → 𝔽 be an epp.

For 𝑎 ∈ D ∩ ℝ+ , the value 𝑓 (𝑎) ≥ 0.


1. Positivity. We are using ∗ to denote the complex
2. Symmetry. We have 𝑓 (𝑧 ∗ ) = 𝑓 (𝑧) ∗ for all 𝑧 ∈ D. conjugate.
3. Bounds. For 𝑎, 𝑐 , 𝑧 ∈ D with 𝑎, 𝑐 ≥ 0,

|𝑓 (𝑧)| 2 ≤ 𝑓 (𝑎) 𝑓 (𝑐 ) when |𝑧 | 2 ≤ 𝑎𝑐 .


In particular, |𝑓 (𝑧)| ≤ 𝑓 (𝑎) if |𝑧 | ≤ 𝑎 .

Proof. Consider a psd matrix 𝑨 ∈ 𝕄2 [ D] and its entrywise image 𝑓 [𝑨] :


   
𝑎 𝑧 𝑓 (𝑎) 𝑓 (𝑧)
𝑨= ∗ < 0 and 𝑓 [𝑨] = < 0.
𝑧 𝑐 𝑓 (𝑧 ∗ ) 𝑓 (𝑐 )
Property (1) holds because the diagonal entries of a psd matrix are positive. Property
(2) is a consequence of the symmetry or Hermiticity of a psd matrix. To prove (3), we
invoke the Hadamard psd criterion. 
Next, we show that epps behave nicely on the strictly positive part of the real line.
Proposition 19.13 (Epp: Restriction to ℝ++ ). For a disc D ⊆ 𝔽 , let 𝑓 : D → 𝔽 be an epp.
The restriction of 𝑓 : D ∩ ℝ++ → ℝ+ is increasing and continuous.

Proof. Choose numbers 𝑎, 𝑐 ∈ D ∩ℝ++ subject to the relation 0 < 𝑐 < 𝑎 . The positivity
and boundedness properties ensure that 𝑓 (𝑐 ) = |𝑓 (𝑐 )| ≤ 𝑓 (𝑎) . Thus, 𝑓 is increasing
on D ∩ ℝ++ .
Define the function 𝑔 (𝑥) B log 𝑓 √︁
( e𝑥 ) whenever e𝑥 ∈ D ∩ ℝ++ . The boundedness

property guarantees that 𝑓 ( 𝑎𝑐 ) ≤ 𝑓 (𝑎) 𝑓 (𝑐 ) . Write 𝑎 = e𝑥 and 𝑐 = e𝑦 . Take the
logarithm, and recognize the function 𝑔 :

𝑔 ( 12 𝑥 + 12 𝑦 ) ≤ 12 𝑔 (𝑥) + 12 𝑔 (𝑦 ).
In other words, 𝑔 is midpoint convex, so it is continuous on the interior of its domain.
Thus, 𝑓 is continuous on D ∩ ℝ++ . 
Proposition 19.14 (Epp: Stability properties). Let D ⊂ 𝔽 be a disc.

1. Convex cone. If 𝑓 , 𝑔 are epps on D, then 𝛼 𝑓 + 𝛽 𝑔 is an epp on D for all 𝛼, 𝛽 ≥ 0.


2. Closedness. If ( 𝑓𝑛 : 𝑛 ∈ ℕ) is a sequence of epps on D that converges pointwise
to 𝑓 , then 𝑓 is an epp on D.
3. Multiplication. If 𝑓 , 𝑔 are epps on D, then the pointwise product 𝑓 𝑔 is also an
epp on D.

Proof. The claims (1) and (2) are valid because the psd matrices form a convex cone
that is closed. Meanwhile, claim (3) is a consequence of the Schur product theorem. 

There is only one setting where epps have been completely characterized: the case
of 2 × 2 matrices with strictly positive entries [Vas79, Thm. 2]. We mention this result
here because we will use this result to prove a weaker characterization theorem.
Exercise 19.15 (Epps: A special characterization). Show that 𝑓 is an epp on 𝕄2 [ℝ++ ] if and
only if 𝑥 ↦→ log 𝑓 ( e𝑥 ) is increasing and midpoint convex.
Lecture 19: Entrywise PSD Preservers 175

19.3 Examples of entrywise psd preservers


In this section, we will introduce several examples of epps on various domains. For
simplicity, we restrict our attention to the real setting from now on. Thus, D is either
the whole real line or an interval of the real line.

19.3.1 Monomials
The simplest example of an epp is the function that always returns one.
Proposition 19.16 (Constants are epps). The function 𝑓 (𝑡 ) = 1 is an epp on ℝ.

Proof. For a psd matrix 𝑨 ∈ 𝕄𝑛 [ℝ] , we see that 𝑓 [𝑨] = 11ᵀ , which is psd. 
Next, we turn our attention to the monomials.
Proposition 19.17 (Monomials are epps). For each 𝑝 ∈ ℕ, the function 𝑓 (𝑡 ) = 𝑡 𝑝 is an epp
on ℝ.

Proof. For a psd matrix 𝑨 ∈ 𝕄𝑛 [ℝ] , we can express

𝑓 [𝑨] = 𝑨 ··· 𝑨 < 0.


| {z }
𝑝 times

As usual, denotes the Schur product. Since 𝑨 < 0, an iterative application of the
Schur product theorem guarantees that the product is psd. 
Problem 19.18 (Other powers). Choose a positive number 𝑟 that is not an integer. Prove
that the function 𝑡 ↦→ 𝑡 𝑟 is not an epp on ℝ+ . Hint: This is hard. One approach is to
argue that every epp on 𝕄𝑛 [ℝ+ ] must have 𝑛 − 1 continuous derivatives, and choose
𝑛 > 𝑟 . See the proofs of Vasudeva’s theorem and Bernstein’s theorems (below) for
some relevant techniques.

19.3.2 Power series


Let us take a step up in abstraction, which will allow us to produce a wealth of
additional examples.
Proposition 19.19 (Some power series that are epps). Consider a power series
∑︁∞
𝑓 (𝑡 ) B 𝑐𝑝 𝑡 𝑝 where 𝑐𝑝 ≥ 0 for 𝑝 ∈ ℕ. (19.1)
𝑝=0

If D ⊆ ℝ is the domain of convergence, then 𝑓 : D → ℝ is an epp on D.

Proof. The monomials are epps on ℝ, and the class of epps on ℝ is a convex cone.
Therefore, each polynomial with positive coefficients is an epp on ℝ. Within the domain
D of convergence, the partial sums of 𝑓 are polynomials with positive coefficients that
converge pointwise to 𝑓 . Since the class of epps on D is closed under pointwise limits,
we see that 𝑓 is an epp on D. 
Later, we will discuss the class of power series with positive coefficients in greater
detail. For the moment, we note a few basic properties.
Exercise 19.20 (Power series with positive coefficients). Suppose that 𝑓 : D → ℝ has
the form (19.1). Then 𝑓 ∈ C∞ ( D) , the set of infinitely differentiable functions on D.
Furthermore, for each 𝑝 ∈ ℕ, the derivative 𝑓 (𝑝) is also an epp on D.
Lecture 19: Entrywise PSD Preservers 176

19.3.3 The exponential and friends


To exploit the ideas from the last section, we first consider some power series related
to the exponential function.
Proposition 19.21 (Exponentials are epps). For 𝑎 > 0, the functions 𝑡 ↦→ exp (𝑎𝑡 ) and
𝑡 ↦→ sinh (𝑎𝑡 ) and 𝑡 ↦→ cosh (𝑎𝑡 ) are all epps on ℝ.
Proof. Scaling a matrix by 𝑎 > 0 preserves positivity, so we may assume that 𝑎 = 1.
Recall that the exponential, hyperbolic sine, and hyperbolic cosine are entire functions,
so their power series converge on the real line:
∑︁∞ 1 𝑝
exp (𝑡 ) = 𝑡 ,
𝑝=0 𝑝!
∑︁∞ 1 2𝑝 ∑︁∞ 1
cosh (𝑡 ) = 𝑡 , and sinh (𝑡 ) = 𝑡 2𝑝+1 .
𝑝=0 2𝑝 ! 𝑝=0 ( 2𝑝 + 1) !
Clearly, each of these series has positive coefficients. Invoke Proposition 19.19. 
This proposition has an elegant and unexpected outcome. If the matrix [𝑎 𝑗 𝑘 ] is
psd, then the matrix [ e𝑎 𝑗 𝑘 ] is also psd. We can actually strengthen the result further.
Problem 19.22 (Exponentials and cpsd matrices). Suppose that 𝒖 ∗ 𝑨𝒖 ≥ 0 for all vectors
𝒖 with tr [𝒖] = 0. Matrices of this type are called conditionally psd (cpsd). Prove that
exp [𝑨] is psd. This result is valid with 𝔽 = ℝ or 𝔽 = ℂ.

19.3.4 The logarithm and the inverse


Next, we consider some important power series that converge only on a subinterval of
the line.
Proposition 19.23 (Logarithms and inverses that are epps). The inverse 𝑡 ↦→ ( 1 − 𝑡 ) −1 and
the logarithm 𝑡 ↦→ − log ( 1 − 𝑡 ) are epps on (−1, +1) .

Proof. We have the power series expansions


∑︁∞ ∑︁∞ 1
( 1 − 𝑡 ) −1 = 𝑡𝑝 and − log ( 1 − 𝑡 ) = 𝑡𝑝.
𝑝=0 𝑝=1 𝑝

These series have positive coefficients, but they converge only in the open interval
D = (−1, +1) . Proposition 19.19 implies that these two functions are epps on the
interval (−1, +1) . 

19.3.5 More examples


Several other elementary functions are also epps. We leave these facts as an exercise.
Exercise 19.24 (Elementary functions that are epps). Show that the following functions are
epps by inspection of their power series. Make plots to compare these functions.

1. Secant. The function 𝑡 ↦→ sec ( 2𝑡 /𝜋) is an epp on (−1, +1) .


2. Tangent. The function 𝑡 ↦→ tan ( 2𝑡 /𝜋) is an epp on (−1, +1) .
3. Arctanh. The function 𝑡 ↦→ arctanh (𝑡 ) is an epp on (−1, +1) .
4. Arcsine. The function 𝑡 ↦→ arcsin (𝑡 ) is an epp on [−1, +1] .
5. Log gamma. The function 𝑡 ↦→ log Γ( 1 − 𝑡 ) is an epp on (−1, +1) .

Hint: Use the magisterial Handbook of Mathematical Functions, prepared by Milton


Abramowitz & Irene Stegun [AS64] and updated by Frank Olver et al. [Olv+10]. Or
just turn to Wikipedia if you consider it trustworthy.
Lecture 19: Entrywise PSD Preservers 177

19.4 Absolutely monotone and completely monotone functions


All our examples of entrywise psd preservers are based on power series with positive
coefficients. Although this may appear to reflect a lack of inspiration, there is a deeper
reason. In the real setting, it can be shown that every epp is given by such a power
series, at least under some regularity conditions. We will discuss results of this type in
the next section.
As a preliminary, we need to take some time to develop the theory of power
series with positive coefficients. These objects also arise from the study of absolutely
monotone functions, and they have a very long history in analysis. In this subject,
one must pay close attention to differentiability assumptions and the behavior of a
function at the endpoints of its domain. Our exposition is modeled after Widder’s
classic book [Wid41, Chap. IV]. We elaborate on this discussion below in Section 19.6.

19.4.1 Absolute monotonicity


Bernstein introduced the concept of absolutely monotonicity for an infinitely differen-
tiable function.

Definition 19.25 (Absolutely monotone function). An infinitely differentiable function We allow 𝑎, 𝑏 ∈ {±∞} in this
𝑓 : (𝑎, 𝑏) → ℝ on an open interval of the real line is called absolutely monotone if definition.
its derivatives are positive:
(𝑝)
𝑓 (𝑡 ) ≥ 0 for all 𝑡 ∈ (𝑎, 𝑏) and all 𝑝 ∈ ℤ+ .

For a left-closed, right-open interval, we say that 𝑓 : [𝑎, 𝑏) → ℝ is absolutely


monotone if 𝑓 is continuous on [𝑎, 𝑏) and absolutely monotone on (𝑎, 𝑏) .

Exercise 19.26 (Absolute monotonicity: Right-hand derivatives). Suppose that 𝑓 : [𝑎, 𝑏) →


ℝ is absolutely monotone. This includes the assumption that 𝑓 (𝑎) B limℎ ↓0 𝑓 (𝑎 + ℎ)
exists and is positive. Show that the right-derivative 𝑓 0 (𝑎) B limℎ ↓0 𝑓 0 (𝑎 + ℎ) exists
and is positive. With this choice of 𝑓 0 (𝑎) , confirm that 𝑓 0 is continuous on [𝑎, 𝑏) . By
induction, deduce that 𝑓 (𝑝) is a positive, continuous function on [𝑎, 𝑏) .
Exercise 19.27 (Absolute monotonicity: Properties). An absolutely monotone function is
positive, increasing, and convex.
Exercise 19.28 (Absolute monotonicity: Stability). On an open or half-open interval, the
absolutely monotone functions compose a convex cone. The product of two absolutely
monotone functions is absolutely monotone. The derivatives of an absolutely monotone
are also absolutely monotone.

19.4.2 Absolutely monotone functions are analytic


In general, the existence of derivatives does not imply that a function has a power
series representation. Absolutely monotone functions, however, do enjoy this property.
This is called the “little” Bernstein theorem.

Theorem 19.29 (Bernstein). Suppose that 𝑓 : [𝑎, 𝑏) → ℝ is absolutely monotone on


a half-open interval. Write 𝑟 B 𝑏 − 𝑎 . Then
∑︁∞ 𝑓 (𝑝) (𝑎)
𝑓 (𝑡 ) = (𝑡 − 𝑎)𝑝 for all 𝑡 ∈ (𝑎 − 𝑟 , 𝑎 + 𝑟 ) . (19.2)
𝑝=0 𝑝!
Lecture 19: Entrywise PSD Preservers 178

Moreover, 𝑓 extends to an analytic function on the open disc D𝑎 (𝑟 ) centered at 𝑎


with radius 𝑟 .
In particular, if 𝑓 : ℝ+ → ℝ is absolutely monotone, then it extends to an entire
function whose power series expansion at zero has positive coefficients.

Proof. Exercise 19.26 shows that 𝑓 has right-derivatives of all orders at 𝑡 = 𝑎 . For each
𝑛 ∈ ℕ, we may expand 𝑓 as a Taylor series with an integral remainder:
∑︁𝑛−1 𝑓 (𝑝) (𝑎)
𝑓 (𝑡 ) = (𝑡 − 𝑎)𝑝 + 𝑅 𝑛 (𝑡 ) when 𝑎 ≤ 𝑡 ≤ 𝑐 < 𝑏 .
𝑝=0 𝑝!
We may express the remainder in the form
1
∫ 𝑡  𝑡 − 𝑠  𝑛−1
(𝑛)
𝑅 𝑛 (𝑡 ) = 𝑓 (𝑠 )(𝑐 − 𝑠 ) 𝑛−1 d𝑠 .
(𝑛 − 1) ! 𝑎 𝑐 −𝑠
Since the fraction in the integrand decreases as a function of 𝑡 and the derivatives 𝑓 (𝑛)

are positive on the interval [𝑎, 𝑏) ,


 𝑡 − 𝑎  𝑛−1 1
∫ 𝑡
𝑅 𝑛 (𝑡 ) ≤ · 𝑓 (𝑛) (𝑡 )(𝑐 − 𝑠 ) 𝑛−1 d𝑠
𝑐 −𝑎 (𝑛 − 1) ! 𝑎
 𝑡 − 𝑎  𝑛−1 ∫ c
1
≤ · 𝑓 (𝑛) (𝑡 )(𝑐 − 𝑠 ) 𝑛−1 d𝑠
𝑐 −𝑎 (𝑛 − 1) ! 𝑎
 𝑡 − 𝑎  𝑛−1  𝑡 − 𝑎  𝑛−1
= · 𝑅 𝑛 (𝑐 ) ≤ · 𝑓 (𝑐 ).
𝑐 −𝑎 𝑐 −𝑎
Indeed, 𝑅 𝑛 (𝑐 ) ≤ 𝑓 (𝑐 ) because absolutely monotonicity ensures that all terms in the
partial Taylor series are all positive. We deduce that 𝑅 𝑛 (𝑡 ) → 0 for all 𝑡 ∈ [𝑎, 𝑐 ] .
Therefore, the series expansion (19.2) converges for 𝑡 ∈ [𝑎, 𝑐 ] . Since 𝑐 < 𝑏 is arbitrary,
we may extend the interval of definition to [𝑎, 𝑏) .
Finally, note that each summand in the series (19.2) is positive when 𝑡 ∈ [𝑎, 𝑎 + 𝑟 ) .
In other words, the series converges absolutely when 𝑡 ∈ [𝑎, 𝑎 + 𝑟 ) . As a consequence,
the series also converges for every 𝑡 ∈ (𝑎 − 𝑟 , 𝑎 + 𝑟 ) . In fact, the same conclusion
extends to 𝑡 ∈ D𝑎 (𝑟 ) . 

19.5 Vasudeva’s theorem


We are now prepared to state and prove a partial converse to Proposition 19.19. Every
epp on the strictly positive real line can be written as a power series with positive
coefficients [Vas79].

Theorem 19.30 (Entrywise psd preservers on ℝ++ ; Vasudeva 1979). A function 𝑓 : ℝ++ →
ℝ is an epp if and only if it is absolutely monotone. That is,
∑︁∞
𝑓 (𝑡 ) = 𝑐𝑝 𝑡 𝑝 where 𝑐𝑝 ≥ 0 for all 𝑝 ∈ ℕ.
𝑝=0

This statement includes the claim that the series converges for all 𝑡 > 0.

In Proposition 19.15, we have already seen that an epp on ℝ++ is necessarily


continuous. Theorem 19.30 asserts that an epp on ℝ++ is actually an analytic function.
In this section, we give a complete proof of Vasudeva’s theorem.
If we add a much stronger differentiability assumption, we can reach a similar
conclusion for the whole real line.
Lecture 19: Entrywise PSD Preservers 179

Corollary 19.31 (Entrywise psd preservers on ℝ). An analytic function 𝑓 : ℝ → ℝ is an


epp if and only if it its power series expansion at zero has positive coefficients.

Proof. The reverse implication is Proposition 19.19.


For the forward direction, we assume that 𝑓 is analytic, so it has a power series
expansion about zero that converges on ℝ. Since 𝑓 is also an epp on the strictly
positive line ℝ++ , Vasudeva’s theorem guarantees that the coefficients in this power
series are positive. 
Corollary 19.31 is actually true without any regularity assumption, but this claim is
somewhat harder to prove. See Section 19.5.4 for related results of this type.

19.5.1 Smooth epps have positive derivatives


The key to the proof of Vasudeva’s theorem is a calculation for a smooth epp 𝑓 on ℝ++ .
The idea is to exploit the epp property to deduce that derivatives of 𝑓 are positive.
This idea dates back to work of Loewner and Horn [Hor69].
Lemma 19.32 (Smooth epp: Positive derivatives). Consider an infinitely differentiable
function 𝑓 : ℝ++ → ℝ that is an epp on ℝ++ . Then 𝑓 (𝑝) ≥ 0 for all 𝑝 ∈ ℤ+ .
Proof. Fix a point 𝑎 > 0. Choose a positive integer 𝑝 ∈ ℤ+ , and write 𝑑 = 𝑝 + 1. From
the assumption that 𝑓 is an epp, we may extract a family of inequalities:

𝒖 ∗ 𝑓 [𝑎 11∗ + 𝑡 𝒄 𝒄 ∗ ]𝒖 ≥ 0 for all 𝒖 ∈ ℝ𝑑 and 𝒄 ∈ ℝ++


𝑑
and small 𝑡 > 0.

Indeed, when 𝑡 is sufficiently small, 𝑎 11ᵀ + 𝑡 𝒄 𝒄 ∗ is a psd matrix with strictly positive
entries. By the 𝑝 th order Taylor expansion of 𝑓 about 𝑎 with a derivative remainder,
we find that
(𝑡 𝑐 𝑗 𝑐 𝑘 )𝑟 (𝑡 𝑐 𝑗 𝑐 𝑘 )𝑝
∑︁ 
∑︁𝑑 𝑝−1 (𝑟 ) (𝑝)
𝑢 𝑗 𝑢𝑘 𝑓 (𝑎) + 𝑓 (𝑎 + 𝑡 𝜃 𝑗 𝑘 𝑐 𝑗 𝑐 𝑘 ) ≥ 0.
𝑗 ,𝑘 =1 𝑟 =0 𝑟! 𝑝!

The scaling coefficients 𝜃 𝑗 𝑘 ∈ ( 0, 1) , but they may depend on everything else.


To continue, we require the entries of 𝒄 ∈ ℝ++ 𝑑 to be distinct . Consider the 𝑑 × 𝑑

Vandermonde matrix
1 𝑐 1 𝑐 2 . . . 𝑐 𝑝 
 1 1
1 𝑐 2 𝑐 2 . . . 𝑐 𝑝 
2 2
𝑽 = . . (19.3)

. . . .
 .. .. .. .. .. 
 𝑝
1 𝑐 𝑑
 𝑐 𝑑2 . . . 𝑐 𝑑 
Since the entries of 𝒄 are distinct, the Vandermonde matrix 𝑽 is nonsingular (Prob-
lem 19.33). Therefore, we can find a vector 𝒖 ∈ ℝ𝑑 that is orthogonal to the first
Í
𝑝 − 1 columns but not orthogonal to the 𝑝 th column. Equivalently, 𝑗 𝑢 𝑗 𝑐 𝑗𝑟 = 0 for
Í 𝑝
0 ≤ 𝑟 < 𝑝 , while 𝑗 𝑢 𝑗 𝑐 𝑗 ≠ 0.
With these choices, the sum in the penultimate display collapses to the form

𝑡 𝑝 ∑︁𝑑 𝑝 𝑝 (𝑝)
𝑢 𝑗 𝑢𝑘 · 𝑐 𝑗 𝑐𝑘 · 𝑓 (𝑎 + 𝑡 𝜃 𝑗 𝑘 𝑐 𝑗 𝑐 𝑘 ) ≥ 0.
𝑝! 𝑗 ,𝑘 =1

Clear the leading fraction. Take 𝑡 ↓ 0 to conclude that 𝑓 (𝑝) (𝑎) ≥ 0. 


Problem 19.33 (Vandermonde determinants). Prove that the Vandermonde matrix (19.3)
Î
satisfies det (𝑽 ) = 𝑖 ≠𝑗 (𝑐 𝑖 − 𝑐 𝑗 ) . Hint: Compute the LU factorization of 𝑽 .
Lecture 19: Entrywise PSD Preservers 180

19.5.2 Smoothing
Next, we show that it is possible to take an arbitrary epp on ℝ++ and smooth it to
obtain an infinitely differentiable epp. This result requires the use of a mollifier, a nice
function that can confer its nice properties onto another function.
Fact 19.34 (Mollifier). Fix 𝛿 ∈ ( 0, 1) . There is a probability density function ℎ 𝛿 : ℝ++ →
ℝ+ that has compact support, is infinitely differentiable, has expectation 1, and has
variance 𝛿 . 

Exercise 19.35 (Epp: Smoothing). Let 𝑓 : ℝ+ → ℝ+ be a continuous function. For 𝛿 > 0,


introduce the mollifier ℎ 𝛿 , and construct the smoothed function This integral can be interpreted as a
convolution of two functions on the
∫ ∞ abelian group (ℝ++ , ×) .
d𝑥
𝑓𝛿 (𝑡 ) B 𝑓 (𝑡 /𝑥) · ℎ 𝛿 (𝑥) · for 𝑡 ≥ 0.
0 𝑥
Establish the following properties.

1. Continuity. The function 𝑓 𝛿 is continuous on ℝ+ .


2. Smoothness. The function 𝑓 𝛿 is infinitely differentiable on ℝ++ .
3. Limits. As 𝛿 ↓ 0, the sequence 𝑓 𝛿 → 𝑓 pointwise.
4. Epp. If 𝑓 is an epp on ℝ++ , then 𝑓 𝛿 is also an epp on ℝ++ .

Hint: The first three parts follow from dominated convergence, perhaps after the change
of variables 𝑢 = 𝑡 /𝑥 . For the last part, approximate the integral by a Riemann sum.

19.5.3 Proof of Theorem 19.30


We may now establish Vasudeva’s result (Theorem 19.30). The reverse implication is
Proposition 19.19. For the forward implication, the basic idea is to smooth the epp. The
smoothed epp has positive derivatives, so it must be absolutely monotone. Under some
regularity conditions, the limit of absolutely monotone functions remains absolutely
monotone, so we can transfer the power series representation of the smoothed epp
back to the original epp.
Assume that 𝑓 : ℝ++ → ℝ+ is an epp on ℝ++ . By Proposition 19.15, the epp 𝑓 is
continuous and increasing on ℝ++ . Thus, we can extend 𝑓 to a continuous function on
ℝ+ with the limiting value 𝑓 ( 0) = inf𝑡 >0 𝑓 (𝑡 ) .
For each 𝛿 > 0, introduce the smoothed function 𝑓 𝛿 : ℝ+ → ℝ+ , as in Exercise 19.35.
The function 𝑓 𝛿 is an infinitely differentiable epp on ℝ++ . Therefore, we may invoke
(𝑝)
Lemma 19.32 to determine that 𝑓 𝛿 ≥ 0 on ℝ++ for all 𝑝 ∈ ℤ+ . Since 𝑓 𝛿 is continuous
on ℝ+ , the “little” Bernstein theorem (Theorem 19.29) implies that 𝑓 𝛿 is an entire
function whose power series expansion at zero has positive coefficients:
∑︁∞
𝑓𝛿 (𝑡 ) = 𝑐𝑝 (𝛿 )𝑡 𝑝 with 𝑐𝑝 (𝛿 ) ≥ 0 for all 𝑡 ∈ ℝ.
𝑝=0

∞ Í
Uniformly in 𝛿 , the coefficients are absolutely summable because 𝑝= 0 𝑐 𝑝 (𝛿 ) =
𝑓𝛿 ( 1) → 𝑓 ( 1) . We will apply some standard results from complex analysis to complete
the argument.
Exercise 19.35 implies that 𝑓 𝛿 → 𝑓 pointwise as 𝛿 ↓ 0. We need to upgrade the
convergence. The family ( 𝑓 𝛿 : 0 < 𝛿 < 1) is uniformly bounded on each compact
set in ℂ because the coefficients in the power series for 𝑓 𝛿 are uniformly absolutely
summable. Passing to a subsequence if necessary, 𝑓 𝛿 → 𝑓 uniformly on each compact This result follows from the
set in ℂ [Ahl66, Thm. 12, p. 217]. Arzelà–Ascoli theorem.
Lecture 19: Entrywise PSD Preservers 181

Finally, if a sequence of analytic functions converges uniformly on each compact


set in ℂ, then the limit is an analytic function [Ahl66, Thm. 1, p. 174]. We deduce that This result is an easy consequence of
𝑓 is analytic. Therefore, Morera’s theorem.

∑︁∞
𝑓 (𝑡 ) = 𝑐𝑝 𝑡 𝑝 with 𝑐𝑝 ≥ 0 for all 𝑡 ∈ ℝ.
𝑝=0

Indeed, the coefficients must be positive because we have taken the limit of series with
positive coefficients. This is the required result.

19.5.4 Variations
Vasudeva’s theorem is, perhaps, the simplest of several results that characterize
entrywise psd preservers. A key feature of these results is that they do not make
strong prior assumptions on the smoothness properties of the epp, but rather extract
smoothness as a consequence of the structural assumption.
The first theorem of this type was derived by Schoenberg [Sch38].

Theorem 19.36 (Schoenberg 1938). A continuous function 𝑓 : [−1, +1] → ℝ is an


epp if and only if it has a representation as a power series about zero with positive
coefficients.
Somewhat later, Rudin [Rud59] removed the continuity condition from Schoenberg’s
theorem.

Theorem 19.37 (Rudin 1959). A function 𝑓 : (−1, +1) → ℝ is an epp if and only if it
has a representation as a power series about zero with positive coefficients.

Rudin conjectured that there was an extension of his result to the complex setting,
and this result was obtained soon after by Herz [Her63].

Theorem 19.38 (Herz 1963). A function 𝑓 : D0 ( 1) → ℂ is an epp if and only if it has


a representation of the form
∑︁
𝑓 (𝑧) = 𝑐𝑝𝑞 𝑧 𝑝 (𝑧 ∗ ) 𝑞 where 𝑐𝑝𝑞 ≥ 0.
𝑝,𝑞 ≥0

See the survey [Bel+18] for more discussion of these results, as well as recent
advances.

19.6 *Completely monotone functions


The rest of this lecture is for general education. Absolutely monotone functions have a
counterpart where the derivatives alternate sign.

Definition 19.39 (Complete monotonicity). An infinitely differentiable function 𝑓 :


(𝑎, 𝑏) → ℝ on an open interval of the real line is called completely monotone if its
derivatives alternate sign:
(𝑝)
(−1)𝑝 𝑓 ≥ 0 for all 𝑡 ∈ (𝑎, 𝑏) and all 𝑝 ∈ ℤ+ .

Exercise 19.40 (Absolute versus complete). Show that 𝑓 is completely monotone on (𝑎, 𝑏)
if and only if its reversal 𝑡 ↦→ 𝑓 (−𝑡 ) is absolutely monotone on (−𝑏, −𝑎) .
In many ways, it is more natural to study completely monotone functions, and we
will see that they enjoy a remarkable integral representation.
Lecture 19: Entrywise PSD Preservers 182

19.6.1 Differences and derivatives


Completely monotone functions also can be defined without continuity assumptions by
using right-difference operators. In this case, it is natural to insist that the domain of
the function is the positive (or strictly positive) real line.

Definition 19.41 (Right difference). For ℎ > 0, the right-difference operator of width ℎ
is defined on functions 𝑓 : ℝ+ → ℝ via the rule

Δℎ 𝑓 : 𝑎 ↦→ 𝑓 (𝑎 + ℎ) − 𝑓 (𝑎) for 𝑎 ∈ ℝ+ .

Higher-order differences are defined iteratively:


𝑝+1
Δℎ 𝑓 : 𝑎 ↦→ Δℎ (𝑎 + ℎ) − Δℎ (𝑎) for 𝑎 ∈ ℝ+ .
𝑝 𝑝

Exercise 19.42 (Higher-order differences). Prove that the iterated differences satisfy
 
∑︁𝑝 𝑝
Δℎ 𝑓
𝑝 𝑝−𝑘
(𝑎) = (−1) 𝑓 (𝑎 + 𝑘ℎ).
𝑘 =0 𝑘
With this concept, we can give another definition of complete monotonicity that
does not rely on differentiability properties.

Definition 19.43 (Completely monotone function II). A function 𝑓 : ℝ+ → ℝ is


completely monotone if

(−1)𝑝 Δℎ 𝑓 ≥ 0 for all ℎ > 0 and all 𝑝 ∈ ℤ+ .


𝑝

The definition in terms of differences essentially implies the condition on derivatives


in the original definition.
Exercise 19.44 (Differences and derivatives). Assume that 𝑓 is infinitely differentiable
on ℝ++ and completely monotone in the sense of Definition 19.43. Prove that the
derivatives of 𝑓 alternate sign: (−1)𝑝 𝑓 (𝑝) ≥ 0 on ℝ++ for each 𝑝 ∈ ℤ+ .
Exercise 19.45 (Completely monotone function: Properties). Suppose that 𝑓 : ℝ+ → ℝ
is completely monotone in the sense of Definition 19.43. Prove that 𝑓 is positive,
decreasing, and convex. Therefore, 𝑓 is continuous on ℝ++ .
Exercise 19.46 (Completely monotone function: Transforms). Suppose that 𝑓 : ℝ+ → ℝ
is completely monotone in the sense of Definition 19.43. Establish the following
properties.
1. Scaling. For 𝑎 > 0, the scaling 𝑎 𝑓 is completely monotone.
2. Differences. For ℎ > 0, the negative difference −Δℎ 𝑓 is completely monotone.
3. Translation. For 𝑐 > 0, the shift 𝑡 ↦→ 𝑓 (𝑡 + 𝑐 ) is completely monotone.
4. Dilation. For 𝑎 > 0, the dilation 𝑡 ↦→ 𝑓 (𝑎𝑡 ) is completely monotone.

Exercise 19.47 (Completely monotone functions: Convex cone). Prove that the completely
monotone functions on ℝ+ compose a convex cone that is closed under pointwise
limits.

19.6.2 Bernstein’s theorem on completely monotone functions


As we have mentioned, completely monotone functions have a beautiful integral
representation. This result is called the “big” Bernstein theorem. Our proof is drawn
from Lax [Lax02, Sec. 14.3].
Lecture 19: Entrywise PSD Preservers 183

Theorem 19.48 (Bernstein). A function 𝑓 : ℝ+ → ℝ is completely monotone as in


Definition 19.43 if and only if it is the Laplace transform of a finite, positive measure
𝜇 on ℝ+ . That is, for some 𝑐 ≥ 0, This expression can be written more
∫ compactly as an integral over ℝ+
−𝑡 𝑥 with the understanding that
𝑓 (𝑡 ) = 𝑐 𝛿 0 (𝑡 ) + e d𝜇(𝑥) for all 𝑡 ∈ ℝ+ . 𝑡 ↦→ e−∞·𝑡 = 𝛿 0 (𝑡 ) .
ℝ+

A remarkable consequence of Theorem 19.48 is that a completely monotone function


is not just continuous on ℝ++ but also analytic on ℝ++ . Therefore, Definition 19.39 is
actually equivalent with Definition 19.43 up to the behavior at 𝑡 = 0.
It is also rather interesting to compare Bernstein’s theorem with Bochner’s theorem.
The former tells us what happens if we take the Laplace transform of a finite, positive
measure, while the latter tells us what happens if we take the Fourier transform.

Proof sketch. The reverse direction simply asks us to verify that the integral represents
a completely monotone function. This is an easy calculation.
Consider the convex set of normalized completely monotone functions:

B B {𝑓 : ℝ+ → ℝ with 𝑓 ( 0) = 1 and 𝑓 completely monotone}.

Since completely monotone functions are positive and decreasing, every function 𝑓 ∈ B
satisfies 0 ≤ 𝑓 (𝑡 ) ≤ 1 for 𝑡 ∈ ℝ+ . After some argument, it follows that the set B is
compact in the topology of pointwise convergence.
Let 𝑒 be an extreme point of B. If 𝑒 (𝑡 ) = 1 for some 𝑡 > 0, then 𝑒 (𝑡 ) = 1 because
𝑒 is convex and decreasing. If 𝑒 (𝑡 ) = 0 for all 𝑡 > 0, then 𝑒 (𝑡 ) = 𝛿 0 (𝑡 ) .
Let us exclude these two cases. Then 𝑒 must be continuous on ℝ+ . Indeed, a
completely monotone function is continuous on ℝ++ . But 𝑒 cannot have a discontinuity
at zero, or else it is a convex combination of 𝛿 0 and another function in B.
To continue, observe that there is a point 𝑎 0 > 0 where 0 < 𝑒 (𝑎 0 ) < 1. Since 𝑒 is
continuous, we have 0 < 𝑒 (𝑎) < 1 on the interval 0 < 𝑎 ≤ 𝑎 0 . Fix a point 0 < 𝑎 ≤ 𝑎 0 ,
and define two functions
𝑒 (𝑡 + 𝑎) 𝑒 (𝑡 ) − 𝑒 (𝑡 + 𝑎)
𝑓𝑎 (𝑡 ) B and 𝑔 𝑎 (𝑡 ) B for 𝑡 ∈ ℝ+ .
𝑒 (𝑎) 1 − 𝑒 (𝑎)

Both functions 𝑓𝑎 , 𝑔 𝑎 ∈ B, and clearly

𝑒 (𝑡 ) = 𝑒 (𝑎) · 𝑓𝑎 (𝑡 ) + ( 1 − 𝑒 (𝑎)) · 𝑔 𝑎 (𝑡 ) for all 𝑡 ∈ ℝ+ .

Since 𝑒 is an extreme point, 𝑒 = 𝑓𝑎 = 𝑔 𝑎 . From the definition of 𝑓𝑎 , we see that

𝑒 (𝑡 + 𝑎) = 𝑒 (𝑡 )𝑒 (𝑎) for all 𝑡 ∈ ℝ+ when 0 < 𝑎 ≤ 𝑎 0 .

The only continuous solution to these equations with 𝑒 ( 0) = 1 and 0 < 𝑒 (𝑡 ) ≤ 1 takes
the form 𝑒 (𝑡 ) = e−𝑡 𝑥 for some 𝑥 > 0.
The integral representation follows from a routine application of the Krein–Milman
theorem. 

Notes
The presentation in this lecture is new. We have used the survey article [Bel+18] as
a guide to the literature, and we have adapted some of the organization from them.
Widder’s book [Wid41] remains an excellent source for material on Laplace transforms,
Lecture 19: Entrywise PSD Preservers 184

absolutely monotone functions, and completely monotone functions. We have revised


Vasudeva’s proof to make parts of it more transparent. The simple proof of the “big”
Bernstein theorem via Krein–Milman is adapted from Lax [Lax02] with some mistakes
corrected.

Lecture bibliography
[AS64] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas,
graphs, and mathematical tables. For sale by the Superintendent of Documents. U.
S. Government Printing Office, Washington, D.C., 1964.
[Ahl66] L. V. Ahlfors. Complex analysis: An introduction of the theory of analytic functions of
one complex variable. Second. McGraw-Hill Book Co., New York-Toronto-London,
1966.
[Bel+18] A. Belton et al. “A panorama of positivity”. Available at https://arXiv.org/
abs/1812.05482. 2018.
[Cha13] D. Chafaï. A probabilistic proof of the Schoenberg theorem. 2013. url: https :
//djalil.chafai.net/blog/2013/02/09/a- probabilistic- proof- of-
the-schoenberg-theorem/.
[Her63] C. S. Herz. “Fonctions opérant sur les fonctions définies-positives”. In: Ann. Inst.
Fourier (Grenoble) 13 (1963), pages 161–180. url: http://aif.cedram.org/
item?id=AIF_1963__13__161_0.
[Hor69] R. A. Horn. “The theory of infinitely divisible matrices and kernels”. In: Trans.
Amer. Math. Soc. 136 (1969), pages 269–286. doi: 10.2307/1994714.
[Lax02] P. D. Lax. Functional analysis. Wiley-Interscience, 2002.
[Olv+10] F. W. J. Olver et al., editors. NIST handbook of mathematical functions. With 1
CD-ROM (Windows, Macintosh and UNIX). National Institute of Standards and
Technology, 2010.
[Rud59] W. Rudin. “Positive definite sequences and absolutely monotonic functions”. In:
Duke Math. J. 26 (1959), pages 617–622. url: http://projecteuclid.org/
euclid.dmj/1077468771.
[Sch38] I. J. Schoenberg. “Metric spaces and positive definite functions”. In: Trans. Amer.
Math. Soc. 44.3 (1938), pages 522–536. doi: 10.2307/1989894.
[Vas79] H. Vasudeva. “Positive definite matrices and absolutely monotonic functions”. In:
Indian J. Pure Appl. Math. 10.7 (1979), pages 854–858.
[Wid41] D. V. Widder. The Laplace Transform. Princeton University Press, Princeton, N. J.,
1941.
II. problem sets

1 Multilinear Algebra & Majorization . . . . . . . . 186

2 UI Norms & Variational Principles . . . . . . . . 190

3 Perturbation Theory & Positive Maps . . . . . 194


1. Multilinear Algebra & Majorization

Date: 4 January 2022

This assignment covers multilinear algebra and majorization. The problems are
optional, but keep in mind that you have to do mathematics to learn mathematics!

1. Kron Job. Let H be an 𝑛 -dimensional inner-product space over the field 𝔽 with
orthonormal basis ( e1 , . . . , e𝑛 ) . We write the matrix of a linear operator on 𝐻
with respect to this distinguished basis. We arrange the family ( e𝑖 ⊗ e 𝑗 : 1 ≤
𝑖 , 𝑗 ≤ 𝑛) of elementary tensors in H ⊗ H in lexicographic order. That is, e𝑖 ⊗ e𝑗
precedes e𝑘 ⊗ e when 𝑖 < 𝑘 or else 𝑖 = 𝑘 and 𝑗 <  .

1. Show that ( e𝑖 ⊗ e 𝑗 : 𝑖 , 𝑗 ) is a basis for H ⊗ H. The easiest way to do this


is via the identification of elementary tensors as linear functionals on the
space of bilinear forms. When are two of these linear functionals linearly
independent?
2. Suppose that 𝒙 , 𝒚 , 𝒛 are linearly independent vectors. Show that

𝒙 ⊗𝒚 +𝒛 ⊗𝒘
is an elementary tensor if and only if 𝒘 = 𝛼 𝒚 for some scalar 𝛼 ∈ 𝔽 .
3. Let 𝑨 and 𝑩 be linear operators on H, so 𝑨 ⊗ 𝑩 is a linear operator on
H ⊗ H. Show that the matrix representation of this operator with respect to
the ordered basis for H ⊗ H takes the form
 𝑎 11 𝑩 . . . 𝑎 1𝑛 𝑩 

ℳ(𝑨 ⊗ 𝑩) =  ...
 ..  ∈ 𝔽 𝑛 2 ×𝑛 2
 . 
𝑎 𝑛 1 𝑩 . . . 𝑎 𝑛𝑛 𝑩 

The block matrix is called the Kronecker product of 𝑨 and 𝑩 . It gives a
concrete representation of the tensor product 𝑨 ⊗ 𝑩 . We typically omit the
ℳ in this notation.
2
4. The operator vec : 𝔽 𝑛×𝑛 → 𝔽 𝑛 maps a matrix to a vector by concatenating
the columns in order from left to right. Show that

(𝑨 ⊗ I) · vec (𝑴 ) = vec (𝑴 𝑨 ᵀ ) and ( I ⊗ 𝑩) · vec (𝑴 ) = vec (𝑩𝑴 )


Interpret the tensors 𝑨 ⊗ I and I ⊗ 𝑩 as right- and left-multiplication
operators. Without making any further calculations, use this interpretation
to explain why 𝑨 ⊗ I and I ⊗ 𝑩 must commute.
5. Assume for concreteness that 𝑨 and 𝑩 are Hermitian. Describe the eigen-
values and eigenvectors of 𝑨 ⊗ I, of I ⊗ 𝑩 , of 𝑨 ⊗ 𝑩 , and of 𝑨 ⊗ I + I ⊗ 𝑩 .
6. Assume that 𝑨 is Hermitian. What are the eigenvalues of the operator
𝑨 ⊗ I ⊗ · · · ⊗ I + · · · + I ⊗ · · · ⊗ I ⊗ 𝑨 acting on ⊗𝑘 H? Each of the 𝑘 terms
is a tensor product of 𝑘 − 1 identity operators and one copy of 𝑨 .
Lecture 1: Multilinear Algebra & Majorization 187

7. (*) The Schur product of 𝑨 and 𝑩 is the matrix 𝑨 · 𝑩 = [𝑎𝑖 𝑗 𝑏 𝑖 𝑗 ] . Relate


the Schur product to the Kronecker product, and use this fact to establish
Schur’s Product Theorem: If 𝑨 and 𝑩 are positive semidefinite, then 𝑨 · 𝑩
is positive semidefinite.
2. Wedge and vee. Let 𝒙 1 , . . . , 𝒙 𝑘 ∈ H, and let 𝒚 1 , . . . , 𝒚 𝑘 ∈ H.

1. Verify that the antisymmetric tensor product is related to the determinant:

h𝒙 1 ∧ · · · ∧ 𝒙 𝑘 , 𝒚 1 ∧ · · · ∧ 𝒚 𝑘 i = det [h𝒙 𝑖 , 𝒚 𝑗 i].

2. (*) Verify that the symmetric tensor product is related to the permanent:

h𝒙 1 ∨ · · · ∨ 𝒙 𝑘 , 𝒚 1 ∨ · · · ∨ 𝒚 𝑘 i = per [h𝒙 𝑖 , 𝒚 𝑗 i].

3. Show that 𝒙 1 ∧ · · · ∧ 𝒙 𝑘 = 0 if and only if the family {𝒙 1 , . . . , 𝒙 𝑘 } is linearly


dependent.
4. The span of each elementary antisymmetric tensor is a one-dimensional
subspace of 𝑘 H. Prove that these one-dimensional subspaces of 𝑘 H are
Ó Ó
in one-to-one correspondence with the 𝑘 -dimensional subspaces of H.
We extend a function 𝜑 : ℝ → ℝ to a function 𝜑 : ℝ𝑛 → ℝ𝑛
3. Majorization mix.
by the definition [𝜑 (𝒙 )] 𝑖 = 𝜑 (𝑥𝑖 ) for each index 𝑖 = 1, . . . , 𝑛 . The function
(𝑎)+ := max{𝑎, 0}.

1. Xiaoqi’s problem. Is it true that 𝑛 −1 1 ≺ 𝒙 ≺ ( 1, 0, . . . , 0) for every vector


𝒙 ∈ ℝ𝑛 with tr 𝒙 = 1? Prove it or provide a counterexample.
2. DS. Suppose that the matrix 𝑨 ∈ ℝ𝑛×𝑛 is positive, trace-preserving, and
unital. Prove that 𝑨 is doubly stochastic.
Í𝑘 ↓
3. k-max. Using duality, show that the function 𝒙 ↦→ 𝑖 =1 𝑥𝑖 is convex.
4. Submajorization without rearrangement. Show that 𝒙 ≺𝑤 𝒚 if and only if
tr (𝒙 − 𝑡 e)+ ≤ tr (𝒚 − 𝑡 e)+ for all 𝑡 ∈ ℝ.
5. Majorization without rearrangement. Using the previous part, show that 𝒙 ≺ 𝒚
if and only if tr |𝒙 − 𝑡 e | ≤ tr |𝒚 − 𝑡 e | for all 𝑡 ∈ ℝ.
6. Majorization and convexity. Using the previous part and results from class,
show that 𝒙 ≺ 𝒚 if and only if tr 𝜑 (𝒙 ) ≤ tr 𝜑 (𝒚 ) for every convex function
𝜑 : ℝ → ℝ.
7. Schur “Convexity” theorem. Suppose that Φ : ℝ𝑛 → ℝ is a differentiable,
permutation invariant function. Consider the property
 
𝜕Φ 𝜕Φ
(𝑥𝑖 − 𝑥 𝑗 ) (𝒙 ) − (𝒙 ) ≥ 0 for all 𝒙 ∈ ℝ𝑛 and all 1 ≤ 𝑖 < 𝑗 ≤ 𝑛 .
𝜕𝑥𝑖 𝜕𝑥 𝑗
Prove that Φ is isotone if and only if the condition in the display holds.
Hints: Reduce to showing that the map preserves majorization for vectors
that differ in two coordinates only. Use interpolation, as in class, and the
fundamental theorem of calculus.
4. Ky Fan norms. In this problem, we explore an important class of matrix norms
that will arise later.
Lecture 1: Multilinear Algebra & Majorization 188

1. Ky Fan maximum principle. Use Schur’s majorization theorem on the diagonal


of an Hermitian matrix to establish the following identity for an Hermitian
matrix 𝑨 : ∑︁𝑘 ∑︁𝑘
𝜆𝑖↓ (𝑨) = max h𝒖 𝑖 , 𝑨𝒖 𝑖 i
𝑖 =1 𝑖 =1
where the maximum ranges over all sets {𝒖 𝑖 : 𝑖 = 1, . . . , 𝑘 } of 𝑘 orthonor-
mal vectors.
2. Use the previous part to prove the following identity for Hermitian matrices
𝑨, 𝑩 ∈ ℂ𝑛×𝑛 , where 1 ≤ 𝑘 ≤ 𝑛 :
∑︁𝑘 ∑︁𝑘 ∑︁𝑘
𝜆𝑖↓ (𝑨 + 𝑩) ≤ 𝜆𝑖↓ (𝑨) + 𝜆𝑖↓ (𝑩)
𝑖 =1 𝑖 =1 𝑖 =1

3. Let 𝑪 ∈ ℂ𝑚×𝑛 be a general matrix, and consider the Hermitian dilation


 
0 𝑪
ℋ(𝑪 ) = .
𝑪∗ 0

Let 𝜎1 , . . . , 𝜎𝑟 be the first 𝑟 singular values of 𝑪 , where 𝑟 = min{𝑚, 𝑛}.


Demonstrate that the eigenvalues of ℋ(𝑪 ) are ±𝜎1 , . . . , ±𝜎𝑟 , along with
an appropriate number of zeros.
4. Use the last two parts to prove the following identity for general matrices
𝑪 , 𝑭 ∈ ℂ𝑚×𝑛 , where 𝑘 ≤ min{𝑚, 𝑛}:
∑︁𝑘 ∑︁𝑘 ∑︁𝑘
𝜎𝑖↓ (𝑪 + 𝑭 ) ≤ 𝜎𝑖↓ (𝑪 ) + 𝜎𝑖↓ (𝑭 )
𝑖 =1 𝑖 =1 𝑖 =1
Í𝑘 ↓
5. Conclude that k𝑪 k (𝑘 ) = 𝑖 =1 𝜎𝑖 (𝑪 ) defines a norm, called the Ky Fan
𝑘 -norm.
It is often the case that an interesting class of functions
5. *The cone of convexity.
forms a convex cone, and we can understand that class well by analyzing the
geometry of the cone. Here is an important classical example of this phenomenon.
Recall that a convex cone is a subset of a (real) linear space that contains all
positive multiples and (finite) sums of its elements. The conic hull of a set S is
given by
n∑︁𝑟 o
cone ( S) B 𝛼𝑖 𝒙 𝑖 where 𝛼𝑖 ≥ 0 and 𝒙 𝑖 ∈ 𝑆 and 𝑟 ∈ ℕ .
𝑖 =1

1. Let I = [ 0, 1] . Consider the class

𝒞 := {𝜑 : I → ℝ is convex, continuous, and 𝜑 ( 0) = 0.}

Show that 𝒞 is a convex cone in C ( I) , the Banach space of real-valued,


continuous functions on I, equipped with the supremum norm.
2. Hardy, Littlewood, Pólya.Consider the subset S ⊂ C ( I) consisting of linear
functions and convex angle functions:
Ð Ð
S = {𝜑 : 𝑡 ↦→ 𝑡 } {𝜑 : 𝑡 ↦→ −𝑡 } {𝜑 : 𝑡 ↦→ (𝑡 − 𝛼)+ for 𝛼 ∈ I}.

Show that cone ( S) is dense in the cone 𝒞 of convex functions.


Lecture 1: Multilinear Algebra & Majorization 189

3. Karamata. The dual of C ( I) is the Banach space of signed (Radon) measures


on I, equipped with the total variation norm. The dual cone 𝒞 ∗ of 𝒞 is
defined as the set of all measures 𝜇 that satisfy
∫ 1
𝜑 d𝜇 ≥ 0 for all 𝜑 ∈ 𝒞 .
0

Prove that 𝒞 ∗ contains precisely those measures 𝜇 that satisfy


∫ 1 ∫ 𝑡
𝑥 d𝜇 = 0 and 𝜇([ 0, 𝑥]) d𝑥 ≥ 0 whenever 0 ≤ 𝑡 ≤ 1.
0 0

4. What are the extreme rays of the cone 𝒞 ?


2. UI Norms & Variational Principles

Date: 27 January 2022

This assignment covers majorization, rearrangements, symmetric gauge functions,


unitarily invariant norms, complex interpolation, variational principles, and spectral
functions.

1. Majorization cone. Consider the convex cone of ordered positive vectors:

𝐾 = {𝒙 ∈ ℝ𝑛 : 𝑥 1 ≥ 𝑥 2 ≥ · · · ≥ 𝑥𝑛 ≥ 0}.

1. Determine the dual cone 𝐾 ∗ .


2. Express the condition 𝒙 ∈ 𝐾 ∗ in terms of a majorization relation.
3. We say that 𝒚 dominates 𝒙 with respect to the 𝐾 ∗ order if and only if
𝒚 − 𝒙 ∈ 𝐾 ∗ . Show that a function 𝑓 : 𝐾 → ℝ is isotone if and only if
𝑓 (𝒙 ) ≤ 𝑓 (𝒚 ) whenever 𝒚 dominates 𝒙 with respect to the 𝐾 ∗ order.
2. Rearrangements. Rearrangements of vectors and functions are a classical part of
the theory of majorization, and they also play an important role in real analysis.
We’ll establish some of the basic rearrangement inequalities.

1. Let 𝒙 , 𝒚 ∈ ℝ𝑛 . Prove Chebyshev’s rearrangement inequality:


∑︁𝑛 ∑︁𝑛 ∑︁𝑛
𝑥𝑖↓ 𝑦𝑖↑ ≤ 𝑥 𝑖 𝑦𝑖 ≤ 𝑥 ↓𝑦 ↓.
𝑖 =1 𝑖 =1 𝑖 =1 𝑖 𝑖

2. (*) Let 𝑓 : [ 0, 1] → ℝ be absolutely integrable. Define the function


𝑚 (𝑦 ) B 𝜇{𝑡 ∈ ℝ : 𝑓 (𝑡 ) > 𝑦 }, where 𝜇 is the Lebesgue measure. The
decreasing rearrangement of 𝑓 is the function

𝑓 ↓ (𝑥) = sup{𝑦 : 𝑚 (𝑦 ) > 𝑥 } for 0 ≤ 𝑥 ≤ 1.

Show that 𝑓 and 𝑓 ↓ are equimeasurable:


∫ 1 ∫ 1
𝜑 ( 𝑓 (𝑥)) d𝑥 = 𝜑 ( 𝑓 ↓ (𝑥)) d𝑥 for all 𝜑 : ℝ → ℝ
0 0

provided that the integrals exist.


3. (*) For absolutely integrable functions 𝑓 , 𝑔 : [ 0, 1] → ℝ, prove that
∫ 1 ∫ 1
𝑓 (𝑥) 𝑔 (𝑥) d𝑥 ≤ 𝑓 ↓ (𝑥) 𝑔 ↓ (𝑥) d𝑥.
0 0


Let Φ (𝑘 ) (𝒙 ) = 𝑖 =1 |𝑥 |𝑖 be the 𝑘 -max norm on ℝ𝑛 ,
Í𝑘
3. Fan minimum principles.
where 1 ≤ 𝑘 ≤ 𝑛 . Define the Ky Fan matrix norm k·k (𝑘 ) = Φ (𝑘 ) ◦ 𝝈 .

1. For 𝒙 ∈ ℝ𝑛 , show that Φ ( 1) (𝒙 ) ≤ Φ(𝒙 ) ≤ Φ (𝑛) (𝒙 ) for each symmetric


gauge function Φ.
Lecture 2: UI Norms & Variational Principles 191

2. For each vector 𝒙 ∈ ℝ𝑛 , prove that

Φ (𝑘 ) (𝒙 ) = min{Φ (𝑛) (𝒚 ) + 𝑘 Φ ( 1) (𝒛 ) : 𝒙 = 𝒚 + 𝒛 }.

3. Determine the dual norm of Φ (𝑘 ) .


4. For each matrix 𝑿 ∈ ℂ𝑛×𝑛 , prove that

k𝑿 k (𝑘 ) = min k𝒀 k (𝑛) + 𝑘 k𝒁 k ( 1) : 𝑿 = 𝒀 + 𝒁 .

5. Determine the dual norm of the Ky Fan matrix norm k·k (𝑘 ) .


Variational principles are powerful tools in analysis.
4. More variational principles.
One way they are useful is to linearize complicated functions. This problem
contains a few additional variational principles.

1. Let 𝑨, 𝑩 ∈ ℂ𝑛×𝑛 . Use the von Neumann trace theorem and the Hermitian
dilation to demonstrate that
∑︁𝑛
max{Re tr (𝑼 ∗ 𝑨𝑽 𝑩) : 𝑼 ,𝑽 ∈ ℂ𝑛×𝑛 are unitary} = 𝜎𝑖 (𝑨) 𝜎𝑖 (𝑩).
𝑖 =1

2. For each Hermitian matrix 𝑨 ,


∑︁𝑘
𝜆𝑖↓ (𝑨) = max tr (𝑼 ∗ 𝑨𝑼 )
𝑖 =1 𝑼
∑︁𝑘
𝜆 ↑ (𝑨) = min tr (𝑼 ∗ 𝑨𝑼 )
𝑖 =1 𝑖 𝑼

where 𝑼 ranges over all 𝑛 × 𝑘 matrices with orthonormal columns. What


is the analog of this result for singular values?
3. For each positive-semidefinite matrix 𝑨 ,
Ö𝑘
𝜆𝑖↓ (𝑨) = max det (𝑼 ∗ 𝑨𝑼 )
𝑖 =1 𝑼
Ö𝑘
𝜆 ↑ (𝑨) = min det (𝑼 ∗ 𝑨𝑼 )
𝑖 =1 𝑖 𝑼

where 𝑼 ranges over all 𝑛 × 𝑘 matrices with orthonormal columns. What


is the analog of this result for singular values?
4. (*) For each positive-definite matrix 𝑨 ∈ ℍ𝑛 ,

( det 𝑨) 1/𝑛 = min{𝑛 −1 tr (𝑨𝑩) : det 𝑩 = 1, 𝑩  0}.

This result implies Minkowski’s determinant theorem, problem (6)(c)(iv).


How?
In this problem, we will use complex interpolation to
5. Interpolation inequalities.
prove some interesting matrix inequalities that are not necessarily easy to obtain
otherwise. You will also need to use duality to represent a matrix norm as the
supremum of a trace.

1. Let 𝑨, 𝑩 be positive semidefinite. Show that

k𝑨 𝑠 𝑩 𝑠 k ≤ k𝑨𝑩 k 𝑠 for 0 ≤ 𝑠 ≤ 1.

Deduce that k𝑨𝑩 k 𝑠 ≤ k𝑨 𝑠 𝑩 𝑠 k for 𝑠 ≥ 1. Compare with [Bha97, Theorem


IX.2.1].
Lecture 2: UI Norms & Variational Principles 192

2. Let 𝑨, 𝑩 be positive semidefinite. Use part (a) to conclude that

k𝑩 𝑠 𝑨 𝑠 𝑩 𝑠 k ≤ k(𝑩 𝑨𝑩) 𝑠 k for 0 ≤ 𝑠 ≤ 1.


A similar result holds for 𝑠 ≥ 1.
3. (*) Use anti-symmetric tensor products and majorization techniques to
extend (b) to all unitarily invariant norms. The Schatten 1-norm case is
the Araki–Lieb–Thirring inequality.
4. Let 𝑨, 𝑩 be positive semidefinite, and let 𝑿 be arbitrary. For 𝑡 ∈ [ 0, 1] ,
prove that
k𝑨 𝑡 𝑿 𝑩 1−𝑡 k ≤ k𝑨𝑿 k𝑡 k𝑿 𝑩 k 1−𝑡 .
This bound holds for every unitarily invariant norm. Explain how to extend
your argument to address this case; this is easier than (c). Compare with
[Bha97, Corollary IX.5.3].
5. (*) Let 𝑨 1 , . . . , 𝑨 𝑚 ∈ 𝕄𝑛 , and let 𝑩 ∈ 𝕄𝑛 be positive semidefinite. For
each 𝑝 ≥ 1, prove

2𝑝 1/( 2𝑝)
 
 ∑︁𝑚 𝑝  1/𝑝 ∑︁𝑚  1/( 2𝑝)
tr 𝑨 ∗𝑩 𝑨 𝑖 ≤ tr 𝑨∗𝑨𝑖 tr 𝑩 2𝑝 .
𝑖 =1 𝑖 𝑖 =1 𝑖

Recall that |𝑴 | B (𝑴 ∗ 𝑴 ) 1/2 , and connect it with the Schatten norm.


Let ℍ𝑛 be the real linear space of Hermitian
6. Lewis’s approach to spectral functions.
matrices of size 𝑛 , equipped with the trace inner product h𝒀 , 𝑿 i = tr (𝒀 ∗ 𝑿 ) .
Write 𝝀(𝑨) for the decreasingly ordered eigenvalues of an Hermitian matrix
𝑨 . Let 𝑓 : ℝ𝑛 → ℝ ∪ {+∞} be permutation-invariant. A spectral function on
Hermitian matrices is map of the form

( 𝑓 ◦ 𝝀) (𝑨) = 𝑓 (𝜆 1 (𝑨), 𝜆2 (𝑨), . . . , 𝜆𝑛 (𝑨))


Let 𝑓 : ℝ𝑛 → ℝ ∪ {+∞}. The Fenchel conjugate of 𝑓 is the (closed, convex)
function
𝑓 ∗ (𝒚 ) = sup𝒙 ∈ℝ𝑛 h𝒚 , 𝒙 i − 𝑓 (𝒙 ) for 𝒚 ∈ ℝ𝑛 .
 

If 𝑓 is closed and convex, then (𝑓 ∗ ) ∗ = 𝑓 . Recall Fenchel’s inequality:

h𝒚 , 𝒙 i ≤ 𝑓 ∗ (𝒚 ) + 𝑓 (𝒙 ) for all 𝒙 , 𝒚 ∈ ℝ𝑛 .
The subdifferential of 𝑓 at the point 𝒙 is the set

𝜕𝑓 (𝒙 ) = {𝒚 ∈ ℝ𝑛 : h𝒚 , 𝒙 i = 𝑓 ∗ (𝒚 ) + 𝑓 (𝒙 )}.
If 𝑓 is differentiable at 𝒙 , then 𝜕𝑓 (𝒙 ) = {∇𝑓 (𝒙 )}. Similarly, for a function
𝐹 : ℍ𝑛 → ℝ ∪ {+∞}, the Fenchel conjugate is
𝐹 ∗ (𝒀 ) = sup𝑿 ∈ℍ𝑛 h𝒀 , 𝑿 i − 𝐹 (𝑿 ) .
 

The definition of the subdifferential of 𝐹 is analogous with the vector definition.

1.Let 𝑓 : ℝ𝑛 → ℝ be permutation invariant. Explain why the conjugate


𝑓 ∗ : ℝ𝑛 → ℝ ∪ {+∞} is convex and permutation invariant.
2. Let 𝑓 : ℝ𝑛 → ℝ be permutation invariant. Use the von Neumann trace
theorem to show
( 𝑓 ◦ 𝝀) ∗ (𝒀 ) = (𝑓 ∗ ◦ 𝝀)(𝒀 ).
Conclude that if 𝑓 is closed and convex, then 𝑓 ◦ 𝝀 also is closed and
convex.
Lecture 2: UI Norms & Variational Principles 193
Í𝑛
3. Consider the map 𝑔 : 𝒙 ↦→ − 𝑖 =1 log 𝑥𝑖
𝑛 .
on ℝ++
1.Argue that 𝑔 is closed and convex. What is its conjugate?
2. What is the spectral function 𝑔 ◦ 𝝀 ? What is its domain? What is its
conjugate? What does Fenchel’s inequality say?
3. Establish the multiplicative form of the Brunn–Minkowski inequality
for positive semidefinite 𝑨, 𝑩 ∈ ℍ𝑛 :

det (𝜏𝑨 + ( 1 − 𝜏)𝑩) ≥ ( det 𝑨)𝜏 ( det 𝑩) 1−𝜏 for 𝜏 ∈ [ 0, 1] .

4. (*) Derive the Minkowski determinant theorem as a consequence of


(iii):
( det (𝑨 + 𝑩)) 1/𝑛 ≥ ( det 𝑨) 1/𝑛 + ( det 𝑩) 1/𝑛 .
4. Let 𝑓 be permutation invariant and convex. Show that 𝒀 ∈ 𝜕( 𝑓 ◦ 𝝀)(𝑿 ) if
and only if 𝝀(𝒀 ) ∈ (𝜕𝑓 ) (𝝀(𝑿 )) and there exists a unitary matrix 𝑼 with
𝑼 ∗ 𝑿 𝑼 = diag (𝝀(𝑿 )) and 𝑼 ∗𝒀 𝑼 = diag (𝝀(𝒀 )) .
5. Let 𝑿 be an Hermitian matrix with an eigenvalue decomposition 𝑿 =
𝑼 diag (𝝀(𝑿 ))𝑼 ∗ . Suppose that 𝑓 is differentiable at 𝝀(𝑿 ) , an interior
point of the domain. Show that

∇(𝑓 ◦ 𝝀) (𝑿 ) = 𝑼 diag ((∇𝑓 )(𝝀(𝑿 )))𝑼 ∗ .

Compute the derivative of the map 21 k·k 2F = 12 𝑖𝑛=1 𝜆𝑖2 on ℍ𝑛 .


Í
6.
7. Compute the subdifferential of the maximum eigenvalue map 𝜆 1 on ℍ𝑛 .
Compute the subdifferential of the Schatten 1-norm k𝑨 k 1 = 𝑖𝑛=1 |𝜆𝑖 (𝑨)|
Í
8.
on ℍ𝑛 .
9. (*) We can use the same method to study smoothness properties of unitarily
invariant norms. Consider the real linear space ℂ𝑛×𝑛 of complex matrices,
equipped with the real inner product h𝑨, 𝑩iRe = Re tr (𝑨 ∗ 𝑩) . Let 𝝈 (𝑩)
denote the vector of singular values of 𝑩 . Let Φ be any symmetric gauge
function. Find an expression for the subdifferential of Φ ◦ 𝝈 . If Φ is
differentiable at 𝝈 (𝑩) , what is its derivative ∇(Φ ◦ 𝝈 )(𝑩) ?
3. Perturbation Theory & Positive Maps

Date: 15 February 2022

This assignment covers principal angles, Sylvester equations, perturbation of eigenspaces,


pinching, positive linear maps, and completely positive maps.

1. Principal angles. Let E and F be subspaces of ℂ𝑛 . Let 𝑷 and 𝑸 be orthogonal


projectors on ℂ𝑛 with ranges E and F. Prove the following statements.

1.Suppose E and F have the same dimension. The singular values of 𝑷 𝑸


are the cosines of the principal angles between E and F. Then the nonzero
singular values of ( I − 𝑷 )𝑸 are the sines of the nonzero principal angles
between E and F.
2. Suppose E and F have the same dimension. Then the nonzero singular
values of 𝑷 − 𝑸 are the nonzero singular values of ( I − 𝑷 )𝑸 , each counted
twice.
3. Show that k𝑷 − 𝑸 k ≤ 1.
4. If k𝑷 − 𝑸 k < 1, then E and F have the same dimension.

2. Davis–Kahan. In this problem, we will develop the proof of the most famous
perturbation theorem for invariant subspaces. Let 𝑨 ∈ 𝕄𝑚 and 𝑩 ∈ 𝕄𝑛 be
normal matrices. Assume that |𝜆𝑖 (𝑨)| ≤ 𝜚 and |𝜆 𝑗 (𝑩)| > 𝜚 for all 𝑖 , 𝑗 . Consider
the Sylvester equation 𝑨𝑿 − 𝑿 𝑩 = 𝒀 .

1. Prove that the Sylvester equation has a unique solution.


2. Argue that the solution admits the representation
∑︁∞
𝑿 = 𝑨 𝑝𝒀 𝑩 −𝑝−1 .
𝑝=0

Hint: The spectral radius formula states that |𝜆𝑖 (𝑨)| = lim𝑝→∞ k𝑨 𝑝 k 1/𝑝 .
3. For each unitarily invariant norm |||·||| , deduce that the solution satisfies

1
|||𝑿 ||| ≤ |||𝒀 |||.
𝛿
4. Now, let 𝑨, 𝑩 be normal matrices of the same dimension. Let S𝑨 and S𝑩 be
subsets of the complex plane, separated by an annulus of width 𝛿 . Consider
the spectral projector 𝑷 𝑨 of 𝑨 onto the subspace spanned by the eigenvalues
listed in S𝑨 , as well as the analog 𝑷 𝑩 for 𝑩 . Prove that

1
|||𝑷 𝑨 𝑷 𝑩 ||| ≤ |||𝑨 − 𝑩 |||.
𝛿
5. Specialize the last result to the case where 𝑨, 𝑩 are Hermitian. Consider
S𝑨 = [𝑎, 𝑏] and S𝑩 = (−∞, 𝑎 − 𝛿 ] ∪ [𝑏 + 𝛿 , +∞) . Interpret the left-hand
side as a measure of the size of the sines of the principal angles between
range (𝑷 𝑨 ) and range (𝑷 𝑩 ) ⊥ . This is called the sine-theta theorem.
Lecture 3: Perturbation Theory & Positive Maps 195

6. What happens when 𝑩 = 𝑨 + 𝑬 for a small perturbation 𝑬 ? Give bounds


on the change to an “isolated” eigenspace of 𝑨 after perturbation. “Isolated”
means that we consider an interval in the spectrum, and assume that the
rest of the spectrum is some distance away.
3. Pinch me.Noncommutative averaging operations play an important role in matrix
analysis. This problem gives an overview of the basic results in this direction.
Consider (conformally partitioned) 2 × 2 block matrices of order 𝑚 + 𝑛 :
  
I 0 𝑨 11 𝑨 12
𝑼 = 𝑚 and 𝑨 = .
0 −I𝑛 𝑨 21 𝑨 22

Define the simple pinching operation


 
1 𝑨 0
𝚿(𝑨) = (𝑨 + 𝑼 ∗ 𝑨𝑼 ) = 11 .
2 0 𝑨 22

1.For each Hermitian 𝑨 , show that 𝝀(𝚿(𝑨)) ≺ 𝝀(𝑨) . Pinching flattens out
eigenvalues.
2. For each square matrix 𝑨 , show that 𝝈 (𝚿(𝑨)) ≺𝑤 𝝈 (𝑨) .
3. Suppose that {𝑷 𝑗 } is a family of orthogonal projectors with the property
Í
that 𝑷 𝑖 𝑷 𝑗 = 𝛿𝑖 𝑗 𝑷 𝑖 and 𝑗 𝑷 𝑗 = I. Consider the (general) pinching map
∑︁
𝚿(𝑨) = 𝑷 𝑗 𝑨𝑷 𝑗 .
𝑗

Show that 𝚿 amounts to a block diagonal restriction in an appropriate


basis. Express 𝚿 as a product of simple pinchings. Prove that 𝚿 satisfies
the relations in (a) and (b).
4. Derive Schur’s Theorem for Hermitian matrices, diag 𝑨 ≺ 𝝀(𝑨) , as a
consequence.
5. For positive definite 𝑨 , derive Fischer’s inequality det 𝑨 ≤ det 𝚿(𝑨) as a
Î
consequence. Conclude that det 𝑨 ≤ 𝑖 𝑎𝑖𝑖 , a result called Hadamard’s
determinant inequality.
In this problem, we use ideas from class to explore some
4. Positive linear maps.
more examples and properties of positive maps.

1. Adapt the proof of Kadison’s Theorem to prove Choi’s Theorem. Let


𝚽 : 𝕄𝑛 → 𝕄𝑘 be a unital, positive linear map. For a normal matrix 𝑨 ,

𝚽(𝑨) ∗ 𝚽(𝑨) 4 𝚽(𝑨 ∗ 𝑨)


𝚽(𝑨) 𝚽(𝑨) ∗ 4 𝚽(𝑨 ∗ 𝑨).

2. (*) It is not always the case that 𝚽(𝑨) is normal, even if 𝑨 is normal. Use
this observation to construct an example where the two inequalities above
in (a) are different.
3. Let 𝑩 be positive semidefinite. Consider the Schur multiplication operator
𝚽(𝑨) = 𝑨 · 𝑩 . Use the Russo–Dye Theorem to compute k𝚽k .
4. Let 𝑨 be a normal matrix whose eigenvalues are in the (strict) right
half-plane. Consider the Lyapunov equation

𝑨 ∗ 𝑿 + 𝑿 𝑨 = 𝑩.
Lecture 3: Perturbation Theory & Positive Maps 196

Extend our discussion about Sylvester equations to find an integral expres-


sion for the solution operator L𝑨 : 𝑩 ↦→ 𝑿 . Use this expression to prove
that L𝑨 is a positive linear map. (*) Apply the Russo–Dye Theorem to find
an expression for the norm of the operator.
5. (*) Let 𝚽 be a strictly positive linear map. Assume that 𝑨 is positive definite
and 𝑯 is Hermitian. Use Kadison’s inequality to show that

𝚽(𝑯 𝑨 −1𝑯 ) < 𝚽(𝑯 )𝚽(𝑨) −1 𝚽(𝑯 )

This is a type of Cauchy–Schwarz inequality. Hint: Use 𝚽 and 𝑨 to construct


a unital, positive linear map.
The adjoint 𝚽∗ of a linear map 𝚽 : 𝕄𝑛 → 𝕄𝑘 is defined as the unique
5. Adjoints.
linear map 𝚽∗ : 𝕄𝑘 → 𝕄𝑛 for which

h𝚽∗ (𝑩), 𝑨i = h𝑩, 𝚽(𝑨)i for all 𝑨 ∈ 𝕄𝑛 and 𝑩 ∈ 𝕄𝑘 .

1.Show that 𝚽 is trace-preserving if and only if 𝚽∗ is unital.



2. Compute the adjoint of 𝚽(𝑨) = 𝑿 𝑨𝑿 .
3. Compute the adjoint of the Schur product map 𝚽(𝑨) = 𝑨 · 𝑩 .
4. Compute the adjoint of the Lyapunov solution operator L𝑨 .

6. Doubly stochastic maps. A linear map on matrices is called doubly stochastic if it is


positive, unital, and trace-preserving.

1. Let 𝑝𝑖 be nonnegative numbers that sum to one, and let 𝑼 𝑖 be unitary.


Show that the following map is doubly stochastic:
∑︁𝑁
𝚽(𝑨) = 𝑝𝑖 𝑼 𝑖 𝑨𝑼 𝑖∗ .
𝑖 =1

2. Let 𝑨 and 𝑩 be Hermitian matrices of the same size, and suppose that
𝑨 = 𝚽(𝑩) for a doubly stochastic map 𝚽. Show that
∑︁𝑁
𝑨= 𝑝𝑖 𝑼 𝑖 𝑩𝑼 𝑖∗
𝑖 =1

where 𝑝𝑖 are nonnegative numbers that sum to one and 𝑼 𝑖 are unitary.
These objects depend on both 𝚽 and 𝑩 . Hints: Use Birkhoff ’s Theorem to
prove the result in case 𝑨 and 𝑩 are diagonal matrices. Then introduce
eigenvalue decompositions of 𝑨 and 𝑩 .
Let 𝚽 : 𝕄𝑛 → 𝕄𝑘 be a linear map. Consider the extension
7. *Complete Positivity.
𝚽𝑚 : 𝕄𝑚 (𝕄𝑛 ) → 𝕄𝑚 (𝕄𝑘 ) defined by
 𝑨 11 . . . 𝑨 1𝑚   𝚽(𝑨 11 ) . . . 𝚽(𝑨 1𝑚 ) 
   
𝚽𝑚 :  ... .. ..  ↦−→ .. .. ..
  
 . . 


 . . . 

𝑨 𝑚 1 . . . 𝑨 𝑚𝑚  𝚽(𝑨 𝑚 1 ) . . . 𝚽(𝑨 𝑚𝑚 ) 
   
That is, we obtain 𝚽𝑚 by applying 𝚽 to each block of an 𝑚 × 𝑚 block matrix
whose blocks have size 𝑛 × 𝑛 . Since 𝕄𝑚 (𝕄𝑛 ) is isomorphic to 𝕄𝑚×𝑛 , we can
think about 𝚽𝑚 as a linear map on matrices. We say that 𝚽 is completely positive
when 𝚽𝑚 is a positive map for each 𝑚 = 1, 2, 3, . . . . This statement is equivalent
to the condition that 𝚽 ⊗ I𝑚 is a positive linear map for each 𝑚 = 1, 2, 3, . . . .
Completely positive maps play an important role in operator theory.
Lecture 3: Perturbation Theory & Positive Maps 197

1. Show that every positive linear functional 𝜑 : 𝕄𝑛 → ℂ takes the form


𝜑 (𝑨) = 𝑖 𝒖 𝑖∗ 𝑨𝒖 𝑖 for some vectors 𝒖 𝑖 .
Í
2. Show that every positive linear functional is a completely positive map.
3. Show that every linear map 𝚽 : 𝕄𝑛 → 𝕄𝑘 of the form 𝚽(𝑨) = 𝑿 ∗ 𝑨𝑿 is
completely positive.
4. Show that the completely positive maps form a convex cone.

Show that every completely positive map has the form 𝚽(𝑨) = 𝑖𝑛𝑘
Í
5. =1 𝑿 𝑖 𝑨𝑿 𝑖 .
Hint: The block matrix [𝑨 𝑖 𝑗 ] ∈ 𝕄𝑛 (𝕄𝑛 ) with blocks 𝑨 𝑖 𝑗 = E𝑖 𝑗 is positive
semidefinite.
6. Consider the conjugate transpose map 𝚽(𝑨) = 𝑨 ∗ defined on 𝕄𝑛 . Is 𝚽
positive? Compute the norm of 𝚽𝑛 . Hint: Consider the same block matrix
from the last part.
7. Let 𝚽 be a positive linear map. For matrices 𝑨, 𝑩 ∈ 𝕄𝑛 , define the
covariance function Cov (𝑨, 𝑩) = 𝚽(𝑨 ∗ 𝑩) − 𝚽(𝑨) ∗ 𝚽(𝑩) . Define the
variance Var (𝑨) = Cov (𝑨, 𝑨) < 0. If 𝚽 is completely positive, show that
 
Var (𝑨) Cov (𝑨, 𝑩)
< 0.
Cov (𝑨, 𝑩) ∗ Var (𝑩)
III. projects

Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

1 Hiai–Kosaki Means . . . . . . . . . . . . . . . . . . . 208

2 The Eigenvector–Eigenvalue Identity . . . . . 217

3 Bipartite Ramanujan Graphs . . . . . . . . . . . . . 227

4 The NC Grothendieck Problem . . . . . . . . . . . 238

5 Algebraic Riccati Equations . . . . . . . . . . . . . 248

6 Hyperbolic Polynomials . . . . . . . . . . . . . . . . 263

7 Matrix Laplace Transform Method . . . . . . . . 274

8 Operator-Valued Kernels . . . . . . . . . . . . . . . 286

9 Spectral Radius and Stability . . . . . . . . . . . . 298


Projects

Each student will complete a project on foundations or applications of matrix analysis.


This project will result in a writeup of about 8–10 pages , with content equivalent to
a 90-minute lecture on the material, prepared in the lecture note template. For the
project, you may select a classic or modern topic in matrix analysis and prepare a new
treatment of the material. The list below may help you identify a topic or papers for
this study. Alternatively, you can describe an application of matrix analysis in your
own field of research or in your own research. There will be some milestones during
the term (project selection, initial summary, etc.) to help keep this work on track. It
would be preferable for each student to choose a slightly different topic.

Numerical linear algebra


Schur complements arise from Gaussian elimination, from partial solution of least-
squares problems, and from conditioning jointly Gaussian random variables. They
enjoy a beautiful and detailed theory. To get started, see the book [Zha05], especially
the chapter by Ando.
The Lanczos iteration is a remarkable algorithm for reducing a positive-definite
matrix to tridiagonal (Jacobi) form. There are deep connections with Gaussian
quadrature, moment matching, and continued fractions. See Golub & Meurant [GM10]
or Liesen & Strakoš [LS13].
The QR iteration computes the eigenvalues of a real symmetric matrix by repeated
QR factorization. This algorithm has a continuous analog, called the Toda flow, which
can be studied using methods for dynamical systems. A first reference is Lax’s algebra
book [Lax07, Chap. 18].
Generalized eigenvalue problems involve equations of the form 𝑨𝒙 = 𝜆𝑩𝒙 . These
problems lead to the theory of matrix pencils. When the matrices are not normal,
generalized Schur factorizations arise. For generalized eigenvalue problems of positive-
definite type, there is a very satisfying theory that connects with the generalized
singular value decomposition. The generalized SVD provides a mechanism for solving
regularized least-squares problems. A first reference is Golub & Van Loan [GVL13].

Eigenvalues and perturbations


Given a matrix, one may ask whether it is possible to narrow down the possible
locations of its eigenvalues by examining its entries. The most famous result of this
form is the Geršgorin disc theorem, but there are many variants and extensions. See
Bhatia [Bha97, Chaps. VI–VIII] or Horn & Johnson [HJ13, Chap. 6] to get started.
Relatedly, one may attempt to use the eigenvalues of the matrix to produce
information about the entries of the eigenvectors. The key result in this area may
be called the eigenvector–eigenvalue identity. It has been repeatedly rediscovered in
various fields of mathematics. For theoretical and historical details, see the recent
paper of Denton et al. [Den+22].
200

We may also ask how much the eigenvalues or eigenvectors of a matrix change
under perturbations. This course covers some of the most important results, but this is
just the tip of an iceberg. For more information, see Bhatia’s books [Bha97; Bha07a] or
Saad’s book [Saa11b]. The classic reference is Kato [Kat95], which is both wide and
deep.

Matrix nearness problems


A recurring theme in matrix analysis is to characterize the “structured” matrix that
is closest to a specified matrix. Higham’s paper [Hig89] offers a primer on these
problems. See also Dhillon & Tropp [DT07].
Among the classic results, the most famous is the Eckart–Young–Mirsky theo-
rem [EY39; Mir60], which states that the truncated singular value decomposition
provides the best low-rank approximation with respect to each unitarily invariant norm.
Gu proposed a reversed form of the Eckart–Young theorem [Gu15]; his result can be
generalized to all unitarily invariant norms using majorization.
The problem of finding a matrix 𝑿 that minimizes the form k𝑨 − 𝑩 𝑿 𝑪 k F arises
quite often in linear algebra. It is quite easy to solve this problem using differential
calculus. The associated problem for the spectral norm, however, is much harder. The
basic solution is given by Parrott’s theorem [Par78]. This result was generalized and
placed in an appropriate context by Davis et al. [DKW82].

Nonnegative and totally nonnegative matrices


Matrices whose entries are nonnegative arise in a huge number of applications, and
they are closely associated with the theory of Markov chains. There are several books
devoted to the topic of nonnegative matrices [BP94; Sen81; Min88; BR97].
The Perron–Frobenius theory describes the spectral properties of a nonnegative
matrix. There are many sources for this material, including [HJ13, Chap. 8].
Combinatorial graphs provide another rich source of nonnegative matrices. The
subject of spectral graph theory connects the eigenvalues and eigenvectors of the
graph Laplacian with its metric and combinatorial properties. Good references include
Spielman’s notes [Spi18a] and the books [GR01; Chu97].
A totally positive matrix has the property that every square minor has a positive
determinant. These matrices are deeply connected with moment problems in anal-
ysis, and they have a long history in matrix analysis. For a recent reference, see
Pinkus [Pin10].

Matrix scaling and balancing


A very interesting class of problems arises when we consider scaling of a nonnegative
matrix 𝑨 by diagonal matrices 𝑫 1 𝑨𝑫 2 to ensure that the rows and columns have fixed
sums. An iterative algorithm for this problem was invented by Kruithof [Kru37] and
rediscovered by Sinkhorn [Sin64]. Recently, this method has achieved new prominence
because of a connection with optimal transport problems. For example, see Altschuler
et al. [AWR17].
Matrix balancing is another matrix scaling problem that requires us to minimize the
1 norm of rows by symmetric diagonal scaling. The classic algorithm for this problem
is due to Osborne. Schulman & Sinclair [SS17] provided the first proof that Osborne’s
algorithm succeeds, under appropriate assumptions. Altschuler & Parrilo [AP22] have
developed a streamlined and more general approach. Their balancing algorithms can
also be used as an ingredient in solving min-mean-cycle problems [AP20].
Recall that positive linear maps offer a noncommutative analog of nonnegative
matrices. In particular, positive linear maps that are unital and trace-preserving can be
201

viewed as generalizations of doubly stochastic matrices. Gurvits [Gur04] has developed


an astonishing generalization of the Sinkhorn algorithm to this noncommutative setting.
See also his joint paper with Garg et al. [Gar+20] for a complete analysis.

Control and dynamical systems


In the study of dynamical systems, we can use matrix theory to characterize when a
discrete or continuous dynamical system is stable. These results are associated with
Lyapunov and Schur. For example, see Horn & Johnson [HJ94, Chap. 2].
Several types of matrix equations (Riccati, Sylvester) also arise naturally from
problems in control theory. For some discussion of these problems, see Horn &
Johnson [HJ94, Chap. 4] or Bhatia [Bha97, Chap. VII].

Matrix functions
What does it mean to apply a (scalar) function to a matrix? For normal matrices, we
can simply apply the function to the eigenvalues, leaving the eigenvectors alone. For
general matrices, we can invoke the Cauchy integral formula. These two approaches
coincide whenever they are both valid. This type of spectral function has a beautiful
and rich theory. Higham’s book [Hig08] develops the foundations, focusing especially
on the numerical aspects. Bhatia’s book [Bha97, Chap. X] discusses perturbation of
matrix functions. There is also a nice series of papers by Lewis on spectral functions,
such as [Lew96], which forms the basis for a homework problem.
Another possible approach is to apply a scalar function to each entry of the matrix.
Although this approach seems naïve, it also has a rich theory that dates back to
Schoenberg’s work [Sch38] on Euclidean distance matrices in the 1930s. See Horn
& Johnson [HJ94, Chap. 6] and Bhatia [Bha07b] for a smattering of results, plus
additional references. The survey of Belton et al. [Bel+18] also covers this topic.

Integral representations
In analysis, it is often the case that an interesting class of functions can be seen as a
convex cone. Within the cone, the extremal functions play a key role because they
frame the boundary of the cone. As a consequence, every function in the cone can
be expressed as an average of the extremal functions. These averages are typically
written as integrals against a probability measure, and they provide a powerful tool
for understanding the behavior of the class of functions. The key results in this area
include the (strong) Krein–Milman theorem and the Choquet theory.
Examples of this principle include the theorem of Bochner on positive-definite
functions, the theorem of Bernstein on completely monotone functions, and the
theorem of Loewner on matrix monotone functions. See Simon’s book [Sim11] for an
introduction to this geometric perspective.
Integral representation theorems are also connected with ideas from complex
analysis (e.g., the Herglotz theorem), with interpolation (e.g., Nevanlinna–Pick), and
with the theory of moments (e.g., Stieltjes). For an introduction to these connections,
see the survey of Berg [Ber08]. For a more comprehensive classical treatment, see the
book [BCR84]. Simon [Sim19] has also written an entire book on Loewner’s theorem.
A surprising and beautiful application of integral representations arises in the
theory of matrix means. Kubo & Ando [KA79] develop a family of matrix means
that are induced by matrix monotone functions. Hiai & Kosaki [HK99] develop a
different type of matrix mean that is induced by a positive-definite function. See also
Bhatia [Bha07b, Chap. 4].
202

Semidefinite programming
Optimization over the cone of positive-semidefinite matrices plays an central role
in various disciplines, including data science, combinatorial optimization, control
theory, and quantum information. For an introduction to this field, see Boyd &
Vandenberghe [VB96]. For a more in-depth treatment, consider the book of Ben Tal &
Nemirovski [BTN01].
The Hausdorff–Toeplitz theorem asserts that the field of values of a matrix com-
poses a convex set in the complex plane. This is the simplest example of a class
of “hidden convexity” theorems, which describe convexity that appears in surprising
circumstances. Brickman’s theorem is a significant generalization, which states that
pairs of (positive) quadratic forms always induce a convex set of points in the plane.
Barvinok [Bar02, Sec. II.14] describes this result, along with some generalizations. See
also the paper [BT10] of Beck & Teboulle.
The results in the last paragraph assert that certain quadratic optimization problems
and their Lagrangian dual exhibit strong duality. The S-Lemma is a manifestation
of this same idea in control theory. In linear algebra, this result means that we can
compute the maximum eigenvalue of an Hermitian matrix, even though the problem is
not convex. See [BV04, App. B] for this perspective.
Semidefinite programming also plays a key role in algorithms for solving combi-
natorial optimization problems by relaxation and rounding. This idea goes back, at
least, to the work of Lovász [Lov79], and it became widely known after Goemans &
Williamson [GW95] developed their method for solving the maxcut problem. For a
survey with applications in signal processing, see [Luo+10]. Related results appear in
the books [Bar02; BTN01].
In fact, the idea of relaxation and rounding implicitly appears in Grothendieck’s
work on tensor products, which leads to the famous Grothendieck inequality for
matrices. This result was cast in the language of matrices by Lindenstrauss. This is a
major topic in functional analysis, but most of the references are technical. There is
an elementary treatment of Krivine’s proof of Grothendieck’s inequality in Vershynin’s
book [Ver18, Secs. 3.5–3.7]. For some algorithmic work, see [AN06; Tro09].
There is a “noncommutative” extension of the Grothendieck inequality due to
Pisier and Haagerup. For an algorithmic proof of this result, see the paper of Naor et
al. [NRV14]. Closely related optimization problems arise in the theory of quantum
optimal transport [Col+21].

Geometry of positive-definite matrices


The positive-definite matrices, equipped with a metric induced by the geometric
mean, compose a Riemannian manifold with remarkable geometric properties. Bha-
tia [Bha07b, Chap. 6] offers an introduction to this subject. Optimization over this
manifold has become a topic of contemporary interest in machine learning and related
areas. For example, see Sra’s paper [SH15].
Optimal transport distances between Gaussian distributions lead to a different
geometric structure on the class of positive-semidefinite matrices, namely the Bures–
Wasserstein metric. See Bhatia et al. [BJL19] for a study of this metric space. Opti-
mization with respect to this metric plays an important role in applications of optimal
transport; for example, see [Alt+21].

Quantum information theory


In quantum information theory, a positive-semidefinite matrix with trace one is called
a density matrix. It models the state of a (finite-dimensional) quantum system. As
203

a consequence, there are rich interactions between matrix analysis and quantum
information. See Watrous’s book for an introduction [Wat18].
Because of this type of connection, mathematical physics has indeed been a
major driver of research on matrix analysis. The joint convexity of quantum relative
entropy was derived as part of an effort to understand subadditivity properties of
quantum entropy under tensorization. There is also a remarkable duality between
subadditivity of entropy and (noncommutative) Brascamp–Lieb inequalities. See
Carlen’s notes [Car10] for a survey of these ideas.
The joint convexity of quantum relative entropy implies (in fact, is equivalent to)
a concavity property of the trace exponential [Tro12]. De Huang developed some
striking generalizations of this result using techniques from majorization and complex
interpolation; see [Hua19] and the works cited. These results play a key role in the
theory of matrix concentration.
Inequalities of Golden–Thompson type compare the matrix exponential of a sum
with the product of matrix exponentials. Some results of this type appear in [Bha97,
Chap. IX]. For far reaching generalizations with beautiful proofs, see the paper of
Sutter et al. [SBT17]. These results also have been used to study matrix concentration.

Hyperbolic polynomials
A hyperbolic polynomial 𝑝 : ℝ𝑛 → ℝ has the property that its restrictions 𝑡 ↦→
𝑝 (𝒙 −𝑡 e) to a fixed direction e have only real roots. This turns out to be a fundamental
concept in the geometry of polynomials, which is becoming a topic of increasing interest
in computational mathematics. For a digestible introduction, see the paper [Bau+01].
For a classic analysis reference, see the book of Hörmander [Hör94].
The Newton inequalities are, perhaps, the most elementary result that stems from
the notion of hyperbolicity. They state that the sequence of elementary symmetric
polynomials compose an ultra-log-concave sequence. This result has many striking
generalizations. In matrix analysis, the deepest theorem of this type is Alexandrov’s
inequality for mixed discriminants. See the paper of Shenfeld and Van Handel [SH19]
for an easy proof of the latter result. For some applications of mixed discriminants in
combinatorics, see Barvinok’s book [Bar16].
The geometry of polynomials also stands at the center of some other recent advances
in operator theory, combinatorics, and other areas. Marcus, Spielman, & Srivastava
used these ideas in their construction of bipartite Ramanujan graphs, their sharp
version of the restricted invertibility theorem, and their solution of the Kadison–Singer
problem. See [MSS14] for an introduction to this circle of ideas and references to the
original work. These results all have implications in matrix analysis.
In a related direction, Anari and collaborators have been using log-concave polyno-
mials for remarkable effect in their work on counting bases of matroids and related
problems. This research has implications for Monte Carlo Markov chain methods
for sampling from determinantal point processes. The first paper in the sequence
is [AGV21].

Projects bibliography
[AN06] N. Alon and A. Naor. “Approximating the cut-norm via Grothendieck’s inequality”.
In: SIAM J. Comput. 35.4 (2006), pages 787–803. doi: 10.1137/S0097539704441629.
[AP20] J. Altschuler and P. Parrilo. “Approximating Min-Mean-Cycle for low-diameter
graphs in near-optimal time and memory”. Available at https://arxiv.org/
abs/2004.03114. 2020.
204

[AP22] J. Altschuler and P. Parrilo. “Near-linear convergence of the Random Osborne


algorithm for Matrix Balancing”. In: Math. Programming (2022). To appear.
[AWR17] J. Altschuler, J. Weed, and P. Rigollet. “Near-linear time approximation algorithms
for optimal transport via Sinkhorn iteration”. In: Advances in Neural Information
Processing Systems 30 (NIPS 2017). 2017.
[Alt+21] J. Altschuler et al. “Averaging on the Bures–Wasserstein manifold: dimension-free
convergence of gradient descent”. In: Advances in Neural Information Processing
Systems 34 (NeurIPS 2021). 2021.
[AGV21] N. Anari, S. O. Gharan, and C. Vinzant. “Log-concave polynomials, I: entropy and
a deterministic approximation algorithm for counting bases of matroids”. In: Duke
Math. J. 170.16 (2021), pages 3459–3504. doi: 10.1215/00127094-2020-0091.
[BR97] R. B. Bapat and T. E. S. Raghavan. Nonnegative matrices and applications. Cambridge
University Press, Cambridge, 1997. doi: 10.1017/CBO9780511529979.
[Bar02] A. Barvinok. A course in convexity. American Mathematical Society, Providence, RI,
2002. doi: 10.1090/gsm/054.
[Bar16] A. Barvinok. Combinatorics and complexity of partition functions. Springer, Cham,
2016. doi: 10.1007/978-3-319-51829-9.
[Bau+01] H. H. Bauschke et al. “Hyperbolic polynomials and convex analysis”. In: Canad. J.
Math. 53.3 (2001), pages 470–488. doi: 10.4153/CJM-2001-020-6.
[BT10] A. Beck and M. Teboulle. “On minimizing quadratically constrained ratio of two
quadratic functions”. In: J. Convex Anal. 17.3-4 (2010), pages 789–804.
[Bel+18] A. Belton et al. “A panorama of positivity”. Available at https://arXiv.org/
abs/1812.05482. 2018.
[BTN01] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization. Analysis,
algorithms, and engineering applications. Society for Industrial and Applied
Mathematics (SIAM), 2001. doi: 10.1137/1.9780898718829.
[Ber08] C. Berg. “Stieltjes-Pick-Bernstein-Schoenberg and their connection to complete
monotonicity”. In: Positive Definite Functions: From Schoenberg to Space-Time
Challenges. Castellón de la Plana: University Jaume I, 2008, pages 15–45.
[BCR84] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Theory
of positive definite and related functions. Springer-Verlag, New York, 1984. doi:
10.1007/978-1-4612-1128-0.
[BP94] A. Berman and R. J. Plemmons. Nonnegative matrices in the mathematical sciences.
Revised reprint of the 1979 original. Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA, 1994. doi: 10.1137/1.9781611971262.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha07a] R. Bhatia. Perturbation bounds for matrix eigenvalues. Reprint of the 1987 original.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
doi: 10.1137/1.9780898719079.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[BJL19] R. Bhatia, T. Jain, and Y. Lim. “On the Bures-Wasserstein distance between positive
definite matrices”. In: Expo. Math. 37.2 (2019), pages 165–191. doi: 10.1016/j.
exmath.2018.01.002.
[BV04] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press,
Cambridge, 2004. doi: 10.1017/CBO9780511804441.
[Car10] E. Carlen. “Trace inequalities and quantum entropy: an introductory course”.
In: Entropy and the quantum. Volume 529. Contemp. Math. Amer. Math. Soc.,
Providence, RI, 2010, pages 73–140. doi: 10.1090/conm/529/10428.
205

[Chu97] F. R. K. Chung. Spectral graph theory. American Mathematical Society, 1997.


[Col+21] S. Cole et al. “Quantum optimal transport”. Available at https://arXiv.org/
abs/2105.06922. 2021.
[DKW82] C. Davis, W. M. Kahan, and H. F. Weinberger. “Norm-preserving dilations and
their applications to optimal error bounds”. In: SIAM J. Numer. Anal. 19.3 (1982),
pages 445–469. doi: 10.1137/0719029.
[Den+22] P. B. Denton et al. “Eigenvectors from eigenvalues: a survey of a basic identity in
linear algebra”. In: Bull. Amer. Math. Soc. (N.S.) 59.1 (2022), pages 31–58. doi:
10.1090/bull/1722.
[DT07] I. S. Dhillon and J. A. Tropp. “Matrix nearness problems with Bregman divergences”.
In: SIAM J. Matrix Anal. Appl. 29.4 (2007), pages 1120–1146. doi: 10 . 1137 /
060649021.
[EY39] C. Eckart and G. Young. “A principal axis transformation for non-hermitian
matrices”. In: Bull. Amer. Math. Soc. 45.2 (1939), pages 118–121. doi: 10.1090/
S0002-9904-1939-06910-3.
[Gar+20] A. Garg et al. “Operator scaling: theory and applications”. In: Found. Comput. Math.
20.2 (2020), pages 223–290. doi: 10.1007/s10208-019-09417-z.
[GR01] C. Godsil and G. Royle. Algebraic graph theory. Springer-Verlag, New York, 2001.
doi: 10.1007/978-1-4613-0163-9.
[GW95] M. X. Goemans and D. P. Williamson. “Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming”. In:
J. Assoc. Comput. Mach. 42.6 (1995), pages 1115–1145. doi: 10.1145/227683.
227684.
[GM10] G. H. Golub and G. Meurant. Matrices, moments and quadrature with applications.
Princeton University Press, Princeton, NJ, 2010.
[GVL13] G. H. Golub and C. F. Van Loan. Matrix computations. Fourth. Johns Hopkins
University Press, Baltimore, MD, 2013.
[Gu15] M. Gu. “Subspace iteration randomization and singular value problems”. In: SIAM
J. Sci. Comput. 37.3 (2015), A1139–A1173. doi: 10.1137/130938700.
[Gur04] L. Gurvits. “Classical complexity and quantum entanglement”. In: J. Comput.
System Sci. 69.3 (2004), pages 448–484. doi: 10.1016/j.jcss.2004.06.003.
[HK99] F. Hiai and H. Kosaki. “Means for matrices and comparison of their norms”. In:
Indiana Univ. Math. J. 48.3 (1999), pages 899–936. doi: 10.1512/iumj.1999.
48.1665.
[Hig89] N. J. Higham. “Matrix nearness problems and applications”. In: Applications of
matrix theory (Bradford, 1988). Volume 22. Inst. Math. Appl. Conf. Ser. New Ser.
Oxford Univ. Press, New York, 1989, pages 1–27. doi: 10.1093/imamat/22.1.1.
[Hig08] N. J. Higham. Functions of matrices. Theory and computation. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2008. doi: 10 . 1137 / 1 .
9780898717778.
[Hör94] L. Hörmander. Notions of convexity. Birkhäuser Boston, Inc., 1994.
[HJ94] R. A. Horn and C. R. Johnson. Topics in matrix analysis. Corrected reprint of the
1991 original. Cambridge University Press, Cambridge, 1994.
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[Hua19] D. Huang. “Improvement on a Generalized Lieb’s Concavity Theorem”. Available
at https://arXiv.org/abs/1902.02194. 2019.
[Kat95] T. Kato. Perturbation theory for linear operators. Reprint of the 1980 edition.
Springer-Verlag, Berlin, 1995.
206

[Kru37] R. Kruithof. “Telefoonverkeersrekening”. In: De Ingenieur 52 (1937), pp. E15–E25.


[KA79] F. Kubo and T. Ando. “Means of positive linear operators”. In: Math. Ann. 246.3
(1979/80), pages 205–224. doi: 10.1007/BF01371042.
[Lax07] P. D. Lax. Linear algebra and its applications. second. Wiley-Interscience [John
Wiley & Sons], Hoboken, NJ, 2007.
[Lew96] A. S. Lewis. “Derivatives of spectral functions”. In: Math. Oper. Res. 21.3 (1996),
pages 576–588. doi: 10.1287/moor.21.3.576.
[LS13] J. Liesen and Z. Strakoš. Krylov subspace methods. Principles and analysis. Oxford
University Press, Oxford, 2013.
[Lov79] L. Lovász. “On the Shannon capacity of a graph”. In: IEEE Trans. Inform. Theory
25.1 (1979), pages 1–7.
[Luo+10] Z. Q. Luo et al. “Semidefinite relaxation of quadratic optimization problems”. In:
IEEE Signal Process. Mag. 27.3 (2010), pages 20–34.
[MSS14] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Ramanujan graphs and the
solution of the Kadison-Singer problem”. In: Proceedings of the International
Congress of Mathematicians—Seoul 2014. Vol. III. Kyung Moon Sa, Seoul, 2014,
pages 363–386.
[Min88] H. Minc. Nonnegative matrices. A Wiley-Interscience Publication. John Wiley &
Sons, Inc., New York, 1988.
[Mir60] L. Mirsky. “Symmetric gauge functions and unitarily invariant norms”. In: Quart.
J. Math. Oxford Ser. (2) 11 (1960), pages 50–59. doi: 10.1093/qmath/11.1.50.
[NRV14] A. Naor, O. Regev, and T. Vidick. “Efficient rounding for the noncommutative
Grothendieck inequality”. In: Theory Comput. 10 (2014), pages 257–295. doi:
10.4086/toc.2014.v010a011.
[Par78] S. Parrott. “On a quotient norm and the Sz.-Nagy - Foiaş lifting theorem”. In: J.
Functional Analysis 30.3 (1978), pages 311–328. doi: 10.1016/0022-1236(78)
90060-5.
[Pin10] A. Pinkus. Totally positive matrices. Cambridge University Press, Cambridge, 2010.
[Saa11b] Y. Saad. Numerical methods for large eigenvalue problems. Revised edition of the
1992 original [ 1177405]. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2011. doi: 10.1137/1.9781611970739.ch1.
[Sch38] I. J. Schoenberg. “Metric spaces and positive definite functions”. In: Trans. Amer.
Math. Soc. 44.3 (1938), pages 522–536. doi: 10.2307/1989894.
[SS17] L. J. Schulman and A. Sinclair. “Analysis of a Classical Matrix Preconditioning
Algorithm”. In: J. Assoc. Comput. Mach. (JACM) 64.2 (2017), 9:1–9:23.
[Sen81] E. Seneta. Nonnegative matrices and Markov chains. Second. Springer-Verlag, New
York, 1981. doi: 10.1007/0-387-32792-4.
[SH19] Y. Shenfeld and R. van Handel. “Mixed volumes and the Bochner method”. In: Proc.
Amer. Math. Soc. 147.12 (2019), pages 5385–5402. doi: 10.1090/proc/14651.
[Sim11] B. Simon. Convexity. An analytic viewpoint. Cambridge University Press, Cambridge,
2011. doi: 10.1017/CBO9780511910135.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
[Sin64] R. Sinkhorn. “A relationship between arbitrary positive matrices and doubly
stochastic matrices”. In: Ann. Math. Statist. 35 (1964), pages 876–879. doi: 10.
1214/aoms/1177703591.
[Spi18a] D. Spielman. “Yale CPSC 662/AMTH 561 : Spectral Graph Theory”. Available at
http://www.cs.yale.edu/homes/spielman/561/561schedule.html. 2018.
207

[SH15] S. Sra and R. Hosseini. “Conic geometric optimization on the manifold of positive
definite matrices”. In: SIAM J. Optim. 25.1 (2015), pages 713–739. doi: 10.1137/
140978168.
[SBT17] D. Sutter, M. Berta, and M. Tomamichel. “Multivariate trace inequalities”. In: Comm.
Math. Phys. 352.1 (2017), pages 37–58. doi: 10.1007/s00220-016-2778-5.
[Tro09] J. A. Tropp. “Column subset selection, matrix factorization, and eigenvalue op-
timization”. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on
Discrete Algorithms. SIAM, Philadelphia, PA, 2009, pages 978–986.
[Tro12] J. A. Tropp. “From joint convexity of quantum relative entropy to a concavity
theorem of Lieb”. In: Proc. Amer. Math. Soc. 140.5 (2012), pages 1757–1760. doi:
10.1090/S0002-9939-2011-11141-9.
[VB96] L. Vandenberghe and S. Boyd. “Semidefinite programming”. In: SIAM Rev. 38.1
(1996), pages 49–95. doi: 10.1137/1038003.
[Ver18] R. Vershynin. High-dimensional probability. An introduction with applications in
data science, With a foreword by Sara van de Geer. Cambridge University Press,
Cambridge, 2018. doi: 10.1017/9781108231596.
[Wat18] J. Watrous. The theory of quantum information. Cambridge University Press, 2018.
[Zha05] F. Zhang, editor. The Schur complement and its applications. Springer-Verlag, New
York, 2005. doi: 10.1007/b105056.
1. Means for Matrices and Norm Inequalities

Date: 14 March 2022 Author: Edoardo Calvello

In this project we broadly investigate means of positive matrices and norm inequalities
Agenda:
involving them.
√ We begin with the goal of extending the arithmetic–geometric mean
1. A simple proof for matrix
inequality 𝜆𝜈 ≤ 2−1 (𝜆 + 𝜈) for positive reals 𝜆, 𝜈 , to the case of positive semidefinite arithmetic–geometric mean
matrices and unitarily invariant norms. In the search of a proof for the matrix inequality
arithmetic–geometric mean inequality, we present a simple method introduced by 2. From scalar means to matrix
Bhatia [Bha07b]. Yet, a more general analysis of means for matrices due to Hiai and means
3. A unified analysis of means for
Kosaki [HK99] leads to a unified framework for comparing norms of these matrix means. matrices
This perspective thus leads to an alternative proof of the matrix arithmetic–geometric 4. Norm inequalities
mean inequality and a variety of other notable norm inequalities. In this presentation,
primarily based on the material from [Bha07b, Chap. 5] and [HK99], we outline this
general analysis and the important consequences that arise.

1.1 A simple proof for the matrix arithmetic–geometric mean inequality


Let 𝕄𝑛 (ℂ) denote the space of all 𝑛 × 𝑛 complex matrices and ℍ𝑛+ (ℂ) the space of
𝑛 -dimensional positive semidefinite Hermitian matrices over ℂ.
Our investigation begins with the aim of proving the following theorem.

Theorem 1.1 (Matrix arithmetic–geometric mean inequality). For any 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ)
and any 𝑿 ∈ 𝕄𝑛 (ℂ) it holds that

1
|||𝑯 1/2 𝑿 𝑲 1/2 ||| ≤ |||𝑯 𝑿 + 𝑿 𝑲 ||| (1.1)
2
for any unitarily invariant norm ||| · ||| .

Before we present a unified analysis of means for matrices and derive norm inequalities,
we discuss a simple proof for the matrix arithmetic–geometric mean inequality (1.1).
We can prove Theorem 1.1 by establishing the inequality in the special case 𝑯 = 𝑲 ,
a key insight that appears in Bhatia [Bha07b]. Indeed, it is enough to prove the
apparently weaker inequality

1
|||𝑫 1/2𝒀 𝑫 1/2 ||| ≤ |||𝑫𝒀 + 𝒀 𝑫 |||, (1.2)
2
for any 𝑫 ∈ ℍ𝑛+ (ℂ),𝒀 ∈ 𝕄𝑛 (ℂ) . In fact, by suitably choosing 𝑫 and 𝒀 in inequality
(1.2) as    
𝑯 0 0 𝑿
𝑫B and 𝒀 B ,
0 𝑲 0 0
one can recover the more general inequality (1.1). Before we present a straightforward
proof of Theorem 1.1, we first note an auxiliary lemma.
Project 1: Hiai–Kosaki Means 209

Lemma 1.2 (Unitarily invariant norm of a Schur product). If 𝑨 ∈ ℍ𝑛+ (ℂ) , for any unitarily
invariant norm ||| · ||| on 𝕄𝑛 (ℂ)

|||𝑨 𝑿 ||| ≤ max 𝑎𝑖𝑖 · |||𝑿 |||, for any 𝑿 ∈ 𝕄𝑛 (ℂ) ,


1 ≤𝑖 ≤𝑛

where is the Schur product.

Proof. A proof for this lemma can be found in [AHJ87, p.363]. 


The following proof of Theorem 1.1 is drawn from [Bha07b].

Proof of Theorem 1.1. We consider |||𝑫 1/2𝒀 𝑫 1/2 ||| and note that since the norm in
question is unitarily invariant, the expression reduces to the case where 𝑫 is a diagonal
matrix of the form diag (𝜆 1 , . . . , 𝜆𝑛 ) . It is easily checked that in this case
 
1/2 1/2 𝑫𝒀 + 𝒀 𝑫
𝑫 𝒀𝑫 =𝚲 . (1.3)
2

For each 𝑖 , 𝑗 = 1, . . . , 𝑛 , the (𝑖 , 𝑗 ) -th entry of 𝚲 is given by


√︁
2 𝜆𝑖 𝜆 𝑗
𝚲𝑖 𝑗 = .
𝜆𝑖 + 𝜆 𝑗

The matrix 𝚲 is congruent to the matrix 𝑪 with (𝑖 , 𝑗 ) th entry 𝑐 𝑖 𝑗 B (𝜆𝑖 + 𝜆 𝑗 ) −1 .


√ √ √ √
Indeed note that 𝚲 = 2 · diag ( 𝜆 1 , . . . , 𝜆𝑛 ) · 𝑪 · diag ( 𝜆 1 , . . . , 𝜆𝑛 ). Since (𝑪 )𝑖 𝑗
can be represented as ∫ ∞
𝑐𝑖 𝑗 = e−(𝜆𝑖 +𝜆 𝑗 )𝑡 d𝑡 ,
0

we can conclude that 𝑪 is positive semidefinite. Thus, by Sylvester’s inertia theorem 𝚲


is also positive semidefinite. Furthermore, all the diagonal entries of 𝚲 are 1. Hence,
from Lemma 1.2 we can observe that (1.3) implies (1.2), concluding the proof.


1.2 From scalar means to matrix means


In this section we will first recall our discussion on scalar means. We will then introduce
a notion of matrix mean due to [HK99] that is more general than the one due to
[KA79], which was the object of Lecture 16.

1.2.1 Scalar means


We first recall the definition of a scalar mean.

Definition 1.3 (Scalar mean). A map 𝑀 : ℝ++ × ℝ++ → ℝ++ is a scalar mean if for ℝ++ B ( 0, ∞) .
any 𝜆, 𝜈 > 0 we have the following properties.

1. Strict postivity. It holds that 𝑀 (𝜆, 𝜈) > 0.


2. Ordering. The inequality min{𝜆, 𝜈 } ≤ 𝑀 (𝜆, 𝜈) ≤ max{𝜆, 𝜈 } holds.
3. Symmetry. It holds that 𝑀 (𝜆, 𝜈) = 𝑀 (𝜈, 𝜆) .
4. Monotonicity. The functions 𝜆 ↦→ 𝑀 (𝜆, 𝜈) and 𝜈 ↦→ 𝑀 (𝜆, 𝜈) are increasing.
5. Homogeneity. For any 𝑐 > 0, 𝑀 (𝑐𝜆, 𝑐𝜈) = 𝑐𝑀 (𝜆, 𝜈) for all 𝜆, 𝜈 > 0.
2
6. Continuity. It holds that (𝜆, 𝜈) ↦→ 𝑀 (𝜆, 𝜈) is continuous on ℝ++ .
Project 1: Hiai–Kosaki Means 210

We denote by 𝔐 the set of all such scalar means.

We now recall two propositions that establish a bijection between the class of scalar
means 𝔐 and what we define as representing functions.
Proposition 1.4 (Representing functions for scalar means). Let 𝑀 ∈ 𝔐 . Let the function 𝑓
be defined as 𝑓 (𝑡 ) B 𝑀 (𝑡 , 1) for 𝑡 > 0. Note that 𝑓 ( 0) can be defined by continuity
and hence by taking limits. The function 𝑓 then satisfies the following properties.

For any 𝑡 > 0 it holds that 𝑓 (𝑡 ) > 0.


1. Strict positivity.
2. Normalization. It holds that 𝑓 ( 1) = 1.
3. Symmetry and homogeneity. For any 𝑡 > 0 it holds that 𝑓 (𝑡 ) = 𝑡 · 𝑓 ( 1/𝑡 ) .
4. Monotonicity. The function 𝑓 is increasing in 𝑡 , while 𝑡 ↦→ 𝑓 (𝑡 )/𝑡 is decreasing
in 𝑡 for 𝑡 > 0.
5. Continuity. The function 𝑓 is continuous.

We call such a function 𝑓 a representing function.

Proof. The proof follows from the properties of scalar means. 


Proposition 1.5 (Scalar means from representing functions). Let 𝑓 : ℝ++ → ℝ be a function
satisfying properties (1) through (5) of Proposition 1.4. Let 𝑀 𝑓 be defined by

𝑀 𝑓 (𝜆, 𝜈) = 𝜆 · 𝑓 (𝜆/𝜈) for 𝜆, 𝜈 > 0.

Then 𝑀 𝑓 is a scalar mean and we say that scalar mean 𝑀 𝑓 has representation 𝑓 .

Proof. The proof of this proposition follows from direct application of the properties of
the function 𝑓 . 
Indeed, together Propositions 1.4 and 1.5 establish a bijection between scalar
means and functions with the properties in Proposition 1.4, which we call representing
functions.

1.2.2 A more general notion of means for matrices


We recall that in Lecture 16 we lifted the definition of scalar mean to the case of 𝑛 -
dimensional positive semidefinite Hermitian matrices over ℂ, ℍ𝑛+ (ℂ) , thus introducing
the definition of matrix mean due to [KA79] . Just as for scalar means, we showed
that there exists a bijection between this class of matrix means and a matrix notion of
representing functions.
In contrast to our previous discussion, we outline a different construction of a more
general notion of matrix mean, due to Hiai and Kosaki [HK99], which will allow us to
perform an analysis useful for establishing a wide range of matrix norm inequalities.
We endow 𝕄𝑛 (ℂ) with the trace inner product denoted by h·, ·i . For 𝑀 ∈ 𝔐 and Recall that h𝑿 ,𝒀 i = tr (𝑿𝒀 ∗ ).
𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ) , we consider the mean associated to left multiplication by 𝑯 and right
multiplication by 𝑲 , which we denote by 𝑀 (𝑯 , 𝑲 ) .

Definition 1.6 (Matrix mean; Hiai & Kosaki 1999). Let 𝑀 ∈ 𝔐 and 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ) .
Let 𝑯 and 𝑲 have spectral decompositions 𝑯 = 𝑖𝑛=1 𝜆𝑖 𝑷 𝑖 and 𝑲 = 𝑛𝑗=1 𝜈 𝑗 𝑸 𝑗 ,
Í Í
respectively. We define by the operator 𝑀 (𝑯 , 𝑲 ) : 𝕄𝑛 (ℂ) → 𝕄𝑛 (ℂ) by
∑︁
𝑀 (𝑯 , 𝑲 )𝑿 = 𝑀 (𝜆𝑖 , 𝜈 𝑗 )𝑷 𝑖 𝑿 𝑸 𝑗 , for 𝑿 ∈ 𝕄𝑛 (ℂ) .
𝑖 ,𝑗

This definition leads to the following useful result.


Project 1: Hiai–Kosaki Means 211

Proposition 1.7 (Alternative formulation of the matrix mean). Let 𝑀 ∈ 𝔐 and 𝑯 , 𝑲 ∈


ℍ𝑛+ (ℂ) . Let 𝑯 = 𝑼 diag (𝜆 1 , . . . , 𝜆𝑛 )𝑼 ∗ for unitary 𝑼 be an eigenvalue decomposition.
Similarly, let 𝑲 = 𝑽 diag (𝜈 1 , . . . , 𝜈𝑛 )𝑽 ∗ for unitary 𝑽 . Then

𝑀 (𝑯 , 𝑲 )𝑿 = 𝑼 ( [𝑀 (𝜆𝑖 , 𝜈 𝑗 )] (𝑼 ∗ 𝑿𝑽 ))𝑽 ∗ ,

for [𝑀 (𝜆𝑖 , 𝜈 𝑗 )] the matrix defined (𝑖 , 𝑗 ) entrywise and where denotes the usual
Schur product.
Exercise 1.8 (Alternative formulation of the matrix mean). Starting from Definition 1.6,
provide a proof for Proposition 1.7.

1.3 A unified analysis of means for matrices


We now present a unified framework for comparing norms of matrix means which is
summarized by the following theorem, the formulation and proof of which are due to
[HK99].

Theorem 1.9 (Comparison of norms of matrix means). For any 𝑀 , 𝐿 ∈ 𝔐 the following
are equivalent.

1. There exists a symmetric probability measure 𝜇 on ℝ such that


∫ ∞
𝑀 (𝑯 , 𝑲 )𝑿 = 𝑯 i𝑠 (𝐿 (𝑯 , 𝑲 )𝑿 )𝑲 −i𝑠 d𝜇(𝑠 ),
−∞

for any 𝑛 × 𝑛 matrices 𝑯 , 𝑲 , 𝑿 with 𝑯 , 𝑲  0.


2. For any 𝑛 × 𝑛 matrices 𝑯 , 𝑲 , 𝑿 with 𝑯 , 𝑲 < 0 it holds that

|||𝑀 (𝑯 , 𝑲 )𝑿 ||| ≤ |||𝐿 (𝑯 , 𝑲 )𝑿 |||,

for any unitarily invariant norm ||| · ||| .


3. For any 𝑛 × 𝑛 matrices 𝑯 , 𝑿 with 𝑯 < 0 it holds that

|||𝑀 (𝑯 , 𝑯 )𝑿 ||| ≤ |||𝐿 (𝑯 , 𝑯 )𝑿 |||,

4. The matrix defined entrywise by

𝑀 (𝜆𝑖 , 𝜆 𝑗 )
 

𝐿 (𝜆𝑖 , 𝜆 𝑗 ) 1 ≤𝑖 ,𝑗 ≤𝑛

is positive semidefinite for any 𝜆 1 , . . . , 𝜆𝑛 > 0, for all 𝑛 ∈ ℕ.


5. The function
𝑀 ( e𝑡 , 1)
ℎ (𝑡 ) B
𝐿 ( e𝑡 , 1)
is positive definite on ℝ.

Proof. We prove the equivalence using a step-by-step approach as presented in [HK99].


(1)⇒(2) We can observe that if (1) holds, then we have
∫ ∞
|||𝑀 (𝑯 , 𝑯 )𝑿 ||| ≤ |||𝐿 (𝑯 , 𝑯 )𝑿 ||| |||𝑯 i𝑠 ||| · |||𝑲 −i𝑠 ||| d𝜇(𝑠 )
−∞
≤ |||𝐿 (𝑯 , 𝑯 )𝑿 |||,
Project 1: Hiai–Kosaki Means 212

for any 𝑛 × 𝑛 matrices 𝑯 , 𝑲 , 𝑿 with 𝑯 , 𝑲  0, where the last inequality follows from
𝜇 being a probability measure. To extend to 𝑯 , 𝑲 < 0 it suffices to take the limit of
the inequality for 𝑯 + 𝜖 I and 𝑲 + 𝜖 I for 𝜖 → 0+ .
(2)⇒(3) By taking 𝑲 = 𝑯 , we have that (2) clearly implies (3).
(3)⇒(4) Consider arbitrary 𝜆 1 , . . . , 𝜆𝑛 > 0 and the matrix 𝑨 defined entrywise by
𝑎𝑖 𝑗 = 𝑀 (𝜆𝑖 , 𝜆 𝑗 )/𝐿 (𝜆𝑖 , 𝜆 𝑗 ) for any 1 ≤ 𝑖 , 𝑗 ≤ 𝑛 . By the symmetry property of matrix
means 𝑀 and 𝐿 , the matrix 𝑨 is Hermitian with all diagonal entries equal to 1. Hence,
it also holds that tr (𝑨) = 𝑛 .
Letting 𝑯 = diag (𝜆 1 , . . . , 𝜆𝑛 ) and applying (3), given what we know from Proposi-
tion 1.7, we obtain
k [𝑀 (𝜆𝑖 , 𝜆 𝑗 )] 𝑿 k ≤ k [𝐿 (𝜆𝑖 , 𝜆 𝑗 )] 𝑿k
for each 𝑿 ∈ 𝕄𝑛 (ℂ) . This implies
k𝑨 𝑿 k ≤ k𝑿 k,
for each 𝑿 ∈ 𝕄𝑛 (ℂ) .
Now, considering that h𝑨 𝑿 ,𝒀 i = h𝑿 , 𝑨 𝒀 i , it is possible to show that from
the above it holds that k𝑨 𝑿 k 1 ≤ k𝑿 k 1 . Since this holds for any 𝑿 ∈ 𝕄𝑛 (ℂ) , if we We recall that k · k 1 denotes the trace
take 𝑿 to be the matrix with all entries 1, we obtain the inequality k𝑨 k 1 ≤ 𝑛 . norm so k𝑨 k 1 = tr (𝑨 ∗ 𝑨) 1/2 .
If 𝛼 1 , . . . , 𝛼𝑛 are the real eigenvalues of 𝑨 , to prove (4) it suffices to show that
𝛼 1 , . . . , 𝛼𝑛 are all positive. We observe
∑︁𝑛
|𝛼𝑖 | = k𝑨 k 1 ≤ 𝑛,
𝑖 =1
but 𝑛 = tr (𝑨) = 𝑖𝑛=1 𝛼𝑖 . Hence
Í𝑛
|𝛼𝑖 | ≤ 𝑖𝑛=1 𝛼𝑖 which proves positivity of the
Í Í
𝑖 =1
eigenvalues of 𝑨 and thus (4).
(4)⇒(5) For any 𝑡𝑖 , 𝑡 𝑗 ∈ ℝ, 1 ≤ 𝑖 , 𝑗 ≤ 𝑛 ,

𝑀 ( e𝑡𝑖 −𝑡 𝑗 , 1)
   
𝑀 ( e𝑡𝑖 , e𝑡 𝑗 )
ℎ (𝑡𝑖 − 𝑡 𝑗 ) = = ,
𝐿 ( e𝑡𝑖 −𝑡 𝑗 , 1) 𝐿 ( e𝑡𝑖 , e𝑡 𝑗 )
which by (4) defines a positive semidefinite matrix on ℝ, hence by definition ℎ (𝑡 ) is a
positive definite function.
(5)⇒(1) Since the function ℎ (𝑡 ) B 𝑀 ( e𝑡 , 1)/𝐿 ( e𝑡 , 1) is positive definite on ℝ, by
Bochner’s∫ ∞theorem we have that there exists a probability measure 𝜇 on ℝ such that
ℎ (𝑡 ) = −∞ ei𝑡 𝑠 d𝜇(𝑠 ) for any 𝑡 ∈ ℝ (see Lecture 18). Since by the properties of scalar
means ℎ (𝑡 ) = ℎ (−𝑡 ) , we have that 𝜇 is a symmetric measure. Now for any 𝑯 , 𝑲  0
we compute 𝑀 (𝑯 , 𝑲 )𝑿 . Indeed, using Definition 1.6, we obtain
∑︁
𝑀 (𝑯 , 𝑲 )𝑿 = 𝑀 (𝜆𝑘 , 𝜈𝑙 )𝑷 𝑘 𝑿 𝑸 𝑙
𝑘 ,𝑙
∑︁  
= 𝜈𝑙 𝑀 elog (𝜆𝑘 /𝜈𝑙 ) , 1 𝑷 𝑘 𝑿 𝑸 𝑙
𝑘 ,𝑙
∑︁   ∫ ∞ 
log (𝜆𝑘 /𝜈𝑙 ) i𝑠
= 𝜈𝑙 𝐿 e ,1 (𝜆𝑘 /𝜈𝑙 ) d𝜇(𝑠 ) 𝑷 𝑘 𝑿 𝑸 𝑙
𝑘 ,𝑙 −∞
∫ ∞ ∑︁
= (𝜆𝑘 /𝜈𝑙 ) i𝑠 𝐿 (𝜆𝑘 , 𝜈𝑙 )𝑷 𝑘 𝑿 𝑸 𝑙 d𝜇(𝑠 )
−∞ 𝑘 ,𝑙
∫ ∞
= 𝑯 i𝑠 (𝐿 (𝑯 , 𝑲 )𝑿 )𝑲 −i𝑠 d𝜇(𝑠 ),
−∞
where the second equality follows from the properties of scalar means, the third equality
from using the definition of ℎ and the fourth equality again from the properties of
scalar means. 
Project 1: Hiai–Kosaki Means 213

As pointed out in [HK99], Theorem 1.9 suggests that to obtain norm inequalities
between matrix means it is important to establish positive definiteness of the related
function on ℝ. To this end we note the following proposition before we continue our
discussion.
Proposition 1.10 (Some positive definite functions). The functions of 𝑡 ,

sinh 𝑎𝑡 cosh 𝑎𝑡
and
sinh 𝑏𝑡 cosh 𝑏𝑡
are positive definite whenever 0 ≤ 𝑎 < 𝑏 . The function of 𝑡 ,
𝑡
sinh 𝑡 /2

is also positive definite.

Proof. We refer to [Kos98] and [HK99] for a proof of these facts and in particular to
[Kos98] for a more comprehensive analysis of useful positive definite functions. 

1.4 Norm inequalities


In this section we show how we can apply Theorem 1.9 to obtain various norm
inequalities involving means for matrices, and how these lead, for example, to an
alternative proof for the matrix arithmetic–geometric mean inequality in Theorem 1.1.

Definition 1.11 (𝛼 -scalar mean). For 𝛼 ∈ ℝ and 𝜆, 𝜈 > 0 we define 𝑀 𝛼 (𝜆, 𝜈) by


(
𝛼−1 𝜆𝛼 −𝜈 𝛼
· 𝜆𝛼−1 −𝜈 𝛼−1
for 𝜆 ≠ 𝜈,
𝑀𝛼 (𝜆, 𝜈) = 𝛼
𝜆 for 𝜆 = 𝜈.

Exercise 1.12 (𝛼 -scalar mean is a scalar mean.). Using Definition 1.3, check that the 𝛼 -scalar
mean as defined in Definition 1.11 is indeed a scalar mean.
In particular we have the following means.
Example 1.13 (Some 𝛼 -scalar means). The following are examples of 𝛼 -scalar means.

1. Arithmetic mean For 𝛼 = 2, it holds that

𝜆 +𝜈
𝑀2 (𝜆, 𝜈) = for any 𝜆, 𝜈 > 0.
2
2. Geometric mean For 𝛼 = 1/2, it holds that

𝑀1/2 (𝜆, 𝜈) = 𝜆𝜈 for any 𝜆, 𝜈 > 0.

3. Harmonic mean For 𝛼 = −1, it holds that


2
𝑀−1 (𝜆, 𝜈) = for any 𝜆, 𝜈 > 0.
𝜆−1 + 𝜈 −1
4. Logarithmic mean Taking the limit as 𝛼 → 1, it holds that

𝜆 −𝜈
𝑀1 (𝜆, 𝜈) = . for any 𝜆, 𝜈 > 0
log 𝜆 − log 𝜈
Project 1: Hiai–Kosaki Means 214

5. Zero-scalar mean Taking the limit as 𝛼 → 0, it holds that It is interesting to note that the
zero-scalar mean is the reciprocal of
log 𝜆 − log 𝜈 the logarithmic mean of the
𝑀0 (𝜆, 𝜈) = for any 𝜆, 𝜈 > 0. reciprocals.
𝜈 −1 − 𝜆−1


Exercise 1.14 (Infinity-scalar mean). Characterize the 𝛼 -scalar mean as defined in Definition
1.11 when 𝛼 → ±∞.
The next theorem due to Hiai and Kosaki [HK99] establishes an attractive norm
inequality between matrix means.

Theorem 1.15 (𝛼 -matrix mean norm inequality). If −∞ ≤ 𝛼 < 𝛽 ≤ ∞, then

|||𝑀𝛼 (𝑯 , 𝑲 )𝑿 ||| ≤ |||𝑀 𝛽 (𝑯 , 𝑲 )𝑿 |||,

for any unitarily invariant norm ||| · ||| .

Proof. For 1/2 ≤ 𝛼 < 𝛽 < ∞, by using Definition 1.11 can write

𝑀𝛼 ( e2𝑡 , 1) (𝛼 − 1)𝛽 ( e2𝛼𝑡 − 1)( e2 (𝛽−1)𝑡 − 1)


= ·
𝑀 𝛽 ( e2𝑡 , 1) 𝛼 (𝛽 − 1) ( e2 (𝛼−1)𝑡 − 1)( e2𝛽𝑡 − 1)
(𝛼 − 1)𝛽 ( e𝛼𝑡 − e−𝛼𝑡 )( e (𝛽−1)𝑡 − e−(𝛽−1)𝑡 )
= ·
𝛼 (𝛽 − 1) ( e (𝛼−1)𝑡 − e−(𝛼−1)𝑡 )( e𝛽𝑡 − e−𝛽𝑡 )
(𝛼 − 1)𝛽 sinh (𝛼𝑡 ) sinh ((𝛽 − 1)𝑡 )
= · ,
𝛼 (𝛽 − 1) sinh ((𝛼 − 1)𝑡 ) sinh (𝛽𝑡 )

where we set
𝛼−1 1 sinh ((𝛽 − 1)𝑡 )
= at 𝛼 = 1, and = 𝑡 at 𝛽 = 1.
sinh ((𝛼 − 1)𝑡 ) 𝑡 𝛽 −1
Now, if 1/2 ≤ 𝛼 < 𝛽 ≤ 1, then

𝑀𝛼 ( e2𝑡 , 1) (𝛼 − 1)𝛽 sinh (𝛼𝑡 ) sinh (( 1 − 𝛽)𝑡 )


= · ·
𝑀 𝛽 ( e2𝑡 , 1) 𝛼 (𝛽 − 1) sinh (𝛽𝑡 ) sinh (( 1 − 𝛼)𝑡 )

is positive definite as it is a product of positive definite functions, by Proposition 1.10.


Therefore, by Theorem 1.9, the claim of this theorem holds when 1/2 ≤ 𝛼 < 𝛽 ≤ 1.
We now consider the case 1 < 𝛼 < 𝛽 < ∞. Through some tedious computations,
we observe that
sinh (𝛼𝑡 ) sinh ((𝛽 − 1)𝑡 )
−1
sinh ((𝛼 − 1)𝑡 ) sinh (𝛽𝑡 )
sinh ((𝛼 − 1)𝑡 + 𝑡 ) sinh ((𝛽 − 1)𝑡 ) sinh ((𝛼 − 1)𝑡 ) sinh ((𝛽 − 1)𝑡 + 𝑡 )
= −
sinh ((𝛼 − 1)𝑡 ) sinh (𝛽𝑡 ) sinh ((𝛼 − 1)𝑡 ) sinh (𝛽𝑡 )
 
sinh 𝑡 cosh ((𝛼 − 1)𝑡 ) sinh ((𝛽 − 1)𝑡 ) sinh ((𝛼 − 1)𝑡 ) cosh ((𝛽 − 1)𝑡 )
= · −
sinh (𝛽𝑡 ) sinh ((𝛼 − 1)𝑡 ) sinh ((𝛼 − 1)𝑡 )
sinh 𝑡 sinh ((𝛽 − 𝛼)𝑡 )
= · .
sinh (𝛽𝑡 ) sinh ((𝛼 − 1)𝑡 )

If 1 < 𝛼 < 𝛽 < 2𝛼 − 1 and so 0 < 𝛽 − 𝛼 < 𝛼 − 1, then the above computations
show that by Proposition 1.10, 𝑀 𝛼 ( e2𝑡 , 1)/𝑀 𝛽 ( e2𝑡 , 1) is positive definite. Therefore by
Project 1: Hiai–Kosaki Means 215

Theorem 1.9, the claim of this theorem holds. In the general case of 1 < 𝛼 < 𝛽 < ∞,
it is possible to choose 𝛼 = 𝛼 0 < 𝛼 1 < · · · < 𝛼𝑚 = 𝛽 that satisfies 𝛼𝑘 < 2𝛼𝑘 −1 − 1
for 1 ≤ 𝑘 ≤ 𝑚 , which leads to the conclusion. The special cases 1 = 𝛼 < 𝛽 < ∞
and 1 < 𝛼 < 𝛽 = ∞ can be obtained by taking the limit as 𝛼 → 1 and 𝛽 → ∞
respectively. We refer to [HK99] for the details on why the inequalities are preserved
under taking limits. Furthermore the results can be extended straightforwardly to
the case −∞ ≤ 𝛼 < 𝛽 ≤ 1/2, an argument for which for conciseness we refer to
[HK99]. 
Theorem 1.15 allows us to directly obtain a proof for the matrix arithmetic–geometric
mean inequality from Theorem 1.1.

Proof of Theorem 1.1. For any 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ) and any 𝑿 ∈ 𝕄𝑛 (ℂ) , we have

1
𝑀1/2 (𝑯 , 𝑲 )𝑿 = 𝑯 1/2 𝑿 𝑲 1/2 and 𝑀2 (𝑯 , 𝑲 )𝑿 = (𝑯 𝑿 + 𝑿 𝑲 ). (1.4)
2
Therefore by directly applying Theorem 1.15 we obtain that for any 𝑯 , 𝑲 ∈ ℍ𝑛+ (ℂ)
and any 𝑿 ∈ 𝕄𝑛 (ℂ) ,

1
|||𝑯 1/2 𝑿 𝑲 1/2 ||| ≤ |||𝑯 𝑿 + 𝑿 𝑲 |||,
2
for any unitarily invariant norm ||| · ||| . 
Exercise 1.16 (Proof of equalities in (1.4)). By using Proposition 1.7 show that the equalities
in (1.4) indeed hold.
Together Theorems 1.9 and 1.15 are very powerful and a plethora of important results
follow from these. We refer the reader to [HK99] for a more in depth investigation of
these topics and derivations of further norm inequalities. We also refer the reader to
[ABY20] for an example of an interesting application of these results.

1.5 Conclusions
In this investigation we began with the objective of proving a matrix arithmetic–
geometric mean inequality. This aim led us to systematically constructing a notion
of matrix mean that led to the fundamental Theorem 1.9. This allowed a more
general analysis that makes it possible to derive inequalities between norms of means
for matrices and thus many different norm inequalities. These insights and the
corresponding analysis is due to Hiai and Kosaki [HK99], where the interested reader
can find a further exploration of these topics.

Lecture bibliography
[AHJ87] T. Ando, R. A. Horn, and C. R. Johnson. “The singular values of a Hadamard product:
a basic inequality”. In: Linear and Multilinear Algebra 21.4 (1987), pages 345–365.
eprint: https : / / doi . org / 10 . 1080 / 03081088708817810. doi: 10 . 1080 /
03081088708817810.
[ABY20] R. Aoun, M. Banna, and P. Youssef. “Matrix Poincaré inequalities and concentration”.
In: Advances in Mathematics 371 (2020), page 107251. doi: https://doi.org/10.
1016/j.aim.2020.107251.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
Project 1: Hiai–Kosaki Means 216

[HK99] F. Hiai and H. Kosaki. “Means for matrices and comparison of their norms”. In:
Indiana Univ. Math. J. 48.3 (1999), pages 899–936. doi: 10.1512/iumj.1999.
48.1665.
[Kos98] H. Kosaki. “Arithmetic–geometric mean and related inequalities for operators”. In:
Journal of Functional Analysis 156.2 (1998), pages 429–451.
[KA79] F. Kubo and T. Ando. “Means of positive linear operators”. In: Math. Ann. 246.3
(1979/80), pages 205–224. doi: 10.1007/BF01371042.
2. The Eigenvector–Eigenvalue Identity

Date: 7 April 2022 Author: Ruizhi Cao

Eigenvectors and eigenvalues are widely used in simplifying equations, data science,
Agenda:
and are powerful analysis tools. For a 𝑛 × 𝑛 Hermitian matrix 𝑨 ∈ ℍ𝑛 (ℂ) , one of its
1. Cauchy interlacing theorem
eigenpairs (𝜆,𝒗 ) is the solution for 2. First order perturbation
theorem
𝑨𝒗 = 𝜆𝒗 . 3. Eigenvector–eigenvalue
identity
Here, 𝜆 is an eigenvalue of 𝑨 , and 𝒗 is an eigenvector associated with 𝜆. Moreover, 4. Proof of the identity
the eigenvalues of a Hermitian matrix are real, and its unit-norm eigenvectors form 5. Application
an orthonormal basis for ℂ𝑛 . One might ask if there is an intrinsic connection for the
eigenvalues and eigenvectors. And over the past few decades, an identity between
components of eigenvectors for a Hermitian matrix and its eigenvalues is repeatedly
rediscovered in different contexts over many different fields [Den+22]. Although it
is a simple yet extremely important result, it begins to attract boarder interest with
the publication of Wolchover’s article [Wol19]. In this section, we will establish an
identity between eigenvectors and eigenvalues of a Hermitian matrix. We will write the

eigenvalues in the weak increasing order throughout the following sections: 𝜆𝑖 = 𝜆𝑖 .

2.1 Cauchy interlacing theorem


Before we establish the eigenvector-eigenvalue identity for a Hermitian matrix 𝑨 ∈ ℍ𝑛 ,
we will need a theorem on the eigenvalues of a submatrix of 𝑨 . We will first give
the definition of a principal submatrix of a matrix 𝑨 . We will then present the
Cauchy interlacing theorem, which tells one the order of the eigenvalues of 𝑨 and the
eigenvalues of one of its principal submatrices.

Definition 2.1 (Principal submatrix). For a 𝑛 × 𝑛 matrix 𝑨 ∈ ℂ𝑛×𝑛 , let

L = {𝑗 1 , 𝑗 2 , . . . , 𝑗 𝑘 }

be a collection of 𝑘 distinct indices such that

1 ≤ 𝑗 1 < 𝑗 2 < . . . < 𝑗 𝑘 ≤ 𝑛.

A principal submatrix of 𝑨 is a submatrix which is obtained by keeping the entry on


the 𝑖 th row and the 𝑖 0th column for all 𝑖 , 𝑖 0 ∈ L and deleting all other entries.

We know that, based on the conjugation rule on matrix, a principal submatrix of a


positive-semidefinite (psd) matrix is also psd. In other words, the eigenvalues of 𝑨 and Recall the conjugation rule tells us
the eigenvalues of a submatrix of 𝑨 are “correlated”. Here, we will present a stronger that if 𝑨 ∈ 𝕄𝑛 is psd and 𝑿 ∈ ℂ𝑛×𝑘 ,
relation on their eigenvalues. then 𝑿 ∗ 𝑨𝑿 is also psd.
Project 2: The Eigenvector–Eigenvalue Identity 218

Theorem 2.2 (Cauchy interlacing theorem). Suppose 𝑨 ∈ ℍ𝑛 is a Hermitian matrix. Let


𝑩 be a principal 𝑚 × 𝑚 submatrix of 𝑨 . Suppose 𝑨 has eigenvalues 𝜆 1 ≤ . . . ≤ 𝜆𝑛 Note that 𝑩 is also a Hermitian
and 𝑩 has eigenvalues 𝛽 1 ≤ . . . ≤ 𝛽𝑚 , then matrix, that is 𝑩 ∈ ℍ𝑚 .

𝜆𝑘 ≤ 𝛽𝑘 ≤ 𝜆𝑘 +𝑛−𝑚 for 𝑘 = 1, . . . , 𝑚.

If 𝑚 = 𝑛 − 1, then

𝜆 1 ≤ 𝛽 1 ≤ 𝜆 2 ≤ 𝛽 2 ≤ . . . ≤ 𝛽𝑛−1 ≤ 𝜆𝑛 .

Proof. We can use the Courant–Fischer–Weyl minimax principle to prove the theorem
[Bha97]. Without loss, assume

𝑩 𝑪∗

𝑨= ,
𝑪 𝒁

where 𝒁 ∈ ℍ𝑛−𝑚 , 𝑪 ∈ ℂ𝑛−𝑚×𝑚 . Let 𝒙 𝑖 ∈ ℂ𝑛 (1 ≤ 𝑖 ≤ 𝑛 ) be the unit-norm


eigenvector of 𝑨 associated with the ordered eigenvalue 𝜆𝑖 , and 𝒚 𝑗 ∈ ℂ𝑚 (1 ≤ 𝑗 ≤ 𝑚 )
be the unit-norm eigenvector of 𝑩 associated with the ordered eigenvalue 𝛽 𝑗 . Let’s
define the following vector spaces for some 𝑘 ∈ ℕ and 1 ≤ 𝑘 ≤ 𝑚 :

W = span (𝒚 1 , . . . , 𝒚 𝑘 ) ⊆ ℂ𝑚 ,
  
𝒘
M= ∈ ℂ , 𝒘 ∈ W ⊂ ℂ𝑛 .
𝑛
0

Fix 𝒘 ∈ W, we can find an associated vector 𝒘


e ∈ M via
𝒘 
𝒘
e= ∈ M.
0

Thus, we have
e ∗ 𝑨𝒘
𝒘 e = 𝒘 ∗ 𝑩𝒘 and 𝒘

e = 𝒘 ∗𝒘 .
e 𝒘
Using the Courant–Fischer–Weyl minimax principle, the following inequality holds

𝒘 ∗ 𝑩𝒘 𝒘e ∗ 𝑨𝒘e e ∗ 𝑨𝒘
𝒘 e
𝛽𝑘 = max ∗
= max ∗ ≥ min max∗ = 𝜆𝑘 .
𝒘 ∈W 𝒘 𝒘 e ∈M 𝒘
𝒘 e 𝒘 e M0 ⊂ℂ 𝑛 0 e ∈M 𝒘
,dim M =𝑘 𝒘 0 e 𝒘 e
This gives one side of the inequalities in the theorem.
For the other side, we can use the similar technique and apply the minimax principle
to −𝑨 and −𝑩 . Define the following vector spaces

W 0 = span (𝒚 𝑘 , . . . , 𝒚 𝑚 ) ⊆ ℂ𝑚 ,
  0 
𝒘 𝑛 0 0
N= ∈ ℂ ,𝒘 ∈ W ⊂ ℂ𝑛 .
0

We have
e 0 = 𝑚 − 𝑘 + 1.
dim N = dim W
0
Then, for each 𝒘 0 ∈ W 0, we can find 𝒘
e ∈ N by choosing
𝒘 0 
e0 =
𝒘 ∈ N.
0
Project 2: The Eigenvector–Eigenvalue Identity 219

Same as above, we have

e 0) ∗ (−𝑨)𝒘
(𝒘 e 0 = (𝒘 0) ∗ (−𝑩)𝒘 0 and (𝒘
e 0 ) ∗𝒘
e 0 = (𝒘 0) ∗𝒘 0.

Note that we have


↑ ↑
𝜆𝑚−𝑘 +1
(−𝑨) = −𝜆𝑘 +𝑛−𝑚 and 𝜆𝑚−𝑘 +1 (−𝑩) = −𝛽𝑘 .

Then, apply the minimax principle to −𝑨 and −𝑩 , we have

(𝒘 0) ∗ (−𝑩)𝒘 0
−𝛽𝑘 = max
0 0
𝒘 ∈W (𝒘 0) ∗𝒘
0 ∗
(𝒘
e ) (−𝑨)𝒘 e0
= max
e 0 ∈N
𝒘 (𝒘e 0 ) ∗𝒘
e0
e 0) ∗ (−𝑨)𝒘
(𝒘 e0
≥ min0 max
0 0 ∗ 0
N0 ⊂ℂ𝑛 ,dim N =𝑚−𝑘 +1 𝒘
e ∈N0 (𝒘e)𝒘 e
= −𝜆𝑘 +𝑛−𝑚 .

That is
𝜆𝑘 +𝑛−𝑚 ≥ 𝛽𝑘 ,
which gives the desired inequality. 
The Cauchy interlacing theorem allows us to bound the eigenvalues of a principal
submatrix of 𝑨 by the eigenvalues of the original matrix. Here, we derived the Cauchy
interlacing theorem from the Courant–Fischer–Weyl minimax principle. It is also
worthwhile to note that the Cauchy interlacing theorem, the Poincaré inequality and
the Courant–Fischer–Weyl minimax principle have independent proofs and they can
be derived from each other [Bha97].

2.2 First-order perturbation theorem


In this section, we will derive the first-order perturbation theory for a simple eigenvalue
of the Hermitian matrix 𝑨 ∈ ℍ𝑛 (ℂ) . For a simple eigenpair (𝜆, 𝒗 ) of 𝑨 , both the Recall that 𝜆𝑖 (𝑨) is a simple
eigenvalue 𝜆 and the unit-norm eigenvector 𝒗 are continuous with respect to entries eigenvalue of 𝑨 if the multiplicity of
𝜆𝑖 (𝑨) is one.
of 𝑨 . Thus, we might expect that the perturbed eigenpair should be “close” to the
original one. In this section, we will develop a rigorous proof [Saa11a; GLO20; Kat95]
for such intuition. The result will also be used to prove the eigenvector–eigenvalue
identity later.

Theorem 2.3 (First-order perturbation theorem). Assume 𝑨 is Hermitian, and (𝜆, 𝒗 )


is a simple eigenpair of 𝑨 , where 𝒗 is a unit-norm eigenvector. Let 𝑬 ∈ ℍ𝑛 be a
small perturbation and
𝑨 (𝑡 ) B 𝑨 + 𝑡 𝑬
be a family of perturbed matrices. If we write 𝜆(𝑡 ) the eigenvalue of 𝑨 (𝑡 )
associated with 𝒗 (𝑡 ) , where 𝜆( 0) = 𝜆, and 𝒗 ( 0) = 𝒗 . Then, for small 𝑡 , the
perturbed eigenvalue is given by

𝜆(𝑡 ) = 𝜆 + h𝒗 , 𝑬𝒗 i𝑡 + 𝑂 (𝑡 2 ),

where h·, ·i is the conventional Euclidean inner product.


Project 2: The Eigenvector–Eigenvalue Identity 220

Proof. For a Hermitian matrix 𝑨 ∈ ℍ𝑛 (ℂ) , the left and right eigenvectors associated
with the (real) eigenvalue 𝜆 coincide. That is Recall that 𝒗 is a right eigenvector of
𝑨 associated with 𝜆 if 𝑨𝒗 = 𝜆𝒗 .

𝑨 𝒗 = 𝑨𝒗 = 𝜆𝒗 . Similarily, 𝒘 is a left eigenvector
associated with 𝜆 if 𝒘 ∗ 𝑨 = 𝜆𝒘 ∗ .
When 𝑡 is small, the eigenvalue 𝜆(𝑡 ) is analytic with respect to 𝑡 . As 𝜆(𝑡 ) is the
eigenvalue of 𝑨 (𝑡 ) , by definition we have

𝑨 (𝑡 )𝒗 (𝑡 ) = 𝜆(𝑡 )𝒗 (𝑡 ).
Let’s calculate the Euclidean inner product of both sides with respect to 𝒗 , then

h𝒗 , 𝑨 (𝑡 )𝒗 (𝑡 )i = h𝒗 , 𝜆(𝑡 )𝒗 (𝑡 )i.
Write it out more explicitly, we get

h𝒗 , 𝑨 + 𝑡 𝑬 𝒗 (𝑡 )i = 𝜆(𝑡 )h𝒗 , 𝒗 (𝑡 )i.
We calculate the left-hand side (LHS) of the above equation

h𝒗 , 𝑨 + 𝑡 𝑬 𝒗 (𝑡 )i = h𝒗 , 𝑨𝒗 (𝑡 )i + h𝒗 , 𝑡 𝑬𝒗 (𝑡 )i
= h𝑨 ∗𝒗 , 𝒗 (𝑡 )i + 𝑡 h𝒗 , 𝑬𝒗 (𝑡 )i
= 𝜆h𝒗 , 𝒗 (𝑡 )i + 𝑡 h𝒗 , 𝑬𝒗 (𝑡 )i.
So, we have 
𝜆(𝑡 ) − 𝜆 h𝒗 , 𝒗 (𝑡 )i = 𝑡 h𝒗 , 𝑬𝒗 (𝑡 )i.
This is equivalent to
𝜆(𝑡 ) − 𝜆( 0) h𝒗 , 𝑬𝒗 (𝑡 )i
=
𝑡 −0 h𝒗 , 𝒗 (𝑡 )i
for a non-zero 𝑡 . Because (𝜆, 𝒗 ) is a simple eigenpair of 𝑨 , the eigenvector 𝒗 and the
eigenvalue 𝜆 are both continuous with respect to 𝑨 . Thus, 𝒗 (𝑡 ) is continuous with
respect to 𝑡 . Then, we have
𝜆(𝑡 ) − 𝜆( 0) h𝒗 , 𝑬𝒗 (𝑡 )i h𝒗 , 𝑬𝒗 i
𝜆 0 ( 0) = lim = lim = = h𝒗 , 𝑬𝒗 i.
𝑡 →0 𝑡 −0 𝑡 →0 h𝒗 , 𝒗 (𝑡 )i h𝒗 , 𝒗 i
Thus, the Taylor expansion of 𝜆(𝑡 ) about 0 is

𝜆(𝑡 ) = 𝜆( 0) + 𝜆 0 ( 0)(𝑡 − 0) + 𝑂 (𝑡 2 ) = 𝜆 + h𝒗 , 𝑬𝒗 i𝑡 + 𝑂 (𝑡 2 ),
which gives the first-order perturbation theorem. 

2.3 Eigenvector–eigenvalue identity


The Cauchy interlacing theorem establishes inequalities for eigenvalues of a Hermitian
matrix 𝑨 and the eigenvalues of any principal submatrix of 𝑨 . In this section, we will
establish a theorem that enables us to “find” the component of any eigenvector of 𝑨
using the eigenvalues of 𝑨 and eigenvalues of one principle minor of 𝑨 [Den+22].

Theorem 2.4 (Eigenvector–eigenvalue identity). Let 𝑨 ∈ ℍ𝑛 be a 𝑛 × 𝑛 Hermitian


matrix and choose 𝑗 such that 1 ≤ 𝑗 ≤ 𝑛 . Let 𝑴 𝑗 be the 𝑛 − 1 × 𝑛 − 1 principal
submatrix of 𝑨 formed by deleting the 𝑗 th row and column of 𝑨 . We arrange the
eigenvalues of 𝑨 and 𝑴 in the weak increasing order, that is

𝜆𝑖 (𝑨) = 𝜆𝑖↑ (𝑨) for 1 ≤ 𝑖 ≤ 𝑛,


Project 2: The Eigenvector–Eigenvalue Identity 221

and
𝜆𝑖 (𝑴 𝑗 ) = 𝜆𝑖↑ (𝑴 𝑗 ) for 1 ≤ 𝑖 ≤ 𝑛 − 1.
Let 𝒗 1 , . . . , 𝒗 𝑛 be the unit-norm eigenvectors of 𝑨 associated with the ordered
eigenvalues 𝜆 1 (𝑨), . . . , 𝜆𝑛 (𝑨) , respectively. The eigenvector–eigenvalue identity
for 𝑨 is
Ö𝑛  Ö𝑛−1
|𝑣𝑖 𝑗 | 2

𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨) = 𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑴 𝑗 ) ,
𝑘 =1,𝑘 ≠𝑖 𝑘 =1

where 𝑣𝑖 𝑗 is the 𝑖 th component of 𝒗 𝑗 .

We can also rewrite this identity using characteristic polynomial, as suggested in


[Den+22]. Later, we will provide a proposition which generalizes the identity based
on the same paper.

Definition 2.5 (Characteristic polynomial). The characteristic polynomial of a square


matrix 𝑩 ∈ ℂ𝑛×𝑛 is a function 𝑝𝑴 : ℂ → ℂ defined by
Ö𝑛 
𝑝𝑩 (𝜆) B det (𝜆I𝑛 − 𝑩) = 𝜆 − 𝜆𝑘 (𝑩) .
𝑘 =1

Let 𝑝 𝑨 : ℂ → ℂ be the characteristic polynomial of 𝑨 defined in the above theorem.


By definition, we have
Ö𝑛 
𝑝 𝑨 (𝜆) = det (𝜆I𝑛 − 𝑨) = 𝜆 − 𝜆𝑘 (𝑨) .
𝑘 =1

Similarly, the characteristic polynomial 𝑝𝑴 𝑗 : ℂ → ℂ of 𝑴 𝑗 is


Ö𝑛−1 
𝑝𝑴 𝑗 (𝜆) = det (𝜆I𝑛−1 − 𝑴 𝑗 ) = 𝜆 − 𝜆𝑘 (𝑴 𝑗 ) .
𝑘 =1

Using the chain rule, the derivative of 𝑝 𝑨 (𝜆) is


∑︁𝑛 Ö𝑛
𝑝 𝑨0 (𝜆) =

𝜆 − 𝜆𝑘 (𝑨) .
𝑙 =1 𝑘 =1,𝑘 ≠𝑙

When we evaluate this derivative at 𝜆 = 𝜆𝑖 (𝑨) , we get


 ∑︁𝑛 Ö𝑛
𝑝 𝑨0 𝜆𝑖 (𝑨) =

𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨)
𝑙 =1 𝑘 =1,𝑘 ≠𝑙
Ö𝑛 
= 𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨) .
𝑘 =1,𝑘 ≠𝑖

Then, the identity (2.4) is equivalent to

|𝑣𝑖 𝑗 | 2𝑝 𝑨0 𝜆𝑖 (𝑨) = 𝑝𝑴 𝑗 𝜆𝑖 (𝑨) .


 

There is an off-diagonal variant of the eigenvector-eigenvalue identity [Den+22].


Apart from the off-diagonal variant, the eigenvector-eigenvalue identity can be further
generalized to obtain relationship between various minors of 𝑨 and the unitary matrix
𝑸 which diagonalizes 𝑨 (that is 𝑨 = 𝑸 𝚲𝑸 ∗ ). These results can also be found in
[Den+22]. It is also possible to extend this identity to normal matrices and even
diagonalizable matrices.
Proposition 2.6 (Off-diagonal characterization of the eigenvetor–eigenvalue identity). Using
the above notation, and let ( I𝑛 ) 𝑗 0 𝑗 and 𝑴 𝑗 0 𝑗 be the (𝑛 − 1) × (𝑛 − 1) minors formed by
Project 2: The Eigenvector–Eigenvalue Identity 222

deleting the 𝑗 0th row and 𝑗 th column of the identity matrix and matrix 𝑨 , respectively.
Then, we have the following identity
0
Ö𝑛
(−1) 𝑗 +𝑗 det 𝜆𝑖 (𝑨)( I𝑛 ) 𝑗 0 𝑗 − 𝑴 𝑗 0 𝑗 = 𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨) 𝑣𝑖 𝑗 𝑣𝑖∗𝑗 0
 
𝑘 =1,𝑘 ≠𝑖

for 1 ≤ 𝑗 , 𝑗 0 ≤ 𝑛 .

2.4 Proof of the identity


The logic of this proof follows from [Den+22] while this is a more elaborate and
self-contained version. As the eigenvectors of 𝑨 are generally not continuous with
respect to the entries of the matrix 𝑨 , one needs to be careful when applying the
limiting argument. In the following proof, we are going to establish the identity by
proving it in two cases. In the first part, the case for which 𝜆𝑖 is a repeated eigenvalue
of 𝑨 , we will see it is easy to verify the identity, as both sides become zero. In the
second part, we will focus on the case when 𝜆𝑖 is a simple eigenvalue of 𝑨 .
Let us first consider the case when 𝜆𝑖 (𝑨) is a repeated eigenvalue for some 𝑖 with
1 ≤ 𝑖 ≤ 𝑛 . If the eigenvalue 𝜆𝑖 (𝑨) of 𝑨 occurs with multiplicity greater than one,
without loss, we can then assume that

𝜆𝑖 +1 (𝑨) = 𝜆𝑖 (𝑨).
By the Cauchy interlacing theorem, the inequality

𝜆𝑖 (𝑨) ≤ 𝜆𝑖 (𝑴 𝑗 ) ≤ 𝜆𝑖 +1 (𝑨)
holds as 𝑴 𝑗 ∈ ℍ𝑛−1 is a principal minor of 𝑨 ∈ ℍ𝑛 . Thus, for any 𝑗 , the 𝑖 th smallest
eigenvalue of 𝑴 𝑗 equals 𝜆𝑖 (𝑨) :

𝜆𝑖 (𝑨) − 𝜆𝑖 (𝑴 𝑗 ) = 0.
In this case, the identity is trivial as the left-hand side of the identity (2.4) is
Ö𝑛
LHS = |𝑣𝑖 𝑗 | 2

𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨)
𝑘 =1,𝑘 ≠𝑖
 Ö𝑛
= |𝑣𝑖 𝑗 | 2 𝜆𝑖 (𝑨) − 𝜆𝑖 +1 (𝑨)

𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑨) = 0,
𝑘 =1,𝑘 ≠𝑖 ,𝑖 +1

and the right-hand side (RHS) of the identity (2.4) is


Ö𝑛−1 
RHS = 𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑴 𝑗 )
𝑘 =1
 Ö𝑛−1 
= 𝜆𝑖 (𝑨) − 𝜆𝑖 (𝑴 𝑗 ) 𝜆𝑖 (𝑨) − 𝜆𝑘 (𝑴 𝑗 ) = 0.
𝑘 =1,𝑘 ≠𝑖

Thus the identity holds for 𝑖 with multiplicity of 𝜆𝑖 (𝑨) greater than one.
In the second half of the proof, we will consider the case when 𝜆𝑖 (𝑨) is a simple
eigenvalue of 𝑨 . As the eigenvalues are continuous with respect to 𝑨 , the eigenvalues
of a principal minor of 𝑨 should also be continuous with respect to 𝑨 . That is, 𝜆𝑖 (𝑨)
and 𝜆𝑖 (𝑴 𝑗 ) are continuous with respect to 𝑨 . Because 𝜆𝑖 (𝑨) is a simple eigenvalue
of 𝑨 , the eigenprojector 𝑷 𝜆𝑖 (𝑨) associated with 𝜆𝑖 (𝑨) is given by

𝑷 𝜆𝑖 (𝑨) = 𝒗 𝑖 𝒗 𝑖∗ .
Thus, the eigenvector 𝒗 𝑖 associated with a simple eigenvalue 𝜆𝑖 (𝑨) is continuous with
respect to 𝑨 . As the left-hand side and the right-hand side consist of products of the
Project 2: The Eigenvector–Eigenvalue Identity 223

eigenvalues and 𝑣𝑖 𝑗 , they are both continuous with respect to 𝑨 . Then, it suffices
to show that the identity holds for 𝑨 with simple eigenvalue 𝜆𝑖 . Let 𝜀 be a small
parameter, and consider the rank one perturbation 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 , where 𝜹 1 , . . . , 𝜹 𝑛 is
the standard basis. The characteristic polynomial of 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 is given by
 
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) = det 𝜆I𝑛 − 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 .

For the sake of simplicity, we abbreviate the matrices as


𝜀
𝑨 = 𝜆I𝑛 − 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗

e and 𝑨 = 𝜆I𝑛 − 𝑨.
e

𝜀 𝜀
Furthermore, we write 𝑴
f𝑖 ,𝑘 and 𝑴 𝑨 and e
f𝑖 ,𝑘 the submatrix of e 𝑨 obtained by deleting
𝜀
𝑨 and e
the 𝑖 th row and 𝑘 th column of e 𝑨 , respectively. Note that 𝜹 𝑗 𝜹 ∗𝑗 is a matrix with
only the 𝑗 th row and 𝑗 th column is non-zero, so we have

f𝑖𝜀𝑗 = 𝑴
𝑴 f𝑖 𝑗 for any 1 ≤ 𝑖 ≤ 𝑛.

Furthermore, the submatrix 𝑴


f 𝑗 𝑗 can be expressed explicitly using 𝑴 𝑗 . That is,

f 𝑗 𝑗 = 𝜆I𝑛−1 − 𝑴 𝑗 .
𝑴

Use cofactor expansion on the 𝑗 th column, we obtain that


𝜀
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) − 𝑝 𝑨 (𝜆) = det ( e
𝑨 ) − det ( e𝑨)
∑︁𝑛 ∑︁𝑛
= f𝑘𝜀 𝑗 ) 𝑎˜𝜀 −
(−1) 𝑘 +𝑗 det (𝑴 (−1) 𝑘 +𝑗 det (𝑴
f𝑘 𝑗 ) 𝑎˜𝑘 𝑗 𝑎˜𝑖𝜀,𝑘 and 𝑎˜𝑖 ,𝑘 are entries in the 𝑖 th
𝑘 =1 𝑘𝑗 𝑘 =1 𝜀
𝑗 +𝑗 𝜀 row and 𝑘 th column of e 𝑨 and e 𝑨,
= (−1) f 𝑗 𝑗 ) 𝑎˜𝜀 − (−1) 𝑗 +𝑗 det (𝑴
det ( 𝑴 f 𝑗 𝑗 ) 𝑎˜𝑗 𝑗
𝑗𝑗 respectively.
2𝑗 f 𝑗 𝑗 ) 𝑎˜𝜀 𝑗 +𝑗
= (−1) det ( 𝑴 𝑗𝑗 − (−1) det ( 𝑴
f 𝑗 𝑗 ) 𝑎˜𝑗 𝑗
= det (𝜆I𝑛−1 − 𝑴 𝑗 ) ( 𝑎˜𝑗𝜀𝑗 − 𝑎˜𝑗 𝑗 )
= det (𝜆I𝑛−1 − 𝑴 𝑗 ) (−𝜀)
= −𝜀𝑝𝑴 𝑗 (𝜆).

In the third line of the above calculation, we use (2.4) and the fact that

𝑎˜𝑖𝜀,𝑗 = 𝑎˜𝑖 ,𝑗 for all 𝑖 ≠ 𝑗 .

Thus, we have
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) = 𝑝 𝑨 (𝜆) − 𝜀𝑝𝑴 𝑗 (𝜆).
On the other hand, based on the first order perturbation theory, the eigenvalue
𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) can be also expanded as

𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) = 𝜆𝑖 (𝑨) + 𝜀 h𝒗 𝑖 , 𝜹 𝑗 𝜹 ∗𝑗 𝒗 𝑖 i + 𝑂 (𝜀 2 )
= 𝜆𝑖 (𝑨) + 𝜀 |𝑣𝑖 𝑗 | 2 + 𝑂 (𝜀 2 ),

where 𝑣𝑖 𝑗 is the 𝑗 th component of the 𝑖 th unit eigenvector 𝒗 𝑖 of 𝑨 . If we evaluate the


charateristic function of 𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 at its eigenvalue, we have the following identity

𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) = 0.

Project 2: The Eigenvector–Eigenvalue Identity 224

We expand 𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) in a Taylor series about 𝜆𝑖 (𝑨) , which gives

𝜹 𝑗 𝜹 ∗ 𝜆𝑖 (𝑨) Δ𝜆 + 𝑂 (Δ𝜆) ,
 0  2

𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) = 𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀
𝑗

where Δ𝜆 B 𝜆 − 𝜆𝑖 (𝑨) . As we established (2.4), we have the first order approximation

𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) · Δ𝜆
 0 
𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) ≈ 𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀

𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) · Δ𝜆
  0 
= 𝑝 𝑨 𝜆𝑖 (𝑨) − 𝜀𝑝𝑴 𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀

𝜹 𝑗 𝜹 ∗𝑗 𝜆𝑖 (𝑨) · Δ𝜆
 0 
= −𝜀𝑝𝑴 𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨+𝜀
= −𝜀𝑝𝑴 𝑗 𝜆𝑖 (𝑨) + 𝑝 𝑨0 𝜆𝑖 (𝑨) · Δ𝜆 − 𝜀𝑝𝑴 0
𝜆𝑖 (𝑨) · Δ𝜆.
  
𝑗

Note that we use (2.4) again in the last step of the above calculation. Based on (2.4),
the quantity 𝑝 𝑨+𝜀 𝜹 𝑗 𝜹 ∗𝑗 (𝜆) is zero when evaluated at 𝜆 = 𝜆𝑖 (𝑨 + 𝜀 𝜹 𝑗 𝜹 ∗𝑗 ) . Using the
expansion (2.4), we get Δ𝜆 = 𝜀 |𝑣𝑖 𝑗 | 2 + 𝑂 (𝜀 2 ) . Thus, the term 𝜀𝑝𝑴
0 𝜆𝑖 (𝑨) · Δ𝜆 is a

𝑗
second order term with respect to 𝜀 . By matching the linear term in 𝜀 , we get

0 = −𝜀𝑝𝑴 𝑗 𝜆𝑖 (𝑨) + 𝜀 |𝑣𝑖 𝑗 | 2𝑝 𝑨0 𝜆𝑖 (𝑨)


 

holds for any sufficiently small 𝜀 . That is,

|𝑣𝑖 𝑗 | 2𝑝 𝑨0 𝜆𝑖 (𝑨) = 𝑝𝑴 𝑗 𝜆𝑖 (𝑨) ,


 

which gives the desired result.


Thus, we can conclude that identity holds for every eigenvalue 𝜆𝑖 of 𝑨 . 

It is interesting that both sides of the identity (2.4) equal zero when 𝜆𝑖 (𝑨) is
a repeated eigenvalue of 𝑨 . This also matches our intuition that if the eigenspace
associated with 𝑨 has a dimension that is greater than one, there are infinite number of
orthogonal bases for this eigenspace. Thus, it is impossible to pinpoint each component
of the eigenvector.

2.5 Application
The eigenvector-eigenvalue identity allows one to reconstruct the magnitude of each
component of eigenvectors of 𝑨 ∈ ℍ𝑛 from eigenvalues of 𝑨 and its principal minor
𝑴 𝑗 (for some 𝑗 ∈ ℕ, 1 ≤ 𝑗 ≤ 𝑛 ). However, the phase information of each component
is not given in the identity. Nonetheless, proposition 2.6 shows that the relative phase
𝑣𝑖 𝑗 𝑣𝑖∗𝑗 0 can be determined.
As discussed in [Den+22], for a symmetric real matrix, the only ambiguity is the
sign of each component, which may be recovered by direct inspection of the eigenvector
equation 𝑨𝒗 𝑖 = 𝜆𝑖 (𝑨)𝒗 𝑖 . Given the eigenvalues and eigenvectors associated with 𝑨 ,
one can obtain the original matrix 𝑨 via the spectral decomposition.
Here, we are going to briefly show the application of the eigenvector-eigenvalue
identity in neutrino oscillation [DPZ20]. Here, we will drop a lot of physical background
knowledge needed to understand the physics of neutrino, and simply focus on the
mathematical expression that models the transition among different types of neutrinos.
In neutrino oscillation, the Pontecorvo–Maki–Nakagawa–Sakata matrix (PMNS matrix)
describes the “probability” of transformation between mass eigenstates (denoted by
1, 2, 3, which tells the mass of the neutrino, and is the basis for neutrinos propagate
in vacuum) and flavor eigenstates (denoted by 𝑒 , 𝜇, 𝜏 , which tells the species of the
Project 2: The Eigenvector–Eigenvalue Identity 225

neutrino, and is basis for neutrinos propagate in matter). Although it is not clear
whether or not the three-neutrino model is correct, we proceed by assuming the PMNS
matrix 𝑼 PMNS is a 3 × 3 unitary matrix. In this case, the PMNS matrix has the following
decomposition
𝑈𝑒 1 𝑈𝑒 2 𝑈𝑒 3 
 
𝑼 PMNS = 𝑈𝜇1 𝑈𝜇2 𝑈𝜇3 
𝑈𝜏 1 𝑈𝜏 2 𝑈𝜏 3 
 
1   𝑐 13 𝑠 13   𝑐 12 𝑠 12 
 
i𝛿 
 
=  𝑐 23 𝑠 23 e   1  −𝑠12 𝑐 12 ,
   
 −𝑠23 e−i𝛿 𝑐 23  −𝑠 13 𝑐 13   1 
  
where 𝑠𝑖 𝑗 = sin 𝜃𝑖 𝑗 , 𝑐 𝑖 𝑗 = cos 𝜃𝑖 𝑗 , and 𝑈𝛼𝑖 is the probability of 𝑖 becoming 𝛼 for
𝛼 ∈ {𝑒 , 𝜇, 𝜏 } and 𝑖 ∈ {1, 2, 3}. The full solution for entries of the PMNS matrix can be
obtained by solving a hard cubic equation. However, it is possible to obtain the entries
via 𝑈ˆ𝛼𝑖 , for 𝛼 ∈ {𝑒 , 𝜇, 𝜏 } and 𝑖 ∈ {1, 2, 3}, which is component of eigenvectors of the
Hamiltonian in flavor basis.
In matter, the oscillation among these three flavors of neutrino is given by the
following matrix 𝑯 (the Hamiltonian in flavor basis):
𝐻𝑒𝑒 𝐻𝜇𝑒 𝐻𝜏𝑒 
 
𝑯 = 𝐻𝑒𝜇 𝐻𝜇𝜇 𝐻𝜏𝜇 
𝐻𝑒𝜏 𝐻𝜇𝜏 𝐻𝜏𝜏 
 
0
" ! !#
𝑎
1
= 𝑼 PMNS Δ𝑚 21
2
𝑼 ∗PMNS + 0 ,
2𝐸 Δ𝑚 31
2
0

where 𝐸 , Δ𝑚 21 , Δ𝑚 31 , and 𝑎 are constant, 𝐻 𝛽𝛼 denotes the probability of 𝛼 becoming


𝛽 for 𝛼, 𝛽 ∈ {𝑒 , 𝜇, 𝜏 }. It is known that 𝑼ˆ 𝛼 (𝛼 ∈ {𝑒 , 𝜇, 𝜏 }) is the eigenvector of 𝑯 . So,
if the eigenvalues of 𝑯 and its pricipal minors are given, then entries of 𝑼 PMNS can be
calculated via eigenvector-eigenvalue identity. Let the eigenvalues for 𝑯 be

𝜆𝑗
for 𝑗 = 1, 2, 3.
2𝐸
For each principal minor of 𝑯 , let the eigenvalues for the minor obtained by deleting
all columns and rows contain subscript 𝛼 for 𝛼 ∈ {𝑒 , 𝜇, 𝜏 } be

𝜉𝑘𝛼
for 𝑘 = 1, 2.
2𝐸

Then, the eigenvector-eigenvalue identity tells us that the square of 𝑈ˆ𝛼 𝑗 is given by
Î2
𝑘 =1 (𝜆 𝑗 − 𝜉𝑘𝛼 )
|𝑈ˆ𝛼 𝑗 | 2 = Î
𝑖 ≠𝑗 (𝜆 𝑗 − 𝜆𝑖 )

for 𝛼 ∈ {𝑒 , 𝜇, 𝜏 }, and 𝑗 ∈ {1, 2, 3}. As one can find the entries of the PMNS matrix
via 𝑈ˆ𝛼 𝑗 , the PMNS matrix can be obtained by measuring the eigenvalues of 𝑯 and its
principal minors. This is often easier than measuring each entries of the PMNS matrix
itself. Furthermore, if experiments show that the three-neutrino theory is incorrect,
the above formula is ready to be applied for 𝑼 PMNS ∈ ℍ4 with trivial modifications.
Project 2: The Eigenvector–Eigenvalue Identity 226

Lecture bibliography
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[DPZ20] P. B. Denton, S. J. Parke, and X. Zhang. “Neutrino oscillations in matter via
eigenvalues”. In: Phys. Rev. D 101 (2020), page 093001. doi: 10.1103/PhysRevD.
101.093001.
[Den+22] P. B. Denton et al. “Eigenvectors from eigenvalues: a survey of a basic identity in
linear algebra”. In: Bull. Amer. Math. Soc. (N.S.) 59.1 (2022), pages 31–58. doi:
10.1090/bull/1722.
[GLO20] A. Greenbaum, R.-C. Li, and M. L. Overton. “First-Order Perturbation Theory for
Eigenvalues and Eigenvectors”. In: SIAM Review 62.2 (2020), pages 463–482. doi:
10.1137/19M124784X.
[Kat95] T. Kato. Perturbation theory for linear operators. Reprint of the 1980 edition.
Springer-Verlag, Berlin, 1995.
[Saa11a] Y. Saad. Numerical Methods for Large Eigenvalue Problems. 2nd edition. Society for
Industrial and Applied Mathematics, 2011. doi: 10.1137/1.9781611970739.
[Wol19] N. Wolchover. “Neutrinos lead to unexpected discovery in basic math”. In: Quanta
Magazine (2019). url: https://www.quantamagazine.org/neutrinos-lead-
to-unexpected-discovery-in-basic-math-20191113.
3. Bipartite Ramanujan Graphs of Any Degree

Date: 14 March 2022 Author: Nico Christianson

We review the work of Marcus, Spielman, and Srivastava on the existence of infinite
Agenda:
families of bipartite Ramanujan graphs of any degree, closely following the arguments
1. Bipartite Ramanujan graphs
laid forth in the series of works [MSS14; MSS15a; MSS15b]. In particular, we 2. Reductions
present a general perspective that relies on the notion of real stability of multivariate 3. Interlacing polynomials and
polynomials and the resulting implications on interlacing of univariate polynomials. real stability
These techniques were successfully applied by the authors of the aforementioned works 4. Proof of Theorem 3.13
to several other problems, including the Kadison–Singer problem. The results highlight
the power of approaching problems in matrix theory via direct consideration of the
characteristic polynomial.

3.1 Bipartite Ramanujan graphs


In this section, we provide a very brief overview of the graph-theoretic notions that
are needed to define Ramanujan graphs and to motivate the problem of constructing
them. A reader unfamiliar with graph theory would be well-served by reading a more
thorough introduction to basic graph theory, e.g., the first few chapters of Spielman’s
book [Spi19].

3.1.1 Graph-theoretic preliminaries


We first recall some basic definitions from graph theory. A graph 𝐺 = ( V, E) is a set
of vertices V along with a collection of edges E ⊆ V × V /∼ between vertices, where
∼ is the equivalence relation of symmetry. That is, for vertices 𝑢, 𝑣 ∈ V, the ordered
pairs (𝑢, 𝑣 ) and (𝑣 , 𝑢) denote the same undirected edge. We will assume that graphs
contain no self-edges, so that (𝑣 , 𝑣 ) ∉ E for any vertex 𝑣 ∈ V. Throughout, we refer
to the cardinality of the vertex set as 𝑛 B | V | and the cardinality of the edge set as
𝑚 B | E | . As the vertex naming is not important, we will simply consider the vertex
set as V = {1, . . . , 𝑛}. We will only consider connected graphs, wherein each vertex is
connected to every other vertex via some sequence of edges.
Given a graph 𝐺 , its associated adjacency matrix 𝑨 ∈ 𝕄𝑛 is the matrix whose
entries indicate the connection of two vertices via an edge. That is, 𝑨 has entries
(
1 if (𝑖 , 𝑗 ) ∈ E;
𝑎𝑖 𝑗 =
0 otherwise.

Since edges are undirected, the adjacency matrix 𝑨 is symmetric: if (𝑖 , 𝑗 ) ∈ E, then


𝑎𝑖 𝑗 = 𝑎 𝑗 𝑖 = 1. By the spectral theorem, 𝑨 has 𝑛 real eigenvalues. For simplicity, we
will also call these the eigenvalues of the graph 𝐺 .
In this lecture, we pay particularly close attention to the set of 𝑑 -regular bipartite
graphs. A 𝑑 -regular graph is one whose adjacency matrix 𝑨 has the all-ones vector 1 as
an eigenvector, with associated eigenvalue 𝑑 . More tangibly, each vertex in a 𝑑 -regular
graph participates in exactly 𝑑 edges. We call 𝑑 the trivial eigenvalue of a 𝑑 -regular
graph, since any graph in this class has 𝑑 as an eigenvalue.
Project 3: Bipartite Ramanujan Graphs 228

A bipartite graph is one whose vertices V can be partitioned into disjoint subsets
S, T ⊆ V so that any edge bridges S and T. That is, for any edge (𝑢, 𝑣 ) ∈ E, either
𝑢 ∈ S and 𝑣 ∈ T, or else 𝑢 ∈ T and 𝑣 ∈ S. The eigenvalues of a bipartite graph are
symmetric about the origin; we leave this fact as an exercise.
Exercise 3.1 (Bipartite graph spectrum). Show that if 𝐺 = ( V, E) is bipartite, its spectrum
is symmetric about the origin. Hint: Argue that there is a vertex ordering under which
the adjacency matrix of a bipartite graph takes the form
 
0 𝑩
𝑩∗ 0

for some matrix 𝑩 .


It follows from Exercise 3.1 that a 𝑑 -regular bipartite graph has an eigenvalue of
−𝑑 in addition to the eigenvalue 𝑑 . We call both of these the trivial eigenvalues of a
𝑑 -regular bipartite graph.
We conclude by defining an infinite family of graphs having a particular property.

Definition 3.2 (Infinite family). An infinite family of graphs with property 𝑃 is an


infinite sequence of graphs (𝐺𝑖 )𝑖∞=1 such that:

• Each graph 𝐺𝑖 has property 𝑃 , and


• Each graph 𝐺𝑖 +1 has strictly more vertices than its predecessor 𝐺𝑖 .

3.1.2 Ramanujan graphs


In theoretical computer science and mathematics, there is great interest in the class
of expander graphs, which are those whose nontrivial eigenvalues are small. The
interested reader can refer to the classic survey of Hoory, Linial, and Widgerson
[HLW06] or the more recent book of Kowalski [Kow19] for a more thorough overview
of expanders and their many applications. Of particular note within the broader family
of expanders are Ramanujan graphs, which are defined as follows.

√𝐺 is called Ramanujan
Definition 3.3 (Ramanujan graph). A connected 𝑑 -regular graph
if all of its nontrivial eigenvalues have magnitude at most 2 𝑑 − 1. In particular, a
bipartite 𝑑 -regular graph 𝐺 is Ramanujan
√ if all of its eigenvalues, aside from ±𝑑 ,
are bounded in magnitude by 2 𝑑 − 1.

Ramanujan graphs are notable in that they are the ideal expander graphs. That is,
no infinite family of graphs can have (asymptotically) smaller nontrivial eigenvalues. A
lower bound due to Alon and Boppana [Alo86; Nil91] establishes that for any 𝜀 > 0
and any infinite family (𝐺𝑖 )𝑖∞=1 of 𝑑 -regular graphs, there exists some 𝑁 ∈ ℕ for which

𝐺𝑖 has a nontrivial eigenvalue of magnitude at least 2 𝑑 − 1 − 𝜀 when 𝑖 ≥ 𝑁 .
For fixed 𝑑 , it is easy to construct 𝑑 -regular Ramanujan graphs with few vertices,
as demonstrated in the following example.
Example 3.4 (Complete bipartite graph). Consider the bipartite graph 𝐾 𝑑,𝑑 with vertex set
V = S ∪ T, where S = {1, . . . , 𝑑 } and T = {𝑑 + 1, . . . , 2𝑑 }, each vertex 𝑠 ∈ S has an
edge to every vertex in T, and these are the only edges. This is known as a complete
bipartite graph. Its adjacency matrix 𝑨 is of the form
 
0 1
𝑨= ,
1 0
Project 3: Bipartite Ramanujan Graphs 229

where 1 denotes the 𝑑 × 𝑑 matrix of ones. Since 𝑨 has rank 2, it follows that all of the
nontrivial eigenvalues of 𝐾𝑑,𝑑 are zero. Thus 𝐾𝑑,𝑑 is Ramanujan. 

In general, it is not known whether there exist 𝑑 -regular Ramanujan graphs with
arbitrarily many vertices. Our task in this lecture will be to show that this is the case
when we restrict our focus to bipartite Ramanujan graphs. Specifically, we prove the
following result of Marcus, Spielman, and Srivastava, following the works [MSS15a;
MSS15b].

Theorem 3.5 (Marcus, Spielman, and Srivastava 2015). For every 𝑑 ≥ 3, there is an
infinite family of 𝑑 -regular bipartite Ramanujan graphs.

Our treatment of this result draws from the papers [MSS15a; MSS15b], as well as
the survey [MSS14] and Spielman’s course notes [Spi18b].

3.2 Reductions
In this section, we present several reductions of Theorem 3.5. We begin by reducing the
existence of infinite families of 𝑑 -regular bipartite Ramanujan graphs to the existence
of a signed adjacency matrix with a certain eigenvalue bound. Subsequently, we shift
our focus to the characteristic polynomial of this matrix, and we show that we can
equivalently phrase the problem as a bound relating the root of a random characteristic
polynomial with the root of its expectation. In the next section, these reductions enable
us to deploy techniques from [MSS15b] on interlacing families of polynomials.

3.2.1 2-lifts
To prove the existence of infinite families of bipartite Ramanujan graphs, it would
be very useful if, given any bipartite Ramanujan graph on 𝑛 vertices, we could use it
to construct another Ramanujan graph with more than 𝑛 vertices. To this end, we
introduce the following definitions.

Definition 3.6 (Signing). Let 𝐺 be a graph with adjacency matrix 𝑨 ∈ 𝕄𝑛 . A signing


𝑺 ∈ 𝕄𝑛 of 𝐺 is a matrix of the form Recall that the indicator vector
∑︁ 𝜹 𝑖 ∈ ℝ𝑛 is 1 in its 𝑖 th entry, and 0
everywhere else.
𝜎 (𝑖 ,𝑗 ) 𝜹 𝑖 𝜹 𝑗 ∗ + 𝜹 𝑗 𝜹 𝑖 ∗ ,

𝑺= (3.1)
(𝑖 ,𝑗 ) ∈E

where the sign vector 𝝈 ∈ {±1}𝑚 is indexed by edges. That is, 𝑺 has the same
nonzero pattern as the adjacency matrix 𝑨 , but both entries corresponding to an
edge (𝑖 , 𝑗 ) are given a sign 𝜎 (𝑖 ,𝑗 ) ∈ {±1}.

A signing of a graph is symmetric, and thus its eigenvalues are real.

Definition 3.7 (2-lift). Let 𝑺 ∈ 𝕄𝑛 be a signing of a graph 𝐺 = ( V, E) . We define the


2-lift 𝐺 𝑺 of 𝐺 associated with the signing 𝑺 as follows.

• For every vertex 𝑣 ∈ V, the 2-lift 𝐺 𝑺 has two vertices, which we denote 𝑣 0 , 𝑣 1 .
• If 𝑠𝑢𝑣 = 1, then 𝐺 𝑺 has the two edges (𝑢 0 , 𝑣 0 ) and (𝑢 1 , 𝑣 1 ) .
• If 𝑠𝑢𝑣 = −1, then 𝐺 𝑺 has the two edges (𝑢 0 , 𝑣 1 ) and (𝑢 1 , 𝑣 0 ) .

2-lifts give a means of constructing larger graphs out of smaller graphs in a manner
that preserves certain desirable properties. In particular, any 2-lift of a bipartite,
Project 3: Bipartite Ramanujan Graphs 230

𝑑 -regular graph is bipartite and 𝑑 -regular, and the spectrum of a 2-lift 𝐺 𝑺 depends
only on the spectra of 𝐺 and 𝑺 . We leave these facts as exercises.
Exercise 3.8 (2-lift: Preservation of properties). Show that the following properties are
preserved by 2-lifts.

• Bipartiteness. Any 2-lift of a bipartite graph is bipartite.


• 𝑑 -regularity. Any 2-lift of a 𝑑 -regular graph is 𝑑 -regular.

Exercise 3.9 (2-lift: Spectrum). Let 𝑺 be a signing of a graph 𝐺 . Show that the eigenvalues
of 𝐺 𝑺 are exactly the union of the eigenvalues of 𝐺 and the eigenvalues of 𝑺 .
These exercises imply
√ that, if 𝐺 is a 𝑑 -regular Ramanujan graph and it has a signing
𝑺 satisfying k𝑺 k ≤ 2 𝑑 − 1, then 𝐺 𝑺 is also Ramanujan.
√ Thus, if it were the case that
any 𝑑 -regular graph had a signing 𝑺 with k𝑺 k ≤ 2 𝑑 − 1, this would immediately
give a means for constructing infinite families of 𝑑 -regular Ramanujan graphs. It
was conjectured by Bilu and Linial [BL06] that such a signing can always be found;
however, we will prove the following weaker result from [MSS15a].

Theorem 3.10 (Existence of signing with spectrum lower bound). Any 𝑑√


-regular graph 𝐺
has a signing 𝑺 whose minimum eigenvalue 𝜆𝑛 (𝑺 ) is at least −2 𝑑 − 1.

Since the eigenvalues of a bipartite graph are symmetric about the origin, Theo-
rem 3.10 immediately implies (via the preceding argument) the existence of infinite
families of 𝑑 -regular bipartite Ramanujan graphs.

3.2.2 The expected characteristic polynomial


To prove Theorem 3.10, instead of reasoning about the eigenvalues of a signing 𝑺
directly, we will instead consider the roots of the characteristic polynomial of a random
signing. In the following, we use 𝜆𝑘 (·) to refer both to the 𝑘 th largest eigenvalue of a
matrix as well as the 𝑘 th largest root of a (real-rooted) polynomial. Recall that the
characteristic polynomial of a square matrix 𝑴 ∈ 𝕄𝑛 is the univariate polynomial
defined as
𝜒(𝑴 ; 𝑡 ) B det (𝑡 I − 𝑴 )
whose roots are exactly the eigenvalues of 𝑴 . We refer to the polynomial as 𝜒(𝑴 ) ,
and we use the notation 𝜒(𝑴 ; 𝑡 ) either when we wish to distinguish the name of
its variable 𝑡 , or else to refer to its value at some particular choice of 𝑡 . Consider a
random signing 𝑺 of a graph 𝐺 as in (3.1), where the signs are uniformly distributed:
𝝈 ∼ uniform{±1}𝑚 . Then Theorem 3.10 is equivalent to the statement that there is
strictly positive probability of the smallest root of 𝜒(𝑺 ) having lower bound

𝜆𝑛 (𝜒(𝑺 )) ≥ −2 𝑑 − 1.

Notably, the expectation of the characteristic polynomial 𝜒(𝑺 ) under a uniformly


random signing 𝑺 is equal to the matching polynomial of the graph 𝐺 , which is a
generator function of matchings in a graph; the interested reader can refer to [GG78]
for an introduction and proof of this fact. Moreover, it was shown by Heilmann and
Lieb [HL72]√that the largest root of the matching polynomial of a graph is bounded
above by 2 𝑑 − 1. Thus to prove Theorem 3.10, it suffices to prove the following
result.
Project 3: Bipartite Ramanujan Graphs 231

Theorem 3.11 (Roots: Random signing vs. expected characteristic polynomial). Suppose
𝐺 is a 𝑑 -regular graph and 𝑺 is a uniformly random signing. Then, with strictly
positive probability,
𝜆𝑛 (𝜒(𝑺 )) ≥ −𝜆 1 (𝔼 [𝜒(𝑺 )]) .

3.2.3 Sums of rank-one matrices


The definition of graph signings in (3.1) expresses a signing as a sum of rank-two
matrices. It will be convenient for us to instead frame the problem in terms of rank-one
matrices. To achieve this, we take inspiration from the graph Laplacian and consider The graph Laplacian of a 𝑑 -regular
Laplacianized signings graph 𝐺 with adjacency matrix 𝑨 is
the matrix 𝑳 𝑨 B 𝑑 I − 𝑨 . The
𝑳𝑺 B 𝑑 I − 𝑺 Laplacian is a central object of study
of a 𝑑 -regular graph 𝐺 with signing 𝑺 . Expanding the definition (3.1), we can write in the field of spectral graph theory.

this Laplacianized signing as


∑︁ ∑︁
𝜹𝑖 𝜹𝑖∗ + 𝜎 (𝑖 ,𝑗 ) 𝜹 𝑖 𝜹 𝑗 ∗ + 𝜹 𝑗 𝜹 𝑖 ∗

𝑳𝑺 = 𝑑
𝑖 ∈V (𝑖 ,𝑗 ) ∈E
∑︁
= (𝜹 𝑖 − 𝜎 (𝑖 ,𝑗 ) 𝜹 𝑗 )(𝜹 𝑖 − 𝜎 (𝑖 ,𝑗 ) 𝜹 𝑗 ) ∗ ,
(𝑖 ,𝑗 ) ∈E

which is a sum of rank-one matrices. We introduce the vectors 𝒗 𝑒 = 𝜹 𝑖 − 𝜎 (𝑖 ,𝑗 ) 𝜹 𝑗


indexed by edges 𝑒 = (𝑖 , 𝑗 ) ∈ E; Theorem 3.11 can then be reformulated equivalently
as follows.
Theorem 3.12 (Equivalent reformulation of Theorem 3.11). Suppose 𝐺 is a 𝑑 -regular
graph and 𝑺 is a uniformly random signing with decomposition as in (3.1). Let
𝒗 𝑒 = 𝜹 𝑖 − 𝜎 (𝑖 ,𝑗 ) 𝜹 𝑗 for each 𝑒 = (𝑖 , 𝑗 ) ∈ E. Then, with strictly positive probability,
 ∑︁   h ∑︁ i 
𝜆1 𝜒 𝒗 𝑒 𝒗 𝑒 ∗ ≤ 𝜆1 𝔼 𝜒 𝒗𝑒𝒗𝑒 ∗ . (3.2)
𝑒 ∈E 𝑒 ∈E

Proof that Theorem 3.12 implies Theorem 3.11. Suppose that the theorem’s conclusion (3.2) We only prove that Theorem 3.12
holds. Since 𝑑 I − 𝑺 = 𝑒 ∈E 𝒗 𝑒 𝒗 𝑒 ∗ , it follows that implies Theorem 3.11, as this is
Í
sufficient to establish the reduction.
However, the reverse implication can
𝑑 − 𝜆𝑛 (𝑺 ) = 𝜆 1 (𝑑 I − 𝑺 ) be seen by a nearly identical
≤ 𝜆 1 (𝔼 [𝜒 (𝑑 I − 𝑺 )]) argument.

= 𝑑 + 𝜆 1 (𝔼 [𝜒 (−𝑺 )]) (3.3)


= 𝑑 + 𝜆 1 (𝔼 [𝜒 (𝑺 )]) . (3.4)

The equality (3.3) holds because the characteristic polynomial 𝜒 (𝑑 I − 𝑺 ; 𝑡 ) is a hori-


zontal translation (by 𝑑 ) of the characteristic polynomial 𝜒 (−𝑺 ; 𝑡 ) . The equality (3.4)
follows by symmetry of the uniform distribution over signings. Subtracting 𝑑 and
negating gives the result of Theorem 3.11. 
Rather than prove Theorem 3.12, we will instead prove the following more general
result in the case of an arbitrary sum of independent rank-one matrices. Theorem 3.12
then clearly follows as an immediate corollary.

Theorem 3.13 (Roots: Independent rank-one sum vs. expected char. poly.). Suppose that
𝒗 1 , . . . , 𝒗 𝑚 ∈ ℂ𝑛 are independent random vectors, each with finite support. Then,
Project 3: Bipartite Ramanujan Graphs 232

with strictly positive probability,


 ∑︁𝑚   h ∑︁𝑚 i 
𝜆1 𝜒 𝒗 𝑖 𝒗 𝑖 ∗ ≤ 𝜆1 𝔼 𝜒 𝒗𝑖𝒗𝑖 ∗ .
𝑖 =1 𝑖 =1

Theorem 3.13 is significantly more general than Theorem 3.12: in particular, it


relates the largest root of any sum of (finitely-supported) independent rank-one
matrices to the largest root of the expected characteristic polynomial of the sum.
This more general result has numerous implications beyond the existence of bipartite
Ramanujan graphs. Theorem 3.13 was established by Marcus, Spielman, and Srivastava
in [MSS15b] to solve the Kadison–Singer problem, and it was used by the same authors
in [MSS21] to obtain sharpened versions of the restricted invertibility principle. We
focus on this more general setting to highlight the theoretical techniques unifying these
results.

3.3 Interlacing polynomials and real stability


Our strategy for proving Theorem 3.13, following [MSS15b], will be to show that the
family comprised of the characteristic polynomials of all realizations of the random
matrix 𝑖𝑚=1 𝒗 𝑖 𝒗 𝑖 ∗ has a common interlacing. This property, which we will soon define,
Í
allows us to compare the roots of the average of polynomials with the roots of the
constituents of the average. The argument crucially relies upon the notion of real
stability of polynomials, which generalizes real-rootedness beyond the univariate
setting. In this section, we will introduce the necessary background on polynomial
interlacing and real stability that will be used to prove Theorem 3.13 in the subsequent
section.

3.3.1 Interlacing polynomials


We begin by defining the notion of interlacing of two polynomials.

Definition 3.14 (Interlacing polynomials). Let 𝑓 be a polynomial of degree 𝑛 with


real roots (𝜆𝑖 )𝑖𝑛=1 , and let 𝑔 be a polynomial of degree 𝑛 or 𝑛 − 1 with real roots
(𝜇𝑖 )𝑖𝑛=or 𝑛−1 . The polynomial 𝑔 is said to interlace 𝑓 if their roots are alternating
1
and those of 𝑓 are larger; i.e.,

𝜇𝑛 ≤ 𝜆𝑛 ≤ 𝜇𝑛−1 ≤ · · · ≤ 𝜇1 ≤ 𝜆 1 ,

with the first inequality disregarded when 𝑔 is of degree 𝑛 − 1.


Figure 3.1 An example of inter-
lacing polynomials plotted in
We illustrate an example of a polynomial 𝑔 interlacing another polynomial 𝑓 in
Desmos. Here, the red polynomial
Figure 3.1. Next, we define what it means for a family of polynomials to have a common 𝑔 (𝑥) = (𝑥 − 2) (𝑥 − 4) (𝑥 − 6)
interlacing. interlaces the blue polynomial
𝑓 (𝑥) = (𝑥 − 3) (𝑥 − 5) (𝑥 − 8) , since
Definition 3.15 (Common interlacing). Let 𝑓 1 , . . . , 𝑓𝑚 be a family of real-rooted polyno- their roots are alternating and the
roots of 𝑓 are larger.
mials, each with degree 𝑛 . If there is a polynomial 𝑔 that interlaces each 𝑓𝑖 , then
the family is said to have a common interlacing.

The existence of a common interlacing is in fact a pairwise property of a family of


polynomials; we leave the proof of this fact as an exercise.
Exercise 3.16 (Pairwise common interlacing implies common interlacing of family). Let 𝑓 1 , . . . , 𝑓𝑚
be a family of real-rooted, degree-𝑛 polynomials; and suppose that, for any indices
Project 3: Bipartite Ramanujan Graphs 233

𝑖 , 𝑗 ∈ {1, . . . , 𝑚}, the polynomials 𝑓𝑖 and 𝑓 𝑗 have a common interlacing. Show that
𝑓1 , . . . , 𝑓𝑚 have a common interlacing. Hint: This is an easy consequence of the
combinatorial Helly theorem on the real line, which states that the intersection of a
finite collection of non-pairwise-disjoint intervals in ℝ is nonempty; for example, see
[Pak10, Chapter 1].
If a family of polynomials has a common interlacing, then we can relate each root
of the average with the corresponding root of a member of the family. We provide the
following lemma without proof, though we recommend that the interested reader give
it some thought; its source and proof can be found in [MSS15a, Lemma 4.2].
Lemma 3.17 (Roots of family with common interlacing). Let 𝑓 1 , . . . , 𝑓𝑚 be a family of This existential result actually
real-rooted, degree-𝑛 polynomials with positive leading coefficients, and let 𝜇 be a extends beyond the leading root to all
other roots 𝜆𝑖 , 𝑖 ∈ {2, . . . , 𝑛 }; see
probability distribution on {1, . . . , 𝑚}. If 𝑓 1 , . . . , 𝑓𝑚 have a common interlacing, then [MSS14, Theorem 2.2]. For our
there exists some 𝑗 ∈ {1, . . . , 𝑚} for which purposes, we only need the stated
 result involving the leading root 𝜆 1 .
𝜆 1 ( 𝑓 𝑗 ) ≤ 𝜆 1 𝔼𝑘 ∼𝜇 𝑓𝑘 .
It is evident that Lemma 3.17 reduces the proof of Theorem 3.13 to the problem of
∗ have a common interlacing
Í𝑚 
showing that the characteristic polynomials 𝜒 𝑖 =1 𝒗 𝑖 𝒗 𝑖
(when considered as a family of polynomials under realizations of the random vectors
𝒗 𝑖 ). To this end, it will be useful to establish a sufficient condition for the existence of
a common interlacing of a family of polynomials. As it turns out, existence of such
an interlacing is equivalent to the real-rootedness of arbitrary convex combinations of
polynomials within the family. This basic idea has appeared in various forms a number
of times, including [Ded92, Theorem 2.1], [Fel80, Theorem 2’], and [CS07, Theorem
3.6]. Our proof follows that of [MSS14].

Theorem 3.18 (Sufficient condition for common interlacing). Let 𝑓 1 , . . . , 𝑓𝑚 be a family of


degree-𝑛 polynomials. Suppose that, for all probability distributions 𝜇 supported
on {1, . . . , 𝑚}, the polynomial 𝔼𝑘 ∼𝜇 𝑓𝑘 has real roots. Then the family has a
common interlacing.

Proof. By Exercise 3.16, it suffices to show that any pair of polynomials 𝑓𝑖 , 𝑓 𝑗 with 𝑖 ≠ 𝑗
has a common interlacing under the given assumptions. Define the polynomial 𝑔𝑡 as
the convex combination
𝑔𝑡 (𝑥) B ( 1 − 𝑡 ) 𝑓𝑖 (𝑥) + 𝑡 𝑓 𝑗 (𝑥) where 𝑡 ∈ [0, 1] .
We will proceed under the assumption that 𝑓𝑖 and 𝑓 𝑗 have no roots in common. If they
do share some root 𝜆 ∈ ℝ, then 𝜆 will also be a root of any convex combination 𝑔𝑡 .
Then it suffices to find a common interlacing ℎ (𝑥) for the polynomials 𝑓𝑖 (𝑥)/(𝑥 − 𝜆)
and 𝑓 𝑗 (𝑥)/(𝑥 −𝜆) ; once we have done this, it is straightforward to see that (𝑥 −𝜆) ·ℎ (𝑥)
will be a common interlacing for 𝑓𝑖 and 𝑓 𝑗 .
Let 𝜆𝑘 (𝑡 ) denote the 𝑘 th largest root of 𝑔𝑡 . From complex analysis, we know that
the roots of a polynomial are continuous in its coefficients; thus the image of the unit
interval under 𝜆𝑘 is a continuous curve in the complex plane that begins at the 𝑘 th
largest root of 𝑓𝑖 at 𝑡 = 0, and ends at the 𝑘 th largest root of 𝑓 𝑗 at 𝑡 = 1. In particular,
this curve must reside on the real line, owing to the assumption that expectations (i.e.,
convex combinations) of polynomials in the family have real roots. As such, the image
𝜆𝑘 ( [ 0, 1]) is a closed interval in ℝ. Moreover, for any 𝑡 ∈ ( 0, 1) , it cannot be the case
that 𝜆𝑘 (𝑡 ) is a root of 𝑓𝑖 , for if this were the case, then the assumption of no common
roots would be violated:
0 = 𝑔𝑡 (𝜆𝑘 (𝑡 )) = ( 1 − 𝑡 ) 𝑓𝑖 (𝜆𝑘 (𝑡 )) + 𝑡 𝑓 𝑗 (𝜆𝑘 (𝑡 )) = 𝑡 𝑓 𝑗 (𝜆𝑘 (𝑡 )).
Project 3: Bipartite Ramanujan Graphs 234

Similarly, 𝜆𝑘 (𝑡 ) cannot be a root of 𝑓 𝑗 for any 𝑡 ∈ ( 0, 1) . It follows that the interval


𝜆𝑘 ( [ 0, 1]) contains, for each 𝑘 = 1, . . . , 𝑛 , a single root of each of 𝑓𝑖 and 𝑓 𝑗 . It
immediately follows that the polynomial ℎ whose 𝑘 th root is the left boundary point
of the interval 𝜆𝑘 ( [ 0, 1]) is a common interlacing for 𝑓𝑖 and 𝑓 𝑗 . 

3.3.2 Real-stable polynomials


In our proof of Theorem 3.13, we will exercise Theorem 3.18 by employing the notion
of real stability, which is defined as follows.

Definition 3.19 (Real stability). An 𝑚 -variate polynomial 𝑝 ∈ ℝ[𝑧 1 , . . . , 𝑧𝑚 ] is real


stable either if it is identically zero or if none of its roots has all coordinates strictly
in the upper half plane. That is, a nonzero 𝑝 is real stable if

Im (𝑧𝑖 ) > 0 for all 𝑖 = 1, . . . , 𝑚 implies 𝑝 (𝑧 1 , . . . , 𝑧𝑚 ) ≠ 0.

Since roots of a univariate polynomial come in conjugate pairs, a real-stable


univariate polynomial has only real roots.
We will need the following few results on particular real-stable polynomials, as well
as closure of the class under certain operations. We begin with a proposition due to
Borcea and Brändén [BB08, Proposition 2.4].
Proposition 3.20 (Determinant of linear combination of psd Hermitian matrices). Suppose that
𝑨 1 , . . . , 𝑨 𝑚 ∈ ℍ𝑛+ are positive-semidefinite Hermitian matrices. Then the 𝑚 -variate
polynomial 𝑝 ∈ ℝ[𝑧 1 , . . . , 𝑧𝑚 ] defined as
∑︁𝑚 
𝑝 (𝑧1 , . . . , 𝑧𝑚 ) B det 𝑧𝑖 𝑨 𝑖 is real stable.
𝑖 =1

Proof. We sketch a proof in the case that each matrix 𝑨 𝑖 is strictly positive definite;
the general result then follows from complex-analytic considerations. (See the proof in
[BB08] for further details.) We consider the restriction of the polynomial 𝑝 to a line.
Define 𝒛 (𝑡 ) = 𝜶 + 𝑡 𝜷 , where 𝜶 ∈ ℝ𝑚 and 𝜷 ∈ ℝ++ 𝑚 are arbitrary, and where 𝑡 ∈ ℂ.

Positive-definite matrices form a convex cone, so the matrix


∑︁𝑚
𝑷 B 𝛽𝑖 𝑨 𝑖 is positive definite.
𝑖 =1

A simple calculation shows that


 h∑︁𝑚 i 
𝑝 (𝒛 (𝑡 )) = det (𝑷 ) det 𝑡 I + 𝑷 −1/2 𝛼𝑖 𝑨 𝑖 𝑷 −1/2 .
𝑖 =1

−1/2 Í𝑚 −1/2
  
In particular, det (𝑷 ) is constant and det 𝑡 I +𝑷 𝑖 =1 𝛼𝑖 𝑨 𝑖 𝑷  Í is the character-
istic polynomial (in the variable 𝑡 ) of the Hermitian matrix −𝑷 −1/2 𝑚
 −1/2
𝑖 =1 𝑎 𝑖 𝑨 𝑖 𝑷 .
By the spectral theorem, it follows that 𝑝 (𝒛 (𝑡 )) has real roots, implying real stabil-
ity. 
We next present two results on the closure of the class of real-stable polynomials
under two operations: the first is the difference of the identity and a partial derivative
operator, and the second is partial evaluation of the polynomial at a real number. We
state these without proof, instead referring the reader to [MSS15b, Corollary 3.8 and
Proposition 3.9] for a more detailed overview.
Lemma 3.21 (Closure: Identity minus derivative). Let 𝑝 ∈ ℝ[𝑧 1 , . . . , 𝑧𝑚 ] be real stable. For
any 𝑖 ∈ {1, . . . , 𝑚}, it is the case that ( 1 − 𝜕𝑧𝑖 )𝑝 is real stable.
Project 3: Bipartite Ramanujan Graphs 235

Lemma 3.22 (Closure: Partial evaluation). Let 𝑝 ∈ ℝ[𝑧 1 , . . . , 𝑧𝑚 ] be real stable. For any
fixed 𝑎 ∈ ℝ, the (𝑚 − 1) -variate polynomial 𝑝 (𝑎, 𝑧 2 , . . . , 𝑧𝑚 ) is real stable.
We now define the mixed characteristic polynomial of a collection of matrices.

Definition 3.23 (Mixed characteristic polynomial). Let 𝑨 1 , . . . , 𝑨 𝑚 ∈ 𝕄𝑛 be matrices.


The mixed characteristic polynomial of 𝑨 1 , . . . , 𝑨 𝑚 is the univariate polynomial
𝜇[𝑨 1 , . . . , 𝑨 𝑚 ] ∈ ℝ[𝑡 ] defined as

𝜇[𝑨 1 , . . . , 𝑨 𝑚 ; 𝑡 ]
Ö𝑚   ∑︁𝑚 
B 1 − 𝜕𝑧𝑖 det 𝑡 I + 𝑧𝑖 𝑨 𝑖 . (3.5)
𝑖 =1 𝑖 =1
𝑧1 ,...,𝑧𝑚 =0

Like with the characteristic polynomial, we refer to the mixed characteristic polynomial
as 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ] , using 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ; 𝑡 ] instead when it is useful to emphasize the
name of its variable 𝑡 .
Finally, we present the following result from [MSS15b, Theorem 4.1] relating the
expected characteristic polynomial of the sum of independent rank-one matrices to a
mixed characteristic polynomial of the corresponding covariance matrices.

Theorem 3.24 (Expected char. poly. is a mixed char. poly.). Let 𝒗 1 , . . . , 𝒗 𝑚 ∈ ℂ𝑛 be


independent random vectors with covariance matrices 𝑨 𝑖 = 𝔼 𝒗 𝑖 𝒗 𝑖 ∗ . Then
h ∑︁𝑚 i
𝔼 𝜒 𝒗 𝑖 𝒗 𝑖 ∗ ; 𝑡 = 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ; 𝑡 ]. (3.6)
𝑖 =1

Moreover, 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ] has real roots.

Proof sketch. For the sake of brevity, we will not prove the equality (3.6); we refer
the reader to [MSS15b, Section 4] for a proof of this result. On the other hand,
real-rootedness of 𝜇[𝑨 1 , . . . , 𝑨 𝑚 ] follows immediately from the definition of the
mixed characteristic polynomial in (3.5), Proposition 3.20, positive semidefiniteness of
covariance matrices, the closure properties in Lemmas 3.21 and 3.22, and the fact that
real stability is equivalent to real-rootedness in the univariate case. 

3.4 Proof of Theorem 3.13


Equipped with the tools of the previous section, we are ready to tackle the proof of
Theorem 3.13. We proceed inductively by showing that the collection of “conditional
expectation polynomials” at a certain level have a common interlacing, which establishes
the desired bound at that level.
Define the conditional expectation polynomial at level  with first  vectors assigned
as 𝒗 1 = 𝒖 1 , . . . , 𝒗  = 𝒖  by
h ∑︁ ∑︁𝑚 i
𝑞𝒖 1 ,...,𝒖  (𝑥) B 𝔼𝒗 +1 ,...,𝒗 𝑚 𝜒 𝒖𝑖𝒖𝑖 ∗ + 𝒗𝑖𝒗𝑖 ∗ .
𝑖 =1 𝑗 =+1

Write {𝒘 1 , . . . , 𝒘 𝑟 } for the support of 𝒗 +1 (which, by assumption, is finite). Let 𝜈


Project 3: Bipartite Ramanujan Graphs 236

be an arbitrary distribution on {1, . . . , 𝑟 }. Then observe that


 
𝔼𝑘 ∼𝜈 𝑞𝒖 1 ,...,𝒖  ,𝒘 𝑘 (𝑥)
h ∑︁ ∑︁𝑚 i
= 𝔼𝒗 +2 ,...,𝒗 𝑚 ;𝑘 ∼𝜈 𝜒 𝒖𝑖𝒖𝑖 ∗ + 𝒘 𝑘𝒘 𝑘 ∗ + 𝒗𝑖𝒗𝑖 ∗
𝑖 =1 𝑗 =+2
= 𝜇 [𝒖 1𝒖 1 , . . . , 𝒖  𝒖  , 𝔼𝑘 ∼𝜈 𝒘 𝑘 𝒘 𝑘 , 𝔼 𝒗 +2𝒗 +2 , . . . , 𝔼 𝒗 𝑚 𝒗 𝑚 ∗ ] .
∗ ∗ ∗ ∗

The mixed characteristic polynomial in the last line has real roots, by Theorem 3.24
(which we also employed in the second equality). The distribution 𝜈 was arbitrary, so
we may invoke Theorem 3.18 to discover that the family of polynomials (𝑞𝒖 1 ,...,𝒖  ,𝒘 𝑘 )𝑘𝑟 =1
has a common interlacing. Then by Lemma 3.17, there is some 𝑘 ∈ {1, . . . , 𝑟 } for
which
𝜆 1 (𝑞𝒖 1 ,...,𝒖  ,𝒘 𝑘 ) ≤ 𝜆 1 (𝑞𝒖 1 ,...,𝒖  ).
Indeed, 𝑞𝒖 1 ,...,𝒖  = 𝔼𝒗 +1 [𝑞𝒖 1 ,...,𝒖  ,𝒗 +1 ] . Applying this argument repeatedly at each level
 ∈ {1, . . . , 𝑚} and chaining the inequalities, we eventually obtain an assignment
𝒖 1 , . . . , 𝒖 𝑚 satisfying
 ∑︁𝑚   h ∑︁𝑚 i 
𝜆1 𝜒 𝒖 𝑖 𝒖 𝑖 ∗ ≤ 𝜆1 𝔼 𝜒 𝒗𝑖𝒗𝑖 ∗ ,
𝑖 =1 𝑖 =1

which is the desired result. 

Lecture bibliography
[Alo86] N. Alon. “Eigenvalues and Expanders”. In: Combinatorica 6.2 (June 1986), pages 83–
96. doi: 10.1007/BF02579166.
[BL06] Y. Bilu and N. Linial. “Lifts, Discrepancy and Nearly Optimal Spectral Gap”. In:
Combinatorica 26.5 (2006), pages 495–519. doi: 10.1007/s00493-006-0029-7.
[BB08] J. Borcea and P. Brändén. “Applications of stable polynomials to mixed determi-
nants: Johnson’s conjectures, unimodality, and symmetrized Fischer products”. In:
Duke Mathematical Journal 143.2 (2008), pages 205 –223. doi: 10.1215/00127094-
2008-018.
[CS07] M. Chudnovsky and P. Seymour. “The roots of the independence polynomial
of a clawfree graph”. In: Journal of Combinatorial Theory, Series B 97.3 (2007),
pages 350–357. doi: https://doi.org/10.1016/j.jctb.2006.06.001.
[Ded92] J. P. Dedieu. “Obreschkoff ’s theorem revisited: what convex sets are contained in the
set of hyperbolic polynomials?” In: Journal of Pure and Applied Algebra 81.3 (1992),
pages 269–278. doi: https://doi.org/10.1016/0022-4049(92)90060-S.
[Fel80] H. J. Fell. “On the zeros of convex combinations of polynomials.” In: Pacific Journal
of Mathematics 89.1 (1980), pages 43 –50. doi: pjm/1102779366.
[GG78] C. Godsil and I. Gutman. “On the matching polynomial of a graph”. In: Algebraic
Methods in Graph Theory 25 (Jan. 1978).
[HL72] O. J. Heilmann and E. H. Lieb. “Theory of monomer-dimer systems”. In: Communi-
cations in Mathematical Physics 25.3 (1972), pages 190 –232. doi: cmp/1103857921.
[HLW06] S. Hoory, N. Linial, and A. Wigderson. “Expander graphs and their applications”.
In: Bulletin of the American Mathematical Society 43.4 (2006), pages 439–561.
[Kow19] E. Kowalski. An introduction to expander graphs. Société mathématique de France,
2019.
[MSS14] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Ramanujan graphs and the
solution of the Kadison-Singer problem”. In: Proceedings of the International
Congress of Mathematicians—Seoul 2014. Vol. III. Kyung Moon Sa, Seoul, 2014,
pages 363–386.
Project 3: Bipartite Ramanujan Graphs 237

[MSS15a] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families I: Bipartite


Ramanujan graphs of all degrees”. In: Annals of Mathematics 182.1 (2015), pages 307–
325. url: http://www.jstor.org/stable/24523004.
[MSS15b] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families II: Mixed char-
acteristic polynomials and the Kadison—Singer problem”. In: Annals of Mathematics
182.1 (2015), pages 327–350. url: http://www.jstor.org/stable/24523005.
[MSS21] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families III: Sharper
restricted invertibility estimates”. In: Israel Journal of Mathematics (2021), pages 1–
28.
[Nil91] A. Nilli. “On the Second Eigenvalue of a Graph”. In: Discrete Math. 91.2 (1991),
pages 207–210. doi: 10.1016/0012-365X(91)90112-F.
[Pak10] I. Pak. Lectures on Discrete and Polyhedral Geometry. 2010. url: https://www.
math.ucla.edu/~pak/book.htm.
[Spi18b] D. Spielman. Bipartite Ramanujan Graphs. 2018. url: http://www.cs.yale.
edu/homes/spielman/561/lect25-18.pdf.
[Spi19] D. Spielman. Spectral and Algebraic Graph Theory. 2019. url: http : / / cs -
www.cs.yale.edu/homes/spielman/sagt/sagt.pdf.
4. The Noncommutative Grothendieck Problem

Date: 14 March 2022 Author: Ethan N. Epperly

Grothendieck’s inequality [Gro53] is a foundational result that has important uses in


Banach space theory, 𝐶 ∗ algebras, quantum information theory, and combinatorial
optimization. A noncommutative analog of the inequality, conjectured by Grothendieck
in 1953, was proved a quarter century later by Pisier [Pis78] and has seen significant
applications in various areas of mathematics. This note will survey these inequalities,
adopting a contemporary perspective centered around semidefinite programming.

4.1 Grothendieck’s inequality


Let 𝑩 ∈ ℝ𝑚×𝑛 be a matrix. Consider the problem of computing the ∞ → 1 operator The equality (4.1) is a consequence of
norm of 𝑩 , which can be phrased as the binary integer optimization problem duality between the ∞ and 1 norms
and the Bauer maximum principle:
k𝑩 k ∞ →1 = max 𝑛 h𝒚 , 𝑩𝒙 i. (4.1) k𝒛 k 1 = max h𝒘 , 𝒛 i
𝒙 ∈ {±1 } k𝒘 k ∞ ≤1
𝒚 ∈ {±1 }𝑚
= max h𝒘 , 𝒛 i.
𝒘 ∈{±1}𝑛
This is a combinatorial optimization problem over the 2𝑚+𝑛
assignments for 𝒙 and 𝒚 ,
and it is known to be NP-hard [AN06, Prop. 2.2].
A classical technique to find approximate solutions to combinatorial optimization
problems is to relax the problem to a semidefinite program, which can be tractably
solved. Observe that the objective function of (4.1) can be written as

h𝑩𝒙 , 𝒚 i = h𝒚 𝒙 ∗ , 𝑩i.

This formulation suggests we reframe as an optimization problem over the matrix


𝒁 12 = 𝒚 𝒙 ∗ . In fact, to capture the constraint that the entries of 𝒙 and 𝒚 are ±1, we
shall consider the larger matrix
   ∗  ∗
𝒚𝒚 𝒚𝒙∗
  
𝒚 𝒚 𝒁 11 𝒁 12
𝒁 = = = . (4.2)
𝒙 𝒙 𝒙 𝒚 ∗ 𝒙𝒙 ∗ 𝒁 21 𝒁 22

One can verify that a matrix 𝒁 ∈ 𝕄𝑚+𝑛 (ℝ) has the form in (4.2) if and only if 𝒁 has
rank one and has all ones on its diagonal. Thus, (4.1) can be exactly reformulated as
the rank-constrained matrix optimization problem

k𝑩 k ∞ →1 = max h𝒁 12 , 𝑩i subject to rank 𝒁 = 1, diag (𝒁 ) = 𝟙. (4.3)


𝒁 ∈𝕄𝑚+𝑛 (ℝ)

We have written 𝒁 12 for the ( 1, 2) -block of 𝒁 , partitioned as in (4.2). As this is a


reformulation, we have not made the problem any more tractable by phrasing it this
way. To make the problem easier on ourself, we “relax” the rank-one constraint to the
looser restriction that 𝒁 is positive semidefinite:

SDP (𝑩) := max h𝒁 12 , 𝑩i subject to 𝒁  0, diag (𝒁 ) = 𝟙. (4.4)


𝒁 ∈𝕄𝑚+𝑛 (ℝ)
Project 4: The NC Grothendieck Problem 239

Since any feasible solution to (4.3) is also feasible for (4.4), we must have k𝑩 k ∞ →1 ≤
SDP (𝑩) . The content of Grothendieck’s inequality is that SDP (𝑩) is no larger than a
modest multiple of k𝑩 k ∞ →1 . That is, the semidefinite programming relaxation (4.4)
provides a constant-factor approximation to the Grothendieck problem (4.1).
ℝ such that
Theorem 4.1 (Grothendieck’s inequality I). There exists a constant KG

k𝑩 k ∞ →1 ≤ SDP (𝑩) ≤ KGℝ · k𝑩 k ∞ →1

for all real matrices 𝑩 of any dimension.

We indulge in one final refomulation. Note that a matrix 𝒁 is feasible for (4.4) if
and only if it is the Gram matrix of unit vectors 𝒖 1 , . . . , 𝒖 𝑚 , 𝒚 1 , . . . , 𝒚 𝑛 in a Hilbert
space ( H, h·, ·i) . That is,

 h𝒖 1 , 𝒖 1 i · · · h𝒖 1 , 𝒖 𝑚 i h𝒖 1 , 𝒗 1 i · · · h𝒖 1 , 𝒗 𝑛 i 
 .. .. .. .. 
 .. .. 
 . . . . . . 
 h𝒖 𝑚 , 𝒖 1 i · · · h𝒖 𝑚 , 𝒖 𝑚 i h𝒖 𝑚 , 𝒗 1 i · · · h𝒖 𝑚 , 𝒗 𝑛 i 
 
𝒁 = .
 h𝒗 1 , 𝒖 1 i · · · h𝒗 1 , 𝒖 𝑚 i h𝒗 1 , 𝒗 1 i · · · h𝒗 1 , 𝒗 𝑛 i 
 .. .. .. .. .. .. 

 . . . . . . 

 h𝒗 , 𝒖 i · · · h𝒗 𝑛 , 𝒖 𝑚 i h𝒗 𝑛 , 𝒗 1 i · · · h𝒗 𝑛 , 𝒗 𝑛 i 
 𝑛 1
With this observation, another equivalent statement of Grothendieck’s inequality
becomes evident. In premonition of what will come, we shall also generalize to the
field 𝔽 of either real or complex numbers.
𝔽 such that
Theorem 4.2 (Grothendieck’s inequality II). There exists a constant KG For 𝔽 = ℂ, we have the following
analog of (4.1):
∑︁𝑚 ∑︁𝑛
𝑏 𝑖 𝑗 h𝒖 𝑖 , 𝒗 𝑗 i ≤ KG𝔽 · k𝑩 k ∞ →1 , k𝑩 k ∞ →1 = max𝑛 | h𝑩𝒙 , 𝒚 i |,
𝑖 =1 𝑗 =1 𝒙 ∈𝕋
𝒚 ∈𝕋 𝑛

where 𝕋 denotes the elements of 𝔽 with modulus one , 𝑩 is an matrix over 𝔽 of where 𝕋 are the complex numbers of
any size 𝑚 × 𝑛 , and 𝒖 1 , . . . , 𝒖 𝑚 , 𝒗 1 , . . . , 𝒗 𝑛 are unit vectors in a Hilbert space unit modulus.
( H, h·, ·i) .

For algorithmic applications, we are not only just interested in the optimal value
of the Grothendieck problem (4.1) but also in the optimal solutions 𝒙 ∈ {±1}𝑚 and
𝒚 ∈ {±1}𝑛 . Fortunately, there are efficient ways for “rounding” the output of the
semidefinite programming relaxation (4.4) to approximate optimal solutions. The
following is a result of Alon and Naor [AN06].

Theorem 4.3 (Grothendieck efficient rounding). There exists a polynomial-time ran-


domized algorithm to convert an optimal solution to the semidefinite programming
problem (4.4) to random vectors 𝒙 ∈ {±1}𝑛 and 𝒚 ∈ {±1}𝑚 such that
h∑︁𝑚 ∑︁𝑛 i
𝔼 𝑏 𝑖 𝑗 𝑥𝑖 𝑦 𝑗 ≤ KGℝ,rnd · k𝑩 k ∞ →1 ,
𝑖 =1 𝑗 =1

where KGℝ,rnd is an absolute constant.


Project 4: The NC Grothendieck Problem 240

4.2 The noncommutative Grothendieck problem


We now turn our attention to a noncommutative analog of the Grothendieck problem
(4.1), which we will analyze using a noncommutative analog of Grothendieck’s inequal-
ity (Theorem 4.2). We shall restrict ourselves to the complex field 𝔽 = ℂ and to the
“square case” 𝑚 = 𝑛 to streamline the presentation.
To motivate the use of the term “noncommutative”, observe that we can further
reformulate the Grothendieck problem (4.1) in the field ℂ as an optimization problem
over diagonal matrices 𝑿 and 𝒀 :

k𝑩 k ∞ →1 = max | T(𝑿 ,𝒀 )|


𝑿 ,𝒀 ∈ {diag (𝒛 ) :𝒛 ∈𝕋 𝑛 }

where ∑︁𝑚 ∑︁𝑚


T(𝑿 ,𝒀 ) = 𝑏 𝑖 𝑗 𝑥𝑖𝑖 𝑦 𝑗 𝑗 .
𝑖 =1 𝑗 =1
Thus the Grothendieck problem can be equivalently formulated as maximizing a
bilinear form over two diagonal (and thus commuting) unitary matrices.
The noncommutative Grothendieck problem relaxes this diagonality assumption to
allow for arbitrary unitary matrices. Let T : 𝕄𝑛 × 𝕄𝑛 → ℂ be a sesquilinear form,
conjugate linear in its first argument. The noncommutative Grothendieck problem for
T is to compute It leads to an equivalent problem to
OPT ( T) = max Re T(𝑿 ,𝒀 ), (4.5) replace the real part in (4.5) by the
𝑿 ,𝒀 ∈𝕌𝑛 (ℂ) modulus. We shall prefer the real part
as it is ℝ-linear.
where 𝕌𝑛 (ℂ) denotes the 𝑛 × 𝑛 unitary matrices over the field ℂ.
In order to state a noncommutative analog of the Grothendieck inequality (Theo-
rem 4.2), we must introduce some seemingly peculiar notation. Let 𝕄𝑛 ( H) denote
𝑛 × 𝑛 matrices taking values in a Hilbert space H, whose inner product we denote
h·, ·i . For a matrix 𝑪 ∈ 𝕄𝑛 ( H) , we define the Gram matrices 𝑪𝑪 ∗ , 𝑪 ∗𝑪 ∈ ℍ𝑛 (ℂ)
with entries
∑︁𝑛 ∑︁𝑛
(𝑪 𝑪 ∗ )𝑖 𝑗 = h𝒄 𝑖𝑘 , 𝒄 𝑗 𝑘 i, (𝑪 ∗𝑪 )𝑖 𝑗 = h𝒄 𝑘𝑖 , 𝒄 𝑘 𝑗 i.
𝑘 =1 𝑘 =1

The set of matrices 𝑼 ∈ 𝕄𝑛 ( H) such that 𝑼 ∗𝑼 = 𝑼𝑼 ∗ = I are called unitary and are
denoted 𝕌𝑛 ( H) . For a sesquilinear form T on 𝕄𝑛 (ℂ) expressed in coordinates as
∑︁𝑛
T(𝑿 ,𝒀 ) = 𝑡 𝑖 𝑗 𝑘  𝑥 𝑖 𝑗 𝑦𝑘  ,
𝑖 ,𝑗 ,𝑘 ,=1

we define T(𝑼 ,𝑽 ) for 𝑼 ,𝑽 ∈ 𝕄𝑛 ( H) as


∑︁𝑛
T(𝑼 ,𝑽 ) = 𝑡𝑖 𝑗 𝑘  h𝒖 𝑖 𝑗 , 𝒗 𝑘  i.
𝑖 ,𝑗 ,𝑘 ,=1

We now state a noncommutative Grothendieck inequality, analogous to our second


formulation of the commutative Grothendieck inequality (Theorem 4.2).

Theorem 4.4 (Noncommutative Grothendieck Inequality I). Let T be a sesquilinear form



on 𝕄𝑛 (ℂ) . There exists an constant KNCG such that for any Hilbert space H and
𝑼 ,𝑽 ∈ 𝕌𝑛 ( H) ,

Re T(𝑼 ,𝑽 ) ≤ KNCG · OPT ( T).

The optimal value of the constant is KNCG = 2.

A result of this form was conjectured by Grothendieck and proved by Pisier [Pis78].
Haagerup obtained the sharp constant using a more streamlined proof [Haa85], and
Haagerup and Itoh established the optimality of the constant [HI95].
Project 4: The NC Grothendieck Problem 241

4.2.1 Semidefinite programming formulation


By running the our argument from the commutative Grothendieck inequality in reverse,
we can reformulate the noncommutative Grothendieck inequality (Theorem 4.4) as an
approximation guarantee for the noncommutative Grothendieck problem (4.5).
Consider a 2𝑛 2 × 2𝑛 2 Gram matrix 𝑮 whose entries are populated with the pairwise
inner products of all 2𝑛 2 vector entries of 𝑼 ,𝑽 ∈ 𝕄𝑛 ( H) for some Hilbert space H.
The unitarity constraints on 𝑼 and 𝑽 are all linear equalities in the pairwise inner
products of the entries of 𝑼 and 𝑽 , which are the entries of the Gram matrix 𝑮 . Further,
the objective function is ℝ-linear in the pairwise inner products of the entries of 𝑼 and
𝑽 . Therefore, if one picks a Hilbert space H of large enough dimension (dim H ≥ 2𝑛 2
suffices), one can optimize over 𝑼 and 𝑽 on the left-hand side of the noncommutative
Grothendieck inequality (Theorem 4.4) to obtain a quantity The dimension 2𝑛 2 suffices since the
dimension of the Hilbert space H
SDP ( T) := max Re T(𝑼 ,𝑽 ) (4.6) must only be chosen to be large
2
𝑼 ,𝑽 ∈𝕌𝑛 (ℂ2𝑛 ) enough that all possible 2𝑛 2 × 2𝑛 2
positive semidefinite matrices can be
which can be solved (to a specified accuracy) in polynomial time by solving a linear represented as the Gram matrix of
2𝑛 2 vectors in H. Cholesky
semidefinite program with 𝑂 (𝑛 4 ) entries in the matrix variable and 𝑂 (𝑛 2 ) linear factorization, for example, shows that
equality constraints. This optimization problem is indeed a relaxation of the non- 2
H = ℂ2𝑛 satisfies this property.
commutative Grothendieck problem (4.5) because any 𝑿 ,𝒀 feasible for (4.5) can be
converted to a pair 𝑼 ,𝑽 feasible for (4.6) by setting 𝒖 𝑖 𝑗 = 𝑥𝑖 𝑗 𝜹 1 and 𝒗 𝑖 𝑗 = 𝑦𝑖 𝑗 𝜹 1 for
all 𝑖 , 𝑗 = 1, . . . , 𝑛 .
The noncommutative Grothendieck inequality can be interpreted as an approxima-
tion guarantee for this semidefinite programming relaxation:

Theorem 4.5 (Noncommutative Grothendieck Inequality II). Let T be a sesquilinear



form on 𝕄𝑛 (ℂ) . With the optimal constant KNCG = 2,

OPT ( T) ≤ SDP ( T) ≤ KNCG · OPT ( T).

4.2.2 Efficient rounding algorithm


As with the commutative Grothendieck problem, we are often interested in obtaining
an nearly optimal solution to the noncommutative Grothendieck problem (4.5) from
the semidefinite programming relaxation (4.6). The following (randomized) algorithm If we solve the relaxation (4.6) using
achieves precisely this goal. the approach outlined, we obtain a
Gram matrix 𝑮 for the entries of a
solution 𝑼 and 𝑽 . From this Gram
Algorithm 4.1 Efficient rounding for noncommutative Grothendieck problem matrix, we can find
2
Input: (Approximate) solutions 𝑼 ,𝑽 ∈ 𝕌𝑛 (ℂ𝑑 ) to the SDP relaxation (4.6) of the 𝑼 ,𝑽 ∈ 𝕌𝑛 (ℂ2𝑛 ) that solve (4.6) in
6
noncommutative Grothendieck problem (4.5) 𝑂 (𝑛 ) time using Cholesky
factorization.
Output: Approximate solutions 𝑿 ,𝑽 ∈ 𝕌𝑛 (ℂ) to the noncommutative Grothendieck
problem (4.5)
1 Choose 𝒛 ∼ uniform { 1, −1, i, −i } and 𝑡 according to a hyperbolic secant distri-
𝑑

bution: ∫ 𝑏
1 𝑥 
ℙ {𝑡 ∈ [𝑎, 𝑏]}} = sech d𝑥 ;
𝑎 2 2
2 Set 𝑼 𝒛 ← h𝒛 , 𝑼 i ∈ 𝕄𝑛 (ℂ) and 𝑽 𝒛 := h𝒛 , 𝑽 i ∈ 𝕄𝑛 (ℂ) , where the inner product
is taken entrywise.
3 Compute polar decompositions
√ 𝑼 𝒛 = 𝚽𝒛 |𝑼√ 𝒛 | , 𝑽 𝒛 = 𝚿𝒛 |𝑽 𝒛 |
4 return 𝑿 ← 𝚽𝒛 |𝑼 𝒛 / 2 | i𝑡 , 𝒀 ← 𝚿𝒛 |𝑽 𝒛 / 2 | i𝑡
Project 4: The NC Grothendieck Problem 242

Amazingly, this somewhat peculiar rounding algorithm produces 𝑿 and 𝒀 which


are, in expectation, nearly (worst-case) optimal. The preceding algorithm and the
following result are due to Naor, Regev, and Vidick [NRV14, Thm. 1.4].

Theorem 4.6 (Noncommutative Grothendieck efficient rounding). Suppose 𝑼 ,𝑽 ∈


𝕌𝑛 (ℂ𝑑 ) are 𝜀 -optimal solutions to (4.6):

Re T(𝑼 ,𝑽 ) ≥ ( 1 − 𝜀) · SDP ( T).

Let 𝑿 and 𝒀 be the outputs of Algorithm 4.1 applied to these inputs. Then
 
1
𝔼 Re T(𝑿 ,𝒀 ) ≥ − 𝜀 · OPT ( T).
2

Both (equivalent) versions of the noncommutative Grothendieck inequality follow


directly from this rounding procedure (and the probabilistic method).

4.3 Noncommutative Grothendieck efficient rounding: Proof


We shall now work our way up to a proof. Throughout this section, we shall use the
notation from the rounding procedure (Algorithm 4.1). We begin with two technical
lemmas which we shall need for the proof proper.

4.3.1 Supporting lemmas


Our first lemma is an expectation bound for the “projected” matrices 𝑼 𝒛 and 𝑽 𝒛 from
the rounding procedure.
Lemma 4.7 (Projection). With the prevailing notation, we have the bounds

𝔼 𝑼 𝒛 𝑼 ∗𝒛 , 𝔼 𝑼 ∗𝒛 𝑼 𝒛 , 𝔼 𝑽 𝒛 𝑽 ∗𝒛 , 𝔼 𝑽 ∗𝒛 𝑽 𝒛  2I.
       

The proof is a short and rather pedestrian calculation, which we omit. We refer the
interested reader to [NRV14, Lem. 2.2].
For our second lemma, it will be helpful to “upgrade” a pair of vector-valued
matrices 𝑩, 𝑪 ∈ 𝕄𝑛 (ℂ𝑑 ) which are subunitary in the sense that
k𝑩 ∗ 𝑩 k, k𝑩𝑩 ∗ k, k𝑪 ∗𝑪 k, k𝑪𝑪 ∗ k ≤ 1 (4.7)
˜
to full unitary vector-valued matrices 𝑹 , 𝑺 ∈ 𝕌𝑛 (ℂ𝑑 ) , possibly of a larger dimension
𝑑˜ ≥ 𝑑 , while preserving the sesquilinear form T. The next lemma shows this is
possible.
Lemma 4.8 (Subunitary to unitary). Suppose 𝑩, 𝑪 ∈ 𝕄𝑛 (ℂ𝑑 ) satisfy the subunitary
2
bound (4.7). Then there exist 𝑹 , 𝑺 ∈ 𝕌𝑛 (ℂ𝑑+2𝑛 ) such that
T(𝑹 , 𝑺 ) = T(𝑩, 𝑪 ).
Proof. Define residuals 𝑫 = I − 𝑩𝑩 ∗ and 𝑬 = I − 𝑩 ∗ 𝑩 . Define 𝜎 := tr 𝑫 = tr 𝑬 and
consider spectral decompositions
∑︁𝑛 ∑︁𝑛
𝑫= 𝜆𝑖 𝒖 (𝑖 ) 𝒖 (𝑖 )∗ and 𝑬 = 𝜇𝑖 𝒗 (𝑖 ) 𝒗 (𝑖 )∗ .
𝑖 =1 𝑖 =1
2
Define 𝑹 ∈ 𝕌𝑛 (ℂ𝑑+2𝑛 ) with (𝑖 , 𝑗 ) th entry
√︂ !
𝜆𝑘 𝜇  (𝑘 ) ()∗
𝒓 𝑖 𝑗 = 𝒃𝑖 𝑗 ⊕ 𝑢 𝑣 : 𝑘 ,  = 1, . . . , 𝑛 ⊕ 0, (4.8)
𝜎 𝑖 𝑗
Project 4: The NC Grothendieck Problem 243

where ⊕ denotes vertical concatenation of column vectors. A short calculation confirms


that 𝑹 is unitary. Define 𝑺 analogously, swapping the last two vectors in (4.8). It is
straightforward to verify that T(𝑹 , 𝑺 ) = T(𝑩, 𝑪 ) . 

4.3.2 Proof of Theorem 4.6


As a final observation before the proof, note that we can identify the sesquilinear form
T on ℂ𝑛 with a linear operator on ℂ𝑛 ⊗ ℂ𝑛 as both spaces are equidimensional and
thus isomorphic. Under this identification, evaluation of T can be computed as an
inner product
T(𝑿 ,𝒀 ) = h𝑿 ⊗ 𝒀 , Ti for 𝑿 ,𝒀 ∈ 𝕄𝑛 (ℂ), (4.9)
where 𝒀 denotes the entrywise complex conjugate of 𝒀 . With the preliminaries taken
care of, we proceed with the proof in earnest.

Proof of Theorem 4.6. Let 𝔼𝒛 , 𝔼𝑡 , and 𝔼𝒛 ,𝑡 denote expectations over the randomness
in 𝒛 , 𝑡 , or both 𝒛 and 𝑡 , respectively. Since T is sesquilinear and 𝒛 is isotropic in the
sense that 𝔼 𝒛 𝒛 ∗ = I, we compute

𝔼𝒛 T(𝑼 𝒛 ,𝑽 𝒛 ) = T(𝑼 ,𝑽 ). (4.10)

In anticipation of applying (4.9) to compute 𝔼𝒛 ,𝑡 T(𝑿 ,𝒀 ) , we begin by computing


𝔼𝑡 [𝑿 ⊗ 𝒀 ] . Recall that 𝑿 and 𝒀 were defined as
 √  i𝑡  √  i𝑡
𝑿 = 𝚽𝒛 |𝑼 𝒛 |/ 2 , 𝒀 = 𝚿𝒛 |𝑽 𝒛 |/ 2

where 𝑼 𝒛 = 𝚽𝒛 |𝑼 𝒛 | and 𝑽 𝒛 = 𝚿𝒛 |𝑽 𝒛 | are polar decompositions. The log-secant


distribution has the property that

𝔼𝑡 𝑎 i𝑡 = 2𝑎 − 𝔼𝑡 𝑎 2+i𝑡 . (4.11)

Apply this fact to calculate


h i  i𝑡
𝔼𝑡 𝑿 ⊗ 𝒀 = (𝚽𝒛 ⊗ 𝚿𝒛 ) 𝔼𝑡 21 |𝑼 𝒛 | ⊗ |𝑽 𝒛 |
h  2+i𝑡 i
= (𝚽𝒛 ⊗ 𝚿𝒛 ) |𝑼 𝒛 | ⊗ |𝑽 𝒛 | − 𝔼𝑡 12 |𝑼 𝒛 | ⊗ |𝑽 𝒛 | (4.12)
 
1 2+i𝑡 2 − i
= 𝑼 𝒛 ⊗ 𝑽 𝒛 − 𝔼𝑡 2+i𝑡 · 𝚽𝒛 |𝑼 𝒛 | ⊗ 𝚿𝒛 |𝑽 𝒛 | 𝑡 .
2

Combining this result with (4.9) and (4.10), we conclude

𝔼𝒛 ,𝑡 T(𝑿 ,𝒀 ) = T(𝑼 ,𝑽 ) − 𝔼𝒛 ,𝑡 T(𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 ).

By hypothesis, Re T(𝑼 ,𝑽 ) ≥ ( 1 − 𝜀) · SDP ( T) , so it will suffice to show

Re 𝔼𝒛 T(𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 ) ≤ 2 OPT ( T) for any 𝑡 ∈ ℝ. (4.13)

We shall show precisely this. For the remainder of this proof, fix 𝑡 ∈ ℝ.
To show (4.13), we “derandomize” the expectation over 𝒛 by passing from random 𝑑
ℂ{1,−1,i,−i} denotes the Hilbert
scalar-valued matrices 𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 ∈ 𝕄𝑛 (ℂ) to deterministic vector-valued space of functions
𝑑
matrices in 𝑩, 𝑪 ∈ 𝕄𝑛 (ℂ {1,−1,i,−i } ) , which we define as 𝑓 : {1, −1, i, −i }𝑑 → ℂ.

1 1
𝑩 (𝒛 ) := 𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 and 𝑪 (𝒛 ) := 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 .
2𝑑 2𝑑
Project 4: The NC Grothendieck Problem 244

As promised, this allows us to remove the explicit randomness in (4.13) since


1 ∑︁
T(𝑩, 𝑪 ) = T(𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 )
4𝑑 𝒛 ∈ {1,−1,i,−i }𝑑

= 𝔼𝒛 T(𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 ).

Further, 𝑩𝑩 ∗ = 𝔼𝒛 (𝑼 𝒛 𝑼 ∗𝒛 ) 2 and 𝑩 ∗ 𝑩 = 𝔼𝒛 (𝑼 ∗𝒛 𝑼 𝒛 ) 2 , with analogous statements


holding for 𝑪 . By Lemma 4.7, we have

𝑩𝑩 ∗ , 𝑩 ∗ 𝑩, 𝑪𝑪 ∗ , 𝑪 ∗𝑪  2I.
√ √
Thus, 𝑩/ 2 and 𝑪 / 2 are subunitary in the sense of Lemma 4.8, so there exists
˜
𝑹 , 𝑺 ∈ 𝕌𝑛 (ℂ𝑑 ) for some dimension 𝑑˜ such that
2 T(𝑹 , 𝑺 ) = T(𝑩, 𝑪 ) = 𝔼𝒛 T(𝚽𝒛 |𝑼 𝒛 | 2+i𝑡 , 𝚿𝒛 |𝑽 𝒛 | 2−i𝑡 ).

But since 𝑹 and 𝑺 are unitary, Re T(𝑹 , 𝑺 ) ≤ SDP ( T) . Plugging in and rearranging
yields (4.13), completing the proof. 

4.3.3 Intuition for rounding procedure


The proof sheds some light on why the rounding procedure (Algorithm 4.1) works.
As shown in (4.10), 𝑼 𝒛 ,𝑽 𝒛 ∈ 𝕌𝑛 (ℂ) have the same T-value in expectation as
𝑼 ,𝑽 ∈ 𝕌𝑛 (ℂ𝑑 ) . Were 𝑼 𝒛 ,𝑽 𝒛 unitary, they would be excellent approximate solutions
to the noncommutative Grothendieck problem (4.5).
We are thus interested in adding additional randomness to obtain random unitary
matrices which√agree with 𝑼 𝒛 and 𝑽 𝒛 √on average, up to a small error. The forms
𝑿 = 𝚽𝒛 (|𝑼 𝒛 |/ 2) i𝑡 and 𝒀 = 𝚿𝒛 (|𝑽 𝒛 |/ 2) i𝑡 with 𝑡 having the log-secant distribution
are specially engineered to achieve this goal. Specifically, the property (4.11) of the
log-secant distribution shows that, on average, the unitary matrices 𝑿 and 𝒀 produced
by the rounding algorithm are close to 𝑼 𝒛 and 𝑽 𝒛 in the sense (4.12).

4.4 Application: Robust PCA


We conclude by presenting an application of the noncommutative Grothendieck
inequality to robust principal component analysis (PCA). Our presentation will be
informal and high-level, seeking only to provide a colorful illustration of this machinery
to a “practical” problem.

4.4.1 1 PCA
Consider a data set encoded in the columns of a matrix 𝑩 ∈ ℂ𝑚×𝑛 . The PCA problem
is to find a 𝑘 -dimensional subspace capturing most of the “energy” or “variance” in the
data matrix 𝑩 . In classical PCA, this can be formulated as an optimization problem

maximize k𝑾 ∗ 𝑩 k F subject to 𝑾 ∈ 𝕌𝑚×𝑘 (ℂ), (4.14)

where 𝕌𝑚×𝑘 (ℂ) consists of the 𝑚 × 𝑘 matrices with orthonormal columns. An optimal
solution to this problem is readily determined by choosing 𝑾 to consist of the 𝑘
dominant left singular vectors of 𝑩 .
To make this procedure more robust to outliers, we can replace the Frobenius
norm with a different norm in (4.14) For instance, Kwak [Kwa08] proposes using the
following “1 PCA” problem

maximize k𝑾 ∗ 𝑩 k 1 subject to 𝑾 ∈ 𝕌𝑚×𝑘 (ℂ), (4.15)


Project 4: The NC Grothendieck Problem 245

where k·k 1 is the entrywise 1 norm


∑︁𝑘 ∑︁𝑛
k𝑪 k 1 := |𝑧𝑖 𝑗 | for 𝑪 ∈ ℂ𝑘 ×𝑛 .
𝑖 =1 𝑗 =1

See also [MT11, §2.7].

4.4.2 Reformulation
We now reformulate the 1 PCA problem as a bilinear optimization problem over
suitably defined unitary matrices. First, use the duality of 1 and ∞ to write the 1
norm as a maximization:

maximize Re h𝑾 ∗ 𝑩, 𝒁 i subject to 𝑾 ∈ 𝕌𝑚×𝑘 (ℂ), 𝒁 ∈ 𝕋 𝑘 ×𝑛 .

One can further encode 𝑾 and 𝒁 as unitary matrices 𝑼 and 𝑽 with a specific entries
set to zero:
 
𝑼 = 𝑾 0 ∈ 𝕌𝑚 (ℂ) ;
 (4.16)
𝑽 = diag 𝑧𝑖 𝑗 : 1 ≤ 𝑖 ≤ 𝑘 , 1 ≤ 𝑗 ≤ 𝑛 ∈ 𝕌𝑘 𝑛 (ℂ).

Thus, one can reformulate the 1 PCA problem (4.15) as a bilinear optimization
problem over unitary matrices 𝑼 and 𝑽 by designing the sesquilinear form such
that optimal solutions 𝑼 and 𝑽 take the forms (4.16); see [NRV14, §5.1] for details.
The noncommutative Grothendieck inequality ensures the semidefinite programming
relaxation of this problem is a constant factor approximation. We conclude the that 1
PCA problem has a polynomial-time constant-factor approximation algorithm.

Notes
A survey on various Grothendieck inequalities and their applications is given by Pisier
[Pis12].
The optimal (commutative) Grothendieck constants are satisfy bounds
𝜋
1.67696 ≤ KGℝ < 1.7822 . . . = √ , 1.338 < KGℂ ≤ 1.4049.
2 log ( 1 + 2)

Davie established the lower bounds in unpublished work; Krivine showed the upper
bound on KGℝ [Kri78]; Braverman, Makarychev, Makarychev, and Naor showed the
slight suboptimality of Krivine’s bound [Bra+11]; and Haagerup showed the upper
bound on KGℂ in [Haa85]. Vershynin provides an accessible exposition on Krivine’s
bound [Ver18, §3.7].
The semidefinite programming interpretation of Grothendieck’s inequality was
popularized by Alon and Naor [AN06]. Their rounding procedure (Theorem 4.3) is
based on Krivine’s bound for the Grothendieck constant. Thus, in the notation of
Theorem 4.3,
𝜋
KGℝ,rnd ≤ √ .
2 log ( 1 + 2)
There are many ways to round answers from semidefinite programs to approximate so-
lutions of combinatorial optimization problems; among these, the Goemans–Williamson
algorithm for MaxCut is a distinguished example [GW95]. Assuming the Unique
Games Conjecture, it is NP-hard to approximate k𝑩 k ∞ →1 to within a factor of KGℝ − 𝜀
for any 𝜀 > 0 [RS09]; the semidefinite programming relaxation (4.4) provides an
Project 4: The NC Grothendieck Problem 246

efficient approximation algorithm with the optimal approximation factor (assuming


the Unique Games Conjecture).
There are real and Hermitian analogs of the noncommutative Grothendieck problem
and inequality. The rounding procedure analyzed in Theorem 4.6 generalizes to
these problems, giving a constructive proof √ of real and Hermitian noncommutative
Grothendieck inequalities with constant 2 2 [NRV14, §3]. Naor, Regev, and Vidick
also provide a different rounding procedure for the real noncommutative inequality
[NRV14, §4] that was inspired by the proof of Kaijser [Kai83]. Briët, Regev, and
Saket showed [BRS17] that it is NP-hard to approximate the solution to the (complex)
noncommutative Grothendieck problem (4.5) within a factor of KNCG ℂ − 𝜀 for any
𝜀 > 0; similar to the commutative case, the semidefinite programming relaxation
(4.6) provides an efficient algorithm with the optimal approximation factor (assuming
P ≠ NP).
Several robust versions of principal component analysis have been proposed,
including the 1 version we discussed as well as a variant where the Frobenius norm
in (4.14) is replaced by an entrywise mixed 1 –2 norm [Din+06]. McCoy and
Tropp [MT11] provide a polynomial-time constant-factor approximation algorithm to
the 1 robust PCA problem (4.15) for 𝑘 = 1 and a polynomial-time 𝑂 ( log 𝑚) -factor
approximation algorithm for general 𝑘 that builds on work of So [So09].

Lecture bibliography
[AN06] N. Alon and A. Naor. “Approximating the cut-norm via Grothendieck’s inequality”.
In: SIAM J. Comput. 35.4 (2006), pages 787–803. doi: 10.1137/S0097539704441629.
[Bra+11] M. Braverman et al. “The Grothendieck Constant Is Strictly Smaller than Krivine’s
Bound”. In: 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.
Oct. 2011, pages 453–462. doi: 10.1109/FOCS.2011.77.
[BRS17] J. Briët, O. Regev, and R. Saket. “Tight Hardness of the Non-Commutative
Grothendieck Problem”. In: Theory of Computing 13.15 (Dec. 2017), pages 1–
24. doi: 10.4086/toc.2017.v013a015.
[Din+06] C. Ding et al. “R1 -PCA: Rotational Invariant L1 -Norm Principal Component Analysis
for Robust Subspace Factorization”. In: Proceedings of the 23rd International
Conference on Machine Learning. ICML ’06. Association for Computing Machinery,
June 2006, pages 281–288. doi: 10.1145/1143844.1143880.
[GW95] M. X. Goemans and D. P. Williamson. “Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming”. In:
J. Assoc. Comput. Mach. 42.6 (1995), pages 1115–1145. doi: 10.1145/227683.
227684.
[Gro53] A. Grothendieck. Résumé de La Théorie Métrique Des Produits Tensoriels Topologiques.
Soc. de Matemática de São Paulo, 1953.
[Haa85] U. Haagerup. “The Grothendieck Inequality for Bilinear Forms on C*-Algebras”.
In: Advances in Mathematics 56.2 (May 1985), pages 93–116. doi: 10.1016/0001-
8708(85)90026-X.
[HI95] U. Haagerup and T. Itoh. “Grothendieck Type Norms For Bilinear Forms On
C*-Algebras”. In: Journal of Operator Theory 34.2 (1995), pages 263–283.
[Kai83] S. Kaijser. “A Simple-Minded Proof of the Pisier-Grothendieck Inequality”. In:
Banach Spaces, Harmonic Analysis, and Probability Theory. Springer, 1983, pages 33–
55.
[Kri78] J.-L. Krivine. “Constantes de Grothendieck et Fonctions de Type Positif Sur Les
Spheres”. In: Séminaire d’Analyse fonctionnelle (dit" Maurey-Schwartz") (1978),
pages 1–17.
Project 4: The NC Grothendieck Problem 247

[Kwa08] N. Kwak. “Principal Component Analysis Based on L1-Norm Maximization”. In:


IEEE Transactions on Pattern Analysis and Machine Intelligence 30.9 (Sept. 2008),
pages 1672–1680. doi: 10.1109/TPAMI.2008.114.
[MT11] M. McCoy and J. A. Tropp. “Two Proposals for Robust PCA Using Semidefinite
Programming”. In: Electronic Journal of Statistics 5.none (Jan. 2011), pages 1123–
1160. doi: 10.1214/11-EJS636.
[NRV14] A. Naor, O. Regev, and T. Vidick. “Efficient rounding for the noncommutative
Grothendieck inequality”. In: Theory Comput. 10 (2014), pages 257–295. doi:
10.4086/toc.2014.v010a011.
[Pis78] G. Pisier. “Grothendieck’s Theorem for Noncommutative C*-Algebras, with an
Appendix on Grothendieck’s Constants”. In: Journal of Functional Analysis 29.3
(Sept. 1978), pages 397–415. doi: 10.1016/0022-1236(78)90038-1.
[Pis12] G. Pisier. “Grothendieck’s Theorem, Past and Present”. In: Bulletin of the American
Mathematical Society 49.2 (Apr. 2012), pages 237–323. doi: 10.1090/S0273-0979-
2011-01348-9.
[RS09] P. Raghavendra and D. Steurer. “Towards Computing the Grothendieck Constant”.
In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics, Jan. 2009, pages 525–534. doi:
10.1137/1.9781611973068.58.
[So09] A. M. So. “Improved Approximation Bound for Quadratic Optimization Problems
with Orthogonality Constraints”. In: Proceedings of the Twentieth Annual ACM-SIAM
Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,
Jan. 2009, pages 1201–1209. doi: 10.1137/1.9781611973068.130.
[Ver18] R. Vershynin. High-dimensional probability. An introduction with applications in
data science, With a foreword by Sara van de Geer. Cambridge University Press,
Cambridge, 2018. doi: 10.1017/9781108231596.
5. Algebraic Riccati Equations

Date: 14 March 2022 Scribe: Taylan Kargin

This manuscript is a brief introduction to algebraic Riccati equations (AREs) named


Agenda:
after Venetian mathematician Jacopo Riccati (1676-1754) for his study of first-order
1. Motivation
quadratic differential equations [Ric24]. AREs are a certain group of nonlinear matrix 2. Metric Geometry of PSD Cone
equations that naturally arise in the context of control theory [Ber12], filtering and 3. Stability and Lyapunov Theory
estimation theory [Kal60; KSH00]. Given a complex-valued square matrix 𝑨 ∈ 𝕄𝑛 , 4. Discrete Algebraic Riccati
and positive-semidefinite (psd) matrices 𝑸 , 𝑺 ∈ ℍ𝑛+ , the quadratic matrix equation Equations

Find 𝑿 ∈ ℍ𝑛 : 𝑨 ∗ 𝑿 + 𝑿 𝑨 − 𝑿 𝑺 𝑿 + 𝑸 = 0,
is known as continuous algebraic Riccati equation (CARE) whereas the equation

Find 𝑿 ∈ ℍ𝑛 : 𝑿 = 𝑸 + 𝑨 ∗ (𝑺 + 𝑿 −1 ) −1 𝑨, (DARE)

is known as discrete algebraic Riccati equation (DARE). These equations are classified
as continuous and discrete since they naturally arise from studying continuous-time
and discrete-time dynamical systems in control and filtering problems. This manuscript
omits the continues version and deals with the discrete version, (DARE), only.
The rest of this manuscript is organized as follows. In Section 5.1, a central problem
in control theory is introduced to motivate the study of AREs. Section 5.2 follows
with a digression on the metric geometry of positive-definite (pd) matrices in order
to develop the tools to study AREs later in this manuscript. In Section 5.3, stability
of matrices and a closely related linear matrix equation are studied in the context of
linear dynamical systems. Section 5.4 concludes the manuscript with an analysis of the
solution of (DARE) by applying the tools and techniques developed in the previous
sections.

5.1 Motivation
In this section, the Linear-Quadratic Regulator (LQR) problem from optimal control
theory will be introduced. LQR problem is central to optimal control theory due to its
wide applicability in various real-world settings as well as its rich mathematical aspects
including AREs.
Example 5.1 (Discrete-time infinite-horizon LQRs). Consider the following linear dynamical
system,
𝒙 𝑡 +1 = 𝑨𝒙 𝑡 + 𝑩𝒖 𝑡 , for 𝑡 ≥ 0,
(LIN)
𝒙0 = 𝒛
where {𝒙 𝑡 }𝑡 ∈ℕ ⊂ ℂ𝑛 is the sequence of states, 𝒛 ∈ ℂ𝑛 is the initial state {𝒖 𝑡 }𝑡 ∈ℕ ⊂ ℂ𝑚
is the sequence of control inputs, 𝑨 ∈ 𝕄𝑛 is the state evolution matrix, and 𝑩 ∈ ℂ𝑛×𝑚
is the input matrix.
At each time step 𝑡 ∈ ℕ, a controller observes the current state 𝒙 𝑡 , exerts a control
input 𝒖 𝑡 to the dynamics, and then suffers from an instantaneous cost 𝑐 (𝒙 𝑡 , 𝒖 𝑡 ) as
a function of the state and the control input. In many real-world applications, it
Project 5: Algebraic Riccati Equations 249

is assumed that the instantaneous cost is a quadratic function of the state and the
control input, that is, 𝑐 (𝒙 𝑡 , 𝒖 𝑡 ) B 𝒙 𝑡∗𝑸 𝒙 𝑡 + 𝒖 𝑡∗𝑹𝒖 𝑡 where 𝑸 ∈ ℍ𝑛++ and 𝑹 ∈ ℍ𝑛++
are given positive-definite matrices. The objective of the controller is to reduce the
cumulative cost by deploying a control policy to design control inputs based on past
state observations. One straightforward approach is to deploy a linear state-feedback
controller 𝑲 ∈ ℂ𝑚×𝑛 such that the inputs are designed as 𝒖 𝑡 = 𝑲 𝒙 𝑡 for all 𝑡 ∈ ℕ.
In fact, the optimal control of the linear dynamical system (LIN) can be achieved
solely by a state-feedback policy as long as the pair (𝑨, 𝑩) is stabilizable (see the
definition 5.11) [Ber12]. The optimal state-feedback controller can be found by
minimizing the infinite-horizon cumulative cost among all state-feedback controllers
subject to linear dynamical system (LIN) as formulated below.
n ∑︁∞ o
min 𝑉 (𝒛 ; 𝑲 ) B 𝒙 𝑡∗𝑸 𝒙 𝑡 + 𝒖 𝑡∗𝑹𝒖 𝑡 ,
𝑲 ∈ℂ𝑚×𝑛 𝑡 =0
𝒙 𝑡 +1 = 𝑨𝒙 𝑡 + 𝑩𝒖 𝑡 , for 𝑡 ∈ ℕ, (OPT 1)
subject to 𝒖 𝑡 = 𝑲 𝒙 𝑡 , for 𝑡 ∈ ℕ,
𝒙0 = 𝒛.

Here, 𝒛 ∈ ℂ𝑛 is a given initial state and 𝑉 (𝒛 ; 𝑲 ) is the value function. The solution to
this problem is known as the infinite-horizon linear-quadratic regulator (LQR). 

The value function gives the total cost of controlling the linear dynamical sys-
tem (LIN) with state-feedback controller 𝑲 ∈ ℂ𝑚×𝑛 starting from an initial state
𝒛 ∈ ℂ𝑛 . Note that the value function is not guaranteed to take a finite value for
all 𝑲 ∈ ℂ𝑚×𝑛 . In fact, a state-feedback controller 𝑲 ∈ ℂ𝑚×𝑛 is called stabilizing if
𝑉 (𝒛 ; 𝑲 ) takes a finite value for any initial state 𝒛 ∈ ℂ𝑛 . Assuming that a state-feedback
controller 𝑲 ∈ ℂ𝑚×𝑛 is stabilizing, the value function can be expressed recursively as
∑︁∞ 
𝑉 (𝒛 ; 𝑲 ) = 𝒙 𝑡∗𝑸 𝒙 𝑡 + (𝑲 𝒙 𝑡 ) ∗𝑹 (𝑲 𝒙 𝑡 ) ,
∑︁𝑡∞=0
= 𝒙 𝑡∗ (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒙 𝑡 ,
𝑡 =0
∑︁∞
= 𝒙 ∗0 (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒙 0 + 𝒙 𝑡∗ (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒙 𝑡 ,
𝑡 =1
= 𝒛 ∗ (𝑸 + 𝑲 ∗𝑹𝑲 ) 𝒛 + 𝑉 ((𝑨 + 𝑩𝑲 )𝒛 ; 𝑲 ), (5.1)

for any 𝒛 ∈ ℂ𝑛 where (5.1) follows from taking the next step at time 𝑡 = 1 as the new
initial state.
Furthermore, it can be argued that the value function is quadratic in the initial
state 𝒛 by noting that the current state at time 𝑡 ≥ 0 is a linear function of the initial
state as 𝒙 𝑡 = (𝑨 + 𝑩𝑲 )𝑡 𝒛 and the instantaneous cost is quadratic in the current state.
Thus, one can write 𝑉 (𝒛 ; 𝑲 ) = 𝒛 ∗𝑷 𝒛 where 𝑷 ∈ ℍ𝑛+ is a psd matrix which depends on
the feedback gain matrix, 𝑲 . Inserting the quadratic form into the recursive relation
in (5.1), one obtains the following condition

𝒛 ∗ {𝑷 − (𝑸 + 𝑲 ∗𝑹𝑲 ) − (𝑨 + 𝑩𝑲 ) ∗𝑷 (𝑨 + 𝑩𝑲 )} 𝒛 = 0, for all 𝒛 ∈ ℂ𝑛 ,

which can be simplified by removing the 𝒛 -dependence as

(𝑨 + 𝑩𝑲 ) ∗𝑷 (𝑨 + 𝑩𝑲 ) − 𝑷 + 𝑸 + 𝑲 ∗𝑹𝑲 = 0. (5.2)

The equation (5.2) is known as the Lypaunov equation with state-feedback controller
𝑲 . This relationship makes it possible to rewrite the optimal control problem (OPT 1)
Project 5: Algebraic Riccati Equations 250

in the following equivalent form as

min 𝒛 ∗𝑷 𝒛 ,
𝑲 ∈ℂ𝑚×𝑛 , 𝑷 ∈ℍ𝑛
(𝑨 + 𝑩𝑲 ) ∗𝑷 (𝑨 + 𝑩𝑲 ) − 𝑷 + 𝑸 + 𝑲 ∗𝑹𝑲 = 0, (OPT 2)
subject to
𝑷 < 0.

Noticing that the Lyapunov equation (5.2) is quadratic with respect to 𝑲 and
𝑹 + 𝑩 ∗𝑷 𝑩  0, one can rewrite the equation (5.2) by completion of squares as

𝑷 = 𝑸 + 𝑨 ∗𝑷 𝑨 − 𝑨 ∗𝑷 𝑩 (𝑹 + 𝑩 ∗𝑷 𝑩) −1 𝑩 ∗𝑷 𝑨
(5.3)
+ (𝑲 − 𝑳) ∗ (𝑹 + 𝑩 ∗𝑷 𝑩)(𝑲 − 𝑳).

where 𝑳 B −(𝑹 + 𝑩 ∗𝑷 𝑩) −1 𝑩 ∗𝑷 𝑨 . The equivalence of (5.2) and (5.3) can be


verified directly by expanding the right-hand side of (5.3). This relationship is further
elaborated in Lemma 5.19 of Section 5.4. Since (𝑲 −𝑳) ∗ (𝑹 +𝑩 ∗𝑷 𝑩)(𝑲 −𝑳) < 0, it can
be concluded that a pair (𝑲 , 𝑷 ) ∈ ℂ𝑚×𝑛 × ℍ𝑛+ satisfying the Lyapunov equation (5.2)
satisfies the matrix inequality

𝑷 < 𝑸 + 𝑨 ∗𝑷 𝑨 − 𝑨 ∗𝑷 𝑩 (𝑹 + 𝑩 ∗𝑷 𝑩) −1 𝑩 ∗𝑷 𝑨. (5.4)

Conversely, for any 𝑷 ∈ ℍ𝑛+ satisfying the inequality (5.4), one can find a matrix 𝑲 ∈
ℂ𝑚×𝑛 such that the equation (5.3) holds. Therefore, the optimization problem (OPT 2)
is equivalent to

min 𝒛 ∗𝑷 𝒛 ,
𝑷 ∈ℍ𝑛
𝑷 < 𝑸 + 𝑨 ∗𝑷 𝑨 − 𝑨 ∗𝑷 𝑩 (𝑹 + 𝑩 ∗𝑷 𝑩) −1 𝑩 ∗𝑷 𝑨,
subject to
𝑷 < 0.

It can be argued that an optimal solution 𝑷 ★ < 0 to the problem above is attained
when the equality is satisfied, that is,

𝑷 ★ = 𝑸 + 𝑨 ∗𝑷 ★ 𝑨 − 𝑨 ∗𝑷 ★ 𝑩 (𝑹 + 𝑩 ∗𝑷 ★ 𝑩) −1 𝑩 ∗𝑷 ★ 𝑨

which is a discrete algebraic Riccati equation (DARE).


Attention will be devoted to the analysis of DARE and the related Lyapunov
equations in the rest of this manuscript. For the sake of generality, the complex-valued
case will be considered.

5.2 Metric Geometry of Positive-Definite Cone


In this section, the open convex cone of positive-definite matrices, denoted as ℙ𝑛 ,
will be studied from a metric geometry perspective. The concept of geometric mean
of two positive-definite matrices gives rise to a natural way of assigning a metric
distance to ℙ𝑛 . The results developed in this section will be used in Section 5.4 to
characterize the convergence of fixed-point iterations of Riccati operator. The proofs of
theorems and lemmas introduced in this section are omitted for the sake of brevity.
The interested reader is referred to the references [Bha03],[BH06], and [LW94] for
detailed discussion of the subject and the proofs omitted in this manuscript.
Suppose that a symmetric gauge function (sgf) Φ : ℝ𝑛 → ℝ+ is given. Denote by
k·k Φ : 𝕄𝑛 → ℝ+ unitarily invariant matrix norm associated to sgf Φ. The space of
Project 5: Algebraic Riccati Equations 251

Hermitian matrices equipped with this norm, (ℍ𝑛 , k·k Φ ) , is a real (complete) normed
vector space. The exponential map defined as

exp : ℍ𝑛 → ℙ𝑛 ,
∑︁∞ 1 𝑘
𝑯 ↦→ 𝑯 ,
𝑘 =0 𝑘!
is a diffeomorphism, that is, a smooth bijection with smooth inverse denoted as
log : ℙ𝑛 → ℍ𝑛 . Thus, the open convex cone of pd matrices constitute a smooth
manifold. It can be shown that the tangent space of ℙ𝑛 at a point 𝑿 ∈ ℙ𝑛 is isomorphic
to the space of Hermitian matrices, that is, 𝑇𝑿 ℙ𝑛  ℍ𝑛 .
A Riemmannian manifold is defined as a smooth manifold whose tangent space
at every point is equipped with a smooth inner product. Finsler manifolds generalize
Riemmannian manifolds by equipping tangent spaces at every point with a smooth
norm, instead. The following theorem shows that ℙ𝑛 attains a metric function, called
a Finsler metric, induced by a symmetric gauge function (sgf) defined on its tangent
space.

Theorem 5.2 (Metric induced by an sgf, [Bha03, p. 216]). Let Φ : ℝ𝑛 → ℝ+ be a Aside: GL (𝑛) is the general lin-
ear group of invertible square ma-
symmetric gauge function (sgf). The nonnegative function defined as trices of size 𝑛 ∈ ℕ.

𝛿Φ : ℙ𝑛 × ℙ𝑛 → ℝ+ ,
(𝑿 , 𝒀 ) ↦→ k log (𝑿 −1/2 𝒀 𝑿 −1/2 )k Φ ,

is a metric on ℙ𝑛 . Equipped with this metric, (ℙ𝑛 , 𝛿 Φ ) constitutes a Finsler


manifold where tangent space at any point is equipped with the norm k·k Φ . Fur-
thermore, 𝛿 Φ is invariant under matrix inversion and congruence transformations,
i.e.,
𝛿 Φ (𝑿 −1 ,𝒀 −1 ) = 𝛿Φ (𝑨 ∗ 𝑿 𝑨, 𝑨 ∗𝒀 𝑨) = 𝛿Φ (𝑿 ,𝒀 ),
for 𝑿 , 𝒀 ∈ ℙ𝑛 and 𝑨 ∈ GL (𝑛) .

Remark 5.3 The special case of Euclidean sgf results in a Riemmannian manifold
(ℙ𝑛 , 𝛿 2 ) since Euclidean norm admits an inner product. The metric 𝛿2 induced by
Euclidean norm is is called the Thompson metric.
The last part of Theorem 5.2 simply tells that inversion operation and group action
of GL (𝑛) onto ℙ𝑛 are isometries with respect to 𝛿 Φ . This invariance property will play
a role in the convergence analysis of fixed-point iterations of the Riccati operator.
The metric 𝛿 Φ is related to geometric mean of matrices in a natural way. The
following proposition shows that 𝛿 Φ is in fact the geodesic distance on ℙ𝑛 and that
Aside: A geodesic is a shortest-
the geodesics can be parameterized by geometric mean of matrices.
distance curve connecting two
Proposition 5.4 (Geodesics, [Bha03, p. 216]). Let 𝑿 ,𝒀 ∈ ℙ𝑛 be pd matrices. Define the points on a smooth manifold.
𝑡 −weighted geometric mean of two pd matrices as
𝑿 ♯𝑡 𝒀 B 𝑿 1/2 (𝑿 −1/2𝒀 𝑿 −1/2 )𝑡 𝑿 1/2 ,
for 𝑡 ∈ [ 0, 1] . The curve 𝛾 : 𝑡 ↦→ 𝑿 ♯𝑡 𝒀 starting at 𝑿 and ending at 𝒀 is a geodesic
wrt the metric 𝛿 Φ for any sgf Φ : ℝ𝑛 → ℝ+ , . In other words,

𝛿Φ (𝛾 (𝑠 ), 𝛾 (𝑡 )) = 𝛿 Φ (𝑿 ♯𝑠𝒀 , 𝑿 ♯𝑡 𝒀 ),
= |𝑠 − 𝑡 |𝛿 Φ (𝑿 ,𝒀 ) = |𝑠 − 𝑡 |𝛿Φ (𝛾 ( 0), 𝛾 ( 1)),
for 𝑠 , 𝑡 ∈ [ 0, 1] .
Project 5: Algebraic Riccati Equations 252

This section ends with some key definitions and theorems relevant to the conver-
gence analysis in Section 5.4.

Definition 5.5 (Contractions). Let 𝑓 : (ℙ𝑛 , 𝛿 Φ ) → (ℙ𝑛 , 𝛿 Φ ) be a self-map on ℙ𝑛 and


denote by

𝛿Φ ( 𝑓 (𝑿 ), 𝑓 (𝒀 ))
𝐿Φ ( 𝑓 ) B sup
𝑿 ,𝒀 ∈ℙ𝑛 , 𝑿 ≠𝒀 𝛿Φ (𝑿 ,𝒀 )

the Lipschitz constant of the map 𝑓 . The map 𝑓 is called

1. nonexpansive if 𝐿 Φ ( 𝑓 ) ≤ 1,

2. a strict contraction if 𝐿 Φ ( 𝑓 ) < 1.

The following lemma states a sufficient condition for a self-map to have a unique fixed
point as well as a simple iterative algorithm to approximate it.
Lemma 5.6 (Banach’s fixed-point theorem). Let 𝑓 : (ℙ𝑛 , 𝛿 Φ ) → (ℙ𝑛 , 𝛿 Φ ) be a strict
contraction on ℙ𝑛 . Then, there exists a unique fixed point 𝑿 ★ = 𝑓 (𝑿 ★ ) . Furthermore,
the iterates {𝑿 𝑘 }𝑘 ∈ℕ defined recursively as 𝑿 𝑘 +1 = 𝑓 (𝑿 𝑘 ) converge to the fixed point
𝑿 ★ for any 𝑿 0 ∈ ℙ𝑛 .
Finally, the next lemma shows that translations do not increase the distance in the
Finsler metric.
Lemma 5.7 (Translations are nonexpansive, [LL08, Prop. 4.1]). Let 𝑷 ∈ ℍ𝑛+ be a psd matrix.
For any 𝑿 ,𝒀 ∈ ℙ𝑛 , one has that

max (𝜆 max (𝑿 ), 𝜆 max (𝒀 ))


𝛿Φ (𝑷 + 𝑿 , 𝑷 + 𝒀 ) ≤ 𝛿 Φ (𝑿 ,𝒀 ).
𝜆min (𝑷 ) + max (𝜆 max (𝑿 ), 𝜆 max (𝒀 ))

5.3 Stability of Matrices


In this section, a notion of stability of linear dynamical systems will be investigated in
relation to spectral properties of matrices. Stability plays a central role in the theory
of dynamical systems as well as control theory. Since Riccati equations often arise
from studying nonautonomous dynamical systems, stability and related aspects are
recurrent themes in the study of Riccati equations.
This section starts with an introduction to concepts and results fundamental to the
theory of stability and ends with a subsection devoted to Lyapunov equations which lay
the foundations of algebraic Riccati equations as discussed in Section 5.1.
Example 5.8 (Linear autonomous dynamics). Let 𝑨 ∈ 𝕄𝑛 be a complex-valued matrix.
Consider the linear dynamical system,

𝒙 𝑡 +1 = 𝑨𝒙 𝑡 for 𝑡 ∈ ℕ, and 𝒙 0 ∈ ℂ𝑛 . (5.5)

The state at time 𝑡 ≥ 0 can be expressed in closed form as 𝒙 𝑡 = 𝑨 𝑡 𝒙 0 . It is common to


study dynamical systems by investigating whether the trajectory {𝒙 𝑡 }𝑡 ∈ℕ converges
to an equilibrium state 𝒙 ★ ∈ ℂ𝑛 such that 𝒙 ★ = 𝑨𝒙 ★ or otherwise the trajectory
{𝒙 𝑡 }𝑡 ∈ℕ diverges and and the state blows up as time goes to infinity. Due to linearity
of the state transitions, one can take 𝒙 ★ = 0 without loss of generality by transforming
𝒙 𝑡 ↦→ 𝒙 𝑡 − 𝒙 ∗ for all 𝑡 ∈ ℕ . The dynamical system (5.5) is said to be globally
asymptotically stable (GAS) if the trajectory {𝒙 𝑡 }𝑡 ∈ℕ converges to the equilibrium
Project 5: Algebraic Riccati Equations 253

𝑥 ★ = 0 for any starting point 𝒙 0 ∈ ℂ𝑛 . A necessary and sufficient condition for this to
happen is lim𝑡 →∞ 𝑨 𝑡 = 0. 

The limit condition for stability stated in the example above could be impractical
to verify in many settings. Therefore, alternative ways of feasibly certifying stability of
linear dynamical systems is needed.

Definition 5.9 (Spectral radius). For a complex-valued matrix 𝑨 ∈ 𝕄𝑛 , denote by


sp (𝑨) B {𝜆 ∈ ℂ | det (𝑨 − 𝜆I𝑛 ) = 0} the spectrum of it. The spectral radius of 𝑨
is defined as 𝜌 (𝑨) B sup𝜆∈sp (𝑨) |𝜆| .

Spectral radius serves as an equivalent criterion for certifying stability as stated by the
following lemma.
Lemma 5.10 The limit lim𝑡 →∞ 𝑨 𝑡 = 0 holds if and only if all the eigenvalues of 𝑨 are
inside the open unit disk of the complex plane, that is, 𝜌 (𝑨) < 1.

Proof. Taking the Jordan canonical form 𝑨 = 𝑷 𝑱 𝑷 −1 , one gets


 
lim 𝑨 𝑡 = lim 𝑷 𝑱 𝑡 𝑷 −1 = 𝑷 lim 𝑱 𝑡 𝑷 −1 .

𝑡 →∞ 𝑡 →∞ 𝑡 →∞

It can be shown that the limit lim𝑡 →∞ 𝑱 𝑡 = 0 holds if and only if for all eigenvalues
𝜆 ∈ sp (𝑨) , the limit lim𝑡 →∞ 𝑱 𝑡𝜆 = 0 holds for each Jordan block corresponding to
eigenvalue 𝜆. Since the elements of 𝑱 𝑡𝜆 consist only of powers of 𝜆𝑖 upto 𝑡 , the limit
condition is satisfied if and only if all the eigenvalues have modulus less than 1. 
The notion of stability of dynamical systems can be extended to matrices by applying
the spectral radius criterion.

Definition 5.11 (Stable matrices). A matrix 𝑨 ∈ 𝕄𝑛 is called (Schur) stable if all the
eigenvalues of 𝑨 are strictly inside the unit circle, that is, 𝜌 (𝑨) < 1. A pair of
matrices (𝑨, 𝑩) is said to be stabilizable if there exists a matrix 𝑲 ∈ ℂ𝑚×𝑛 such
that 𝑨 + 𝑩𝑲 is stable.

As discussed earlier in Section 5.1 and at the beginning of this section, Lyapunov
equations are foundational in the analysis of Riccati equations. Lyapunov equations
are named after Russian mathematician Aleksandr Mikhailovich Lyapunov (1857-1918)
who pioneered the mathematical theory of stability of dynamical systems and worked
in the fields of mathematical physics, probability theory and potential theory [SMI92].
He originally proposed Lyapunov function method to certify stability of equilibrium
points of ordinary differential equations.
When applied to discrete-time linear dynamical systems, Lyapunov function method
leads to Lyapunov stability theorem for certifying stability by showing existence of
solution to a Lyapunov equation as stated in Theorem 5.14. The details of Lyapunov
function method is omitted in this manuscript for the sake of brevity. However, the
interested reader is refereed to [Kha96] for detailed derivations.
The proof of Theorem 5.14 relies on the following lemmas.
Lemma 5.12 (Gelfand’s formula, [Lax02, Thm. 17.4]). Suppose a consistent matrix norm
k·k is given. For any matrix 𝑨 ∈ 𝕄𝑛 , the spectral radius is bounded from above as
𝜌 (𝑨) ≤ k𝑨 𝑘 k 1/𝑘 for all 𝑘 ∈ ℕ. Furthermore, k𝑨 𝑘 k 1/𝑘 ↓ 𝜌 (𝑨) as 𝑘 → ∞.
Gelfand’s formula provides a way to approximate spectral radius from matrix
norms.
Project 5: Algebraic Riccati Equations 254

Lemma 5.13 (Bounded matrix norm, [HJ13, Lem. 5.6.10]). Let 𝑨 ∈ 𝕄𝑛 and 𝜖 > 0 be given.
There exists a submultiplicative matrix norm k·k 𝑨,𝜖 such that 𝜌 (𝑨) ≤ k𝑨 𝑠 k 𝑨,𝜖 ≤
𝜌 (𝑨) + 𝜖 .

Theorem 5.14 (Lyapunov stability theorem). Let 𝑨 ∈ 𝕄𝑛 be given. The following are
equivalent.

1. The linear dynamical system 𝒙 𝑡 +1 = 𝑨𝒙 𝑡 is GAS.


2. For every pd matrix 𝑸 ∈ ℙ𝑛 , there exists a unique pd matrix 𝑷 ∈ ℙ𝑛 such
that 𝑷 − 𝑨 ∗𝑷 𝑨 = 𝑸 .

Proof. Suppose that the equation 𝑷 − 𝑨 ∗𝑷 𝑨 = 𝑸 holds for 𝑷  0 and 𝑸  0. Aside: Spectral norm k𝐴 k 2 is
Multiplying both hand sides with 𝑷 −1/2 from left and right as equal to the maximum singular
value.
I𝑛 − (𝑷 −1/2 𝑨 ∗𝑷 1/2 )(𝑷 1/2 𝑨𝑷 −1/2 ) = 𝑷 −1/2𝑸 𝑷 −1/2  0,

one obtains the bound k𝑷 1/2 𝑨𝑷 −1/2 k 2 < 1 on the spectral norm. Since the mapping
𝑨 ↦→ 𝑷 1/2 𝑨𝑷 −1/2 is a similarity transformation, the characteristic polynomial of
𝑷 1/2 𝑨𝑷 −1/2 and its spectrum are unchanged. Thus, one can certify stability of 𝑨 by
noticing
𝜌 (𝑨) = 𝜌 (𝑷 1/2 𝑨𝑷 −1/2 ) ≤ k𝑷 1/2 𝑨𝑷 −1/2 k 2 < 1,
where the first inequality is due to Lemma 5.12.
For the converse, suppose 𝜌 (𝑨) < 1. Consider the sequence {𝑷 𝑡 }𝑡 ∈ℕ generated
by the iterations 𝑷 𝑡 +1 = 𝑸 + 𝑨 ∗𝑷 𝑡 𝑨 with 𝑷 0 = 0 for a given pd matrix 𝑸  0. The
iterate at time 𝑡 ≥ 0 can be expressed in closed form as
∑︁𝑡 −1
𝑷𝑡 = (𝑨 ∗ ) 𝑠 𝑸 𝑨 𝑠  0.
𝑠 =0

Notice that the iterates are nondecreasing, that is, 𝑷 𝒕 +1 < 𝑷 𝒕  0 for all 𝑡 ≥ 0. Let
𝜖 > 0 be small enough such that 𝜌 (𝑨) + 𝜖 < 1. By Lemma 5.12, there exists a constant
𝑇𝜖 ∈ ℕ such that 𝜌 (𝑨) ≤ k𝑨 𝑡 k 1/𝑡 < 𝜌 (𝑨) + 𝜖 < 1 for all 𝑡 ≥ 𝑇𝜖 . Therefore, for any
𝑡 ≥ 𝑠 ≥ 𝑇𝜖 + 1, the distance between two iterates can be bounded as
∑︁𝑡 −1
k𝑷 𝑡 − 𝑷 𝑠 k 2 = (𝑨 ∗ ) 𝑘 𝑸 𝑨 𝑘 ,
𝑘 =𝑠 2
∑︁𝑡 −1
≤ k (𝑨 ∗ ) 𝑘 𝑸 𝑨 𝑘 k 2 ,
𝑘 =𝑠
∑︁𝑡 −1
≤ k𝑸 k 2 k𝑨 𝑘 k 22 ,
𝑘 =𝑠
∑︁𝑡 −1
≤ k𝑸 k 2 (𝜌 (𝑨) + 𝜖) 2𝑘 ,
𝑘 =𝑠
(𝜌 (𝑨) + 𝜖) 2𝑠 − (𝜌 (𝑨) + 𝜖) 2𝑡 (𝜌 (𝑨) + 𝜖) 2𝑠
= k𝑸 k 2 2
≤ k𝑸 k 2 ,
1 − (𝜌 (𝑨) + 𝜖) 1 − (𝜌 (𝑨) + 𝜖) 2

where the first inequality is due to triangle inequality and the second one is due to
submultiplicative property of spectral norm. Notice that the bound above is independent
of 𝑡 which leads to the bound

(𝜌 (𝑨) + 𝜖) 2𝑠
sup k𝑷 𝑡 − 𝑷 𝑠 k 2 ≤ k𝑸 k 2 .
𝑡 ≥𝑠 1 − (𝜌 (𝑨) + 𝜖) 2
Project 5: Algebraic Riccati Equations 255

The right-hand side converges to zero as 𝑠 → ∞ since 𝜌 (𝑨) + 𝜖 < 1. Therefore, the
sequence of iterates, {𝑷 𝑡 }𝑡 ∈ℕ , is Cauchy in ℍ𝑛+ and there exists a limit lim𝑡 →∞ 𝑷 𝑡 =
𝑷 ∈Í ℍ𝑛+ such that 𝑷 < 𝑷 𝑡  0. The limit can be expressed as an infinite sum as
𝑷 = 𝑡∞=0 (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 . Substituting this limit in the Lyapunov equation yields
∑︁∞ ∑︁∞
𝑷 − 𝑨 ∗𝑷 𝑨 = (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 +1𝑸 𝑨 𝑡 +1 ,
∑︁𝑡∞=0 ∑︁𝑡∞=0
= (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 +1𝑸 𝑨 𝑡 +1 ,
∑︁𝑡∞=0 ∑︁𝑡∞=0
= (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 = 𝑸 .
𝑡 =0 𝑡 =1
In order to show uniqueness of the solution of the Lyapunov equation, suppose
𝑷 1 , 𝑷 2  0 are two different solutions. Subtracting the equation 𝑷 1 = 𝑨 ∗𝑷 1 𝑨 + 𝑸
from 𝑷 2 = 𝑨 ∗𝑷 2 𝑨 + 𝑸 , one obtains
𝑷 1 − 𝑷 2 = 𝑨 ∗ (𝑷 1 − 𝑷 2 )𝑨.
Observe that multiplying the equality above repeatedly with 𝑨 ∗ from left and with 𝑨
from right yields
∑︁𝑡 −1
𝑷1 − 𝑷2 = (𝑨 ∗ ) 𝑠 (𝑷 1 − 𝑷 2 )𝑨 𝑠 ,
𝑠 =0
for any 𝑡 ≥ 0. Let 𝜖 > 0 be given such that 𝜌 (𝑨) + 𝜖 < 1. By Lemma 5.13, there
exists a submultiplicative norm k·k 𝑨,𝜖 such that k𝑨 𝑠 k 𝑨,𝜖 ≤ (𝜌 (𝑨) + 𝜖) 𝑠 . By triangle
inequality and submultiplicative property, the distance between the two solutions can
be bounded as
∑︁𝑡 −1
k𝑷 1 − 𝑷 2 k 𝑨,𝜖 ≤ k𝑷 1 − 𝑷 2 k 𝑨,𝜖 k𝑨 𝑠 k 2𝑨,𝜖 ,
𝑠 =0
∑︁𝑡 −1
≤ k𝑷 1 − 𝑷 2 k 𝑨,𝜖 (𝜌 (𝑨) + 𝜖) 2𝑠 ,
𝑠 =0
(𝜌 (𝑨) + 𝜖) 2𝑡
= k𝑷 1 − 𝑷 2 k 𝑨,𝜖 ,
1 − (𝜌 (𝑨) + 𝜖) 2
which holds for all 𝑡 ≥ 0. Taking the limit as 𝑡 → ∞, one gets that k𝑷 1 − 𝑷 2 k 𝑨,𝜖 = 0,
which is a contradiction. Hence, the solution must be unique. 

5.3.1 Lyapunov Equation


Given a matrix 𝑨 ∈ 𝕄𝑛 , the discrete Lyapunov operator L𝑨 : ℍ𝑛 → ℍ𝑛 is a linear
operator defined as L𝑨 (𝑷 ) = 𝑷 − 𝑨 ∗𝑷 𝑨 for any 𝑷 ∈ ℍ𝑛 . The following proposition
lists some of the equivalent properties of L𝑨 .
Proposition 5.15 (Properties of Lyapunov operator). Suppose 𝑨 ∈ 𝕄𝑛 is given. The
following statements are equivalent.
1. 𝑨 is Schur stable.
2. L𝑨 (𝑷 )  0 for all 𝑷  0.
3. There exists pd matrices 𝑷  0, and 𝑸  0 such that L𝑨 (𝑷 ) = 𝑸 .
4. There exists pd matrices 𝑷  0, and 𝑸  0 such that L𝑨 (𝑷 )  𝑸 .
5. The inverse operator L𝑨−1 : ℍ𝑛 → ℍ𝑛 exists.
Proof. Equivalence of these statements can be shown following a direction similar to
the proof of Theorem 5.14. 
The unique solution to the Lyapunov equation is constructed in the proof of
Theorem 5.14 for a pd matrix 𝑸  0. The same form of the solution is valid for any
Hermitian matrix 𝑸 ∈ ℍ𝑛 as long as 𝑨 is stable.
Project 5: Algebraic Riccati Equations 256

Theorem 5.16 (Solution operator, [Lan70, Thm. 3]). Let 𝑨 ∈ 𝕄𝑛 be a stable matrix. The
solution operator L𝑨−1 : ℍ𝑛 → ℍ𝑛 can be expressed as
∑︁∞
L𝑨−1 (𝑸 ) = (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 , (5.6)
𝑡 =0

for any 𝑸 ∈ ℍ𝑛 .

Proof. Since 𝑨 is stable, all of its eigenvalues are inside the open unit disk. Hence, the
series 𝑡∞=0 (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 converges for any 𝑸 ∈ ℍ𝑛 (see the proof of Theorem 5.14).
Í
By evaluating the Lyapunov operator at the candidate solution (5.6), one gets
∑︁∞  ∑︁∞ ∑︁∞ 
L𝑨 (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 = (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − 𝑨 ∗ (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 𝑨,
𝑡 =0 𝑡 =0 𝑡 =0
∑︁∞ ∑︁∞
∗ 𝑡
= (𝑨 ) 𝑸 𝑨 − 𝑡
(𝑨 ∗ )𝑡 +1𝑸 𝑨 𝑡 +1 ,
∑︁𝑡∞=0 ∑︁𝑡∞=0
= (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 − (𝑨 ∗ )𝑡 𝑸 𝑨 𝑡 ,
𝑡 =0 𝑡 =1
= 𝑸.
Uniqueness of the solution operator for stable 𝑨 can be argued in a similar fashion as
in the proof of Theorem 5.14.

Remark 5.17 (Solution in Kronecker product form). The Lyapunov equation can also be
solved for general 𝑨, 𝑸 ∈ 𝕄𝑛 by vectorizing and using Kronecker products as

( I ⊗ I − 𝑨 | ⊗ 𝑨 ∗ ) vec (𝑷 ) = vec (𝑸 ).
If the eigenvalues of matrix 𝑨 are such that 𝜆𝑖 𝜆∗𝑗 ≠ 1, then the unique solution exists
for any 𝑸 and is given by

vec (𝑷 ) = ( I ⊗ I − 𝑨 | ⊗ 𝑨 ∗ ) −1 vec (𝑸 ).

Notice that the special case 𝜌 (𝑨) < 1 satisfies the spectrum condition since all
eigenvalue have modulus less than 1.

5.4 Discrete Algebraic Riccati Equations


In the previous sections, several sophisticated tools and techniques were developed.
In this section, existence and uniqueness of the solution of discrete algebraic Riccati
equations (DARE) will be investigated utilizing these tools and techniques.
Let 𝑨 ∈ 𝕄𝑛 , 𝑩 ∈ ℂ𝑛×𝑚 be complex-valued matrices and 𝑸 ∈ ℙ𝑛 , 𝑹 ∈ ℙ𝑚 be pd
matrices. The Riccati operator R : ℍ𝑛+ → ℍ𝑛+ is defined as

R(𝑿 ) = 𝑸 + 𝑨 ∗ 𝑿 𝑨 − 𝑨 ∗ 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨,

for any 𝑿 ∈ ℍ𝑛+ . Lemma 5.19 establishes the formal link between Riccati and Lyapunov
equations as hinted in Section 5.1. The following lemma is needed in order to prove
Lemma 5.19.
Lemma 5.18 (Block upper-diagonal-lower (UDL) decomposition). Let 𝑨 ∈ 𝕄𝑛 , 𝑩 ∈ ℂ𝑛×𝑚 ,
𝑪 ∈ ℂ𝑚×𝑛 , and 𝑫 ∈ 𝕄𝑚 be given complex matrices. The following decomposition is
admitted as long as 𝑫 is invertible.

𝑩𝑫 −1 𝑨 − 𝑩𝑫 −1𝑪 0𝑛𝑚
     
𝑨 𝑩 I𝑛 I𝑛 0𝑛𝑚
= .
𝑪 𝑫 0𝑚𝑛 I𝑚 0𝑚𝑛 𝑫 𝑫 −1𝑪 I𝑚
Project 5: Algebraic Riccati Equations 257

Proof. The proof follows directly by multiplying the matrices in the right-hand side. 
Lemma 5.19 (A useful upper bound). Consider the function Γ : ℂ𝑚×𝑛 × ℍ𝑛+ → ℍ𝑛+ defined
as
Γ(𝑲 , 𝑿 ) B (𝑨 + 𝑩𝑲 ) ∗ 𝑿 (𝑨 + 𝑩𝑲 ) + 𝑸 + 𝑲 ∗𝑹𝑲 .
For any 𝑲 ∈ ℂ𝑚×𝑛 and 𝑿 ∈ ℍ𝑛+ , it holds that R(𝑿 ) 4 Γ(𝑲 , 𝑿 ) . Furthermore, the
equality R(𝑿 ) = Γ(𝑲 ★ (𝑿 ), 𝑿 ) holds for

𝑲 ★ (𝑿 ) B −(𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨.

Proof. Observe that Γ(𝑲 , 𝑿 ) is quadratic in 𝑲 and linear in 𝑿 as it can be expressed


as
 𝑨∗𝑿 𝑨 + 𝑸 𝑨∗𝑿 𝑩
  
I𝑛
Γ(𝑲 , 𝑿 ) = I𝑛 𝑲 ∗

∗ ∗ ,
𝑩 𝑿𝑨 𝑹 +𝑩 𝑿𝑩 𝑲
Since 𝑹 + 𝑩 ∗ 𝑿 𝑩  0, block UDL decomposition can be applied to the middle block
matrix in (5.4) by Lemma 5.18 as

𝑨∗𝑿 𝑨 + 𝑸 𝑨∗𝑿 𝑩
 
= 𝑼 𝑫𝑼 ∗ ,
𝑩 ∗𝑿 𝑨 𝑹 + 𝑩 ∗𝑿 𝑩

where
𝑨 ∗ 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1
 
I
𝑼 = 𝑛 ,
0𝑛 I𝑛
and
𝑨 ∗ 𝑿 𝑨 + 𝑸 − 𝑨 ∗ 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨
 
0𝑛
𝑫= .
0𝑛 𝑹 + 𝑩 ∗𝑿 𝑩
Notice that the first block of 𝑫 is same as R(𝑿 ) and the upper second block of 𝑼 is
simply −𝑲 ★ (𝑿 ) ∗ . Using this decomposition, Γ(𝑲 , 𝑿 ) can be rewritten as
 
I
Γ(𝑲 , 𝑿 ) = I𝑛 𝑲 ∗ 𝑼 𝑫𝑼 ∗ 𝑛 ,
 
𝑲
  
 ∗ ∗
 R(𝑿 ) 0𝑛 I𝑛
= I𝑛 𝑲 − 𝑲 ★ (𝑿 ) ,
0𝑛 𝑹 + 𝑩 ∗𝑿 𝑩 𝑲 − 𝑲 ★ (𝑿 )
= R(𝑿 ) + (𝑲 − 𝑲 ★ (𝑿 )) ∗ (𝑹 + 𝑩 ∗ 𝑿 𝑩)(𝑲 − 𝑲 ★ (𝑿 )).

Thus, Γ(𝑲 , 𝑿 ) < R(𝑿 ) for any 𝑲 ∈ ℂ𝑚×𝑛 and 𝑿 ∈ ℍ𝑛+ with equality only if
𝑲 = 𝑲 ★ (𝑿 ) . 
Proposition 5.20 (Properties of Riccati operator). Suppose 𝑿 , 𝑿 0 ∈ ℍ𝑛+ . Then, we have
that

• Monotone. R(𝑿 ) 4 R(𝑿 0) whenever 𝑿 4 𝑿 0,


• Concave. R(𝑡 𝑿 + ( 1 − 𝑡 )𝑿 0) < 𝑡 R(𝑿 ) + ( 1 − 𝑡 ) R(𝑿 0) for 𝑡 ∈ [ 0, 1] ,
• Continuous. If {𝑿 𝑘 }𝑘 ∈ℕ ⊂ ℍ𝑛+ is a sequence such that lim 𝑿 𝑘 = 𝑿 as 𝑘 → ∞,
then lim R(𝑿 𝑘 ) = R(𝑿 ) as 𝑘 → ∞.

Proof. To show monotonicity, suppose 𝑿 4 𝑿 0. By Lemma 5.19, one has that


R(𝑿 ) 4 Γ(𝑲 , 𝑿 ) 4 Γ(𝑲 , 𝑿 0) for any 𝑲 ∈ ℂ𝑚×𝑛 . This holds in particular for
𝑲 ★ (𝑿 0) . Hence, R(𝑿 ) 4 Γ(𝑲 ★ (𝑿 0), 𝑿 ) 4 Γ(𝑲 ★ (𝑿 0), 𝑿 0) = R(𝑿 0) .
Project 5: Algebraic Riccati Equations 258

In order to show concavity, Lemma 5.19 and linearity of Γ(𝑲 , 𝑿 ) wrt 𝑿 can be
used to write

Γ(𝑲 , 𝑡 𝑿 + ( 1 − 𝑡 )𝑿 0) = 𝑡 Γ(𝑲 , 𝑿 ) + ( 1 − 𝑡 )Γ(𝑲 , 𝑿 0)


< 𝑡 R(𝑿 ) + ( 1 − 𝑡 ) R(𝑿 0)

for any 𝑲 ∈ ℂ𝑚×𝑛 and 𝑡 ∈ [ 0, 1] . Choosing 𝑲 = 𝑲 ★ (𝑡 𝑿 + ( 1 − 𝑡 )𝑿 0) yields the


desired result.
Continuity of R(·) follows trivially from continuity of matrix inversion, congruence
transformations, translations and matrix multiplication.

Characterizing the conditions for existence and uniqueness of the solutions to DARE
lies at the heart of the analysis of Riccati equations. The following theorem gives a
necessary and sufficient condition for the existence and uniqueness of a solution, 𝑿 ★ ,
for which the the closed-loop matrix 𝑨 + 𝑩𝑲 ★ (𝑿 ★ ) is stable.

Theorem 5.21 (Existence and uniqueness of stabilizing fixed point). If the pair (𝑨, 𝑩) is
stabilizable, then there exists a unique positive-definite fixed point 𝑿 ★  0 of the
Riccati operator such that 𝑨 + 𝑩𝑲 ★ (𝑿 ★ ) is a stable matrix. Conversely, if there
exists a positive definite fixed-point 𝑿 ★  0, then it is uniquely determined and
𝑨 + 𝑩𝑲 ★ (𝑿 ★ ) is a stable matrix.

Proof. For the converse direction, suppose 𝑿 ★ = R(𝑿 ★ )  0, that is,

𝑿 ★ = Γ(𝑲 ★ , 𝑿 ★ ) = (𝑨 + 𝑩𝑲 ★ ) ∗ 𝑿 ★ (𝑨 + 𝑩𝑲 ★ ) + 𝑸 + 𝑲 ∗★𝑹𝑲 ★

where 𝑲 ★ B 𝑲 ★ (𝑿 ★ ) . Notice that the positive-definite matrix 𝑿 ★  0 satisfies the


Lyapunov equation with parameters 𝑨 + 𝑩𝑲 ★ and 𝑸 + 𝑲 ∗★𝑹𝑲 ★  0. Hence, 𝑨 + 𝑩𝑲 ★
is a stable matrix by Proposition 5.15 and 𝑿 ★ is uniquely determined by the unique
solution to the Lyapunov equation Γ(𝑲 ★ , 𝑿 ★ ) = 𝑿 ★ . This proves the uniqueness and
the stabilizing property of a positive-definite solution.
Now, suppose that (𝑨, 𝑩) is a stabilizable pair, that is, there exists a matrix
𝑲 0 ∈ ℂ𝑚×𝑛 such that 𝑨 + 𝑩𝑲 0 is stable. By Proposition 5.15, there exists a unique
positive-definite matrix 𝑿 0  0 satisfying the Lyapunov equation Γ(𝑲 0 , 𝑿 0 ) = 𝑿 0
with parameters 𝑨 + 𝑩𝑲 0 and 𝑸 + 𝑲 ∗0𝑹𝑲 0  0. Then, the inequality R(𝑿 0 ) =
Γ(𝑲 ★ (𝑿 0 ), 𝑿 0 ) 4 Γ(𝑲 0 , 𝑿 0 ) = 𝑿 0 holds by Lemma 5.19. Denoting 𝑲 1 B 𝑲 ★ (𝑿 0 ) ,
the inequality Γ(𝑲 1 , 𝑿 0 ) 4 𝑿 0 implies that 𝑨 + 𝑩𝑲 1 is a stable matrix by the 4th item
of Proposition 5.15. Hence, there exists a positive-definite matrix 𝑿 1  0 satisfying
Lyapunov equation Γ(𝑲 1 , 𝑿 1 ) = 𝑿 1 with parameters 𝑨 + 𝑩𝑲 1 and 𝑸 + 𝑲 ∗1𝑹𝑲 1  0.
In addition, the inequality R(𝑿 1 ) = Γ(𝑲 ★ (𝑿 1 ), 𝑿 1 ) 4 Γ(𝑲 1 , 𝑿 1 ) = 𝑿 1 holds by
Lemma 5.19. Taking the difference Γ(𝑲 1 , 𝑿 0 ) − Γ(𝑲 1 , 𝑿 1 ) = R(𝑿 0 ) − 𝑿 1 4 𝑿 0 − 𝑿 1 ,
one obtains

(𝑿 0 − 𝑿 1 ) − (𝑨 + 𝑩𝑲 1 ) ∗ (𝑿 0 − 𝑿 1 )(𝑨 + 𝑩𝑲 1 ) < 0.

By the 2th item of Proposition 5.15, it can be seen that 𝑿 0 < 𝑿 1 which implies
R(𝑿 0 ) < R(𝑿 1 ) by monotonicity.
Recursively repeating this process, a nonincreasing sequence of pd matrices
{𝑿 𝑘 }𝑘 ∈ℕ can be constructed such that 𝑿 𝑘 < 𝑿 𝑘 +1 . The stabilizing matrix of the next
step is 𝑲 𝑘 +1 B 𝑲 ★ (𝑿 𝑘 ) such that 𝑨 + 𝑩𝑲 𝑘 +1 is stable. Each iterate also satisfies the
Lyapunov equation Γ(𝑲 𝑘 , 𝑿 𝑘 ) = 𝑿 𝑘  0.
Project 5: Algebraic Riccati Equations 259

Since {𝑿 𝑘 }𝑘 ∈ℕ is nonincreasing and bounded below, there exists a limit 𝑿 ★ =


lim 𝑿 𝑘 such that 𝑿 𝑘 < 𝑿 ★ < 0 for any 𝑘 ∈ ℕ. Since the mapping 𝑲 ★ : 𝑿 ↦→ 𝑲 ★ (𝑿 )
is continuous, the limit lim 𝑲 𝑘 = 𝑲 ★ (𝑿 ★ ) exists. Taking the limit of both hand sides
of the equation Γ(𝑲 𝑘 , 𝑿 𝑘 ) = 𝑿 𝑘 leads to the equation Γ(𝑲 ★ (𝑿 ★ ), 𝑿 ★ ) = 𝑿 ★ . In
other words, 𝑿 ★ < 0 is a fixed point R(𝑿 ★ ) = 𝑿 ★ .
In order to show positivity and uniqueness of 𝑿 ★ , observe that 0 ≺ 𝑸 = R( 0) 4
R(𝑿 ★ ) = 𝑿 ★ due to monotonicity of the operator R. Therefore, the fact that
𝑿 ★  0 together with the converse theorem implies uniqueness of 𝑿 ★ and stability of
𝑨 + 𝑩𝑲 ★ (𝑿 ★ ) . 
Remark 5.22 In fact, any positive-semidefinite fixed point must necessarily be nonsingu-
lar and unique whenever (𝑨, 𝑩) is stabilizable and 𝑸  0. If the condition 𝑸 < 0 is
relaxed, then the unique stabilizing fixed-point might be singular.

5.4.1 Convergence Analysis of Fixed-Point Iterates


In this section, strict contraction property of Riccati operator with respect to Finsler
metric 𝛿 Φ will be shown and convergence rate of fixed-point iterations 𝑿 𝑘 +1 = R(𝑿 𝑘 )
will be analyzed. In order to simplify the analysis, the domain of Riccati operator will
be restricted to the cone of pd matrices, ℙ𝑛 . In this case, matrix inversion lemma can
be used to rewrite the Riccati operator as

R(𝑿 ) = 𝑸 + 𝑨 ∗ 𝑿 − 𝑿 𝑩 (𝑹 + 𝑩 ∗ 𝑿 𝑩) −1 𝑩 ∗ 𝑿 𝑨,

 −1
= 𝑸 + 𝑨 ∗ 𝑺 + 𝑿 −1 𝑨

where 𝑺 = 𝑩𝑹 −1 𝑩 ∗ . For the sake of brevity in analysis, it will be assumed that 𝑺  0


in the rest of this manuscript. However, similar results can be obtained for the case of
invertible 𝑨 and 𝑺 < 0. Interested reader is referred to [Bou93], [LL07], and [LL08].

Theorem 5.23 (Strict contraction of Riccati map, [LL08, Thm. 4.4]). Suppose that 𝑸 , 𝑺 
0 are positive definite. Let 𝑨 ∈ 𝕄𝑛 be a complex-valued matrix. Then, the Riccati
map is a strict contraction with respect to 𝛿 Φ with the Lipschitz constant

𝜆max (𝑨 ∗𝑺 −1 𝑨)
𝐿 Φ ( R) ≤ < 1,
𝜆 min (𝑸 ) + 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
for any sgf Φ.

Proof. Assume that 𝑨 is nonsingular. One has that 0 ≺ (𝑺 + 𝑿 −1 ) −1 4 𝑺 −1 for 𝑿 ∈ ℙ𝑛


by order-reversing property of matrix inversion. Congruence transformation of both
sides by 𝑨 leads to 0 ≺ 𝑨 ∗ (𝑺 + 𝑿 −1 ) −1 𝑨 4 𝑨 ∗𝑺 −1 𝑨 . Weyl’s monotonicity theorem
implies that
0 < 𝜆 max (𝑨 ∗ (𝑺 + 𝑿 −1 ) −1 𝑨) ≤ 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
for any 𝑿 ∈ ℙ𝑛 .
Suppose 𝑿 ,𝒀 ∈ ℙ𝑛 . Define

𝛼 B max (𝜆 max (𝑨 ∗ (𝑺 + 𝑿 −1 ) −1 𝑨), 𝜆 max (𝑨 ∗ (𝑺 + 𝒀 −1 ) −1 𝑨)).

Notice that 𝛼 ≤ 𝜆 max (𝑨 ∗𝑺 −1 𝑨) . Then, for any sgf Φ, the Finsler distance between
Project 5: Algebraic Riccati Equations 260

two points under the Riccati operation can be bounded as


𝛼
𝛿Φ ( R(𝑿 ), R(𝒀 )) ≤ 𝛿Φ (𝑨 ∗ (𝑺 + 𝑿 −1 ) −1 𝑨, 𝑨 ∗ (𝑺 + 𝒀 −1 ) −1 𝑨),
𝜆 min (𝑸 ) + 𝛼
𝛼
= 𝛿Φ (𝑺 + 𝑿 −1 , 𝑺 + 𝒀 −1 ),
𝜆 min (𝑸 ) + 𝛼
𝛼
≤ 𝛿Φ (𝑿 −1 , 𝒀 −1 ),
𝜆 min (𝑸 ) + 𝛼
𝛼
= 𝛿Φ (𝑿 , 𝒀 ),
𝜆 min (𝑸 ) + 𝛼
𝜆 max (𝑨 ∗𝑺 −1 𝑨)
≤ 𝛿Φ (𝑿 , 𝒀 ),
𝜆 min (𝑸 ) + 𝜆max (𝑨 ∗𝑺 −1 𝑨)
where the first and third lines are due to Lemma 5.7, the second and fourth lines are
due to invariance of 𝛿 Φ under congruence transformation and matrix inversion, and
𝛼
the last line is due to the monotonically increasing map 𝛼 ↦→ 𝛼+𝛽 , 𝛽 > 0. This bound
implies that
𝜆 max (𝑨 ∗𝑺 −1 𝑨)
𝐿 Φ ( R) ≤ <1
𝜆 min (𝑸 ) + 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
for nonsingular 𝑨 ∈ 𝕄𝑛 where strict inequality is due to positivity 𝜆 min (𝑸 ) > 0.
Now, suppose that 𝑨 is arbitrary. Let {𝑨 𝑘 }𝑘 ∈ℕ ⊂ GL (𝑛) be a sequence of non-
singular matrices converging to 𝑨 . Define the sequence of Riccati maps R𝑘 (𝑿 ) B
𝑸 + 𝑨 𝑘∗ (𝑺 + 𝑿 −1 ) −1 𝑨 𝑘 . Then, for 𝑿 ,𝒀 ∈ ℙ𝑛 , one has that

𝜆 max (𝑨 𝑘∗ 𝑺 −1 𝑨 𝑘 )
𝛿Φ ( R𝑘 (𝑿 ), R𝑘 (𝒀 )) ≤ 𝛿Φ (𝑿 , 𝒀 ).
𝜆min (𝑸 ) + 𝜆 max (𝑨 𝑘∗ 𝑺 −1 𝑨 𝑘 )

Taking the limit of both sides and using the continuity of the metric (𝑿 ,𝒀 ) ↦→ 𝛿 Φ (𝑿 ,𝒀 )
and the eigenvalue function 𝑨 ↦→ 𝜆 max (𝑨 ∗𝑺 −1 𝑨) , one obtains the final bound

𝜆max (𝑨 ∗𝑺 −1 𝑨)
𝛿Φ ( R(𝑿 ), R(𝒀 )) ≤ 𝛿Φ (𝑿 , 𝒀 ),
𝜆 min (𝑸 ) + 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
where the coefficient is strictly less then 1. 
Strict contraction of Riccati operator enables one to approximate the unique fixed point
by fixed-point iterations starting from any initial point. The rate of convergence is
exponential and controlled by the Lipschitz constant of the Riccati map as shown in
the following corollary.
Corollary 5.24 (Convergence rate of Riccati iterates, [LL08, Thm. 4.6]). Suppose 𝑸 , 𝑺  0
and let 𝑨 ∈ 𝕄𝑛 be arbitrary. Given an initial seed 𝑿 0 ∈ ℙ𝑛 , define recursions
𝑿 𝑘 +1 B R(𝑿 𝑘 ) for 𝑘 ∈ ℕ. Then, for any sgf Φ, the distance of the iterate 𝑿 𝑘 to the
unique fixed point 𝑿 ★ = R(𝑿 ★ ) is controlled as

𝐿𝑘
𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) ≤ 𝛿Φ (𝑿 1 , 𝑿 0 ),
1−𝐿
where
𝜆 max (𝑨 ∗𝑺 −1 𝑨)
𝐿= < 1.
𝜆 min (𝑸 ) + 𝜆 max (𝑨 ∗𝑺 −1 𝑨)
Project 5: Algebraic Riccati Equations 261

Proof. From Theorem 5.23, one has that

𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) = 𝛿Φ ( R(𝑿 ★ ), R(𝑿 𝑘 −1 )) ≤ 𝐿𝛿 Φ (𝑿 ★ , 𝑿 𝑘 −1 ).

Recursively applying the first inequality, the following bound

𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) ≤ 𝐿 𝑘 −1 𝛿Φ (𝑿 ★ , 𝑿 1 ),

is obtained. In particular, one has that

𝛿 Φ (𝑿 ★ , 𝑿 1 ) ≤ 𝐿𝛿Φ (𝑿 ★ , 𝑿 0 ) ≤ 𝐿 (𝛿 Φ (𝑿 ★ , 𝑿 1 ) + 𝛿 Φ (𝑿 1 , 𝑿 0 )),

where the second inequality is due to triangle inequality. Hence,

𝐿
𝛿 Φ (𝑿 ★ , 𝑿 1 ) ≤ 𝛿Φ (𝑿 1 , 𝑿 0 ),
1−𝐿
which implies

𝐿𝑘
𝛿Φ (𝑿 ★ , 𝑿 𝑘 ) ≤ 𝐿 𝑘 −1 𝛿Φ (𝑿 ★ , 𝑿 1 ) ≤ 𝛿Φ (𝑿 1 , 𝑿 0 ).
1−𝐿


Lecture bibliography
[Ber12] D. Bertsekas. Dynamic Programming and Optimal Control: Volume I. v. 1. Athena
Scientific, 2012. url: https://books.google.com/books?id=qVBEEAAAQBAJ.
[Bha03] R. Bhatia. “On the exponential metric increasing property”. In: Linear Algebra and its
Applications 375 (2003), pages 211–220. doi: https://doi.org/10.1016/S0024-
3795(03)00647-5.
[BH06] R. Bhatia and J. Holbrook. “Riemannian geometry and matrix geometric means”.
In: Linear Algebra and its Applications 413.2 (2006). Special Issue on the 11th
Conference of the International Linear Algebra Society, Coimbra, 2004, pages 594–
618. doi: https://doi.org/10.1016/j.laa.2005.08.025.
[Bou93] P. Bougerol. “Kalman Filtering with Random Coefficients and Contractions”. In:
SIAM Journal on Control and Optimization 31.4 (1993), pages 942–959. eprint:
https://doi.org/10.1137/0331041. doi: 10.1137/0331041.
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[KSH00] T. Kailath, A. H. Sayed, and B. Hassibi. Linear estimation. Prentice Hall, 2000.
[Kal60] R. E. Kalman. “A New Approach to Linear Filtering and Prediction Problems”.
In: Journal of Basic Engineering 82.1 (Mar. 1960), pages 35–45. eprint: https:
//asmedigitalcollection.asme.org/fluidsengineering/article-pdf/
82/1/35/5518977/35\_1.pdf. doi: 10.1115/1.3662552.
[Kha96] H. Khalil. “Nonlinear Systems, Printice-Hall”. In: Upper Saddle River, NJ 3 (1996).
[Lan70] P. Lancaster. “Explicit Solutions of Linear Matrix Equations”. In: SIAM Review 12.4
(1970), pages 544–566. url: http://www.jstor.org/stable/2028490.
[LL07] J. Lawson and Y. Lim. “A Birkhoff Contraction Formula with Applications to
Riccati Equations”. In: SIAM Journal on Control and Optimization 46.3 (2007),
pages 930–951. eprint: https://doi.org/10.1137/050637637. doi: 10.1137/
050637637.
[Lax02] P. D. Lax. Functional analysis. Wiley-Interscience, 2002.
Project 5: Algebraic Riccati Equations 262

[LL08] H. Lee and Y. Lim. “Invariant metrics, contractions and nonlinear matrix equations”.
In: Nonlinearity 21.4 (2008), pages 857–878. doi: 10.1088/0951-7715/21/4/
011.
[LW94] C. Liverani and M. P. Wojtkowski. “Generalization of the Hilbert metric to the
space of positive definite matrices.” In: Pacific Journal of Mathematics 166.2 (1994),
pages 339 –355. doi: pjm/1102621142.
[Ric24] J. Riccati. “Animadversiones in aequationes differentiales secundi gradus”. In:
Actorum Eruditorum quae Lipsiae publicantur Supplementa. Actorum Eruditorum
quae Lipsiae publicantur Supplementa v. 8. prostant apud Joh. Grossii haeredes
& J.F. Gleditschium, 1724. url: https : / / books . google . com / books ? id =
UjTw1w7tZsEC.
[SMI92] V. I. SMIRNOV. “Biography of A. M. Lyapunov”. In: International Journal of
Control 55.3 (1992), pages 775–784. eprint: https : / / doi . org / 10 . 1080 /
00207179208934258. doi: 10.1080/00207179208934258.
6. Hyperbolic Polynomials

Date: 14 March 2022 Author: Eitan Levin

In this note, we survey the basic properties of hyperbolic polynomials and their
Agenda:
consequences. These polynomials generalize the determinant, and the roots of their
1. Hyperbolic polynomials
analogously-defined characteristic polynomials share remarkably many of the properties 2. Differentiation
of eigenvalues. Hyperbolic polynomials were originally introduced in the study of 3. Quadratics and Alexandrov’s
(hyperbolic) PDEs by Gårding [Går51], and have since found applications in several inequality
other areas, most famously in the resolution of the Kadison–Singer problem by Marcus, 4. SDPs and perturbation theory
5. Compositions
Spielman, and Srivastava [MSS14]. 6. Euclidean structure

6.1 Basic definitions and properties


Let E be a Euclidean vector space. Denote by ℝ[ E] 𝑑 the space of polynomial functions
on E which are homogeneous of degree 𝑑 . In more detail, E is a
finite-dimensional real inner-product
space equipped with the associated
Definition 6.1 (Hyperbolic polynomials). A homogeneous polynomial 𝑝 ∈ ℝ[ E] 𝑑 is
norm topology. The space ℝ [ E] 𝑑
hyperbolic on E with respect to a direction 𝒆 ∈ E if 𝑝 (𝒆 ) ≠ 0 and if the univariate consists of functions that are
polynomial 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒆 ) has only real roots for any 𝒙 ∈ E. The set of hyperbolic homogeneous polynomials of degree
polynomials of degree 𝑑 with respect to 𝒆 is denoted Hyp𝑑 (𝒆 ) . 𝑑 in the coordinates with respect to
some basis for E. This definition is
The polynomial 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒆 ) is called the characteristic polynomial of 𝒙 , and independent of the choice of basis.
its roots (listed with multiplicity) are called the eigenvalues of 𝒙 and are denoted
by 𝜆 max (𝒙 ) B 𝜆 1 (𝒙 ) ≥ . . . ≥ 𝜆𝑑 (𝒙 ) =: 𝜆 min (𝒙 ) .

Note that the characteristic polynomial always has degree 𝑑 , and hence has 𝑑 roots
counting multiplicities, since the leading term of 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒆 ) is (−1) 𝑑 𝑝 (𝒆 )𝑡 𝑑 and
we assume 𝑝 (𝒆 ) ≠ 0. See also Remark 6.8 below.
It is common to normalize hyperbolic polynomials so that 𝑝 (𝒆 ) = 1, but we do not
require this.
Example 6.2 The following are three of the most basic examples of hyperbolic polyno-
mials.

The product 𝑝 (𝒙 ) = 𝑑𝑖=1 𝑥𝑖 is hyperbolic on ℝ𝑑 with respect to 1 = ( 1, . . . , 1) ᵀ


Î
(a)
since the roots of 𝑝 (𝒙 − 𝑡 1) = 𝑑𝑖=1 (𝑥𝑖 − 𝑡 ) are the coordinates 𝑥𝑖 . In this case,
Î
𝝀(𝒙 ) = 𝒙 ↓ .
(b) The determinant 𝑝 (𝑿 ) = det (𝑿 ) is hyperbolic on ℍ𝑑 with respect to I since
the roots of 𝑝 (𝑿 − 𝑡 I) = det (𝑿 − 𝑡 I) are the eigenvalues of 𝑿 . In this case
𝝀(𝑿 ) = 𝝀(𝑿 ) ↓ is the vector of eigenvalues of 𝑿 sorted in decreasing order.
(c) We have 𝑝 (𝒙 ) = h𝒆 , 𝒙 i 𝑑 ∈ Hyp𝑑 (𝒆 ) , showing that Hyp𝑑 (𝒆 ) ≠ ∅ for all 𝑑 ∈ ℕ
and 𝒆 ∈ E.

More sophisticated examples are given in Sections 6.3 and 6.5. 

Hyperbolic polynomials unify the analysis of certain families of inequalities and


cones arising in optimization.
Project 6: Hyperbolic Polynomials 264

Note that the eigenvalues 𝜆𝑖 (𝒙 ) are continuous in 𝒙 , since the roots of a polynomial
are continuous functions of its coefficients (which follows from the argument principle
in complex analysis), and the coefficients of 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒆 ) are themselves continuous
(in fact, polynomial) functions of 𝒙 . Note also that

𝝀(𝑠𝒙 + 𝑡 𝒆 ) = 𝑠 𝝀(𝒙 ) + 𝑡 1 for 𝑠 ≥ 0 and 𝑡 ∈ ℝ,


𝜆𝑖 (−𝒙 ) = 𝜆𝑑−𝑖 (𝒙 ), (6.1)
Ö𝑑
𝑝 (𝒙 ) = 𝑝 (𝒆 ) 𝜆𝑖 (𝒙 ).
𝑖 =1

In particular, 𝝀(𝒆 ) = 1.
As usual in this class, we begin by characterizing the geometry of the set of
hyperbolic polynomials.
Proposition 6.3 (Basic geometry). If 𝑝 ∈ Hyp𝑑 (𝒆 ) then 𝛼𝑝 ∈ Hyp𝑑 (𝒆 ) for any nonzero
𝛼 ∈ ℝ. In particular, Hyp𝑑 (𝒆 ) is a cone, but it is not convex. If 𝑝 ∈ Hyp𝑑 (𝒆 ) and In this note, a cone is a set closed
𝑞 ∈ Hyp𝑚 (𝒆 ) , then 𝑝𝑞 ∈ Hyp𝑑+𝑚 (𝒆 ) . under multiplication by strictly
positive numbers.
Proof. The first and last claims are easy to verify directly from Definition 6.1.
The cone Hyp𝑑 (𝒆 ) is not convex since if 𝑝 ∈ Hyp𝑑 (𝒆 ) then −𝑝 ∈ Hyp𝑑 (𝒆 ) but
(𝑝 − 𝑝)/2 = 0 ∉ Hyp𝑑 (𝒆 ) because the zero polynomial vanishes at 𝒆 . We do not get convexity even if we
 restrict to polynomials satisfying
𝑝 (𝒆 ) = 1, as can be seen from
Next, we introduce the generalization of the psd cone. Proposition 6.13(d) below.

Definition 6.4 The hyperbolicity cone for 𝑝 with respect to 𝒆 is the cone Λ++ = {𝒙 ∈
E : 𝜆 min (𝒙 ) > 0}. Its closure is denoted by Λ+ .

Note that Λ++ is indeed a cone, since 𝜆 min is positively homogeneous by (6.1). Note
also that
Λ+ = {𝒙 ∈ E : 𝜆 min (𝒙 ) ≥ 0}.
Indeed, the inclusion ⊆ follows by continuity of 𝜆 min , while the reverse inclusion follows
since if 𝜆 min (𝒙 ) = 0 then for 𝑡 > 0 we have 𝒙 + 𝑡 𝒆 ∈ Λ++ by (6.1) and 𝒙 + 𝑡 𝒆 → 𝒙 as
𝑡 ↓ 0. Equation (6.1) also shows that 𝑝/𝑝 (𝒆 ) > 0 on Λ++ and 𝑝/𝑝 (𝒆 ) ≥ 0 on Λ+ .
We have Λ++ = ℝ++ 𝑑 , Λ = ℝ𝑑 in Example 6.2(a) and Λ
+ + ++ = ℍ𝑑 , Λ+ = ℍ𝑑 in
++ +

Example 6.2(b).
We proceed to show that hyperbolicity cones are convex.
Lemma 6.5 The set Λ++ is the connected component containing 𝒆 in {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0}.
Moreover, it is star-shaped with center 𝒆 . A set S is star-shaped with center
𝑠0 ∈ S if the line segment between 𝑠0
Proof. We have 𝒆 ∈ Λ++ since 𝜆 min (𝒆 ) = 1, and Λ++ ⊆ {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0} since and any 𝑠 ∈ S is contained in S.
𝑝/𝑝 (𝒆 ) > 0 on Λ++ . If 𝒙 ∈ Λ++ then 𝜆 min (𝜏𝒙 + 𝜏𝒆
¯ ) = 𝜏𝜆 min (𝒙 ) + 𝜏¯ > 0 for all
𝜏 ∈ [0, 1] and 𝜏¯ = 1 − 𝜏 . This shows that the line segment between 𝒙 and 𝒆 is
contained in Λ++ and hence Λ++ is connected, and moreover start-shaped with center 𝒆 .
Therefore, Λ++ is contained in the connected component of 𝒆 in {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0}.
Conversely, if 𝒙 is in that connected component, then 𝒙 ∈ Λ++ because 𝜆 min varies
continuously on any path between 𝒙 and 𝒆 contained in {𝒙 ∈ E : 𝑝 (𝒙 ) ≠ 0} and is
never zero on it. 
The following theorem gives an important property of hyperbolicity cones, from
which their convexity readily follows. The proof below is a slight simplification of
Renegar’s proof in [Ren06, Thm. 3], which in turn is a simplification of Gårding’s
original proof in [Går51, Lemma 2.7].
Project 6: Hyperbolic Polynomials 265

Theorem 6.6 If 𝒙 ∈ Λ++ , then 𝑝 is hyperbolic with respect to 𝒙 . The hyperbolicity


cone of 𝑝 with respect to 𝒙 is also Λ++ .

Proof. Fix arbitrary 𝒚 ∈ E. For the first claim, we need to show that 𝑡 ↦→ 𝑝 (𝒚 − 𝑡 𝒙 )
has only real roots. Since this is a polynomial with real coefficients, and hence its
complex roots come in conjugate pairs, it is equivalent to show that all of its roots have
nonnegative imaginary parts. Suppose for contradiction that this is not the case.
Î
Define 𝑞 (𝛼, 𝑠 , 𝑡 ) = 𝑝 ( i𝛼𝒆 + 𝑠 𝒚 −𝑡 𝒙 ) . All the roots of 𝑡 ↦→ 𝑞 ( 1, 0, 𝑡 ) = 𝑝 (𝒆 ) 𝑗 ( i −
𝑡 𝜆 𝑗 (𝒙 )) have positive imaginary parts (namely, 𝜆 𝑗 (𝒙 ) −1 ), while some root of 𝑡 ↦→
𝑞 ( 0, 1, 𝑡 ) has a negative imaginary part. Using the continuous dependence of the roots
on the coefficients, we conclude that there must exist 𝛽 ∈ ( 0, 1) such that some root
of 𝑡 ↦→ 𝑞 ( 1 − 𝛽, 𝛽, 𝑡 ) is real, say that root is 𝑡 0. Then 𝒛 = 𝛽𝒚 − 𝑡 0𝒙 ∈ E is such that
𝑡 ↦→ 𝑝 (𝒛 − 𝑡 𝒆 ) has a purely imaginary root (namely, 𝑡 = −i ( 1 − 𝛽) ), contradicting the
hyperbolicity of 𝑝 with respect to 𝒆 .
The second claim follows from Lemma 6.5. 
Corollary 6.7 The cones Λ++ and Λ+ are convex.

Proof. The cone Λ++ is star-shaped and every 𝒙 ∈ Λ++ is a center by Lemma 6.5, hence
it is convex. The cone Λ+ is the closure of a convex set, hence convex as well. 
Remark 6.8 Corollary 6.7 can fail without the requirement that 𝑝 (𝒆 ) ≠ 0 in Defini-
tion 6.1, see [Ren06].
We can now generalize the convexity of the largest eigenvalue.

Theorem 6.9 The function 𝜆 max is convex and 𝜆 min is concave on all of E.

Proof. Since 𝜆 1 (−𝒙 ) = −𝜆 min (𝒙 ) , it suffices to show the second statement.


The superlevel sets of 𝜆 min are convex, because {𝒙 ∈ E : 𝜆 min (𝒙 ) ≥ 𝛼} = Λ+ + 𝛼𝒆 .
For 𝒙 , 𝒚 ∈ E such that 𝜆 min (𝒚 ) ≥ 𝜆 min (𝒙 ) , define

𝒛 = 𝒚 + [𝜆 min (𝒙 ) − 𝜆 min (𝒚 )]𝒆 ,

and note that 𝜆 min (𝒙 ) = 𝜆 min (𝒛 ) . Therefore, for any 𝜏 ∈ [ 0, 1] and 𝜏¯ = 1 −𝜏 , we have
𝜏𝒙 + 𝜏𝒛
¯ ∈ {𝒘 ∈ E : 𝜆 min (𝒘 ) ≥ 𝜆 min (𝒙 )}, so

𝜆 min (𝒙 ) ≤ 𝜆min (𝜏𝒙 + 𝜏𝒛


¯ ) = 𝜆 min (𝜏𝒙 + 𝜏𝒚
¯ ) + 𝜏¯ [𝜆 min (𝒙 ) − 𝜆 min (𝒚 )].

Rearranging gives 𝜆 min (𝜏𝒙 + 𝜏𝒚


¯ ) ≥ 𝜏𝜆 min (𝒙 ) + 𝜏𝜆
¯ min (𝒚 ) , as desired. 

6.2 Derivatives and multilinearization


Differentiating a hyperbolic polynomial gives another hyperbolic polynomial. This fact,
which we prove below, ultimately yields Alexandrov’s mixed discriminant inequality
(Corollary 6.16) as an easy consequence of the theory we develop.

Definition 6.10 For a polynomial 𝑝 on E and vectors 𝒚 1 , . . . , 𝒚 𝑘 ∈ E, define induc-


Project 6: Hyperbolic Polynomials 266

tively the following polynomials:

𝑑
𝑝𝒚( 11 ) (𝒙 ) = 𝑝 (𝒙 + 𝑡 𝒚 1 ) = h∇𝑝 (𝒙 ), 𝒚 1 i,
𝑑𝑡 𝑡 =0
𝑑
𝑝𝒚(𝑘1 ,...,𝒚
)
𝑘
= 𝑝𝒚(𝑘1 ,...,𝒚
−1)
𝑘 −1
(𝒙 + 𝑡 𝒚 𝑘 ).
𝑑𝑡 𝑡 =0

Proposition 6.11 (Derivatives are hyperbolic). Suppose 𝑝 is hyperbolic on E of degree 𝑑


with hyperbolicity cone Λ++ . Then for any 𝒚 1 , . . . , 𝒚 𝑘 ∈ Λ++ , the derivative polynomial
𝑝𝒚(𝑘1 ,...,𝒚
)
𝑘
is hyperbolic of degree 𝑑 − 𝑘 with respect to any 𝒚 ∈ Λ++ .

Proof. It suffices to prove the case 𝑘 = 1, since then the general case follows by
induction. If 𝒚 1 ∈ Λ++ , then 𝑝 is hyperbolic with respect to 𝒚 1 by Theorem 6.6,
so 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒚 1 ) has 𝑑 real roots. By Rolle’s theorem, the roots 𝝀 ( 1) (𝒙 ) of
𝑡 ↦→ 𝑝 𝒚( 11 ) (𝒙 − 𝑡 𝒚 1 ) are nested between the roots 𝝀(𝒙 ) of 𝑡 ↦→ 𝑝 (𝒙 − 𝑡 𝒚 1 ) , meaning
that
𝜆 1 (𝒙 ) ≥ 𝜆1( 1) (𝒙 ) ≥ 𝜆 2 (𝒙 ) ≥ . . . ≥ 𝜆𝑑−1 (𝒙 ) ≥ 𝜆𝑑−
( 1)
1
(𝒙 ) ≥ 𝜆𝑑 (𝒙 ).
In particular, this shows that all 𝑑 − 1 roots of 𝑡 ↦→ 𝑝 ( 1) (𝒙 − 𝑡 𝒚 1 ) are real. By Euler’s
( 1) ( 1)
homogeneous function theorem, we have 𝑝 𝒚 1 (𝒚 1 ) = 𝑝 (𝒚 1 )𝑑 ≠ 0. Thus, 𝑝 𝒚 1 is
hyperbolic with respect to 𝒚 1 . Moreover, for any 𝒚 ∈ Λ++ we have 𝜆𝑑 (𝒚 ) > 0, hence
( 1)
𝜆𝑑− 1
(𝒚 ) > 0 so 𝒚 is in the hyperbolicity cone of 𝑝 𝒚( 11 ) . Theorem 6.6 then shows that
𝑝𝒚( 11 ) is hyperbolic with respect to 𝒚 . 
An important special case of derivative polynomials is the multilinearization of a
homogeneous polynomial. If 𝑝 is a homogeneous polynomial of degree 𝑑 on E, one
defines polynomials 𝑝𝑖 1 ,...,𝑖𝑑 by
∑︁𝑑  ∑︁
𝑝 𝛼𝑖 𝒙 𝑖 = 𝑝𝑖1 ,...,𝑖𝑑 (𝒙 1 , . . . , 𝒙 𝑑 )𝛼 1𝑖1 · · · 𝛼𝑑𝑖𝑑 ,
𝑖 =1 𝑖 1 ,...,𝑖𝑑

where the sum is over indices 𝑖 1 , . . . , 𝑖𝑑 ∈ {0, . . . , 𝑑 } summing to 𝑑 . By viewing this


as a polynomial in 𝛼 ∈ ℝ𝑑 and equating coefficients of corresponding monomials, it is
easy to show that 𝑝𝑖 1 ,...,𝑖𝑑 (𝒙 1 , . . . , 𝒙 𝑑 ) is homogeneous of degree 𝑖 𝑗 in 𝒙 𝑗 , and that for
any permutation 𝜎 on 𝑑 letters, we have

𝑝𝑖1 ,...,𝑖𝑑 (𝒙 𝜎 ( 1) , . . . , 𝒙 𝜎 (𝑑) ) = 𝑝𝑖𝜎 −1 (1) ,...,𝑖𝜎 −1 (𝑑 ) (𝒙 1 , . . . , 𝒙 𝑑 ).

e = 𝑝1,...,1 is multilinear and symmetric in its 𝑑 arguments, and it is


In particular, 𝑝
called the multilinearization or full polarization of 𝑝 . The map 𝑝 ↦→ 𝑝
e defines a linear Multilinearization realizes the
isomorphism between ℝ[ E] 𝑑 and symmetric multilinear polynomials on E𝑑 , since we isomorphism Sym𝑑 ( E)  ( E⊗𝑑 ) 𝑆𝑑
briefly discussed in class, where the
have:
1 symmetric group 𝑆𝑑 acts by
𝑝 (𝒙 ) = 𝑝 e(𝒙 , . . . , 𝒙 ). (6.2) permuting the factors.
𝑑!
Í 
𝑑
𝛼𝑖 ) 𝑑 𝑝 (𝒙 ) and equating the
Í
This can be seen by noting that 𝑝 𝑖 =1 𝛼𝑖 𝒙 = ( 𝑖
coefficients of 𝛼 1 · · · 𝛼𝑑 . Proposition 6.11 implies the following Corollary.
Corollary 6.12 Suppose 𝑝 is a hyperbolic polynomial on E of degree 𝑑 with hyperbolicity
cone Λ++ , and let 𝑝
e be its multilinearization. For any 1 ≤ 𝑘 ≤ 𝑑 and any 𝒙 𝑘 , . . . , 𝒙 𝑑 ∈
Λ++ , the polynomial 𝒙 ↦→ 𝑝 e(𝒙 , . . . , 𝒙 , 𝒙 𝑘 , . . . , 𝒙 𝑑 ) is hyperbolic with respect to any
𝒚 ∈ Λ++ .
Project 6: Hyperbolic Polynomials 267

Proof. Using (6.2) and the symmetry and multilinearity of 𝑝


e, we have
1
𝑝 (𝒙 + 𝑡 𝒙 𝑑 ) = e(𝒙 + 𝑡 𝒙 𝑑 , . . . , 𝒙 + 𝑡 𝒙 𝑑 )
𝑝
𝑑!
1 
𝑝 (𝒙 , . . . , 𝒙 , 𝒙 𝑑 ) + 𝑂 (𝑡 2 ) .

= 𝑝 (𝒙 ) + 𝑡 𝑑e
𝑑!
( 1) 1
Therefore, 𝑝𝒙 𝑑 (𝒙 ) = e(𝒙 , . . . , 𝒙 , 𝒙 𝑑 ) .
(𝑑−1) ! 𝑝 Continuing inductively, we get

e(𝒙 , . . . , 𝒙 , 𝒙 𝑘 , . . . , 𝒙 𝑑 ) = (𝑘 − 1) !𝑝𝒙(𝑑−𝑘
𝑝 +1)
𝑘 ,...,𝒙 𝑑
(𝒙 ),

so the claim follows from Proposition 6.11. 

6.3 Hyperbolic quadratics and Alexandrov’s mixed discriminant inequality


In this section, we consider the special case of a quadratic polynomial 𝑝 (𝒙 ) = 𝒙 ᵀ 𝑨𝒙 .
Its hyperbolicity is characterized by a “reverse Cauchy-Schwarz” inequality satisfied
by its associated symmetric bilinear form. We then instantiate this inequality for
the multilinearization of the determinant, yielding Alexandrov’s mixed discriminant
inequality. The following proposition is based on [SH19, Lemma 2.9].
Proposition 6.13 (Hyperbolic quadratics). Set E = ℝ𝑛 and fix 𝑨 ∈ ℍ𝑛 . Define 𝑄 (𝒙 , 𝒚 ) =
𝒙 ᵀ 𝑨𝒚 and 𝑝 (𝒙 ) = 𝑄 (𝒙 , 𝒙 ) , and suppose 𝒆 ∈ ℝ𝑛 satisfies 𝑝 (𝒆 ) > 0. The following are
equivalent.

(a) 𝑝 is hyperbolic with respect to 𝒆 .


(b) 𝑄 (𝒙 , 𝒚 ) 2 ≥ 𝑝 (𝒙 )𝑝 (𝒚 ) for all 𝒙 , 𝒚 ∈ E such that 𝑝 (𝒚 ) ≥ 0.
(c) 𝑨 has at most one positive eigenvalue.
(d) There exists 𝒘 ∈ ℝ𝑛 such that 𝑄 (𝒙 , 𝒘 ) = 0 implies 𝑝 (𝒙 ) ≤ 0.

The hyperbolicity cone of 𝑝 with respect to any such 𝒆 is

Λ++ = {𝒙 ∈ ℝ𝑛 : 𝑝 (𝒙 ) > 0}.

Proof. We show (a) ⇔ (b) ⇒ (c) ⇒ (d) ⇒ (b). Note that 𝑝 (𝒙 − 𝑡 𝒚 ) = 𝑝 (𝒚 )𝑡 2 −


2𝑡𝑄 (𝒙 , 𝒚 ) + 𝑝 (𝒙 ) , whose roots are (assuming 𝑝 (𝒚 ) ≠ 0):
√︁
𝑄 (𝒙 , 𝒚 ) ± 𝑄 (𝒙 , 𝒚 ) 2 − 𝑝 (𝒙 )𝑝 (𝒚 )
. (6.3)
𝑝 (𝒚 )
Suppose (a) holds. If 𝑝 (𝒚 ) = 0 then the inequality in (b) is trivially satisfied. If
𝑝 (𝒚 ) > 0, then the smallest root of 𝑡 ↦→ 𝑝 (𝒚 −𝑡 𝒆 ) (given by (6.3) with 𝒙 = 𝒆 ) is strictly
positive (and conversely, showing the claimed expression for Λ++ ), hence 𝒚 ∈ Λ++ and
𝑝 is hyperbolic with respect to 𝒚 . This implies that both roots in (6.3) are real, and
hence the term inside the square root is positive, giving (b). Conversely, suppose (b)
holds. Setting 𝒚 = 𝒆 in the inequality of (b), we obtain 𝑄 (𝒙 , 𝒆 ) 2 − 𝑝 (𝒙 )𝑝 (𝒆 ) ≥ 0,
hence (a) holds. Thus, (a) and (b) are equivalent.
Suppose (b) holds. If (c) does not hold, then there exist orthogonal eigenvectors 𝒙 , 𝒚
corresponding to positive eigenvalues, in which case 𝑄 (𝒙 , 𝒚 ) = 0 but 𝑝 (𝒙 )𝑝 (𝒚 ) > 0.
Hence (b) does not hold, a contradiction, so (b) implies (c). Suppose (c) holds. If
𝑨 has no positive eigenvalues, then 𝑝 ≤ 0 and (d) holds with any 𝒘 . If 𝑨 has
exactly one positive eigenvalue, then (d) holds with 𝒘 equal to the corresponding
eigenvector. Hence (c) implies (d). Suppose (d) holds. We claim (b) holds as well.
Project 6: Hyperbolic Polynomials 268

Indeed, if 𝑝 (𝒚 ) = 0 there is nothing to prove, so assume 𝑝 (𝒚 ) > 0. Then (d) implies


𝑄 (𝒚 , 𝒘 ) ≠ 0. For each 𝒙 ∈ ℝ𝑛 , define 𝒛 = 𝒙 − 𝑡 𝒚 with 𝑡 = 𝑄 (𝒙 , 𝒘 )/𝑄 (𝒚 , 𝒘 ) , which
satisfies 𝑄 (𝒛 , 𝒘 ) = 0. Therefore, (d) gives

𝑄 (𝒙 , 𝒚 ) 2
0 ≥ 𝑝 (𝒛 ) = 𝑝 (𝒙 ) − 2𝑡𝑄 (𝒙 , 𝒚 ) + 𝑡 2𝑝 (𝒚 ) ≥ 𝑝 (𝒙 ) − ,
𝑝 (𝒚 )
where last inequality is obtained by minimizing over 𝑡 . This shows (b) holds. 
Example 6.14 (The second-order cone). Let E = ℝ𝑛+1 and 𝑝 (𝒙 ) = 𝑥 02 − = 𝒙 ᵀ 𝑨𝒙 2
Í𝑛
𝑖 =1 𝑥𝑖
for 𝑨 = diag ( 1, −1, . . . , −1) . Then 𝑝 is hyperbolic with respect to 𝒆 = ( 1, 0, . . . , 0) ᵀ
by Proposition 6.13(d), and its corresponding hyperbolicity cone is
n ∑︁𝑛 o
Λ+ = 𝒙 ∈ ℝ𝑛+1 : 𝑥𝑖2 ≤ 𝑥 02 .
𝑖 =1

This is the so-called second-order cone, the epigraph of the 2 norm on ℝ𝑛 . 

Applying Proposition 6.13 to the multilinearization, we obtain the following re-


sult.
Theorem 6.15 ([Sch14, Thm. 5.5.3]). Let 𝑝 be hyperbolic on E of degree 𝑑 with
hyperbolicity cone Λ++ , and let 𝑝
e : E𝑑 → ℝ be its multilinearization. Fix
𝒙 2 , 𝒙 3 , . . . , 𝒙 𝑑 ∈ Λ++ .
(a) e(𝒛 , 𝒙 2 , 𝒙 3 , . . . , 𝒙 𝑑 ) ≥ 0 for all 𝒛 ∈ Λ+ .
𝑝
(b) For any 𝒚 ∈ Λ+ and 𝒙 ∈ E, we have

e(𝒙 , 𝒚 , 𝒙 3 , . . . , 𝒙 𝑑 ) 2 ≥ 𝑝
𝑝 e(𝒙 , 𝒙 , 𝒙 3 , . . . , 𝒙 𝑑 )e
𝑝 (𝒚 , 𝒚 , 𝒙 3 , . . . , 𝒙 𝑑 ).

Proof. Corollary 6.12 shows that 𝒙 ↦→ 𝑝 e(𝒙 , 𝒙 2 , 𝒙 3 , . . . , 𝒙 𝑑 ) is hyperbolic with respect


to any 𝒛 ∈ Λ++ , hence in particular, nonnegative on Λ+ . This shows (a).
The same corollary also shows that 𝒙 ↦→ 𝑝 e(𝒙 , 𝒙 , 𝒙 3 , . . . , 𝒙 𝑑 ) is a hyperbolic
quadratic on E with respect to any element in Λ++ , so Proposition 6.13 give part
(b). 
Corollary 6.16 (Alexandrov’s mixed discriminant inequality). Let D : (ℍ𝑛 ) 𝑛 → ℝ be the
multilinearization of 𝑝 = det on ℍ𝑛 .
(a) D (𝑪 1 , . . . , 𝑪 𝑛 ) ≥ 0 whenever 𝑪 1 ∈ ℍ𝑛+ and 𝑪 𝑖 ∈ ℍ𝑛++ for all 𝑖 ≥ 2.
(b) For any 𝑨 ∈ ℍ𝑛 , any 𝑩 ∈ ℍ𝑛 + , and any 𝑪 , . . . , 𝑪 ∈ ℍ++ , we have:
3 𝑛 𝑛

D (𝑨, 𝑩, 𝑪 3 , . . . , 𝑪 𝑛 ) 2 ≥ D (𝑨, 𝑨, 𝑪 3 , . . . , 𝑪 𝑛 ) D (𝑩, 𝑩, 𝑪 3 , . . . , 𝑪 𝑛 ). (6.4)

The function D is called the mixed discriminant, and inequality (6.4) is called
Alexandrov’s mixed discriminant inequality. One can also derive the equality cases
for (6.4) from hyperbolicity, see [Sch14, Thm. 5.5.4].
The inequality (6.4) can be used to derive the Alexandrov–Fenchel inequality for
mixed volumes, a fundamental result in convex geometry. See [SH19], who also give a
more elementary proof of (6.4).

6.4 Semidefinite representability, and additive perturbation theory


An important subset ofHyp𝑑 (𝒆 ) for 𝒆 ∈ ℝ𝑛 consists of determinants of matrix pencils
𝑝 (𝒙 ) = det 𝑖𝑛=1 𝑥𝑖 𝑨 𝑖 with 𝑨 𝑖 ∈ ℍ𝑑 such that 𝑖 𝑒 𝑖 𝑨 𝑖  0. The corresponding
Í Í
hyper-
bolicity cones are linear slices of the psd cone ℍ𝑑 , namely, Λ+ =
+ Í
{𝒙 ∈ E : 𝑖 𝑥𝑖 𝑨 𝑖  0 } .
Project 6: Hyperbolic Polynomials 269

Such cones Λ+ are said to be semidefinite-representable, and they are significant be-
cause optimizing a linear function over Λ+ subject to linear equality constraints is
a semidefinite program (SDP), which can be solved efficiently using interior-point
methods. Those same methods can be used to efficiently solve linear optimization
over any hyperbolicity cone [Gül97] (because log 1/𝑝 is a self-concordant barrier
function for Λ+ ), leading to so-called hyperbolic programs. A natural question is then
whether the class of hyperbolic programs is more expressive than SDPs, that is, can
every hyperbolic program be written as an SDP? This is the content of the general Lax
conjecture [Ren06], which conjectures that for any hyperbolicity cone Λ+ there exists

𝑛 ∈ ℕ, a subspace 𝑆 ⊂ ℍ𝑛 , and a linear isomorphism E − → 𝑆 sending Λ+ to 𝑆 ∩ ℍ𝑛+ .
The special case of hyperbolicity cones in ℝ3 (which was the original Lax conjecture)
has been proved in [LPR05] using, according to [Ser09], a “deep result about Riemann
surfaces by Helton–Vinnikov [HV07]”. They show that for any hyperbolic polynomial
𝑝 on ℝ3 with respect to ( 1, 0, 0) ᵀ , there exist 𝑨, 𝑩 ∈ ℍ𝑑 (where 𝑑 = deg 𝑝 ) satisfying

𝑝 (𝜉 0 , 𝜉 1 , 𝜉 2 ) = 𝑝 ( 1, 0, 0) det (𝜉 0 I + 𝜉 1 𝑨 + 𝜉 2 𝑩). (6.5)

The corresponding statement for hyperbolic polynomials on ℝ𝑛 for 𝑛 ≥ 4 is false [LPR05]


(but note that the existence of a representation of the form (6.5) is stronger than the
general Lax conjecture).
The representation (6.5) can be used to show that all the additive perturbation
theory for eigenvalues of matrices extend to eigenvalues of any hyperbolic polynomial.
Indeed, for any 𝒙 , 𝒚 ∈ E, consider the polynomial on ℝ3 given by 𝑞 (𝝃) = 𝑝 (𝜉 0𝒆 +
𝜉 1 𝒙 + 𝜉 2 𝒚 ) . This is hyperbolic with respect to ( 1, 0, 0) ᵀ , hence (6.5) yields 𝑨, 𝑩 ∈ ℍ𝑑
satisfying
𝑝 (𝜉 0𝒆 + 𝜉 1 𝒙 + 𝜉 2 𝒚 ) = det (𝜉 0 I + 𝜉 1 𝑨 + 𝜉 2 𝑩).
This implies that 𝝀(𝒙 ) = 𝝀(𝑨) , 𝝀(𝒚 ) = 𝝀(𝑩) , and 𝝀(𝒙 + 𝒚 ) = 𝝀(𝑨 + 𝑩) , giving the
general result:

Theorem 6.17 (Perturbation theory). Let 𝑝 be a hyperbolic polynomial on E of degree 𝑑 .


Then any relation satisfied by the eigenvalues of arbitrary matrices 𝑨, 𝑩, 𝑨 +𝑩 ∈ ℍ𝑑
is also satisfied by 𝝀(𝒙 ), 𝝀(𝒚 ), 𝝀(𝒙 + 𝒚 ) , for any 𝒙 , 𝒚 ∈ E.
In particular, Lidskii’s theorem holds: 𝝀(𝒙 + 𝒚 ) − 𝝀(𝒙 ) ≺ 𝝀(𝒚 ) .

A more elementary proof of Lidskii’s theorem for hyperbolic polynomials is given


in [Ser09]. We give a self-contained proof of the weaker statement 𝝀(𝒙 + 𝒚 ) ≺
𝝀(𝒙 ) + 𝝀(𝒚 ) in Corollary 6.21 in the next section. There, we also use hyperbolicity to
derive inequalities for eigenvalues of matrices, rather than the other way around.

6.5 Hyperbolicity and convexity of compositions


In this section, we consider hyperbolicity and convexity of the composition of functions
and the eigenvalue map 𝝀 . These properties enable us to prove inequalities for
eigenvalues of matrices, in particular a weak form of Lidskii’s theorem and Minkowski’s
determinant inequality.
Recall that the 𝑘 th elementary symmetric polynomial on ℝ𝑛 is defined as
∑︁
𝐸𝑘 (𝒙 ) = 𝑥𝑖 1 · · · 𝑥𝑖𝑘 ,
1 ≤𝑖 1 <...<𝑖𝑘 ≤𝑛

for 𝑘 = 1, . . . , 𝑛 , and 𝐸 0 = 1. The ring of symmetric polynomials on ℝ𝑛 is generated


by 𝐸 0 , . . . , 𝐸𝑛 .
Project 6: Hyperbolic Polynomials 270

Lemma 6.18 If 𝑝 is hyperbolic with respect to 𝒆 with eigenvalue map 𝝀 , then 𝐸𝑘 ◦ 𝝀 is


a hyperbolic polynomial of degree 𝑘 with respect to any 𝒚 ∈ Λ++ .

Proof. Note that


Ö𝑑 ∑︁𝑑
𝑝 (𝒙 + 𝑡 𝒆 ) = 𝑝 (𝒆 ) (𝜆𝑖 (𝒙 ) + 𝑡 ) = 𝑝 (𝒆 ) 𝐸𝑖 (𝝀(𝒙 ))𝑡 𝑑−𝑖 ,
𝑖 =1 𝑖 =0
1 (𝑑−𝑘 )
hence 𝐸𝑘 ◦ 𝝀 = 𝑝 (𝒆 ) (𝑑−𝑘 ) ! 𝑝𝒆 ,...,𝒆 . The claim follows from Proposition 6.11. 
Corollary 6.19 The following are consequences of Lemma 6.18.
(a) 𝐸𝑘 is hyperbolic on ℝ𝑛 with respect to any 𝒚 ∈ ℝ++
𝑛 .

(b) ᵀ
The sum of all the eigenvalues 1 𝝀 is linear.
Proof. For (a), instantiate Lemma 6.18 on Example 6.2(a). For (b), note that 1ᵀ 𝝀 =
𝐸 1 ◦ 𝝀 which is a polynomial of degree 1 by Lemma 6.18. 
The following theorem, taken from [Bau+01, Thm. 3.1], shows that compo-
sitions of symmetric hyperbolic polynomials with the eigenvalue map is hyper-
bolic.
Theorem 6.20 (Hyperbolicity of composition). Let 𝑝 be a hyperbolic polynomial on
E with respect to 𝒆 of degree 𝑑 , with eigenvalue map 𝝀 . Let 𝑞 be a symmetric
hyperbolic polynomial on ℝ𝑑 with respect to 1 of degree 𝑘 , with eigenvalue map
𝝁 . Then 𝑞 ◦ 𝝀 is a hyperbolic polynomial on E with respect to 𝒆 of degree 𝑘 , with
eigenvalue map 𝝁 ◦ 𝝀 .

Proof. Since 𝑞 is symmetric of degree 𝑘 , it can be written as a polynomial in 𝐸 1 , . . . , 𝐸𝑘 ,


hence 𝑞 ◦ 𝝀 is a homogeneous polynomial of degree 𝑘 by Lemma 6.18. Next, recall
from (6.1) that 𝝀(𝒙 − 𝑡 𝒆 ) = 𝝀(𝒙 ) − 𝑡 1. Since 𝑞 is hyperbolic with respect to 1, we
have (𝑞 ◦ 𝝀) (𝒆 ) = 𝑞 ( 1) ≠ 0 and
Ö
(𝑞 ◦ 𝝀) (𝒙 − 𝑡 𝒆 ) = 𝑞 (𝝀(𝒙 ) − 𝑡 1) = 𝑞 ( 1) (𝜇 𝑗 (𝝀(𝒙 )) − 𝑡 ),
𝑗

whose roots are (𝝁 ◦ 𝝀) (𝒙 ) , which are all real. 


We can use Theorem 6.20 to prove a weaker form of Lidskii’s theorem:
Corollary 6.21 Let 𝑝 be a hyperbolic polynomial on E with eigenvalue map 𝝀 . For any
𝒙 , 𝒚 ∈ E, we have 𝝀(𝒙 + 𝒚 ) ≺ 𝝀(𝒙 ) + 𝝀(𝒚 ) .
Proof. For each 1 ≤ 𝑘 ≤ 𝑑 let
Ö ∑︁𝑘
𝑞𝑘 (𝑢) = 𝑢𝑖 𝑗 .
1 ≤𝑖 1 <...<𝑖𝑘 ≤𝑑 𝑗 =1

Note that 𝑞𝑘 is a symmetric


 polynomial, hyperbolic with respect to 1 with eigenvalues
Í𝑘
𝝁 (𝒖) = 𝑗 =1 𝑢 𝑖 𝑗 . Theorem 6.20 then shows that 𝑞𝑘 ◦ 𝝀 is hyperbolic
1 ≤𝑖 1 <...<𝑖𝑘 ≤𝑑
with respect to 𝒆 , and its largest eigenvalue is 𝑘𝑖=1 𝜆𝑖 . By (6.1) and Theorem 6.9, this
Í

largest eigenvalue is positively homogeneous and convex. Thus, 𝑘𝑖=1 𝜆𝑖 is subadditive


Í
for all 𝑘 , implying 𝝀(𝒙 + 𝒚 ) ≺𝑤 𝝀(𝒙 ) + 𝝀(𝒚 ) . Finally, 1ᵀ 𝝀(𝒙 + 𝒚 ) = 1ᵀ (𝝀(𝒙 ) + 𝝀(𝒚 ))
by Corollary 6.19(b). 
Using the above weak form of Lidskii’s theorem, we can prove that the composition
of convex and symmetric functions with the eigenvalue map is also convex, a fact
that yields in particular Minkowski’s determinant inequality. The next result appears
in [Bau+01, Thm. 3.9].
Project 6: Hyperbolic Polynomials 271

Theorem 6.22 (Convexity of compositions). Let 𝑝 be a hyperbolic polynomial on E


with eigenvalue map 𝝀 . If 𝑓 : ℝ𝑑 → ℝ is convex and symmetric, then 𝑓 ◦ 𝝀 is
convex on E. Recall that ℝ = ℝ ∪ {±∞} is the
extended reals.
Proof. By Corollary 6.21 and the positive homogeneity of 𝜆, for any 𝒙 , 𝒚 ∈ E, any
𝜏 ∈ [0, 1] and 𝜏¯ = 1 − 𝜏 , we have

𝝀(𝜏𝒙 + 𝜏𝒚
¯ ) ≺ 𝝀(𝜏𝒙 ) + 𝝀(𝜏𝒚
¯ ) = 𝜏𝝀(𝒙 ) + 𝜏𝝀(𝒚
¯ ).

Since 𝑓 is convex and symmetric, it is isotone, hence

( 𝑓 ◦ 𝝀) (𝜏𝒙 + 𝜏𝒚
¯ ) ≤ 𝑓 (𝜏𝝀(𝒙 ) + 𝜏𝝀(𝒚
¯ )) ≤ 𝜏 ( 𝑓 ◦ 𝝀)(𝒙 ) + 𝜏¯ ( 𝑓 ◦ 𝝀)(𝒚 ),

where for the last inequality we used the convexity of 𝑓 . 


We can now use hyperbolicity to derive inequalities for eigenvalues of matrices.
Corollary 6.23 (Gårding’s inequality). Let 𝑝 be a hyperbolic polynomial on E with respect
to 𝒆 of degree 𝑑 . If 𝑝 (𝒆 ) > 0, then 𝒙 ↦→ 𝑝 (𝒙 ) 1/𝑑 is superlinear on Λ+ . In particular, A function is superlinear if it is
we have superadditive and positively
homogeneous.
tr (∧𝑑 (𝑨 + 𝑩)) 1/𝑑 ≥ tr (∧𝑑 𝑨) 1/𝑑 + tr (∧𝑑 𝑩) 1/𝑑 , (6.6)
for any 𝑨, 𝑩 ∈ ℍ𝑛+ and any 𝑛 ∈ ℕ.
Setting 𝑛 = 𝑑 in (6.6), we get Minkowski’s determinant inequality det (𝑨 + 𝑩) 1/𝑑 ≥
det (𝑨) 1/𝑑 + det (𝑩) 1/𝑑 , valid for any 𝑨, 𝑩 ∈ ℍ𝑑+ .

Proof. Note that 𝑝 (𝒙 ) = 𝑝 (𝒆 ) 𝑖 𝜆𝑖 (𝒙 ) = 𝑝 (𝒆 )(𝐸𝑑 ◦ 𝝀)(𝒙 ) . The function 𝒙 ↦→


Î
𝐸𝑑 (𝒙 ) 1/𝑑 is the geometric mean, which is concave on ℝ+𝑑 (exercise) and symmetric.
1/𝑑
Hence, 𝑝 1/𝑑 is concave on Λ+ by Theorem 6.22 (with 𝑓 = 𝐸𝑑 on ℝ+𝑑 and +∞
otherwise). Since 𝑝 is homogeneous of degree 𝑑 , we conclude that 𝑝 1/𝑑 is positively
homogeneous. Thus, it is superlinear on Λ+ .
For the second claim, Lemma 6.18 applied to 𝑝 = det and E = ℍ𝑛 shows that
𝐸𝑑 ◦ 𝝀 is hyperbolic of degree 𝑑 with respect to I𝑛 . Therefore, the map 𝑿 ↦→
 1/𝑑
(𝐸𝑑 (𝝀(𝑿 ))) 1/𝑑 = tr ∧𝑑 𝑿 is superlinear on Λ+ = ℍ𝑛+ , as desired. 
Î  1/𝑑
𝑑
Exercise 6.24 Show that 𝒙 ↦→ 𝑖 =1 𝑥𝑖 is concave on ℝ+𝑑 .

6.6 Euclidean structure


The eigenvalue map 𝜆 induces a psd symmetric bilinear form, which is an inner product
for all our examples. It generalizes the Frobenius inner product on symmetric matrices,
and satisfies a sharpened Cauchy–Schwarz inequality generalizing von Neumann’s
trace inequality.

Definition 6.25 Define k · k : E → ℝ+ by k𝒙 k = k𝝀(𝒙 )k 2 , and define h·, ·i : E2 → ℝ


by the polarization identity

1 1
h𝒙 , 𝒚 i B k𝒙 + 𝒚 k 2 − k𝒙 − 𝒚 k 2 .
4 4

Proposition 6.26 ([Bau+01, Thm. 4.2]). The function k · k is a seminorm on E, and h·, ·i is
a positive-semidefinite bilinear form on E. A seminorm is sublinear but may be
zero on non-zero vectors.
Project 6: Hyperbolic Polynomials 272

Proof. The function k · k is clearly nonnegative, and it is absolutely homogeneous


by (6.1). It is convex by Theorem 6.22 as it is the composition of the convex and
symmetric 2 norm on ℝ𝑑 with the eigenvalue map 𝜆. Thus, k · k is subadditive.
Moreover, k𝒙 k 2 = 𝑖 𝜆𝑖 (𝒙 ) 2 = (𝐸 1 (𝝀(𝒙 ))) 2 − 2𝐸 2 (𝝀(𝒙 )) , which is a homogeneous
Í
quadratic polynomial on E by Lemma 6.18. Any quadratic seminorm is induced by a
psd symmetric bilinear form given by the polarization identity. 
In order to make k · k a norm and h·, ·i an inner-product, we must require an
additional property from 𝑝 .

Definition 6.27 A hyperbolic polynomial 𝑝 with respect to 𝒆 is complete if {𝒙 ∈ E :


𝝀(𝒙 ) = 0} = {0}.

Proposition 6.28 The following are equivalent.

(a) 𝑝 is complete.
(b) Λ+ is pointed. A closed cone K is pointed if
(c) k𝒙 k = k𝝀(𝒙 ) k 2 is a norm. K ∩ (−K) = {0 }, or equivalently, if K
contains no lines.
Proof. If 𝝀(𝒙 ) = 0 then 𝝀(−𝒙 ) = 0 as well by (6.1), so 𝒙 ∈ Λ+ ∩ (−Λ+ ) . Conversely,
if 𝒙 ∈ Λ+ ∩ (−Λ+ ) then 𝜆 min (𝒙 ) ≥ 0 and 𝜆 min (−𝒙 ) = −𝜆 max (𝒙 ) ≥ 0, hence 0 ≤
𝜆 min (𝒙 ) ≤ 𝜆max (𝒙 ) ≤ 0 and 𝝀(𝒙 ) = 0. This shows (a) and (b) are equivalent. Finally,
(a) and (c) are equivalent because k𝒙 k is always a seminorm, and k𝒙 k = 0 if and only
if 𝝀(𝒙 ) = 0. 
The polynomials in Example 6.2 and Example 6.14 are complete, since ℝ+𝑛 , ℍ𝑛+ , and
the second-order cone are pointed. The above norm is also the Hessian norm used in
interior-point methods with the barrier log 1/𝑝 [Bau+01, Rmk. 4.3].
The inner-product in Definition 6.25 satisfies a sharpened Cauchy–Schwarz inequal-
ity [Bau+01, Prop. 4.4].
Proposition 6.29 (Refined Cauchy–Schwarz). For any 𝒙 , 𝒚 ∈ E,

h𝒙 , 𝒚 i ≤ h𝝀(𝒙 ), 𝝀(𝒚 )i2 ≤ k𝒙 kk𝒚 k.

Proof. Corollary 6.21 shows that 𝝀(𝒙 + 𝒚 ) ≺ 𝝀(𝒙 ) + 𝝀(𝒚 ) . Since the 2 norm is convex
and symmetric, it is isotone, hence k𝝀(𝒙 + 𝒚 )k 2 ≤ k𝝀(𝒙 ) + 𝝀(𝒚 )k 2 . Squaring both
sides, this is equivalent to

2 h𝝀(𝒙 ), 𝝀(𝒚 )i2 ≥ k𝝀(𝒙 + 𝒚 )k 22 − k𝝀(𝒙 )k 22 − k𝝀(𝒚 )k 22


= k𝒙 + 𝒚 k 2 − k𝒙 k 2 − k𝒚 k 2 = 2 h𝒙 , 𝒚 i,

giving the first claimed inequality. The second is Cauchy–Schwarz for the 2 norm. 

In Example 6.2(a), the sharpened Cauchy–Schwarz inequality reduces to Cheby-


shev’s rearrangement inequality, and in Example 6.2(b) it reduces to von Neumann’s
Í
trace inequality. Note also that h𝒙 , 𝒆 i = 𝑖 𝜆𝑖 (𝒙 ) , generalizing the trace h𝑿 , I𝑑 iF on
ℍ𝑑 and making its linearity (proved separately in Corollary 6.19(b)) apparent.

Lecture bibliography
[Bau+01] H. H. Bauschke et al. “Hyperbolic polynomials and convex analysis”. In: Canad. J.
Math. 53.3 (2001), pages 470–488. doi: 10.4153/CJM-2001-020-6.
Project 6: Hyperbolic Polynomials 273

[Går51] L. Gårding. “Linear hyperbolic partial differential equations with constant co-
efficients”. In: Acta Mathematica 85.none (1951), pages 1 –62. doi: 10 . 1007 /
BF02395740.
[Gül97] O. Güler. “Hyperbolic Polynomials and Interior Point Methods for Convex Program-
ming”. In: Mathematics of Operations Research 22.2 (1997), pages 350–377.
[HV07] J. W. Helton and V. Vinnikov. “Linear matrix inequality representation of sets”.
In: Communications on Pure and Applied Mathematics 60.5 (2007), pages 654–674.
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.20155.
doi: https://doi.org/10.1002/cpa.20155.
[LPR05] A. Lewis, P. Parrilo, and M. Ramana. “The Lax conjecture is true”. In: Proceedings
of the American Mathematical Society 133.9 (2005), pages 2495–2499.
[MSS14] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Ramanujan graphs and the
solution of the Kadison-Singer problem”. In: Proceedings of the International
Congress of Mathematicians—Seoul 2014. Vol. III. Kyung Moon Sa, Seoul, 2014,
pages 363–386.
[Ren06] J. Renegar. “Hyperbolic Programs, and Their Derivative Relaxations”. In: Found.
Comput. Math. 6.1 (2006), pages 59–79. doi: 10.1007/s10208-004-0136-z.
[Sch14] R. Schneider. Convex bodies: the Brunn–Minkowski theory. 151. Cambridge university
press, 2014. doi: 10.1017/CBO9781139003858.
[Ser09] D. Serre. “Weyl and Lidskiı̆ inequalities for general hyperbolic polynomials”. In:
Chinese Annals of Mathematics, Series B 30.6 (2009), pages 785–802.
[SH19] Y. Shenfeld and R. van Handel. “Mixed volumes and the Bochner method”. In: Proc.
Amer. Math. Soc. 147.12 (2019), pages 5385–5402. doi: 10.1090/proc/14651.
7. The Laplace Transform Method for Matrices

Date: 9 March 2022 Author: Elvira Moreno

In previous lectures, we have observed how mathematical theory concerning scalars


Agenda:
can be extended to the realm of matrices. In this lecture, we explore how concepts in
1. The Laplace transform method
probability, originally defined to study large-deviation behavior of sequences of random 2. The Laplace transform bound
variables and their sums, can be paralleled to study the behavior of the extremal for sums of random matrices
eigenvalues of independent sums of random matrices. 3. The matrix Chernoff bound
We begin our discussion with a matrix version of the Laplace transform method, 4. Sparsification via random
sampling
which has proven to be an invaluable tool for producing tail bounds for the sums of
random variables. We then present a Laplace-tranform-like bound for the sum of
independent random matrices due to Tropp [Tro11], who builds on Lieb’s [Lie73b]
work on convex trace functions to extend the classical result to the matrix setting. As a
testament to the power of Tropp’s result, we show how it can be employed to prove a
matrix version of the Chernoff bound. Finally, we discuss an application of the matrix
Chernoff bound in spectral graph theory, where it is used to show that any undirected,
connected graph can be well approximated by a sparse graph with high probability.

7.1 The Laplace transform method


The Laplace transform method is an invaluable tool for producing bounds on the
tail probabilities of random variables and their sums. In this section, we recall the
statement of the Laplace transform bound and we show an extension of this result to
the matrix setting due to Ahlswede and Winter [AW01].

7.1.1 The Laplace transform method: Scalar case


We begin by recalling the classical definitions for the moment generating function and
the cumulant generating function of a random variable.

Definition 7.1 (Mgf and cgf). Let 𝑋 be a real random variable. The moment generating
function (mgf) of 𝑋 is defined by

𝑚 𝑋 (𝜃 ) B 𝔼 exp (𝜃 𝑋 ) for each 𝜃 ∈ ℝ.

The cumulant generating function (cgf) of 𝑋 is defined by

𝜉 𝑋 (𝜃 ) B log 𝔼 exp (𝜃 𝑋 ) for each 𝜃 ∈ ℝ.

Next, we state the Laplace transform method, which illustrates how the cgf of a
real random variable can be used to establish bounds on its tail probabilities.

Theorem 7.2 (Laplace transform method). Let 𝑋 be a real random variable. Then, for
Project 7: Matrix Laplace Transform Method 275

each 𝑡 ∈ ℝ,

ℙ {𝑋 ≥ 𝑡 } ≤ inf exp (−𝜃𝑡 + 𝜉 𝑋 (𝜃 )) ;


𝜃 >0
ℙ {𝑋 ≤ 𝑡 } ≤ inf exp (−𝜃𝑡 + 𝜉 𝑋 (𝜃 )) .
𝜃 <0

Exercise 7.3 Provide a proof for Theorem 7.2. Hint: Markov’s inequality.
The previous result can be readily employed to produce tail bounds for the sum of
independent random variables.
Corollary 7.4 (Laplace transform method for the independent sum). Let {𝑋 𝑖 }𝑘𝑖=1 be an
independent family of real random variables in L∞ . Then
n∑︁𝑘 o  ∑︁𝑘 
ℙ 𝑋𝑖 ≥ 𝑡 ≤ inf exp −𝜃𝑡 + 𝜉 𝑋𝑖 (𝜃 ) ;
𝑖 =1 𝜃 >0 𝑖 =1
n∑︁𝑘 o  ∑︁𝑘 
ℙ 𝑋𝑖 ≤ 𝑡 ≤ inf exp −𝜃𝑡 + 𝜉 𝑋𝑖 (𝜃 ) .
𝑖 =1 𝜃 <0 𝑖 =1

Proof. Let 𝑌 =
Í𝑘
𝑖 =1 𝑋 𝑖 . By independence of the random variables 𝑋𝑖 , we have that
Í𝑘 Ö𝑘  Ö𝑘 ∑︁𝑘
log 𝔼 e𝜃 𝑖 =1 𝑋 𝑖 = log 𝔼 e𝜃 𝑋𝑖 = log 𝔼 e𝜃 𝑋𝑖 = log 𝔼 e𝜃 𝑋𝑖
𝑖 =1 𝑖 =1 𝑖 =1
Í𝑘
for all 𝜃 ∈ ℝ. So 𝜉𝑌 = 𝑖 =1 𝜉 𝑋𝑖 , and the result follows by replacing 𝑌 and 𝜉𝑌 in
Theorem 7.2. 
In the next section, we extend these concepts and results to the matrix case.

7.1.2 The Laplace transform method: Matrix case


We begin by defining the concepts of moment generating function and cumulant
generating function of a random matrix in analogy to their classical definitions for real
random variables.
Recall from Lecture 13, that the
Definition 7.5 (Matrix mgf and cgf). Let 𝑿 ∈ 𝕄𝑛 be a random self-adjoint matrix. matrix exponential is neither
Define the moment generating function (mgf) of 𝑿 by monotone nor convex. This means
that, unlike the scalar mgf, the matrix
𝑀 𝑿 (𝜃 ) B 𝔼 exp (𝜃 𝑿 ) for 𝜃 ∈ ℝ. mgf does not enjoy these properties.

Define the cumulant generating function (cgf) of 𝑿 by

Ξ𝑿 (𝜃 ) B log 𝔼 exp (𝜃 𝑿 ) for 𝜃 ∈ ℝ.

Note that when 𝑛 = 1, the definitions for the mgf and the cgf of a random matrix
𝑿 ∈ 𝕄𝑛 coincide with their classical definition in the setting of real random variables.
As in the scalar case, the cgf of a random matrix can be employed to provide bounds
on the tail probabilities of its largest eigenvalue. An extension of Theorem 7.2 to the
matrix setting was first provided by Ahlswede and Winter in [AW01]. We will instead
consider the following variant by Oliveira [Oli10]. The proof we present is that by
Tropp [Tro11].

Theorem 7.6 (Laplace transform method: Matrix case). Let 𝑿 be a random self-adjoint
Project 7: Matrix Laplace Transform Method 276

matrix and let 𝑡 ∈ ℝ. Then,

ℙ {𝜆 max (𝑿 ) ≥ 𝑡 } ≤ inf e−𝜃𝑡 𝔼 tr e𝜃 𝑿 ;


𝜃 >0
ℙ {𝜆 min (𝑿 ) ≤ 𝑡 } ≤ inf e−𝜃𝑡 𝔼 tr e𝜃 𝑿 .
𝜃 <0

Proof. (Upper Bound) For 𝜃 > 0, Markov’s inequality yields


n o 𝔼 e𝜆max (𝜃 𝑿 )
ℙ {𝜆 max (𝑿 ) ≥ 𝑡 } = ℙ {𝜆max (𝜃 𝑿 ) ≥ 𝜃𝑡 } = ℙ e𝜆max (𝜃 𝑿 ) ≤ e𝜃𝑡 ≤ .
𝜃𝑡 e
The first identity follows because the eigenvalue map is positively homogeneous. The
second is due to the fact that the scalar exponential function is monotone increasing.
Next, recall from Definition 13.2 in Lecture 13, that matrix functions are defined via
the spectral resolution. This means that the eigenvalues of the matrix e𝜃 𝑿 correspond
to the exponentials of the eigenvalues of 𝜃 𝑿 , and we have that

e𝜆max (𝜃 𝑿 ) = 𝜆 max ( e𝜃 𝑿 ) ≤ tr e𝜃 𝑿 .

The inequality is due to the fact that all eigenvalues of the matrix e𝜃 𝑿 are strictly
positive, so its trace dominates its largest eigenvalue. Hence, for 𝜃 > 0,

𝔼 e𝜆max (𝜃 𝑿 ) 𝔼 tr e𝜃 𝑿
ℙ {𝜆 max (𝑿 ) ≥ 𝑡 } ≤ ≤ .
e𝜃𝑡 e𝜃 𝑿
The result is obtained by taking the infimum over all strictly positive 𝜃 . 
Exercise 7.7 (Laplace transform method: Lower bound). Provide a proof for the lower bound
of Theorem 7.2.
Paralleling the scalar case, we would like to employ the Laplace transform method
for random matrices, Theorem 7.6, to establish tail bounds for the largest eigenvalue
of the sum of independent random matrices. It is here where we encounter the biggest
challenge. Indeed, as is evidenced in the proof of Corollary 7.4, the natural step from
the Laplace transform bound to a Laplace-transform-like statement regarding the tail
decay of the sum of random variables, strongly depends on the additivity of the cgf.
This property of the scalar cgf is not transferred into the matrix case. The reason for
this is that, unlike the scalar exponential, the matrix exponential function does not
convert sums of matrices into products. That is, the identity

e𝑿 1 +𝑿 2 = e𝑿 1 𝑿 2

does not always hold.


In the next section, we show how a result on the concavity of a particular trace
function can be leveraged to circumvent this issue and establish the desired extension
of Corollary 7.4 to the matrix setting.

7.2 Laplace transform tail bound for sums of random matrices


To move forward in our aim of extending the Laplace transform method to the matrix
setting, we require a result by Lieb on the concavity of a particular trace function.
This result is one of his many insights concerning the convexity of trace functions. We
refer the interested reader to [Lie73b], where Theorem 7.8 appears as Theorem 6. We
present a proof of the result by Tropp [Tro12], who derives it from joint convexity of
the quantum relative entropy function.
Project 7: Matrix Laplace Transform Method 277

Theorem 7.8 (Lieb). Fix a self-adjoint matrix 𝑯 . The function

𝑨 ↦−→ tr exp (𝑯 + log (𝑨))

is concave on the PSD cone.

Before proceeding with the proof of Lieb’s Theorem, it is important to recall the
definition of the quantum relative entropy function.

Definition 7.9 (Quantum relative entropy). Let 𝑿 and 𝒀 be positive-definite matrices.


The quantum relative entropy of 𝑿 with respect to 𝒀 is defined by

D (𝑿 ; 𝒀 ) B tr (𝑿 log 𝑿 − 𝑿 log 𝒀 − (𝑿 − 𝒀 )).

In Lecture 17, we defined the (Umegaki) quantum relative entropy of two density
matrices 𝝔, 𝝂 ∈ 𝚫𝑛 as

S (𝝔; 𝝂) B tr (𝝔 log 𝝔 − 𝝔 log 𝝂),

and we proved that it is both nonnegative and jointly convex. These two properties
also hold for the quantum relative entropy function as defined by 7.9 and play a central
role in the proof of Theorem 7.8.

Proof. Let 𝑿 and 𝒀 be positive definite. By nonnegativity of the quantum relative


entropy function, we have that

tr (𝒀 ) ≥ tr (𝑿 log 𝒀 − 𝑿 log 𝑿 + 𝑿 ).

Since equality is attained when 𝑿 = 𝒀 , we deduce that

tr (𝒀 ) = max tr (𝑿 log 𝒀 − 𝑿 log 𝑿 + 𝑿 ).


𝑿 0

Replacing 𝒀 = exp (𝑯 + log 𝑨) in the identity above yields

tr ( exp (𝑯 + log 𝑨)) = max tr (𝑿 log exp (𝑯 + log 𝑨) − 𝑿 log 𝑿 + 𝑿 )


𝑿 0
= max tr (𝑿 𝑯 ) − tr (𝑿 log 𝑿 − 𝑿 tr (𝑨) − 𝑿 + 𝑨) − tr (𝑨)
𝑿 0
= max tr (𝑿 𝑯 ) − D (𝑿 ; 𝑨) − tr (𝑨).
𝑿 0

Observe that 𝒀  0, as matrix functions are defined via the spectral resolution and
e𝜆 > 0 for all 𝜆 ∈ ℝ. So the substitution is valid. Furthermore, joint convexity
of 𝐷 and linearity of the trace imply joint concavity of the function (𝑿 , 𝑨) ↦→
tr (𝑿 𝑯 )−D (𝑿 ; 𝑨)−tr (𝑨) . The result follows from the fact that the partial maximization
of a jointly concave function is concave (see Exercise 7.10). 
Exercise 7.10 (Partial maximization of a jointly concave function). Let 𝑓 (·, ·) be a jointly
concave function. Then the map 𝑦 ↦→ sup𝑥 𝑓 (𝑥, 𝑦 ) is concave.
Corollary 7.11 (Expectation of the trace exponential). Let 𝑯 be a fixed self-adjoint matrix.
Then, for any random self-adjoint matrix 𝑿 ,

𝔼 tr exp (𝑯 + 𝑿 ) ≤ tr exp (𝑯 + log (𝔼 e𝑿 )).


Project 7: Matrix Laplace Transform Method 278

Exercise 7.12 Provide a proof for Corollary 7.11. Hint: Employ Theorem 7.8 with 𝑨 = e𝑿
and invoke Jensen’s inequality.
Recall from the end of Section 1 that nonadditivity of the cgf was the mayor obstacle
for establishing a matrix parallel to Corollary 7.4. Tropp gives rest to this issue by
proving the following subadditivity result for matrix cgfs [Tro11].
Lemma 7.13 (Subadditivity of Matrix cgfs). Consider an independent sequence {𝑿 𝑖 }𝑘𝑖=1 of
random, self-adjoint matrices. Then
 ∑︁   ∑︁ 
𝑘 𝑘
𝔼 tr exp 𝜃 𝑿 𝑖 ≤ tr exp log 𝔼 e𝜃 𝑿 𝑖 for 𝜃 ∈ ℝ.
𝑖 =1 𝑖 =1

Proof. Assume without loss of generality that 𝜃 = 1. For 𝑖 = 1, . . . , 𝑘 , we adopt the


conventions
𝔼𝑖 B 𝔼 [· | 𝑿 1 , . . . , 𝑿 𝑖 ] and Ξ𝑖 B log 𝔼 e𝑿 𝑖 .
Using the tower property of the conditional expectation we can write
 ∑︁   ∑︁ 
𝑘 𝑘 −1
𝔼 tr exp 𝑿 𝑖 = 𝔼0 𝔼1 · · · 𝔼𝑘 −1 tr exp 𝑿𝑖 + 𝑿𝑘 .
𝑖 =1 𝑖 =1

Í𝑘 −1
Invoking Corollary 7.11 with 𝑯 = 𝑿 𝑖 yields
𝑖 =1
 ∑︁ 
𝑘 −1
𝔼0 · · · 𝔼𝑘 −1 tr exp 𝑿𝑖 + 𝑿𝑘
𝑖 =1
 ∑︁ 
𝑘 −1 𝑿𝑘
≤ 𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + log 𝔼𝑘 −1 e
𝑖 =1
 ∑︁ 
𝑘 −1
= 𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + Ξ𝑘 .
𝑖 =1
 ∑︁ 
𝑘 −2
= 𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + 𝑿 𝑘 −1 + Ξ𝑘 .
𝑖 =1

𝑿 𝑖 + Ξ𝑘 and invoking Corollary 7.11 once more, we obtain


Í𝑘 −2
Taking 𝑯 = 𝑖 =1

 ∑︁ 
𝑘 −2
𝔼0 · · · 𝔼𝑘 −2 tr exp 𝑿 𝑖 + 𝑿 𝑘 −1 + Ξ𝑘
𝑖 =1
 ∑︁ 
𝑘 −2
≤ 𝔼0 · · · 𝔼𝑘 −3 tr exp 𝑿 𝑖 + Ξ𝑘 −1 + Ξ𝑘
𝑖 =1
 ∑︁ 
𝑘 −3
= 𝔼0 · · · 𝔼𝑘 −3 tr exp 𝑿 𝑖 + 𝑿 𝒌 −2 + Ξ𝑘 −1 + Ξ𝑘 .
𝑖 =1

𝑗 =𝑚+1 Ξ𝑘
Í𝑘 −2 Í𝑘
Repeating this procedure recursively with 𝑯 = 𝑖 =1 𝑿𝑖 + for 𝑚 = 𝑘 , 𝑘 −
1, . . . , 1 yields   ∑︁𝑘 ∑︁𝑘
𝔼 tr exp 𝑿 𝑖 ≤ tr Ξ𝑘 ,
𝑖 =1 𝑖 =1

from which the result follows. 


Having established subadditivity of the matrix cgf, the following tail bound can be
derived from the first bound in Theorem 7.6 [Tro11].
Project 7: Matrix Laplace Transform Method 279

Theorem 7.14 (Tail bounds for independent sums of random matrices). Consider an
independent sequence {𝑿 𝑖 }𝑘𝑖=1 of random, self-adjoint matrices. Then, for all
𝑡 ∈ ℝ,
  ∑︁     ∑︁ 
𝑘 𝑘
ℙ 𝜆 max 𝑿 𝑖 ≥ 𝑡 ≤ inf e−𝜃𝑡 tr exp log 𝔼 e𝜃 𝑿 𝑖 ;
𝑖 =1 𝜃 >0 𝑖 =1
  ∑︁     ∑︁ 
𝑘 −𝜃𝑡 𝑘 𝜃𝑿 𝑖
ℙ 𝜆 min 𝑿 𝑖 ≤ 𝑡 ≤ inf e tr exp log 𝔼 e .
𝑖 =1 𝜃 <0 𝑖 =1

Proof. Follows directly from subadditivity of matrix cgfs, Lemma 7.13, and Theorem
7.6. 
We also consider the following Corollary, as it will be useful in the next section
when we prove an extension of the Chernoff bound to the matrix setting. We refer the
reader to [Tro11, Corollary 3.9] for a proof.
Corollary 7.15 Consider an independent sequence {𝑿 𝑖 }𝑘𝑖=1 of random, self-adjoint
matrices in 𝕄𝑛 . For all 𝑡 ∈ ℝ,
  ∑︁     
𝑘 1 ∑︁𝑘 𝑿𝑖
ℙ 𝜆max 𝑿 𝑖 ≥ 𝑡 ≤ 𝑛 inf exp − 𝜃𝑡 + 𝑘 log 𝜆 max 𝔼e .
𝑖 =1 𝜃 >0 𝑘 𝑖 =1

It is worth mentioning that these results can be generalized to inequalities involving


the singular values of random rectangular matrices by means of their self-adjoint
dilation. Refer to [Tro11] for more details.

7.3 The matrix Chernoff bound


Recall Chernoff ’s inequality for bounded, positive random variables.

Theorem 7.16 (Chernoff’s inequality). Consider an independent sequence


{𝑋𝑖 }𝑘𝑖=1 of random variables satisfying 𝑋𝑖 ∈ [0, 1] for 𝑖 = 1, . . . , 𝑘 , and let
𝜇 B 𝑘𝑖=1 𝔼 𝑋𝑖 . Then,
Í

o  𝜇
n∑︁𝑘 e𝛿
ℙ 𝑋𝑖 ≥ ( 1 + 𝛿 )𝜇 ≤ for 𝛿 ≥ 0;
𝑖 =1 ( 1 + 𝛿 ) 1+𝛿
e−𝛿
n∑︁𝑘 o  𝜇
ℙ 𝑋𝑖 ≤ ( 1 − 𝛿 )𝜇 ≤ for 𝛿 ∈ [ 0, 1] .
𝑖 =1 ( 1 − 𝛿 ) 1−𝛿

Next, we establish a similar result for controlling the extremal eigenvalues of the
sum of independent random matrices that satisfy some additional properties. For this
purpose, we first need a simple lemma.
Lemma 7.17 (Chernoff mgf). Let 𝑿 be a random positive-definite matrix such that
𝜆 max (𝑿 ) ≤ 1. Then,
𝔼 e𝜃 𝑿 4 I + ( e𝜃 − 1)(𝔼 𝑿 ).

Exercise 7.18 Provide a proof for Lemma 7.17. Hint: First establish the result for the
scalar case by bounding the exponential on [ 0, 1] by a straight line.
The following theorem is presented as Corollary 5.2 in [Tro11], where the reader
can find a stronger version of the Chernoff bound [Tro11, Theorem 5.1].
Project 7: Matrix Laplace Transform Method 280

Theorem 7.19 (Matrix Chernoff bound). Consider an independent sequence {𝑿 𝑖 }𝑘𝑖=1 of


random, self-adjoint matrices in 𝕄𝑛 satisfying 𝑿 𝑖 < 0 and 𝜆 max (𝑿 𝑖 ) ≤ 𝑅 almost
surely, for each 𝑖 ∈ {1, . . . , 𝑘 }. Define
 ∑︁   ∑︁ 
𝑘 𝑘
𝜇min B 𝜆 min 𝔼𝑿𝑖 and 𝜇 max B 𝜆 max 𝔼𝑿𝑖 .
𝑖 =1 𝑖 =1

Then,
  ∑︁     𝜇max /𝑅
𝑘 e𝛿
ℙ 𝜆 max 𝑿 𝑖 ≥ ( 1 + 𝛿 )𝜇max ≤𝑛 for 𝛿 ≥ 0;
𝑖 =1 ( 1 + 𝛿 ) 1+𝛿
 𝜇min /𝑅
e−𝛿
  ∑︁   
𝑘
ℙ 𝜆 min 𝑿 𝑖 ≤ ( 1 − 𝛿 )𝜇min ≤𝑛 for 𝛿 ∈ [ 0, 1] .
𝑖 =1 ( 1 − 𝛿 ) 1−𝛿

1
Proof. For each 𝑖 = 1, . . . , 𝑘 , let 𝒀 𝒊 = 𝑅 𝑿𝑖. Then 𝜆 max (𝒀 𝒊 ) ≤ 1 for every 𝑖 ∈
{1, . . . , 𝑘 }, and for 𝑡 > 0 we have that
  ∑︁     ∑︁  
𝑘 𝑘
ℙ 𝜆 max 𝑿 𝑖 ≥ 𝑡 = ℙ 𝜆 max 𝒀𝒊 ≥ 𝑡
𝑖 =1 𝑖 =1

where 𝑡 = 𝑅𝑡 . By Lemma 7.17, 𝔼 e𝜃𝒀 𝒊 4 I + ( e𝜃 − 1)(𝔼𝒀 𝒊 ) for all 𝑖 . So Weyl’s


monotonicity principle implies that 𝜆 max ( e𝜃𝒀 𝒊 ) ≤ 𝜆 max ( I + ( e𝜃 − 1)(𝔼𝒀 𝒊 )) for all 𝑖 ,
and Corollary 7.15 yields
  ∑︁     ∑︁ 
𝑘 1 𝑘 𝒀𝒊
ℙ 𝜆 max 𝒀 𝒊 ≥ 𝑡 ≤ 𝑛 exp − 𝜃𝑡 + 𝑘 log 𝜆 max 𝔼e
𝑖 =1 𝑘 𝑖 =1
  ∑︁ 
1 𝑘
≤ 𝑛 exp − 𝜃𝑡 + 𝑘 log 𝜆max I + ( e𝜃 − 1)(𝔼𝒀 𝒊 )
𝑘 𝑖 =1
  
( e𝜃 − 1) ∑︁𝑘
≤ 𝑛 exp − 𝜃𝑡 + 𝑘 log 𝜆max I + 𝔼𝒀 𝒊
𝑘 𝑖 =1
   
( e𝜃 − 1) 𝜇max
≤ 𝑛 exp − 𝜃𝑡 + 𝑘 log 1 +
𝑘 𝑅
 
( e − 1)𝜇max
𝜃
≤ 𝑛 exp − 𝜃𝑡 + 𝑘 .
𝑘𝑅
The last inequality is due to the fact that log ( 1 + 𝑥) ≤ 𝑥 for 𝑥 > −1. The result is
𝜇
obtained by replacing 𝑡 = ( 1 + 𝛿 ) 𝑅max and selecting 𝜃 = log ( 1 + 𝛿 ) .
The lower bound follows from a similar argument. 
It is worth mentioning that the Chernoff bound is only one of many matrix
inequalities that can be deduced from the Laplace transform tail bounds for independent
sums of random matrices, Theorem 7.14. In fact, Theorem 7.14 is used in [Tro11]
to produce matrix versions of the bounds known by the names of Azuma, Bennett,
Bernstein, Chernoff, Hoeffding, and McDiarmid.

7.4 Application : Sparsification via random sampling


In this section we show how the matrix Chernoff bound can be employed to prove that
any graph can be well approximated by a sparse graph with high probability. With this
purpose in mind, we recall a couple of key definitions from graph theory.
Project 7: Matrix Laplace Transform Method 281

For a set V, we denote by | V | its


Definition 7.20 (Adjacency matrix). Let G = ( V, E) be a graph. The adjacency matrix cardinality.
𝑨 of G is the | V | × | V | matrix defined by In this section we will only be dealing
with undirected graphs G. In this
(
case, the adjacency matrix 𝑨 is
1 if (𝑢, 𝑣 ) ∈ E always symmetric.
𝐴 (𝑢,𝑣 ) B
0 otherwise.

In words, the adjacency matrix of G is the ( 0, 1) -matrix whose (𝑢, 𝑣 ) -th entry
indicates whether there is an edge from node 𝑢 to node 𝑣 .

Recall that the degree of node 𝑢 ∈ V


Definition 7.21 (Laplacian matrix). Let G = ( V, E) be an undirected graph. The is the number of edges that are
Laplacian matrix of G, denoted by 𝑳 G , is defined by attached to 𝑢 .

𝑳 G B 𝑫 − 𝑨,

where 𝑫 is the diagonal matrix whose entries are the degrees of the nodes in V.
The Laplacian of a graph is often understood as its matrix representation.

An important property of the Laplacian of any graph is that it is positive semidefinite.


Indeed, for 𝒙 ∈ ℝ𝑛 , we have that
∑︁ ∑︁ ∑︁
𝒙 ᵀ 𝑳𝒙 = deg (𝑥𝑢 )𝑥𝑢 2 − 𝑥𝑢 𝑥𝑣 = (𝑥𝑢 − 𝑥𝑣 ) 2 .
𝑢 (𝑢,𝑣 ) ∈E (𝑢,𝑣 ) ∈E

So far we have mentioned that any graph can be approximated by a sparse graph
with high probability. Nonetheless, we have not yet specified what we mean by an
approximation. The notion of “closeness" between two graphs that we will consider is
based on the spectrum of their Laplacian matrices.

Definition 7.22 (Spectral 𝜀 -approximation). Let G = ( V, E) be a connected, undirected


graph, and let H be a graph that is defined over the vertex set V. Given 𝜀 > 0, we
say that H is a spectral 𝜀 -approximation of G if

( 1 − 𝜀)𝑳 G 4 𝑳 H 4 ( 1 + 𝜀)𝑳 G .

Our main objective is to show, by means of the probabilistic method, that every
undirected, connected graph has a sparse spectral 𝜀 -approximation. This result is due
to Spielman and Srivastava [SS11] and the analysis presented in this section was based
on that by Spielman in his lecture notes [Spi19]. The argument consists of two steps.
First, we describe an algorithm that, given 𝜀 > 0 and a graph G, produces a sparse
graph H whose Laplacian coincides in expectation to that of G. Then, we invoke the
matrix Chernoff bound, Theorem 7.19, to show that H, the graph resulting from the
algorithm, is a spectral 𝜀 -approximation of G with high probability.
For the remainder of this section, we denote by G = ( V, E) an undirected, connected
graph with | V | = 𝑛 nodes. We also denote by 𝑤 (𝑢,𝑣 ) the weight of the node (𝑢, 𝑣 ) ∈ E.
Finally, for (𝑢, 𝑣 ) ∈ E, we denote by 𝑳 (𝑢,𝑣 ) the Laplacian matrix of the graph with 𝑛
nodes and a single edge connecting 𝑢 and 𝑣 . Note that 𝑳 (𝑢,𝑣 ) can alternatively be
written as (𝜹 𝑢 − 𝜹 𝑣 ) (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ , where 𝜹 𝑢 is the standard basis vector associated to
the position of 𝑢 .
Project 7: Matrix Laplace Transform Method 282

7.4.1 The algorithm


We begin by presenting the randomized algorithm that Spielman proposed for the
construction of sparse approximations of a graph [Spi19]. Given G and a parameter
𝑅 > 0, the algorithm produces a graph H whose Laplacian matrix coincides in
expectation with 𝑳 G , and whose edge density is inversely proportional to 𝑅 .
For (𝑢, 𝑣 ) ∈ E the quantities
Algorithm 7.1 𝑟 (𝑢,𝑣 ) = (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ 𝑳 †G (𝜹 𝑢 − 𝜹 𝑣 )
Input: G, 𝑅 ≥ 0 and 𝑙 (𝑢,𝑣 ) = 𝑤 (𝑢,𝑣 ) 𝑟 (𝑢,𝑣 ) are called
Ouput: H the effective resistance and the leverage
score of the edge (𝑢, 𝑣 ) , respectively,
For each (𝑢, 𝑣 ) ∈ E do:
and are of interest on their own.
1. Compute the probability

1
𝑝 (𝑢,𝑣 ) B 𝑤 (𝑢,𝑣 ) (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ 𝑳 †G (𝜹 𝑢 − 𝜹 𝑣 ),
𝑅

where 𝑳 G denotes the generalized inverse of 𝑳 G .
2. Add the edge (𝑢, 𝑣 ) with weight 𝑤 (𝑢,𝑣 ) /𝑝 (𝑢,𝑣 ) to H with probability 𝑝 (𝑢,𝑣 ) .

Therefore, to create the approximating graph H, we first assign a probability 𝑝 (𝑢,𝑣 )


to each edge (𝑢, 𝑣 ) ∈ E of G and then add the edge (𝑢, 𝑣 ) with weight 𝑤 (𝑢,𝑣 ) /𝑝 (𝑢,𝑣 )
to H with probability 𝑝 (𝑢,𝑣 ) . As expressed in the following proposition, this choice of
weights allows for the expectation of 𝑳 H to be exactly what we want it to be.
Proposition 7.23 (Expectation of the 𝑳 H ). Given a graph G and 𝑅 > 0, the graph H returned
by Algorithm 7.1 is such that 𝔼 𝑳 H = 𝑳 G .

Proof. Computing the expectation of 𝑳 H yields


 
∑︁ 𝑤 (𝑢,𝑣 ) ∑︁
𝔼 𝑳H = 𝑝 (𝑢,𝑣 ) 𝑳 (𝑢,𝑣 ) = 𝑤 (𝑢,𝑣 ) 𝑳 (𝑢,𝑣 ) = 𝑳 G ,
(𝑢,𝑣 ) ∈E 𝑝 (𝑢,𝑣 ) (𝑢,𝑣 ) ∈E

as desired. 
The substance of Algorithm 7.1 lies in the choice of the probabilities 𝑝 (𝑢,𝑣 ) , which
have been cleverly selected so as to produce a sparse approximator H.
Proposition 7.24 (Sparsity of H). Given a graph G = ( V, E) , the expected number of edges
𝑛−1
of the graph H returned by Algorithm 7.1 is 𝑅 .

Proof. The expected number of edges in H is:


∑︁ 1 ∑︁
𝑝 (𝑢,𝑣 ) = 𝑤 (𝑢,𝑣 ) (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ 𝑳 †G (𝜹 𝑢 − 𝜹 𝑣 )
(𝑢,𝑣 ) ∈E 𝑅 (𝑢,𝑣 ) ∈E
1 ∑︁
= 𝑤 (𝑢,𝑣 ) tr (𝑳 †G (𝜹 𝑢 − 𝜹 𝑣 )(𝜹 𝑢 − 𝜹 𝑣 ) ᵀ )
𝑅 (𝑢,𝑣 ) ∈E
 ∑︁ 
1 † ᵀ
= tr 𝑳 𝑤 (𝑢,𝑣 ) (𝜹 𝑢 − 𝜹 𝑣 )(𝜹 𝑢 − 𝜹 𝑣 )
𝑅 (𝑢,𝑣 ) ∈E G
 ∑︁ 
1 †
= tr 𝑳 G 𝑤 (𝑢,𝑣 ) 𝑳 (𝑣 ,𝑢)
𝑅 (𝑢,𝑣 ) ∈E
1 † †
= tr (𝑳 G 𝑳 G )
𝑅
𝑛 −1
= .
𝑅
Project 7: Matrix Laplace Transform Method 283

The last identity is due to the fact that G is connected, so 𝐿 G has 0 as an eigenvalue
with multiplicity exactly one. For more information on the relationship between
connectivity of a graph and the eigenvalues of its Laplacian matrix we refer the reader
to Chapter 2 of [MP93]. 
Exercise 7.25 (Sparsity with high probability). Show that choosing 𝑅 = Ω(𝜀 −2 ln 𝑛) in
Algorithm 7.1 produces a graph H with expected number of edges 𝑂 (𝜀 −2 𝑛 ln 𝑛) with
high probability.

7.4.2 Analysis of the algorithm


We have described an algorithm by Spielman and Srivastava that, given an undirected,
connected graph G, produces a sparse graph H whose Laplacian coincides with the
Laplacian of G in expectation. It remains to show that H is a spectral 𝜀 -approximation
of G with high probability. We begin with a useful transformation, which will later
allow us to employ the matrix Chernoff bound.
Lemma 7.26 Let 𝜀 > 0 and let G and H be as in Definition 7.22. Then

( 1 − 𝜀)𝑳 G 4 𝑳 H 4 ( 1 + 𝜀)𝑳 G ; if and only if

( 1 − 𝜀)𝑳 †/
G
2
𝑳 G 𝑳 †/
G
2
4 𝑳 †/
G
2
𝑳 H 𝑳 †/
G
2
4 ( 1 + 𝜀)𝑳 †/
G
2
𝑳 G 𝑳 †/
G
2
,
†/2
where 𝑳 G denotes the square root of the generalized inverse of 𝑳 G .
Exercise 7.27 Provide a proof for Lemma 7.26.
We are now ready to prove the main theorem of this section.

Theorem 7.28 (Spielman and Srivastava (2011)). Let G = ( V, E) be an undirected,


connected graph with | V | = 𝑛 , and let 𝜀 > 0. Then there exists a graph H on V,
with at most 4𝜀 −2 𝑛 log 𝑛 edges, that is a spectral 𝜀 -approximation of G with high
probability.

†/2 †/2
Proof. Let 𝚷 B 𝑳 G 𝑳 G 𝑳 G and define
(𝑤
(𝑢,𝑣 )
𝑝 (𝑢,𝑣 ) 𝑳 †/
G
2
𝑳 (𝑢,𝑣 ) 𝑳 †/
G
2
, if (𝑢, 𝑣 ) is added to H
𝑿 (𝑢,𝑣 ) B
0, otherwise.

By Lemma 7.26, we have that H is a spectral 𝜀 -approximation of G if and only if the


†/2 †/2
graph whose Laplacian is 𝑳 G 𝑳 H 𝑳 G is a spectral 𝜀 -approximation of the graph with
Laplacian 𝚷 . Applying the transformation to the identity in the proof of Proposition
7.23, we obtain
∑︁ 𝑤 (𝑢,𝑣 ) †/2
𝔼 𝑳 †/
G
2
𝑳 H 𝑳 †/
G
2
= 𝑝 (𝑢,𝑣 ) 𝑳 𝑳 (𝑢,𝑣 ) 𝑳 †/ 2
(𝑢,𝑣 ) ∈E 𝑝 (𝑢,𝑣 ) G G
 ∑︁ 
=𝔼 𝑿 (𝑢,𝑣 )
(𝑢,𝑣 ) ∈E

= 𝚷.
Í
Therefore, it suffices to show that the extremal eigenvalues of the sum (𝑢,𝑣 ) ∈E 𝑿 (𝑢,𝑣 )
stay within a factor of ( 1 ± 𝜀) of that of 𝚷 with high probability. It is easy to check that

𝜆 max (𝑳 †/
G
2
𝑳 (𝑢,𝑣 ) 𝑳 †/
G
2
) = (𝜹 𝑢 − 𝜹 𝑣 ) ᵀ 𝑳 †G (𝜹 𝑢 − 𝜹 𝑣 ).
Project 7: Matrix Laplace Transform Method 284

So 𝜆 max (𝑿 (𝑣,𝑢) ) ≤ 𝑅 for all (𝑢, 𝑣 ) ∈ E. Furthermore, the equality above implies that
𝜀2
𝜇max (𝑿 (𝑢,𝑣 ) ) = 𝜇 min (𝑿 (𝑢,𝑣 ) ) = 1. Therefore, choosing 𝑅 = 3.5 ln 𝑛
and applying the
matrix Chernoff bound yields
  1/𝑅
n∑︁ o e𝜀
ℙ 𝑿 (𝑢,𝑣 ) ≥ ( 1 + 𝜀)𝚷 ≤ 𝑛
(𝑢,𝑣 ) ∈E ( 1 + 𝜀) 1+𝜀
  ln 𝑛/( 3.5𝜀 2 )
2
≤ 𝑛 e−𝜀 /3

= 𝑛 −1/6 .
2
The second inequality uses the relation e𝜀 ( 1 + 𝜀) −1−𝜀 ≤ e−𝜀 /3 for 𝜀 ∈ [ 0, 1] . Similarly,
2
applying the matrix Chernoff lower bound and using the fact that e−𝜀 ( 1 −𝜀) 𝜀−1 ≤ e−𝜀 /2
for 𝜀 ∈ [ 0, 1] we obtain
n∑︁ o
ℙ 𝑿 (𝑢,𝑣 ) ≤ ( 1 − 𝜀)𝚷 ≤ 𝑛 −3/2 .
(𝑢,𝑣 ) ∈E

†/2 †/2
We conclude that 𝑳 G 𝑳 H 𝑳 G approximates 𝚷 , hence 𝑳 H approximates 𝑳 G , suffi-
ciently well with high probability. Finally, observe that in light of Proposition 7.24, and
given the choice of 𝑅 , the expected number of edges of the resulting approximating
graph H is 3.5 (𝑛 − 1) ln 𝑛𝜀 −2 . 
To understand the significance of this result, first recall that the spectrum of a graph
contains relevant information about the graph’s connectivity, underlying modularity,
and other invariants. As indicated by their name, spectral approximations of a graph G
have eigenvalues that are similar to those of G, and thus preserve many of G’s structural
properties. Furthermore, the solutions to linear systems of equations associated to the
Laplacian matrices of graphs that approximate each other are similar [Spi19]. This
property allows for more efficient computation of the solutions of Laplacian linear
systems. A well known testament of the utility of graph sparsification are expander
graphs, which are sparse spectral approximations of complete graphs.
The greatest potential of application of matrix sparsification lies in its utility for
reducing the computational burden of otherwise complex tasks. The perspicacious
reader may thus wonder about the computational feasibility of implementing the
algorithm, and particularly of computing the probabilities 𝑝 (𝑢,𝑣 ) for each (𝑢, 𝑣 ) ∈ E.
In [SS11], Spielman and Srivastava provide an efficient way of estimating the effective
resistances of every edge on a graph, hence the probabilities 𝑝 (𝑢,𝑣 ) .

7.5 Conclusion
In this lecture we extended the Laplace transform method for random variables to
the matrix setting, obtaining powerful tail bounds for the extremal eigenvalues of
random matrices and their independent sums. As an example of the results that can be
deduced from the matrix extension of the Laplace transform method, we established a
Chernoff inequality for random matrices and exhibited an application of the latter in
spectral graph theory.

Lecture bibliography
[AW01] R. Ahlswede and A. Winter. Strong Converse for Identification via Quantum Channels.
2001. arXiv: quant-ph/0012127 [quant-ph].
Project 7: Matrix Laplace Transform Method 285

[Lie73b] E. H. Lieb. “Convex trace functions and the Wigner-Yanase-Dyson conjecture”. In:
Advances in Mathematics 11.3 (1973), pages 267–288. doi: https://doi.org/10.
1016/0001-8708(73)90011-X.
[MP93] B. Mohar and S. Poljak. “Eigenvalues in Combinatorial Optimization”. In: Com-
binatorial and Graph-Theoretical Problems in Linear Algebra. Springer New York,
1993, pages 107–151.
[Oli10] R. I. Oliveira. “Sums of random Hermitian matrices and an inequality by Rudelson”.
In: Electronic Communications in Probability 15 (2010), pages 203–212.
[Spi19] D. Spielman. Spectral and Algebraic Graph Theory. 2019. url: http : / / cs -
www.cs.yale.edu/homes/spielman/sagt/sagt.pdf.
[SS11] D. A. Spielman and N. Srivastava. “Graph Sparsification by Effective Resistances”.
In: SIAM Journal on Computing 40.6 (2011), pages 1913–1926. eprint: https :
//doi.org/10.1137/080734029. doi: 10.1137/080734029.
[Tro11] J. A. Tropp. “User-Friendly Tail Bounds for Sums of Random Matrices”. In: Founda-
tions of Computational Mathematics 12.4 (2011), pages 389–434. doi: 10.1007/
s10208-011-9099-z.
[Tro12] J. A. Tropp. “From joint convexity of quantum relative entropy to a concavity
theorem of Lieb”. In: Proc. Amer. Math. Soc. 140.5 (2012), pages 1757–1760. doi:
10.1090/S0002-9939-2011-11141-9.
8. Operator-Valued Kernels

Date: 14 March 2022 Author: Nicholas H. Nelsen

In this lecture, we given an overview of the theory of operator-valued kernels. These


Agenda:
objects generalize scalar-valued kernels to vector-valued output data and underlie statis-
1. Scalar kernels and RKHS
tically analyzable algorithms for vector-valued learning. Motivated by machine learning 2. Operator-valued kernels
considerations, we take a function space perspective and introduce reproducing kernel 3. Examples
Hilbert spaces for scalar- and operator-valued kernels. Both finite-dimensional (matrix- 4. Vector-valued GPs
valued) and infinite-dimensional (operator-valued) examples of positive-definite kernels
are presented. After developing some of the properties of these kernels, we state
vector-valued extensions of Bochner’s and Schoenberg’s characterization theorems. We
conclude with a connection to vector-valued Gaussian processes and regression.
Throughout this lecture, we write L( U; V) for the Banach space of bounded linear
operators taking normed space U to Banach space V, L( V) when U = V, and
L+ ( H) ⊂ L( H) for the set of bounded positive-semidefinite operators on a Hilbert
space H.

8.1 Scalar kernels and reproducing kernel Hilbert space


In Lectures 18 and 19, we studied some basic properties of positive-definite kernels
and positive-definite functions. Two characterization theorems were developed for
translation invariant and radial kernels. Bochner’s theorem characterized translation
invariant kernels in terms of the Fourier transform of a measure, and Schoenberg’s
theorem characterized radial kernels in terms of the Laplace transform of a measure.
Some of these ideas will be generalized in today’s lecture.
Recall that a measurable bivariate function 𝑘 : 𝔽 𝑑 × 𝔽 𝑑 → 𝔽 over the field 𝔽 = ℝ
or ℂ is a positive-definite kernel if it is continuous and satisfies

1. (Symmetry) 𝑘 (𝒙 , 𝒙 0) = 𝑘 (𝒙 0, 𝒙 ) for all 𝒙 , 𝒙 0 ∈ 𝔽 𝑑 ,


2. (psd) For all 𝑛 ∈ ℕ and any set {𝒙 𝑖 }𝑖 =1,...,𝑛 ⊂ 𝔽 𝑑 , the matrix

𝑲 B [𝑘 (𝒙 𝑖 , 𝒙 𝑗 )] 𝑖 𝑗 < 0 .

These kernels naturally model scalar-valued data. Hence, they may be referred to
as scalar-valued kernels. Kernels implicitly define functions from 𝔽 𝑑 into 𝔽 , and it is
often the goal in machine learning, statistics, and approximation theory to exploit this
space of functions for modeling, inference, prediction, and other tasks. We now briefly
introduce the notion of reproducing kernel Hilbert space in the scalar setting.

8.1.1 Reproducing kernel Hilbert space


The vector space of all functions mapping an input space X to the field 𝔽 is a messy
infinite-dimensional linear space. Instead of working here, functional analysts often
study smaller infinite-dimensional spaces that have more structure and better properties.
For example, the function space L2 (𝔽 𝑑 ; 𝔽) of all square integrable scalar-valued
functions on 𝔽 𝑑 is a natural generalization of finite-dimensional Euclidean space
Project 8: Operator-Valued Kernels 287

to infinite dimensions. It is a Hilbert space whenever its elements are viewed as


equivalence classes of functions equal Lebesgue almost everywhere. However, this
construction means that evaluation of an element of L2 (𝔽 𝑑 ; 𝔽) at a single point in 𝔽 𝑑 is
not even defined!
Reproducing kernel Hilbert spaces (RKHS) come to the rescue.

Definition 8.1 (Reproducing kernel Hilbert space). A Hilbert space ( H, h·, ·i) of functions
mapping X to 𝔽 is a reproducing kernel Hilbert space if for every 𝒙 ∈ X, the linear
functional Φ𝒙 : H → 𝔽 defined by

𝑓 ↦→ Φ𝒙 𝑓 B 𝑓 (𝒙 ) is continuous.

This definition ensures that functions in an RKHS have a pointwise meaning. Intuitively,
it follows that functions in RKHS are smoother than those in L2 . However, this abstract
property does not explain the terminology “reproducing” or “kernel” in the name.
The following alternative definition of a RKHS makes clear the connection to
positive-definite kernels.

Definition 8.2 (RKHS). A Hilbert space ( H, h·, ·i) of functions mapping the input
space X to 𝔽 is said to be the reproducing kernel Hilbert space of the positive-definite
kernel 𝑘 : X × X → 𝔽 if

1. (Inclusion) 𝑘 (·, 𝒙 ) ∈ H for every 𝒙 ∈ X.


2. (Reproducing property) 𝑓 (𝒙 ) = h𝑘 (·, 𝒙 ), 𝑓 i for every 𝒙 ∈ X and 𝑓 ∈ H. The RKHS inner-product is not easily
expressed in closed form for most
kernels.
Item (2) in Definition 8.2 shows that the pointwise values of the function are “reproduced”
by the RKHS inner-product of the kernel with the function itself. Since most real-world
data can be modeled as pointwise values of some underlying scalar function, it seems
that RKHSs are natural candidates for hypothesis spaces in learning theory.
Today, we will go beyond the scalar function setting by answering the ques-
tion:
Can we use kernels to model vector-valued data?

The key idea is to generalize scalar kernels to operator-valued kernels.

8.2 Operator-valued kernels


Consider the space of functions mapping 𝔽 𝑑 to 𝔽𝑝 , where 𝑑, 𝑝 ∈ ℕ. For example,
vector-valued functions arise as the vector fields of dynamical systems or as the
parameter-to-image map in computer graphics. One could treat a vector-valued
function in this space as a family of 𝑝 scalar-valued functions and apply the previous
RKHS formulation componentwise. This is commonplace in many fields. However, this
approach may ignore correlations between the 𝑝 components that could be essential to
fully describe the function. The field of multitask learning [Cap+08; Car97; Evg+05]
has developed to explicitly model the correlation in vector-valued output data. At the
core of this effort is the operator-valued kernel.

8.2.1 Kernels valued as linear maps


We consider a setting in which the input space X is a separable Banach space but
now the output space Y is a separable Hilbert space instead of the scalar field 𝔽 . This
Project 8: Operator-Valued Kernels 288

framework is powerful because it allows for both X and Y to be infinite-dimensional,


in which case maps between them are often called operators.
The following definition generalizes what it means to be a kernel.

Definition 8.3 (Positive-definite operator-valued kernel). An operator-valued kernel on X


is a Borel- X measurable function 𝑲 : X× X → L( Y) . The kernel is positive-definite
if

1.(Symmetry) 𝑲 (𝒙 , 𝒙 0) = 𝑲 (𝒙 0, 𝒙 ) ∗ for all 𝒙 , 𝒙 0 ∈ X, Here ∗ denotes adjoint with respect


2. (psd) For all 𝑛 ∈ ℕ and any {𝒙 𝑖 }𝑖 =1,...,𝑛 ⊂ X and {𝒚 𝑗 } 𝑗 =1,...,𝑛 ⊂ Y, the to the Hilbert space Y.
quadratic form ∑︁𝑛
h𝒚 𝑖 , 𝑲 (𝒙 𝑖 , 𝒙 𝑗 )𝒚 𝑗 i Y ≥ 0 . (8.1)
𝑖 ,𝑗 =1

That is, an operator-valued kernel takes a pair of vectors in the input space into the
space of bounded linear maps between the output space and itself. Not only does an
operator-valued kernel measure similarity between inputs, but it also accounts for
relationships between the outputs. When Y is finite-dimensional, we use the term
matrix-valued kernel.
Notice that the second condition (8.1) in Definition 8.3 generalizes the corresponding
scalar definition condition from 𝑛 -by-𝑛 psd kernel matrices to 𝑛 -by-𝑛 psd block
kernel operators 𝑲 ( X, X) B [𝑲 (𝒙 𝑖 , 𝒙 𝑗 )] 𝑖 ,𝑗 ∈ L( Y𝑛 ) . Here X B {𝒙 1 , . . . , 𝒙 𝑛 } and
Y𝑛 B Y × · · · × Y is the 𝑛 -fold product of Y (which is itself a Hilbert space).

8.2.2 Vector-valued RKHS


Since scalar-valued kernels on X lead to functions taking X to 𝔽 , it is reasonable to
expect that operator-valued kernels induce vector-valued functions taking X to Y.
This is indeed the case. The basic theory of vector-valued RKHS can be developed in a
parallel manner to the scalar case, albeit with a few nontrivial generalizations such as
the vector-valued reproducing property.
We begin this endeavor by considering an arbitrary Hilbert space ( H, h·, ·i H) of
maps taking X to Y.

Definition 8.4 (Vector-valued RKHS [MP05]). A Hilbert space ( H, h·, ·i H) of functions


mapping X to Y is a reproducing kernel Hilbert space if for every 𝒙 ∈ X and 𝒚 ∈ Y,
the linear functional 𝜑 𝒙 ,𝒚 : H → 𝔽 defined by

𝒇 ↦→ Φ𝒙 ,𝒚 𝒇 B h𝒚 , 𝒇 (𝒙 )i Y is continuous.

While seemingly benign, the implications of Definition 8.4 are quite profound. Indeed,
let H be a vector-valued RKHS. For 𝒙 ∈ X and 𝒚 ∈ Y, the linear functional Φ𝒙 ,𝒚 is
bounded because it is continuous. Hence, the Riesz representation theorem ensures
the existence of a unique element 𝑲 𝒙 ,𝒚 ∈ H, parametrized by 𝒙 and 𝒚 , such that

Φ𝒙 ,𝒚 𝒇 = h𝒚 , 𝒇 (𝒙 )i Y = h𝑲 𝒙 ,𝒚 , 𝒇 i H for every 𝒇 ∈ H . (8.2)

The left-hand side of (8.2) is linear in 𝒚 . It follows that 𝑲 𝒙 ,𝒚 on the right-hand side is
also linear. We can define a linear operator 𝑲 𝒙 : Y → H via the rule

𝑲 𝒙 ,𝒚 C 𝑲 𝒙 𝒚 .
Furthermore, we can define an operator-valued kernel 𝑲 (𝒙 , 𝒙 0) : Y → Y by

𝒚 ↦→ 𝑲 (𝒙 , 𝒙 0)𝒚 B (𝑲 𝒙 0 𝒚 )(𝒙 ) . (8.3)


Project 8: Operator-Valued Kernels 289

We sometimes write 𝑲 𝒙 = 𝑲 (·, 𝒙 ) to emphasize that the first argument of the kernel
is open. Provided that 𝑲 𝒙 : Y → H is bounded (which we prove in Proposition 8.5),
it follows from (8.2) and (8.3) that
𝒇 (𝒙 ) = 𝑲 𝒙∗ 𝒇 and 𝑲 (𝒙 , 𝒙 0) = 𝑲 𝒙∗ 𝑲 𝒙 0 for each 𝒙 , 𝒙 0 ∈ X The second equality parallels the well
known scalar identity
because Y is a Hilbert space. This is a form of the vector-valued reproducing property 𝑘 (𝒙 , 𝒙 0 ) = h𝑘 ( ·, 𝒙 ), 𝑘 ( ·, 𝒙 0 ) i .
(cf. Definition 8.2).
Using some basic functional analysis, it is possible to deduce the following facts
about the kernel 𝑲 in (8.3).
Proposition 8.5 (Operator-valued kernel: Properties). The operator-valued kernel 𝑲 on X
from (8.3) satisfies the following properties for all 𝒙 and 𝒙 0 ∈ X.
1.(Kernel) 𝑲 (𝒙 , 𝒙 0) ∈ L( Y) , 𝑲 (𝒙 , 𝒙 0) ∗ ∈ L( Y) , and 𝑲 (𝒙 , 𝒙 ) ∈ L+ ( Y) .
2. (psd) 𝑲 is positive-definite.
1/2
3. (Kernel sections) 𝑲 𝒙 ∈ L( Y; H) with k𝑲 𝒙 k L( Y; H) = k𝑲 (𝒙 , 𝒙 )k .
L( Y)
1/2 1/2
4. (Cauchy–Schwarz) k𝑲 (𝒙 , 𝒙 0)k L( Y) ≤ k𝑲 (𝒙 , 𝒙 )k L( Y) k𝑲 (𝒙 0, 𝒙 0)k L( Y) .

Proof. The only hard item is (1), which is left as an exercise; see Problem 8.8. We will
prove (3), noting that the remaining items are similar or trivial given (1). Applying the
vector-valued reproducing property and the Cauchy–Schwarz inequality, we obtain
k𝑲 𝒙 k 2 = sup k𝑲 𝒙 𝒚 k 2H = sup h𝑲 𝒙 𝒚 , 𝑲 𝒙 𝒚 i H
k𝒚 k Y=1 k𝒚 k Y=1

= sup h𝒚 , 𝑲 (𝒙 , 𝒙 )𝒚 i Y
k𝒚 k Y=1

≤ k𝑲 (𝒙 , 𝒙 )k ,
where all unlabeled norms are the induced operator norms. For the other direction,
h𝑲 (𝒙 , 𝒙 )𝒚 , 𝑲 (𝒙 , 𝒙 )𝒚 i Y = h𝑲 𝒙 𝑲 (𝒙 , 𝒙 )𝒚 , 𝑲 𝒙 𝒚 i H
≤ k𝑲 𝒙 kk𝑲 (𝒙 , 𝒙 )𝒚 k Yk𝑲 𝒙 𝒚 k H
≤ k𝑲 𝒙 k 2 k𝑲 (𝒙 , 𝒙 )𝒚 k Yk𝒚 k Y .
These two bounds imply that k𝑲 (𝒙 , 𝒙 )𝒚 k Y ≤ k𝑲 𝒙 k 2 k𝒚 k Y as desired. 

Aside: Further assumptions on 𝑲 (such as tr (𝑲 (𝒙 , 𝒙 )) < ∞ for all 𝒙 ) beyond


just the default properties in Proposition 8.5 are made in the paper [CDV07] of
Caponnetto and de Vito. They derive optimal convergence rates for the statistical
learning of vector-valued maps taking X to Y using RKHS-penalized least-squares
(i.e., kernel ridge) regression.

Exercise 8.6 (Matrix coordinates). For 𝑝 ∈ ℕ, let 𝑲 : X× X → 𝔽𝑝×𝑝 be the operator-valued


kernel of RKHS H𝑲 . For any 𝒙 , 𝒙 0 ∈ X and 𝑖 , 𝑗 = 1, . . . , 𝑝 , show that
𝑲 (𝒙 , 𝒙 0)
 
𝑖𝑗
= h𝑲 𝒙 𝜹 𝑖 , 𝑲 𝒙 0 𝜹 𝑗 i H𝑲 ,
where (𝜹 𝑖 )𝑖 =1,...,𝑝 are the standard basis vectors of 𝔽𝑝 .
Exercise 8.7 (Point evaluation). From Proposition 8.5 deduce that the point evaluation
map is a bounded linear operator. For every 𝒙 ∈ X, there exists 𝐶 𝒙 > 0 such that
k𝒇 (𝒙 ) k Y ≤ 𝐶 𝒙 k𝒇 k H for all 𝒇 ∈ H .
Explicitly determine a choice of the constant 𝐶 𝒙 .
Project 8: Operator-Valued Kernels 290

Problem 8.8 (Kernel bounds). Complete the proof of Proposition 8.5. Hint: For item (1),
use the adjoint and apply the uniform boundedness principle.

Warning 8.9 (Continuity). Although point evaluation is well-defined, functions 𝒇 : X →


Y in the RKHS H𝑲 of a positive-definite operator-valued reproducing kernel 𝑲 need
not be continuous. To guarantee continuity, that is, 𝒇 ∈ C ( X; Y) , it is necessary
and sufficient for the kernel to be Mercer [Car+10, Proposition 2]. More precisely,

1. 𝒙 ↦→ 𝑲 (𝒙 , 𝒙 ) is locally bounded, Item (2) is natural from the


2. 𝑲 (·, 𝒙 )𝒚 ∈ C ( X; Y) for all 𝒙 ∈ X and all 𝒚 ∈ Y. 
characterization of H𝑲 as the closure
of the span of 𝑲 ( ·, 𝒙 )𝒚 over all 𝒙
The following is an alternative definition of vector-valued RKHS that parallels and 𝒚 [ARL+12].
Definition 8.2 in the scalar case.

Definition 8.10 (Vector-valued RKHS [Kad+16]). A Hilbert space ( H, h·, ·i) of functions
mapping the input space X to the output Hilbert space ( Y, h·, ·i Y) is said to be the
vector-valued reproducing kernel Hilbert space of the positive-definite operator-valued
kernel 𝑲 : X × X → L( Y) if

1. (Inclusion) 𝑲 (·, 𝒙 )𝒚 ∈ H for every 𝒙 ∈ X and 𝒚 ∈ Y,


2. (Reproducing property) h𝒚 , 𝒇 (𝒙 )i Y = h𝑲 (·, 𝒙 )𝒚 , 𝒇 i for every 𝒙 ∈ X, 𝒚 ∈ Y
and 𝒇 ∈ H.

We conclude this section with a converse result [Sch64]. In the scalar setting, the
analogous result is called the Moore–Aronszajn theorem [Aro50].

Theorem 8.11 (Schwartz 1964). Every positive-definite operator-valued kernel 𝑲 is


the reproducing kernel of a uniquely defined RKHS H𝑲 .

This one-to-one correspondence between operator-valued kernels and RKHS justifies


our notation of indexing the RKHS by its kernel, 𝑲 ↔ H𝑲 .

8.2.3 Vector-valued Bochner and Schoenberg theorems


Last, we present vector-valued extensions of Bochner’s theorem for translation-invariant
kernels and Schoenberg’s theorem for radial kernels as seen in Lectures 18 and
19. We refer the reader to [MP05, Section 5] and [Car+10] for more details and
references.

Theorem 8.12 (Vector-valued Bochner; Berberian 1966, Fillmore 1970). A map 𝑲 : ℝ𝑑 ×


ℝ𝑑 → L( Y) is a positive-definite operator-valued translation-invariant kernel if
and only if it takes the form

0 0 0,
(𝒙 , 𝒙 ) ↦→ 𝑲 (𝒙 , 𝒙 ) = ei h𝒙 −𝒙 𝝃 iℝ𝑑
𝝁 ( d𝝃) ,
ℝ𝑑

where Y is a separable Hilbert space and 𝝁 is a L+ ( Y) operator-valued Borel


measure on ℝ𝑑 .

Theorem 8.13 (Vector-valued Schoenberg; Berberian 1966, Fillmore 1970). A function


𝝋 : ℝ+ → L( Y) generates a positive-definite radial operator-valued kernel
Project 8: Operator-Valued Kernels 291

(𝒙 , 𝒙 0) ↦→ 𝜑 (k𝒙 − 𝒙 0 k 2X) on the Hilbert space X if and only if


∫ ∞
𝑠 ↦→ 𝝋 (𝑠 ) = e−𝑠𝑢 𝝂 ( d𝑢) ,
0

where Y is a separable Hilbert space and 𝝂 is a L+ ( Y) operator-valued Borel


measure on ℝ+ .

8.3 Examples
Operator-valued kernels are much more complex than their scalar counterparts. This
abstraction may make it difficult to intuit canonical ways to map vectors into operators
in an expressive way. To build this intuition, we develop examples of operator-valued
kernels that are related to scalar formulations. Catalogs of explicit matrix-valued and
operator-valued kernels may be found in [ARL+12; BHB16; Car+10; MP05].
The most common examples are the separable scalar-type kernels in the next two
examples. This is not surprising given the far-reaching implications of the vector-valued
Bochner and Schoenberg characterization theorems of Section 8.2.3.
Example 8.14 (Separable scalar operator-valued kernel). Consider the function 𝑲 : X× X →
L( Y) given by Notice that the input space is
(𝒙 , 𝒙 0) ↦→ 𝑲 (𝒙 , 𝒙 0) = 𝑘 (𝒙 , 𝒙 0)𝑻 (8.4) decoupled from the output space,
which may be a strong limitation in
for some scalar-valued kernel 𝑘 and some self-adjoint 𝑻 ∈ L+ ( Y) . Then 𝑲 is called practical settings. However, the block
a separable scalar operator-valued kernel. In supervised learning applications, the kernel matrix for (8.4) has a simple
tensor product structure that allows
map 𝑻 can be learned jointly as a hyperparameter along with the regression function for its efficient inversion.
[ARL+12]. Alternatively, 𝑻 can be chosen as the empirical covariance operator of the
labeled data (output space PCA), for example.
Here is a concrete instantiation of (8.4) in infinite dimensions. For any invertible
and self-adjoint 𝑪 ∈ L+ ( X) , define 𝑲 by
 
(𝒙 , 𝒙 0) ↦→ exp 1
2
k𝑪 −1 (𝒙 − 𝒙 0)k 2X 𝑻 , where

(𝑻 𝒚 ) (𝒕 ) = 𝑒 − k𝒕 −𝒔 k ℝ𝑑 𝒚 (𝒔 ) d𝒔
D

for 𝒚 ∈ Y and 𝒕 ∈ D, where Y = L2 ( D; ℝ𝑝 ) and D ⊂ ℝ𝑑 [Kad+16]. 

Exercise 8.15 (Separable). Show that separable kernels 𝑲 of the form (8.4) are positive-
definite whenever 𝑘 is positive-definite.
In Lecture 19 we encountered inner-product kernels on 𝔽 𝑑 . These kernels take the
form
𝑘 (𝒙 , 𝒙 0) = 𝜑 (h𝒙 , 𝒙 0i𝔽𝑑 )
for a scalar function 𝜑 : 𝔽 → 𝔽 . We saw that these objects played a crucial role in
the theory of entrywise psd preservers. Moreover, 𝜑 sometimes admits a convergent
power series expansion via Vasudeva’s theorem and its consequences.
We now demonstrate a dramatic extension of inner-product kernels to the operator-
valued setting.
Example 8.16 (Kernel inner-product). Let X = ℂ𝑑 for some 𝑑 ∈ ℕ. Let 𝜶 ∈ ℤ𝑑+ denote a
multi-index (which is a succinct notation to deal with polynomials in arbitrary input
dimension; for example, see [Eva10]). Let 𝝋 : ℂ𝑑 → L( Y) be an operator-valued
Project 8: Operator-Valued Kernels 292

entire function, which means that


∑︁
𝒛 ↦→ 𝝋 (𝒛 ) = 𝑨 𝜶 𝒛 𝜶 for some {𝑨 𝜶 } ⊂ L+ ( Y) . (8.5)
𝜶 ∈ℤ𝑑+

This series converges everywhere on ℂ𝑑 in an appropriate sense ([MP05]). Here


𝒛 𝜶 B 𝑧 1𝛼1 𝑧 2𝛼2 · · · 𝑧𝑑𝛼𝑑 and Y is allowed to be infinite-dimensional. Let 𝑑 scalar kernels
𝑘𝑖 : ℂ𝑑 × ℂ𝑑 → ℂ be given. These represent “inner-products. Denote their vectorized
concatenation by 𝒌 B (𝑘 1 , . . . , 𝑘𝑑 ) : ℂ𝑑 × ℂ𝑑 → ℂ𝑑 . Then the composition

𝝋 ◦ 𝒌 : ℂ𝑑 × ℂ𝑑 → L( Y) (8.6)
∑︁
(𝒙 , 𝒙 0) ↦→ 𝑑
𝑘 1 (𝒙 , 𝒙 0) 𝛼1 · · · 𝑘𝑑 (𝒙 , 𝒙 0) 𝛼𝑑 𝑨 𝜶
𝜶 ∈ℤ+

is an operator-valued kernel. 

We now show that such an inner-product kernel is positive-definite.


Proposition 8.17 (Positive-definite inner-product kernel). Let 𝝋 : ℂ𝑑 → L( Y) be an entire
function of the form (8.5), where each 𝑨 𝜶 is self-adjoint. Let {𝑘𝑖 }𝑖 =1,...,𝑑 be a family of
positive-definite scalar kernels. Then 𝑲 B 𝝋 ◦𝒌 as defined in (8.6) is a positive-definite
operator-valued kernel.

Proof. We must verify Definition 8.3. Clearly 𝑲 (𝒙 , 𝒙 0) = 𝑲 (𝒙 0, 𝒙 ) ∗ since each 𝑘𝑖 is



positive-definite and 𝑨 𝜶 = 𝑨 𝜶 . It remains to show the psd property. Note
𝑛
∑︁ 𝑛 ∑︁ Ö
∑︁ 𝑑
h𝒚 𝑖 , 𝑲 (𝒙 𝑖 , 𝒙 𝑗 )𝒚 𝑗 i Y = 𝑘  (𝒙 𝑖 , 𝒙 𝑗 ) 𝛼  h𝒚 𝑖 , 𝑨 𝜶 𝒚 𝑗 i Y
𝑖 ,𝑗 =1 𝑖 ,𝑗 =1 𝜶 ∈ℤ𝑑+ =1

𝑛 𝑑
! 
∑︁  ∑︁ Ö 
𝛼
= 
 1𝑖 𝑘  (𝒙 𝑖 , 𝒙 𝑗 )  h𝒚 𝑖 , 𝑨 𝜶 𝒚 𝑗 i Y 1𝑗 
𝜶 ∈ℤ𝑑+ 𝑖 ,𝑗 =1 =1
 

for any {𝒙 𝑖 }𝑖 =1,...,𝑛 , {𝒚 𝑗 } 𝑗 =1,...,𝑛 , and 𝑛 ∈ ℕ. Since each 𝑘  is positive-definite, the
𝛼
function 𝑘   by entrywise psd preservation of the power function (as seen in Lecture
19). By the Schur product theorem,
𝑑
Ö 
𝛼
𝑘  (𝒙 𝑖 , 𝒙 𝑗 ) < 0.
=1 𝑖𝑗

Since 𝑨 𝜶 ∈ L+ ( Y) is psd, it defines a scalar positive-definite inner-product kernel


(𝒚 , 𝒚 0) ↦→ h𝒚 , 𝑨 𝜶 𝒚 0i Y on Y. Thus, for all multi-indices 𝜶 ,
[h𝒚 𝑖 , 𝑨 𝜶 𝒚 𝑗 i Y] 𝑖 𝑗 < 0 ,
One final application of the Schur product theorem shows that all the terms in the first
set of large brackets above are greater than or equal to zero. This yields the desired
result. 
We conclude this example by remarking that there is a converse result to Propo-
sition 8.17 [MP05, Proposition 3] that mimics Vasudeva’s and Schoenberg’s scalar
characterization of entrywise psd preservers (Lecture 19) as everywhere convergent
power series.
The previous two examples concerned separable kernels or sums thereof. It is
possible to go beyond the separable case using special feature map factorizations, as
the next example demonstrates.
Project 8: Operator-Valued Kernels 293

Example 8.18 (Random features). Nonseparable kernels can be constructed with vector-
valued random features. In their most general form [NS21], they are defined as a pair
(𝝋, 𝜇) , where 𝝋 : X × Θ → Y is jointly square Bochner integrable and 𝜇 is a Borel
probability measure on Θ. It follows that

(𝒙 , 𝒙 0) ↦→ 𝑲 (𝒙 , 𝒙 0) B 𝔼𝜽 ∼𝜇 [𝝋 (𝒙 , 𝜽 ) ⊗ Y 𝝋 (𝒙 0, 𝜽 )] (8.7)

is a positive-definite operator-valued kernel on X. A key insight is that the expectation


can be approximated empirically with Monte Carlo. For some 𝑚 ∈ ℕ, this leads to a
low rank approximation 𝑲 𝑚 to 𝑲 given by Operator-valued kernel regression in,
say, Y = 𝔽𝑝 with 𝑝 large, using 𝑲 𝑚
1 ∑︁𝑚 iid is much more efficient than that using
(𝒙 , 𝒙 0) ↦→ 𝑲 𝑚 (𝒙 , 𝒙 0) B 𝝋 (𝒙 , 𝜽  ) ⊗ Y 𝝋 (𝒙 0, 𝜽  ) , where 𝜽  ∼ 𝜇 . 𝑲 , even when 𝑚  𝑛𝑝 is relatively
𝑚 =1
large. This is because all the linear
algebra is performed in 𝔽 𝑚 for the
One can further exploit the vector-valued Bochner characterization (Theorem 8.12)
former and in 𝔽 𝑛𝑝 for the latter,
to design random Fourier feature [RR08] approximations to translation-invariant where 𝑛 is the sample size of the
operator-valued kernels, as demonstrated in [BHB16; Min16].  dataset.

Exercise 8.19 (Random feature). Prove that the operator-valued kernel (8.7) defined in
terms of random features is positive-definite.

8.4 Vector-valued Gaussian processes


In the context of learning models from data, kernel methods are closely linked to
Gaussian process (GP) regression in the scalar setting. It should come as no surprise
that this remains true in the vector-valued setting. Indeed, there is a correspondence
between a vector-valued RKHS H𝑲 and a vector-valued GP, which is a Gaussian
distribution over vector-valued functions, given by

𝒇 𝑲 ∼ GP(𝒎, 𝑲 ) , where Warning 8.20 (Sample


inclusion). The GP satisfies
𝒇 𝑲 (𝒙 ) ∼ normal (𝒎 (𝒙 ), 𝑲 (𝒙 , 𝒙 )) for 𝒙 ∈ X , 
ℙ 𝒇 𝑲 ∈ H𝑲 = 0. 

barring some measurability issues in infinite dimensions. GPs are fully specified by
their finite-dimensional distributions. Here, 𝒇 𝑲 ∼ GP(𝒎, 𝑲 ) signifies that 𝒇 𝑲 has
mean function 𝒎 = 𝔼 𝒇 𝑲 ∈ H𝑲 and covariance function Although this GP connection is
conceptually useful, it is not
(𝒙 , 𝒙 0) ↦→ 𝑲 (𝒙 , 𝒙 0) = 𝔼[𝒇 𝑲 (𝒙 ) ⊗ Y 𝒇 𝑲 (𝒙 0)] , advisable to use such covariance
kernels in practice unless they are
known in closed form or the GP is fast
which is just the operator-valued kernel! The covariance kernel is in the random to sample from and evaluate.
feature form (8.7) with the feature map given by the GP itself. Therefore, the kernel is
positive-definite.

8.4.1 GP examples
We now present some examples of operator-valued kernels defined from a single scalar
kernel [NS21, Example 2.9] and their associated RKHSs. We explicitly make use of the
GP perspective and will see how output space correlations (or lack thereof) influence
sample paths of the corresponding GP. For simplicity in all that follows, the input space
is set to X B ( 0, 1) .
Example 8.21 (Brownian bridge: Scalar). Take Y B ℝ and define the function 𝑘 : X× X →
ℝ by
𝑘 (𝑥, 𝑥 0) B min{𝑥, 𝑥 0 } − 𝑥𝑥 0 . (8.8)
Project 8: Operator-Valued Kernels 294

Clearly 𝑘 is symmetric. It is positive-definite because 𝑘 is the covariance function of


the Brownian bridge GP {𝑏 (𝑥)}𝑥 ∈ X ⊂ ℝ, which is a Brownian motion constrained to
zero at locations 𝑥 = 0 and 𝑥 = 1. That is,

𝑏 ∼ GP( 0, 𝑘 ) and 𝑏 (𝑥) ∼ normal ( 0, 𝑘 (𝑥, 𝑥)) for every 𝑥 ∈ X .


The RKHS H𝑘 is given by the function space

H10 ( X; ℝ) = 𝑓 ∈ L2 ( X; ℝ) : 𝑓 0 ∈ L2 ( X; ℝ), 𝑓 ( 0) = 𝑓 ( 1) = 0 .

Aside: Without appealing to the
RKHS formalism, one can deduce
This is the Hilbert space of real-valued functions on ( 0, 1) that vanish at the endpoints that functions in H10 ( X; ℝ) are
pointwise well defined by the
with one weak derivative and equipped with the inner-product Sobolev embedding theorem since
∫ 1 here 1 > 𝑑/2 = 1/2.
0 0
h𝑓 , 𝑔 iH10 ( X;ℝ) = h𝑓 , 𝑔 iL2 ( X;ℝ) = 𝑓 0 (𝑡 )𝑔 0 (𝑡 ) d𝑡 .
0

To see this connection, we perform a direct calculation in the spirit of Definition 8.2.
For any 𝑥 ∈ X, the function 𝑘 (·, 𝑥) ∈ L2 ( X; ℝ) because it is bounded, vanishes at the
endpoints, and has a bounded weak derivative 𝑡 ↦→ 𝑘 0 (𝑡 , 𝑥) = 𝟙{𝑡 < 𝑥 } − 𝑥 . Hence
𝑘 (·, 𝑥) ∈ H𝑘 . Last, we verify the reproducing property:
∫ 1
h𝑘 (·, 𝑥), 𝑓 i ( X;ℝ) = 𝑘 0 (·, 𝑥)(𝑡 ) 𝑓 0 (𝑡 ) d𝑡
0
∫ 𝑥 ∫ 1
0
= 𝑓 (𝑡 ) d𝑡 − 𝑥 𝑓 0 (𝑡 )𝑑𝑡
0 0
= 𝑓 (𝑥) .

Therefore, H𝑘 = H10 ( X; ℝ) . 

We can lift to a matrix-valued kernel by considering a finite vector of Brownian


bridges.
Example 8.22 (Brownian bridge: Matrix). For some 𝑝 ∈ ℕ, let Y = ℝ𝑝 . Define the
function 𝑲 : X × X → ℝ𝑝×𝑝 by

𝑲 (𝑥, 𝑥 0) B 𝑘 (𝑥, 𝑥 0) I ,
where 𝑘 is as in (8.8). Then 𝑲 is the covariance function of 𝒃 B (𝑏 1 , . . . , 𝑏 𝑝 ) , where
each 𝑏 𝑖 (·) is an independent Brownian bridge:

𝒃 ∼ GP( 0, 𝑲 ) and 𝒃 (𝑥) ∼ normal ( 0, 𝑲 (𝑥, 𝑥)) for every 𝑥 ∈ X .


This is because the covariance function of the process satisfies

𝔼[𝒃 (𝑥)𝒃 (𝑥 0) ᵀ ] 𝑖 𝑗 = 𝔼[𝑏 𝑖 (𝑥)𝑏 𝑗 (𝑥 0)] = 𝛿𝑖 𝑗 𝑘 (𝑥, 𝑥 0) = [𝑲 (𝑥, 𝑥 0)] 𝑖 𝑗 .


By an argument similar to the previous example, 𝑲 is a positive-definite matrix-valued
kernel. Its RKHS H𝑲 is

H10 ( X; ℝ𝑝 ) = 𝒇 ∈ L2 ( X; ℝ𝑝 ) : 𝒇 0 ∈ L2 ( X; ℝ𝑝 ), 𝒇 ( 0) = 𝒇 ( 1) = 0 .


This is the Hilbert space of ℝ𝑝 -valued functions on ( 0, 1) that vanish component-wise


at the endpoints, with one weak derivative in each component, and equipped with the
inner-product
∫ 1 ∑︁𝑝 ∫ 1
h𝒇 , 𝒈 iH10 ( X;ℝ𝑝 ) = h𝒇 0 (𝑡 ), 𝒈 0 (𝑡 )iℝ𝑝 d𝑡 = 𝑓𝑖 0 (𝑡 )𝑔𝑖0 (𝑡 ) d𝑡 .
0
𝑖 =1 0
Project 8: Operator-Valued Kernels 295

Indeed, verifying the two defining properties (Definition 8.10) of a vector-valued


RKHS shows that [𝑲 (·, 𝑥)𝒚 ] 𝑖 = 𝑘 (·, 𝑥)𝑦𝑖 ∈ H10 ( X; ℝ) for all 𝑥 ∈ X, 𝒚 ∈ ℝ𝑝 , and
𝑖 = 1, . . . , 𝑝 . By the inner-product definition above, we obtain 𝑲 (·, 𝑥)𝒚 ∈ H10 ( X; ℝ𝑝 ) .
As a consequence, 𝑓𝑖 (𝑥) = h𝑘 (·, 𝑥), 𝑓𝑖 iH1 ( X;ℝ) so that
0

∑︁𝑝 ∑︁𝑝
h𝒚 , 𝒇 (𝑥)iℝ𝑝 = 𝑦𝑖 𝑓𝑖 (𝑥) = h𝑘 (·, 𝑥)𝑦𝑖 , 𝑓𝑖 iH10 ( X;ℝ)
𝑖 =1
∑︁𝑝𝑖 =1
= h[𝑲 (·, 𝑥)𝒚 ] 𝑖 , 𝑓𝑖 iH10 ( X;ℝ)
𝑖 =1
= h𝑲 (·, 𝑥)𝒚 , 𝒇 iH10 ( X;ℝ𝑝 ) .

This is the required statement. 

Lifting once again, this time to an infinite-dimensional Y, is straightforward given


the last example and corresponds to a function-valued GP.
Example 8.23 (Brownian bridge: Operator). Take Y = L2 ( D; ℝ) for some finite-dimensional,
bounded domain D in Euclidean space. Define the function 𝑲 : X × X → L( Y) by

𝑲 (𝑥, 𝑥 0) B 𝑘 (𝑥, 𝑥 0) I , (8.9)

where again 𝑘 is as in (8.8) and I ∈ L( Y) is the identity operator on L2 ( D; ℝ) . Then


𝑲 is the covariance function of GP 𝒃 B {𝑏 𝒔 }𝒔 ∈D ,

𝒃 ∼ GP( 0, 𝑲 ) and 𝒃 (𝑥) ∼ normal ( 0, 𝑲 (𝑥, 𝑥)) for every 𝑥 ∈ X ,

where each 𝑏 𝒔 (·) is independent Brownian bridge. This is because for any 𝒚 ∈ Y and
𝒔 ∈ D, the covariance operator satisfies

𝔼[𝒃 (𝑥) ⊗ Y 𝒃 (𝑥 0)]𝒚 (𝒔 ) = 𝔼[h𝒃 (𝑥 0), 𝒚 i Y𝑏 𝒔 (𝑥)]




= 𝔼[𝑏 𝒔 (𝑥)𝑏 𝒔 0 (𝑥 0)]𝑦 (𝒔 0) d𝒔 0 These calculations with white noise,
D though formal, can be made rigorous
with the Itô isometry from stochastic

= 𝑘 (𝑥, 𝑥 0)𝛿 (𝒔 − 𝒔 0)𝑦 (𝒔 0) d𝒔 0 analysis.
D
= 𝑘 (𝑥, 𝑥 0)𝑦 (𝒔 ) .

Calculations similar to those in the previous two examples show that 𝑲 is a positive-
definite operator-valued kernel with RKHS H𝑲 given by

H10 ( X; Y) = 𝒇 ∈ L2 ( X; Y) : 𝒇 0 ∈ L2 ( X; Y), 𝒇 ( 0) = 𝒇 ( 1) = 0 .

(8.10)

This is the Hilbert space of L2 ( D; ℝ) -valued functions on ( 0, 1) that vanish in the Aside: Lebesgue–Bochner spaces
such as L2 ( ( 0, 1) ; L2 ) here arise
L2 -sense at the endpoints with one weak Fréchet derivative, and equipped with the
frequently in the analysis of
inner-product time-dependent partial differen-
tial equations.
∫ 1
h𝒇 , 𝒈 iH10 ( X; Y) = h𝒇 0 (𝑡 ), 𝒈 0 (𝑡 )i Y d𝑡
0
∫ ∫ 1
= [𝑓 0 (𝑡 )] (𝒔 ) [𝑔 0 (𝑡 )] (𝒔 ) d𝑡 d𝒔 .
D 0


Exercise 8.24 (Lebesgue–Bochner). Complete the calculations in Example 8.23 that prove
𝑲 in (8.9) is positive-definite with RKHS (8.10).
Project 8: Operator-Valued Kernels 296

Warning 8.25 (Sample path regularity). Using a separable operator-valued kernel with
isotropic output operator such as I ∈ L( Y) in (8.9) can be detrimental in infinite
dimensions. Not only does this completely uncorrelate the outputs, but also I
does not have finite trace on Y whenever Y is infinite-dimensional. Thus, the
function-valued GP associated to the operator-valued kernel does not take its values
in Y with probability one! In the language of GPs, this is a bad prior distribution.
Indeed, following Example 8.23, for 𝑥 ∈ ( 0, 1) ,

𝒃 (𝑥) ∼ normal ( 0, 𝑲 (𝑥, 𝑥)) = 𝑘 (𝑥, 𝑥) normal ( 0, I) .

This is not a Gaussian random function in Y but instead Gaussian white noise
on Y; it belongs to a much “rougher” space than Y itself. Instead, taking
𝑲 : (𝑥, 𝑥 0) ↦→ 𝑘 (𝑥, 𝑥 0)𝑻 with self-adjoint and psd trace-class 𝑻 ∈ S1 ( Y) addresses Here S1 denotes the Schatten-1 class
this issue.  operators.

8.4.2 Implications for learning theory


We end this lecture with a brief remark on kernel methods in machine learning. Our
emphasis on the RKHS model stems partly from its role in kernel ridge regression (KRR)
[CDV07], which is a regularization procedure to recover a vector-valued function from
a finite number of noisy input–output data pairs. The direct link to vector-valued GPs
is the fact that the KRR estimator is the posterior mean of a GP conditioned on these
data pairs.
We now state a core result in this field.
Theorem 8.26 (Kernel ridge regression). Let H𝑲 be a vector-valued RKHS with positive-
definite operator-valued kernel 𝑲 : X × X → L( Y) . For 𝑛 ∈ ℕ, let ( X, Y) B
{(𝒙 𝑖 , 𝒚 𝑖 )}𝑖 =1,...,𝑛 ⊂ X× Y be a dataset of 𝑛 input–output pairs. Write 𝒚 B (𝒚 𝑖 )𝑖 ∈
Y𝑛 for the vector of concatenated outputs. Then for 𝜆 > 0, the solution to A major shortcoming of vector-valued
kernel ridge regression is its lack of
scalability. Taking Y = 𝔽𝑝 , the block
 
1 ∑︁𝑛 𝜆
minimize k𝒚 𝑖 − 𝒇 (𝒙 𝑖 )k 2Y + k𝒇 k 2H𝑲 : 𝒇 ∈ H𝑲 kernel matrix 𝑲 ( X, X) requires
𝑛 𝑖 =1 𝑛 𝑂 (𝑛 2𝑝 2 ) space complexity and the
linear solve in (8.11) has 𝑂 (𝑛 3𝑝 3 )
is attained by ∑︁𝑛 time complexity. Yet, there is much
research activity attempting to
𝒇ˆ = 𝑲 (·, 𝒙 𝑖 )𝜷 𝑖 ,
𝑖 =1 improve scalability, e.g., with random
features.
where 𝜷 B (𝜷 𝑖 )𝑖 ∈ Y𝑛 solves the block linear operator equation

(𝑲 ( X, X) + 𝜆I)𝜷 = 𝒚 . (8.11)

Proof. The proof is a standard exercise in orthogonality, see Problem 8.27. 


Problem 8.27 (Representer). Prove Theorem 8.26.

Notes
The presentation of operator-valued kernels in this lecture is new. The theory of RKHS
is largely drawn from [MP05] while many of the exercises scattered throughout are
original. The examples of separable kernels are from [Kad+16] and [MP05]. The
random feature and GP Brownian bridge examples in the operator-valued setting are
new, as is the characterization of their RKHSs.
A complete theory of operator-valued kernels in a much more general setting than
that presented here was developed in the 1960s [Sch64]. Since then, many refinements
Project 8: Operator-Valued Kernels 297

have been made, especially with learning theoretic considerations in mind [CDV07;
Cap+08; Car+10; MP05]. There is a rich but technical literature on universal kernels
in both the scalar- and operator-valued setting [Cap+08; Car+10; MXZ06]. These are
kernels whose associated RKHS is dense in the space of continuous functions equipped
with the topology of uniform convergence over compact subsets. This universal
approximation property is of direct relevance to well-posed learning algorithms. Under
fairly weak assumptions the translation-invariant and radial kernels characterized by
Bochner’s and Schoenberg’s theorems are universal. Random feature developments
along similar lines in the operator-valued setting may be found in [BHB16; Min16].

Lecture bibliography
[ARL+12] M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. “Kernels for vector-valued
functions: A review”. In: Foundations and Trends® in Machine Learning 4.3 (2012),
pages 195–266.
[Aro50] N. Aronszajn. “Theory of reproducing kernels”. In: Transactions of the American
mathematical society 68.3 (1950), pages 337–404.
[BHB16] R. Brault, M. Heinonen, and F. Buc. “Random fourier features for operator-valued
kernels”. In: Asian Conference on Machine Learning. PMLR. 2016, pages 110–125.
[CDV07] A. Caponnetto and E. De Vito. “Optimal rates for the regularized least-squares
algorithm”. In: Foundations of Computational Mathematics 7.3 (2007), pages 331–
368.
[Cap+08] A. Caponnetto et al. “Universal multi-task kernels”. In: The Journal of Machine
Learning Research 9 (2008), pages 1615–1646.
[Car+10] C. Carmeli et al. “Vector valued reproducing kernel Hilbert spaces and universality”.
In: Analysis and Applications 8.01 (2010), pages 19–61.
[Car97] R. Caruana. “Multitask learning”. In: Machine learning 28.1 (1997), pages 41–75.
[Eva10] L. C. Evans. Partial differential equations. American Mathematical Soc., 2010.
[Evg+05] T. Evgeniou et al. “Learning multiple tasks with kernel methods.” In: Journal of
machine learning research 6.4 (2005).
[Kad+16] H. Kadri et al. “Operator-valued kernels for learning from functional response
data”. In: Journal of Machine Learning Research 17.20 (2016), pages 1–54.
[MP05] C. A. Micchelli and M. Pontil. “On learning vector-valued functions”. In: Neural
computation 17.1 (2005), pages 177–204.
[MXZ06] C. A. Micchelli, Y. Xu, and H. Zhang. “Universal Kernels.” In: Journal of Machine
Learning Research 7.12 (2006).
[Min16] H. Q. Minh. “Operator-valued Bochner theorem, Fourier feature maps for operator-
valued kernels, and vector-valued learning”. In: arXiv preprint arXiv:1608.05639
(2016).
[NS21] N. H. Nelsen and A. M. Stuart. “The random feature model for input-output maps
between banach spaces”. In: SIAM Journal on Scientific Computing 43.5 (2021),
A3212–A3243.
[RR08] A. Rahimi and B. Recht. “Random Features for Large-Scale Kernel Machines”.
In: Advances in Neural Information Processing Systems 20. Curran Associates, Inc.,
2008, pages 1177–1184. url: http://papers.nips.cc/paper/3182-random-
features-for-large-scale-kernel-machines.pdf.
[Sch64] L. Schwartz. “Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux
associés (noyaux reproduisants)”. In: Journal d’analyse mathématique 13.1 (1964),
pages 115–256.
9. Spectral Radius Perturbation and Stability

Date: 15 March 2022 Author: Jing Yu

Eigenvalues of a matrix are continuous functions of the entries of the matrix. We


Agenda:
have studied how eigenvalues of a Hermitian matrix change with respect to additive
1. System Stability and Spectral
Hermitian perturbations in Lecture 9. Beyond the Hermitian setting, there are many Radius
perturbation results. The eigenvalue perturbation problem arises in many applications. 2. Geršgorin Disks
In particular, the stability of a linear dynamical system is characterized by the location 3. Bounding Spectral Radius
of the eigenvalues of the system matrices.
In this project, we introduce an eigenvalue perturbation problem that is intimately
related to the robustness of a given linear controller for a liner dynamical system. We
will survey some perturbation results relevant to our problem setting, and demonstrate
their usage for the control problem at hand.

9.1 System stability and spectral radius


A central question in linear control theory is whether a given linear time-invariant (LTI)
control system is stable under a chosen feedback gain. An LTI system is represented as

𝒙 𝑡 +1 = 𝑨𝒙 𝑡 + 𝑩𝒖 𝑡 , (9.1)

where 𝒙 𝑡 ∈ ℝ𝑛 is the state indexed by discrete time 𝑡 , vector 𝒖 𝑡 ∈ ℝ𝑚 is the control


input, while 𝑨 ∈ 𝕄𝑛 and 𝑩 ∈ ℝ𝑛×𝑚 are called system matrices. Generally, one wants
to design a feedback gain matrix 𝑲 ∈ ℝ𝑚×𝑛 such that control inputs are chosen as
𝒖 𝑡 = −𝑲 𝒙 𝑡 for the system (9.1). In general, feedback gain 𝑲 is designed with respect
to the system matrices 𝑨 and 𝑩 . Under a fixed feedback gain 𝑲 , the closed-loop
dynamics of (9.1) is
𝒙 𝑡 +1 = (𝑨 − 𝑩𝑲 )𝒙 𝑡 . (9.2)
A key property for studying LTI systems is the spectral radius.

Definition 9.1 (Spectral radius). The spectral radius of a matrix 𝑨 ∈ 𝕄𝑛 is denoted as


𝜌 (𝑨) and defined as
𝜌 (𝑨) := max |𝜆𝑖 (𝑨)|
𝑖

where 𝜆𝑖 (𝑨) ’s are the eigenvalues of 𝑨 .

The next definition connects stability of (9.2) with the spectral radius of the system
matrices.
Aside: Another common stabil-
Definition 9.2 (Stability). We say the closed-loop dynamics is stable under feedback ity definition states that (9.2) is
stable if for all 𝒙 0 ∈ ℝ𝑛 , 𝒙 𝑡 → 0
gain 𝑲 if 𝜌 (𝑨 − 𝑩𝑲 ) < 1.
as 𝑡 → ∞. Using Jordan decom-
position of 𝐴 , it can be readily
A natural question in control design is the following. checked that the two definitions
of the stability are equivalent.
Project 9: Spectral Radius and Stability 299

For a feedback gain 𝑲 designed for 𝑨 and 𝑩 , how much can we perturb 𝑨 with
𝚫 ∈ 𝕄𝑛 such that 𝜌 (𝑨 + 𝚫 − 𝑩𝑲 ) < 1?

Depending on how 𝑲 is synthesized according to 𝑨 , the answer to this question can


be quite different. The property that a feedback gain 𝑲 maintains closed-loop stability
when 𝑨 varies is called robustness in control theory literature. In this project, we
will investigate the robustness of a specific feedback gain that is synthesized from the
discrete-time algebraic Riccati equation (DARE) for a class of LTI systems that has
block diagonal 𝑨 and 𝑩 .
Consider dynamics (9.1) with diagonal 𝑨 and 𝑩 . A linear quadratic (LQ) optimal
control gain 𝑲 ★ is the solution to the following optimization,
∑︁∞
2
minimize𝑲 ∈ℝ𝑚×𝑛 k𝒙 𝑡 k𝑸 + k𝒖 𝑡 k 𝑹2
𝑡 =1
subject to 𝒙 𝑡 +1 = 𝑨𝒙 𝑡 + 𝑩𝒖 𝑡
𝒖 𝑡 = −𝑲 𝒙 𝑡 ,
2
where k𝒙 𝑡 k 𝑸 = 𝒙 𝑡ᵀ𝑸 𝒙 𝑡 and k𝒖 𝑡 k 𝑹2 = 𝒖 𝑡ᵀ𝑹𝒖 𝑡 with block diagonal matrices 𝑸 , 𝑹  0 .
Note that the objective is finite if and only if the closed loop is stable.
It is well known that the optimal solution is given by the solution 𝑷 = 𝑷 ᵀ  0 to
DARE shown below,
 −1
𝑷 = 𝑨 ᵀ𝑷 𝑨 − 𝑨 ᵀ𝑷 𝑩 𝑹 + 𝑩 ᵀ𝑷 𝑩 𝑩 ᵀ𝑷 𝑨 + 𝑸 .
 
(9.4)

The optimal control gain 𝑲 ★ can be computed as


 −1
𝑲 ★ = 𝑹 + 𝑩 ᵀ𝑷 𝑩 𝑩 ᵀ𝑷 𝑨 .

(9.5)

When 𝑨 , 𝑩 , 𝑸 , 𝑹 are all diagonal, it can be shown that the 𝑲 ★ is diagonal as well.
These diagonal structures arise when one considers dynamical systems that are made
up of multiple independent scalar subsystems. The generalized setting considers
block-diagonal matrices instead of diagonal matrices. Example of such systems are
networks of swarm robots.
Now suppose that 𝑨 is not perfectly diagonal but instead is nonzero at the off-
diagonal entries. How much of these nonzero off-diagonal entries can 𝑲 ★ tolerate
before the closed-loop stability is lost under 𝑲 ★ ? This question corresponds to the case
where we design an optimal controller for each subsystem in the network independently,
while in reality the subsystems are dynamically coupled loosely. This type of analysis
quantifies how dynamically coupled the subsystems can be before 𝑲 ★ (designed
assuming no dynamical coupling) no longer stabilizes the closed loop.

9.2 Geršgorin disks


Given a matrix 𝑨 ∈ 𝕄𝑛 , we can always write 𝑨 = 𝑫 + 𝑩 where 𝑫 is a diagonal
matrix containing the main diagonal of 𝑨 and 𝑩 is a zero-diagonal matrix containing
the off-diagonal entries of 𝑨 . Now consider the matrix 𝑨 𝑡 = 𝑫 + 𝑡 𝑩 , then we have
𝑨 0 = 𝑫 and 𝑨 1 = 𝑨 . The eigenvalues of 𝑨 0 are immediately the diagonal elements
of the matrix. On the other hand, if 𝑡 is small enough, the eigenvalues of 𝑨 𝑡 will lie
in a neighborhood of the diagonal entries of 𝑨 due to continuity. The Geršgorin disk
theorem formalizes this intuition about the locations of the eigenvalues. In particular,
it provides an explicit bound for the eigenvalues using the diagonal entries of a matrix.
Let us first define the Geršgorin discs of a matrix.
Project 9: Spectral Radius and Stability 300

Definition 9.3 (Geršgorin disc). Let 𝑨 = [𝑎 𝑖 𝑗 ] ∈ 𝕄𝑛 . The Geršgorin discs of 𝑨 are


n ∑︁ o
D𝑖 (𝑨) = 𝑧 ∈ ℂ : |𝑧 − 𝑎𝑖𝑖 | ≤ |𝑎𝑖 𝑗 | ,
𝑗 ≠𝑖

for 𝑖 = 1, . . . , 𝑛 .

In other words, the Geršgorin discs are balls in the complex plane with 𝑎𝑖𝑖 as the
Í
center and 𝑗 ≠𝑖 |𝑎𝑖 𝑗 | as the radius. It turns out the Geršgorin discs can be used to
Aside: Both Definition 9.3 and
locate the eigenvalues of a matrix.
Theorem 9.4 are valid for com-
plex matrices as well.
Theorem 9.4 (Geršgorin [HJ13]). The eigenvalues of 𝑨 ∈ 𝕄𝑛 are in the union of its
Geršgorin discs 𝑖𝑛=1 D𝑖 (𝑨) . Furthermore, if the union of 𝑘 discs is disjoint from
Ð
the remaining 𝑛 − 𝑘 discs, then the union contains exactly 𝑘 eigenvalues of 𝑨 ,
counted according to their algebraic multiplicities.

Proof. Let 𝜆, 𝒗 = [𝑣 𝑗 ] ∈ ℝ𝑛 be an eigenpair of 𝑨 where 𝒗 ≠ 0. Let 𝑖 ∈ {1, 2, . . . , 𝑛}


be an index such that k𝒗 k ∞ = |𝑣𝑖 | ≠ 0. Rewriting 𝑨𝒗 = 𝜆𝒗 for the 𝑖 th entry, we have
∑︁
𝑎𝑖 𝑗 𝑣 𝑗 = 𝜆𝑣𝑖 .
𝑗

Subtracting 𝑎𝑖𝑖 𝑣𝑖 from both sides,


∑︁
𝑎𝑖 𝑗 𝑣 𝑗 = (𝜆 − 𝑎𝑖𝑖 ) 𝑣𝑖 .
𝑗 ≠𝑖

Therefore, using triangle inequality and k𝒗 k ∞ = |𝑣𝑖 | ,


∑︁ ∑︁
|𝜆 − 𝑎𝑖𝑖 ||𝑣𝑖 | = |(𝜆 − 𝑎𝑖𝑖 ) 𝑣𝑖 | = 𝑎𝑖 𝑗 𝑣 𝑗 ≤ |𝑎𝑖 𝑗 ||𝑣𝑖 |.
𝑗 ≠𝑖 𝑗 ≠𝑖

Dividing both sides by |𝑣𝑖 | , we conclude 𝜆 ∈ D𝑖 (𝑨) .


Now suppose there are 𝑘 Geršgorin discs such that the union is disjoint from the
rest of the 𝑛 − 𝑘 discs. Without loss of generality, we assume the first 𝑘 Geršgorin
discs are such discs, and we denote their union as 𝐺𝑘 (𝑨) = 𝑘𝑖=1 D𝑖 (𝑨) . We write
Ð
𝑨 = 𝑫 + 𝑩 where 𝑫 is a diagonal matrix containing the diagonal entries of 𝑨 and
𝑩 = 𝑨 − 𝑫 . Consider 𝑨 𝑡 = 𝑫 + 𝑡 𝑩 with 𝑡 ∈ [ 0, 1] . Then 𝑨 0 = 𝑫 and 𝑨 1 = 𝑨 . In
particular, observe that the radii of each Geršgorin disc of 𝑨 𝑡 , 𝑡 𝑩 , and 𝑡 𝑨 are equal.
Therefore, each Geršgorin disc of 𝑨 𝑡 is contained in the corresponding Geršgorin disc
of 𝑨 . Consequently, the corresponding union of 𝑘 discs of 𝑨 𝒕 , denoted as 𝐺𝑘 (𝑨 𝑡 ) , is
contained in 𝐺𝑘 (𝑨) . This in turn means that 𝐺𝑘 (𝑨 𝑡 ) is disjoint from the rest of 𝑛 − 𝑘
Geršgorin discs of 𝑨 𝑡 . The above argument holds for all 𝑡 ∈ [ 0, 1] .
Our goal now is to show that the number of eigenvalues contained inside of any
curve surrounding 𝐺𝑘 (𝑨 𝑡 ) remains constant for all 𝑡 ∈ [ 0, 1] . This, combined with
the fact that 𝐺𝑘 (𝑨 0 ) contains exactly 𝑘 eigenvalues of 𝑨 0 ( 𝑎 11 , . . . , 𝑎𝑘𝑘 ) assert that
𝐺𝑘 (𝑨) contains 𝑘 eigenvalues of 𝑨 . Specifically, let Γ be a simple closed finite-length
curve in the complex plane that surrounds 𝐺𝑘 (𝑨 𝑡 ) and is disjoint from the rest of the
𝑛 − 𝑘 Geršgorin discs of 𝑨 . Curve Γ does not pass through any eigenvalue of any 𝑨 𝑡 .
Let 𝑝𝑡 (𝑧) denote the characteristic polynomial of 𝑨 𝑡 , that is, 𝑝𝑡 (𝑧) = det (𝑧 I − 𝑨 𝑡 ) .
Observe that the function 𝑝𝑡 (𝑧) is a polynomial in 𝑡 . Further, 𝑝𝑡 (𝑧) has no poles and
has zeros that are the eigenvalues of 𝑨 𝑡 inside of Γ. Therefore, we apply Cauchy’s
argument principle to 𝑝𝑡 (𝑧) , which states
𝑝𝑡0 (𝑧)

1
𝑁 (𝑡 ) = d𝑧,
2𝜋𝑖 Γ 𝑝𝑡 (𝑧)
Project 9: Spectral Radius and Stability 301

where 𝑁 (𝑡 ) is an integer function of 𝑡 and denotes the number of the zeros of 𝑝𝑡 (𝑧)
inside of Γ. Since 𝑁 (𝑡 ) is polynomial on 𝑡 and thus a continuous function of 𝑡 on the
bounded domain [ 0, 1] , we deduce that 𝑁 (𝑡 ) is a constant function on 𝑡 ∈ [ 0, 1] by
Liouville’s theorem. This means that the number of zeros of 𝑝𝑡 (𝑧) (eigenvalues of 𝑨 𝑡 )
inside of Γ remain the same regardless of our choice of 𝑡 . Since 𝑁 ( 0) = 𝑘 , we know
𝑁 ( 1) = 𝑘 . Therefore, there are 𝑘 eigenvalues of 𝑨 inside Γ. By the first part of the
theorem, we know that any eigenvalues of 𝑨 has to lie on a Geršgorin disc. Therefore,
we conclude that there are exactly 𝑘 eigenvalues of 𝑨 in 𝐺𝑘 (𝑨) . 
Theorem 9.4 provides a nice quantitative bound on the eigenvalues. One may
wonder if there are refinement or alternatives that can help improve our estimation of
the locations of the eigenvalues. The answer is affirmative. Below we list a few such
results.
Corollary 9.5 (Transpose discs). The eigenvalues of a matrix 𝑨 are in the insersections of
the Geršgorin discs of 𝑨 ᵀ .

Proof. Observe that 𝑨 and 𝑨 ᵀ have the same eigenvalues. Apply Theorem 9.4 to 𝑨 ᵀ to
obtain the result. 
Corollary 9.6 (Scaling discs). Let 𝑨 = [𝑎 𝑖 𝑗 ] ∈ 𝕄𝑛 and let 𝑫 ∈ 𝕄𝑛 be a diagonal matrix
with positive real numbers 𝑝 1 , 𝑝 2 , . . . , 𝑝𝑛 on the main diagonal. The eigenvalues 𝜆𝑖
of 𝑨 are in the union of the Geršgorin discs of 𝑫 −1 𝑨𝑫 , that is,
Ø𝑛  1 ∑︁

𝜆𝑖 (𝑨) ∈ 𝑧 ∈ ℂ : |𝑧 − 𝑎𝑖𝑖 | ≤ 𝑝 𝑗 |𝑎𝑖 𝑗 | ,
𝑖 =1 𝑝𝑖 𝑗 ≠𝑖

for all 𝑖 = 1, . . . , 𝑛 .
Together, Theorem 9.4, Corollary 9.5, and 9.6 provide three different approxima-
tions of the eigenvalues of 𝑨 , with 9.6 being especially useful in some cases.
Example 9.7 Consider the matrix

10 5 8 

𝑨=
 0 − 11 1 .

 0 −1 13
 
The eigenvalues of 𝑨 are −10, 10, 12.96. The Geršgorin discs of 𝑨 are plotted in Figure
9.1. In particular, the union of the blue and the yellow disc is disjoint from the red disc.
Corroborating Theorem 9.4, we see that there are exactly 2 eigenvalues of 𝑨 inside the
union of the blue and the yellow disc.
Now consider a scaling matrix 𝑫 = diag ( 13, 1, 1) . The Geršgorin Disks for the
matrix 𝑫 −1 𝑨𝑫 is shown in Figure 9.2. This particular choice of scaling significantly
improves the estimation for eigenvalue 10. For this particular example, we can actually
compute the optimal scaling matrix. To see this, note that the radii of the scaled
discs will be ( 5𝑝 2 + 8𝑝 3 )/𝑝 1 , 𝑝 3 /𝑝 2 , 𝑝 2 /𝑝 3 . Therefore, the best we could do without
making a worse estimation than the original discs is to set 𝑝 2 = 𝑝 3 , and this will mean
that we can scale the radius of the first disc 13𝑝 2 /𝑝 1 arbitrarily by making 𝑝 1 large to
improve our estimation of the eigenvalue 10. We can achieve such improvement all
while maintaining the same estimation for the eigenvalue −10 and 12.96. 
Project 9: Spectral Radius and Stability 302

15

10

-5

-10

-15
-15 -10 -5 0 5 10 15 20 25

Figure 9.1 The eigenvalues and Geršgorin Disks of a matrix plotted on the complex plane.
The crosses are the eigenvalues of the matrix while the circles denotes the Geršgorin
Disks of the matrix.

-1
-15 -10 -5 0 5 10 15

Figure 9.2 The scaled Geršgorin Disks of the same matrix plotted on the complex plane.
The crosses are the eigenvalues of the matrix while the circles are the Geršgorin Disks
of the matrix. Compared to the original Geršgorin Disks in Figure 9.1, the estimation
using a scaling matrix significantly improves for the blue circle disc, while maintaining
the same estimation radius for the other circles.
Project 9: Spectral Radius and Stability 303

9.3 Bounding spectral radius


Recall our original problem in Section 9.1. We would like to understand how sensitive
the stability of the closed-loop dynamics (9.2) is to the change of the dynamical
coupling among subsystems. With tools from the previous section, we can, for example,
directly apply Theorem 9.4 and conclude the following.
Corollary 9.8 (Sufficient condition for robust stability). Let 𝚫 = [𝛿𝑖 𝑗 ] be a zero-diagonal
matrix denoting the unaccounted dynamical coupling among subsystems. Given an
LQ optimal feedback gain 𝑲 ★ designed for diagonal system matrices 𝑨 and 𝑩 , the
closed-loop dynamics under 𝑲 ★ applied to 𝑨 + 𝚫 and 𝑩 is stable if the coupling matrix
𝚫 satisfies ∑︁𝑛
|𝛿𝑖 𝑗 | < 1 − |(𝑨 − 𝑩𝑲 ★ )𝑖𝑖 | for 𝑖 = 1, . . . , 𝑛, (9.6)
𝑗 =1

where (𝑨 − 𝑩𝑲 ★ )𝑖𝑖 denotes the 𝑖 th diagonal entry of 𝑨 − 𝑩𝑲 ★ .


From this, we can also see that the more stable (eigenvalues are smaller) the LQ
controller is for the ideal diagonal system matrices 𝑨 and 𝑩 , the more margin we have
for the dynamically coupled true dynamics 𝑨 + 𝚫 and 𝑩 to vary. Controllers with more
margin to handle perturbation in the system matrices are said to be more robust.
Example 9.9 (of Corollary 9.8). Consider the following nominal dynamics
   
1 0 1 0
𝒙 𝑡 +1 = 𝒙 + 𝒖 ,
0 1 𝑡 0 1 𝑡
| {z } | {z }
𝑨 𝑩

for two independently evolving scalar states 𝑥𝑡1 := 𝒙 𝑡 ( 1) and 𝑥𝑡2 := 𝒙 𝑡 ( 2) with their
own control input 𝑢𝑡1 := 𝒖 𝑡 ( 1) and 𝑢𝑡2 := 𝒖 𝑡 ( 2) respectively. Now suppose  that we
0.2 0
have used the LQ optimal control theory to design the control gain 𝑲 =
0 0.2
 
0 𝛼
via (9.4) and (9.5) for qudratic cost 𝑸 = I and 𝑹 = 20 · I. Let 𝚫 = be the
𝛽 0
unaccounted dynamics coupling between 𝑥 1 and 𝑥 2 such that the true system matrices
are 𝑨 + 𝚫, 𝑩 instead of 𝑨 , 𝑩 .  
0.2 0
A simple calculation reveals that (𝑨 − 𝑩𝑲 ) = . According to the
0 0.2
sufficient condition (9.6), as long as 𝛼 and 𝛽 individually does not exceed 0.2, then
𝑲 designed for diagonal matrices 𝑨 and 𝑩 continues to stabilize 𝑨 + 𝚫 and 𝑩 . We
can numerically check how tight this sufficient condition is in this example for varying
values of 𝛼 and 𝛽 . In Figure 9.3, each horizontal curve on the surface corresponds to
a fixed 𝛽 value as we sweep over different 𝛼 values. Any points on the surface that
has 𝑦 axis value of more than 1 mean that the corresponding closed-loop system is
unstable. Sufficient condition (9.6) is loose, because for the top curve (top boundary
of the surface) which corresponds to 𝛽 = 0.3, there are clearly 𝛼 values small enough
such that the spectral radius of the resulting closed loop remains below 1. On the
other hand, the sufficient condition (9.6) is tight in the following sense. Although not
explicitly shown in the plot, when 𝛼 = 0.2 and 𝛽 = 0.2, i.e., (9.6) holds in equality,
then the spectral radius of the closed loop is 1, which is the cross-over point between
stability and instability. Combined with the plot, we see that if we fix one of the two
variable to be 0.2, then the closed loop system becomes unstable as soon as the other
variable exceeds 0.2. 
Project 9: Spectral Radius and Stability 304

Figure 9.3 The spectral radius of the true closed loop 𝜌 (𝑨 − 𝑩𝑲 + 𝚫) for fixed 𝛽 values
varying from 0 to 0.3.

Since stability of a linear dynamical system is entirely characterized by the spectral


radius of the closed-loop system, it is relevant to study various ways to bound the
spectral radius of a matrix. We first present and prove the following variational
characterization of the spectral radius of a generic real matrix.
Lemma 9.10 (Variational characterization of spectral radius). For 𝑨 ∈ 𝕄𝑛 , the following
holds true:
𝜌 (𝑨) = inf 𝐷 ∈𝕄𝑛 invertible k𝑫 𝑨𝑫 −1 k 2 .

Proof. Let 𝑫 ∈ 𝕄𝑛 be an invertible matrix. Then 𝜌 (𝑨) = 𝜌 (𝑫 𝑨𝑫 −1 ) ≤ k𝑫 𝑨𝑫 −1 k 2 , Aside: Here we use k · k 2 to de-
where the first equality comes from the fact that similarity transformations do note the spectral norm.
not affect eigenvalues. This equality holds for any 𝑫 ∈ 𝕄𝑛 , therefore 𝜌 (𝑨) ≤
inf 𝐷 ∈𝕄𝑛 invertible k𝑫 𝑨𝑫 −1 k 2 .
Now suppose 𝑨 has Jordan decomposition 𝑨 = 𝑻 −1 𝑱𝑻 where 𝑱 is in the Jordan
normal form with eigenvalues of 𝑨 on the diagonal and 1’s on the super diagonal. Let
𝑷 = diag ( 1, 𝑘 , 𝑘 2 , . . . , 𝑘 𝑛−1 ) where 𝑘 > 0. Observe that the matrix 𝑷 𝑱 𝑷 −1 is a matrix
with eigenvalues of 𝑨 on the diagonal and 𝑘1 on the super diagonal. Therefore, as
𝑘 → 0, 𝑷 𝑱 𝑷 −1 → diag (𝝀(𝑨)) where 𝝀(𝑨) is the vector of eigenvalues of 𝑨 repeating
with algebraic multiplicity. Therefore, k𝑷 𝑱 𝑷 −1 k 2 → 𝜌 (𝑨) . Letting 𝑫 = 𝑷𝑻 , we have
𝜌 (𝑨) = 𝑫 𝑨𝑫 −1 ≥ inf 𝑫 ∈𝕄𝑛 invertible k𝑫 𝑨𝑫 −1 k 2 . 
The variational characterization of the spectral radius reminds us of Corollary 9.6,
where we bounded the eigenvalues of a matrix 𝑨 with the entries of 𝑫 𝑨𝑫 −1 for 𝑫 a
diagonal matrix with positive real diagonal entries. Indeed, we can bound the spectral
radius of 𝑨 also using the entries of 𝑫 𝑨𝑫 −1 .
Corollary 9.11 (Variational bound on spectral radius via Geršgorin). Let 𝑨 = [𝑎 𝑖 𝑗 ] ∈ 𝕄𝑛 .
Then
1 ∑︁𝑛
𝜌 (𝑨) ≤ min max 𝑝 𝑗 |𝑎𝑖 𝑗 |
𝑝1 ,...,𝑝𝑛 >0 1 ≤𝑖 ≤𝑛 𝑝𝑖 𝑗 =1

and
1 ∑︁𝑛
𝜌 (𝑨) ≤ min max 𝑝𝑖 |𝑎𝑖 𝑗 |. (9.7)
𝑝 1 ,...,𝑝𝑛 >0 1 ≤𝑗 ≤𝑛 𝑝𝑗 𝑖 =1
Project 9: Spectral Radius and Stability 305

Proof. By Corollary 9.6, all eigenvalues of 𝑨 belong to the union of the Geršgorin
Ð𝑛 n 1 Í
o
discs, 𝑖 =1 𝑧 ∈ ℂ : |𝑧 − 𝑎𝑖𝑖 | ≤ 𝑝𝑖 𝑗 ≠𝑖 𝑝 𝑗 |𝑎𝑖 𝑗 | . For each of the Geršgorin disc of
the scaled matrix 𝑫 𝑨𝑫 −1 for 𝑫 = diag (𝑝 1 , . . . , 𝑝𝑛 ) with 𝑝 1 , . . . , 𝑝𝑛 > 0, the furthest
point away from the origin on the complex plane is 𝑝1𝑖 𝑛𝑗=1 𝑝 𝑗 |𝑎𝑖 𝑗 | . Therefore, all
Í
eigenvalues of 𝑨 , including the largest eigenvalue, has to be less than the maximum
furthest point over all Geršgorin discs of 𝑫 𝑨𝑫 −1 . Now apply the same reasoning to 𝑨 ᵀ
using Corollary 9.5, we obtain (9.7). 

9.4 Notes
The majority of the results presented here is from chapter 6 of [HJ13]. Some corollaries
are exercises from [HJ13]. The examples are inspired and tweaked from [Mar+16].
The variational characterization of the spectral radius is presented in [Hua+21] as a
fact and proved here by JY.

Lecture bibliography
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[Hua+21] D. Huang et al. “Matrix Concentration for Products”. In: Foundations of Computa-
tional Mathematics (2021), pages 1–33.
[Mar+16] D. Marquis et al. ‘Gershgorin’s Circle Theorem for Estimating the Eigenvalues of a
Matrix with Known Error Bounds”. 2016.
IV. back matter

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 307
Bibliography

[AS64] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas,


graphs, and mathematical tables. For sale by the Superintendent of Documents. U.
S. Government Printing Office, Washington, D.C., 1964.
[Ahl66] L. V. Ahlfors. Complex analysis: An introduction of the theory of analytic functions of
one complex variable. Second. McGraw-Hill Book Co., New York-Toronto-London,
1966.
[AW01] R. Ahlswede and A. Winter. Strong Converse for Identification via Quantum Channels.
2001. arXiv: quant-ph/0012127 [quant-ph].
[Alo86] N. Alon. “Eigenvalues and Expanders”. In: Combinatorica 6.2 (June 1986), pages 83–
96. doi: 10.1007/BF02579166.
[AN06] N. Alon and A. Naor. “Approximating the cut-norm via Grothendieck’s inequality”.
In: SIAM J. Comput. 35.4 (2006), pages 787–803. doi: 10.1137/S0097539704441629.
[AP20] J. Altschuler and P. Parrilo. “Approximating Min-Mean-Cycle for low-diameter
graphs in near-optimal time and memory”. Available at https://arxiv.org/
abs/2004.03114. 2020.
[AP22] J. Altschuler and P. Parrilo. “Near-linear convergence of the Random Osborne
algorithm for Matrix Balancing”. In: Math. Programming (2022). To appear.
[AWR17] J. Altschuler, J. Weed, and P. Rigollet. “Near-linear time approximation algorithms
for optimal transport via Sinkhorn iteration”. In: Advances in Neural Information
Processing Systems 30 (NIPS 2017). 2017.
[Alt+21] J. Altschuler et al. “Averaging on the Bures–Wasserstein manifold: dimension-free
convergence of gradient descent”. In: Advances in Neural Information Processing
Systems 34 (NeurIPS 2021). 2021.
[ARL+12] M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. “Kernels for vector-valued
functions: A review”. In: Foundations and Trends® in Machine Learning 4.3 (2012),
pages 195–266.
[AGV21] N. Anari, S. O. Gharan, and C. Vinzant. “Log-concave polynomials, I: entropy and
a deterministic approximation algorithm for counting bases of matroids”. In: Duke
Math. J. 170.16 (2021), pages 3459–3504. doi: 10.1215/00127094-2020-0091.
[AJD69] W. N. Anderson Jr and R. J. Duffin. “Series and parallel addition of matrices”. In:
Journal of Mathematical Analysis and Applications 26.3 (1969), pages 576–594.
[AHJ87] T. Ando, R. A. Horn, and C. R. Johnson. “The singular values of a Hadamard product:
a basic inequality”. In: Linear and Multilinear Algebra 21.4 (1987), pages 345–365.
eprint: https : / / doi . org / 10 . 1080 / 03081088708817810. doi: 10 . 1080 /
03081088708817810.
[And79] T. Ando. “Concavity of certain maps on positive definite matrices and applications
to Hadamard products”. In: Linear Algebra and its Applications 26 (1979), pages 203–
241.
[And89] T. Ando. “Majorization, doubly stochastic matrices, and comparison of eigenvalues”.
In: Linear Algebra and its Applications 118 (1989), pages 163–248.
308

[ABY20] R. Aoun, M. Banna, and P. Youssef. “Matrix Poincaré inequalities and concentration”.
In: Advances in Mathematics 371 (2020), page 107251. doi: https://doi.org/10.
1016/j.aim.2020.107251.
[AB08] T. Arbogast and J. L. Bona. Methods of applied mathematics. ICES Report. UT-Austin,
2008.
[Aro50] N. Aronszajn. “Theory of reproducing kernels”. In: Transactions of the American
mathematical society 68.3 (1950), pages 337–404.
[Art11] M. Artin. Algebra. Pearson Prentice Hall, 2011.
[AV95] J. S. Aujla and H. Vasudeva. “Convex and monotone operator functions”. In: Annales
Polonici Mathematici. Volume 62. 1. 1995, pages 1–11.
[BCL94] K. Ball, E. A. Carlen, and E. H. Lieb. “Sharp Uniform Convexity and Smoothness
Inequalities for Trace Norms”. In: Invent Math 115.1 (Dec. 1994), pages 463–482.
doi: 10.1007/BF01231769.
[BBv21] A. S. Bandeira, M. T. Boedihardjo, and R. van Handel. “Matrix Concentration
Inequalities and Free Probability”. In: arXiv:2108.06312 [math] (Aug. 2021). arXiv:
2108.06312 [math].
[BR97] R. B. Bapat and T. E. S. Raghavan. Nonnegative matrices and applications. Cambridge
University Press, Cambridge, 1997. doi: 10.1017/CBO9780511529979.
[Bar02] A. Barvinok. A course in convexity. American Mathematical Society, Providence, RI,
2002. doi: 10.1090/gsm/054.
[Bar16] A. Barvinok. Combinatorics and complexity of partition functions. Springer, Cham,
2016. doi: 10.1007/978-3-319-51829-9.
[Bau+01] H. H. Bauschke et al. “Hyperbolic polynomials and convex analysis”. In: Canad. J.
Math. 53.3 (2001), pages 470–488. doi: 10.4153/CJM-2001-020-6.
[BT10] A. Beck and M. Teboulle. “On minimizing quadratically constrained ratio of two
quadratic functions”. In: J. Convex Anal. 17.3-4 (2010), pages 789–804.
[Bel+18] A. Belton et al. “A panorama of positivity”. Available at https://arXiv.org/
abs/1812.05482. 2018.
[BTN01] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization. Analysis,
algorithms, and engineering applications. Society for Industrial and Applied
Mathematics (SIAM), 2001. doi: 10.1137/1.9780898718829.
[Ben+17] P. Benner et al. Model reduction and approximation: theory and algorithms. SIAM,
2017.
[Ber08] C. Berg. “Stieltjes-Pick-Bernstein-Schoenberg and their connection to complete
monotonicity”. In: Positive Definite Functions: From Schoenberg to Space-Time
Challenges. Castellón de la Plana: University Jaume I, 2008, pages 15–45.
[BCR84] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Theory
of positive definite and related functions. Springer-Verlag, New York, 1984. doi:
10.1007/978-1-4612-1128-0.
[BP94] A. Berman and R. J. Plemmons. Nonnegative matrices in the mathematical sciences.
Revised reprint of the 1979 original. Society for Industrial and Applied Mathematics
(SIAM), Philadelphia, PA, 1994. doi: 10.1137/1.9781611971262.
[Ber12] D. Bertsekas. Dynamic Programming and Optimal Control: Volume I. v. 1. Athena
Scientific, 2012. url: https://books.google.com/books?id=qVBEEAAAQBAJ.
[Bha97] R. Bhatia. Matrix analysis. Springer-Verlag, New York, 1997. doi: 10.1007/978-
1-4612-0653-8.
[Bha03] R. Bhatia. “On the exponential metric increasing property”. In: Linear Algebra and its
Applications 375 (2003), pages 211–220. doi: https://doi.org/10.1016/S0024-
3795(03)00647-5.
309

[Bha07a] R. Bhatia. Perturbation bounds for matrix eigenvalues. Reprint of the 1987 original.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 2007.
doi: 10.1137/1.9780898719079.
[Bha07b] R. Bhatia. Positive definite matrices. Princeton University Press, Princeton, NJ, 2007.
[BH06] R. Bhatia and J. Holbrook. “Riemannian geometry and matrix geometric means”.
In: Linear Algebra and its Applications 413.2 (2006). Special Issue on the 11th
Conference of the International Linear Algebra Society, Coimbra, 2004, pages 594–
618. doi: https://doi.org/10.1016/j.laa.2005.08.025.
[BJL19] R. Bhatia, T. Jain, and Y. Lim. “On the Bures-Wasserstein distance between positive
definite matrices”. In: Expo. Math. 37.2 (2019), pages 165–191. doi: 10.1016/j.
exmath.2018.01.002.
[BL06] Y. Bilu and N. Linial. “Lifts, Discrepancy and Nearly Optimal Spectral Gap”. In:
Combinatorica 26.5 (2006), pages 495–519. doi: 10.1007/s00493-006-0029-7.
[Bir46] G. Birkhoff. “Three observations on linear algebra”. In: Univ. Nac. Tacuman, Rev.
Ser. A 5 (1946), pages 147–151.
[Boc33] S. Bochner. “Monotone Funktionen, Stieltjessche Integrale und harmonische
Analyse”. In: Math. Ann. 108.1 (1933), pages 378–410. doi: 10.1007/BF01452844.
[BB08] J. Borcea and P. Brändén. “Applications of stable polynomials to mixed determi-
nants: Johnson’s conjectures, unimodality, and symmetrized Fischer products”. In:
Duke Mathematical Journal 143.2 (2008), pages 205 –223. doi: 10.1215/00127094-
2008-018.
[Bou93] P. Bougerol. “Kalman Filtering with Random Coefficients and Contractions”. In:
SIAM Journal on Control and Optimization 31.4 (1993), pages 942–959. eprint:
https://doi.org/10.1137/0331041. doi: 10.1137/0331041.
[BV04] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press,
Cambridge, 2004. doi: 10.1017/CBO9780511804441.
[BHB16] R. Brault, M. Heinonen, and F. Buc. “Random fourier features for operator-valued
kernels”. In: Asian Conference on Machine Learning. PMLR. 2016, pages 110–125.
[Bra+11] M. Braverman et al. “The Grothendieck Constant Is Strictly Smaller than Krivine’s
Bound”. In: 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.
Oct. 2011, pages 453–462. doi: 10.1109/FOCS.2011.77.
[BRS17] J. Briët, O. Regev, and R. Saket. “Tight Hardness of the Non-Commutative
Grothendieck Problem”. In: Theory of Computing 13.15 (Dec. 2017), pages 1–
24. doi: 10.4086/toc.2017.v013a015.
[CDV07] A. Caponnetto and E. De Vito. “Optimal rates for the regularized least-squares
algorithm”. In: Foundations of Computational Mathematics 7.3 (2007), pages 331–
368.
[Cap+08] A. Caponnetto et al. “Universal multi-task kernels”. In: The Journal of Machine
Learning Research 9 (2008), pages 1615–1646.
[Car10] E. Carlen. “Trace inequalities and quantum entropy: an introductory course”.
In: Entropy and the quantum. Volume 529. Contemp. Math. Amer. Math. Soc.,
Providence, RI, 2010, pages 73–140. doi: 10.1090/conm/529/10428.
[Car+10] C. Carmeli et al. “Vector valued reproducing kernel Hilbert spaces and universality”.
In: Analysis and Applications 8.01 (2010), pages 19–61.
[Car97] R. Caruana. “Multitask learning”. In: Machine learning 28.1 (1997), pages 41–75.
[Cha13] D. Chafaï. A probabilistic proof of the Schoenberg theorem. 2013. url: https :
//djalil.chafai.net/blog/2013/02/09/a- probabilistic- proof- of-
the-schoenberg-theorem/.
310

[CB21] C.-F. Chen and F. G. S. L. Brandão. “Concentration for Trotter Error”. Available
at https : / / arXiv . org / abs / 2111 . 05324. Nov. 2021. arXiv: 2111 . 05324
[math-ph, physics:quant-ph].
[Cho74] M.-D. Choi. “A Schwarz inequality for positive linear maps on 𝐶 ∗ -algebras”. In:
Illinois Journal of Mathematics 18.4 (1974), pages 565 –574. doi: 10.1215/ijm/
1256051007.
[CS07] M. Chudnovsky and P. Seymour. “The roots of the independence polynomial
of a clawfree graph”. In: Journal of Combinatorial Theory, Series B 97.3 (2007),
pages 350–357. doi: https://doi.org/10.1016/j.jctb.2006.06.001.
[Chu97] F. R. K. Chung. Spectral graph theory. American Mathematical Society, 1997.
[Col+21] S. Cole et al. “Quantum optimal transport”. Available at https://arXiv.org/
abs/2105.06922. 2021.
[CP77] “CHAPTER 6. Calculus in Banach Spaces”. In: Functional Analysis in Modern Applied
Mathematics. Volume 132. Mathematics in Science and Engineering. Elsevier, 1977,
pages 87–105. doi: https://doi.org/10.1016/S0076-5392(08)61248-5.
[Dav57] C. Davis. “A Schwarz inequality for convex operator functions”. In: Proc. Amer.
Math. Soc. 8 (1957), pages 42–44. doi: 10.2307/2032808.
[DK70] C. Davis and W. M. Kahan. “The Rotation of Eigenvectors by a Perturbation. III”.
In: SIAM Journal on Numerical Analysis 7.1 (1970), pages 1–46. eprint: https:
//doi.org/10.1137/0707001. doi: 10.1137/0707001.
[DKW82] C. Davis, W. M. Kahan, and H. F. Weinberger. “Norm-preserving dilations and
their applications to optimal error bounds”. In: SIAM J. Numer. Anal. 19.3 (1982),
pages 445–469. doi: 10.1137/0719029.
[Ded92] J. P. Dedieu. “Obreschkoff ’s theorem revisited: what convex sets are contained in the
set of hyperbolic polynomials?” In: Journal of Pure and Applied Algebra 81.3 (1992),
pages 269–278. doi: https://doi.org/10.1016/0022-4049(92)90060-S.
[DPZ20] P. B. Denton, S. J. Parke, and X. Zhang. “Neutrino oscillations in matter via
eigenvalues”. In: Phys. Rev. D 101 (2020), page 093001. doi: 10.1103/PhysRevD.
101.093001.
[Den+22] P. B. Denton et al. “Eigenvectors from eigenvalues: a survey of a basic identity in
linear algebra”. In: Bull. Amer. Math. Soc. (N.S.) 59.1 (2022), pages 31–58. doi:
10.1090/bull/1722.
[DT07] I. S. Dhillon and J. A. Tropp. “Matrix nearness problems with Bregman divergences”.
In: SIAM J. Matrix Anal. Appl. 29.4 (2007), pages 1120–1146. doi: 10 . 1137 /
060649021.
[Din+06] C. Ding et al. “R1 -PCA: Rotational Invariant L1 -Norm Principal Component Analysis
for Robust Subspace Factorization”. In: Proceedings of the 23rd International
Conference on Machine Learning. ICML ’06. Association for Computing Machinery,
June 2006, pages 281–288. doi: 10.1145/1143844.1143880.
[ENG11] A. Ebadian, I. Nikoufar, and M. E. Gordji. “Perspectives of matrix convex functions”.
In: Proceedings of the National Academy of Sciences 108.18 (2011), pages 7313–7314.
doi: 10.1073/pnas.1102518108.
[EY39] C. Eckart and G. Young. “A principal axis transformation for non-hermitian
matrices”. In: Bull. Amer. Math. Soc. 45.2 (1939), pages 118–121. doi: 10.1090/
S0002-9904-1939-06910-3.
[Eff09] E. G. Effros. “A matrix convexity approach to some celebrated quantum inequali-
ties”. In: Proceedings of the National Academy of Sciences 106.4 (2009), pages 1006–
1008. doi: 10.1073/pnas.0807965106.
[Eva10] L. C. Evans. Partial differential equations. American Mathematical Soc., 2010.
311

[Evg+05] T. Evgeniou et al. “Learning multiple tasks with kernel methods.” In: Journal of
machine learning research 6.4 (2005).
[Fel80] H. J. Fell. “On the zeros of convex combinations of polynomials.” In: Pacific Journal
of Mathematics 89.1 (1980), pages 43 –50. doi: pjm/1102779366.
[Går51] L. Gårding. “Linear hyperbolic partial differential equations with constant co-
efficients”. In: Acta Mathematica 85.none (1951), pages 1 –62. doi: 10 . 1007 /
BF02395740.
[Gar+20] A. Garg et al. “Operator scaling: theory and applications”. In: Found. Comput. Math.
20.2 (2020), pages 223–290. doi: 10.1007/s10208-019-09417-z.
[Gar07] D. J. H. Garling. Inequalities: a journey into linear analysis. Cambridge University
Press, Cambridge, 2007. doi: 10.1017/CBO9780511755217.
[GG78] C. Godsil and I. Gutman. “On the matching polynomial of a graph”. In: Algebraic
Methods in Graph Theory 25 (Jan. 1978).
[GR01] C. Godsil and G. Royle. Algebraic graph theory. Springer-Verlag, New York, 2001.
doi: 10.1007/978-1-4613-0163-9.
[GW95] M. X. Goemans and D. P. Williamson. “Improved approximation algorithms for
maximum cut and satisfiability problems using semidefinite programming”. In:
J. Assoc. Comput. Mach. 42.6 (1995), pages 1115–1145. doi: 10.1145/227683.
227684.
[GM10] G. H. Golub and G. Meurant. Matrices, moments and quadrature with applications.
Princeton University Press, Princeton, NJ, 2010.
[GVL13] G. H. Golub and C. F. Van Loan. Matrix computations. Fourth. Johns Hopkins
University Press, Baltimore, MD, 2013.
[GR07] I. S. Gradshteyn and I. M. Ryzhik. Table of integrals, series, and products. Seventh.
Translated from the Russian, Translation edited and with a preface by Alan
Jeffrey and Daniel Zwillinger, With one CD-ROM (Windows, Macintosh and UNIX).
Elsevier/Academic Press, Amsterdam, 2007.
[GLO20] A. Greenbaum, R.-C. Li, and M. L. Overton. “First-Order Perturbation Theory for
Eigenvalues and Eigenvectors”. In: SIAM Review 62.2 (2020), pages 463–482. doi:
10.1137/19M124784X.
[Gro53] A. Grothendieck. Résumé de La Théorie Métrique Des Produits Tensoriels Topologiques.
Soc. de Matemática de São Paulo, 1953.
[Gu15] M. Gu. “Subspace iteration randomization and singular value problems”. In: SIAM
J. Sci. Comput. 37.3 (2015), A1139–A1173. doi: 10.1137/130938700.
[Gül97] O. Güler. “Hyperbolic Polynomials and Interior Point Methods for Convex Program-
ming”. In: Mathematics of Operations Research 22.2 (1997), pages 350–377.
[Gur04] L. Gurvits. “Classical complexity and quantum entanglement”. In: J. Comput.
System Sci. 69.3 (2004), pages 448–484. doi: 10.1016/j.jcss.2004.06.003.
[Haa85] U. Haagerup. “The Grothendieck Inequality for Bilinear Forms on C*-Algebras”.
In: Advances in Mathematics 56.2 (May 1985), pages 93–116. doi: 10.1016/0001-
8708(85)90026-X.
[HI95] U. Haagerup and T. Itoh. “Grothendieck Type Norms For Bilinear Forms On
C*-Algebras”. In: Journal of Operator Theory 34.2 (1995), pages 263–283.
[Hal74] P. R. Halmos. Finite-dimensional vector spaces. 2nd ed. Springer-Verlag, New York-
Heidelberg, 1974.
[HP82] F. Hansen and G. K. Pedersen. “Jensen’s Inequality for Operators and Löwner’s
Theorem.” In: Mathematische Annalen 258 (1982), pages 229–241.
[HP03] F. Hansen and G. K. Pedersen. “Jensen’s Operator Inequality”. In: Bulletin of the
London Mathematical Society 35.4 (2003), pages 553–564.
312

[HLP88] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Reprint of the 1952 edition.
Cambridge University Press, Cambridge, 1988.
[HL72] O. J. Heilmann and E. H. Lieb. “Theory of monomer-dimer systems”. In: Communi-
cations in Mathematical Physics 25.3 (1972), pages 190 –232. doi: cmp/1103857921.
[HV07] J. W. Helton and V. Vinnikov. “Linear matrix inequality representation of sets”.
In: Communications on Pure and Applied Mathematics 60.5 (2007), pages 654–674.
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.20155.
doi: https://doi.org/10.1002/cpa.20155.
[Her63] C. S. Herz. “Fonctions opérant sur les fonctions définies-positives”. In: Ann. Inst.
Fourier (Grenoble) 13 (1963), pages 161–180. url: http://aif.cedram.org/
item?id=AIF_1963__13__161_0.
[Hia10] F. Hiai. “Matrix analysis: matrix monotone functions, matrix means, and majoriza-
tion”. In: Interdiscip. Inform. Sci. 16.2 (2010), pages 139–248. doi: 10.4036/iis.
2010.139.
[HK99] F. Hiai and H. Kosaki. “Means for matrices and comparison of their norms”. In:
Indiana Univ. Math. J. 48.3 (1999), pages 899–936. doi: 10.1512/iumj.1999.
48.1665.
[HP91] F. Hiai and D. Petz. “The proper formula for relative entropy and its asymptotics
in quantum probability”. In: Communications in Mathematical Physics 143 (1991),
pages 99–114.
[HP14] F. Hiai and D. Petz. Introduction to matrix analysis and applications. Springer, Cham;
Hindustan Book Agency, New Delhi, 2014. doi: 10.1007/978-3-319-04150-6.
[Hig89] N. J. Higham. “Matrix nearness problems and applications”. In: Applications of
matrix theory (Bradford, 1988). Volume 22. Inst. Math. Appl. Conf. Ser. New Ser.
Oxford Univ. Press, New York, 1989, pages 1–27. doi: 10.1093/imamat/22.1.1.
[Hig08] N. J. Higham. Functions of matrices. Theory and computation. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, PA, 2008. doi: 10 . 1137 / 1 .
9780898717778.
[HW53] A. J. Hoffman and H. W. Wielandt. “The variation of the spectrum of a normal
matrix”. In: Duke J. Math. 20 (1953), pages 37–39.
[HLW06] S. Hoory, N. Linial, and A. Wigderson. “Expander graphs and their applications”.
In: Bulletin of the American Mathematical Society 43.4 (2006), pages 439–561.
[Hör94] L. Hörmander. Notions of convexity. Birkhäuser Boston, Inc., 1994.
[Hor69] R. A. Horn. “The theory of infinitely divisible matrices and kernels”. In: Trans.
Amer. Math. Soc. 136 (1969), pages 269–286. doi: 10.2307/1994714.
[HJ94] R. A. Horn and C. R. Johnson. Topics in matrix analysis. Corrected reprint of the
1991 original. Cambridge University Press, Cambridge, 1994.
[HJ13] R. A. Horn and C. R. Johnson. Matrix analysis. Second. Cambridge University
Press, Cambridge, 2013.
[Hua19] D. Huang. “Improvement on a Generalized Lieb’s Concavity Theorem”. Available
at https://arXiv.org/abs/1902.02194. 2019.
[Hua+21] D. Huang et al. “Matrix Concentration for Products”. In: Foundations of Computa-
tional Mathematics (2021), pages 1–33.
[Hur10] G. H. Hurlbert. Linear optimization. The simplex workbook. Springer, New York,
2010. doi: 10.1007/978-0-387-79148-7.
[Jor75] C. Jordan. “Essai sur la géométrie à 𝑛 dimensions”. In: Bulletin de la Société
mathématique de France 3 (1875), pages 103–174.
[Kad+16] H. Kadri et al. “Operator-valued kernels for learning from functional response
data”. In: Journal of Machine Learning Research 17.20 (2016), pages 1–54.
313

[Kai83] S. Kaijser. “A Simple-Minded Proof of the Pisier-Grothendieck Inequality”. In:


Banach Spaces, Harmonic Analysis, and Probability Theory. Springer, 1983, pages 33–
55.
[KSH00] T. Kailath, A. H. Sayed, and B. Hassibi. Linear estimation. Prentice Hall, 2000.
[Kal60] R. E. Kalman. “A New Approach to Linear Filtering and Prediction Problems”.
In: Journal of Basic Engineering 82.1 (Mar. 1960), pages 35–45. eprint: https:
//asmedigitalcollection.asme.org/fluidsengineering/article-pdf/
82/1/35/5518977/35\_1.pdf. doi: 10.1115/1.3662552.
[Kat95] T. Kato. Perturbation theory for linear operators. Reprint of the 1980 edition.
Springer-Verlag, Berlin, 1995.
[Kha96] H. Khalil. “Nonlinear Systems, Printice-Hall”. In: Upper Saddle River, NJ 3 (1996).
[Kos98] H. Kosaki. “Arithmetic–geometric mean and related inequalities for operators”. In:
Journal of Functional Analysis 156.2 (1998), pages 429–451.
[Kow19] E. Kowalski. An introduction to expander graphs. Société mathématique de France,
2019.
[Kra36] F. Kraus. “Über konvexe Matrixfunktionen.” ger. In: Mathematische Zeitschrift 41
(1936), pages 18–42. url: http://eudml.org/doc/168648.
[Kri78] J.-L. Krivine. “Constantes de Grothendieck et Fonctions de Type Positif Sur Les
Spheres”. In: Séminaire d’Analyse fonctionnelle (dit" Maurey-Schwartz") (1978),
pages 1–17.
[Kru37] R. Kruithof. “Telefoonverkeersrekening”. In: De Ingenieur 52 (1937), pp. E15–E25.
[KA79] F. Kubo and T. Ando. “Means of positive linear operators”. In: Math. Ann. 246.3
(1979/80), pages 205–224. doi: 10.1007/BF01371042.
[Kwa08] N. Kwak. “Principal Component Analysis Based on L1-Norm Maximization”. In:
IEEE Transactions on Pattern Analysis and Machine Intelligence 30.9 (Sept. 2008),
pages 1672–1680. doi: 10.1109/TPAMI.2008.114.
[Lan70] P. Lancaster. “Explicit Solutions of Linear Matrix Equations”. In: SIAM Review 12.4
(1970), pages 544–566. url: http://www.jstor.org/stable/2028490.
[LL07] J. Lawson and Y. Lim. “A Birkhoff Contraction Formula with Applications to
Riccati Equations”. In: SIAM Journal on Control and Optimization 46.3 (2007),
pages 930–951. eprint: https://doi.org/10.1137/050637637. doi: 10.1137/
050637637.
[Lax02] P. D. Lax. Functional analysis. Wiley-Interscience, 2002.
[Lax07] P. D. Lax. Linear algebra and its applications. second. Wiley-Interscience [John
Wiley & Sons], Hoboken, NJ, 2007.
[LL08] H. Lee and Y. Lim. “Invariant metrics, contractions and nonlinear matrix equations”.
In: Nonlinearity 21.4 (2008), pages 857–878. doi: 10.1088/0951-7715/21/4/
011.
[Lew96] A. S. Lewis. “Derivatives of spectral functions”. In: Math. Oper. Res. 21.3 (1996),
pages 576–588. doi: 10.1287/moor.21.3.576.
[LPR05] A. Lewis, P. Parrilo, and M. Ramana. “The Lax conjecture is true”. In: Proceedings
of the American Mathematical Society 133.9 (2005), pages 2495–2499.
[LM99] C.-K. Li and R. Mathias. “The Lidskii-Mirsky-Wielandt theorem–additive and
multiplicative versions”. In: Numerische Mathematik 81.3 (1999), pages 377–413.
[Lie73a] E. H. Lieb. “Convex trace functions and the Wigner-Yanase-Dyson conjecture”. In:
Advances in Math. 11 (1973), pages 267–288. doi: 10.1016/0001-8708(73)90011-
X.
314

[Lie73b] E. H. Lieb. “Convex trace functions and the Wigner-Yanase-Dyson conjecture”. In:
Advances in Mathematics 11.3 (1973), pages 267–288. doi: https://doi.org/10.
1016/0001-8708(73)90011-X.
[LR73] E. H. Lieb and M. B. Ruskai. “Proof of the strong subadditivity of quantum-
mechanical entropy”. In: J. Mathematical Phys. 14 (1973). With an appendix by B.
Simon, pages 1938–1941. doi: 10.1063/1.1666274.
[LS13] J. Liesen and Z. Strakoš. Krylov subspace methods. Principles and analysis. Oxford
University Press, Oxford, 2013.
[Lin73] G. Lindblad. “Entropy, information and quantum measurements”. In: Communica-
tions in Mathematical Physics 33 (1973), pages 305–322.
[Lin63] J. Lindenstrauss. “On the Modulus of Smoothness and Divergent Series in Banach
Spaces.” In: Michigan Mathematical Journal 10.3 (1963), pages 241–252.
[LW94] C. Liverani and M. P. Wojtkowski. “Generalization of the Hilbert metric to the
space of positive definite matrices.” In: Pacific Journal of Mathematics 166.2 (1994),
pages 339 –355. doi: pjm/1102621142.
[Lov79] L. Lovász. “On the Shannon capacity of a graph”. In: IEEE Trans. Inform. Theory
25.1 (1979), pages 1–7.
[Löw34] K. Löwner. “Über monotone Matrixfunktionen”. In: Mathematische Zeitschrift 38
(1934), pages 177–216. url: http://eudml.org/doc/168495.
[Luo+10] Z. Q. Luo et al. “Semidefinite relaxation of quadratic optimization problems”. In:
IEEE Signal Process. Mag. 27.3 (2010), pages 20–34.
[MSS14] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Ramanujan graphs and the
solution of the Kadison-Singer problem”. In: Proceedings of the International
Congress of Mathematicians—Seoul 2014. Vol. III. Kyung Moon Sa, Seoul, 2014,
pages 363–386.
[MSS15a] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families I: Bipartite
Ramanujan graphs of all degrees”. In: Annals of Mathematics 182.1 (2015), pages 307–
325. url: http://www.jstor.org/stable/24523004.
[MSS15b] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families II: Mixed char-
acteristic polynomials and the Kadison—Singer problem”. In: Annals of Mathematics
182.1 (2015), pages 327–350. url: http://www.jstor.org/stable/24523005.
[MSS21] A. W. Marcus, D. A. Spielman, and N. Srivastava. “Interlacing families III: Sharper
restricted invertibility estimates”. In: Israel Journal of Mathematics (2021), pages 1–
28.
[Mar+16] D. Marquis et al. ‘Gershgorin’s Circle Theorem for Estimating the Eigenvalues of a
Matrix with Known Error Bounds”. 2016.
[MOA11] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: theory of majorization and
its applications. Second. Springer, New York, 2011. doi: 10.1007/978- 0- 387-
68276-1.
[MT11] M. McCoy and J. A. Tropp. “Two Proposals for Robust PCA Using Semidefinite
Programming”. In: Electronic Journal of Statistics 5.none (Jan. 2011), pages 1123–
1160. doi: 10.1214/11-EJS636.
[MP05] C. A. Micchelli and M. Pontil. “On learning vector-valued functions”. In: Neural
computation 17.1 (2005), pages 177–204.
[MXZ06] C. A. Micchelli, Y. Xu, and H. Zhang. “Universal Kernels.” In: Journal of Machine
Learning Research 7.12 (2006).
[Min88] H. Minc. Nonnegative matrices. A Wiley-Interscience Publication. John Wiley &
Sons, Inc., New York, 1988.
315

[Min16] H. Q. Minh. “Operator-valued Bochner theorem, Fourier feature maps for operator-
valued kernels, and vector-valued learning”. In: arXiv preprint arXiv:1608.05639
(2016).
[Mir60] L. Mirsky. “Symmetric gauge functions and unitarily invariant norms”. In: Quart.
J. Math. Oxford Ser. (2) 11 (1960), pages 50–59. doi: 10.1093/qmath/11.1.50.
[Mir59] L. Mirsky. “On the trace of matrix products”. In: Mathematische Nachrichten 20.3-6
(1959), pages 171–174. doi: 10.1002/mana.19590200306.
[MP93] B. Mohar and S. Poljak. “Eigenvalues in Combinatorial Optimization”. In: Com-
binatorial and Graph-Theoretical Problems in Linear Algebra. Springer New York,
1993, pages 107–151.
[Nao12] A. Naor. “On the Banach-Space-Valued Azuma Inequality and Small-Set Isoperime-
try of Alon–Roichman Graphs”. In: Combinatorics, Probability and Computing 21.4
(July 2012), pages 623–634. doi: 10.1017/S0963548311000757.
[NRV14] A. Naor, O. Regev, and T. Vidick. “Efficient rounding for the noncommutative
Grothendieck inequality”. In: Theory Comput. 10 (2014), pages 257–295. doi:
10.4086/toc.2014.v010a011.
[NS21] N. H. Nelsen and A. M. Stuart. “The random feature model for input-output maps
between banach spaces”. In: SIAM Journal on Scientific Computing 43.5 (2021),
A3212–A3243.
[NC00] M. A. Nielsen and I. L. Chuang. Quantum computation and quantum information.
Cambridge University Press, Cambridge, 2000.
[Nil91] A. Nilli. “On the Second Eigenvalue of a Graph”. In: Discrete Math. 91.2 (1991),
pages 207–210. doi: 10.1016/0012-365X(91)90112-F.
[ON00] T. Ogawa and H. Nagaoka. “Strong converse and Stein’s lemma in quantum
hypothesis testing”. In: IEEE Transactions on Information Theory 46.7 (2000),
pages 2428–2433. doi: 10.1109/18.887855.
[Oli10] R. I. Oliveira. “Sums of random Hermitian matrices and an inequality by Rudelson”.
In: Electronic Communications in Probability 15 (2010), pages 203–212.
[Olv+10] F. W. J. Olver et al., editors. NIST handbook of mathematical functions. With 1
CD-ROM (Windows, Macintosh and UNIX). National Institute of Standards and
Technology, 2010.
[Ost59] A. M. Ostrowski. “A quantitative formulation of Sylvester’s law of inertia”. In:
Proceedings of the National Academy of Sciences of the United States of America 45.5
(1959), page 740.
[Pak10] I. Pak. Lectures on Discrete and Polyhedral Geometry. 2010. url: https://www.
math.ucla.edu/~pak/book.htm.
[Par98] B. N. Parlett. The symmetric eigenvalue problem. Corrected reprint of the 1980
original. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA,
1998. doi: 10.1137/1.9781611971163.
[Par78] S. Parrott. “On a quotient norm and the Sz.-Nagy - Foiaş lifting theorem”. In: J.
Functional Analysis 30.3 (1978), pages 311–328. doi: 10.1016/0022-1236(78)
90060-5.
[Pin10] A. Pinkus. Totally positive matrices. Cambridge University Press, Cambridge, 2010.
[Pis78] G. Pisier. “Grothendieck’s Theorem for Noncommutative C*-Algebras, with an
Appendix on Grothendieck’s Constants”. In: Journal of Functional Analysis 29.3
(Sept. 1978), pages 397–415. doi: 10.1016/0022-1236(78)90038-1.
[Pis12] G. Pisier. “Grothendieck’s Theorem, Past and Present”. In: Bulletin of the American
Mathematical Society 49.2 (Apr. 2012), pages 237–323. doi: 10.1090/S0273-0979-
2011-01348-9.
316

[PX97] G. Pisier and Q. Xu. “Non-commutative martingale inequalities”. In: Comm. Math.
Phys. 189.3 (1997), pages 667–698. doi: 10.1007/s002200050224.
[RS09] P. Raghavendra and D. Steurer. “Towards Computing the Grothendieck Constant”.
In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms.
Society for Industrial and Applied Mathematics, Jan. 2009, pages 525–534. doi:
10.1137/1.9781611973068.58.
[RR08] A. Rahimi and B. Recht. “Random Features for Large-Scale Kernel Machines”.
In: Advances in Neural Information Processing Systems 20. Curran Associates, Inc.,
2008, pages 1177–1184. url: http://papers.nips.cc/paper/3182-random-
features-for-large-scale-kernel-machines.pdf.
[Ren06] J. Renegar. “Hyperbolic Programs, and Their Derivative Relaxations”. In: Found.
Comput. Math. 6.1 (2006), pages 59–79. doi: 10.1007/s10208-004-0136-z.
[RX16] É. Ricard and Q. Xu. “A Noncommutative Martingale Convexity Inequality”. In: The
Annals of Probability 44.2 (Mar. 2016), pages 867–882. doi: 10.1214/14-AOP990.
[Ric24] J. Riccati. “Animadversiones in aequationes differentiales secundi gradus”. In:
Actorum Eruditorum quae Lipsiae publicantur Supplementa. Actorum Eruditorum
quae Lipsiae publicantur Supplementa v. 8. prostant apud Joh. Grossii haeredes
& J.F. Gleditschium, 1724. url: https : / / books . google . com / books ? id =
UjTw1w7tZsEC.
[Rud59] W. Rudin. “Positive definite sequences and absolutely monotonic functions”. In:
Duke Math. J. 26 (1959), pages 617–622. url: http://projecteuclid.org/
euclid.dmj/1077468771.
[Rud90] W. Rudin. Fourier analysis on groups. Reprint of the 1962 original, A Wiley-
Interscience Publication. John Wiley & Sons, Inc., New York, 1990. doi: 10.1002/
9781118165621.
[Rud91] W. Rudin. Functional analysis. Second. McGraw-Hill, Inc., New York, 1991.
[Saa11a] Y. Saad. Numerical Methods for Large Eigenvalue Problems. 2nd edition. Society for
Industrial and Applied Mathematics, 2011. doi: 10.1137/1.9781611970739.
[Saa11b] Y. Saad. Numerical methods for large eigenvalue problems. Revised edition of the
1992 original [ 1177405]. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2011. doi: 10.1137/1.9781611970739.ch1.
[SSB18] B. Schlkopf, A. J. Smola, and F. Bach. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. The MIT Press, 2018.
[Sch14] R. Schneider. Convex bodies: the Brunn–Minkowski theory. 151. Cambridge university
press, 2014. doi: 10.1017/CBO9781139003858.
[Sch38] I. J. Schoenberg. “Metric spaces and positive definite functions”. In: Trans. Amer.
Math. Soc. 44.3 (1938), pages 522–536. doi: 10.2307/1989894.
[SS17] L. J. Schulman and A. Sinclair. “Analysis of a Classical Matrix Preconditioning
Algorithm”. In: J. Assoc. Comput. Mach. (JACM) 64.2 (2017), 9:1–9:23.
[Sch64] L. Schwartz. “Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux
associés (noyaux reproduisants)”. In: Journal d’analyse mathématique 13.1 (1964),
pages 115–256.
[Sen81] E. Seneta. Nonnegative matrices and Markov chains. Second. Springer-Verlag, New
York, 1981. doi: 10.1007/0-387-32792-4.
[Ser09] D. Serre. “Weyl and Lidskiı̆ inequalities for general hyperbolic polynomials”. In:
Chinese Annals of Mathematics, Series B 30.6 (2009), pages 785–802.
[Sha48] C. E. Shannon. “A mathematical theory of communication”. In: The Bell System
Technical Journal 27.3 (1948), pages 379–423. doi: 10.1002/j.1538-7305.1948.
tb01338.x.
317

[SH19] Y. Shenfeld and R. van Handel. “Mixed volumes and the Bochner method”. In: Proc.
Amer. Math. Soc. 147.12 (2019), pages 5385–5402. doi: 10.1090/proc/14651.
[Sim11] B. Simon. Convexity. An analytic viewpoint. Cambridge University Press, Cambridge,
2011. doi: 10.1017/CBO9780511910135.
[Sim15] B. Simon. Real analysis. With a 68 page companion booklet. American Mathematical
Society, Providence, RI, 2015. doi: 10.1090/simon/001.
[Sim19] B. Simon. Loewner’s theorem on monotone matrix functions. Springer, Cham, 2019.
doi: 10.1007/978-3-030-22422-6.
[Sin64] R. Sinkhorn. “A relationship between arbitrary positive matrices and doubly
stochastic matrices”. In: Ann. Math. Statist. 35 (1964), pages 876–879. doi: 10.
1214/aoms/1177703591.
[SMI92] V. I. SMIRNOV. “Biography of A. M. Lyapunov”. In: International Journal of
Control 55.3 (1992), pages 775–784. eprint: https : / / doi . org / 10 . 1080 /
00207179208934258. doi: 10.1080/00207179208934258.
[So09] A. M. So. “Improved Approximation Bound for Quadratic Optimization Problems
with Orthogonality Constraints”. In: Proceedings of the Twentieth Annual ACM-SIAM
Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,
Jan. 2009, pages 1201–1209. doi: 10.1137/1.9781611973068.130.
[Spi18a] D. Spielman. “Yale CPSC 662/AMTH 561 : Spectral Graph Theory”. Available at
http://www.cs.yale.edu/homes/spielman/561/561schedule.html. 2018.
[Spi18b] D. Spielman. Bipartite Ramanujan Graphs. 2018. url: http://www.cs.yale.
edu/homes/spielman/561/lect25-18.pdf.
[Spi19] D. Spielman. Spectral and Algebraic Graph Theory. 2019. url: http : / / cs -
www.cs.yale.edu/homes/spielman/sagt/sagt.pdf.
[SS11] D. A. Spielman and N. Srivastava. “Graph Sparsification by Effective Resistances”.
In: SIAM Journal on Computing 40.6 (2011), pages 1913–1926. eprint: https :
//doi.org/10.1137/080734029. doi: 10.1137/080734029.
[SH15] S. Sra and R. Hosseini. “Conic geometric optimization on the manifold of positive
definite matrices”. In: SIAM J. Optim. 25.1 (2015), pages 713–739. doi: 10.1137/
140978168.
[Ste56] E. M. Stein. “Interpolation of linear operators”. In: Transactions of the American
Mathematical Society 83.2 (1956), pages 482–492.
[SS90] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. 1st Edition. Academic
Press, 1990.
[SBT17] D. Sutter, M. Berta, and M. Tomamichel. “Multivariate trace inequalities”. In: Comm.
Math. Phys. 352.1 (2017), pages 37–58. doi: 10.1007/s00220-016-2778-5.
[Syl84] J. J. Sylvester. “Sur l’équation en matrices px= xq”. In: CR Acad. Sci. Paris 99.2
(1884), pages 67–71.
[Syl52] J. Sylvester. “A demonstration of the theorem that every homogeneous quadratic
polynomial is reducible by real orthogonal substitutions to the form of a sum of
positive and negative squares”. In: The London, Edinburgh, and Dublin Philosophical
Magazine and Journal of Science 4.23 (1852), pages 138–142.
[Tom74] N. Tomczak-Jaegermann. “The Moduli of Smoothness and Convexity and the
Rademacher Averages of the Trace Classes 𝑆𝑝 (1 ≤ 𝑝 < ∞)”. In: Studia Mathe-
matica 50.2 (1974), pages 163–182.
[Tro09] J. A. Tropp. “Column subset selection, matrix factorization, and eigenvalue op-
timization”. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on
Discrete Algorithms. SIAM, Philadelphia, PA, 2009, pages 978–986.
318

[Tro11] J. A. Tropp. “User-Friendly Tail Bounds for Sums of Random Matrices”. In: Founda-
tions of Computational Mathematics 12.4 (2011), pages 389–434. doi: 10.1007/
s10208-011-9099-z.
[Tro12] J. A. Tropp. “From joint convexity of quantum relative entropy to a concavity
theorem of Lieb”. In: Proc. Amer. Math. Soc. 140.5 (2012), pages 1757–1760. doi:
10.1090/S0002-9939-2011-11141-9.
[Tro15] J. A. Tropp. “An Introduction to Matrix Concentration Inequalities”. In: Foundations
and Trends in Machine Learning 8.1-2 (2015), pages 1–230.
[Tro18] J. A. Tropp. “Second-order matrix concentration inequalities”. In: Appl. Comput.
Harmon. Anal. 44.3 (2018), pages 700–736. doi: 10.1016/j.acha.2016.07.005.
[van14] R. van Handel. Probability in High Dimension. Technical report. Princeton University,
June 2014. doi: 10.21236/ADA623999.
[VB96] L. Vandenberghe and S. Boyd. “Semidefinite programming”. In: SIAM Rev. 38.1
(1996), pages 49–95. doi: 10.1137/1038003.
[Vas79] H. Vasudeva. “Positive definite matrices and absolutely monotonic functions”. In:
Indian J. Pure Appl. Math. 10.7 (1979), pages 854–858.
[Ver18] R. Vershynin. High-dimensional probability. An introduction with applications in
data science, With a foreword by Sara van de Geer. Cambridge University Press,
Cambridge, 2018. doi: 10.1017/9781108231596.
[Wat18] J. Watrous. The theory of quantum information. Cambridge University Press, 2018.
[Wid41] D. V. Widder. The Laplace Transform. Princeton University Press, Princeton, N. J.,
1941.
[Wol19] N. Wolchover. “Neutrinos lead to unexpected discovery in basic math”. In: Quanta
Magazine (2019). url: https://www.quantamagazine.org/neutrinos-lead-
to-unexpected-discovery-in-basic-math-20191113.
[Zha05] F. Zhang, editor. The Schur complement and its applications. Springer-Verlag, New
York, 2005. doi: 10.1007/b105056.

You might also like