0% found this document useful (0 votes)
19 views103 pages

Advanced Topics in Quantum Information Theory

The document contains lecture notes for a course on Advanced Topics in Quantum Information Theory, focusing on conic programming and its applications in quantum information. It covers various topics including definitions, feasible solutions, duality, and specific concepts like max-relative entropy and nonlocal games. The notes aim to provide a comprehensive understanding of conic programming as a tool for analyzing quantum information theory.

Uploaded by

scribd.ip5o6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views103 pages

Advanced Topics in Quantum Information Theory

The document contains lecture notes for a course on Advanced Topics in Quantum Information Theory, focusing on conic programming and its applications in quantum information. It covers various topics including definitions, feasible solutions, duality, and specific concepts like max-relative entropy and nonlocal games. The notes aim to provide a comprehensive understanding of conic programming as a tool for analyzing quantum information theory.

Uploaded by

scribd.ip5o6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Advanced Topics in

Quantum Information Theory


Lecture notes for CS 798/QIC 890 Spring 2020

John Watrous
School of Computer Science and Institute for Quantum Computing
University of Waterloo

November 8, 2020
Lectures

1 Conic Programming 1
1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Feasible solutions, optimal values, and weak duality . . . . . . . . . . 4
1.4 Minimization and maximization . . . . . . . . . . . . . . . . . . . . . 4
1.5 Slater’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Example: conic program for optimal measurements . . . . . . . . . . 8

2 Max-relative entropy and conditional min-entropy 13


2.1 Quantum max-relative entropy . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Conditional min-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Smoothing and optimizing max-relative entropy 25


3.1 Smoothed max-relative entropy . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Minimizing over a convex set of models . . . . . . . . . . . . . . . . . 29
3.3 Minimizing over both states and models . . . . . . . . . . . . . . . . . 31

4 Regularization of the smoothed max-relative entropy 33


4.1 Strong typicality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Min-relative entropy, conditional max-entropy, and hypothesis-testing


relative entropy 45
5.1 Min-relative entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Conditional max-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Hypothesis-testing relative entropy . . . . . . . . . . . . . . . . . . . . 52

6 Nonlocal games and Tsirelson’s theorem 59


6.1 Nonlocal games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 XOR games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3 Tsirelson’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7 A semidefinite program for the entangled bias of XOR games 71


7.1 The semidefinite program . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Strong parallel repetition for XOR games . . . . . . . . . . . . . . . . 76

8 The hierarchy of Navascues, Pironio, and Acin 83


8.1 Representing and comparing strategies . . . . . . . . . . . . . . . . . 83
8.2 Commuting measurement strategies . . . . . . . . . . . . . . . . . . . 86
8.3 The NPA hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4 Convergence of the NPA hierarchy . . . . . . . . . . . . . . . . . . . . 93
Lecture 1

Conic Programming

The first topic we will discuss in the course is conic programming, which is a valu-
able tool for the study of quantum information. In particular, semidefinite programs,
which are a specific type of conic program, have proven to be particularly useful
in the theory of quantum information and computation—and you may already be
familiar with some of their applications. There is, however, much to be gained in
considering conic programs in greater generality. A few points in support of this
claim follow.

1. Conic programming offers a formulation through which fundamental con-


cepts in convex analysis may be conveniently expressed and analyzed. These
fundamental concepts can offer valuable insights into semidefinite programs
that might otherwise be easily obscured by technical details.

2. Certain key properties of semidefinite programs are, in fact, possessed by


conic programs defined over a wide variety of convex cones. A noteworthy
example is that Slater’s theorem, which provides a simple-to-check condition
for the critically important property of strong duality for many semidefinite
programs, generalizes to conic programs.

3. While not all properties of semidefinite programs generalize to conic pro-


grams, an understanding of conic programming serves to illuminate the spe-
cific attributes of the cone of positive semidefinite operators that have al-
lowed for these properties to hold in the semidefinite programming case.

4. The generality offered by conic programming will be of use in this course.

One must appreciate that semidefinite programs do have a very special prop-
erty that contributes enormously to their utility, which is that one can generally
solve a given semidefinite program with reasonable efficiency and precision us-
ing a (classical!) computer. Conic programs, in contrast, are in general hard to

1
solve with a computer. For example, maximizing a linear function over the cone
SepD(Cn : Cn ) of bipartite separable density operators with local dimension n is an
NP-hard optimization problem, even to approximate with a modest degree of pre-
cision.
Having sufficient motivation (I presume) for a study of conic programming, we
will now proceed to such a study.

1.1 Preliminaries
This preliminary subsection defines various notions and discuss a few known facts
(without proofs) that will be needed for a proper treatment of conic programming,
mostly relating to convex analysis.
Let V be a finite-dimensional real inner product space, with the inner product
of any two vectors u, v ∈ V being denoted hu, vi. Note that this inner product is
necessarily symmetric in its arguments, given that V is a vector space over the real
numbers R, which will be our ground field throughout this entire discussion of
conic programming.
A subset C ⊆ V is convex if, for all u, v ∈ C and λ ∈ [0, 1], one has

λu + (1 − λ)v ∈ C. (1.1)

A subset K ⊆ V is a cone if, for all u ∈ K and λ ≥ 0, one has λu ∈ K. We will be


principally concerned with subsets having both properties simultaneously, which
are aptly named convex cones. The letters K and L are typical names for convex
cones, and we will often make the additional assumption that these cones are closed
when we are discussing conic programs.
The Cartesian product of any two convex sets is convex. Explicitly, if C and D
are convex, and (v0 , w0 ), (v1 , w1 ) ∈ C × D and λ ∈ [0, 1], then

λ(v0 , w0 ) + (1 − λ)(v1 , w1 ) = (λv0 + (1 − λ)v1 , λw0 + (1 − λ)w1 ) ∈ C × D. (1.2)

Similarly, the Cartesian product of any two cones is a cone: if K and L are cones,
and (u, v) ∈ K × L and λ ≥ 0, then

λ(u, v) = (λu, λv) ∈ K × L. (1.3)

The notion of separating hyperplane is fundamental within convex analysis. Here


is one form of the separating hyperplane theorem, which establishes that for any two
disjoint, nonempty, convex sets, there is a hyperplane that separates the two con-
vex sets, with one lying within one of the two closed half-spaces defined by the
hyperplane and the second set lying within the opposite closed half-space.

2
Theorem 1.1 (Separating hyperplane theorem). Let V be a finite-dimensional real
inner-product space and let C and D be nonempty, disjoint, convex subsets of V. There
exists a nonzero vector w ∈ V and a real number γ ∈ R such that

hw, ui ≤ γ ≤ hw, vi (1.4)

for every u ∈ C and v ∈ D. If either of C or D is a cone, then there must exist a nonzero
vector w ∈ V as above for which the inequality (1.4) is true, for all u ∈ C and v ∈ D,
when γ = 0.

Given any set A ⊆ V, one defines the dual cone to A as

A∗ = v ∈ V : hu, vi ≥ 0 for all u ∈ A .



(1.5)

The fact that A∗ is indeed a cone, and is also closed and convex, irrespective of the
choice of the set A, can be verified. For any two cones K, L ⊆ V, it is the case that

(K × L ) ∗ = K∗ × L∗ . (1.6)

Finally, if K is a closed, convex cone, then K∗∗ = K.

1.2 Definitions
Now suppose that a finite-dimensional real inner-product space V and a closed,
convex cone K ⊆ V have been fixed. In addition, let W be a finite-dimensional real
inner product space, let φ : V → W be a linear map, and let a ∈ V and b ∈ W be
vectors. These choices of objects define a conic program, with which the following
optimization problem is associated.

Optimization Problem 1.2 (Standard Conic Program)

Primal problem Dual problem


maximize: h a, x i minimize: hb, yi
subject to: φ( x ) = b subject to: φ∗ (y) − a ∈ K∗
x∈K y∈W

In the dual problem statement, φ∗ : W → V denotes the uniquely determined


linear map, known as the adjoint map to φ, that satisfies hw, φ(v)i = hφ∗ (w), vi for
all v ∈ V and w ∈ W.

3
1.3 Feasible solutions, optimal values, and weak
duality
It is convenient to associate two sets of vectors with Optimization Problem 1.2:

A = x ∈ K : φ( x ) = b and B = y ∈ W : φ∗ (y) − a ∈ K∗
 
(1.7)

are the sets of primal feasible and dual feasible vectors for that conic program. We also
define the primal optimal and dual optimal values of this conic program as

α = suph a, x i and β = inf hb, yi, (1.8)


x ∈A y ∈B

respectively. These values may be finite or infinite, and by convention we define


α = −∞ in case A = ∅ and β = ∞ in case B = ∅.

Proposition 1.3 (Weak duality for conic programs). Let V and W be finite-dimensional
real inner product spaces, let K ⊆ V be a closed, convex cone, let φ : V → W be a linear
map, and let a ∈ V and b ∈ W be vectors. For α, β ∈ R ∪ {−∞, ∞} as defined in (1.8)
above, it is the case that α ≤ β.

Proof. If either of the sets A and B defined in (1.7) are empty, then the proposition
is vacuously true: either −∞ ≤ β or α ≤ ∞. It therefore suffices to consider the
case in which A and B are nonempty.
Suppose x ∈ A and y ∈ B are chosen arbitrarily. The set A is a subset of K, so
x ∈ K, and because y ∈ B it is the case that φ∗ (y) − a ∈ K∗ , and therefore

hφ∗ (y) − a, x i ≥ 0. (1.9)

We may therefore observe the following inequality and chain of equalities:

h a, x i ≤ hφ∗ (y), x i = hy, φ( x )i = hy, bi = hb, yi. (1.10)

This inequality is maintained as one takes the supremum over all x ∈ A and infi-
mum over y ∈ B, and therefore α ≤ β, as required.

1.4 Minimization and maximization


In some situations, it may be convenient or natural to take the primal problem to
be a minimization rather than a maximization problem. The dual problem then
becomes a maximization problem, as follows.

4
(1 + ε) x − εy
x

C y

Figure 1.1: An illustration of the point (1 + ε) x − εy, as it relates to points x and y


in a convex set C.

Optimization Problem 1.4

Primal problem Dual problem


minimize: h a, x i maximize: hb, yi
subject to: φ( x ) = b subject to: a − φ∗ (y) ∈ K∗
x∈K y∈W

Notice, in particular, that the dual constraint a − φ∗ (y) ∈ K∗ has replaced the
constraint φ∗ (y) − a ∈ K∗ in Optimization Problem 1.2. The reason for this substi-
tution is that Optimization Problem 1.4 is equivalent, up to a negation of sign, to
an instance of Optimization Problem 1.2 in which a, b, and φ have been replaced
with − a, −b, and −φ, respectively.
With this equivalence in mind, we shall feel free to minimize or maximize in
the primal problem of any conic program as we see fit—naturally selecting the
corresponding dual problem formulation—but at the same time we shall lose no
generality by adopting Optimization Problem 1.2 as the standard form for conic
programs.

1.5 Slater’s theorem


Now let us state and prove Slater’s theorem. To do this, we must refer to the con-
cept of the relative interior of a set S ⊆ V. This is the set denoted relint(S) that is
obtained by taking the interior of the set S, assuming that we have restricted our
attention to the smallest affine subspace of V that contains S.
The relative interior of a convex set C may be described in the following simple
way:

relint(C) = x ∈ C : (∀y ∈ C)(∃ε > 0) (1 + ε) x − εy ∈ C .


 
(1.11)

Figure 1.1 illustrates how the point (1 + ε) x − εy relates to x and y.

5
Theorem 1.5 (Slater’s theorem for conic programs). Let V and W be finite-dimensional
real inner-product spaces, let K ⊆ V be a closed, convex cone, let φ : V → W be a linear
map, and let a ∈ V and b ∈ W be vectors. With respect to the notations A, B, α, and β
defined in Subsection 1.3, the following two statements are true.
1. If B is nonempty and there exists x ∈ relint(K) such that φ( x ) = b, then there must
exist y ∈ B such that hb, yi = α.
2. If A is nonempty and there exists y ∈ W for which φ∗ (y) − a ∈ relint(K∗ ), then
there must exist x ∈ A such that h a, x i = β.
Both statements imply the equality α = β.
Proof. We will prove just the first statement—the second statement can be proved
through a similar technique, or one may conclude that the second statement is true
given the first by formulating the dual problem of Optimization Problem 1.2 as the
primal problem of a (different but equivalent) conic program. We will also make
the simplifying assumptions that V = span(K) and W = im(φ), both of which
cause no loss of generality. Note that α and β are both necessarily finite, as the
assumptions of the first statement imply that A and B are nonempty.
Define two subsets of W ⊕ V ⊕ R as follows:
C = (b − φ( x ), z, h a, x i) : x, z ∈ V, x − z ∈ K ,

(1.12)
D = (0, 0, η ) : η > α .


Both of these sets are evidently convex, and they are disjoint by the definition of α,
so they are separated by a hyperplane. That is, there must exist a nonzero vector
(y, u, λ) ∈ W ⊕ V ⊕ R such that
(y, u, λ), (b − φ( x ), z, h a, x i) ≤ (y, u, λ), (0, 0, η ) , (1.13)
or equivalently
hy, b − φ( x )i + hu, zi + λh a, x i ≤ λη, (1.14)
for all x, z ∈ V for which x − z ∈ K and all η > α. Let us observe that there is no
loss of generality in assuming λ ∈ {−1, 0, 1}, as the inequality (1.14) remains true
when the vector (y, u, λ) is rescaled (i.e., multiplied by any positive real number).
We will now draw several conclusions from the fact that (1.14) holds for all
x, z ∈ V for which x − z ∈ K and all η > α.
1. The inequality (1.14) must be true when x = 0 and z = 0, and therefore
hy, bi ≤ λη (1.15)
for all η > α. This implies that λ ≥ 0, for otherwise the right-hand side of the
inequality tends to −∞ as η becomes large while the left-hand side remains
fixed. Thus, λ = −1 is impossible, so λ ∈ {0, 1}.

6
2. For any choice of x ∈ A and z ∈ −K, it is the case that x − z ∈ K. Substituting
these vectors into the inequality (1.14) and rearranging yields

hu, zi ≤ λ(η − h a, x i) (1.16)

for every η > α. We conclude that u ∈ K∗ , for otherwise the left-hand side
of the above inequality can be made to approach ∞ while the right-hand side
remains bounded, for any fixed η > α, through an appropriate selection of
z ∈ −K.
3. Assume toward contradiction that λ = 0. Fix any choice of x ∈ A ∩ relint(K),
which is possible by the assumption of the statement being proved. The in-
equality (1.14) simplifies to
hu, zi ≤ 0 (1.17)
for every choice of z ∈ V for which x − z ∈ K.
Setting z = x, we have that x − z ∈ K, and therefore

hu, x i ≤ 0. (1.18)

As x ∈ K and u ∈ K∗ , we conclude that hu, x i = 0.


On the other hand, for an arbitrarily chosen vector v ∈ K, there must exist
ε > 0 such that
x − ε(v − x ) = (1 + ε) x − εv ∈ K, (1.19)
by virtue of the fact that x is in the relative interior of K. Setting z = εv − εx,
one therefore has that x − z ∈ K, and so

εhu, vi = εhu, vi − εhu, x i = hu, zi ≤ 0. (1.20)

As it was was for x, we find that hu, vi = 0. Seeing that this is true for all v ∈ K,
and recognizing that u ∈ V = span(K), we conclude that u = 0.
But if λ = 0 and u = 0, then we may free the vector x to range over all of V
and set z = x to conclude from (1.14) that

hy, b − φ( x )i ≤ 0 (1.21)

for every x ∈ V. Bearing in mind the assumption that W = im(φ), we conclude


that y = 0.
This, however, is a contradiction to the assumption that (y, u, λ) is nonzero.
One concludes that λ = 1.

7
The steps just describe have allowed us to conclude that there exist vectors
y ∈ W and u ∈ K∗ such that

hy, b − φ( x )i + hu, zi ≤ η − h a, x i (1.22)

for all x, z ∈ V for which x − z ∈ K and all η > α. Setting z = 0 and flipping sign,
we find that
hφ∗ (y) − a, x i ≥ hy, bi − η (1.23)
for all x ∈ K and η > α. This implies

φ ∗ ( y ) − a ∈ K∗ , (1.24)

for otherwise the left-hand side of the previous inequality can be made to ap-
proach −∞ while the right-hand side remains fixed. The vector y is therefore a
dual-feasible point: y ∈ B. Finally, again considering the possibility that x = 0
and z = 0, we conclude that hb, yi ≤ η for all η > α. It is therefore the case that
hb, yi ≤ α, and hence hb, yi = α by weak duality. This concludes the proof (of the
first statement).

1.6 Example: conic program for optimal measurements


In this section we will discuss an example of a conic program that is relevant to
quantum information. We’ll begin with a somewhat general example, or perhaps
a category of examples, and then discuss a specific, concrete example.
Suppose X is a complex Euclidean space and K ⊆ Herm(X) is a closed, convex
cone. Suppose further that H1 , . . . , Hn ∈ Herm(X), and consider this optimization
problem.

maximize: h H1 , X1 i + · · · + h Hn , Xn i
subject to: X 1 + · · · + X n = 1X
X1 , . . . , X n ∈ K

This is essentially an optimal measurement problem, where X1 , . . . , Xn represent


measurement operators; these operators must sum to the identity as usual, but
in place of the usual constraint on measurement operators being positive semidef-
inite, we are instead constraining them to the cone K. By choosing K to be the cone
of positive semidefinite operators Pos(X), which is closed and convex, we natu-
rally obtain the ordinary optimal measurement problem, but we can consider any
closed, convex cone K we choose. For example, we may take K to be the cone of
separable operators when X = Y ⊗ Z is a bipartite tensor product space.

8
We can set this problem up as a conic program as follows. First, observe that
Kn is a closed, convex cone. The problem may therefore be expressed as the primal
problem of a conic program:

maximize: ( H1 , . . . , Hn ), ( X1 , . . . , Xn )

subject to: φ ( X 1 , . . . , X n ) = X 1 + · · · + X n = 1X
( X1 , . . . , X n ) ∈ K n


The symbol = indicates that this is the definition of the function φ, whereas the
ordinary equal sign represents a constraint.
Here is the dual problem, as is dictated by Optimization Problem 1.2:

minimize: 1X , Y
∗
subject to: φ∗ (Y ) − ( H1 , . . . , Hn ) ∈ Kn
Y ∈ Herm(X)

We can simplify this by observing that φ∗ (Y ) = (Y, . . . , Y ) and (Kn )∗ = (K∗ )n ,


and naturally replacing the inner-product with the identity operator as the trace in
the objective function. The following problem is obtained.

minimize: Tr(Y )
subject to: Y − H1 ∈ K∗
..
.
Y − Hn ∈ K∗
Y ∈ Herm(X)

In summary, the following conic program, expressed in a simplified form, has been
obtained.

Optimization Problem 1.6

Primal problem Dual problem


maximize: h H1 , X1 i + · · · + h Hn , Xn i minimize: Tr(Y )
subject to: X 1 + · · · + X n = 1X subject to: Y − H1 ∈ K∗
..
X1 , . . . , X n ∈ K .
Y − Hn ∈ K∗
Y ∈ Herm(X)

9
Now let us consider a specific instance of this conic program. Define four pure
states | ψ1 i, . . . , | ψ4 i ∈ C4 ⊗ C4 as follows:

1  1
| ψ1 i = | 0 i| 0 i + | 1 i| 1 i + | 2 i| 2 i + | 3 i| 3 i = vec(1 ⊗ 1),
2 2
1  1
| ψ2 i = | 0 i| 3 i + | 1 i| 2 i + | 2 i| 1 i + | 3 i| 0 i = vec(σx ⊗ σx ),
2 2
1  i
| ψ3 i = | 0 i| 3 i + | 1 i| 2 i − | 2 i| 1 i − | 3 i| 0 i = vec(σy ⊗ σx ),
2 2
1  1
| ψ4 i = | 0 i| 1 i + | 1 i| 0 i − | 2 i| 3 i − | 3 i| 2 i = vec(σz ⊗ σx ).
2 2
These states were identified by Yu, Duan, and Ying (2012), who proved that they
cannot be perfectly discriminated by a PPT measurement. Cosentino (2013) subse-
quently showed that the optimal PPT discrimination probability is 7/8, by solving
the associated semidefinite program.
We will prove that no separable measurement can discriminate these states with
probability greater than 3/4, assuming that one of the four states is selected uni-
formly at random. This is easily achievable by coarse-graining a measurement with
respect to the standard basis, so this is in fact the optimal probability to correctly
discriminate the states by a separable measurement. Here is a precise description
of the corresponding conic program.

Optimization Problem 1.7


Primal problem
1 1
maximize: h ψ1 | X1 | ψ1 i + · · · + h ψ4 | X4 | ψ4 i
4 4
subject to: X 1 + · · · + X 4 = 14 ⊗ 14
X1 , . . . , X4 ∈ Sep(C4 : C4 )

Dual problem
minimize: Tr(Y )
1
subject to: Y − | ψk ih ψk | ∈ Sep(C4 : C4 )∗ (1 ≤ k ≤ 4)
4
Y ∈ Herm(C4 ⊗ C4 )

Our goal will be to describe a dual-feasible solution having objective value 3/4,
for this will then be an upper-bound on the probability of a correct discrimination
by weak duality.

10
Note that the dual-feasibility of a given Y ∈ Herm(C4 ⊗ C4 ) is equivalent to

1
Y− vec(Uk ) vec(Uk )∗ ∈ Sep(C4 : C4 )∗ (1.25)
16
for
U1 = 1 ⊗ 1, U2 = σx ⊗ σx , U3 = iσy ⊗ σx , U4 = σz ⊗ σx . (1.26)
In order to prove the dual-feasibility of a specific choice for Y that will be specified
shortly, we will make use of the following lemma.

Lemma 1.8 (Breuer–Hall). Let U, V ∈ U(Cn ) be unitary operators such that V T U is


anti-symmetric: (V T U )T = −V T U. The operator

Z = 1n ⊗ 1n − vec(U ) vec(U )∗ − (T ⊗ 1)(vec(V ) vec(V )∗ ) (1.27)

is contained in Sep(Cn : Cn )∗ .

Proof. For any unit vector z ∈ Cn , we find that

(1 ⊗ z)∗ Z (1 ⊗ z) = 1 − UzzT U ∗ − Vzz∗ V T ≥ 0. (1.28)

This follows from the observation that the vectors Uz and Vz must be orthogonal
unit vectors:
Vz, Uz = z∗ V T Uz = zzT , V T U
(1.29)
= (zzT )T , (V T U )T = − zzT , V T U = 0.
It follows that
hyy∗ ⊗ zz∗ , Z i = (y ⊗ z)∗ Z (y ⊗ z) ≥ 0 (1.30)
for every y ∈ Cn . The required containment follows by convexity.

Now define V = σy ⊗ σz , and observe that V T U1 , V T U2 , V T U3 , and V T U4 are all


anti-symmetric. By the Breuer–Hall lemma, the operator

1
14 ⊗ 14 − (T ⊗ 1)(vec(V ) vec(V )∗ )

Y= (1.31)
16
is dual-feasible. Given that Tr(Y ) = (16 − 4)/16 = 3/4, we have obtained the
claimed upper-bound on correctly discriminating these states by a separable mea-
surement.

11
Lecture 2

Max-relative entropy and conditional


min-entropy

In this lecture we will first define the max-relative entropy and observe some of its
properties. We will then define the conditional min-entropy in terms of the quantum
max-relative entropy, derive an alternative characterization of this quantity, and
consider the conditional min-entropy of a few example classes of states.
Before proceeding to the definition of the max-relative entropy, it will be help-
ful to consider the ordinary quantum relative entropy and its relationship to the
conditional quantum entropy as a source of inspiration. Recall that the quantum
relative entropy is defined as follows for all density operators ρ and all positive
semidefinite operators Q acting on the same complex Euclidean space:
(
Tr(ρ log ρ) − Tr(ρ log Q) im(ρ) ⊆ im( Q)
D( ρ k Q ) = (2.1)
∞ im(ρ) 6⊆ im( Q).

We can define this function more generally for any positive semidefinite operator P
in place of the density operator ρ, but our focus will be on the case where the first
argument is a density operator.
One way to think about the quantum relative entropy is that it represents the
loss of efficiency, measured in bits, that is incurred when one plans ahead for Q but
receives ρ instead. This is highly informal, and should not be taken too seriously,
but we will allow this intuitive description to suggest some useful terminology: we
will refer to the second argument Q in the quantum relative entropy as the model,
and to the first argument ρ as the actual state, for the sake of convenience.
Irrespective of how we choose to interpret the quantum relative entropy func-
tion, there is no denying its enormous utility as a “helper function,” through which
fundamental entropic quantities may be defined and analyzed. In particular, the
conditional quantum entropy and the quantum mutual information are defined in

13
terms of the quantum relative entropy as follows:

H(X | Y )ρ = − D(ρ k 1X ⊗ ρ[Y ]),


(2.2)
I(X : Y )ρ = D(ρ k ρ[X] ⊗ ρ[Y ]),

for all ρ ∈ D(X ⊗ Y). Then, though properties of the quantum relative entropy,
one may establish important properties of the conditional quantum entropy and
quantum mutual information. For example, through the joint convexity of quantum
relative entropy,

D(λρ0 + (1 − λ)ρ1 kλQ0 + (1 − λ) Q1 ) ≤ λ D(ρ0 k Q0 ) + (1 − λ) D(ρ1 k Q1 ), (2.3)

one may prove the critically important strong subadditivity property of von Neu-
mann entropy, which may be expressed as

H(X | Y, Z)ρ ≤ H(X | Y )ρ (2.4)

for every ρ ∈ D(X ⊗ Y ⊗ Z).


The quantum relative entropy, and the entropic quantities it defines, tell us a
great deal about the so-called i.i.d. limit, where an increasing number of indepen-
dent copies of a given state are made available. In contrast, when our interest is
in the so-called one-shot setting, where our concern is primarily with a single copy
of a given state, the quantum relative entropy and the quantities it defines have
limited value.

2.1 Quantum max-relative entropy


The quantum max-relative entropy (or just max-relative entropy for short) offers an
alternative to the ordinary quantum relative entropy that is relevant in the one-
shot setting. While it is a different function from the quantum relative entropy,
it does possess some of the same general characteristics that make the quantum
relative entropy function useful. As we will see in a couple of lectures, the ordinary
quantum relative entropy can in fact be recovered from the max-relative entropy
(or, to be more precise, a smoothed version of max-relative entropy) by applying it
in the i.i.d. limit.

Definition 2.1 (Quantum max-relative entropy). For a density operator ρ and a


positive semidefinite operator Q acting on the same complex Euclidean space, the
quantum max-relative entropy of ρ with respect to Q is defined as follows:

Dmax (ρ k Q) = inf{λ ∈ R : ρ ≤ 2λ Q}. (2.5)

14
Remark 2.2. The same definition may be used when ρ is any positive semidefinite
operator, and not necessarily a density operator. It is common, in particular, that ρ
is taken to be a sub-normalized state, meaning ρ ≥ 0 and Tr(ρ) ≤ 1. In this course,
however, we will focus on the case that ρ is normalized.

Let us first observe the equivalence of the following statements:

1. Dmax (ρ k Q) < ∞
2. im(ρ) ⊆ im( Q) (or, equivalently, ker( Q) ⊆ ker(ρ))
3. D(ρ k Q) < ∞

In particular, the max-relative entropy is finite if and only if the ordinary quantum
relative entropy is finite.
We may also observe that the max-relative entropy can be expressed through a
semidefinite program. More specifically, the max-relative entropy is the logarithm
of the optimal value of the following semidefinite program.

Optimization Problem 2.3 (SDP for max-relative entropy)

Primal problem Dual problem


minimize: η maximize: hρ, X i
subject to: ρ ≤ ηQ subject to: h Q, X i ≤ 1
η≥0 X ∈ Pos(X)

Alternatively, the max-relative entropy is the negative logarithm of the optimal


value of the following semidefinite program.

Optimization Problem 2.4 (Reciprocal SDP for max-relative entropy)

Primal problem Dual problem


maximize: µ minimize: h Q, Y i
subject to: µρ ≤ Q subject to: hρ, Y i ≥ 1
µ≥0 Y ∈ Pos(X)

Notice that all four of the problems just suggested are strictly feasible when
im(ρ) ⊆ im( Q). Slater’s theorem therefore implies that strong duality holds under
this assumption for both semidefinite programs, with optimal values always being
achieved in all four problems. Strong duality also holds when im(ρ) 6⊆ im( Q); in
this case the optimal value of both the primal and dual forms in Optimization
Problem 2.3 is positive infinity, while the optimal value of both the primal and
dual forms in Optimization Problem 2.4 is zero.

15
Two alternative characterizations of max-relative entropy
We will now take moment to observe two alternative characterizations of the max-
relative entropy. For the first, observe that if im(ρ) ⊆ im( Q), then the condition
ρ ≤ 2λ Q is equivalent to
p p
Q + ρ Q + ≤ 2λ . (2.6)
Therefore, we have
  p p 
log Q+ ρ Q+ im(ρ) ⊆ im( Q)
Dmax (ρ k Q) = (2.7)
∞ im(ρ) 6⊆ im( Q).

We may alternatively write


 
−1/2 −1/2
Dmax (ρ k Q) = log Q ρQ , (2.8)

with the somewhat informal understanding that the expression evaluates to ∞ in


case im(ρ) 6⊆ im( Q).
The second alternative characterization of the max-relative entropy begins with
the observation that the condition ρ ≤ 2λ Q is equivalent to hρ, Z i ≤ 2λ h Q, Z i for
all positive definite operators Z. Therefore, assuming Q 6= 0, we find that
hρ, Z i
 
Dmax (ρ k Q) = sup log . (2.9)
Z >0 h Q, Z i

Interpretation of max-relative entropy


One simple and intuitive way to think about the max-relative entropy Dmax (ρ k Q)
is as follows. Suppose that one attempts to express Q as a nonnegative linear com-
bination of ρ along with any other collection of positive semidefinite operators. We
can amalgamate the other positive semidefinite operators and associated nonneg-
ative scalars into a single positive semidefinite operator R, for the sake of focusing
on the relationship between ρ and Q, and we obtain an expression like this:

Q = ηρ + R (where R ≥ 0). (2.10)

The largest that the value η can be, assuming we are free to choose R however we
wish, is precisely 2− Dmax (ρkQ) .
If Q = σ is itself a density operator, then necessarily η ∈ [0, 1], and we may
think of this value as being a probability. The simple fact that η ≤ 1 immediately
yields a variant of Klein’s inequality for the max-relative entropy:

Dmax (ρ k σ) ≥ 0, (2.11)

16
with equality if and only if ρ = σ. If, on the other hand, σ is “highly dissimilar” to ρ,
then any convex combination involving ρ and yielding σ must take the probability
η associated with ρ to be small, so Dmax (ρ k σ) must be large. In the extreme case
that im(ρ) 6⊆ im( Q), then any expression of Q taking the form (2.10) must have
η = 0, which is consistent with Dmax (ρ k σ) = ∞.

Monotonicity of max-relative entropy


Next let us observe that the max-relative entropy is monotonic with respect to the
action of channels, meaning that

Dmax (Φ(ρ)k Φ( Q)) ≤ Dmax (ρk Q) (2.12)

for all ρ ∈ D(X), Q ∈ Pos(X), and Φ ∈ C(X, Y). In fact, complete positivity is not
required; the inequality (2.12) holds for all Φ positive and trace preserving.
Before we prove that the max-relative entropy is monotonic in the sense just
described, let us noted that we cannot follow a similar route to this fact that we
followed when proving the analogous fact for the ordinary quantum relative en-
tropy in CS 766/QIC 820—which was through the joint convexity of quantum rel-
ative entropy. This is because the max-relative entropy is not jointly convex—and this
is a sense in which it differs from the ordinary quantum relative entropy. The max-
relative entropy is, however, jointly quasi-convex:
!
n n
Dmax ∑ pk ρk ∑ pk Qk ≤ max Dmax (ρk k Qk ).
k∈{1,...,n}
(2.13)
k =1 k =1

The fact that the max-relative entropy is monotonic with respect to the action
of channels, however, is not only true but is almost immediate from the definition
of the max-relative entropy. Specifically, if we have ρ ≤ 2λ Q for some choice of λ,
then Φ(ρ) ≤ 2λ Φ( Q) by the positivity of Φ, from which (2.12) follows. The as-
sumption that Φ preserves trace implies that Tr(Φ(ρ)) = 1, so that it is a suitable
first argument to the max-relative entropy—but this assumption can be dropped
altogether, provided that we’re willing to allow Φ(ρ) as a first argument to the
max-relative entropy function.

Max-relative entropy upper-bounds relative entropy


One can prove that the max-relative entropy is at least as large as the ordinary
quantum relative entropy, meaning

D(ρk Q) ≤ Dmax (ρk Q) (2.14)

17
for all density operators ρ and all positive semidefinite operators Q.
One way to prove this is to use the fact that the logarithm is an operator monotone
function: for all positive definite operators P and Q with P ≤ Q, it is the case that
log( P) ≤ log( Q). This is not a trivial fact to prove, but it is well-known, and you
should have no trouble finding a proof if you search for one.
Now, suppose that λ satisfies ρ ≤ 2λ Q, or equivalently 2−λ ρ ≤ Q. We then
have
D(ρ k Q) = Tr(ρ log ρ) − Tr(ρ log Q) ≤ Tr(ρ log ρ) − Tr ρ log 2−λ ρ = λ, (2.15)


and the relation (2.14) follows by minimizing over λ.

Max-relative entropy of tensor products and block operators


The max-relative entropy is additive with respect to tensor products:

Dmax ρ0 ⊗ ρ1 Q0 ⊗ Q1 = Dmax (ρ0 Q0 ) + Dmax (ρ1 Q1 ). (2.16)
(The ordinary relative entropy is also additive with respect to tensor products in
the same way.) The characterization
  p p 
log Q+ ρ Q+ im(ρ) ⊆ im( Q)
Dmax (ρ k Q) = (2.17)
∞ im(ρ) 6⊆ im( Q)

offers an easy route to a proof of this fact. Observe in particular that this implies
that, for every choice of ρ, Q, and a positive integer n, we have
Dmax ρ⊗n Q⊗n = n Dmax (ρ k Q).

(2.18)
The max-relative entropy also obeys the following identity, for any choice of
density operator ρ1 , . . . , ρn , positive semidefinite operators Q1 , . . . , Qn , and a prob-
ability vector ( p1 , . . . , pn ):
!
n n
Dmax ∑ pk | k ih k | ⊗ ρk ∑ pk | k ih k | ⊗ Qk = max Dmax (ρk k Qk ).
k∈{1,...,n}
(2.19)
k =1 k =1

Using the formula Dmax (ρ k Q) = Dmax (ρ kλQ) − log(η ), we obtain this formula
for the situation in which the probabilities p1 , . . . , pn are not included in the blocks
of the second operator:
!
n n
Dmax ∑ pk | k ih k | ⊗ ρk ∑ | k ih k | ⊗ Qk
k =1 k =1 (2.20)

= max Dmax (ρk k Qk ) + log( pk ) .
k∈{1,...,n}

18
In contrast, the ordinary quantum relative entropy obeys this equation:
!
n n n
D ∑ pk | k ih k | ⊗ ρk ∑ pk | k ih k | ⊗ Qk = ∑ p k D( ρ k k Q k ). (2.21)
k =1 k =1 k =1

Using the equation D(ρ k Q) = D(ρ kλQ) − log(η ), we then conclude that
!
n n n
D ∑ pk | k ih k | ⊗ ρk ∑ | k ih k | ⊗ Qk = ∑ p k D( ρ k k Q k ) − H( p ). (2.22)
k =1 k =1 k =1

2.2 Conditional min-entropy


As was already mentioned at the beginning of the lecture, the ordinary conditional
quantum entropy is given by the formula
H(X | Y )ρ = − D(ρ k1X ⊗ ρ[Y ]). (2.23)
We may also observe that
D(ρ k1X ⊗ ρ[Y ]) = inf D(ρ k1X ⊗ σ); (2.24)
σ ∈D(Y)

the infimum value is always obtained when σ = ρ[Y ]. With this fact in mind, we
define the conditional min-entropy as follows.
Definition 2.5. Let X and Y be registers and let ρ ∈ D(X ⊗ Y) be a state of these
registers. The conditional min-entropy of X given Y for the state ρ is defined as
Hmin (X | Y )ρ = − inf Dmax (ρ k1X ⊗ σ). (2.25)
σ ∈D(Y)

Remark 2.6. It is not, in general, the case that the infimum in (2.25) is achieved
when σ = ρ[Y ].
By expanding the definition of the max-relative entropy, one may alternatively
express the conditional min-entropy in the following way:

2− Hmin (X|Y)ρ = inf η ≥ 0 : ρ ≤ η1X ⊗ σ, σ ∈ D(Y)



(2.26)
= inf Tr(Y ) : ρ ≤ 1X ⊗ Y, Y ∈ Pos(Y) .


The conditional min-entropy is always at most the ordinary conditional quan-


tum entropy: Hmin (X|Y )ρ ≤ H(X|Y )ρ . This fact follows from the fact that the max-
relative entropy is at least the ordinary quantum relative entropy, for then we have
Dmax (ρ k1X ⊗ σ ) ≥ D(ρ k1X ⊗ σ) (2.27)
for all density operators σ, implying the claimed inequality.

19
Semidefinite program for conditional min-entropy
It is evident from (2.26) that the conditional min-entropy can be expressed as a
semidefinite program. In particular, the quantity Hmin (X|Y )ρ is the negative loga-
rithm of the optimal value of the following semidefinite program.

Optimization Problem 2.7 (SDP for conditional min-entropy)

Primal problem Dual problem


maximize: hρ, X i minimize: Tr(Y )
subject to: TrX ( X ) = 1Y subject to: 1X ⊗ Y ≥ ρ
X ∈ Pos(X ⊗ Y) Y ∈ Herm(Y)

The dual problem is clearly consistent with the expression (2.26), whereas the
primal problem corresponds (essentially) to an optimization of a linear function
(represented by the state ρ) over all channels Φ ∈ C(Y, X). There is a useful and
intuitive way to think about this optimization, but first we will take a moment to
introduce a useful concept, the transpose of a channel.

Definition 2.8. Let X and Y be complex Euclidean spaces and let Φ ∈ T(Y, X). The
transpose of Φ is the unique map ΦT ∈ T(X, Y) satisfying the equation

Φ T ⊗ 1L ( X ) vec(1X ) vec(1X )∗ = 1L(Y) ⊗ Φ) vec(1Y ) vec(1Y )∗ .


  
(2.28)

Equivalently, the map ΦT ∈ T(X, Y) is the (uniquely determined) map whose


Choi representation is given by

J (ΦT ) = 1L(Y) ⊗ Φ) vec(1Y ) vec(1Y )∗ .



(2.29)

Here is a short list of facts concerning this notion, all of which are straightforward
to prove.

1. (ΦT )T = Φ.
2. The map Φ 7→ ΦT from T(Y, X) to T(X, Y), is linear, one-to-one, and onto.
3. ΦT ∈ CP(X, Y) if and only if Φ ∈ CP(Y, X).
4. ΦT is unital if and only if Φ preserves trace.

Finally, one may observe that ΦT is (as you might have guessed) the map that
is obtained by taking any Kraus representation of Φ and transposing the Kraus
operators.

20
Returning to Optimization Problem 2.7, let us consider the set A of primal fea-
sible operators, which can be expressed in multiple ways based on the facts about
the transpose of a map just listed:
A = { X ∈ Pos(X ⊗ Y) : TrX ( X ) = 1Y }
= Φ ⊗ 1L(Y) vec(1Y ) vec(1Y )∗ : Φ ∈ C(Y, X)
  
(2.30)
= 1L(X) ⊗ ΦT vec(1X ) vec(1X )∗ : Φ ∈ C(Y, X)
  

= 1L(X) ⊗ Ψ vec(1X ) vec(1X )∗ : Ψ ∈ CP(X, Y), Ψ(1X ) = 1Y .


  

The optimal value of the semidefinite program is 2− Hmin (X|Y)ρ , so


2− Hmin (X|Y)ρ
= sup ρ, (1L(X) ⊗ Ψ) vec(1X ) vec(1X )∗ : Ψ ∈ CP(X, Y), Ψ(1X ) = 1Y
 

= sup (1L(X) ⊗ Ψ∗ )(ρ), vec(1X ) vec(1X )∗ : Ψ ∈ CP(X, Y), Ψ(1X ) = 1Y




= sup (1L(X) ⊗ Ξ)(ρ), vec(1X ) vec(1X )∗ : Ξ ∈ C(Y, X) .



(2.31)
That is,
2− Hmin (X|Y)ρ = n · 1L(X) ⊗ Ξ (ρ), τ

sup (2.32)
Ξ∈C(Y,X)

where
1 n
n a,b∑
τ= | a ih b | ⊗ | a ih b | and n = dim(X). (2.33)
=1

In words, 2− Hmin (X|Y)ρ is equal to dim(X) times the maximum squared-fidelity,


over all channels Ξ ∈ C(Y, X), between the state (1L(X) ⊗ Ξ)(ρ) and the canonical
maximally entangled state τ ∈ D(X ⊗ X).

2.3 Examples
We will conclude the lecture by considering the conditional min-entropy of a few
classes of states.
Example 2.9. Suppose Y is trivial (i.e., one-dimensional), so that ρ ∈ D(X). We
then find that
Hmin (X | Y )ρ = − inf Dmax (ρ k1X ⊗ σ)
σ ∈D(Y)

= − Dmax (ρ k 1X ) (2.34)
= − log λ1 (ρ).
Naturally, we omit the register Y from this notation when it is trivial:
Hmin (X)ρ = Hmin (ρ) = − log λ1 (ρ). (2.35)

21
Example 2.10. Suppose ρ = σ ⊗ ξ for σ ∈ D(X) and ξ ∈ D(Y). Through a similar
calculation to the previous example, we find that

Hmin (X | Y )ρ = − inf Dmax (σ ⊗ ξ k1X ⊗ ξ 0 )


ξ 0 ∈D(Y)

= − Dmax (σ k 1X ) (2.36)
= Hmin (X)σ .

This is natural: if the registers X and Y are completely uncorrelated, the conditional
min-entropy of X given Y is simply the min-entropy of X.

Example 2.11. Next, suppose that we have a separable state: ρ ∈ SepD(X : Y).
Then, for any channel Ξ ∈ C(Y, X) we have

(1L(X) ⊗ Ξ)(ρ) ∈ SepD(X : X); (2.37)

applying a channel locally to one part of a separable state always results in a sepa-
rable state. The inner-product between any separable state and the canonical max-
imally entangled state τ is at most 1/n (as we proved in CS 766/QIC 820), and
therefore
1
2− Hmin (X|Y)ρ = n · 1L(X) ⊗ Ξ (ρ), τ ≤ n · = 1.

sup (2.38)
Ξ∈C(Y,X) n

The conditional min-entropy of every separable state is therefore nonnegative.


By similar reasoning, for every PPT state ρ ∈ PPT(X : Y) ∩ D(X ⊗ Y) it is the
case that Hmin (X|Y )ρ ≥ 0.

Example 2.12. Suppose that the τ can be recovered perfectly by applying a channel
locally to Y for the state ρ ∈ D(X ⊗ Y). This is equivalent to ρ taking the form

ρ = (1X ⊗ V )(τ ⊗ ξ )(1X ⊗ V )∗ (2.39)

for some choice of a density operator ξ ∈ D(Z) and an isometry V ∈ U(X ⊗ Z, Y).
Then we have
Hmin (X|Y )ρ = − log(n), (2.40)
which is the minimum possible value for the conditional min-entropy.

Example 2.13. Finally, suppose that ρ is a classical-quantum state:


n
ρ= ∑ p(a)| a ih a | ⊗ ξ a . (2.41)
a =1

22
We find that
2− Hmin (X|Y)ρ = n · 1L ( X ) ⊗ Ξ ( ρ ) , τ

sup
Ξ∈C(Y,X)
n (2.42)
= sup ∑ p(a)h a | Ξ(ξ a )| a i,
Ξ∈C(Y,X) a=1

with the simplification to the second line being possible because ρ is a classical-
quantum state. This has the following intuitive meaning: Hmin (X|Y )ρ is the neg-
ative logarithm of the optimal correctness probability to identify a state chosen
randomly according to the ensemble corresponding to ρ.

23
Lecture 3

Smoothing and optimizing


max-relative entropy

Recall the definition of the max-relative entropy, which was introduced in the pre-
vious lecture:
Dmax (ρ k Q) = inf λ ∈ R : ρ ≤ 2λ Q .

(3.1)

In this lecture we will consider what happens when we minimize this function
over various choices of ρ and Q. Two common situations in which this is done are
as follows:

1. Smoothing. For a given ρ, Q, and ε > 0, the smoothed max-relative entropy of ρ


with respect to Q is defined as

ε
Dmax (ρ k Q) = inf Dmax (ξ k Q), (3.2)
ξ ∈Bε ( ρ )

where Bε (ρ) denotes the set of states that are ε-close to ρ with respect to some
notion of distance.
2. Optimizing over models. For a given ρ and a convex set C of possible choices of
models Q, we may consider the quantity

Dmax (ρ kC) = inf Dmax (ρ k Q), (3.3)


Q ∈C

which measures in a certain sense which Q ∈ C incurs the least loss of effi-
ciency when serving as a model for the state ρ.

In both cases, the optimizations can be represented as conic programs. We will


consider the two types of optimizations separately in the sections that follow.

25
3.1 Smoothed max-relative entropy
Let us begin with the smoothed max-relative entropy, where one takes the min-
imum value of the max-relative entropy over all choices of actual states that are
close to a given state, as suggested above. The idea is that the smoothed max-
relative entropy reflects a tolerance for small errors, which we often have or would
like to express when analyzing operationally defined notions. Without smoothing,
the max-relative entropy can sometimes, in certain settings at least, have unwanted
hyper-sensitivities that smoothing eliminates.

Definition
As it turns out, there is not a single agreed upon definition for the smoothed max-
relative entropy; different authors sometimes choose different notions of distance
with respect to which the smoothing is done, which translates to different choices
for the set Bε (ρ) in (3.2). In addition, the operator ξ is sometimes allowed to range
not only over density operators, but also over sub-normalized density operators,
and in this case the definition of the max-relative entropy is extended in the most
straightforward way to accommodate such operators.
It is typically the case, however, that the notions of distance with respect to
which the smoothed max-relative entropy is defined are based on either the trace
distance or the fidelity function. Through the Fuchs–van de Graaf inequalities, one
finds that the resulting definitions of smoothed max-relative entropy are roughly
equivalent, and are certainly quite similar in a qualitative sense.
For the sake of concreteness, we will define the smoothed max-relative entropy
in terms of the trace distance, as the following definition makes precise.

Definition 3.1 (Smoothed max-relative entropy). For a density operator ρ ∈ D(X),


a positive semidefinite operator Q ∈ Pos(X), and a real number ε ∈ (0, 1), the
ε-smoothed relative max-entropy of ρ with respect to Q is defined as
ε
Dmax (ρk Q) = min Dmax (ξ k Q), (3.4)
ξ ∈Bε ( ρ )

where
Bε ( ρ ) = ξ ∈ D (X ) : 1

2 k ρ − ξ k1 ≤ε . (3.5)

A couple of other common choices for Bε (ρ) are these:

Bε (ρ) = ξ ∈ D(X) : F(ξ, ρ)2 ≥ 1 − ε ,



(3.6)
Bε (ρ) = ξ ∈ Pos(X) : F(ξ, ρ)2 ≥ 1 − ε2 , Tr(ξ ) ≤ 1 .


26
Optimizing over arbitrary closed and convex sets of states
To better understand the smoothed max-relative entropy, it is helpful to consider
a more general set-up. Suppose that C ⊆ D(X) is any closed and compact set of
density operators, let Q ∈ Pos(X) be given, and consider the problem of minimiz-
ing η over all choices of ξ ∈ C and η ∈ R that satisfy ξ ≤ ηQ. This problem can be
expressed as a conic problem as will now be described.
First, define a set K ⊂ R ⊕ Herm(X) ⊕ R ⊕ Herm(X) as follows:

K = (λ, λξ, η, P) : η, λ ≥ 0, ξ ∈ C, P ∈ Pos(X) .



(3.7)

This set is a closed and convex cone. One may think of K as being the Cartesian
product of three sets: the first is

L = (λ, λξ ) : λ ≥ 0, ξ ∈ C ,

(3.8)

the second is the set of all nonnegative real numbers η ≥ 0, and the third is Pos(X).
All three of these sets are closed and convex cones; in the case of L this follows
from the assumption that C is a compact and convex set. It should be noted that
the construction of a convex cone L from a convex set C like this is both common
and useful.
Now, the optimization problem suggested above is evidently equivalent to the
problem of minimizing the inner-product

(0, 0, 1, 0), (λ, λξ, η, P) (3.9)

over all (λ, λξ, η, P) ∈ K, subject to the affine linear constraints that

λ=1 and ηQ = λξ + P. (3.10)

By defining a linear map φ : R ⊕ Herm(X) ⊕ R ⊕ Herm(X) → R ⊕ Herm(X) as

φ(λ, X, η, Y ) = (λ, ηQ − X − Y ), (3.11)

these affine linear constraints may be expressed as

φ(λ, λξ, η, P) = (1, 0). (3.12)

The optimization problem being considered is therefore the primal form of a conic
program, which is stated below (together with a simplified expression of its dual
form) as Optimization Problem 3.2.
The dual form of this optimization problem is given by the maximization of the
objective function
(1, 0), (µ, Z ) (3.13)

27
subject to the constraint

(0, 0, 1, 0) − φ∗ (µ, Z ) ∈ K∗ , (3.14)

where (µ, Z ) ranges over the space R ⊕ Herm(X) (as is dictated by Optimization
Problem 1.4 in Lecture 1). To simplify this problem, we must compute the adjoint
map φ∗ and try to understand what K∗ looks like. The adjoint of φ may simply be
computed, and one obtains

φ∗ (µ, Z ) = (µ, − Z, h Q, Z i, − Z ). (3.15)

As for the dual cone to K, an element (δ, H, γ, R) is contained in K∗ if and only if

(δ, H, γ, R), (λ, λξ, η, P) = δλ + λh H, ξ i + γη + h R, Pi ≥ 0 (3.16)

for all λ, η ≥ 0, ξ ∈ C, and P ∈ Pos(X). This is equivalent to the requirement that


γ ≥ 0, R ∈ Pos(X), and
δ ≥ −h H, ξ i (3.17)
for all ξ ∈ C. By defining a function

ψC ( H ) = inf hξ, H i, (3.18)


ξ ∈C

we may alternatively express K∗ as follows:

K∗ = (δ, H, γ, R) : δ ≥ −ψC ( H ), γ ≥ 0, R ∈ Pos(X) .



(3.19)

The dual problem is therefore a maximization of δ subject to the constraint that


δ ≤ ψC ( Z ), h Q, Z i ≤ 1, and Z ∈ Pos(X), or, equivalently, a maximization of ψC ( Z )
over all Z ∈ Pos(Z) satisfying h Q, Z i ≤ 1. We obtain the following simplified
expression of this conic program.

Optimization Problem 3.2

Primal problem Dual problem


minimize: η maximize: inf{hξ, Z i : ξ ∈ C}
subject to: ξ ≤ ηQ, subject to: h Q, Z i ≤ 1,
ξ ∈ C, Z ∈ Pos(X).
η ≥ 0.

It can be shown, through the use of Slater’s theorem, that strong duality always
holds for Optimization Problem 3.2.

28
Conic program for smoothed max-relative entropy
At this point, one may substitute the set Bε (ρ) for C in Optimization Problem 3.2 to
obtain a conic program for the smoothed max-relative entropy. To be precise, the
smoothed max-relative entropy Dmax ε
(ρk Q) is the logarithm of the optimal value
of this conic program. One may also note that if C = {ρ}, then the semidefinite
program for the ordinary (non-smoothed) max-relative entropy is recovered.
Naturally, other notions of smoothing can be considered by making alternative
choices for the set C.

3.2 Minimizing over a convex set of models


Next we may consider what happens when we minimize the max-relative entropy
over a compact and convex set in the second coordinate. That is, for C ⊆ Pos(X)
being a compact and convex set, we may consider the quantity
Dmax (ρ kC) = inf Dmax (ρ k Q) (3.20)
Q ∈C

for a given choice of ρ ∈ D(X).

Conic program formulation


Going through a similar process to the one above, we obtain the following conic
program.
Optimization Problem 3.3

Primal problem Dual problem


minimize: η maximize: hρ, Z i
subject to: ρ ≤ ηQ subject to: sup{h Q, Z i : Q ∈ C} ≤ 1
Q ∈ C, Z ∈ Pos(X)
η≥0
Similar to before, the value Dmax (ρ kC) defined above is the logarithm of the
optimal value of this conic program. The constraint in the dual problem can alter-
natively be written
θC ( Z ) ≤ 1, (3.21)
where
θC ( H ) = sup h Q, H i (3.22)
Q ∈C
is the so-called support function of the convex set C.

29
Example 3.4. We have, in fact, already seen an example in which we minimize
over models drawn from a convex set: the conditional min-entropy. The conditional
min-entropy of X given Y, for the state ρ ∈ D(X ⊗ Y), is given by

Hmin (X | Y )ρ = − inf Dmax (ρ k1X ⊗ σ ) = − inf Dmax (ρ k Q) (3.23)


σ ∈D(Y) Q ∈C

for C = {1X ⊗ σ : σ ∈ D(Y)}. Observe that


θC ( Z ) = suph Q, Z i = sup h1X ⊗ σ, Z i
Q ∈C σ ∈D(Y)
 (3.24)
= sup hσ, TrX ( Z )i = λ1 TrX ( Z ) .
σ ∈D(Y)

By the dual formulation of the conic program above, we find that

Hmin (X | Y )ρ = sup loghρ, Z i : λ1 TrX ( Z ) ≤ 1, Z ∈ Pos(X ⊗ Y)


 

= sup loghρ, Z i : TrX ( Z ) ≤ 1Y , Z ∈ Pos(X ⊗ Y)



(3.25)
= sup loghρ, Z i : TrX ( Z ) = 1Y , Z ∈ Pos(X ⊗ Y) ,


which is consistent with the primal problem in our semidefinite program for the
conditional min-entropy from Lecture 2.

Divergence from a convex set of states


For a convex set of states C ⊆ D(X), it is typical that one views the quantity

D(ρ k C) = inf D(ρ k σ) (3.26)


σ ∈C

(where it should be stressed that it is the ordinary quantum relative entropy, not
the max-relative entropy, that appears in this equation) as a measure of distance
(or divergence) of ρ from C. For example, the relative entropy of entanglement of a
state ρ ∈ D(Y ⊗ Z) is given by

REE(Y : Z)ρ = D(ρ k SepD(Y : Z)) = inf D( ρ k σ ). (3.27)


σ∈SepD(Y:Z)

We may consider a similar notion for the max-relative entropy in place of the
ordinary relative entropy:

Dmax (ρ k C) = inf Dmax (ρ k σ). (3.28)


σ ∈C

To better understand this quantity, let us expand the definition of the max-relative
entropy, so that we obtain

Dmax (ρ k C) = inf λ ∈ R : ρ ≤ 2λ σ, σ ∈ C .

(3.29)

30
2− λ ρ + (1 − 2− λ ) ξ ξ

C
ρ D(X)

Figure 3.1: Dmax (ρ k C) is the minimum value of λ for which 2−λ ρ + (1 − 2−λ )ξ is
in C, for some choice of a density operator ξ.

The inequality on the right-hand side of this equation can be expressed as an


equality through the introduction of a positive semidefinite slack variable P, which
yields

Dmax (ρ k C) = inf λ ∈ R : ρ + P = 2λ σ, σ ∈ C, P ∈ Pos(X) .



(3.30)

As ρ and σ are density operators, the slack variable P must have trace equal to
2λ − 1 in order for the equality ρ + P = 2λ σ to hold. We may therefore replace P
by (2λ − 1)ξ for a density operator ξ, and we obtain

Dmax (ρ k C) = inf λ ∈ R : ρ + 2λ − 1 ξ = 2λ σ, σ ∈ C, ξ ∈ D(X) ,


 
(3.31)

which is equivalent to

Dmax (ρ k C) = inf λ ∈ R : 2−λ ρ + 1 − 2−λ ξ ∈ C, ξ ∈ D(X) .


 
(3.32)

This expression reveals that the quantity Dmax (ρ k C) has the simple and intuitive
interpretation suggested by Figure 3.1. For λ = Dmax (ρ k C), the quantity 2λ − 1 is
sometimes called the (global or generalized) robustness of ρ with respect to C.

3.3 Minimizing over both states and models


Finally, and very briefly, it should be noted that one can simultaneously minimize
over both arguments in the max-relative entropy. That is, if C ⊆ D(X) and D ⊆
Pos(X) are convex and compact sets, one may consider the quantity

Dmax (C k D) = inf Dmax (ρ k Q). (3.33)


σ ∈C
Q ∈D

This value is the logarithm of the optimal value of the following conic program.

31
Optimization Problem 3.5

Primal problem Dual problem


minimize: η maximize: ψC ( Z )
subject to: ρ ≤ ηQ subject to: θD ( Z ) ≤ 1
ρ∈C Z ∈ Pos(X)
Q∈D
η≥0
where
ψC ( Z ) = inf hρ, Z i and θD ( Z ) = sup h Q, Z i. (3.34)
ρ ∈C Q ∈D

32
Lecture 4

Regularization of the smoothed


max-relative entropy

In this lecture we will prove an important theorem concerning the smoothed max-
relative entropy, which is that by regularizing the smoothed max-relative entropy
we obtain the ordinary quantum relative entropy:

ρ⊗n σ ⊗n
ε

Dmax
lim = D( ρ k σ ) (4.1)
n→∞ n
for all density operators ρ, σ ∈ D(X) and all ε ∈ (0, 1). For the sake of clarity, recall
that we define the smoothed max-relative entropy with respect to trace-distance
smoothing:
ε
Dmax (ρ kσ) = inf Dmax (ξ kσ) (4.2)
ξ ∈Bε ( ρ )

where
Bε ( ρ ) = ξ ∈ D (X ) : 1

2 k ρ − ξ k1 ≤ε . (4.3)

Bibliographic remarks
Lemma 4.4, which in some sense is the engine that drives the proof we will discuss,
is due to Bjelaković and Siegmund-Schultze (arXiv:quant-ph/0307170), who used
it to prove the so-called quantum Stein lemma, and through it obtained an alternative
proof of the monotonicity of quantum relative entropy.
The more direct route from Bjelaković and Siegmund-Schultze’s lemma to the
regularization (4.1) to be followed in this lecture appears in the following as-of-yet
unpublished manuscript:

Shitikanth Kashyap, Ashwin Nayak, and Michael Saks. Asymptotic equiparti-


tion for quantum relative entropy revisited. Manuscript, 2014.

33
4.1 Strong typicality
The general notion of typicality is fundamentally important in information theory,
and there is a sense in which it goes hand-in-hand with the concept of entropy. We
will begin the lecture with a brief and directed summary of strong typicality, which
is a particular formulation of typicality that is convenient for the proof.
First let us introduce some notation. Supposing that Σ is an alphabet, for every
string a1 · · · an ∈ Σn and symbol a ∈ Σ, we write

N ( a | a1 · · · an ) = {k ∈ {1, . . . , n} : ak = a} , (4.4)

which is simply the number of times the symbol a occurs in the string a1 · · · an .
With respect to that notation, strong typicality is defined as follows.

Definition 4.1. Let Σ be an alphabet, let p ∈ P(Σ) be a probability vector, let n be


a positive integer, and let δ > 0 be a positive real number. A string a1 · · · an ∈ Σn is
δ-strongly typical with respect to p if

N ( a | a1 · · · a n )
− p( a) ≤ p( a)δ (4.5)
n

for every a ∈ Σ. The set of all δ-strongly typical strings of length n with respect
to p is denoted Sn,δ ( p).

What the definition expresses is that the proportion of each symbol in a strongly
typical string is approximately what one would expect if the individual symbols
were chosen independently at random according to the probability vector p. No-
tice that because it is the quantity p( a)δ, as opposed to δ, that appears on the right-
hand side of the inequality in the definition, we have that the error tolerance for the
frequency with which each symbol appears shrinks proportionately as the proba-
bility for that symbol to appear shrinks—and if p( a) = 0 for some a ∈ Σ, then a
strongly typical string cannot include the symbol a at all.
Next we will prove two basic facts concerning the notion of strong typicality.
The two facts are stated as the lemmas that follow.

Lemma 4.2. Let Σ be an alphabet, let p ∈ P(Σ) be a probability vector, let n be a positive
integer, and let δ > 0 be a positive real number. It is the case that

∑ ∑ exp −2nδ2 p( a)2 .



p ( a1 ) · · · p ( a n ) ≥ 1 − 2 (4.6)
a1 ··· an ∈Sn,δ ( p) a∈Σ
p( a)>0

34
Proof. Suppose first that a ∈ Σ is fixed, and consider the probability that a string
a1 · · · an ∈ Σn , where each symbol is selected independently at random according
to the probability vector p, satisfies
N ( a | a1 · · · a n )
− p( a) > p( a) δ. (4.7)
n
To upper-bound this probability, one may define X1 , . . . , Xn to be independent and
identically distributed random variables, taking value 1 with probability p( a) and
value 0 otherwise, so that the probability of the event (4.7) is equal to
X1 + · · · + X n
 
Pr − p( a) > p( a) δ . (4.8)
n
If it is the case that p( a) > 0, then Hoeffding’s inequality implies that
X1 + · · · + X n
 
− p( a) > p( a) δ ≤ 2 exp −2nδ2 p( a)2 ,

Pr (4.9)
n
while it is the case that
X1 + · · · + X n
 
Pr − p( a) > p( a) δ = 0 (4.10)
n
in case p( a) = 0. The lemma follows from the union bound.
Lemma 4.3. Let Σ be an alphabet, let p ∈ P(Σ) be a probability vector, let n be a positive
integer, let δ > 0 be a positive real number, let a1 · · · an ∈ Sn,δ ( p) be a δ-strongly typical
string with respect to p, and let φ : Σ → [0, ∞) be a nonnegative real-valued function.
The following inequality is satisfied:

φ ( a1 ) + · · · + φ ( a n )
− ∑ p ( a ) φ ( a ) ≤ δ ∑ p ( a ) φ ( a ). (4.11)
n a∈Σ a∈Σ

Proof. The inequality (4.11) follows from the definition of strong typicality together
with the triangle inequality:

φ ( a1 ) + · · · + φ ( a n )
− ∑ p( a)φ( a)
n a∈Σ

N ( a | a1 · · · a n )
 
= ∑ − p( a) φ( a) (4.12)
a∈Σ
n

N ( a | a1 · · · a n )
≤ ∑ n
− p ( a ) φ ( a ) ≤ δ ∑ p ( a ) φ ( a ),
a∈Σ a∈Σ

as required.

35
4.2 Lemmas
Next we will prove two lemmas that are needed for the proof of the main theo-
rem to which this lecture is devoted. The first of these lemmas is the one due to
Bjelaković and Siegmund-Schultze mentioned at the start of the lecture.
Lemma 4.4. Let ρ, σ ∈ D(X) be density operators for which im(ρ) ⊆ im(σ ) and let
δ > 0 be a positive real number. There exist positive real numbers K and µ such that, for
every positive integer n, there exists a projection operator Πn acting on X⊗n satisfying
Πn , σ⊗n = 0,
 

Πn , ρ⊗n ≥ 1 − K exp(−µn), (4.13)


and
2(1+δ)n Tr(ρ log(σ)) Πn ≤ Πn σ⊗n Πn ≤ 2(1−δ)n Tr(ρ log(σ)) Πn . (4.14)
Proof. By considering a spectral decomposition of σ, one may select an alphabet Σ,
an orthonormal set { x a : a ∈ Σ} ⊂ X, and a probability vector q ∈ P(Σ) such that
σ= ∑ q(a)xa x∗a (4.15)
a∈Σ

and q( a) > 0 for all a ∈ Σ. Define a new probability vector p ∈ P(Σ) as


p( a) = x ∗a ρx a (4.16)
for every a ∈ Σ. The fact that p is indeed a probability vector follows from the
assumption that ρ is a density operator with im(ρ) ⊆ im(σ).
Real numbers K and µ satisfying the requirements of the lemma may now be
selected as follows:
K = 2|supp( p)|,
(4.17)
µ = 2δ2 min p( a)2 : a ∈ Σ, p( a) > 0 .


Toward a verification that K and µ satisfying the requirements of the lemma, let
Πn = ∑ x a1 x ∗a1 ⊗ · · · ⊗ x an x ∗an , (4.18)
a1 ··· an ∈Sn,δ ( p)

where Sn,δ ( p) denotes the set of δ-strongly typical sequences with


 respect  to the
probability vector p, for every positive integer n. The condition Πn , σ ⊗ n = 0 is
immediate, while the bound
Πn , ρ⊗n = ∑ p ( a1 ) · · · p ( a n )
a1 ··· an ∈Sn,δ ( p)
(4.19)
∑ exp −2nδ2 p( a)2 ≥ 1 − K exp(−µn)

≥ 1−2
a∈Σ
p( a)>0

36
follows directly from Lemma 4.2.
It remains to prove the inequalities in (4.14). As

Πn σ ⊗n Πn = ∑ q( a1 ) · · · q( an ) x a1 x ∗a1 ⊗ · · · ⊗ x an x ∗an , (4.20)


a1 ··· an ∈Sn,δ ( p)

these inequalities are equivalent to


n
− (1 − δ)n Tr(ρ log(σ)) ≤ − ∑ log(q(ak )) ≤ −(1 + δ)n Tr(ρ log(σ)) (4.21)
k =1

for every a1 · · · an ∈ Sn,δ ( p). By taking φ( a) = − log(q( a)) for every a ∈ Σ in


Lemma 4.3, so that
∑ p(a) φ(a) = − Tr(ρ log(σ)), (4.22)
a∈Σ
the inequalities (4.21) are obtained, which completes the proof.

The next lemma is just a simple technical fact concerning the inner-product of
a product of projection operators with a density operator.

Lemma 4.5. Let ρ ∈ D(X) be a density operator, let ε > 0 be a positive real number, and
let Π and ∆ be projection operators on X that satisfy hΠ, ρi ≥ 1 − ε and h∆, ρi ≥ 1 − ε.
It is the case that
h∆Π∆, ρi ≥ 1 − 6ε. (4.23)

Proof. Observe first that


√ 2 √ √ 2
∆Π∆, ρ = Tr Π∆ρ∆Π = Π∆ ρ ρ, Π∆ ρ = |h∆Π, ρi|2 , (4.24)

2

where the inequality is by the Cauchy–Schwarz inequality. Next, by the identity

1 = (1 − ∆)(1 − Π) + ∆ + Π − ∆Π, (4.25)

one sees that


h∆Π, ρi = h∆, ρi + hΠ, ρi − 1 + h(1 − ∆)(1 − Π), ρi
(4.26)
≥ 1 − 2ε + h(1 − ∆)(1 − Π), ρi.
By the Cauchy–Schwarz inequality, we have
√ √
h(1 − ∆)(1 − Π), ρi = (1 − Π) ρ, (1 − ∆) ρ
√ √
q q (4.27)
≤ (1 − Π ) ρ 2 (1 − ∆ ) ρ 2 = 1 − Π, ρ 1 − ∆, ρ ≤ ε,

and therefore
h(1 − ∆)(1 − Π), ρi ≥ −ε. (4.28)

37
Consequently,
h∆Π, ρi ≥ 1 − 3ε, (4.29)
and therefore
∆Π∆, ρ ≥ (1 − 3ε)2 ≥ 1 − 6ε, (4.30)
as required.

Remark 4.6. If ∆ commutes with ρ, then the bound established by the previous
lemma can be improved to
h∆Π∆, ρi ≥ 1 − 2ε. (4.31)
To see that this is so, note that

ρ − ∆ρ∆ = (1 − ∆)ρ(1 − ∆), (4.32)

and therefore

h∆Π∆, ρi = hΠ, ρi + hΠ, ∆ρ∆ − ρi = hΠ, ρi − hΠ, (1 − ∆)ρ(1 − ∆)i. (4.33)

Because (1 − ∆)ρ(1 − ∆) is positive semidefinite and Π ≤ 1, it follows that

h∆Π∆, ρi ≥ hΠ, ρi − Tr (1 − ∆)ρ(1 − ∆) = hΠ, ρi − h1 − ∆, ρi ≥ 1 − 2ε. (4.34)




Remark 4.7. The proof of the lemma above can be extended to the assumptions
hΠ, ρi ≥ 1 − ε and h∆, ρi ≥ 1 − δ to obtain
 √ 2
h∆Π∆, ρi ≥ 1 − ε − δ − εδ . (4.35)

We’re also going to make use of Winter’s gentle measurement lemma, which is
a very useful and well-known fact.

Lemma 4.8 (Winter’s gentle measurement lemma). Let X be a complex Euclidean


space, let ρ ∈ D(X) be a density operator, and let P ∈ Pos(X) be a positive semidefinite
operator satisfying P ≤ 1 and h P, ρi > 0. This inequality is satisfied:
√ √ ! q
Pρ P
F ρ, ≥ h P, ρi. (4.36)
h P, ρi

Proof. Observe that for any two positive semidefinite operators Q and R, it is nec-
essarily the case that F( R, QRQ) = h R, Qi. Indeed,
q√ √ q √ √ 2 √ √
RQRQ R = RQ R = RQ R, (4.37)

38
and therefore
q√ √  √ √ 
F( R, QRQ) = Tr RQRQ R = Tr RQ R = h R, Qi. (4.38)

By this formula, along with the square root scaling of the fidelity function, one
finds that
√ √ ! √
Pρ P 1  √ √  P, ρ
F ρ, =p F ρ, Pρ P = p . (4.39)
h P, ρi h P, ρi h P, ρi

under the assumption 0 ≤ P ≤ 1, it is the case that
Finally, √ P ≥ P, and
therefore P, ρ ≥ h P, ρi, from which the lemma follows.

√ actually only need the lemma for P being a projection opera-


Remark 4.9. We will
tor, in which case P = P, and so the lemma holds with equality in (4.36).

4.3 Main theorem


Now we’re ready for the main theorem and its proof. Here it is.

Theorem 4.10. Let ρ, σ ∈ D(X) be density operators. For every ε ∈ (0, 1), the following
equality holds.
ρ⊗n σ ⊗n
ε

Dmax
lim = D( ρ k σ ). (4.40)
n→∞ n
Proof. Let us first consider the case that im(ρ) 6⊆ im(σ). In this case the right-hand
side of (4.40) is infinite, and so we must prove the same for the left-hand side.
ε
This follows from the fact that Dmax (ρ⊗n σ⊗n ) is infinite for all but finitely many
positive integers n. Indeed, let Λ be the projection onto im(σ), so that σ = ΛσΛ
and hΛ, ρi < 1. Now, for a given choice of n, one has that if ξ n ∈ D(X⊗n ) satisfies
im(ξ n ) ⊆ im(σ⊗n ), then

⊗n ⊗n ⊗n ⊗n
 q n
F ξn, ρ ) = F ξn, Λ ρ Λ ≤ Tr Λ⊗n ρ⊗n Λ⊗n < hΛ, ρi 2 ,

(4.41)

and therefore
1 n
ξ n − ρ⊗n 1 > 1 − hΛ, ρi 2 (4.42)
2
by one of the Fuchs–van de Graaf inequalities. The right-hand side of this inequal-
ity must exceed ε for all but finitely many positive integers n, by the fact that
hΛ, ρi < 1. It follows that for all but finitely many positive integers n, there are
no elements of Bε (ρ⊗n ) whose images are contained in im(σ⊗n ), which implies for
ε
any such n that Dmax (ρ⊗n kσ⊗n ) = ∞.

39
For the remainder of the proof it will be assumed that im(ρ) ⊆ im(σ ). Let
δ > 0 be chosen arbitrarily, and for each positive integer n, let Πn be the projection
whose existence is guaranteed by Lemma 4.4 for ρ, σ, δ, and n, and also let ∆n be
the projection whose existence is guaranteed by Lemma 4.4, again for ρ, δ, and n,
but where σ is replaced by ρ. Thus, we have [Πn , σ⊗n ] = 0 and [∆n , ρ⊗n ] = 0, and
the following inequalities are satisfied:

2(1+δ)n Tr(ρ log(σ)) Πn ≤ Πn σ⊗n Πn ≤ 2(1−δ)n Tr(ρ log(σ)) Πn , (4.43)


2−(1+δ)n H(ρ) ∆n ≤ ∆n ρ⊗n ∆n ≤ 2−(1−δ)n H(ρ) ∆n , (4.44)
⊗n
h∆n , ρ i ≥ 1 − K exp(−µn), (4.45)
hΠn , ρ⊗n i ≥ 1 − K exp(−µn), (4.46)

where K and µ are positive constants independent of n.


It will first be proved that

ρ⊗n σ ⊗n
ε

Dmax
lim ≤ D( ρ k σ ). (4.47)
n→∞ n

The projection operator ∆n commutes with ρ⊗n , and therefore

h∆n Πn ∆n , ρ⊗n i ≥ 1 − 2K exp(−µn) (4.48)

by Remark 4.6. It is therefore the case that h∆n , ρ⊗n i, hΠn , ρ⊗n i, and h∆n Πn ∆n , ρ⊗n i
are all positive for all but finitely many n, and we will restrict our attention to those
n for which these values are all positive.
Define a density operator

Πn ∆n ρ⊗n ∆n Πn
ξn = (4.49)
h∆n Πn ∆n , ρ⊗n i

for all values of n under consideration. We will begin by proving a bound on the
trace distance between ξ n and ρ⊗n . To this end, observe that

1 1 1
ξ n − ρ⊗n 1
≤ ξ n − τn 1
+ τn − ρ⊗n 1
(4.50)
2 2 2
for
∆n ρ⊗n ∆n
τn = , (4.51)
∆n , ρ⊗n
and notice that
Πn τn Πn
ξn = . (4.52)
hΠn , τn i

40
By Winter’s gentle measurement lemma and one of the Fuchs–van de Graaf in-
equalities, we find that

1 1 Πn τn Πn
q
ξ n − τn = − τn ≤ 1 − hΠn , τn i, (4.53)
2 1 2 hΠn , τn i 1

and because
h∆n Πn ∆n , ρ⊗n i
hΠn , τn i = ≥ h ∆ n Π n ∆ n , ρ ⊗ n i, (4.54)
h∆n , ρ⊗n i
we obtain
1
q q
ξ n − τn 1
≤ 1 − h ∆n Πn ∆n , ρ⊗n i ≤ 2K exp(−µn). (4.55)
2
Along similar (although simpler) lines, we find that

1
q q
⊗n
τn − ρ 1
≤ 1 − h∆n , ρ i ≤ K exp(−µn).
⊗ n (4.56)
2
These upper bounds are decreasing to 0 (exponentially quickly, as it happens), and
therefore
1
ξ n − ρ⊗n 1 ≤ ε, (4.57)
2
or equivalently ξ n ∈ Bε ρ⊗n , implying


ρ⊗n σ⊗n ≤ Dmax ξ n σ⊗n ,


ε
 
Dmax (4.58)

for all but finitely many n. Let us further restrict our attention to these values of n.
Next, we will use the inequalities (4.43) and (4.44) to obtain an upper-bound on
Dmax ξ n σ ⊗ n . First, by (4.44), together with ∆n ≤ 1 and Πn = Πn , we find that
2

Πn ∆n ρ⊗n ∆n Πn ≤ 2−(1−δ)n H(ρ) Πn ∆n Πn ≤ 2−(1−δ)n H(ρ) Πn . (4.59)

Second, by (4.44), together with the fact that [Πn , σ⊗n ] = 0, we have

Πn ≤ 2−(1+δ)n Tr(ρ log(σ)) Πn σ⊗n Πn ≤ 2−(1+δ)n Tr(ρ log(σ)) σ⊗n . (4.60)

Combining (4.59) and (4.60), we obtain

Πn ∆n ρ⊗n ∆n Πn ≤ 2n D(ρkσ)+δn(H(ρ)−Tr(ρ log(σ))) σ⊗n . (4.61)

Accounting for the normalization of ξ n and making use of (4.48), we find that

Dmax ξ n σ⊗n ≤ n D(ρkσ ) + δn(H(ρ) − Tr(ρ log(σ)))



 (4.62)
− log 1 − 2K exp(−µn) .

41
At this point we may conclude that

ρ⊗n σ ⊗n
ε

Dmax
lim ≤ D(ρkσ) + δ(H(ρ) − Tr(ρ log(σ))), (4.63)
n→∞ n
and as δ was an arbitrarily chosen positive real number, we obtain the required
inequality (4.47).
Now we will prove the reverse inequality

ρ⊗n σ ⊗n
ε

Dmax
lim ≥ D( ρ k σ ). (4.64)
n→∞ n
Let δ > 0 again be chosen arbitrarily. For every positive integer n it is the case that

σ⊗n , Πn ∆n Πn = ∆n , Πn σ⊗n Πn ≤ 2(1−δ)n Tr(ρ log(σ)) ∆n , Πn (4.65)

by (4.43), as well as

∆ n , Π n ≤ 2(1+ δ ) n H( ρ ) ∆ n ρ ⊗ n ∆ n , Π n ≤ 2(1+ δ ) n H( ρ ) (4.66)

by (4.44). It therefore follows that the operator

Zn = 2n D(ρ k σ)−δn(H(ρ)−Tr(ρ log(σ))) Πn ∆n Πn (4.67)

is positive semidefinite and satisfies hσ⊗n , Z i ≤ 1. By an inspection of the dual


form of Optimization Problem 3.2, the conic program for the exponential of the
smoothed max-relative entropy, we conclude that

ρ⊗n σ⊗n ≥ n D(ρ k σ ) − δn(H(ρ) − Tr(ρ log(σ)))


ε

Dmax
(4.68)
+ inf loghΠn ∆n Πn , ξ n i.
ξ n ∈Bε ( ρ ⊗ n )

For every ξ n ∈ Bε (ρ⊗n ) we have, by virtue of the fact that ρ⊗n − ξ n is traceless and
0 ≤ Πn ∆n Πn ≤ 1, that
1 ⊗n
Πn ∆n Πn , ρ⊗n − ξ n ≤ ρ − ξn 1
≤ε (4.69)
2
and therefore
hΠn ∆n Πn , ξ n i = hΠn ∆n Πn , ρ⊗n i + hΠn ∆n Πn , ξ n − ρ⊗n i
(4.70)
≥ 1 − 6K exp(−µn) − ε
by Lemma 4.5. Consequently,

ρ⊗n σ⊗n ≥ n D(ρ k σ) − δn(H(ρ) − Tr(ρ log(σ)))


ε

Dmax
 (4.71)
+ log 1 − 6K exp(−µn) − ε .

42

Given the assumption ε ∈ (0, 1), one concludes that log 1 − 6K exp(−µn) − ε
converges to a constant value as n goes to infinity. It follows that
ρ⊗n σ ⊗n
ε

Dmax
lim ≥ D(ρkσ) − δ(H(ρ) − Tr(ρ log(σ))). (4.72)
n→∞ n
Once again, as δ was an arbitrarily chosen positive real number, the required in-
equality (4.64) follows.
Remark 4.11. An alternative way to argue the closeness of ξ n to ρ⊗n is to use a
different known equality concerning the fidelity function, which is that
F AA∗ , BB∗ = A∗ B 1

(4.73)
for any choice of operators A, B ∈ L(X, Y). (This fact is closely connected with
Uhlmann’s theorem, and can be found as Lemma 3.21 in my book Theory of Quan-
tum Information.) In the present case, we obtain
q q
⊗n ⊗n
F Πn ∆n ρ ∆n Πn , ρ ρ ∆n Πn ρ⊗n


= n
1
q q  (4.74)
⊗n
≥ Tr ρ ∆n Πn ρ
⊗ n ⊗ n = ∆n Πn ∆n , ρ ,

where the last equality makes use of [∆n , ρ⊗n ] = 0. A suitable bound on the trace
distance between ξ n and ρ⊗n is obtained through the Fuchs–van de Graaf inequal-
ities.
And we will conclude with two corollaries.
It is not important that σ is a density operator in Theorem 4.10—it is true for
arbitrary positive semidefinite models. The only part of the proof that depends on
the scaling of σ occurs in the proof of Lemma 4.4, where q ∈ P(Σ) implies that
φ( a) = − log(q( a)) is nonnegative. Although it would not be difficult to modify
this portion of the proof slightly to handle arbitrary positive semidefinite models,
it is perhaps simpler to observe it as a fairly straightforward corollary of Theo-
rem 4.10.
Corollary 4.12. If ρ is a density operator and Q is any positive semidefinite operator, we
have
ρ⊗n Q⊗n
ε

Dmax
lim = D( ρ k Q ). (4.75)
n→∞ n
Proof. Let σ = Q/ Tr( Q). Then
ρ⊗n Q⊗n = Dmax ρ⊗n Tr( Q)n σ⊗n
ε
 ε

Dmax
ρ⊗n σ⊗n − n log(Tr( Q))
ε

= Dmax

43
so
ρ⊗n Q⊗n ρ⊗n σ ⊗n
ε
 ε

Dmax Dmax
lim = lim − log(Tr( Q))
n→∞ n n→∞ n
= D(ρkσ) − log(Tr( Q))
= D( ρ k Q ),

as required.

The second and final corollary is quite spectacular. Of course it is well-known,


and we proved it in CS 766/QIC 820, but now we have a completely different
alternative proof.

Corollary 4.13 (Monotonicity of quantum relative entropy). Let ρ, σ ∈ D(X) be


density operators and let Φ ∈ C(X, Y) be a channel, for complex Euclidean spaces X
and Y. It is the case that D(Φ(ρ)k Φ(σ)) ≤ D(ρ k σ).

Proof. Observe first that the smoothed max-relative entropy is monotonic:


ε
Dmax (Φ(ρ)k Φ(σ)) ≤ inf Dmax (Φ(ξ )k Φ(σ ))
ξ ∈Bε ( ρ )
(4.76)
≤ inf Dmax (ξ k σ ) = Dmax
ε
(ρ k σ)
ξ ∈Bε ( ρ )

by the monotonicity of the max-relative entropy. Therefore,

Φ (ρ )⊗n Φ (σ )⊗n
ε

Dmax
D(Φ(ρ)k Φ(σ )) = lim
n→∞ n
Dmax Φ (ρ⊗n ) Φ⊗n (σ⊗n )
⊗ n
ε

= lim
n→∞ n (4.77)
ε
D (ρ ⊗ n ⊗
σ n)
≤ lim max
n→∞ n
= D( ρ k σ ),

as required.

44
Lecture 5

Min-relative entropy, conditional


max-entropy, and hypothesis-testing
relative entropy

In this lecture we will discuss a few additional generalized entropy measures,


namely the min-relative entropy, the conditional max-entropy, and the hypothesis-
testing relative entropy. We will discuss various properties of these quantities, and
relate them to the other entropic quantities we have previously discussed.

5.1 Min-relative entropy


We will begin with the min-relative entropy.

Definition 5.1 (Quantum min-relative entropy). Let ρ ∈ D(X) be a density oper-


ator and let Q ∈ Pos(X) be a positive semidefinite operator, for X a complex Eu-
clidean space. The quantum min-relative entropy (or min-relative entropy, for short)
of ρ with respect to Q is defined as

Dmin (ρk Q) = − log F(ρ, Q)2 ,



(5.1)

where
√ p
F(ρ, Q) = ρ Q (5.2)
1

is the fidelity between ρ and Q.

Remark 5.2. In the case that F(ρ, Q) = 0, which is equivalent to im(ρ) ⊥ im( Q),
one is to interpret that Dmin (ρk Q) = ∞.

45
Elementary observations
Here are a couple of relevant properties of the min-relative entropy that follow
directly from known properties of the fidelity function.

1. A variant of Klein’s inequality holds for the min-relative entropy. That is, if
ρ, σ ∈ D(X) are density operators, then Dmin (ρkσ) ≥ 0, with equality if and
only if ρ = σ.
2. The min-relative entropy is monotonic with respect to the action of channels.
That is, for every choice of ρ ∈ D(X), Q ∈ Pos(X), and Φ ∈ C(X, Y), it is the
case that
Dmin (Φ(ρ)kΦ( Q)) ≤ Dmin (ρk Q).
This is true, in fact, for all positive and trace-preserving maps Φ ∈ T(X, Y).

One can identify additional properties of the min-relative entropy through its very
direct connection to the fidelity function, which we know to have many interesting
and remarkable properties.

Relationship to the quantum relative entropy


The following theorem reveals that the min-relative entropy is upper-bounded by
the ordinary quantum relative entropy.

Theorem 5.3. Let ρ ∈ D(X) be a density operator and let Q ∈ Pos(X) be a positive
semidefinite operator, for X a complex Euclidean space. It is the case that

Dmin (ρk Q) ≤ D(ρk Q). (5.3)

Proof. The theorem is trivial in the case im(ρ) 6⊆ im( Q), as the right-hand side of
(5.3) is infinite in this case, so the remainder of the proof is focused on the case
im(ρ) ⊆ im( Q). There is no loss of generality in assuming that Q is positive defi-
nite in this case, as the values of Dmin (ρk Q) and D(ρk Q) then do not change if X
is replaced by im( Q).
Define a function φ : (−1, 1) → R as

φ(α) = − ln Tr(ρ1−α Qα ). (5.4)

We are using the natural logarithm because it will simplify the calculus that will
soon be considered. It is not really important to the proof that this function is de-
fined on the entire interval (−1, 1), we only require that the function is defined on

46
the interval [0, 1/2] and is differentiable at α = 0. But, in any case, φ is differen-
tiable at every point α ∈ (−1, 1), with the derivative being given by
1−α Qα (ln( ρ ) − ln( Q ))

Tr ρ
φ0 (α) = . (5.5)
Tr ρ1−α Qα


Notice in particular that

1
φ0 (0) = Tr(ρ ln(ρ)) − Tr(ρ ln( Q)) = D( ρ k Q ). (5.6)
log(e)

We also observe that


√ p  1
φ(1/2) = − ln Tr ρ Q ≥ Dmin (ρ k Q) (5.7)
2 log(e)

with the inequality following from


√ p √ p 
F(ρ, Q) = ρ Q ≥ Tr ρ Q . (5.8)
1

Finally, noting that φ(0) = 0, we see that the theorem will follow from a demon-
stration that
φ(1/2) − φ(0)
φ 0 (0) ≥ . (5.9)
1/2
This in turn will follow from a demonstration that φ is a concave function.
To prove that φ is concave, it suffices to compute its second derivative and
observe that its value is non-positive. To make this as simple as possible, and to
avoid a messy calculation, let us use the spectral theorem to write

ρ= ∑ p(a)xa x∗a and Q= ∑ q(b)yb y∗b (5.10)


a∈Σ b∈Γ

for alphabets Σ and Γ, orthonormal sets { x a : a ∈ Σ} and {yb : b ∈ Γ}, a proba-


bility vector p ∈ P(Σ), and q ∈ (0, ∞)Γ being a vector of positive real numbers. Let
us also define a function

|h x a , yb i|2 p( a)1−α q(b)α


r a,b (α) = (5.11)
Tr(ρ1−α Qα )

for every ( a, b) ∈ Σ × Γ and α ∈ (−1, 1), so that

φ0 (α) = ∑

r a,b (α) ln( p( a)) − ln(q(b)) . (5.12)
( a,b)∈Σ×Γ

47
We may then express the second derivative of φ as
!2
φ00 (α) = ∑

r a,b (α) ln( p( a)) − ln(q(b))
( a,b)∈Σ×Γ (5.13)

2
− r a,b (α) ln( p( a) − ln(q(b)) .
( a,b)∈Σ×Γ

Observing that r a,b (α) ≥ 0 and

∑ r a,b (α) = 1 (5.14)


( a,b)∈Σ×Γ

for every α ∈ (−1, 1), we find that φ00 (α) is non-positive by Jensen’s inequality,
which completes the proof.

5.2 Conditional max-entropy


Next we will discuss the conditional max-entropy, which is defined through the
min-relative entropy in precisely the same way that the conditional min-entropy is
defined through the max-relative entropy.

Definition 5.4 (Conditional max-entropy). Let ρ ∈ D(X ⊗ Y) be a state of a pair of


registers (X, Y ). The conditional max-entropy of X given Y for the state ρ is defined
as
Hmax (X|Y )ρ = − inf Dmin (ρk1X ⊗ σ). (5.15)
σ ∈D(Y)

Equivalently,
Hmax (X|Y )ρ = sup log F(ρ, 1X ⊗ σ )2 .

(5.16)
σ ∈D(Y)

We obtain from Theorem 5.3 that H(X|Y )ρ ≤ Hmax (X|Y )ρ , and so

Hmin (X|Y )ρ ≤ H(X|Y )ρ ≤ Hmax (X|Y )ρ . (5.17)

Writing n = dim(X) and ω = 1X /n, we see from the definition of the conditional
max-entropy that

Hmax (X|Y )ρ = log(n) + sup log F(ρ, ω ⊗ σ )2 .



(5.18)
σ ∈D(Y)

48
Semidefinite program for conditional max-entropy
One way to compute the value Hmax (X|Y )ρ is to use the semidefinite program for
the fidelity function that was discussed in CS 766/QIC 820, obtaining the following
semidefinite program.

Optimization Problem 5.5 (SDP for conditional max-entropy)

Primal problem Dual problem


1 1 1 1
maximize: Tr( X ) + Tr( X ∗ ) minimize: hρ, Y i + λ1 (TrX ( Z ))
2 2  2 2 
ρ X Y − 1X ⊗ Y
subject to: ∗ ≥ 0, subject to: ≥ 0,
X 1X ⊗ σ − 1X ⊗ Y Z
X ∈ L(X ⊗ Y), Y, Z ∈ Pos(X ⊗ Y).
σ ∈ D(Y)

It is the case that


Hmax (X|Y )ρ = 2 log(α), (5.19)
for
α = sup F(ρ, 1X ⊗ σ) (5.20)
σ ∈D(Y)

being the optimal value of this semidefinite program.

Remark 5.6. The dual problem may be simplified to obtain the expression

hρ, Z i k TrX ( Z −1 )k
 
α = inf + (5.21)
Z >0 2 2

for the optimal value of Optimization Problem 5.5. Using the arithmetic-geometric
mean inequality, one may conclude that

Hmax (X|Y )ρ = inf loghρ, Z i + logk TrX ( Z −1 )k



(5.22)
Z >0

Examples
We may again consider a few examples of classes of states, to gain some intuition
on the conditional max-entropy.

Example 5.7. For any choice of σ ∈ D(X) and ξ ∈ D(Y), it is the case that

Hmax (X|Y )σ⊗ξ = 2 log Tr σ. (5.23)

49
Example 5.8. Calculating the conditional max-entropy Hmax (X|Y )ρ for a classical-
quantum state ρ of the form
n
ρ= ∑ pk | k ih k | ⊗ ξ k (5.24)
k =1

yields
!2
n

Hmax (X|Y )ρ = log sup ∑ p k F( ξ k , σ ) . (5.25)
σ ∈D(Y) k =1

Example 5.9. Let n = dim(X), let

1 n
n a,b∑
τ= | a ih b | ⊗ | a ih b | (5.26)
=1

and suppose that τ can be recovered perfectly by applying a channel locally to Y


for the state ρ ∈ D(X ⊗ Y). This is equivalent to ρ taking the form

ρ = (1X ⊗ V )(τ ⊗ ξ )(1X ⊗ V )∗ (5.27)

for some choice of a density operator ξ ∈ D(Z) and an isometry V ∈ U(X ⊗ Z, Y).
Then we have
Hmax (X|Y )ρ = − log(n), (5.28)
just like the conditional min-entropy and conditional quantum entropy.

A relationship between conditional min- and max-entropy


We will now prove a fundamental relationship between conditional min- and max-
entropy, which is stated in the following theorem.

Theorem 5.10. Let X, Y, and Z be registers and assume the triple (X, Y, Z) is in a pure
state uu∗ , for u ∈ X ⊗ Y ⊗ Z a unit vector. It is the case that

Hmin (X|Y ) + Hmax (X|Z) = 0. (5.29)

Proof. First let us prove


2− Hmin (X|Y) ≤ 2Hmax (X|Z) . (5.30)
Let
ρ = TrZ (uu∗ ) (5.31)
and choose Φ ∈ C(Y, X) to be a channel for which
2
2− Hmin (X|Y) = F (1L(X) ⊗ Φ)(ρ), vec(1X ) vec(1X )∗ . (5.32)

50
Among the many nice properties that the fidelity function possesses is the fact that
if Q0 , Q1 ∈ Pos(U) and P0 ∈ Pos(U ⊗ V) satisfies TrV ( P0 ) = Q0 , then

F( Q0 , Q1 ) = max F( P0 , P1 ) : P1 ∈ Pos(U ⊗ V), TrV ( P1 ) = Q1 .



(5.33)

From this fact we find that


2
2− Hmin (X|Y) = F 1L(X) ⊗ Φ ⊗ 1L(Z) (uu∗ ), vec(1X ) vec(1X )∗ ⊗ σ

(5.34)

for some state σ ∈ D(Z), as operators of the form vec(1X ) vec(1X )∗ ⊗ σ are the
only operators that leave vec(1X ) vec(1X )∗ when Z is traced out. The fidelity is
nondecreasing under the partial trace on the second tensor factor of X, and there-
fore 2
2− Hmin (X|Y) ≤ F TrY (uu∗ ), 1X ⊗ σ ≤ 2Hmax (X|Z) . (5.35)
Now let us prove the reverse inequality

2Hmax (X|Z) ≤ 2− Hmin (X|Y) , (5.36)

which is based on a similar idea. Choose a density operator σ ∈ D(Z) so that


2
2Hmax (X|Z) = F TrY (uu∗ ), 1X ⊗ σ . (5.37)

The operator 1X ⊗ σ can be purified as


√  √ ∗
vec 1X ⊗ σ vec 1X ⊗ σ ∈ Pos(X ⊗ Z ⊗ X ⊗ Z). (5.38)

Every extension of TrY (uu∗ ) to an element of Pos(X ⊗ Z ⊗ X ⊗ Z) can be expressed


as
1L(X) ⊗ Ξ ⊗ 1L(Z) (uu∗ )

(5.39)
for a channel Ξ ∈ C(Y, Z ⊗ X), meaning that these are exactly the operators that
leave TrY (uu∗ ) when the first tensor factor of Z and the second tensor factor of X
are traced out. By the same fact regarding the fidelity function from before, we find
that
√  √  ∗ 2
2Hmax (X|Z) = F 1L(X) ⊗ Ξ ⊗ 1L(Z) (uu∗ ), vec 1X ⊗ σ vec 1X ⊗ σ

(5.40)

for some choice of a channel Ξ ∈ C(Y, Z ⊗ X). Because the fidelity is nondecreasing
under the partial trace on both copies of Z, we obtain
 ∗ 2
2Hmax (X|Z) ≤ F 1L(X) ⊗ Φ (TrZ (uu∗ )), vec 1X vec 1X
 
(5.41)

for Φ = TrZ ◦ Ξ ∈ C(Y, X). This implies

2Hmax (X|Z) ≤ 2− Hmin (X|Y) , (5.42)

as required.

51
5.3 Hypothesis-testing relative entropy
We will define the hypothesis-testing relative entropy as follows.

Definition 5.11. Let ρ ∈ D(X), Q ∈ Pos(X), and ε ∈ [0, 1]. The ε-hypothesis-testing
relative entropy of ρ with respect to Q is defined as

(ρk Q) = − inf logh Q, X i : X ∈ Pos(X), hρ, X i ≥ 1, εX ≤ 1 .


ε

DH (5.43)

Elementary observations
Before we try to understand the intuitive meaning of this quantity, let us note a
few simple things about it.
ε
First, we see that DH (ρk Q) = ∞ is possible:
1. D0H (ρk Q) = ∞ if and only if im(ρ) 6⊆ im( Q).
2. For ε ∈ (0, 1) we have DH
ε
(ρk Q) = ∞ if and only if
Πker(Q) , ρ ≥ ε. (5.44)

3. D1H (ρk Q) = ∞ if and only if im(ρ) ⊥ im( Q).

Next, notice that as ε decreases, the infimum decreases, because decreasing ε


means relaxing the constraint εX ≤ 1, and therefore the ε-hypothesis-testing rel-
ative entropy increases: δ ≤ ε implies DH δ
(ρk Q) ≥ DHε
(ρk Q). Stated another way,
taking ε to be smaller means taking the ε-hypothesis-testing relative entropy to
be a stronger notion of divergence. (The same may be said about the ε-smoothed
max-relative entropy.)
Continuing on, although our primary focus will be on the range of values
ε ∈ (0, 1), we will take a moment to consider the extreme cases ε = 0 and ε = 1. If
it is the case that ε = 0, the constrain εX ≤ 1 is trivially satisfied. We obtain pre-
cisely the max-relative entropy, as an examination of the dual form of Optimization
Problem 2.4 reveals:
D0H (ρk Q) = Dmax (ρk Q). (5.45)
At the other extreme, we may consider ε = 1, so that the constraint εX ≤ 1 becomes
X ≤ 1. The infimum is evidently achieved for X = Πim(ρ) , and so we obtain

D1H (ρk Q) = − log Q, Πim(ρ) . (5.46)

This quantity has also been called the min-relative entropy by some, but obviously
we will not use this name given that we have already used it for something else—
the name 1-hypothesis-testing relative entropy will do just fine.

52
Semidefinite programming characterization
The definition of the hypothesis-testing relative entropy immediately suggests a
ε
semidefinite programming characterization. Specifically, the value DH (ρk Q) is the
negative logarithm of the optimal value of the following semidefinite program.

Optimization Problem 5.12 (SDP for hypothesis-testing relative entropy)

Primal problem Dual problem


minimize: h Q, X i maximize: λ − Tr(Y )
subject to: hρ, X i ≥ 1, subject to: λρ ≤ Q + εY,
εX ≤ 1, Y ∈ Pos(X),
X ∈ Pos(X). λ ≥ 0.

The primal problem is strictly feasible provided that ε ∈ [0, 1). In particular,
( (1+ ε )1
2ε if ε ∈ (0, 1)
X= (5.47)
21 if ε = 0

is strictly primal feasible. Strict primal feasibility is, on the other hand, impossible
when ε = 1. The dual problem is strictly feasible when ε ∈ (0, 1], by λ = 1 and
Y = 21/ε for instance. In the case ε = 0, the dual problem is strictly feasible if and
only if im(ρ) ⊆ im( Q).
Therefore, strong duality holds and the optimal values are achieved for all
choices of ρ and Q when ε ∈ (0, 1), by Slater’s theorem. We also have strong duality
and an optimal value achieved in the primal problem when ε = 1, again by Slater’s
theorem, as X = 1 is primal feasible (although not strictly so); and strong duality
also holds in the case ε = 0, as our examination of Optimization Problem 2.4 has
already revealed.

Interpretation
One way to interpret the ε-hypothesis-testing relative entropy, at least in the case
ε > 0, begins with observation that

h Q, Pi
 
ε
DH (ρk Q) = − inf log : 0 ≤ P ≤ 1, hρ, Pi ≥ ε . (5.48)
ε

For the sake of the discussion that follows, let us consider the case that Q = σ is a
density operator.

53
We can now consider a test being performed that aims to distinguish between
the states ρ and σ. Think of σ as representing an idealized model for a state, corre-
sponding to the null-hypothesis of the test being performed, whereas ρ is the actual
state of the system being tested. The operator P may be associated with a mea-
surement operator, the outcome corresponding to which is to be seen as a signal
supporting an alternative hypothesis. If we were to measure ρ with respect to such
a measurement, the alternative hypothesis may not be signaled with high prob-
ability, but the probability is at least ε. Think of the probability ε as representing
how small of a signal one is willing to tolerate in support of an alternative hypoth-
esis. The value 2− DH (ρ k Q) is then equal to the smallest possible value for h Q, Pi/ε
ε

that we could achieve by selecting P optimally. Informally speaking, DH ε


(ρ k σ) is a
measure of how surprising it would be to obtain the outcome corresponding to P
in the idealized case represented by σ.
The precise scaling in the definition above has been selected so that a variant of
Klein’s inequality holds, provided ε ∈ [0, 1). That is, for all ε ∈ [0, 1) it is the case
that DH ε
(ρ k σ) ≥ 0, with equality if and only if ρ = σ. Indeed, if σ 6= ρ, we may
choose a unit vector u for which huu∗ , ρi > huu∗ , σi, and then consider

δ
X = (1 − δ )1 + uu∗ (5.49)
huu∗ , ρi

in the primal problem, for δ ∈ (0, 1). The operator X is clearly positive semidefi-
nite, and it is the case that hσ, X i < hρ, X i = 1. The constraint εX ≤ 1 is satisfied
so long as
δ 1
1−δ+ ∗
≤ , (5.50)
huu , ρi ε
which is so for all sufficiently small δ, as 1/ε is strictly larger than 1 by the as-
sumption ε < 1. By selecting any such δ, one obtains a primal feasible X having
objective value strictly smaller than 1, which implies that DH ε
(ρkσ) is positive. In
1
the case ε = 1, one has DH (ρkσ) = 0 if and only if im(σ ) ⊆ im(ρ).

Monotonicity under the action of channels


Suppose that Φ is a trace-preserving and positive map. Consider the dual form of
Optimization Problem 5.12. If it is the case that λ ≥ 0 and Y ∈ Pos(X) satisfy the
constraint
λρ ≤ Q + εY, (5.51)
then by the positivity of Φ it must also hold that

λΦ(ρ) ≤ Φ( Q) + εΦ(Y ), (5.52)

54
and therefore (λ, Φ(Y )) is dual-feasible for the instance of this problem corre-
sponding to DHε
(Φ(ρ)kΦ( Q)), with the same objective value being achieved by
the assumption that Φ preserves trace. It follows that

2− DH (Φ(ρ)kΦ(Q)) ≥ 2− DH (ρkQ) ,
ε ε
(5.53)

or, equivalently,
ε
DH (Φ(ρ)kΦ( Q)) ≤ DH
ε
( ρ k Q ). (5.54)

Relationship to smoothed max-relative entropy


We will conclude our discussion of the hypothesis-testing relative entropy by ob-
serving its close relationship to the smoothed max-relative entropy. The two theo-
rems that follow reveal this close relationship.
ε
Recall that Dmax (ρ k Q) is the logarithm of the optimal value of this conic pro-
gram:

Optimization Problem 5.13 (SDP for smoothed max-relative entropy)

Primal problem Dual problem


minimize: η maximize: ψρε ( Z )
subject to: ξ ≤ ηQ subject to: h Q, Z i ≤ 1
ξ ∈ Bε ( ρ ), Z ∈ Pos(X)
η≥0
where
ψρε ( Z ) = inf hξ, Z i. (5.55)
ξ ∈Bε ( ρ )

In the theorems that follow, we will again use trace-distance smoothing, as we


did in Lecture 4:
Bε (ρ) = ξ ∈ D(X) : 12 k ρ − ξ k1 ≤ ε .

(5.56)

Theorem 5.14. Let ρ ∈ D(X) and Q ∈ Pos(X) be operators satisfying im(ρ) ⊆ im( Q)
and let ε ∈ (0, 1). It is the case that

ε
 1 
Dmax (ρk Q) ≤ ε
DH (ρk Q) + log . (5.57)
ε (1 − ε )

Proof. Consider any operator Z ∈ Pos(X) that satisfies h Q, Z i ≤ 1, and let


n
Z= ∑ λk (Z) zk z∗k (5.58)
k =1

55
be a spectral decomposition of Z. We will use the basis {z1 , . . . , zn } to construct a
family of feasible solutions X to the primal form of Optimization Problem 5.12.
In particular, for each real number λ ∈ R, define a subset Sλ ⊆ {1, . . . , n} as

Sλ = k ∈ {1, . . . , n} : zk z∗k , ρ − 2λ Q > 0 ,



(5.59)

and define
Πλ = ∑ zk z∗k . (5.60)
k ∈ Sλ

Observe that
Π λ , ρ ≥ 2λ Π λ , Q , (5.61)
with the equality being strict so long as Πλ is nonzero. The operator

Πλ
X= (5.62)
ε
is therefore feasible for primal form of Optimization Problem 5.12 provided that
hΠλ , ρi ≥ ε, and with this observation in mind it is informative to consider the
supremum over all such values of λ:

γ = sup λ ∈ R : hΠλ , ρi ≥ ε .

(5.63)

In particular, for every δ > 0 we see that the operator

Πγ−δ
X= (5.64)
ε
is primal feasible, and so we obtain the inequalities

h Q, Πγ−δ i 2− γ + δ 2− γ + δ
2 − DH ( ρ k Q ) ≤
ε
≤ hρ, Πγ−δ i ≤ . (5.65)
ε ε ε
As these inequalities hold for every δ > 0, it follows that
ε
DH (ρk Q) ≥ γ + log(ε). (5.66)

On the other hand, for any choice of δ > 0, it must be that hΠγ+δ , ρi < ε. The
density operator
(1 − Π γ + δ ) ρ (1 − Π γ + δ )
ξ= (5.67)
h1 − Π γ + δ , ρ i
therefore satisfies √
F(ρ, ξ ) ≥ 1−ε (5.68)

56
by Winter’s gentle measurement lemma, which implies that
1 √
k ρ − ξ k1 ≤ ε (5.69)
2
by one of the Fuchs–van de Graaf inequalities. Consequently,

1
∑ λk ( Z )hzk z∗k , ρi
ε
ψρ ( Z ) ≤ hξ, Z i =
h1 − Π γ + δ , ρ i k 6 ∈ S
γ+δ
(5.70)
2γ + δ 2γ + δ 2γ + δ
≤ ∑
1 − ε k6∈S
λk ( Z )hzk z∗k , Qi ≤
1−ε
h Z, Qi ≤
1−ε
.
γ+δ

As this is so for all δ > 0, it follows that



ε
log ψρ ( Z ) ≤ γ − log(1 − ε), (5.71)

and therefore √
ε
 1 
log ψρ ( Z ) ≤ ε
DH (ρk Q) + log . (5.72)
ε (1 − ε )
By optimizing over all operators Z ∈ Pos(X) that satisfy h Q, Z i ≤ 1 we obtain the
required inequality.
Theorem 5.15. Let ρ ∈ D(X) and Q ∈ Pos(X) be operators satisfying im(ρ) ⊆ im( Q),
let ε ∈ (0, 1), and let δ ∈ (0, 1 − ε). It is the case that
 δ 
ε+δ
DH (ρk Q) ≤ Dmax ε
(ρk Q) − log . (5.73)
ε+δ
Proof. Suppose that ξ ∈ Bε (ρ) satisfies

ξ ≤ 2Dmax (ρ k Q) Q.
ε
(5.74)

Given that ξ ∈ Bε (ρ) we have that

ρ≤ξ+R (5.75)

for R ∈ Pos(X) satisfying Tr( R) ≤ ε.


ε+δ
Now consider the quantity DH (ρ k Q), and in particular consider the choices

2− Dmax (ρ k Q) R
ε
− Dmax
ε
(ρ k Q)
λ=2 and Y = (5.76)
ε+δ
in the dual form of Optimization Problem 5.12 (with ε being replaced by ε + δ).
These choices represent a feasible solution to this conic program, and the objective
value is at least  
− Dmax
ε
(ρ k Q) ε
2 1− (5.77)
ε+δ

57
It follows that  δ 
ε+δ
DH (ρ k Q) ≤ Dmax
ε
(ρ k Q) − log , (5.78)
ε+δ
which completes the proof.

Corollary 5.16. Let ρ ∈ D(X) and Q ∈ Pos(X) be operators. For every ε ∈ (0, 1), we
have
ρ⊗n Q⊗n
ε

DH
lim = D( ρ k Q ). (5.79)
n→∞ n

58
Lecture 6

Nonlocal games and Tsirelson’s


theorem

In this lecture we will discuss nonlocal games, which offer a model through which
the phenomenon of nonlocality is commonly studied. We will then narrow our
focus to XOR games, which are a highly restricted form of nonlocal games that
can, perhaps surprisingly, be analyzed through semidefinite programming. This is
made possible by Tsirelson’s theorem, which we will prove in this lecture.

6.1 Nonlocal games


We will begin by introducing the nonlocal games model. A nonlocal game is a hy-
pothetical game in which two cooperating players, Alice and Bob, each receive a
question from a referee, and then respond with an answer. The referee randomly
selects the questions according to a known distribution, and, upon receiving an-
swers from Alice and Bob, decides whether they win or lose. The following defini-
tion makes this notion precise in mathematical terms.

Definition 6.1. A nonlocal game is a 6-tuple G = ( X, Y, A, B, π, V ), where

1. X, Y, A, and B are finite and nonempty sets,


2. π ∈ P(X × Y ) is a probability vector, and
3. V : A × B × X × Y → {0, 1} is a predicate.

In this definition, the sets X and Y are the sets of questions, and A and B are
the sets of answers, for Alice and Bob, respectively. The probability vector π de-
termines the probability with which each pair of questions ( x, y) ∈ X × Y is se-
lected by the referee, and V determines whether or not a pair of answers ( a, b)
wins or loses for a given pair of questions ( x, y). For a given pair of questions

59
( x, y) ∈ X × Y and a pair of answers ( a, b) ∈ A × B, we write the value of the pred-
icate as V ( a, b | x, y), because that’s the way Ben Toner prefers it to be written—as it
helps to stress the idea that ( a, b) either wins or loses given that the question pair
( x, y) was selected.

Example 6.2 (The CHSH game). The CHSH game (named after Clauser, Horn,
Shimony, and Holt) is a nonlocal game in which the questions and answers cor-
respond to binary values, X = Y = A = B = {0, 1}, the probability vector π is
uniform,
1
π (0, 0) = π (0, 1) = π (1, 0) = π (1, 1) = , (6.1)
4
and the predicate V is defined as
(
1 if a ⊕ b = x ∧ y
V ( a, b| x, y) = (6.2)
0 if a ⊕ b 6= x ∧ y,

where a ⊕ b denotes the XOR of a and b, and x ∧ y denotes the AND of x and y.
Intuitively speaking, if the referee selects any of the question pairs (0, 0), (0, 1),
or (1, 0), then Alice and Bob must provide a pair of answers ( a, b) for which a = b
in order to win, while if the referee selects the question pair (1, 1), the answer ( a, b)
wins when a 6= b.

Example 6.3 (The FFL game). The FFL game (named after Fortnow, Feige, and
Lovász) is a nonlocal game in which the questions and answers correspond to
binary values, X = Y = A = B = {0, 1}, the probability vector π is given by

1
π (0, 0) = π (0, 1) = π (1, 0) = , π (1, 1) = 0, (6.3)
3

and the predicate V is defined as


(
1 if a ∨ x 6= b ∨ y
V ( a, b| x, y) = (6.4)
0 if a ∨ x = b ∨ y,

where a ∨ x denotes the OR of a and x, and similar for b ∨ y.


Intuitively speaking, if the referee asks the question pair (0, 0), then exactly one
of Alice and Bob, but not both, must respond with the answer 1 in order to win.
However, if the question pair is either (0, 1) or (1, 0), then the player who received
0 must answer 0 to win (and it does not matter what the player who received the
question 1 answers).

60
Example 6.4 (Graph coloring games). Suppose that H = (V, E) is an undirected
graph and k is a positive integer. Let us also define n = |V | and m = | E|, and
assume m ≥ 1. We may form a nonlocal game in the following way. The question
sets are both equal to the set of vertices, X = Y = {1, . . . , n}, and the answer sets
are given by A = B = {1, . . . , k}, which we may intuitively think about as colors.
The probability vector π is defined as follows:

1
 2n if x = y


π ( x, y) = 4m 1
if { x, y} ∈ E (6.5)

0

otherwise.

In words, the referee flips a fair coin, and if the outcome is heads, it randomly
selects a vertex and sends it to both players, and if the outcome is tails, it randomly
selects an edge and then sends the two incident vertices to the two players (again
at random). The predicate is defined as

1 if x = y and a = b


V ( a, b | x, y) = 1 if x 6= y and a 6= b (6.6)

0 otherwise.

The idea is that if Alice and Bob receive the same vertex, they should answer with
the same color, while if they receive different (adjacent) vertices, they should an-
swer with different colors.

Strategies
The definition of a nonlocal game does not, in itself, specify or restrict the sorts
of strategies that Alice and Bob might employ when playing. There are, in fact,
different types of strategies that are of interest. Let us start with a short summary
of the strategy types that are of interest for this lecture.

1. Deterministic strategies. In a deterministic strategy, Alice must deterministically


choose her answer a based on her question x alone, and likewise Bob must
choose b based on y alone. A deterministic strategy may therefore be described
as a pair of functions ( f , g), where f : X → A and g : Y → B.
Notice that when we consider such a strategy, there is an implicit assumption
that Alice cannot see Bob’s question (or answer), and likewise Bob cannot see
Alice’s question (or answer). This sort of implicit assumption is also in place
for the other strategy types listed below, and is what makes nonlocal games
interesting and motivates their name.

61
2. Randomized strategies. Rather than choosing their answers deterministically, Al-
ice and Bob could choose to make use of randomness when selecting their
answers. The randomness could be in the form of local randomness, where
Alice and Bob individually generate random numbers to assist in the selection
of their answers, or it could be in the form of shared randomness, which one
might view as having been generated by Alice and Bob at some point in the
past.
As it turns out, randomized strategies are not helpful to Alice and Bob, as-
suming their goal is to maximize the probability that they win. This is because
randomized strategies can simply be viewed as the random selection of a de-
terministic strategy, and Alice and Bob might as well just select the optimal
deterministic strategy—the average winning probability obviously cannot be
larger than the maximum winning probability over all deterministic strategies.
3. Entangled strategies. An entangled strategy is one in which Alice and Bob make
use of a shared quantum state when playing a nonlocal game. That is, Alice
holds a register A and Bob holds a register B, where (A, B) is in a joint state
ρ ∈ D(A ⊗ B), prior to the referee sending the questions. Upon receiving a
question x ∈ X, Alice measures the register A with respect to a measurement
described by a collection of measurement operators
 x
Pa : a ∈ A ⊂ Pos(A), (6.7)

and likewise Bob measures B with respect to a measurement described by


measurement operators
 y
Qb : b ∈ B ⊂ Pos(B). (6.8)

To be clear, Alice’s measurement depends on her question x ∈ X and Bob’s


measurement depends on his question y ∈ Y; they each have a measurement
for each possible question they might receive.
Given such a strategy, we see that the probability that Alice and Bob respond
to a question pair ( x, y) with an answer pair ( a, b) is equal to
y
Pax ⊗ Qb , ρ . (6.9)

Note that ρ is not actually required to be entangled by the definition of an


entangled strategy, but also note that if ρ is separable, then the strategy will be
equivalent to a classical randomized strategy. So, entanglement is what makes
this sort of strategy different from a classical strategy, which perhaps explains
the name entangled strategy.

62
There are other types of strategies that are often considered in the study of non-
local games, including commuting operator strategies and no-signaling strategies—we
will discuss commuting operator strategies in the lecture following the next one.
One could also consider global strategies, in which there is no implicit assumption
that Alice and Bob are separated, so that ( a, b) can depend arbitrarily on ( x, y), but
this class of strategies is not very interesting in a setting in which the nonlocality
of Alice and Bob is relevant.

Values of games
When we speak of the value of a nonlocal game, we’re referring to the supremum
probability with which Alice and Bob can win the game, with respect to whatever
class of strategies we might wish to consider. For this lecture we will focus on two
values: the classical value and the entangled value.

Definition 6.5 (Classical value of a nonlocal game). The classical value of a nonlocal
game G = ( X, Y, A, B, π, V ), which is denoted ω ( G ), is given by a maximization
of the winning probability over all deterministic strategies:



ω ( G ) = max π ( x, y)V f ( x ), g(y)| x, y , (6.10)
f ,g
( x,y)∈ X ×Y

where the maximum is over all f : X → A and g : Y → B.

Remark 6.6. As was already discussed, there is need to differentiate between de-
terministic and randomized values of nonlocal games, because they are the same—
and so the name classical value is justified.

Definition 6.7 (Entangled value of a nonlocal game). The entangled value of a non-
local game G = ( X, Y, A, B, π, V ), which is denoted ω ∗ ( G ), is the supremum of the
winning probabilities

∑ ∑
y
Pax ⊗ Qb , ρ ,

π ( x, y) V a, b| x, y (6.11)
( x,y)∈ X ×Y ( a,b)∈ A× B

over all choices of complex Euclidean spaces A and B, states ρ ∈ D(A ⊗ B}, and
sets of measurements
y
Pax : a ∈ A} x∈X ⊂ Pos(A) Qb : b ∈ B}y∈Y ⊂ Pos(B).
 
and (6.12)

That is, the entangled value is the supremum winning probability over all entan-
gled strategies.

63
Remark 6.8. There are nonlocal games for which the winning probability is never
achieved, so it is necessary to use the supremum in this definition. The principal
issue is that the dimensions of the spaces A and B are not bounded as one ranges
over all entangled strategies.

Example 6.9 (CHSH game values). Letting G denote the CHSH game, we have that
the classical value of this game is ω ( G ) = 3/4. This may be verified by checking
that the winning probability of each of the 16 possible deterministic strategies is at
most 3/4, and of course that some of those strategies win with probability 3/4.
The entangled value of the CHSH game is ω ∗ ( G ) = cos2 (π/8) ≈ 0.85. The fact
that this is so will emerge as a simple corollary to Tsirelson’s theorem—which is
fitting given that the inequality ω ∗ ( G ) ≤ cos2 (π/8) is a rephrasing of an inequality
known as Tsirelson’s bound.

Example 6.10 (FFL game values). If we let G denote the FFL game, then we have
that its classical value and quantum value agree: ω ( G ) = ω ∗ ( G ) = 2/3. The fact
that ω ( G ) = 2/3 is easily established by testing all deterministic classical strate-
gies. I will ask you to prove that ω ∗ ( G ) = 2/3 as a homework problem. One way to
do this is to prove that even the so-called no-signaling value, which upper-bounds
the quantum value, of the FFL game is 2/3. The no-signaling value can be com-
puted through linear programming.

Example 6.11 (Graph coloring game values). If G is the graph coloring game de-
termined by a graph H and an integer k, there we see that ω ( G ) = 1 if and only
if the chromatic number of H is at most k. That is, given any perfect deterministic
strategy, meaning one that wins with certainty, it is possible to recover a k-coloring
of H, meaning an assignment of colors {1, . . . , k } to the vertices of H such that no
two adjacent vertices share the same color.
There are known examples of graphs H and choices of k for which the associ-
ated nonlocal game G satisfies ω ( G ) < 1 but ω ∗ ( G ) = 1.

6.2 XOR games


XOR games are a restricted type of nonlocal game G = ( X, Y, A, B, π, V ) in which
both players answer binary values, so that A = B = {0, 1}, and for which the
predicate V takes the form
(
1 if a ⊕ b = f ( x, y)
V ( a, b| x, y) = (6.13)
0 if a ⊕ b 6= f ( x, y)

64
for some choice of a function f : X × Y → {0, 1}. Intuitively speaking, the func-
tion f specifies whether a and b should agree or disagree in order to be a winning
answer, for each question pair ( x, y). Notice that exactly one of the two possibili-
ties, meaning the possibilities that a and b agree or disagree, always wins for each
question pair, while the other possibility loses.
As every XOR game is uniquely determined by the sets X and Y, the probability
vector π ∈ P( X × Y ), and the function f : X × Y → {0, 1}, we will identify the
corresponding game G with the quadruple ( X, Y, π, f ) when it is convenient to do
that. For example, the CHSH game is an example of an XOR game, corresponding
to the the quadruple ({0, 1}, {0, 1}, π, f ), for π the uniform probability vector and
f ( x, y) = x ∧ y being the AND function.

Bias of an XOR game


When analyzing XOR games, it is often convenient to consider the bias of games
rather than their value. For a given XOR game G = ( X, Y, π, f ), and any strategy
for G, we define the bias of that strategy, for that game, to be the probability it
wins minus the probability it loses—which happens to be the same thing as twice
the probability it wins minus 1. The bias of a game is defined to be the supremum
bias over all strategies under consideration for that game. We will write ε( G ) and
ε∗ ( G ) to denote the classical and quantum biases for G, and so we have

ε( G ) = 2ω ( G ) − 1 and ε∗ ( G ) = 2ω ∗ ( G ) − 1, (6.14)

or, alternatively,

1 ε( G ) 1 ε∗ ( G )
ω (G) = + and ω∗ (G) = + . (6.15)
2 2 2 2

XOR game strategies described by observables


Let G = ( X, Y, π, f ) be an XOR game, and consider any entangled strategy for that
game, represented by a state ρ ∈ D(A ⊗ B} and measurement operators
 x x  y y
P0 , P1 } x∈X ⊂ Pos(A) and Q0 , Q1 }y∈Y ⊂ Pos(B). (6.16)

If we consider the expression


y y
π ( x, y)(−1) f ( x,y) P0x − P1x ⊗ Q0 − Q1 , ρ

(6.17)
x,y∈ X ×Y

for a few moments, we find that it agrees with the bias of the strategy just de-
y y
scribed. By defining A x = P0x − P1x for each x ∈ X and By = Q0 − Q1 for each

65
y ∈ Y, we may express this quantity as

∑ π ( x, y)(−1) f ( x,y) A x ⊗ By , ρ . (6.18)


x,y∈ X ×Y

The operators A x and By may be viewed as representing observables in the parlance


of quantum mechanics.
Notice that as one ranges over all binary-valued measurements { R0 , R1 }, the
operator R0 − R1 ranges over all Hermitian operators H with k H k ≤ 1. Therefore,
the bias of a game G is given by the supremum value of the expression (6.18),
over all choices of { A x : x ∈ X } ⊂ Herm(A), { By : y ∈ Y } ⊂ Herm(B), and
ρ ∈ D(A ⊗ B), subject to the constraints k A x k ≤ 1 for every x ∈ X and k By k ≤ 1
for every y ∈ Y.

6.3 Tsirelson’s theorem


Now we will prove the theorem of Tsirelson mentioned previously. Let us begin
with a statement of the theorem.

Theorem 6.12 (Tsirelson’s theorem). For every choice of finite and nonempty sets X and
Y and an operator M ∈ L(RY , RX ), the following statements are equivalent.

1. There exist complex Euclidean spaces A and B, a density operator ρ ∈ D(A ⊗ B), and
two collections { A x : x ∈ X } ⊂ Herm(A) and { By : y ∈ Y } ⊂ Herm(B) of
operators such that k A x k ≤ 1, k By k ≤ 1, and

M( x, y) = A x ⊗ By , ρ (6.19)

for all x ∈ X and y ∈ Y.


2. There exist positive semidefinite operators R ∈ Pos(CX ) and S ∈ Pos(CY ), with
R( x, x ) = 1 and S(y, y) = 1 for all x ∈ X and y ∈ Y, such that
!
R M
≥ 0. (6.20)
M∗ S

Remark 6.13. The second statement in the theorem is equivalent to one in which
the requirement that R and S have real number entries is added. In particular, if R0
and S0 satisfy the conditions listed in the second statement of the theorem, then so
too will
R0 + RT0 S0 + S0T
R= and S = , (6.21)
2 2

66
by virtue of the fact that M has real-number entries and
     T
R M 1 R0 M 1 R0 M
= + (6.22)
M∗ S 2 M ∗ S0 2 M ∗ S0
is a positive semidefinite operator whose diagonal entries are all equal to 1.
The first statement of the theorem says that the operator M, which is best
viewed as a matrix indexed by pairs ( x, y) ∈ X × Y in this case, describes exactly
the values in the expression (6.18) that depend upon the strategy under consider-
ation. The second statement of the theorem is a suprisingly simple condition on
M—and it may come at no surprise to learn that it will be used to define semidefi-
nite programs to calculate XOR game biases. The fact that these two statements are
exactly the same thing is a remarkable thing of beauty.

Weyl–Brauer operators
The proof of Tsirelson’s theorem will make use of a collection of unitary and Her-
mitian operators known as Weyl–Brauer operators.
Definition 6.14. Let N be a positive integer and let Z = C2 . The Weyl–Brauer oper-
ators of order N are the operators V1 , . . . , V2N +1 ∈ L(Z⊗ N ) defined as
⊗(k−1)
V2k−1 = σz ⊗ σx ⊗ 1⊗( N −k) ,
(6.23)
⊗(k−1)
V2k = σz ⊗ σy ⊗ 1⊗( N −k) ,
for all k ∈ {1, . . . , N }, as well as
V2N +1 = σz⊗ N , (6.24)
where 1, σx , σy , and σz denote the Pauli operators:
       
1 0 0 1 0 −i 1 0
1= , σx = , σy = , σz = . (6.25)
0 1 1 0 i 0 0 −1
Example 6.15. In the case N = 3, the Weyl–Brauer operators V1 , . . . , V7 are
V1 = σx ⊗ 1 ⊗ 1
V2 = σy ⊗ 1 ⊗ 1
V3 = σz ⊗ σx ⊗ 1
V4 = σz ⊗ σy ⊗ 1 (6.26)
V5 = σz ⊗ σz ⊗ σx
V6 = σz ⊗ σz ⊗ σy
V7 = σz ⊗ σz ⊗ σz .

67
A proposition summarizing the properties of the Weyl–Brauer operators that
are relevant to the proof of Tsirelson’s theorem follows.
Proposition 6.16. Let N be a positive integer, let V1 , . . . , V2N +1 denote the Weyl–Brauer
operators of order N. For every unit vector u ∈ R2N +1 , the operator
2N +1
∑ u(k )Vk (6.27)
k =1

is both unitary and Hermitian, and for any two vectors u, v ∈ R2N +1 , it holds that
* +
1 2N +1 2N +1

2 N j∑
u( j)Vj , ∑ v(k )Vk = hu, vi. (6.28)
=1 k =1

Proof. Each operator Vk is Hermitian, and therefore the operator (6.27) is Hermitian
as well.
The Pauli operators anti-commute in pairs:
σx σy = −σy σx , σx σz = −σz σx , and σy σz = −σz σy . (6.29)
By an inspection of the definition of the Weyl–Brauer operators, it follows that
V1 , . . . , V2N +1 also anti-commute in pairs:
Vj Vk = −Vk Vj (6.30)
for distinct choices of j, k ∈ {1, . . . , 2N + 1}. Moreover, each Vk is unitary (as well
as being Hermitian), and therefore Vk2 = 1⊗ N . It follows that
!2
2N +1 2N +1
∑ u(k)Vk = ∑ u(k)2 Vk2 + ∑ u( j)u(k) Vj Vk + Vk Vj

k =1 k =1 1≤ j<k≤2N +1 (6.31)
2N +1
= ∑ u ( k )2 1⊗ N = 1⊗ N ,
k =1
and therefore (6.27) is unitary.
Next, observe that (
2N if j = k
hVj , Vk i = (6.32)
0 if j 6= k.
Therefore, one has
* +
1 2N +1 2N +1

2 N j∑
u( j)Vj , ∑ v(k )Vk
=1 k =1
2N +1 2N +1 2N +1
(6.33)
1
= N
2 ∑ ∑ u( j)v(k )hVj , Vk i = ∑ u(k)v(k ) = hu, vi,
j =1 k =1 k =1

as required.

68
Proof of Tsirelson’s theorem
Proof of Theorem 6.12. For the sake of simplifying notation, we will make the as-
sumption that X = {1, . . . , n} and Y = {1, . . . , m}.
Assume that the first statement is true, and define an operator
 √ ∗ 
vec ( A1 ⊗ 1) ρ
 .. 
 . 
vec ( A ⊗ 1)√ρ∗ 
 
n
K= √ ∗  ∈ L(A ⊗ B ⊗ A ⊗ B, Cn ⊕ Cm ). (6.34)
 
 vec (1 ⊗ B1 ) ρ 
..
 
.
 
√ ∗
 
vec (1 ⊗ Bm ) ρ

The operator KK ∗ ∈ Pos Cn ⊕ Cm may be written in a block form as




 
∗ P M
KK = (6.35)
M∗ Q

for P ∈ Pos(Cn ) and Q ∈ Pos(Cm ); the fact that the off-diagonal blocks are as
claimed follows from the calculation
√ √
( A j ⊗ 1) ρ, (1 ⊗ Bk ) ρ = A j ⊗ Bk , ρ = M( j, k). (6.36)

For each j ∈ {1, . . . , n} one has


√ √
P( j, j) = ( A j ⊗ 1) ρ, ( A j ⊗ 1) ρ = A2j ⊗ 1, ρ , (6.37)

which is necessarily a nonnegative real number in the interval [0, 1]; and through
a similar calculation, one finds that Q(k, k ) is also a nonnegative integer in the
interval [0, 1] for each k ∈ {1, . . . , m}. A nonnegative real number may be added to
each diagonal entry of this operator to yield another positive semidefinite operator,
so one has that statement 2 holds.
Next, let us assume statement 2 holds. As was explained in Remark 6.13, we
are free to assume that all of the entries of R and S are real numbers.
Now, a matrix with real number entries is positive semidefinite if and only if
it is the Gram matrix of a collection of real vectors, and therefore there must exist
real vectors {u1 , . . . , un , v1 , . . . , vm } such that

hu j , vk i = M( j, k) (6.38)

for all j ∈ {1, . . . , n} and k ∈ {1, . . . , m}, as well as

hu j0 , u j1 i = R( j0 , j1 ) and hvk0 , vk1 i = S(k0 , k1 ) (6.39)

69
for all j0 , j1 ∈ {1, . . . , n} and k0 , k1 ∈ {1, . . . , m}. There are n + m of these vectors,
and therefore they span a real vector space of dimension at most n + m, so there is
no loss of generality in assuming u1 , . . . , un , v1 , . . . , vm ∈ Rn+m . Observe that these
vectors are all unit vectors, as the diagonal entries of R and S represent their norm
squared.
Choose N so that 2N + 1 ≥ n + m and let Z = C2 . Define operators A1 , . . . , An
and B1 , . . . , Bm , all acting on L(Z⊗ N ), as
n+m n+m
Aj = ∑ u j (i )Vi and Bk = ∑ vk (i )ViT (6.40)
i =1 i =1

for each j ∈ {1, . . . , n} and k ∈ {1, . . . , m}, where V1 , . . . , Vn+m are the first n + m
Weyl–Brauer operators of order N. By Proposition 6.16, each of these operators
is both unitary and Hermitian, and therefore each of these operators has spectral
norm equal to 1.
Finally, define

1 ⊗N ⊗N ∗ ⊗N ⊗N
Z Z
  
ρ= vec 1 vec 1 ∈ D ⊗ . (6.41)
2N
Applying Proposition 6.16 again gives

1
A j ⊗ Bk , ρ = N
A j , BkT = hu j , vk i = M ( j, k), (6.42)
2
for each j ∈ {1, . . . , n} and k ∈ {1, . . . , m}.
We have proved that statement 2 implies statement 1, for the spaces A = Z⊗ N
and B = Z⊗ N , and so the proof is complete.

70
Lecture 7

A semidefinite program for the


entangled bias of XOR games

In this lecture we will discuss a couple of applications of Tsirelson’s theorem, cen-


tering primarily on the semidefinite programming formulation for the entangled
bias of XOR games that it yields.

7.1 The semidefinite program


Let us begin with the semidefinite program that is suggested by Tsirelson’s theo-
rem. Assume that G = ( X, Y, π, f ) is an XOR game, and recall (as was discussed
in the previous lecture) that the entangled bias of G is the supremum of the values

∑ π ( x, y)(−1) f ( x,y) h A x ⊗ By , ρi, (7.1)


( x,y)∈ X ×Y

taken over all choices for complex Euclidean spaces A and B, a state ρ ∈ D(A ⊗ B),
and Hermitian contractions

{ A x : x ∈ X } ⊂ Herm(A) and { By : y ∈ Y } ⊂ Herm(B). (7.2)

By Tsirelson’s theorem, this is equivalent to the supremum of the values

∑ π ( x, y)(−1) f ( x,y) M( x, y), (7.3)


( x,y)∈ X ×Y

taken over all M ∈ L(RY , RX ) for which there exist R ∈ Pos(CX ) and S ∈ Pos(CY )
for which R( x, x ) = 1 for all x ∈ X, S(y, y) = 1 for all y ∈ Y, and
 
R M
∈ Pos(CX ⊕ CY ). (7.4)
M∗ S

71
With this fact in mind, let us consider the following semidefinite program.
First, let us write X = CX and Y = CY for brevity, and let ∆ ∈ C(X ⊕ Y) de-
note the completely dephasing channel acting on X ⊕ Y, which zeros out all of the
off-diagonal entries of its input and leaves the diagonal entries alone. Define an
operator D ∈ L(Y, X) as

D ( x, y) = π ( x, y)(−1) f ( x,y) (7.5)

for every x ∈ X and y ∈ Y, and let H ∈ Herm(X ⊕ Y) be defined as


 
1 0 D
H= . (7.6)
2 D∗ 0
The semidefinite program to be considered is described by the triple (∆, H, 1X⊕Y ).
The primal and dual problems associated with this semidefinite program take the
following form.
Optimization Problem 7.1 (SDP for XOR game bias, unsimplified)

Primal problem Dual problem


maximize: h H, Z i minimize: Tr(W )
subject to: ∆( Z ) = 1X⊕Y , subject to: ∆(W ) ≥ H,
Z ∈ Pos(X ⊕ Y). W ∈ Herm(X ⊕ Y).
Note that in the dual problem formulation we have used the fact that the com-
pletely dephasing channel is self-dual: ∆ = ∆∗ . Strong duality and the achiev-
ability of the optimal values in both the primal and dual problems follow from
Slater’s theorem; strictly feasible solutions are given by Z = 1X⊕Y in the primal
and W = λ1X⊕Y for a sufficiently large λ in the dual.
Let us now examine both the primal and dual problems, beginning with the
primal problem. Our principal order of business with the primal problem, which
is fairly straightforward, is to verify that its optimal value indeed agrees with the
entangled bias of the XOR game G.
Suppose first that M ∈ L(RY , RX ) is such that there exist R ∈ Pos(X) and
S ∈ Pos(Y) for which R( x, x ) = 1 for all x ∈ X, S(y, y) = 1 for all y ∈ Y, and
 
R M
Z= ∈ Pos(X ⊕ Y). (7.7)
M∗ S
The operator Z is then primal feasible, as the constraint ∆( Z ) = 1X⊕Y is equivalent
to R and S having diagonal entries equal to one. The objective value for Z is equal
to
1 1
h H, Z i = h D, Mi + h D ∗ , M∗ i = h D, Mi, (7.8)
2 2

72
owing to the fact that D and M have real number entries. Noting that
h D, Mi = ∑ π ( x, y)(−1) f ( x,y) M ( x, y), (7.9)
( x,y)∈ X ×Y

we find that the optimal value of the semidefinite program is at least the entangled
bias of G.
To see that the optimal value of the semidefinite program is no greater than the
entangled bias of G, consider any Z ∈ Pos(X ⊕ Y), which can be expressed as
 
R K
Z= (7.10)
K∗ S
for some choice of R ∈ Pos(X), S ∈ Pos(Y), and K ∈ L(Y, X). Again the constraint
∆( Z ) = 1X⊕Y is equivalent to R and S having diagonal entries equal to one, and
the only issue remaining is that K might not have real number entries. However,
by expressing the objective function in terms of the block structure of H and Z, we
find that
1 1
h H, Z i = h D, K i + h D ∗ , K ∗ i = h D, Mi (7.11)
2 2
for
K+K
M= , (7.12)
2
following from the fact that D has real entries:
h D ∗ , K ∗ i = h D, K i = D, K = D, K . (7.13)
Proceeding in much the same way as in the previous lecture, we see that
T  1
R + 12 RT
   
1 R K 1 R K 2 M
+ = (7.14)
2 K∗ S 2 K∗ S M∗ 1 1 T
2S + 2S

is positive semidefinite, and the diagonal entries of 12 R + 12 RT and 12 S + 12 ST , are


all equal to one, and therefore the objective value h D, Mi is no greater than the
entangled bias of G.
Now let us consider the dual problem, in which one maximizes Tr(W ) over all
W ∈ Herm(X ⊕ Y), subject to the constraint that
∆(W ) ≥ H. (7.15)
Notice that the off-diagonal entries of W have absolutely no influence on this prob-
lem: they are zeroed out by ∆ in the constraint, and they do not influence the objec-
tive function. For this reason there is no generality lost in restricting one’s attention
to W taking the form  
1 Diag(u) 0
W= (7.16)
2 0 Diag(v)

73
for vectors u ∈ RX and v ∈ RY . (Here we’re including the factor of 1/2 for the
sake of convenience—we’re free to scale the vectors u and v as we choose.) The
objective function then becomes

1 1
Tr(W ) = ∑
2 x∈X
u ( x ) + ∑ v ( y ),
2 y ∈Y
(7.17)

while the constraint becomes equivalent to


 
Diag(u) −D
≥ 0. (7.18)
− D∗ Diag(v)

Summarizing what we have just concluded about Optimization Problem 7.1,


we arrive at the following expression of the same problem.

Optimization Problem 7.2 (SDP for XOR game bias, simplified)

Primal problem Dual problem


1 1
maximize: h D, Mi
 
minimize: ∑
2 x∈X
u( x ) + ∑ v(y)
2 y ∈Y
R M
subject to: ≥ 0, !
M∗ S Diag(u) −D
subject to: ≥ 0,
R( x, x ) = 1 for all x ∈ X, − D∗ Diag(v)
S(y, y) = 1 for all y ∈ Y, u ∈ RX ,
R ∈ Pos(X), v ∈ RY .
S ∈ Pos(Y),
M ∈ L(RY , RX ).
It will be helpful later in the lecture for us to observe at this point that for any
dual-optimal choice of u and v, it must be the case that

∑ u ( x ) = ∑ v ( y ). (7.19)
x∈X y ∈Y

The reason is that for any choice of u and v, and for any λ > 0, the operator
 
Diag(u) −D
(7.20)
− D∗ Diag(v)

is positive semidefinite if and only if


 
λ Diag(u) −D
(7.21)
− D∗ 1
λ Diag( v )

74
is positive semidefinite. The dual objective value obtained by the operator (7.21),
assuming it is positive semidefinite, is equal to

λ 1
2 ∑ u(x) + 2λ ∑ v(y). (7.22)
x∈X y ∈Y

Assuming that D is nonzero, which is always the case when it arises from an XOR
game G, it must be the case that ∑ x∈X u( x ) and ∑y∈Y v(y) are strictly positive, and
in this case the minimum value for (7.22) occurs when
s
∑ y ∈Y v ( y )
λ= , (7.23)
∑ x∈X u( x )

and this is the unique choice of λ for which the minimum is obtained. Thus, under
the assumption that u and v are optimal, it must be the case that λ = 1, which is
equivalent to (7.19).
We can now verify that the entangled value of the CHSH game is cos2 (π/8), as
claimed in the previous lecture.

Example 7.3 (CHSH game entangled bias/value). Recall that the CHSH game is
the XOR game G = ( X, Y, π, f ) with

X = Y = {0, 1},
1
π (0, 0) = π (0, 1) = π (1, 0) = π (1, 1) = , (7.24)
4
f ( x, y) = x ∧ y.

The matrix D ( x, y) = π ( x, y)(−1) f ( x,y) is then equal to


 
1 1 1
D= . (7.25)
4 1 −1

We will verify√that the optimal value of Optimization Problem 7.2 for the game G
is ε∗ ( G ) = 1/ 2.
First choose  
1 1 1
M= √ , (7.26)
2 1 −1
which has spectral norm equal to 1—it is a unitary operator, representing the
Hadamard transform—and observe that for R = S = 1 we have that
 
R M
Z= ≥ 0. (7.27)
M∗ S

75
The diagonal entries of R and S are equal √ to one, so Z is primal feasible, and it
achieves the objective value h D, M i = 1/ 2.
In the dual problem, choosing
   
1 1 1 1
u= √ , √ and v = √ , √ (7.28)
2 2 2 2 2 2 2 2

yields a feasible solution. The objective value is the same value just achieved in the
primal:
1 1 1

2 x∈{0,1}
u( x ) + ∑
2 y∈{0,1}
v(y) = √ .
2
(7.29)

Having obtained the same value in the primal and dual, we have verified that
the optimal value, which is the entangled bias, is

1
ε∗ ( G ) = √ . (7.30)
2
This implies that the entangled value of the CHSH game is

1 1
ω∗ (G) = + √ = cos2 (π/8). (7.31)
2 2 2

7.2 Strong parallel repetition for XOR games


Next we will prove that XOR games obey a strong parallel repetition property. To
explain what this means, let us first discuss the notion of parallel repetition in
greater generality.
Given arbitrary nonlocal games G1 , . . . , Gn , described by probability distribu-
tions
π1 : X1 × Y1 → [0, 1],
.. (7.32)
.
πn : Xn × Yn → [0, 1],
and predicates
V1 : A1 × B1 × X1 × Y1 → {0, 1},
.. (7.33)
.
Vn : An × Bn × Xn × Yn → {0, 1},
respectively, one defines the nonlocal game G = G1 ∧ · · · ∧ Gn by the distribution

π (( x1 , . . . , xn ), (y1 , . . . , yn )) = π1 ( x1 , y1 ) · · · πn ( xn , yn ) (7.34)

76
and the predicate

V (( a1 , . . . , an ), (b1 , . . . , bn )|( x1 , . . . , xn ), (y1 , . . . , yn ))


(7.35)
= V1 ( a1 , b1 | x1 , y1 ) ∧ · · · ∧ Vn ( an , bn | xn , yn ).

In words, the game G is run as if it were independent instances of the games


G1 , . . . , Gn , where Alice and Bob receive n-tuples of questions ( x1 , . . . , xn ) and
(y1 , . . . , yn ) all at the same time, with each question pair ( xk , yk ) being chosen ac-
cording to πk , independent of all other question pairs. They are expected to pro-
vide answers ( a1 , . . . , an ) and (b1 , . . . , bn ), respectively, and they win the game G if
and only if every one of the pairs ( ak , bk ) is correct for the corresponding question
pair ( xk , yk ) in the game Gk .
It is important to realize, though, that Alice and Bob are not required to treat
the individual games G1 , . . . , Gn independently. They may, in particular, attempt to
correlate their answers in otherwise independent game instances to their advan-
tage. The following example illustrates that it is indeed possible for them to gain
an advantage along these lines.

Example 7.4 (Parallel repetition of the FFL game). The classical value of the FFL
game, introduced in the previous lecture, is 2/3. Remarkably, the classical value
of the two-fold repetition FFL ∧ FFL of this game with itself is also 2/3. A deter-
ministic strategy that achieves this winning probability is that Alice and Bob both
respond to their question pairs by simply swapping the two binary values. That
is, Alice’s answers are determined by the function f : X × X → A × A and Bob’s
answers are determined by the function g : Y × Y → B × B, where

f ( x1 , x2 ) = ( x2 , x1 ) and g ( y1 , y2 ) = ( y2 , y1 ). (7.36)

For this strategy, the winning condition in both games is the same:

x1 ∨ x2 6 = y1 ∨ y2 . (7.37)

That is, they either win both games or lose both games, never winning just one of
them. If ( x1 , y1 ) and ( x2 , y2 ) are independently and uniformly generated from the
set {(0, 0), (0, 1), (1, 0)} then the above condition fails only when

(( x1 , y1 ), ( x2 , y2 )) ∈ ((0, 0), (0, 0)), ((1, 0), (0, 1)), ((0, 1), (1, 0)) . (7.38)

Such question pairs are selected with probability 3/9 = 1/3, so the winning proba-
bility is 2/3, as claimed.

The example of the FFL game illustrates that the classical value of a nonlo-
cal game G = G1 ∧ · · · ∧ Gn is not always equal to the product of the values of

77
G1 , . . . , Gn , which is what one obtains when Alice and Bob play the games inde-
pendently. That is, one always has

ω ( G1 ∧ · · · ∧ Gn ) ≥ ω ( G1 ) · · · ω ( Gn ), (7.39)

but in some cases the inequality is strict. A similar phenomenon occurs for the
entangled value—although we did not prove it, the entangled value of the FFL
game agrees with the classical value, and therefore

2
ω ∗ (FFL ∧ FFL) ≥ ω (FFL ∧ FFL) = = ω ∗ (FFL) > ω ∗ (FFL)2 . (7.40)
3
We will prove that the entangled value of XOR games forbids this type of ad-
vantage. That is, if G1 , . . . , Gn are XOR games, then

ω ∗ ( G1 ∧ · · · ∧ Gn ) = ω ∗ ( G1 ) · · · ω ∗ ( Gn ). (7.41)

This is the property referred to as strong parallel repetition. In particular, if G is an


XOR game and we write

G ∧n = G ∧ · · · ∧ G (n times), (7.42)

then it necessarily holds that

ω ∗ ( G ∧n ) = ω ∗ ( G )n . (7.43)

The first step in proving that (7.41) holds for XOR games G1 , . . . , Gn is to define
the XOR of two (or more) XOR games. Suppose that G1 and G2 are XOR games,
specified by probability distributions

π1 : X1 × Y1 → [0, 1] and π2 : X2 × Y2 → [0, 1] (7.44)

along with functions

f 1 : X1 × Y1 → {0, 1} and f 2 : X2 × Y2 → {0, 1}. (7.45)

The XOR G1 ⊕ G2 of these two games is the XOR game defined by the distribution
π : ( X1 × X2 ) × (Y1 × Y2 ) → [0, 1] given by

π (( x1 , x2 ), (y1 , y2 )) = π1 ( x1 , y1 )π2 ( x2 , y2 ) (7.46)

and the function f : ( X1 × X2 ) × (Y1 × Y2 ) → {0, 1} given by

f (( x1 , x2 ), (y1 , y2 )) = f 1 ( x1 , y1 ) ⊕ f 2 ( x2 , y2 ). (7.47)

78
In words, the question sets for the game G1 ⊕ G2 are X1 × X2 and Y1 × Y2 , and
if Alice receives ( x1 , x2 ) and Bob receives (y1 , y2 ), they are expected to provide
answers a, b ∈ {0, 1} that are consistent with the equations

a = a1 ⊕ a2 and b = b1 ⊕ b2 (7.48)

for some choice of a1 , a2 , b1 , b2 ∈ {0, 1} that would cause ( a1 , b1 ) to be correct for


( x1 , y1 ) and ( a2 , b2 ) to be correct for ( x2 , y2 ). The XOR of more than two XOR games
is defined similarly (or, equivalently, by applying this definition iteratively).
It is evident, for any two XOR games G1 and G2 , that

ε∗ ( G1 ⊕ G2 ) ≥ ε∗ ( G1 )ε∗ ( G2 ). (7.49)

The reason is that if Alice and Bob play the game G1 ⊕ G2 by playing G1 and G2
independently and optimally, and then answer according to the XOR of their an-
swers in G1 and G2 , then they will achieve the bias ε∗ ( G1 )ε∗ ( G2 ). We will prove
that this is, in fact, the best they can do. That is, we will prove

ε∗ ( G1 ⊕ G2 ) = ε∗ ( G1 )ε∗ ( G2 ). (7.50)

This will be done using semidefinite programming duality.


In particular, consider the dual form of Optimization Problem 7.2, the semidef-
inite program for the entangled bias of an XOR game, for the two separate XOR
games G1 and G2 . Suppose that (u1 , v1 ) and (u2 , v2 ) represent dual-optimal solu-
tions to these semidefinite programs. As argued previously, this implies

∑ u1 ( x1 ) = ∑ v1 ( y1 ) and ∑ u2 ( x2 ) = ∑ v2 ( y2 ). (7.51)
x1 y1 x2 y2

The dual form of Optimization Problem 7.2 for the entangled bias of the XOR game
G1 ⊕ G2 has the following form:
1 1
minimize: ∑
2 x∈X ×X
u( x ) +
2 y∈Y∑×Y
v(y)
1 2 1 2
!
Diag(u) − D1 ⊗ D2
subject to: ≥ 0,
− D1∗ ⊗ D2∗ Diag(v)
u ∈ RX1 ×X2 , v ∈ RY1 ×Y2 .

Here, D1 and D2 are the operators given by

D1 ( x1 , y1 ) = π1 ( x1 , y1 )(−1) f1 ( x1 ,y1 ) ,
(7.52)
D2 ( x2 , y2 ) = π2 ( x2 , y2 )(−1) f2 ( x2 ,y2 ) .

79
Next we observe that u = u1 ⊗ u2 and v = v1 ⊗ v2 provide a dual feasible
solution to Optimization Problem 7.2 for ε∗ ( G1 ⊕ G2 ). To prove that this is so, it is
helpful to observe that if
   
P1 X1 P2 X2
≥ 0 and ≥0 (7.53)
X1∗ Q1 X2∗ Q2

then  
P1 ⊗ P2 X1 ⊗ X2
≥ 0. (7.54)
X1∗ ⊗ X2∗ Q1 ⊗ Q2
This is so because, after some simultaneous re-ordering of rows and columns, the
matrix above is a principal submatrix of the positive semidefinite matrix
   
P1 X1 P2 X2
⊗ . (7.55)
X1∗ Q1 X2∗ Q2

Alternatively, by (7.53) we may conclude that


p p p p
X1 = P1 K1 Q1 and X2 = P2 K2 Q2 (7.56)

for K1 and K2 satisfying k K1 k ≤ 1 and k K2 k ≤ 1. As we therefore have


p p
X1 ⊗ X2 = P1 ⊗ P2 (K1 ⊗ K2 ) Q1 ⊗ Q2 , (7.57)

and k K1 ⊗ K2 k = k K1 kk K2 k ≤ 1, it follows that (7.54) holds. One may therefore


conclude that
!
Diag(u1 ) ⊗ Diag(u2 ) D1 ⊗ D2
≥ 0. (7.58)
D1∗ ⊗ D2∗ Diag(v1 ) ⊗ Diag(v2 )

As Diag(u1 ) ⊗ Diag(u2 ) = Diag(u) and Diag(v1 ) ⊗ Diag(v2 ) = Diag(v), we have


!
Diag(u) − D1 ⊗ D2
− D1∗ ⊗ D2∗ Diag(v)
! (7.59)
Diag(u) D1 ⊗ D2
  
1 0 1 0
= ≥ 0.
0 −1 D1∗ ⊗ D2∗ Diag(v) 0 −1

The objective value of the dual solution (u, v) is


1 1 1 1

2 x
u ( x ) + ∑ v ( y ) = ∑ u1 ( x1 ) ∑ u2 ( x2 ) + ∑ v1 ( y1 ) ∑ v2 ( y2 )
2 y 2 x1 2 y1
x2 y2 (7.60)
1 ∗ 1
= ε ( G1 ) ε∗ ( G2 ) + ε∗ ( G1 ) ε∗ ( G2 ) = ε∗ ( G1 ) ε∗ ( G2 ),
2 2

80
where we have used the fact that

∑ u1 (x1 ) = ∑ v1 (y1 ) = ε∗ (G1 ) and ∑ u2 (x2 ) = ∑ v2 (y2 ) = ε∗ (G2 ). (7.61)


x1 y1 x2 y2

Therefore, ε∗ ( G1 ⊕ G2 ) ≤ ε∗ ( G1 )ε∗ ( G2 ), which implies (7.50).


We are now prepared to prove that, for XOR games G1 , . . . , Gn , it holds that

ω ∗ ( G1 ∧ · · · ∧ Gn ) = ω ∗ ( G1 ) · · · ω ∗ ( Gn ). (7.62)

As we have already observed, it holds that

ω ∗ ( G1 ∧ · · · ∧ Gn ) ≥ ω ∗ ( G1 ) · · · ω ∗ ( Gn ), (7.63)

and therefore it remains to prove

ω ∗ ( G1 ∧ · · · ∧ Gn ) ≤ ω ∗ ( G1 ) · · · ω ∗ ( Gn ), (7.64)

Assume hereafter that XOR games G1 , . . . , Gn have been fixed, and consider an
arbitrary strategy for Alice and Bob in the game G1 ∧ · · · ∧ Gn , through which Alice
and Bob answer question tuples ( x1 , . . . , xn ) and (y1 , . . . , yn ) with answer tuples
( a1 , . . . , an ) and (b1 , . . . , bn ), respectively. We will consider how well this strategy
performs for the XOR game
Gk1 ⊕ · · · ⊕ Gkm , (7.65)
for various choices of a subset S = {k1 , . . . , k m } ⊆ {1, . . . , n}, provided that we
define Alice and Bob’s answers as

ak1 ⊕ · · · ⊕ ak m and bk 1 ⊕ · · · ⊕ bk m (7.66)

and where we assume that they have chosen to share randomly generated question
pairs ( xk , yk ) for those choices of k 6∈ S.
To do this we will define binary-valued random variables Z1 , . . . , Zn as

Zk = ak ⊕ bk ⊕ f k ( xk , yk ), (7.67)

where we view x1 , . . . , xn , y1 , . . . , yn , a1 , . . . , an , and b1 , . . . , bn as random variables,


with ( x1 , y1 ), . . . , ( xn , yn ) distributed independently according to the distributions
π1 , . . . , πn given by the games G1 , . . . , Gn and a1 , . . . , an , b1 , . . . , bn distributed and
correlated with x1 , . . . , xn , y1 , . . . , yn in whatever manner Alice and Bob’s strategy
determines. It holds that
(
0 if Alice and Bob win Gk
Zk = (7.68)
1 if Alice and Bob lose Gk

81
and therefore the probability of winning minus losing game Gk is equal to the
expectation  
E (−1) Zk . (7.69)

More generally, if Alice and Bob’s strategy is transformed into a strategy for the
XOR game Gk1 ⊕ · · · ⊕ Gkm as suggested above, we find that
(
0 if Alice and Bob win Gk1 ⊕ · · · ⊕ Gkm
Zk1 ⊕ · · · ⊕ Zkm = (7.70)
1 if Alice and Bob lose Gk1 ⊕ · · · ⊕ Gkm

and therefore the probability they win minus the probability they lose in this XOR
game is  
E (−1) Zk1 +···+Zkm . (7.71)

Now, we know that probability of winning minus the probability of losing in the
game Gk1 ⊕ · · · ⊕ Gkm is upper-bounded by its bias:
 
Zk1 +···+ Zk m
E (−1) ≤ ε∗ ( Gk1 ⊕ · · · ⊕ Gkm ) = ε∗ ( Gk1 ) · · · ε∗ ( Gkm ). (7.72)

Here we have used the fact concerning XOR game biases proved above, which is
the key to making the entire argument work. The probability that Alice and Bob’s
strategy wins G1 ∧ · · · ∧ Gn is therefore bounded as follows:

1 + (−1) Z1 1 + (−1) Zn
   
Pr( Z1 = 0, . . . , Zn = 0) = E ···
2 2
1  
=
2n ∑ E (−1)∑k∈S Zk
S⊆{1,...,n}
1 (7.73)
2n S⊆{∑ ∏ ε∗ (Gk )

1,...,n} k∈S
1 + ε∗ ( G1 ) 1 + ε∗ ( Gn )
   
= ···
2 2
= ω ∗ ( G1 ) · · · ω ∗ ( Gn ).

Maximizing over all possible entangled strategies for Alice and Bob yields

ω ∗ ( G1 ∧ · · · ∧ Gn ) ≤ ω ∗ ( G1 ) · · · ω ∗ ( Gn ), (7.74)

as required.

82
Lecture 8

The hierarchy of
Navascués, Pironio, and Acín

In this lecture, we will define and study a class of strategies for nonlocal games
known as commuting measurement strategies, or alternatively as commuting operator
strategies. These strategies include all entangled strategies, in a sense that will be
made more precise momentarily—and it was not long ago that this inclusion was
proved to be proper by Slofstra [arXiv:1606.03140]. A proof that the values defined
by the classes of entangled and commuting measurement strategies are different,
where we take the supremum winning probability over the two classes of strate-
gies, has only recently been announced by Ji, Natarajan, Vidick, Wright, and Yuen
[arXiv:2001.04383]. But be warned—the paper is over 200 pages long. This refutes
the famous Connes’ embedding conjecture from the subject of von Neumann alge-
bras, so it is worth every page it needs.
We will then analyze the semidefinite programming hierarchy of Navascués,
Pironio, and Acín, better known as the NPA hierarchy, which provides us with a
uniform family of semidefinite programs that converges to the commuting measure-
ment value of any nonlocal game. This result is, in fact, a necessary ingredient in Ji,
Natarajan, Vidick, Wright, and Yuen’s proof.

8.1 Representing and comparing strategies


Let us fix question sets X and Y, and answer sets A and B, for a nonlocal game.
It is natural to think about strategies for nonlocal games having these question
and answer sets as being represented by operators of the form

M ∈ L(RX ⊗ RY , R A ⊗ RB ), (8.1)

83
or equivalently as matrices whose columns are indexed by question pairs and
whose rows are indexed by answer pairs. To be precise, the value M( a, b | x, y) rep-
resents the probability that Alice and Bob answer the questions ( x, y) with the
answer ( a, b).1
This representation is nice because not only does M store the probabilities with
which Alice and Bob respond to each question pair ( x, y) with answers ( a, b), but
also the action of M as a linear operator has meaning. For example, assuming the
referee selects question pairs according to a probability vector π, we have that
Alice and Bob’s answers are distributed according to the vector Mπ. Perhaps more
useful is to consider the vector

v= ∑ π ( x, y) | x i ⊗ | y i ⊗ | x i ⊗ | y i (8.2)
( x,y)∈ X ×Y

and to observe that 


M ⊗ 1 X ×Y v (8.3)
represents the joint probability distribution of quadruples ( a, b, x, y) that arise from
the selection of the questions ( x, y) according to π together with Alice and Bob’s
answers to those questions.
Notice also that the probability for a strategy M ∈ L(RX ⊗ RY , R A ⊗ RB ) to
win a particular nonlocal game G = ( X, Y, A, B, π, V ) is equal to

∑ π ( x, y) ∑ V ( a, b | x, y) M( a, b | x, y), (8.4)
( x,y)∈ X ×Y ( a,b)∈ A× B

which one may alternatively express as the inner product hK, M i, where the oper-
ator K ∈ L(RX ⊗ RY , R A ⊗ RB ) is defined as

K ( a, b | x, y) = π ( x, y)V ( a, b | x, y) (8.5)

for all x ∈ X, y ∈ Y, a ∈ A, and b ∈ B.


Now, when we consider a particular class of strategies, such as the classical
strategies or the entangled strategies, we are effectively defining a subset

S ⊂ L(RX ⊗ RY , R A ⊗ RB ) (8.6)

of operators that represent strategies in the class under consideration. We then


have the associated value
ωS ( G ) = sup hK, M i (8.7)
M ∈S
of a game G for the class S.
1 Throughout the lecture, entries of operators having the form M ∈ L(RX ⊗ RY , R A ⊗ RB ) are
expressed as M ( a, b | x, y) rather than M(( a, b), ( x, y)), to mirror the notation we have adopted for
the referee’s predicate V ( a, b | x, y).

84
       
1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0
0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1
       
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

       
1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0
       
0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1

       
0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1
       
1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0

       
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
       
1 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0
0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1

Figure 8.1: The 16 deterministic strategies for binary question and answer pairs.

Example 8.1. Suppose X = Y = A = B = {0, 1}. There are 16 deterministic strate-


gies for games having these question and answer sets, and they are represented by
the matrices shown in Figure 8.1.
It is natural to associate the classical strategies for these question and answer
sets with the convex hull of these 16 deterministic strategies to account for the pos-
sibility that Alice and Bob make use of randomness.
Representing the CHSH game as an operator K, as described above, yields
 
1 1 1 0
1 0 0 0 1
K=  . (8.8)
4 0 0 0 1
1 1 1 0

It is now perhaps more clear by an inspection of this matrix together with those ap-
pearing in Figure 8.1 that the classical value of the CHSH game is 3/4; if M ranges
over the matrices representing deterministic strategies, it is possible to make three
of the 1s in any of these matrices, but not all four, overlap the nonzero entries of K.

85
Now, if we were to represent the set of all (probabilistic) classical strategies by

P ⊂ L(RX ⊗ RY , R A ⊗ RB ) (8.9)

and the set of all entangled strategies as

E ⊂ L(RX ⊗ RY , R A ⊗ RB ), (8.10)

then there are various observations we could make. For example:

1. P and E are both convex sets.


2. P is a polytope.
3. P ⊆ E, and the inclusion is proper so long as X, Y, A, and B all have at least
two elements.

As special cases of (8.7) we have

ω ( G ) = sup hK, Mi and ω ∗ ( G ) = sup hK, Mi. (8.11)


M ∈P M ∈E

8.2 Commuting measurement strategies


Now we will introduce a new set of strategies, called commuting measurement strate-
gies. The basic idea is to drop the tensor product structure that is present in an
entangled strategy, replacing it with the assumption that Alice and Bob’s measure-
ment operators commute. We also drop the requirement that the space H is finite
dimensional.2

Definition 8.2. For a given choice of question sets X and Y and answer sets A
and B, we say that an operator M ∈ L(RX ⊗ RY , R A ⊗ RB ) represents a commuting
measurement strategy if there exists a complex Hilbert space H, a unit vector u ∈ H,
and projection operators
 x  y
Pa : x ∈ X, a ∈ A and Qb : y ∈ Y, b ∈ B (8.12)

acting on H, such that the following properties are satisfied. The projections repre-
sent measurements, in the sense that

∑ Pax = 1H ∑ Q b = 1H
y
and (8.13)
a∈ A b∈ B
2 Ifwe restrict the definition to finite-dimensional H, then we obtain precisely the entangled
strategies. It is not the case, in contrast, that allowing the spaces A and B appearing in the defi-
nition of entangled strategies to be infinite dimensional makes entangled strategies equivalent to
commuting measurement strategies.

86
for all x ∈ X and y  ∈ Y, the two collections of projections commute in pairs,
y
meaning that Pax , Qb = 0 for all x ∈ X and y ∈ Y, and we have
y
M ( a, b | x, y) = u, Pax Qb u (8.14)

for all x ∈ X, y ∈ Y, a ∈ A, and b ∈ B.

Hereafter we will write

C ⊂ L(RX ⊗ RY , R A ⊗ RB ) (8.15)

to denote the set of all commuting measurement strategies for the question and
answer sets X, Y, A, and B. The commuting measurement value of G, which is de-
noted ω c ( G ), is the supremum value of the winning probability for G taken over
all commuting measurement strategies for Alice and Bob:

ω c ( G ) = sup hK, M i, (8.16)


M ∈C

assuming again that K is defined from G as in (8.5).


Here are a couple of known facts about the set C of commuting measurement
strategies and its relationship to the entangled strategies E, stated without proof:

1. C is compact and convex.


2. E ⊆ C.

Proving that C is convex and includes E is a good exercise that I will leave to you.
The fact that C is compact happens to follow from what we will prove later in the
lecture (as does convexity). The containment E ⊆ C is proper, as was mentioned at
the start of the lecture.

8.3 The NPA hierarchy


This section is devoted to a description of the semidefinite programming hierarchy
of Navascués, Pironio, and Acín—the NPA hierarchy—which is principally con-
cerned with the commuting measurement strategies. Its convergence, and what
exactly that means, will be discussed in the section following this one.

Basics of strings
When defining the NPA hierarchy in precise terms, it is helpful to make use of
some elementary concepts concerning strings of symbols. In particular, we will be

87
discussing matrices and vectors whose entries are indexed by strings of varying
lengths. These concepts are elementary and quite familiar within theoretical com-
puter science, and it will take just a moment to become familiar with them in case
you are not.
Suppose that an alphabet Σ, which is a finite and nonempty set whose ele-
ments are viewed as symbols, has been fixed. A string over Σ is any finite, ordered
sequence of symbols from Σ. (Infinite sequences of symbols are not to be consid-
ered as strings.) The length of a string is the total number of symbols, counting all
repetitions, appearing in that string.
For example, if Σ = {0, 1}, then 0, 0110, and 1000100100011 are examples of
strings over Σ; the string 0 has length 1, the string 0110 has length 4, and the string
1000100100011 has length 13. There is a special string that has length 0, and this
string is called the empty string. We use the Greek letter ε to denote this string.
For every nonnegative integer n, we write Σ≤n to denote the set of all strings
having length at most n and we write Σ∗ to denote the set of all strings, of any
(finite) length, over Σ. For every alphabet Σ, the set Σ∗ is countably infinite.
Lastly, given any string s ∈ Σ∗ , we let sR denote the string obtained by reversing
the ordering of the symbols in s. For example, if s = 00010, then sR = 01000.

Intuition behind the NPA hierarchy


Next, let us introduce the basic idea behind the NPA hierarchy.
When we think about a particular commuting measurement strategy, repre-
 xHilbert space H, a unit vector
sented by a complex
y
u ∈ H, and collections of pro-
jection operators Pa : x ∈ X, a ∈ A and Qb : y ∈ Y, b ∈ B acting on H, as
described above, our interest is naturally with the numbers
y
M( a, b | x, y) = u, Pax Qb u , (8.17)

ranging over all x ∈ X, y ∈ Y, a ∈ A, and b ∈ B.


These numbers arise when we consider the Gram matrix of the vectors
 y
{u} ∪ Pax u : x ∈ X, a ∈ A ∪ Qb u : y ∈ Y, b ∈ B .

(8.18)

Among the entries of this matrix, one finds all of the values
y y
Pax u, Qb u = u, Pax Qb u = M( a, b | x, y) (8.19)

that we ostensibly care about, as well as others, including (for instance)


y
Pax u, Pcz u , u, Qb u , and hu, ui, (8.20)

88
for appropriate choices of x, y, z, a, b, and c.
We may now think about the various properties that must hold for such a Gram
matrix, but before we do this let us discuss how we will index the entries of such
matrices. Assuming that question and answer sets X, Y, A, and B have been fixed,
we will introduce three alphabets:
Σ A = X × A, Σ B = Y × B, and Σ = Σ A t ΣB . (8.21)
Here, t denotes the disjoint union, meaning that Σ A and Σ B are to be treated as
disjoint sets when forming Σ. In words, there are | X × A| + |Y × B| symbols in the
alphabet Σ, one symbol for each pair ( x, a) ∈ X × A and a separate symbol for
each pair (y, b) ∈ Y × B. The collection of vectors (8.18) may naturally be labeled
by the set
{ ε } ∪ Σ = Σ ≤1 , (8.22)
and so we may consider that the Gram matrix of these vectors has rows and
columns indexed by this set.
Now, supposing that  ≤1 
R ∈ Pos CΣ (8.23)
is such a Gram matrix—and naturally this is a positive semidefinite matrix, as all
Gram matrices are—there are various things one may say, based on the conditions
that commuting measurement strategies must meet. In particular, we observe these
conditions:
1. R(ε, ε) = 1, given that u is a unit vector.
2. For every x ∈ X we must have

∑ R((x, a), s) = R(ε, s) and ∑ R(s, (x, a)) = R(s, ε), (8.24)
a∈ A a∈ A
and likewise for every y ∈ Y we must have

∑ R((y, b), s) = R(ε, s) and ∑ R(s, (y, b)) = R(s, ε), (8.25)
b∈ B b∈ B

for every s ∈ Σ≤1 (in all four equalities). These conditions reflect the fact that
summing over all operators in any given measurement yields the identity op-
erator, as in (8.13).
3. For every x ∈ X and a, c ∈ A satisfying a 6= c, we have
R(( x, a), ( x, c)) = 0, (8.26)
because Pax and Pcx must be orthogonal projection operators. Similarly, for ev-
ery y ∈ Y and b, d ∈ B satisfying b 6= d, we have
R((y, b), (y, d)) = 0. (8.27)

89
4. For every (z, c) ∈ Σ, we have the equality

R((z, c), (z, c)) = R(ε, (z, c)) = R((z, c), ε); (8.28)
y
each Pax and Qb is a projection operator, thus squaring to itself.
5. For every ( x, a) ∈ X × A and (y, b) ∈ Y × B we have

R(( x, a), (y, b)) = R((y, b), ( x, a)), (8.29)


y
as Pax and Qb commute.

Remark 8.3. We can actually do a bit better in the case of the fifth item, and say
that every entry R(( x, a), (y, b)) must be a nonnegative real number, given that
y y 2
R(( x, a), (y, b)) = Pax u, Qb u = Pax Qb u , (8.30)

and likewise for R((y, b), ( x, a)). This is a stronger condition than (8.29), as the en-
tries R(( x, a), (y, b)) and R((y, b), ( x, a)) must be equal if they are real given that R
is Hermitian. This stronger claim is not really needed though, and does not appear
in the formal description of the NPA hierarchy coming up.

Let us now take C1 to be the subset of L(RX ⊗ RY , R A ⊗ RB ) containing all M


for which there exists a positive semidefinite operator R satisfying items 1 through
5 above, as well as
M( a, b | x, y) = R(( x, a), (y, b)) (8.31)
for every x ∈ X, y ∈ Y, a ∈ A, and b ∈ B.
Observe that C ⊆ C1 , as every commuting measurement strategy defines a
Gram matrix that satisfies the conditions of items 1 through 5. Thus, assuming
once again that for a given nonlocal game G = ( X, Y, A, B, π, V ) we have defined
K as in (8.5), we see that

ω c ( G ) = sup hK, Mi ≤ sup hK, Mi. (8.32)


M ∈C M ∈ C1

The inclusion C ⊆ C1 is proper in general, but we can still use this relationship
to obtain upper bounds on the commuting operator value (and therefore on the
entangled value) of nonlocal games.
Now, notice that items 1 through 5 in the list above are all affine linear constraints
on R. (With the exception of R(ε, ε) = 1 they are all linear constraints.) If we define
≤1 
a Hermitian operator H ∈ Herm CΣ as

1
H (( x, a), (y, b)) = H ((y, b), ( x, a)) = π ( x, y)V ( a, b | x, y) (8.33)
2

90
for all x ∈ X, y ∈ Y, a ∈ A, and b ∈ B, and with all other entries equal to zero,
we see that hK, M i = h H, Ri. Thus, the optimization of hK, M i over all M ∈ C1 is
represented by a semidefinite program, where we optimize h H, Ri over all positive
semidefinite R satisfying the affine linear constraints imposed by items 1 through
5 in the list above.
The semidefinite program just suggested is the first level of the NPA hierarchy.
The idea behind subsequent levels is to consider not just the Gram matrix of the
vectors (8.18), but of larger sets of vectors that include ones such as
y y
Pax Qb u, Pax Qb Pcz u, etc. (8.34)

This will lead us to consider operators R whose rows and columns are indexed not
by Σ≤1 , but by Σ≤k for larger choices of k. (Larger choices for k yield higher levels
in the hierarchy.) Although the inner products between most pairs of these vectors
are not informative to the task of determining how well such a strategy performs
in a given nonlocal game, the benefit comes from the introduction of additional
constraints that reflect the same properties through which the five affine linear con-
straints above were derived. Indeed, the sequence of optimal values obtained from
this hierarchy always converges to ω c ( G ), as we will prove in the next section.

Formal description of the hierarchy


We are now ready to formally define the NPA hierarchy. We will begin by defining
an equivalence relation ∼ on strings over the alphabet Σ = Σ A t Σ B defined above.
This equivalence relation will be used to equate various entries in matrices that
generalize items 4 and 5 in the list described above.
Specifically, we take this equivalence relation to be the one generated by these
rules holding for every s, t ∈ Σ∗ , ( x, a) ∈ Σ A , and (y, b) ∈ Σ B :

1. s( x, a)t ∼ s( x, a)( x, a)t and s(y, b)t ∼ s(y, b)(y, b)t.


2. s( x, a)(y, b)t ∼ s(y, b)( x, a)t.

That is, two strings are equivalent with respect to the relation ∼ if and only if one
can be obtained from the other by any number of applications of the above rules.
These equivalences reflect the fact that projections square to themselves (the first
rule) and Alice’s measurements commute with Bob’s measurements (the second
rule).
Next, notice that the values that appear in any Gram matrix R of the form sug-
gested above always take the form

u, Πzc11 · · · Πzcnn u (8.35)

91
for some finite sequence of projection operators Πzc11 , . . . , Πzcnn selected from Alice
and Bob’s projections.
With this observation in mind, we say that a function of the form φ : Σ∗ → C is
admissible if it satisfies these properties:

1. φ(ε) = 1.
2. For all strings s, t ∈ Σ∗ we have

∑ φ(s(x, a)t) = φ(st) and ∑ φ(s(y, b)t) = φ(st) (8.36)


a∈ A b∈ B

for every x ∈ X and y ∈ Y.


3. For all strings s, t ∈ Σ∗ we have

φ(s( x, a)( x, c)t) = 0 and φ(s(y, b)(y, d)t) = 0 (8.37)

for every x ∈ X and a, c ∈ A satisfying a 6= c, and every y ∈ Y and b, d ∈ B


satisfying b 6= d, respectively.
4. For all strings s, t ∈ Σ∗ satisfying s ∼ t we have φ(s) = φ(t).

We will also consider a restriction of this notion to functions of the form

φ : Σ≤k → C, (8.38)

which we call admissible if and only if the same conditions listed above hold for
those strings s and t that are sufficiently short so that φ is defined on the arguments
indicated within each condition.
Thus, if φ is a function that is defined from an actual commuting measurement
strategy as
φ((z1 , c1 ) · · · (zn , cn )) = u, Πzc11 · · · Πzcnn u (8.39)
for every string (z1 , c1 ) · · · (zn , cn ), where each Πzc denotes Pcz or Qzc depending on
whether (z, c) ∈ Σ A or (z, c) ∈ Σ B , respectively, then φ is necessarily admissible.
This is true for functions of both forms φ : Σ∗ → C and φ : Σ≤k → C.
Finally, a positive semidefinite operator
 ≤k 
R ∈ Pos CΣ (8.40)

is said to be a k-th order admissible operator if there exists an admissible function


φ : Σ≤2k → C such that
R(s, t) = φ(sR t) (8.41)
for every choice of strings s, t ∈ Σ≤k .

92
Observe that for each positive integer k, the condition that a positive semidef-
inite operator is k-th order admissible is an affine linear constraint, as there are a
finite number of affine linear constraints imposed by the equation (8.41) and the
condition that φ is admissible. Thus, the optimization over all k-th order admissi-
ble operators can be represented by a semidefinite program. This is the NPA hier-
archy, where different choices of k correspond to different levels of the hierarchy.
(Any linear objective function may naturally be considered, but often the objective
function reflects the probability to win a given nonlocal game. Alternatively, one
can set up a semidefinite program that tests whether a given M agrees with some
k-th order admissible operator.)

8.4 Convergence of the NPA hierarchy


Define Ck to be the subset of L(RX ⊗ RY , R A ⊗ RB ) containing all M for which
≤k 
there exists a k-th order admissible operator R ∈ Pos CΣ satisfying (8.31) for
every x ∈ X, y ∈ Y, a ∈ A, and b ∈ B. One might call any such M a k-th order
pseudo-commuting measurement strategy.
A moment’s thought reveals that

C1 ⊇ C2 ⊇ C3 ⊇ · · · (8.42)

as any (k + 1)-st order admissible operator must yield a k-th order admissible oper-
ator when its rows and columns corresponding to strings longer than k are deleted.
It is also the case that C ⊆ Ck for every positive integer k; just like in the k = 1 case,
an actual commuting measurement strategy defines a Gram matrix R that is k-th
order admissible for every choice of k.
The remainder of the lecture is devoted to proving that the sequence (8.42) con-
verges to C in the sense made precise by the following theorem.

Theorem 8.4. Let X, Y, A, and B be finite and nonempty sets and let C and Ck , for every
positive integer k, be as defined above. It is the case that

C= Ck .
\
(8.43)
k =1

Equivalently, for every M ∈ L(RX ⊗ RY , R A ⊗ RB ) the following two statements are


equivalent:

1. M is a commuting measurement strategy.


2. M is a k-th order pseudo-commuting measurement strategy for every positive integer k.

93
The easier implication
We have already observed the implication that statement 1 implies statement 2.
In more detail, under the assumption that statement 1 holds, it must be that M is
defined by a commuting measurement strategy in which Alice and Bob’s projective
y
measurements are described by { Pax : a ∈ A} for Alice and { Qb : b ∈ B} for Bob,
with all of these projections acting on a Hilbert space H, along with a unit vector
u ∈ H. As before, let Πzc denote Pcz if z ∈ X and c ∈ A, or Qzc if z ∈ Y and c ∈ B.
With respect to this notation, one may consider the k-th order admissible operator
R defined by
R(s, t) = φ(sR t), (8.44)
where the function φ is defined as
zj
φ (z1 , c1 ) · · · (z j , c j ) = u, Πzc11 · · · Πc j u

(8.45)

for every string (z1 , c1 ) · · · (z j , c j ) ∈ Σ≤2k . A straightforward verification reveals


that this operator is consistent with M, in the sense of (8.31), and therefore M is a
k-th order pseudo-commuting measurement strategy.

A bound on entries of any k-th order admissible operator


Before approaching the more challenging implication of Theorem 8.4, which is that
statement 2 implies statement 1, we will prove that every entry of a k-th order
admissible operator is bounded by 1 in absolute value.
≤k 
More explicitly, if R ∈ Pos CΣ is k-th order admissible, then

R(s, t) ≤ 1 (8.46)

for every s, t ∈ Σ≤k . To see that this is so, observe first that
q q
R(s, t) ≤ R(s, s) R(t, t) (8.47)

for each s, t ∈ Σ∗ , which is a consequence of the fact that each 2 × 2 principal


submatrix !
R(s, s) R(s, t)
(8.48)
R(t, s) R(t, t)
must be positive semidefinite. Noting that R(s, s) is real and nonnegative, it there-
fore suffices to prove that
R(s, s) ≤ 1 (8.49)
for every s ∈ Σ≤k .

94
The bound (8.49) may be proved by induction on the length of s. For the base
case, one has that R(ε, ε) = 1. For the general case, one has that for any string
s ∈ Σ∗ and any choice of (z, c) ∈ Σ, it holds that

R((z, c)s, (z, c)s) ≤ ∑ R((z, d)s, (z, d)s) = ∑ φ(sR (z, d)(z, d)s)
d d (8.50)
= ∑ φ(sR (z, d)s) = φ(sR s) = R(s, s),
d

where φ : Σ≤2k → C is the admissible function from which R is defined, and


the sums are over all d ∈ A or d ∈ B, depending on whether z ∈ X or z ∈ Y,
respectively. By the hypothesis of induction the required bound (8.49) follows.

Entry-wise convergence to an admissible matrix


Now assume statement 2 of Theorem 8.4 holds. For every k ≥ 1, let
 ≤k 
Rk ∈ Pos CΣ (8.51)

be a k-th order admissible operator satisfying

M( a, b | x, y) = Rk (( x, a), (y, b)) (8.52)

for every x ∈ X, y ∈ Y, a ∈ A, and b ∈ B, and let φk : Σ≤2k → C be the admissible


function that defines Rk .
We will begin with the observation that there exists an infinite matrix

R : Σ∗ × Σ∗ → C (8.53)

having the following properties:

1. Every finite principal submatrix of R is positive semidefinite.


2. M ( a, b | x, y) = R(( x, a), (y, b)) for every x ∈ X, y ∈ Y, a ∈ A, and b ∈ B.
3. There exists an admissible function φ : Σ∗ → C such that R(s, t) = φ(sR t) for
all s, t ∈ Σ∗ .

Note here that we must draw a distinction between an infinite matrix of the form
(8.53) and a linear operator, as these concepts are no longer equivalent in infinite
dimensions.
Such an infinite matrix R can, in fact, be obtained in a fairly straightforward
fashion. Observe first that for every string s ∈ Σ∗ and every infinite, strictly in-
creasing sequence of positive integers (k1 , k2 , k3 , . . .), the sequence

(φk1 (s), φk2 (s), φk3 (s), . . .) (8.54)

95
must have at least one limit point.3 This is because each value φk (s) agrees with an
entry of Rk , and the entries of each operator Rk are bounded by 1 in absolute value.
Beginning with the string s = ε and the sequence (k1 , k2 , . . .) = (1, 2, . . .), we
consider the function φ : Σ∗ → C defined by the following process:

1. Define φ(s) to be any limit point of the sequence (8.54).


2. Restrict the sequence (k1 , k2 , . . .) to any infinite subsequence for which (8.54)
converges to the chosen limit point, and rename the indices forming this sub-
sequence as (k1 , k2 , k3 , . . .).
3. Increment s and return to step 1.

Here, when we say “increment s,” we are referring to any fixed total ordering of
Σ∗ for which ε is the first string. For example, the strings in Σ∗ may be ordered
according to their length, with strings of equal length being ordered according to
the natural “dictionary ordering” induced by a fixed ordering of Σ (which is the so-
called lexicographic ordering of Σ∗ ). It must be that any function φ obtained through
this process is admissible, by virtue of the fact that every φk is admissible.
Now we may define
R(s, t) = φ(sR t). (8.55)
The three properties required of R follow, either directly or from the recognition
that every finite submatrix of R is equal to the limit of the corresponding subma-
trices of some convergent subsequence of the sequence

( R1 , R2 , R3 , . . . ) (8.56)

together with the fact that the positive semidefinite cone is closed.

Construction of a commuting measurement strategy


We will now make use of a fact concerning countably infinite matrices for which
all finite principal submatrices are positive semidefinite—and that is that any such
matrix must be the Gram matrix of a countably infinite set of vectors chosen from
some Hilbert space. This is not a trivial fact to prove, but we will take it as given.
In the case at hand, this implies that there must exist a Hilbert space K and a
collection of vectors
us : s ∈ Σ∗ ⊂ K,

(8.57)
such that
R(s, t) = us , ut (8.58)
3 When considering the sequence (8.54), we ignore those indices k for which φ ( s ) is not defined,
k
of which there are at most finitely many.

96
for every s, t ∈ Σ∗ . (The inner product is, naturally, the inner product on K.)
Now let us take H to be the closure of the span of the set {us : s ∈ Σ∗ }, which
is a Hilbert space having countable dimension (and is therefore a separable Hilbert
space). We will define a commuting measurement strategy for Alice and Bob, with
H being the Hilbert space for this strategy. The unit vector associated with this
strategy will be uε ∈ H, which is indeed a unit vector given that

huε , uε i = R(ε, ε) = 1. (8.59)

Next we define a collection of projections on H. For each choice of x ∈ X and


a ∈ A, define Pax to be the projection onto the orthogonal complement of the set

u( x,c)s : c ∈ A\{ a}, s ∈ Σ∗ ,



(8.60)
y
and along similar lines, for each choice of y ∈ Y and b ∈ B, define Qb to be the
projection onto the orthogonal complement of the set

u(y,d)s : d ∈ B\{b}, s ∈ Σ∗ .

(8.61)

(The orthogonal complement of any collection of vectors is closed, so these are


well-defined projection operators.)
In order to verify that the objects just defined induce a valid commuting mea-
surement strategy that agrees with M, we will first prove the following useful fact.
For all ( x, a) ∈ X × A, (y, b) ∈ Y × B, and s ∈ Σ∗ , it is the case that
y
Pax us = u( x,a)s and Qb us = u(y,b)s . (8.62)

Note first that for any choice of x ∈ X, a, c ∈ A with a 6= c, and s, t ∈ Σ∗ , we have



u( x,a)s , u( x,c)t = R(( x, a)s, ( x, c)t) = φ sR ( x, a)( x, c)t = 0, (8.63)

and similarly for any choice of y ∈ Y, b, d ∈ B with b 6= d, and s, t ∈ Σ∗ , we have

u(y,b)s , u(y,d)t = 0. (8.64)

This implies that


y
Pax u( x,a)s = u( x,a)s and Qb u(y,b)s = u(y,b)s . (8.65)

We also see that for any choice of x ∈ X and s, t ∈ Σ∗ ,

∑ ut , u( x,a)s = ∑ R(t, (x, a)s)


a∈ A a∈ A
(8.66)
= ∑ φ(tR (x, a)s) = φ(tR s) = R(t, s) = ut , us ,
a∈ A

97
from which it follows that
us = ∑ u(x,a)s . (8.67)
a∈ A
Similarly, for any choice of y ∈ Y and s ∈ Σ∗ ,

us = ∑ u(y,b)s . (8.68)
b∈ B

Consequently,
Pax us = ∑ Pax u(x,c)s = Pax u(x,a)s = u(x,a)s , (8.69)
c∈ A
and by similar reasoning
y
Qb us = u(y,b)s , (8.70)
as claimed.
With the formulas (8.62) in hand, we can verify the required properties of H,
y
uε , { Pax }, and { Qb }. First, see that
y
us , Pax Qb ut = us , u( x,a)(y,b)t = φ sR ( x, a)(y, b)t

y (8.71)
= φ sR (y, b)( x, a)t = us , u(y,b)(x,a)t = us , Qb Pax ut


y
for every s, t ∈ Σ∗ . This implies that Pax and Qb commute on the span of the vectors
{us : s ∈ Σ∗ }, and it follows that they commute on all of H by continuity. Second,
for every x ∈ X and s ∈ Σ∗ we find that

∑ Pax us = ∑ u(x,a)s = us , (8.72)


a∈ A a∈ A

from which it follows (again by continuity) that

∑ Pax = 1H (8.73)
a∈ A

and similarly
∑ Qb = 1H .
y
(8.74)
b∈ B
To complete the proof, it remains to observe that the strategy represented by
y
the unit vector uε and the projections { Pax } and { Qb } yields the strategy M. This is
also evident from the formulas (8.62), as one has
y
R(( x, a), (y, b)) = u( x,a) , u(y,b) = uε , Pax Qb uε (8.75)

and therefore
y
M ( a, b | x, y) = uε , Pax Qb uε (8.76)
for every choice of x ∈ X, y ∈ Y, a ∈ A, and b ∈ B.

98
Two implications
We will conclude the lecture by briefly observing two facts that follow from Theo-
rem 8.4. The first is that the set C of commuting measurement strategies is closed,
as it is the intersection of the closed sets C1 , C2 , . . . (and, by similar reasoning, it
is convex). The second fact is that there is no loss of generality in restricting one’s
attention to separable Hilbert spaces in the definition of commuting measurement
strategies, for this is so for the strategy constructed in the proof.

99

You might also like