0% found this document useful (0 votes)
91 views34 pages

The Power of Data in QML

The document discusses how machine learning models can learn from data to make predictions about problems that are classically hard to compute. It develops a methodology to assess potential quantum advantage in learning tasks using rigorous prediction error bounds. It also proposes a projected quantum model that provides a simple quantum speedup in the fault-tolerant regime, and demonstrates prediction advantage over classical models on engineered data sets up to 30 qubits.

Uploaded by

sashwat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views34 pages

The Power of Data in QML

The document discusses how machine learning models can learn from data to make predictions about problems that are classically hard to compute. It develops a methodology to assess potential quantum advantage in learning tasks using rigorous prediction error bounds. It also proposes a projected quantum model that provides a simple quantum speedup in the fault-tolerant regime, and demonstrates prediction advantage over classical models on engineered data sets up to 30 qubits.

Uploaded by

sashwat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Power of data in quantum machine learning

Hsin-Yuan Huang,1, 2, 3 Michael Broughton,1 Masoud Mohseni,1 Ryan


Babbush,1 Sergio Boixo,1 Hartmut Neven,1 and Jarrod R. McClean1, ∗
1
Google Research, 340 Main Street, Venice, CA 90291, USA
2
Institute for Quantum Information and Matter, Caltech, Pasadena, CA, USA
3
Department of Computing and Mathematical Sciences, Caltech, Pasadena, CA, USA
(Dated: February 12, 2021)
The use of quantum computing for machine learning is among the most exciting prospective ap-
plications of quantum technologies. However, machine learning tasks where data is provided can be
considerably different than commonly studied computational tasks. In this work, we show that some
problems that are classically hard to compute can be easily predicted by classical machines learning
from data. Using rigorous prediction error bounds as a foundation, we develop a methodology for
arXiv:2011.01938v2 [quant-ph] 10 Feb 2021

assessing potential quantum advantage in learning tasks. The bounds are tight asymptotically and
empirically predictive for a wide range of learning models. These constructions explain numerical
results showing that with the help of data, classical machine learning models can be competitive
with quantum models even if they are tailored to quantum problems. We then propose a projected
quantum model that provides a simple and rigorous quantum speed-up for a learning problem in
the fault-tolerant regime. For near-term implementations, we demonstrate a significant prediction
advantage over some classical models on engineered data sets designed to demonstrate a maximal
quantum advantage in one of the largest numerical tests for gate-based quantum machine learning
to date, up to 30 qubits.

INTRODUCTION sample from classically [10, 11]. If these distributions


were to coincide with real-world distributions, this would
As quantum technologies continue to rapidly advance, suggest the potential for significant advantage. This is
it becomes increasingly important to understand which typically the type of advantage that has been sought in
applications can benefit from the power of these devices. recent work on both quantum neural networks [12–14],
At the same time, machine learning on classical comput- which seek to parameterize a distribution through some
ers has made great strides, revolutionizing applications set of adjustable parameters, and quantum kernel meth-
in image recognition, text translation, and even physics ods [15] that use quantum computers to define a feature
applications, with more computational power leading to map that maps classical data into the quantum Hilbert
ever increasing performance [1]. As such, if quantum space. The justification for the capability of these meth-
computers could accelerate machine learning, the poten- ods to exceed classical models often follows similar lines
tial for impact is enormous. as Refs [10, 11] or quantum simulation results. That is,
At least two paths towards quantum enhancement of if the model leverages a quantum circuit that is hard to
machine learning have been considered. First, motivated sample results from classically, then there is potential for
by quantum applications in optimization [2–4], the power a quantum advantage.
of quantum computing could, in principle, be used to help In this work, we show quantitatively how this picture
improve the training process of existing classical mod- is incomplete in machine learning (ML) problems where
els [5, 6], or enhance inference in graphical models [7]. some training data is provided. The provided data can
This could include finding better optima in a training elevate classical models to rival quantum models, even
landscape or finding optima with fewer queries. How- when the quantum circuits generating the data are hard
ever, without more structure known in the problem, the to compute classically. We begin with a motivating ex-
advantage along these lines may be limited to quadratic ample and complexity-theoretic argument showing how
or small polynomial speedups [8, 9]. classical algorithms with data can match quantum out-
The second vein of interest is the possibility of us- put. Following this, we provide rigorous prediction error
ing quantum models to generate correlations between bounds for training classical and quantum ML methods
variables that are inefficient to represent through clas- based on kernel functions [15–24] to learn quantum me-
sical computation. The recent success both theoretically chanical models. We focus on kernel methods, as they
and experimentally for demonstrating quantum compu- not only provide provable guarantees, but are also very
tations beyond classical tractability can be taken as evi- flexible in the functions they can learn. For example, re-
dence that quantum computers can sample from prob- cent advancements in theoretical machine learning show
ability distributions that are exponentially difficult to that training neural networks with large hidden layers
is equivalent to training an ML model with a particu-
lar kernel, known as the neural tangent kernel [19–21].
Throughout, when we refer to classical ML models re-
∗ Corresponding author: [email protected] lated to our theoretical developments, we will be referring
2

(a) (b)
Dissecting quantum prediction advantage

Quantum Geometry test


Computation p p
Can be
gCQ ⌧ N gCQ / N
(BQP) constructed
Classical ML predicts similar or Data set exists with potential
better than the quantum ML quantum advantage

Classical Dimension test for Complexity test for


Classical quantum space specific function/label
ML
Algorithm
Algorithm sC / N,
(BPP) min(d, Tr(O2 )) ⌧ N Else sC ⌧ N Else
w/ data sQ ⌧ N

Classical ML Classical ML Classical ML Potential Likely


can learn can work/fail, can learn & quantum Hard
any………
UQNN QK likely fails predict well advantage to learn

FIG. 1. Illustration of the relation between complexity classes and a flowchart for understanding and pre-screening potential
quantum advantage. (a) We cartoon the separation between problem complexities that are created by the addition of data to
a problem. Classical algorithms that can learn from data define a complexity class that can solve problems beyond classical
computation (BPP), but it is still expected that quantum computation can efficiently solve problems that classical ML algorithm
with data cannot. Rigorous definition and proof for the separation between classical algorithms that can learn from data and
BPP / BQP is given in Appendix B. (b) The flowchart we develop for understanding the potential for quantum prediction
advantage. N samples of data from a potentially infinite depth QNN made with encoding and function circuits Uenc and UQNN
are provided as input along with quantum and classical methods with associated kernels. Tests are given as functions of N
to emphasize the role of data in the possibility of a prediction advantage. One can first evaluate a geometric quantity gCQ
that measures the possibility of an advantageous quantum/classical prediction separation without yet considering the actual
function to learn. We show how one can efficiently construct an adversarial function that saturates this limit if the test is
passed, otherwise the classical approach is guaranteed to match performance for any function of the data. To subsequently
consider the actual function provided, a label/function specific test may be run using the model complexities sC and sQ . If
one specifically uses the quantum kernel (QK) method, the red dashed arrows can evaluate if all possible choices of UQNN lead
to an easy classical function for the chosen encoding of the data.

to ML models that can be easily associated with a ker- powerful, function independent pre-screening that allows
nel, either explicitly as in kernel methods, or implicitly one to evaluate if there is any possibility of better per-
as in the neural tangent kernels. However, in the numeri- formance. On the other hand, if the geometry differs
cal section, we will also include performance comparisons greatly, we show both the existence of a data set that ex-
to methods where direct association of a kernel is chal- hibits large prediction advantage using the quantum ML
lenging, such as random forest methods. In the quantum model and how one can construct it efficiently. While
case, we will also show how quantum ML based on ker- the tools we develop could be used to compare and con-
nels can be made equivalent to training an infinite depth struct hard classical models like hash functions, we en-
quantum neural network. force restrictions that allow us to say something about
a quantum separation. In particular, the feature map
We use our prediction error bounds to devise a will be white box, in that a quantum circuit specification
flowchart for testing potential quantum prediction advan- is available for the ideal feature map, and that feature
tage, the separation between prediction errors of quan- map can be made computationally hard to evaluate clas-
tum and classical ML models for a fixed amount of train- sically. A constructive example of this is a discrete log
ing data. The most important test is a geometric dif- feature map, where a provable separation for our kernel
ference between kernel functions defined by classical and is given in Appendix K. Additionally, the minimum over
quantum ML. Formally, the geometric difference is de- classical models means that classical hash functions are
fined by the closest efficient classical ML model. In prac- reproduced formally by definition.
tice, one should consider the geometric difference with
respect to a suite of optimized classical ML models. If Moreover, application of these tools to existing mod-
the geometric difference is small, then a classical ML els in the literature rules many of them out immediately,
method is guaranteed to provide similar or better per- providing a powerful sieve for focusing development of
formance in prediction on the data set, independent of new data encodings. Following these constructions, in
the function values or labels. Hence this represents a numerical experiments, we find that a variety of com-
3

mon quantum models in the literature perform similarly p-dimensional classical vectors with kxi k2 = 1, and use
or worse than classical ML on both classical and quan- amplitude encoding [31–33]
Pp tok encode the data into an
tum data sets due to a small geometric difference. The n-qubit state |xi i = k=1 xi |ki. If UQNN is a time-
small geometric difference is a consequence of the expo- evolution under a many-body Hamiltonian, then the

nentially large Hilbert space employed by existing quan- function f (x) = hx| UQNN OUQNN |xi is in general hard
tum models, where all inputs are too far apart. To cir- to compute classically [34] , even for a single input state.
cumvent the setback, we propose an improvement, which In particular, we have the following proposition showing
enlarges the geometric difference by projecting quantum that if a classical algorithm can compute f (x) efficiently,
states embedded from classical data back to approximate then quantum computers will be no more powerful than
classical representation [25–27]. With the large geomet- classical computers; see Appendix A for a proof.
ric difference endowed by the projected quantum model,
we are able to construct engineered data sets to demon- Proposition 1. If a classical algorithm without training
strate large prediction advantage over common classical data can compute f (x) efficiently for any UQNN and O,
ML models in numerical experiments up to 30 qubits. then BPP=BQP.
Despite our constructions being based on methods with
Nevertheless, it is incorrect to conclude that training a
associated kernels, we find empirically that the prediction
classical model from data to learn this evolution is hard.
advantage remains robust across tested classical meth-
To see this, we write out the expectation value as
ods, including those without an easily determined ker-
nel. This opens the possibility to use a small quantum p
! p
!

X X
computer to generate efficiently verifiable machine learn- f (xi ) = xk∗
i hk| UQNN OUQNN xli |li
ing problems that could be challenging for classical ML k=1 l=1
models. p X
X p
= Bkl xik∗ xli , (1)
k=1 l=1
RESULTS
which is a quadratic function with p2 coefficients Bkl =

A. Setup and motivating example hk| UQNN OUQNN |li. Using the theory developed later in
this work, we can show that, for any UQNN and O, train-
We begin by setting up the problems and methods of ing a specific classical ML model on a collection of N
interest for classical and quantum models, and then pro- training examples {(xi , yi = f (xi ))} would give rise to a
vide a simple motivating example for studying how data prediction model h(x) with
can increase the power of classical models on quantum r
data. The focus will be a supervised learning task with a p2
Ex∼D |h(x) − f (x)| ≤ c , (2)
collection of N training examples {(xi , yi )}, where xi is N
the input data and yi is an associated label or value. We
for a constant c > 0. We refer to Appendix A for the
assume that xi are sampled independently from a data
proof of this result. Hence, with N ∝ p2 /2 training
distribution D.
data, one can train a classical ML model to predict the
In our theoretical analysis, we will consider yi ∈ R function f (x) up to an additive prediction error . This
to be generated by some quantum model. In partic- elevation of classical models through some training sam-
ular, we consider a continuous encoding unitary that ples is illustrative of the power of data. In Appendix B,
maps classical vector xi into quantum state |xi i = we give a rigorous complexity-theoretic argument on the
⊗n
Uenc (xi ) |0i and refer to the corresponding density ma- computational power provided by data. A cartoon de-
trix as ρ(xi ). The expressive power of these embed- piction of the complexity separation induced by data is
dings have been investigated from a functional analy- provided in Fig. 1(a).
sis point of view [28, 29], however the setting where While this simple example makes the basic point that
data is provided requires special attention. The encod- sufficient data can change complexity considerations, it
ing unitary is followed by a unitary UQNN (θ). We then perhaps opens more questions than it answers. For ex-
measure an observable O after the quantum neural net- ample, it uses a rather weak encoding into amplitudes
work. This produces the label/value for input xi given as and assumes one has access to an amount of data that

yi = f (xi ) = hxi | UQNN OUQNN |xi i. The quantum model is on par with the dimension of the model. The more
considered here is also referred to as a quantum neural interesting cases occur if we strengthen the data encod-
network (QNN) in the literature [14, 30]. The goal is to ing, include modern classical ML models, and consider
understand when it is easy to predict the function f (x) number of data N much less than the dimension of the
by training classical/quantum machine learning models. model. These more interesting cases are the ones we
With notation in place, we turn to a simple moti- quantitatively answer.
vating example to understand how the availability of Our primary interest will be ML algorithms that are
data in machine learning tasks can change computa- much stronger than fitting a quadratic function and the
tional hardness. Consider data points {xi }N i=1 that are input data is provided in more interesting ways than an
4

Classical ML Quantum kernel method.


A wide class of functions can be learned with a suf-
B
D
A ficiently large amount of data by using the right ker-
D
Embed into A nel function k. For example, in contrast to the per-
C
C
quantum
B
haps more natural kernel, hxi |xj i, the quantum kernel
F
Hilbert space k Q (xi , xj ) = |hxi |xj i|2 = Tr(ρ(xi )ρ(xj )) can learn arbi-
G H
E E F
H trarily deep quantum neural network UQNN that mea-
g measures the geometric difference, d-dim. G sures any observable O (shown in Appendix C), and the
training
e.g., and has small g, (d ≤ N:
training
set spac
e Gaussian kernel, k γ (xi , xj ) = exp(−γ||xi − xj ||2 ) with
but and has larger g. set size
) hyper-parameter γ, can learn any continuous function in
C F a compact space [36], which includes learning any QNN.
H
G B Project back Nevertheless, the required amount of data N to achieve
to classical space a small prediction error could be very large in the worst
case. Although we will work with other kernels defined
E D A
through a quantum space, due both to this expressive
property and terminology of past work, we will refer
Projected quantum kernel to k Q (xi , xj ) = Tr [ρ(xi )ρ(xj )] as the quantum kernel
method throughout this work, which is also the defini-
FIG. 2. Cartoon of the geometry (kernel function) defined by tion given in [15].
classical and quantum ML models. The letters A, B, ... rep-
resent data points {xi } in different spaces with arrows repre-
senting the similarity measure (kernel function) between data. B. Testing quantum advantage
The geometric difference g is a difference between similarity
measures (arrows) in different ML models and d is an effective
dimension of the data set in the quantum Hilbert space. We now construct our more general framework for as-
sessing the potential for quantum prediction advantage
in a machine learning task. Beginning from a general re-
amplitude encoding. In this work, we focus on both clas- sult, we build both intuition and practical tests based on
sical and quantum ML models based on kernel functions the geometry of the learning spaces. This framework is
k(xi , xj ). At a high level, a kernel function can be seen as summarized in Fig. 1.
a measure of similarity, if k(xi , xj ) is large when xi and xj Our foundation is a general prediction error bound
are close. When considered for finite input data, a kernel for training classical/quantum ML models to predict
function may be represented as a matrix Kij = k(xi , xj ) some quantum model defined by f (x) = Tr(OU ρ(x))
and the conditions required for kernel methods are sat- derived from concentration inequalities, where OU =

isfied when the matrix representation is Hermitian and UQNN OUQNN . Suppose we have obtained N training ex-
positive semi-definite. amples {(xi , yi = f (xi ))}. After training on this data,
A given kernel function corresponds to a nonlin- there exists an ML algorithm that outputs h(x) = w† φ(x)
ear feature mapping φ(x) that maps x to a possibly using kernel k(xi , xj ) = Kij = φ(xi )† φ(xj ) which has a
infinite-dimensional feature space, such that k(xi , xj ) = simplified prediction error bounded by
φ(xi )† φ(xj ). This is the basis of the so-called “kernel r
trick” where intricate and powerful maps φ(xi ) can be sK (N )
Ex∼D |h(x) − f (x)| ≤ c (3)
implemented through the evaluation of relatively simple N
kernel functions k. As a simple case, in the example for a constant c > 0 and N independent samples from the
2
above, using a kernel of k(x i , xj ) = |hxi |xj i| corresponds data distribution D. We note here that this and all sub-
k∗ l
P
to a feature map φ(xi ) = kl xi xi |ki ⊗ |li which is ca- sequent bounds have a key dependence on the quantity of
pable of learning quadratic functions in the amplitudes. data N , reflecting the role of data to improve prediction
In kernel based ML algorithms, the trained model can performance. Due to a scaling freedom between αφ(x)
always be written as h(x) = w† φ(x) where w is a vector PN
and w/α, we have assumed i=1 φ(xi )† φ(xi ) = Tr(K) =
in the feature space defined by the kernel. For example,
N . A derivation of this result is given in Appendix D.
training a convolutional neural network with large hidden
Given this core prediction error bound, we now seek
layers [19, 35] is equivalent to using a corresponding neu-
to understand its implications. The main quantity that
ral tangent kernel k CNN . The feature map φCNN for the
determines the prediction error is
kernel k CNN is a nonlinear mapping that extracts all lo-
cal properties of x [35]. In quantum mechanics, similarly N X
X N
a kernel function can be defined using the native geome- sK (N ) = (K −1 )ij Tr(OU ρ(xi )) Tr(OU ρ(xj )).
try of the quantum state space |xi. For example, we can i=1 j=1
define the kernel function as hxi |xj i or |hxi |xj i|2 . Using (4)
the output from this kernel in a method like a classical The quantity sK (N ) is equal to the model complexity
support vector machine [16] defines the quantum kernel of the trained function h(x) = w† φ(x), where sK (N ) =
5

2
kwk = w† w after training. A smaller value of sK (N ) im- ing an asymmetric geometric difference that depends on
plies better generalization to new data x sampled from the dataset, but is independent of the function values
the distribution D. Intuitively, sK (N ) measures whether or labels. Hence evaluating this quantity is a good first
the closeness between xi , xj defined by the kernel func- step in understanding if there is a potential for quantum
tion k(xi , xj ) matches well with the closeness of the ob- advantage, as shown in Fig. 1. This quantity is defined
servable expectation for the quantum states ρ(xi ), ρ(xj ), by
recalling that a larger kernel value indicates two points
√ √
r
are closer. The computation of sK (N ) can be performed g12 = g(K 1 ||K 2 ) = K 2 (K 1 )−1 K 2 , (5)

efficiently on a classical computer by inverting an N × N ∞
matrix K after obtaining the N values Tr(OU ρ(xi )) by
performing order N experiments on a physical quan- where k.k∞ is the spectral norm of the resulting matrix
tum device. The time complexity scales at most as or- and we assume Tr(K 1 ) = Tr(K 2 ) = N . One can show
2
der N 3 . Due to the connection between w† w and the that sK 1p≤ g12 sK 2 , which
p implies the prediction error
model complexity, a regularization term w† w is often bound c sK 1 /N ≤ cg12 sK 2 /N . A detailed derivation
added to the optimization problem during the training is given in Appendix F 3 and an illustration of g12 can
of h(x) = w† φ(x), see e.g., [16, 37, 38]. Regularization be found in Fig. 2. The geometric difference g(K 1 ||K 2 )
prevents sK (N ) from becoming too large at the expense can be computed on a classical computer by performing
of not completely fitting the training data. A detailed a singular value decomposition of the N × N matrices
discussion and proof under regularization is given in Ap- K 1 and K 2 . Standard numerical analysis packages [40]
pendix D and F. provide highly efficient computation of a singular value
The prediction error upper bound can often be shown decomposition in time at most order N 3 . Intuitively, if
to be asymptotically tight by proving a matching lower K 1 (xi , xj ) is small/large when K 2 (xi , xj ) is small/large,
bound. As an example, when k(xi , xj ) is the quan- then the geometric difference g12 is a small value ∼ 1,
tum kernel Tr(ρ(xi )ρ(xj )), we can deduce that sK (N ) ≤ where g12 grows as the kernels deviate.
Tr(O2 ) hence one would need a number of data N scal- To see more explicitly how the geometric difference al-
ing as Tr(O2 ). In Appendix H, we give a matching lower lows one to make statements about the possibility for one
bound showing that a scaling of Tr(O2 ) is unavoidable if ML model to make different predictions from another,
we assume a large Hilbert space dimension. This lower consider the geometric difference gCQ = g(K C ||K Q )
bound holds for any learning algorithm and not only for between a classical ML model with kernel k C (xi , xj )
quantum kernel methods. The lower bound proof uses and a quantum ML model, e.g., with k Q (xi , xj ) =
mutual information analysis and could easily extend to Tr(ρ(xi )ρ(xj )). If gCQ is small, because
other kernels. This proof strategy is also employed ex- 2
sC ≤ gCQ sQ , (6)
tensively in a follow-up work [39] to devise upper and
lower bounds for classical and quantum ML in learning the classical ML model will always have a similar or bet-
quantum models. Furthermore, not only are the bounds ter model complexity sK (N ) compared to the quantum
asymptotically tight, in numerical experiments given in ML model. This implies that the prediction performance
Appendix M we find that the prediction error bound also for the classical ML will likely be competitive or better
captures the performance of other classical ML models than the quantum ML model, and one is likely to prefer
not based on kernels where the constant factors are ob- using the classical model. This is captured in the first
served to be quite modest. step of our flowchart in Fig. 1.
Given some set of data, if sK (N ) is found to be small In contrast, if gCQ is large we show that there ex-
relative to N after training for a classical ML model, this 2
ists a data set with sC = gCQ sQ with the quantum
quantum model f (x) can be predicted accurately even model exhibiting superior prediction performance. An
if f (x) is hard to compute classically for any given x. efficient method to explicitly construct such a maximally
In order to formally evaluate the potential for quantum divergent data set is given in Appendix G and a nu-
prediction advantage generally, one must take sK (N ) to merical demonstration of the stability of this separation
be the minimal over efficient classical models. However, is provided in the next section. While a formal state-
we will be more focused on minimally attainable values ment about classical methods generally requires defin-
over a reasonable set of classical methods with tuned hy- ing it over all efficient classical methods, in practice, we
perparameters. This prescribes an effective method for consider gCQ to be the minimum geometric difference
evaluating potential quantum advantage in practice, and among a suite of optimized classical ML models. Our
already rules out a considerable number of examples from engineered approach minimizes this value as a hyperpa-
the literature. rameter search to find the best classical adversary, and
From the bound, we can see that the potential advan- shows remarkable robustness across classical methods in-
tage for one ML algorithm defined by K 1 to predict bet- cluding those without an associated kernel, such as ran-
ter than another ML algorithm defined by K 2 depends on dom forests [41].
the largest possible separation between sK 1 and sK 2 for In the specific case of the quantum kernel method with
Q
a data set. The separation can be characterized by defin- Kij = k Q (xi , xj ) = Tr(ρ(xi )ρ(xj )), we can gain addi-
6

(a) (b)
Dataset (Q, E1) Dataset (Q, E2)

Dataset (Q, E3) Dataset (C)

FIG. 3. Relation between dimension d, geometric difference g, and prediction performance. The shaded regions are the standard
deviation over 10 independent runs and n is the number of qubits in the quantum encoding and dimension of the input for
the classical encoding. (a) The approximate dimension d and the geometric difference g with classical ML models for quantum
kernel (Q) and projected quantum kernel (PQ) under different embeddings and system sizes n. (b) Prediction error (lower is
better) of the quantum kernel method (Q), projected quantum kernel method (PQ), and classical ML models on classical (C)
and quantum (Q) data sets with number of data N = 600. As d grows too large, the geometric difference g for quantum kernel
becomes small. We see that small geometric difference g always results in classical ML being competitive or outperforming the
quantum ML model. When g is large, there is a potential for improvement over classical ML. For example, projected quantum
kernel improves upon the best classical ML in Dataset (Q, E3).

tional insights into the model complexity sK , and some- quantum 2-design, measuring O on |ψi, and perform-
times make conclusions about classically learnability for ing statistical analysis on the measurement data [25].
all possible UQNN for the given encoding of the data. Let This prediction error bound shows that a quantum ker-
us define vec(X) for a Hermitian matrix X to be a vec- nel method can learn any UQNN when the dimension of
tor containing the real and imaginary part of each entry the training set space d or the squared Frobenius norm
in X. In this case, we find sQ = vec(OU )T PQ vec(OU ), of observable Tr(O2 ) is much smaller than the amount of
where PQ is the projector onto the subspace formed by data N . In Appendix H, we show that quantum kernel
{vec(ρ(x1 )), . . . , vec(ρ(xN ))}. We highlight methods are optimal for learning quantum models with
bounded Tr(O2 ) as they saturate the fundamental lower
d = dim(PQ ) = rank(K Q ) ≤ N, (7) bound. However, in practice, most observables, such as
Pauli operators, will have exponentially large Tr(O2 ), so
which defines the effective dimension of the quantum the central quantity is the dimension d. Using the pre-
state space spanned by the training data. An illustra- diction error bound for the quantum kernel method, if
tion of the dimension d can be found in Fig. 1. Be- both gCQ and min(d, Tr(O2 )) are small, then a classical
cause PQ is a projector and has eigenvalues 0 or 1, ML would also be able to learn any UQNN . In such a
sQ ≤ min(d, vec(OU )T vec(OU )) = min(d, Tr(O2 )) as- case, one must conclude that the given encoding of the
suming kOk∞ ≤ 1. Hence in the case of the quantum data is classically easy, and this cannot be affected by an
kernel method, the prediction error bound may be writ- arbitrarily deep UQNN . This constitutes the bottom left
ten as part of our flowchart in Fig. 1.
r
min(d, Tr(O2 ))
Ex∈D |h(x) − f (x)| ≤ c . (8)
N
Ultimately, to see a prediction advantage in a particu-
A detailed derivation is given in Appendix E 1. We can lar data set with specific function values/labels, we need
also consider the approximate dimension d, where small a large separation between sC and sQ . This happens
eigenvalues in K Q are truncated, by incurring a small when the inputs xi , xj considered close in a quantum ML
training error. After obtaining K Q from a quantum de- model are actually close in the target function f (x), but
vice, the dimension d can be computed efficiently on a are far in classical ML. This is represented as the final
classical machine by performing a singular value decom- test in Fig. 1 and the methodology here outlines how
position on the N × N matrix K Q . Estimation of Tr(O2 ) this result can be achieved in terms of its more essential
can be performed by sampling random states |ψi from a components.
7

C. Projected quantum kernels D. Numerical studies

In addition to analyzing existing quantum models, the We now provide numerical evidence up to 30 qubits
analysis approach introduced also provides suggestions that supports our theory on the relation between the
for new quantum models with improved properties, which dimension d, the geometric difference g, and the pre-
we now address here. For example, if we start with the diction performance. Using the projected quantum ker-
original quantum kernel, when the effective dimension nel, the geometric difference g is much larger and we
d is large, kernel Tr(ρ(xi )ρ(xj )), which is based on a see the strongest empirical advantage of a scalable quan-
fidelity-type metric, will regard all data to be far from tum model on quantum data sets to date. These
each other and the kernel matrix K Q will be close to are the largest combined simulation and analysis in
identity. This results in a small geometric difference gCQ digital quantum machine learning that we are aware
leading to classical ML models being competitive or out- of, and make use of the TensorFlow and TensorFlow-
performing the quantum kernel method. In Appendix I, Quantum package [47], reaching a peak throughput of
we present a simple quantum model that requires an ex- up to 1.1 quadrillion floating point operations per second
ponential amount of samples to learn using the quantum (petaflop/s). Trends of approximately 300 teraflop/s for
kernel Tr(ρ(xi )ρ(xj )), but only needs a linear number of quantum simulation and 800 teraflop/s for classical anal-
samples to learn using a classical ML model. ysis were observed up to the maximum experiment size
To circumvent this setback, we propose a family of with the overall floating point operations across all ex-
projected quantum kernels as a solution. These kernels periments totalling approximately 2 quintillion (exaflop).
work by projecting the quantum states to an approximate In order to mimic a data distribution that pertains
classical representation, e.g., using reduced physical ob- to real-world data, we conduct our experiments around
servables or classical shadows [25, 27, 42–44]. Even if the fashion-MNIST data set [48], which is an image clas-
the training set space has a large dimension d ∼ N , the sification for distinguishing clothing items, and is more
projection allows us to reduce to a low-dimensional clas- challenging than the original digit-based MNIST source
sical space that can generalize better. Furthermore, by [49]. We pre-process the data using principal compo-
going through the exponentially large quantum Hilbert nent analysis [50] to transform each image into an n-
space, the projected quantum kernel can be challenging dimensional vector. The same data is provided to the
to evaluate without a quantum computer. In numer- quantum and classical models, where in the classical case
ical experiments, we find that the classical projection the data is the n-dimensional input vector, and the quan-
increases rather than decreases the geometric difference tum case uses a given circuit to embed the n-dimensional
with classical ML models. These constructions will be vector into the space of n qubits. For quantum embed-
the foundation of our best performing quantum method dings, we explore three options, E1 is a separable rota-
later. tion circuit [32, 51, 52], E2 is an IQP-type embedding
One of the simplest forms of projected quantum ker- circuit [15], and E3 is a Hamiltonian evolution circuit,
nel is to measure the one-particle reduced density matrix with explicit constructions in Appendix L.
(1-RDM) on all qubits for the encoded state, ρk (xi ) = For the classical ML task (C), the goal is to correctly
Trj6=k [ρ(xi )], then define the kernel as identify the images as shirts or dresses from the original
! data set. For the quantum ML tasks, we use the same
PQ
X 2 fashion-MINST source data and embeddings as above,
k (xi , xj ) = exp −γ kρk (xi ) − ρk (xj )kF . (9)
but take as function values the expectation value of a
k
local observable that has been evolved under a quantum
This kernel defines a feature map function in the 1-RDM neural network resembling the Trotter evolution of 1D-
space that is capable of expressing arbitrary functions of Heisenberg model with random couplings. In these cases,
powers of the 1-RDMs of the quantum state. From non- the embedding is taken as part of the ground truth, so
intuitive results in density functional theory, we know the resulting function will be different depending on the
even one body densities can be sufficient for determining quantum embedding. For these ML tasks, we compare
exact ground state [45] and time-dependent [46] proper- against the best performing model from a list of stan-
ties of many-body systems under modest assumptions. dard classical ML algorithms with properly tuned hyper-
In Appendix J, we provide examples of other projected parameters (see Appendix L for details).
quantum kernels. This includes an efficient method for In Fig. 3, we give a comparison between the predic-
computing a kernel function that contains all orders of tion performance of classical and quantum ML models.
RDMs using local randomized measurements and the for- One can see that not only do classical ML models per-
malism of classical shadows [25]. The classical shadow form best on the original classical dataset, the prediction
formalism allows efficient construction of RDMs from performance for the classical methods on the quantum
very few measurements. In Appendix K, we show that datasets is also very competitive and can even outperform
projected versions of quantum kernels lead to a simple existing quantum ML models despite the quantum ML
and rigorous quantum speed-up in a recently proposed models having access to the training embedding while the
learning problem based on discrete logarithms [24]. classical methods do not. The performance of the clas-
8

PQ (E1): g - small PQ (E2): g - moderate PQ (E3): g - large

FIG. 4. Prediction accuracy (higher the better) on engineered data sets. A label function is engineered to match the geometric
difference g(C||PQ) between projected quantum kernel and classical approaches, demonstrating a significant gap between
quantum and the best classical models up to 30 qubits when g is large. We consider the best performing classical ML models
among Gaussian SVM, linear SVM, Adaboost, random forest, neural networks, and gradient boosting. We only report the
accuracy of the quantum kernel method up to system size n = 28 due to the high simulation cost and the inferior performance.

sical ML model is especially strong on Dataset (Q, E1) tion values designed to saturate the geometric inequal-
and Dataset (Q, E2). This elevation of the classical per- ity sC ≤ g(K C ||K PQ )2 sPQ between classical ML models
formance is evidence of the power of data. Moreover, this with associated kernels and the projected quantum kernel
intriguing behavior and the lack of quantum advantage method. In particular, we design the data set such that
may be explained by considering the effective dimension sPQ = 1 and sC = g(K C ||K PQ )2 . Recall from Eq. (3),
d and the geometric difference g following our theoret- this data set will hence show p the largest separation in
ical constructions. From Fig. 3a, we can see that the the prediction error bound s(N )/N . The engineered
dimension d of the original quantum state space grows data set is constructed via a simple eigenvalue problem
rather quickly, and the geometric difference g becomes with the exact procedure described in Appendix G and
small as the dimension becomes too large (d ∝ N ) for the results are shown in Fig.4. As the quantum nature
the standard quantum kernel. The saturation of the of the encoding increases from E1 to E3, corresponding
dimension coincides with the decreasing and statistical to increasing g, the performance of both the best clas-
fluctuations in performance seen in Fig. 4. Moreover, sical methods and the original quantum kernel decline
given poor ML performance a natural instinct is to throw precipitously. The advantage of projected quantum ker-
more resources at the problem, e.g. more qubits, but as nel closely follows the geometric difference g and reaches
demonstrated here, doing this for naı̈ve quantum ker- more than 20% for large sizes. Despite the optimiza-
nel methods is likely to lead to tiny inner products and tion of g only being possible for classical methods with
even worse performance. In contrast, the projected quan- an associated kernel, the performance advantage remains
tum space has a low dimension even when d grows, and stable across other common classical methods. Note that
yields a higher geometric difference g for all embeddings we also constructed engineered data sets saturating the
and system sizes. Our methodology predicts that, when geometric inequality between classical ML and the orig-
g is small, classical ML model will be competitive or inal quantum kernel, but the small geometric difference
outperform the quantum ML model. This is verified in g presented no empirical advantage at large system size
Fig. 3b for both the original and projected quantum ker- (see Appendix M).
nel, where a small geometric difference g leads to a very In keeping with our arguments about the role of data,
good performance of classical ML models and no large when we increase the number of training data N , all
quantum advantage can be seen. Only when the geomet- methods improve, and the advantage will gradually di-
ric difference g is large (projected kernel method with minish. While this data set is engineered, it shows the
embedding E3) can we see some mild advantage over the strongest empirical separation on the largest system size
best classical method. This result holds disregarding any to date. We conjecture that this procedure could be used
detail of the quantum evolution we are trying to learn, with a quantum computer to create challenging data sets
even for ones that are hard to simulate classically. that are easy to learn with a quantum device, hard to
In order to push the limits of separation between quan- learn classically, while still being easy to verify classically
tum and classical approaches in a learning setting, we given the correct labels. Moreover, the size of the mar-
now consider a set of engineered data sets with func- gin implies that this separation may even persist under
9

moderate amounts of noise in a quantum device. of qubits. The size of this separation and trend up to
30 qubits suggests the existence of learning tasks that
may be easy to verify, but hard to model classically, re-
DISCUSSION quiring just a modest number of qubits and allowing for
device noise. Claims of true advantage in a quantum
The use of quantum computing in machine learning machine learning setting require not only benchmarking
remains an exciting prospect, but quantifying quantum classical machine learning models, but also classical ap-
advantage for such applications has some subtle issues proximations of quantum models. Additional work will
that one must approach carefully. Here, we constructed be needed to identify embeddings that satisfy the some-
a foundation for understanding opportunities for quan- times conflicting requirements of being hard to approxi-
tum advantage in a learning setting. We showed quanti- mate classically and exhibiting meaningful signal on lo-
tatively how classical ML algorithms with data can be- cal observables for very large numbers of qubits. Further
come computationally more powerful, and a prediction research will be required to find use cases on data sets
advantage for quantum models is not guaranteed even closer to practical interest and evaluate potential claims
if the data comes from a quantum process that is chal- of advantage, but we believe the tools developed in this
lenging to independently simulate. Motivated by these work will help to pave the way for this exciting frontier.
tests, we introduced projected quantum kernels. On en-
gineered data sets, projected quantum kernels outper-
form all tested classical models in prediction error. To ACKNOWLEDGEMENTS
the authors’ knowledge, this is the first empirical demon-
stration of such a large separation between quantum and The authors want to thank Richard Kueng, John Platt,
classical ML models. John Preskill, Thomas Vidick, Nathan Wiebe, and Chun-
This work suggests a simple guidebook for generat- Ju Wu for valuable inputs and inspiring discussions. We
ing ML problems which give a large separation between thank Bálint Pató for crucial contributions in setting up
quantum and classical models, even at a modest number simulations.

[1] A. Halevy, P. Norvig, and F. Pereira, “The unreasonable effectiveness of data,” IEEE Intelligent Systems 24, 8 (2009).
[2] L. K. Grover, “A fast quantum mechanical algorithm for database search,” in Proceedings of the twenty-eighth annual ACM
symposium on Theory of computing (1996) pp. 212–219.
[3] C. Durr and P. Hoyer, “A quantum algorithm for finding the minimum,” arxiv preprint arXiv:quant-ph/9607014 (1996).
[4] E. Farhi, J. Goldstone, S. Gutmann, J. Lapan, A. Lundgren, and D. Preda, “A quantum adiabatic evolution algorithm
applied to random instances of an np-complete problem,” Science 292, 472 (2001).
[5] H. Neven, V. S. Denchev, G. Rose, and W. G. Macready, “Training a large scale classifier with the quantum adiabatic
algorithm,” arXiv preprint arXiv:0912.0779 (2009).
[6] P. Rebentrost, M. Mohseni, and S. Lloyd, “Quantum support vector machine for big data classification,” Phys. Rev. Lett.
113, 130503 (2014).
[7] M. S. Leifer and D. Poulin, “Quantum graphical models and belief propagation,” Annals of Physics 323, 1899 (2008).
[8] S. Aaronson and A. Ambainis, “The need for structure in quantum speedups,” arXiv preprint arXiv:0911.0996 (2009).
[9] J. R. McClean, M. P. Harrigan, M. Mohseni, N. C. Rubin, Z. Jiang, S. Boixo, V. N. Smelyanskiy, R. Babbush, and
H. Neven, “Low depth mechanisms for quantum optimization,” arXiv preprint arXiv:2008.08615 (2020).
[10] S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, M. J. Bremner, J. M. Martinis, and H. Neven,
“Characterizing quantum supremacy in near-term devices,” Nature Physics 14, 595 (2018).
[11] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell,
et al., “Quantum supremacy using a programmable superconducting processor,” Nature 574, 505 (2019).
[12] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’brien, “A
variational eigenvalue solver on a photonic quantum processor,” Nature communications 5, 4213 (2014).
[13] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru-Guzik, “The theory of variational hybrid quantum-classical
algorithms,” New Journal of Physics 18, 023023 (2016).
[14] E. Farhi and H. Neven, “Classification with quantum neural networks on near term processors,” arXiv preprint
arXiv:1802.06002 (2018).
[15] V. Havlı́ček, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, “Supervised
learning with quantum-enhanced feature spaces,” Nature 567, 209 (2019).
[16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning 20, 273 (1995).
[17] B. Schölkopf, A. J. Smola, F. Bach, et al., Learning with kernels: support vector machines, regularization, optimization,
and beyond (2002).
[18] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning (2018).
[19] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” arXiv
preprint arXiv:1806.07572 (2018).
10

[20] R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz, “Neural tangents: Fast and easy
infinite neural networks in python,” arXiv preprint arXiv:1912.02803 (2019).
[21] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide
neural net,” in Advances in Neural Information Processing Systems (2019) pp. 8141–8150.
[22] C. Blank, D. K. Park, J.-K. K. Rhee, and F. Petruccione, “Quantum classifier with tailored quantum kernel,” npj Quantum
Information 6, 1 (2020).
[23] K. Bartkiewicz, C. Gneiting, A. Černoch, K. Jiráková, K. Lemr, and F. Nori, “Experimental kernel-based quantum
machine learning in finite feature space,” Scientific Reports 10, 1 (2020).
[24] Y. Liu, S. Arunachalam, and K. Temme, “A rigorous and robust quantum speed-up in supervised machine learning,”
arXiv preprint arXiv:2010.02174 (2020).
[25] H.-Y. Huang, R. Kueng, and J. Preskill, “Predicting many properties of a quantum system from very few measurements,”
Nat. Phys. (2020).
[26] J. Cotler and F. Wilczek, “Quantum overlapping tomography,” Physical Review Letters 124, 100401 (2020).
[27] M. Paini and A. Kalev, “An approximate description of quantum states,” arXiv preprint arXiv:1910.10543 (2019).
[28] S. Lloyd, M. Schuld, A. Ijaz, J. Izaac, and N. Killoran, “Quantum embeddings for machine learning,” arXiv preprint
arXiv:2001.03622 (2020).
[29] M. Schuld, R. Sweke, and J. J. Meyer, “The effect of data encoding on the expressive power of variational quantum
machine learning models,” arXiv preprint arXiv:2008.08605 (2020).
[30] J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush, and H. Neven, “Barren plateaus in quantum neural network
training landscapes,” Nature communications 9, 1 (2018).
[31] E. Grant, L. Wossnig, M. Ostaszewski, and M. Benedetti, “An initialization strategy for addressing barren plateaus in
parametrized quantum circuits,” Quantum 3, 214 (2019).
[32] M. Schuld, A. Bocharov, K. M. Svore, and N. Wiebe, “Circuit-centric quantum classifiers,” Physical Review A 101,
032308 (2020).
[33] R. LaRose and B. Coyle, “Robust data encodings for quantum classifiers,” Physical Review A 102, 032420 (2020).
[34] A. W. Harrow and A. Montanaro, “Quantum computational supremacy,” Nature 549, 203 (2017).
[35] Z. Li, R. Wang, D. Yu, S. S. Du, W. Hu, R. Salakhutdinov, and S. Arora, “Enhanced convolutional neural tangent
kernels,” arXiv preprint arXiv:1911.00809 (2019).
[36] C. A. Micchelli, Y. Xu, and H. Zhang, “Universal kernels,” Journal of Machine Learning Research 7, 2651 (2006).
[37] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing
systems (1992) pp. 950–957.
[38] J. A. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural processing letters 9, 293
(1999).
[39] H.-Y. Huang, R. Kueng, and J. Preskill, “Information-theoretic bounds on quantum advantage in machine learning,”
arXiv preprint arXiv:2101.02464 (2021).
[40] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling,
A. McKenney, and D. Sorensen, LAPACK Users’ Guide, 3rd ed. (Society for Industrial and Applied Mathematics,
Philadelphia, PA, 1999).
[41] L. Breiman, “Random forests,” Machine learning 45, 5 (2001).
[42] D. Gosset and J. Smolin, “A compressed classical description of quantum states,” arXiv preprint arXiv:1801.05721 (2018).
[43] S. Aaronson, “Shadow tomography of quantum states,” SIAM Journal on Computing , STOC18 (2020).
[44] S. Aaronson and G. N. Rothblum, “Gentle measurement of quantum states and differential privacy,” in Proceedings of the
51st Annual ACM SIGACT Symposium on Theory of Computing (2019) pp. 322–333.
[45] P. Hohenberg and W. Kohn, “Inhomogeneous electron gas,” Physical review 136, B864 (1964).
[46] E. Runge and E. K. Gross, “Density-functional theory for time-dependent systems,” Physical Review Letters 52, 997
(1984).
[47] M. Broughton, G. Verdon, T. McCourt, A. J. Martinez, J. H. Yoo, S. V. Isakov, P. Massey, M. Y. Niu, R. Halavati, E. Peters,
et al., “Tensorflow quantum: A software framework for quantum machine learning,” arXiv preprint arXiv:2003.02989
(2020).
[48] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,”
arXiv preprint arXiv:1708.07747 (2017).
[49] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” ATT Labs [Online] 2 (2010).
[50] I. T. Jolliffe, “Principal components in regression analysis,” in Principal component analysis (Springer, 1986) pp. 129–155.
[51] M. Schuld and N. Killoran, “Quantum machine learning in feature hilbert spaces,” Physical review letters 122, 040504
(2019).
[52] A. Skolik, J. R. McClean, M. Mohseni, P. van der Smagt, and M. Leib, “Layerwise learning for quantum neural networks,”
arXiv preprint arXiv:2006.14904 (2020).
[53] E. A. Nadaraya, “On estimating regression,” Theory of Probability & Its Applications 9, 141 (1964).
[54] N. S. Altman, “An introduction to kernel and nearest-neighbor nonparametric regression,” The American Statistician 46,
175 (1992).
[55] J. Haah, A. W. Harrow, Z. Ji, X. Wu, and N. Yu, “Sample-optimal tomography of quantum states,” IEEE Transactions
on Information Theory 63, 5628 (2017).
[56] R. A. Servedio and S. J. Gortler, “Equivalences and separations between quantum and classical learnability,” SIAM Journal
on Computing 33, 1067 (2004).
11

[57] R. Sweke, J.-P. Seifert, D. Hangleiter, and J. Eisert, “On the quantum versus classical learnability of discrete distributions,”
arXiv preprint arXiv:2007.14451 (2020).
[58] M. A. Nielsen and I. L. Chuang, “Quantum computation and quantum information,” .
[59] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth, “Learnability and the vapnik-chervonenkis dimension,”
Journal of the ACM (JACM) 36, 929 (1989).
[60] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vector machines,” ACM transactions on intelligent systems
and technology (TIST) 2, 1 (2011).
[61] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort,
J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, “API design for machine learning software:
experiences from the scikit-learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning
(2013) pp. 108–122.
[62] D. Wecker, M. B. Hastings, and M. Troyer, “Progress towards practical quantum variational algorithms,” Physical Review
A 92, 042303 (2015).
[63] C. Cade, L. Mineh, A. Montanaro, and S. Stanisic, “Strategies for solving the fermi-hubbard model on near-term quantum
computers,” arXiv preprint arXiv:1912.06007 (2019).
[64] R. Wiersema, C. Zhou, Y. de Sereville, J. F. Carrasquilla, Y. B. Kim, and H. Yuen, “Exploring entanglement and
optimization within the hamiltonian variational ansatz,” arXiv preprint arXiv:2008.02941 (2020).
[65] R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz, “Neural tangents: Fast and easy
infinite neural networks in python,” in International Conference on Learning Representations (2020).
12

Appendix A: Rigorous proofs for statements regarding the motivating example

We first give a simple proof that the motivating example f (x) considered in the main text is in general hard to
compute classically. Then, we show that training a classical ML model to predict the function f (x) is easy on a
classical computer.
Proposition
Pp 2 (Restatement of Proposition 1). Consider input vector x ∈ Rp encoded into an n-qubit state |xi =
k=1 xk |ki. If a randomized classical algorithm can compute

f (x) = hx| UQNN OUQNN |xi (A1)
up to 0.15-error with high probability over the randomness in the classical algorithm for any n, UQNN and O in a time
polynomial to the description length of UQNN and O, the input vector size p, and the qubit system size n, then
BPP = BQP. (A2)
Proof. We consider p = 1 and |xi = |0n i the all zero computational basis state. A language L is in BQP if and only
if there exists a polynomial-time uniform family of quantum circuits {Qn : n ∈ N}, such that
1. For all n ∈ N, Qn takes an n-qubit computational basis state as input, apply Qn on the input state, and
measures the first qubit in the computational basis as output.
2. For all z ∈ L, the probability that output of Q|z| applying on the input z is one is greater than or equal to 2/3.
3. For all z ∈
/ L, the probability that output of Q|z| applying on the input z is zero is greater than or equal to 2/3.
If we have the randomized classical algorithm that can compute f (x), then for all z: input bitstring, we consider the
unitary quantum neural network given by
n
O
UQNN = Q|z| Xizi , (A3)
i=1

where Xi is the Pauli-X matrix acting on the i-th qubit, and the observable O is given by Z1 . Hence, we have

1. For all z ∈ L, f (x) = hx| UQNN OUQNN |xi = hz| Q†|z| Z1 Q|z| |zi = Pr[the output of Q|z| applying on the input z
is one ] − Pr[the probability that output of Q|z| applying on the input z is zero] ≥ 2/3 − 1/3 = 1/3.

2. For all z ∈
/ L, f (x) = hx| UQNN OUQNN |xi = hz| Q†|z| Z1 Q|z| |zi = Pr[the output of Q|z| applying on the input z
is one ] − Pr[the probability that output of Q|z| applying on the input z is zero] ≤ 1/3 − 2/3 = −1/3.
By assumption, we can use the randomized classical algorithm to compute an estimate fˆ(x) such that |fˆ(x) − f (x)| <
0.15 with high probability over the randomness of the classical algorithm. Therefore with high probability, fˆ(x) > 0
if z ∈ L and fˆ(x) < 0 if z ∈
/ L. We can use the indication of whether fˆ(x) is positive or negative to determine if z ∈ L
or z ∈/ L with high probability over the randomness of the classical algorithm. This implies that L ∈ BPP.
Together, the existence of the randomized classical algorithm implies that BQP ⊆ BPP. By definition, we have
BPP ⊆ BQP, hence BPP = BQP.
We will now give a classical machine learning algorithm that could learn f (x) efficiently using few samples. Recall
point is given by {xi }Ni=1 , where xi ∈ R . Now, we consider a classical ML model with the kernel function
p
that the dataP
p 2
k(xi , xj ) = ( l=1 xil xjl ) , which can be evaluated in time linear in the dimension p. NotePpthat this definition of kernel
is equivalent to the quantum kernel Tr(ρ(xi )ρ(xj )) = |hxi |xj i|2 for the encoding |xi i = k=1 xik |ki. We will now use
the theoretical framework we developed in the main text (the section on testing quantum advantage). In particular,
we will use the prediction error of quantum kernel method given in Eq. 8. It shows that for any observable O and
quantum neural network UQNN , the prediction error after training from N data points {(xi , yi = f (xi ))} is given by
r
min(d, Tr(O2 ))
Ex∈D |h(x) − f (x)| ≤ c , (A4)
N
Pp
where d is the Hilbert space dimension of {ρ(xi )}N i=1 . Because we have ρ(xi ) = |xi ihxi | and |xi i = k=1 xik |ki, the
dimension of the Hilbert space is upper bounded by p2 . Therefore,
r r
min(d, Tr(O2 )) p2
Ex∈D |h(x) − f (x)| ≤ c ≤c . (A5)
N N
This is the result stated in the main text. For more details about the machine learning models, the prediction error
bound, and the proof for the prediction error bound of quantum kernel methods, see Appendix D and E 1.
13

+ Training data + Advice


Quantum
Computation
Tn+1 = {(xi , yi )}
n+1 n+1 an+1 = 0100 . . . 0100 (BQP)

Problem
n Tn = {(xi , yi )} n an = 1010 . . . 110
Size
Classical
n-1 Tn 1 = {(xi , yi )} n-1 an = 111 . . . 001
1
Algorithm BPP/Samp P/poly
(BPP)
Classical ML
P/poly
algorithm with data
BPP ( P/poly
(BPP/samp)

FIG. 5. We present an illustration of the complexity class for classical machine learning algorithm with the availability of data.
To the right, we have a diagram showing the relations between different complexity classes.

Appendix B: Complexity-theoretic argument for the power of data

In the main text, we give an argument based on an example to demonstrate the power of data. However, this is not
satisfactory when we want to put the power of data on a rigorous footing. To demonstrate this fact from a rigorous
standpoint, let us capture classical ML algorithms that can learn from data by means of a complexity class, which we
refer to as BPP/samp. A language L of bit strings is in BPP/samp if and only if the following holds: There exists
probabilistic Turing machines D and M . D generates samples x with |x| = n in polynomial time for any input size
poly(n)
n. D defines a sequence of input distributions {Dn }. M takes an input x of size n along with T = {(xi , yi )}i=1 of
polynomial size, where xi is sampled from Dn using Turing machine D and yi conveys language membership: yi = 1
if xi ∈ L and yi = 0 if xi 6∈ L. Moreover, we require

• The probabilistic Turing machine M to process all inputs x in polynomial time (polynomial runtime).

• For all x ∈ L, M outputs 1 with probability greater than or equal to 2/3 (prob. completeness).

• For all x ∈
/ L, M outputs 1 with probability less than or equal to 1/3 (prob. soundness).

If the Turing machine M neglects the sampled data T , this is equivalent to the definition of BPP. Hence BPP is
contained inside BPP/samp.
We can also see that T is a restricted form of randomized advice string. It is not hard to show that BPP/samp is
contained in P/poly based on the same proof strategy for Adleman’s theorem. We consider a new probabilistic Turing
machine M 0 that runs M for 18n times. Each time, we use an independently sampled training set T from Dn . Then
we take a majority vote from the 18n runs. By Chernoff bound, the probability of failure for any given x with |x| = n
would be at most 1/en . Hence by union bound, the probability that all x with |x| = n succeeds is at least 1 − (2/e)n .
This implies the existence of a particular choice of the 18n training sets and 18n random bit-strings used in each run
of the probabilistic Turing machine M , such that for all x with |x| = n the decision of whether x ∈ L is correct. We
simply define the advice string an to one particular choice of the 18n training sets and 18n random bit-strings, which
will be a string of size polynomial in n. Hence we know that BPP/samp is contained in P/poly. An illustration is
given in Figure 5. We leave open the question of whether BPP/samp is strictly contained in P/poly.
The separation between P/poly and BPP is often illustrated by undecidable unary languages. The separation
between BPP/samp and BPP could also be proved using a similar example. Actually, an undecidable unary language
serves as an equally good example. Here, we choose to present a slightly more complicated example to demonstrate
what BPP/samp could do. Let us consider an undecidable unary language Lhard = {1n |n ∈ A}, where A is a subset
of the natural numbers N and a classically easy language Leasy ∈ BPP. We assume that for every input size n, there
exists an input an ∈ Leasy and an input bn ∈ / Leasy . We define a new language as follows:

[
L= {x|∀x ∈ Leasy , 1n ∈ Lhard , |x| = n} ∪ {x|∀x ∈
/ Leasy , 1n ∈
/ Lhard , |x| = n}. (B1)
n=1

For each size n, if 1n ∈ Lhard , L would include all x ∈ Leasy with |x| = n. If 1n ∈
/ Lhard , L would include all x ∈
/ Leasy
with |x| = n. By definition, if we can output whether x ∈ L for an input x using a classical algorithm (BPP), we can
output whether 1n ∈ Lhard by computing whether x ∈ Leasy . This is however impossible due to the undecidability
14

of Lhard . Hence the language L is not in BPP. On the other hand, for every size n, a classical machine learning
algorithm can use a single training data point (x0 , y0 ) to decide whether x ∈ L. An algorithm is as follows. Using
y0 , we know whether x0 ∈ Leasy . Hence, we know whether 1n ∈ Lhard . Then for any input x with size n, we can
output the correct answer by using the knowledge of whether 1n ∈ Lhard combined with a classical computation to
decide whether x ∈ Leasy . This example nicely illustrates the power of data and how machine learning algorithms can
utilize it. In summary, the data provide information that is hard to compute with a classical computer (e.g., whether
1n ∈ Lhard ). Then the classical machine learning algorithm would perform classical computation to infer the solution
from the given knowledge (e.g., computing whether x ∈ Leasy ). The same language L also yields a separation between
BPP/samp and BQP because L is constructed to be undecidable.
From a practical perspective, it is impossible to obtain training data that is undecidable. But it is still possible
to obtain data that cannot be efficiently computed with a classical computer, since the universe operates quantum
mechanically. If the universe computes classically, then the data we can obtain will be computable by BPP and there
is no separation between classical ML algorithm with data from BPP and BPP. We now present a simple argument for
a separation between classical algorithm learning with data coming from quantum computation and BPP. This follows
from a similar argument as the previous example. Here, we assume that there is a sequence of quantum circuits such
that the Z measurement on the first qubit (being +1 with probability > 2/3 or < 1/3) is hard to decide classically.
This defines a unary language L0hard that is outside BPP, but inside BQP. We can then use L0hard in replace of Lhard
for the example above. When the data comes from BQP, the class classical ML algorithms that can learn from the
data would not have a separation from BQP.

Appendix C: Relation between quantum kernel methods and quantum neural networks

In this section we demonstrate the formal equivalence of an arbitrary depth neural network with a quantum kernel
method built from the original quadratic quantum kernel. This connection helps demonstrate the feature map induced
by this kernel to motivate its use as opposed to the simpler inner product. While this equivalence shows the flexibility
of this quantum kernel, it does not imply that it allows learning with a parsimonious amount of data. Indeed, in
many cases it requires both an exponential amount of data and exponential precision in evaluation due to the fidelity
type metric. In later sections we show simple cases where it fails for illustration purposes.
Proposition 3. Training an arbitrarily deep quantum neural network UQNN with a trainable observable O is equivalent
to training a quantum kernel method with kernel kQ (xi , xj ) = Tr(ρ(xi )ρ(xj )).
Proof. Let us define ρi = ρ(xi ) = Uenc (xi ) |0n ih0n | Uenc (xi )† to be the corresponding quantum states for the input
vector xi . The training of a quantum neural network can be written‘ as
N
X
min l(Tr(OU ρi U † ), yi ), (C1)
U ∈C⊂U (2n )
i=1

where l(ỹ, y) is a loss function that measures how close the prediction ỹ is to the true label y, C is the space of all
possible unitaries considered by the parameterized quantum circuit, O is some predefined observable that we measure
after evolving with U . Let us denote the optimal U to be U ∗ , then the prediction for a new input x is given by
Tr(OU ∗ ρ(x)(U ∗ )† ).
On the other nhand, n
the training of the quantum kernel method under the implied feature map is equivalent to
training W ∈ C2 ×2 under the optimization
N
X
min
n
l(Tr(W ρi ), yi ) + λ Tr(W † W ), (C2)
W ∈C2 ×2n
i=1

where λ ≥ 0 is the regularization parameter and l(ỹ, y) is the loss function. Let us denote the optimal W to be
W ∗ , then the prediction for a new input x is given by Tr(W ∗ ρ(x)). The well-known kernel trick allows efficient
implementation of this machine learning model, and connects the original quantum kernel to the derivation here.
Using the fact that ρi is Hermitian and set λ = 0, the quantum kernel method can be expressed as
N
X
minn l(Tr(OU ρi U † ), yi ). (C3)
U ∈U (2 ),
n n i=1
O∈C2 ×2 ,O=O †

This is equivalent to training an arbitrarily deep quantum neural network U with a trainable observable O.
15

Appendix D: Proof of a general form of prediction error bound

This section is dedicated to deriving thepprecise statement for the core prediction error bound from which we base
our methodology: Ex |h(x) − f (x)| ≤ O( s/N ) given by the first inequality in Equation (3). We will provide a
detailed proof for the following general theorem when we include the regularization parameter λ. The regularization
parameter λ will be used to improve prediction performance by limiting the complexity of the machine learning model.
Theorem 1. Consider an observable O with kOk∞ ≤ 1, a quantum unitary U (e.g., a quantum neural network or a
general Hamiltonian evolution), a mapping of classical vector x to quantum system ρ(x), and a training set of N data

{(xi , yi = Tr(OU ρ(xi )))}N U
i=1 , with O = U OU being the Heisenberg evolved observable. The training set is sampled
from some unknown distribution over the input x. Suppose that k(x, x0 ) can be evaluated efficiently and the kernel
PN
function is re-scaled to satisfy i=1 k(xi , xi ) = N . Define the Gram matrix Kij = k(xi , xj ). For any λ ≥ 0, with
probability at least 1 − δ over the sampling of the training data, we can learn a model h(x) from the training data,
such that the expected prediction error is bounded by
r r r !
U Tr(Atra OU ⊗ OU ) Tr(Agen OU ⊗ OU ) log(1/δ)
Ex |h(x) − Tr(O ρ(x))| ≤ O + + , (D1)
N N N

where the two operators Atra , Agen are given as


N X
X N
Atra = λ2 ((K + λI)−2 )ij ρ(xi ) ⊗ ρ(xj ), (D2)
i=1 j=1
N X
X N
Agen = ((K + λI)−1 K(K + λI)−1 )ij ρ(xi ) ⊗ ρ(xj ). (D3)
i=1 j=1

This is a data-dependent bound as Atra and Agen both depend on the N training data.
PN PN
When we take the limit of λ → 0, we have Atra = 0 and Agen = i=1 j=1 (K −1 )ij ρ(xi ) ⊗ ρ(xj ). Thus with
probability at least 0.99 = 1 − δ, we have
r !
U sK (N )
Ex |h(x) − Tr(O ρ(x))| ≤ O , (D4)
N
PN PN
where sK (N ) = i=1 j=1 (K −1 )ij Tr(OU ρ(xi )) Tr(OU ρ(xj )). This is the formula stated in the main text. However,
in practice, we would recommend the use of regularization λ > 0 to prevent numerical instability and to obtain
prediction error bound when we use a regularized ML model.
In Section D 1, we will present the definition of the machine learning models used to prove Theorem 1. In Sec-
tion D 2 and D 3, we will analyze the training error and generalization error of the machine learning models we consider
to prove the prediction error bound given in Theorem 1.

1. Definition and training of machine learning models

We consider a class of machine learning models, including Gaussian kernel regression, infinite-width neural networks,
and quantum kernel methods. These models are equivalent to training a linear function mapping from a (possibly
infinite-dimensional) Hilbert space H to R. The linear function can be written as hw, φ(x)i, where w parameterizes
the linear function, h·, ·i : H × H → R is an inner product, and φ(x) is a nonlinear mapping from the input vector x
to the Hilbert space H. For example, in quantum kernel method, we use the space of 2n × 2n Hermitian matrices as
the Hilbert space H. This yields a natural definition of inner product hρ, σi = Tr(ρσ) ∈ R.
Because the output y = Tr(U † OU ρ(x)) of the quantum model satisfies y ∈ [−1, 1], we confine the output of the
machine learning model to the interval [−1, 1]. The resulting machine learning model would be
hw (x) = min(1, max(−1, hw, φ(x)i)). (D5)
For efficient optimization of w, we consider minimization of the following loss function
N
X 2
min λhw, wi + hw, φ(x)i − Tr(U † OU ρ(xi )) , (D6)
w
i=1
16

where λ ≥ 0 is a hyper-parameter. We define Φ = (φ(x1 ), . . . , φ(xN )). The kernel matrix K = Φ† Φ is an N ×N matrix
that defines the geometry between all pairs of the training data. We see that Kij = hφ(xi ), φ(xj )i = k(xi , xj ) ∈ R.
Without loss of generality, we consider Tr(K) = N , which can be done by rescaling k(xi , xj ). The optimal w can be
written down explicitly as
N X
X N
w= φ(xi )((K + λI)−1 )ij Tr(U † OU ρ(xj )). (D7)
i=1 j=1

Hence the trained machine learning model would be


  
XN XN
hw (x) = min 1, max −1, k(xi , x)((K + λI)−1 )ij Tr(U † OU ρ(xj )) . (D8)
i=1 j=1

This is an analytic representation for various trained machine learning models, including least-square support vector
machine [38], kernel regression [53, 54], and infinite-width neural networks [19]. We will now analyze the prediction
error of these machine learning models:
w (x) = |hw (x) − Tr(U † OU ρ(x))|, (D9)
which is uniquely determined by the kernel matrix K and the hyper-parameter λ. In particular, we will focus on
providing an upper bound on the expected prediction error
N N
1 X 1 X
Ex w (x) = w (xi ) + Ex w (x) − w (xi ), (D10)
N i=1 N i=1
| {z } | {z }
Training error Generalization error

which is the sum of training error and generalization error.

2. Training error

We will now relate the training error to the optimization problem, i.e., Equation (D6), for obtaining the ma-
chine learning model hw (x). Because kOk ≤ 1, we have Tr(U † OU ρ(x)) ∈ [−1, 1], and hence w (x) = |hw (x) −
Tr(U † OU ρ(x))| ≤ |hw, φ(x)i − Tr(U † OU ρ(x))|. Using the convexity of x2 and Jensen’s inequality, we obtain
v
N u N
1 X u1 X 2
w (xi ) ≤ t (hw, φ(x)i − Tr(U † OU ρ(xi ))) . (D11)
N i=1 N i=1

We can plug in the expression for the optimal w given in Equation (D7) to yield
N
r
1 X Tr(Atra (U † OU ) ⊗ (U † OU ))
w (xi ) ≤ , (D12)
N i=1 N
PN PN
where Atra = λ2 i=1 j=1 ((K + λI)−2 )ij ρ(xi ) ⊗ ρ(xj ). When K is invertible and λ = 0, we can see that the training
error is zero. However, in practice, we often set λ > 0.

3. Generalization error

A basic theorem in statistics and learning theory is presented below. This theorem provides an upper bound on
the largest (one-sided) deviation from expectation over a family of functions.
Theorem 2 (See Theorem 3.3 in [18]). Let G be a family of function mappings from a set Z to [0, 1]. Then for any
δ > 0, with probability at least 1 − δ over identical and independent draw of N samples from Z: z1 , . . . , zN , we have
for all g ∈ G,
N N
" # r
1 X 1 X log(2/δ)
Ez [g(z)] ≤ g(zi ) + 2Eσ sup σi g(zi ) + 3 , (D13)
N i=1 g∈G N i=1 2N
where σ1 , . . . σN are independent and uniform random variables over ±1.
17

For our purpose, we will consider Z to be the space of input vector with zi = xi drawn from some input distribution.
Each function g would be equal to w /2 for some w, where w is defined in Equation (D9). The reason that we divide
by 2 is because the range of w is [0, 2]. And ∀γ = 1, 2, 3, . . ., we define Gγ to be {w /2 | ∀ kwk ≤ γ}. The definition
of an infinite sequence of family of functions Gγ is useful for proving a prediction error bound for an unbounded class
of machine learning models hw (x), where kwk could be arbitrarily large. Using Theorem 2 and multiplying the entire
inequality by 2, we can show that the following inequality holds for any w with kwk ≤ γ,
N N
" # r
1 X 1 X log(4γ 2 /δ)
Ex [w (x)] − w (xi ) ≤ 2Eσ sup σi v (xi ) + 6 , (D14)
N i=1 kvk≤γ N i=1 2N

with probability at least 1 − δ/2γ 2 . This probabilistic statement holds for any γ = 1, 2, 3, . . ., but this does not yet
guarantee that the inequality holds for all γ with high probability. We need to apply a union P∞ bound over all γ to
achieve this, which shows that Inequality (D14) holds for all γ with probability at least 1 − γ=1 δ/2γ 2 ≥ 1 − δ.
PN
Together we have shown that, for any w ∈ H, the generalization error Ex [w (x)] − N1 i=1 w (xi ) is upper bounded
by
N
" # r
1 X log(4dkwke2 /δ)
2Eσ sup σi v (xi ) + 6 , (D15)
kvk≤dkwke N i=1 2N

with probability at least 1 − δ, where we consider the particular inequality with γ = dkwke. We will now analyze the
above inequality using Talagrand’s contraction lemma.

Lemma 1 (Talagrand’s contraction lemma; See Lemma 5.7 in [18]). Let G be a family of function from a set Z to
R. Let l1 , . . . , lN be Lipschitz-continuous function from R → R with Lipschitz constant L. Then
N N
" # " #
1 X 1 X
Eσ sup σi li (g(zi )) ≤ LEσ sup σi g(zi ) . (D16)
g∈G N i=1 g∈G N i=1

We consider li (s) = | min(1, max(−1, s)) − Tr(U † OU ρ(xi ))|, zi = xi , and G = {gv (zi ) = hv, zi i | kvk ≤ dkwke}.
This choice of functions gives v (xi ) = li (g(zi )). Furthermore, li is Lipschitz-continuous with Lipschitz constant 1.
Talagrand’s contraction lemma then allows us to bound the formula in Equation (D15) by
N
" # r
1 X log(4dkwke2 /δ)
2Eσ sup σi hv, φ(xi )i + 6 (D17)
kvk≤dkwke N i=1 2N
#
N
" r
1 X log(4dkwke2 /δ)
≤ 2Eσ sup kvk σi φ(xi ) + 6 (D18)

kvk≤dkwke N
i=1
2N
" N # r
1
X
log(4dkwke2 /δ)
≤ 2dkwkeEσ σi φ(xi ) + 6 (D19)

N i=1 2N


v
N X N
u r
dkwke utEσ
X log(4dkwke2 /δ)
≤2 σi σj k(xi , xj ) + 6 (D20)
N i=1 j=1
2N
p r
dkwke2 Tr(K) log(4dkwke2 /δ)
≤2 +6 (D21)
N 2N
r r r
dkwke 2 log(dkwke) log(4/δ)
≤2 +6 +6 (D22)
N N 2N
r r
dkwke 2 log(4/δ)
≤8 +6 . (D23)
N 2N
The first inequality uses Cauchy’s inequality. The second inequality uses the fact that kvk ≤ dkwke. The third
inequality uses a Jensen’s inequality to move Eσ into the square-root. The fourth inequality
√ uses
√ the fact that σi are

independent and uniform random variable taking +1, −1. The fifth inequality uses x + y ≤ x + y, ∀x, y ≥ 0 and
our assumption that we rescale K such that Tr(K) = N . The sixth inequality uses the fact that x2 ≥ log(x), ∀x ∈ N.
18

Finally, we plug in the optimal w given in Equation (D7). This allows us to obtain an upper bound of the
generalization error:

N p r
1 X d Tr(Agen (U † OU ) ⊗ (U † OU ))e log(4/δ)
Ex [w (x)] − w (xi ) ≤ 8 √ +6 , (D24)
N i=1 N 2N

PN PN
where Agen = ((K + λI)−1 K(K + λI)−1 )ij ρ(xi ) ⊗ ρ(xj ). When K is invertible and λ = 0, we have
PN PN i=1 −1j=1
Agen = i=1 j=1 (K )ij ρ(xi ) ⊗ ρ(xj ).

Appendix E: Simplified prediction error bound based on dimension and geometric difference

In this section, we will show that for quantum kernel methods, we have
r !
min(d, Tr(O2 ))
Ex |hQ (x) − Tr(OU ρ(x))| ≤ O , (E1)
N

where d is the dimension of the training set space d = dim(span(ρ(x1 ), . . . , ρ(xN ))). If we use the quantum kernel
method as a reference point, then the prediction error of another machine learning algorithm that produces h(x) using
kernel matrix K can be bounded by
r !
U min(d, Tr(O2 ))
Ex |h(x) − Tr(O ρ(x))| ≤ O g , (E2)
N

√ Q −1 √ Q
r
where g = K K K assuming the normalization condition Tr(K Q ) = Tr(K) = N .

1. Quantum kernel method

In quantum kernel method, the kernel function that will be used to train the model is defined using the quantum
Q
Hilbert space kQ (x, x0 ) = Tr(ρ(x)ρ(x0 )). Correspondingly, we define the kernel matrix Kij = kQ (xi , xj ). We will
Q
P N
focus on ρ(x) being a pure state, so the scaling condition Tr(K ) = i=1 kQ (xi , xi ) = N is immediately satisfied.
We also denote the trained model as hQ for the quantum kernel method. We now consider an orthonormal basis
{σ1 , . . . , σd } for the d-dimensional quantum state space formed by the training data span{ρ(x1 ), . . . , ρ(xN )} under
the inner product hρ, σi = Tr(ρσ). We have σp is Hermitian, Tr(σp2 ) = 1, but σp may not be positive semi-definite.
We consider an expansion of ρ(xi ) in terms of σp :

d
X
ρ(xi ) = αip σp , (E3)
p=1

where α ∈ RN ×d . The coefficient α is real as the vector space of Hermitian matrices is over real numbers. Note that
multiplying a Hermitian matrix with an imaginary number will not generally result in a Hermitian matrix, hence
Hermitian matrices are not a vector space over complex numbers. We can perform a singular value decomposition
on α = U ΣV † , where U ∈ CN ×d , Σ, V ∈ Cd×d with U † U = I, Σ is diagonal and Σ  0, V † V = V V † = I. Then
K Q = αα† = U Σ2 U † . This allows us to explicitly evaluate Atra and Agen given in Equation (D2) and (D3):

d X
d 
Σ2
X 
Atra = λ2 V V† σp ⊗ σq , (E4)
p=1 q=1
(Σ + λI)2
2
pq
d X
d 
Σ4
X 
Agen = V V† σp ⊗ σq , (E5)
p=1 q=1
(Σ + λI)2
2
pq
19

which can be done by expanding ρ(xi ) in terms of σp . Because Σ  0, when we take the limit of λ → 0, we have
Pd Pd Pd
Atra = 0 and Agen = p=1 q=1 δpq σp ⊗ σq = p=1 σp ⊗ σp . Hence Tr(Atra OU ⊗ OU ) = 0 and Tr(Agen OU ⊗ OU ) =
Pd U 2
p=1 Tr(σp O ) . From Equation (D1) with λ → 0, we have
s P 
d U σ )2
r
p=1 Tr(O p log(1/δ)
Ex |hQ (x) − Tr(OU ρ(x))| ≤ O  + . (E6)
N N
Pd
Because {σ1 , . . . , σk } forms an orthonormal set in the space of 2n × 2n Hermitian matrices, p=1 Tr(OU σp )2 is the
Frobenius norm of the observable OU restricted to the subspace span{σ1 , . . . , σk }.
Pd
We now focus on obtaining an informative upper bound on how large p=1 Tr(OU σp )2 could be. First, because
Pd
we can extend the subspace span{σ1 , . . . , σk } to the full Hilbert space span{σ1 , . . . σ4n }, we have p=1 Tr(OU σp )2 ≤
P4 n 2 Pd 2
p=1 Tr(OU σp )2 = Tr((OU )2 ) = OU . Next, we will show that
F
Tr(OU σp )2 ≤ d OU ≤ d, where OU
p=1 ∞ ∞
is the spectral norm of the observable OU . We pick a linearly-independent set of {ρ1 , . . . , ρk } from {ρ(x1 ), . . . ρ(xN )}.
We assume that all the quantum states are pure, hence we have ρi = |ψi ihψi | , ∀i = 1, . . . , d. The pure states
{|ψ1 i , . . . , |ψk i} may not be orthogonal, so we perform a Gram-Schmidt process to create an orthonormal set of
quantum states {|φ1 i , . . . , |φk i}. Because ρi are linear combination of |φq i hφr | , ∀q, r = 1, . . . , d, we have
d X
X d
σp = spqr |φq ihφr | , ∀p = 1, . . . , d. (E7)
q=1 r=1
Pd Pd
The condition Tr(σp σp0 ) = δpp0 implies that q=1 r=1 spqr sp qr
0 = δpp0 . If we view s as a vector ~s of size d2 , then
2 Pd
h~sp , ~sp0 i = δpp0 . Thus {~s1 , . . . , ~sk } forms a set of orthonormal vectors in Rd , which implies p=1 ~sp~s†p  I. Let us
Pd 2
define the projection operator P = q=1 |φq ihφq |. We will also consider a vector ~o ∈ Rd , where ~oqr = hφr | OU |φq i.
We have
d d d Xd
!2 d
X X X X
U
Tr(O σp ) =2 U
spqr hφr | O |φq i = ~o†~sp~s†p~o (E8)
p=1 p=1 q=1 r=1 p=1

d
d X
!2
X 2
≤ ~o†~o = hφr | OU |φq i = P OU P F .

(E9)
q=1 r=1
Pd
The inequality comes from the fact that p=1 ~sp~s†p  I. With a proper choice of basis, one could view P OU P as an
√ √ Pd
d × d matrix. Hence P OU P F ≤ d P OU P ∞ ≤ d OU ∞ . This established the fact that p=1 Tr(OU σp )2 ≤
2
d OU ∞ ≤ d. Combining with Equation (E6), we have
s 
2
r
U
min(d, kO kF ) log(1/δ) 
Ex |hQ (x) − Tr(OU ρ(x))| ≤ O  + . (E10)
N N

This elucidates the fact that the prediction error of a quantum kernel method is bounded by minimum of the dimension
of the quantum subspace formed by the training set and the Frobenius norm of the observable OU .
Choosing a small but non-zero λ allows p us to consider an approximate space of span{ρ(x1 ), . . . , ρ(xN )} formed
by the training set. The training error Tr(Atra OU ⊗ OU )/N would increase slightly, and the generalization error
Tr(Agen O ⊗ O )/N would reflect the Frobenius norm of OU restricted to a smaller subspace, which only contains
p
U U

the principal components of the space formed by the training set. This would be a better choice when most states
lie in low-dimensional subspace with small random fluctuations. One may also consider training a machine learning
model with truncated kernel matrix Kλ , where all singular values below λ are truncated. This makes the act of
restricting to an approximate subspace more explicit.

2. Another machine learning method compared to quantum kernel

We now consider an upper bound on the prediction error using the quantum kernel method as a reference point for
some machine learning algorithm. For the following discussion, we consider classical neural networks with large hidden
20

sizes. The function generated by a classical neural network with large hidden size after training is equivalent to the
function h(x) given in Equation (D7) with λ = 0 and with a special kernel function kNTK (x, x0 ) known as the neural
tangent kernel (NTK) [19]. The precise definition of kNTK (x, x0 ) depends on the architecture of the neural network. For
example, a two-layer feedforward neural network (FNN), a three-layer FNN, or some particular form of convolutional
neural network (CNN) all correspond to different kNTK (x, x0 ). Given the kernel kNTK (x, x0 ), we can define the kernel
PN
matrix K̃ij = kNTK (xi , xj ). For neural tangent kernel, the scaling condition Tr(K̃) = i=1 kNTK (xi , xi ) = N may not
be satisfied. Hence, we define a normalized kernel matrix K = N K̃/ Tr(K̃). When λ = 0, the trained machine learning
model (given in Equation (D7)) under the normalized matrix K and the original matrix K̃ are the same. In order to
apply Theorem 1, we will use the normalized kernel matrix K for the following discussion. From Equation (D1) with
λ = 0, we have
r r !
U Tr(AOU ⊗ OU ) log(1/δ)
Ex |h(x) − Tr(O ρ(x))| ≤ O + , (E11)
N N
PN PN −1
where A = i=1 j=1 (K )ij ρ(xi ) ⊗ ρ(xj ). Using Equation (E3) on the expansion of ρ(xi ), we have
d X
X d X
N X
N
A= (K −1 )ij αip αjq σp ⊗ σq (E12)
p=1 q=1 i=1 j=1
d X
X d
= (α† K −1 α)pq σp ⊗ σq . (E13)
p=1 q=1

Using the definition of spectral norm, we have


d X
X d
Tr(AOU ⊗ OU ) = (α† K −1 α)pq Tr(σp OU ) Tr(σq OU ) (E14)
p=1 q=1
d
X
≤ α† K −1 α ∞ Tr(OU σp )2 .

(E15)
p=1

Recall from the definition below Equation (E3), we have


α = U ΣV † , K Q = αα† = U Σ2 U † . (E16)
do not change the spectral norm, α K α ∞ = ΣU † K −1 U Σ ∞ =
† −1
Using the fact that orthogonal
√ transformation

U ΣU † K −1 U ΣU † = K Q K −1 K Q . Hence

∞ ∞

√ √ d
X
Tr(AOU ⊗ OU ) ≤ K Q K −1 K Q Tr(OU σp )2 . (E17)


p=1

Together with Equation (E11), we have the following prediction error bound
 sP 
d U σ )2
r
p=1 Tr(O p log(1/δ)
Ex |h(x) − Tr(OU ρ(x))| ≤ O g + , (E18)
N N

√ Q −1 √ Q
r
where g = K K K . The scalar g measures the closeness of the geometry between the training data

points defined by classical neural network and quantum state space. Note that without the geometric scalar g, this
prediction error bound is the same as Equation (E6) for the quantum kernel method. Hence, if g is small, classical
neural network could predict as well (or potentially better) as the quantum kernel method. The same analysis in
Section E 1 allows us to arrive at the following result
 s 
2
r
U
min(d, kO kF ) log(1/δ) 
Ex |h(x) − Tr(OU ρ(x))| ≤ O g + . (E19)
N N

The same analysis holds for other machine learning algorithms, such as Gaussian kernel regression.
21

Appendix F: Detailed discussion on the relevant quantities s, d, and g

There are some important aspects on the three relevant quantities s, d, g that were not fully discussed in the main
text, including the limit when we have infinite amount of data and the effect of regularization. While in practice
one always has a finite amount of data, constructing these formal limits both clarifies the construction and provides
another perspective through which to understand the finite data constructions. This section will provide a detailed
discussion of these aspects.

1. Model complexity s

PN PN
While we have used sK (N ) = i=1 j=1 (K −1 )ij Tr(OU ρ(xi )) Tr(OU ρ(xj )) in the main text, this is a simplified
quantity when we do not apply regularization. The model complexity sK (N ) under regularization is given by
N X
N
2
X
sK (N ) = kwk = Tr(Agen OU ⊗ OU ) = ((K + λI)−1 K(K + λI)−1 )ij Tr(OU ρ(xi )) Tr(OU ρ(xj )) (F1)
i=1 j=1
N X
N
X √ √
= ( K(K + λI)−2 K)ij Tr(OU ρ(xi )) Tr(OU ρ(xj )). (F2)
i=1 j=1

Training machine learning model with regularization is often desired when we have a finite number N of training
2
data. kwk has been used extensively in regularizing p machine learning models, see e.g., [16, 37, 38]. This is because
we
p can often significantly reduce generalization error Tr(Agen OU ⊗ OU )/N by slightly increasing the training error
Tr(Atra OU ⊗ OU )/N . In practice, we should choose the regularization parameter λ to be a small number such that
the training error plus the generalization error is minimized.
The model complexity sK (N ) we have been calculating can be seen as an approximation to the true model com-
plexity when we have a finite number N of training data. If we have exact knowledge about the input distribution
given as a probability measure µx , we can also write down the precise model complexity in the reproducing kernel
Hilbert space φ(x) where k(x, y) = φ(x)† φ(y). Starting from
Z
2
min λw† w + w† φ(x) − Tr(OU ρ(x)) dµx ,

(F3)
w

we can obtain
 Z −1 Z

w = λI + φ(x)φ(x) dµx Tr(OU ρ(x))φ(x)dµx . (F4)

Hence the true model complexity is


Z Z  Z −2
2 U U † †
kwk = dµx1 dµx2 Tr(O ρ(x1 )) Tr(O ρ(x2 ))φ(x1 ) λI + φ(ξ)φ(ξ) dµξ φ(x2 ) (F5)

= Tr(Agen OU ⊗ OU ), (F6)

where the operator


Z Z  Z −2
Agen = dµx1 dµx2 φ(x1 )† λI + φ(ξ)φ(ξ)† dµξ φ(x2 ) ρ(x1 ) ⊗ ρ(x2 ). (F7)

If we replace the integration over the probability measure with N random samples and apply the fact that k(x, y) =
φ(x)† φ(y), then we can obtain the original expression given in Equation (D3).

2. Dimension d

The dimension we considered in the main text is the effective dimension of the training set quantum state space.
Q
This can be seen as the rank of the quantum kernel matrix Kij = Tr(ρ(xi )ρ(xj )). However, it will often be the
22

case that most of the states lie in some low-dimensional subspace, but have negligible contributions in a much higher
dimensional subspace. In this case, the dimension of the low-dimensional subspace is the better characterization.
More generally, we can perform a singular value decomposition of K Q
N
ti ui u†i ,
X
KQ = (F8)
i=1

PN P
N
with t1 ≥ t2 ≥ . . . ≥ tN . We define σi = u ρ(x )/ u ρ(xj , where k·kF is the Frobenius norm. σi
)

j=1 ij j j=1 ij
F
is the i-th principal component of the quantum state space. Recall the normalization condition Tr(K Q ) = N , so
PN
i=1 ti = N . If the training set quantum state space is one-dimensional (d = 1), then

t1 = N, ti = 0, ∀i > 1. (F9)

If all the quantum states in the training set are orthogonal (d = N ), then

ti = 1, ∀i = 1, . . . , N. (F10)

By the Eckart-Young-Mirsky theorem, for any k ≥ 1, the first k principal components σ1 , . . . , σk form the best
k-dimensional subspace for approximating span{ρ(x1 ), . . . , ρ(xN )}. The approximation error is given by
2
XN Xk XN
p
ρ(xi ) − tj uji σj =
tl . (F11)

i=1 j=1 l=k+1
F
PN
As we can see, when the spectrum is flatter, the dimension is larger. The error decreases at most as l=k tl ≤ N − k,
where the equality holds when all states are orthogonal. In the numerical experiment, we choose the following measure
as the approximate dimension
N N
!
X 1 X
1≤ tl ≤ N (F12)
N −k
k=1 l=k

due to the independence to any hyperparameter. Alternatively, we can also define approximate dimension by choosing
PN
the smallest k such that l=k+1 tl /N <  for some  > 0. Both give similar trend, but the actual value of the dimension
would be different.
From the discussion, we can see that in the above definitions, the dimension will always be upper bounded by the
number N of training data. Similar to the case of model complexity, we can also define the dimension d when we
have the exact knowledge about the input distribution given by probability measure µx . For a quantum state space
representing n qubits, we simply consider the spectrum t1 ≥ t2 ≥ . . . ≥ t2n of the following operator
Z
vec(ρ(x))vec(ρ(x))T dµx . (F13)

When we replace the integration by a finite number of training samples, the spectrum would be equivalent to the
spectrum given in Equation (F8) except for the additional zeros.
Remark 1. The same definition of dimension can be used for any kernels, such as projected quantum kernels or
neural tangent kernels (under the normalization Tr(K) = N ).

3. Geometric difference g

The geometric difference is defined between two kernel functions K 1 , K 2 and the corresponding reproducing kernel
Hilbert space φ1 (x), φ2 (x). If we have a function represented by the first kernel w† φ1 (x), what would be the model
complexity for the second kernel? We consider the ideal case where we know the input distribution µx exactly. The
optimization for training the first kernel method with regularization λ > 0 is
Z
2
min λv v + v † φ2 (x) − w† φ1 (x) dµx .


(F14)
v
23

The solution is given by


 Z −1 Z
v= λI + φ2 (x)φ2 (x)† dµx w† φ1 (x)φ2 (x)dµx . (F15)

Hence the model complexity for the optimized v is


Z Z  Z −2 !
2 † † † †
kvk = w dµx1 dµx2 φ1 (x1 )φ2 (x1 ) λI + φ2 (ξ)φ2 (ξ) dµξ φ2 (x2 )φ1 (x2 ) w (F16)

2 2
≤ ggen kwk , (F17)

where the geometric difference is


v
u Z Z
u  Z −2
ggen = t dµx1 dµx2 φ1 (x1 )φ2 (x1 )† λI + φ2 (ξ)φ2 (ξ)† dµξ φ2 (x2 )φ1 (x2 )† . (F18)


The subscript in ggen is added because when λ > 0, there will also be a contribution from training error. When we
only have a finite number N of training samples, we can use the fact that k(x, y) = φ(x)† φ(y) and the definition that
Kij = k(xi , xj ) to obtain

√ √ √ √
r
−2
ggen = K 1 K 2 (K 2 + λI) K 2 K 1 . (F19)

This formula differs from the mainr


text due to the regularization parameter λ. If λ = 0, then the above formula for
√ √
ggen reduces to the formula ggen = K 1 (K 2 )−1 K 1 .


When λ is non-zero, the geometric difference can become much smaller. This is the same as the discussion on model
complexity s in Section F 1. However, a nonzero λ induces a small amount of training error. For a finite number N
of samples, the training error can always be upper bounded:
N √ √
1 X v † φ2 (xi ) − w† φ1 (xi ) 2 ≤ λ2 2 2
K 1 (K 2 + λI)−2 K 1 kwk = gtra
2

kwk , (F20)

N i=1 ∞

√ 1 2 √
r
where gtra = λ K (K + λI)−2 K 1 . This upper bound can be obtained by plugging the solution for v in


PN 2
Equation (F14) under finite samples into the training error N1 i=1 v † φ2 (xi ) − w† φ1 (xi ) and utilizing the fact that
2
w† Aw ≤ kAk∞ kwk . In the numerical experiment, we report ggen given in Equation (F19) with the largest λ such
that the training error gtra ≤ 0.045.

Appendix G: Constructing dataset to separate quantum and classical model

In the main text, our central quantity of interest is the geometric difference g, which provides a quantification for a
given data set, how large the prediction gap can be for possibly function or labels associated with that data. Here we
detail how one can efficiently construct a function that saturates this bound for a given data set. This is the approach
that is used in the main text to engineer the data set with maximal performance.
Given a (projected) quantum kernel k Q (xi , xj ) = φQ (xi )† φQ (xj ) and a classical kernel k C (xi , xj ) = φC (xi )† φC (xj ),
our goal is to construct a dataset that would best separate the two models. Consider a dataset (xi , yi ), ∀i = 1, . . . , N .
PN PN
We use the model complexity s = i=1 j=1 (K −1 )ij yi yj to quantify the generalization error of the model. The
model complexity has been introduced in the main text, where a detailed proof relating s to prediction error is given
in Appendix D. To separate between quantum and classical model, we consider sQ = 1 and sC is as large as possible
for a particular choice of targets y1 , . . . , yN . To achieve this, we solve the optimization
PN PN C −1
i=1 j=1 ((K ) )ij yi yj
min PN PN (G1)
Q −1 ) y y
j=1 ((K )
N
y∈R
i=1 ij i j
24

which has an exact solution given by a generalized eigenvalue problem. The solution is given by y = K Q v, where v is
√ √ √ √
the eigenvector of K Q (K C )−1 K Q corresponding to the eigenvalue g 2 = K Q (K C )−1 K Q . This guarantees


that sC = g 2 sQ = g 2 , and note that by definition of g, sC ≤ g 2 sQ . Hence this dataset fully utilized the geometric
difference between the quantum and classical space.
We should also include regularization parameter λ when constructing the dataset. Detailed discussion on model
complexity s and geometric difference g with regularization is given in Appendix F. Recall that for λ > 0,
√ −2 √
sλC = y † ( K C K C + λI K C )ij y, (G2)
which is the model complexity that we want to maximize. Similar to the unregularized case, we consider the (unreg-
† Q −1
ularized)
√ model complexity sQ = y (K ) y to be one. Solving the generalized eigenvector problem yields the target
y = K Q v, where v is the eigenvector of
√ √ −2 √ √
K Q K C K C + λI KC KQ (G3)
with the corresponding eigenvalue
√ √ −2 √ √
2
ggen = K Q K C K C + λI K C K Q . (G4)


2
The larger λ is, the smaller ggen would
be. In practice, one should choose a λ such that the training error bound
√ √
2
gtra sQ = λ2 K Q (K C + λI)−2 K Q for the classical ML model is small enough. In the numerical experiment, we


2
choose a λ such that the training error bound gtra sQ ≤ 0.002 and ggen is as large as possible. Finally, we can turn this
dataset, which maps input x to a real value y, into a classification task by replacing y with +1 if y > median(y1 , . . . , yN )
and −1 if y ≤ median(y1 , . . . , yN ).
The constructed dataset will yield the largest separation between quantum and classical models from a learning-
theoretic sense, as the model complexity fully saturates the geometric difference. If there is no quantum advantage in
this dataset, there will likely be none. We believe this construction procedure will eventually lead to the first quantum
advantage in machine learning problems (classification problems to be more specific).

Appendix H: Lower bound on learning quantum models

In this section, we will prove a fundamental lower bound for learning quantum models stated in Theorem 3. This
result says that in the worst case, the number N of training data has to be at least Ω(Tr(O2 )/2 ) when the input
quantum state can be distributed across a sufficiently large Hilbert space. Quantum kernel method matches this lower
bound. When the data spans over the entire Hilbert space, the dimension d will be large and the prediction error of
the quantum kernel method given in Equation (E1) becomes
r !
Q U Tr(O2 )
Ex |h (x) − Tr(O ρ(x))| ≤ O . (H1)
N

Hence we can achieve  error using N ≤ O(Tr(O2 )/2 ) matching the fundamental lower bound.
Theorem 3. Consider any learning algorithm A. Suppose for any unknown unitary evolution U , any unknown
observable O with bounded Frobenius norm Tr(O2 ) ≤ B, and any distribution D over the input quantum states, the
learning algorithm A could learn a function h such that
Eρ∼D |h(ρ) − Tr(OU ρU † )| ≤ , (H2)
from N training data (ρi , Tr(OU ρi U † )), ∀i = 1, . . . , N with high probability. Then we must have
N ≥ Ω(B/2 ). (H3)
Proof. We select a Hilbert space with dimension d = B/42 (this could be a subspace of an exponentially large Hilbert
space). We define the distribution D to be the uniform distribution over the basis states |xihx| of the d-dimensional
Hilbert space. Then we consider the unknown unitary U to always be the identity, while the possible observables are
d
X
Ov = 2 vx |xihx| , (H4)
x=1
25

with vx ∈ {±1}, ∀x = 1, . . . , d. There are hence 2d different choices of observables Ov .


We now set up a simple communication protocol to prove the lower bound on the number of data needed. This is a
simplified version of the proofs found in Refs.[25, 55]. Alice samples an observable Ov uniformly at random from the 2d
possible choices. We can treat v as a bit-string of d entries. Then she samples N quantum states |xi ihxi | , ∀i = 1, . . . , N .
Alice then gives Bob the following training data T = {(|xi ihxi | , hxi | Ov |xi i) = vxi , ∀i = 1, . . . , N }. Notice that the
mutual information I(v, T ) between v and the training data T satisfies
I(s, T ) ≤ N, (H5)
because the training data contains at most N values of s.
With high probability, the following is true by the requirement of the learning algorithm A. Using the training
data T , Bob can apply the learning algorithm A to obtain a function f such that
Eρ∼D |h(ρ) − Tr(OU ρU † )| ≤ . (H6)
Using Markov’s inequality, we have
1
Pr[|h(ρ) − Tr(OU ρU † )| < 2] > . (H7)
2
For all x = 1, . . . , d, if |h(|xihx|) − Tr(OU |xihx| U † )| < 2, we have |h(|xihx|)/2 − vx | < 1. This means that if
h(|xihx|) > 0, then vx = 1 and if h(|xihx|) < 0, then vx = −1. Hence Bob can construct a bit-string ṽ given as
ṽx = sign(h(|xihx|)), ∀x = 1, . . . , d. Using Equation (H7), we know that at least d/2 bits in ṽ will be equal to v.
Because with high probability, ṽ and v has at least d/2 bits in common. Fano’s inequality tells us that I(v, ṽ) ≥ Ω(d).
Because the bit-string ṽ is constructed solely from the training data T . Data processing inequality tells us that
I(v, ṽ) ≤ I(v, T ). Together with Equation (H5), we have
N ≥ I(v, T ) ≥ I(v, ṽ) ≥ Ω(d). (H8)
2 2
Recall that d = B/4 , we have hence obtained the desired result N ≥ Ω(B/ ).

Appendix I: Limitations of quantum kernel methods

Even though the quantum kernel method saturates the fundamental lower bound Ω(Tr(O2 )/2 ) and can be made
formally equivalent to infinite depth quantum neural networks it has a number of limitations that hinder its practical
applicability. In this section we construct a simple example where the overhead for using the quantum kernel method
is exponential in comparison to trivial classical methods.
Specifically, it has the limitation of closely following this lower bound for any unitary U and observable O. This is
not true for other machine learning methods, such as classical neural networks or projected quantum kernel methods.
It is possible for classical machine learning methods to learn quantum models with exponentially large Tr(O2 ), which
is not learnable by the quantum kernel method. This can already be seen in the numerical experiments given in the
main text. In this section, we provide a simple example that allows theoretical analysis to illustrate this limitation.
We consider a simple learning task where the input vector x ∈ {0, π}n . The encoding of the input vector x to the
quantum state space is given as
n
Y
|xi = exp(iXk xk ) |0n i . (I1)
k=1

The quantum state |xi is a computational basis state. We define ρ(x) = |xihx|. The quantum model applies a unitary
U = I, and measures the observable O = I ⊗ . . . ⊗ I ⊗ Z. Hence f (x) = Tr(Oρ(x)) = (2xn − π). Notice that for this
very simple quantum model, the function f (x) is an extremely simple linear model. Hence a linear regression or a
single-layer neural network can learn the function f (x) from training data of size n with high probability.
Despite being a very simple quantum model, the Frobenius norm of the observable Tr(O2 ) is exponentially large,
i.e., Tr(O2 ) = 2n . We now show that a quantum kernel method will need a training data of size N ≥ Ω(2n ) to
learn this simple function f (x). Suppose we have obtained a training set given as {(xi , Tr(Oρ(xi ))}N
i=1 where each xi
is selected uniformly at random from {0, π}n . Recall from the analysis in Section D 1, the function learned by the
quantum kernel method will be
  
N X
X N
hQ (x) = min 1, max −1, Tr(ρ(xi )ρ(x))((K Q + λI)−1 )ij Tr(Oρ(xj )) , (I2)
i=1 j=1
26

Q
where Kij = k Q (xi , xj ) = Tr(ρ(xi )ρ(xj )). The main problem of the quantum kernel method comes from the precise
definition of the kernel function k(xi , x) = Tr(ρ(xi )ρ(x)). For at least 2n − N choices of x, we have Tr(ρ(xi )ρ(x)) =
0, ∀i = 1, . . . , N . This means that for at least 2n −N choices of x, hQ (x) = 0. However, by construction, f (x) ∈ {1, −1}.
Hence the prediction error can be lower bounded by
1 X N
|hQ (x) − f (x)| ≥ 1 − . (I3)
2n 2n
x∈{0,π}n

Therefore, if N < (1 − )2n , then the prediction error will be greater than . Hence we need a training set of size
N ≥ (1 − )2n to achieve a prediction error ≤ .
In general, when we place the classical vectors x into an exponentially large quantum state space, the quantum
kernel function Tr(ρ(xi )ρ(xj )) will be exponentially close to zero for xi 6= xj . In this case K Q will be close to the
identity matrix, but Tr(ρ(xi )ρ(x)) will be exponentially small. For a training set of size N  2n , hQ (x) will be
exponentially close to zero similar to the above example. Despite hQ (x) being exponentially close to zero, if we
can distinguish > 0 and < 0, then hQ could still be useful in classification tasks. However, due to the inherent
quantum measurement error in evaluating the kernel function Tr(ρ(xi )ρ(xj )) on a quantum computer, we will need
an exponential number of measurements to resolve such an exponentially small difference.

Appendix J: Projected quantum kernel methods

In the main text, we argue that projection back from the quantum space to a classical one in the projected
quantum kernel can greatly improve the performance of such methods. There we focused on the simple case of a
squared exponential based on reduced 1-particle observables, however this idea is far more general. In this section
we explore some of these generalizations including a novel scheme for calculating functions of all powers of RDMs
efficiently.
From discussions on the quantum kernel method, we have seen that using the native quantum state space to
define the kernel function, e.g., k(xi , xj ) = Tr(ρ(xi )ρ(xj )) can fail to learn even a simple function when the full
exponential quantum state space is being used. We have to utilize the entire exponential quantum state space
otherwise the quantum machine learning model could be simulated efficiently classically and a large advantage could
not be found. In this section, we will detail a set of solutions that project the quantum states back to approximate
classical representations and define the kernel function using the classical representation. We refer to these modified
quantum kernels as projected quantum kernels. The projected quantum kernels are defined in a classical vector
space to circumvent the hardness of learning due to the exponential dimension in quantum Hilbert space. However,
projected quantum kernels still use the exponentially large quantum Hilbert space for evaluation and can be hard to
simulate classically.
Some simple choices based on reduced density matrices (RDMs) of the quantum state are given below.
1. A linear kernel function using 1-RDMs
X
Q1l (xi , xj ) = Tr [Trm6=k [ρ(xi )]Trn6=k [ρ(xj )]] , (J1)
k

where Trm 6= k(ρ) is the partial trace of the quantum state ρ over all qubits except for the k-th qubit. It could
learn any observable that can be written as a sum of one-body terms.
2. A Gaussian kernel function using 1-RDMs
!
2
X
Q1g (xi , xj ) = exp −γ (Trm6=k [ρ(xi )] − Trn6=k [ρ(xj )]) , (J2)
k

where γ > 0 is a hyper-parameter. It could learn any nonlinear function of the 1-RDMs.
3. A linear kernel using k−RDMs
X
Qkl (xi , xj ) = Tr [Trn∈K
/ [ρ(xi )]Trm∈K
/ [ρ(xj )]] (J3)
K∈Sk (n)

where Sk (n) is the set of subsets of k qubits from n, Trn∈K


/ is a partial trace over all qubits not in subset K. It
could learn any observable that can be written as a sum of k-body terms.
27

The above choices have a limited function class that they can learn, e.g., Q1l can only learn observables that are sum
of single-qubit observables. It is desirable to define a kernel that can learn any quantum models (e.g., arbitrarily
deep quantum neural networks) with sufficient amount of data similar to the original quantum kernel k Q (xi , xj ) =
Tr(ρ(xi )ρ(xj )) as discussed in Appendix C.
We now define a projected quantum kernel that contains all orders of RDMs. Since all quantum models f (x) =
Tr(OU ρ(x)U † ) are linear functions of the full quantum state, this kernel can learn any quantum models with sufficient
data. A k-RDM of a quantum state ρ(x) for qubit indices (p1 , p2 , . . . , pk ) can be reconstructed by local randomized
measurements using the formalism of classical shadows [25]:

ρ(p1 ,p2 ,...,pk ) (x) = E ⊗kr=1 (3 |spr , bpr ihspr , bpr | − I) ,


 
(J4)
where bpr is a random Pauli measurement basis X, Y, Z on the pr -th qubit, and spr is the measurement outcome ±1
on the pr -th qubit of the quantum state ρ(x) under Pauli basis bpr . The expectation is taken with respect to the
randomized measurement on ρ(x). The inner product of two k-RDMs is equal to
h i h i
Tr ρ(p1 ,p2 ,...,pk ) (xi )ρ(p1 ,p2 ,...,pk ) (xj )] = E Πkr=1 (9δsi sjp δbi bjp − 4) , (J5)
pr r pr r

where we used the fact that the randomized measurement outcomes for ρ(xi ) and ρ(xj ) are independent. We extend
this equation to the case where some indices pr , ps coincide. This would only introduce additional features in the
feature map φ(x) that defines the kernel k(xi , xj ) = φ(xi )† φ(xj ). The sum of all possible k-RDMs can be written as
 !k 
n
X n
X h i n
X
Qk (ρ(xi ), ρ(xj )) = ... Tr ρ(p1 ,p2 ,...,pk ) (xi )ρ(p1 ,p2 ,...,pk ) (xj )] = E  (9δsi sjp δbi bjp − 4)  , (J6)
p p
p1 =1 pk =1 p=1

where we used Equation (J5) and linearity of expectation. A kernel function that contains all orders of RDMs can be
evaluated as
∞ n
!
k
X γ γ X
Q∞γ (ρ(xi ), ρ(xj )) = Qk (ρ(xi ), ρ(xj )) = E exp (9δ i j δ i j − 4) , (J7)
k!nk n p=1 sp sp bp bp
k=0

where γ is a hyper-parameter. The kernel function Q∞γ (ρ(xi ), ρ(xj )) can be computed by performing local randomized
measurement on the quantum states ρ(xi ) and ρ(xj ) independently. First, we collect a set of randomized measurement
data for ρ(xi ), ρ(xj ) independently:

ρ(xi ) → {((si,r i,r i,r i,r


1 , b1 ), . . . , (sn , bn )), ∀r = 1, . . . , Ns }, (J8)
ρ(xj ) → {((sj,r j,r j,r j,r
1 , b1 ), . . . , (sn , bn )), ∀r = 1, . . . , Ns }, (J9)
where Ns is the number of repetition for each quantum state. For each repetition, we will randomly sample a Pauli
basis for each qubit and measure that qubit to obtain an outcome ±1. For the r-th repetition, the Pauli basis in the
k-th qubit is given as bi,r i,r
k and the measurement outcome ±1 is given as sk . Then we compute

Ns X Ns n
!
1 X γX
exp (9δ i,r1 j,r2 δ i,r1 j,r2 − 4) ≈ Q∞
γ (ρ(xi ), ρ(xj )). (J10)
Ns (Ns − 1) r =1 r =1 n p=1 sp sp bp bp
1 2
r2 6=r1

We reuse all pairs of data r1 , r2 to reduce variance when estimating Q∞ γ (ρ(xi ), since the resulting estimator would
still be equal to the desired quantity in expectation. This technique is known as U-statistics, which is often used to
create minimum-variance unbiased estimators. U-statistics is also applied in [25] for estimating Renyi entanglement
entropy with high accuracy.

Appendix K: Simple and rigorous quantum advantage over classical machine learning models

In Ref.[24], the authors proposed a machine learning problem based on discrete logarithm which is assumed to be
hard for any classical machine learning algorithm, complementing existing work studying learnability in the context
of discrete logs [56, 57]. Much of the challenge in their construction [57] was related to technicalities involved in the
original quantum kernel approach. Here we present a simple quantum machine learning algorithm using the projected
quantum kernel method. The problem is defined as follows, where p is an exponentially large prime number and g is
chosen such that computing logg (x) in Z∗p is classically hard and logg (x) is one-to-one.
28

Definition 1 (Discrete logarithm-based learning problem). For all input x ∈ Z∗p , where n = dlog2 (p)e, the output is
(
+1, logg (x) ∈ [s, s + p−3
2 ],
y(x) = p−3 (K1)
−1, logg (x) ∈
/ [s, s + 2 ],

for some s ∈ Z∗p . The goal is to predict y(x) for an input x sampled uniformly from Z∗p .
Let us consider the most straight-forward feature mapping that maps the classical input x into the quantum state
space |logg (x)i using Shor’s algorithm for computing discrete logarithms [58].
Training the original quantum kernel method using this feature mapping will require training data {(xi , yi )}N i=1
with N being exponentially large to yield a small prediction error. This is because for a new x ∈ Z∗p , such that
logg (x) 6= logg (xi ), ∀i = 1, . . . , N , quantum kernel method will be equivalent to random guessing. Hence the quantum
kernel method has to see most of the values in the range of logg (x) (Z∗p ) to make accurate predictions. This is the same
as the example to demonstrate the limitation of quantum kernel methods in Appendix I. Since Z∗p is exponentially
large, the quantum kernel method has to use an exponentially amount number of data N for this straight-forward
feature map. The central problem is that all the inputs x are maximally far apart from one another, and this impedes
the ability for quantum kernel methods to generalize.
On the other hand, we can project the quantum feature map |logg (x)i back to a classical space, which is now just
a number logg (x) ∈ Z∗p . Recall that Z∗p contains all number from 0, . . . , p − 1, thus we consider mapping x to a real
number z = logg (x)/p ∈ [0, 1). Let us define t = s/p. In this projected space, we are learning a simple classification
problem where y(z) = +1 if z ∈ [t, t + p−3 2p ], and y(z) = −1 if z ∈ / [t, t + p−3
2p ]. We are using a periodic boundary
p−3
where 0 and 1 are the same point. If t + 2p < 1, then there exists some a, b ∈ [0, 1) and a < b, such that y(z) = +1,
if a ≤ z ≤ b, and y(z) = −1, otherwise. In this case we have y(z) = sign((b − z)(z − a)), where sign(t) = +1 if
t ≥ 0, otherwise sign(t) = −1. If t + p−3 2p ≥ 1, then there exists some a, b ∈ [0, 1) and a < b, such that y(z) = −1, if
a ≤ z ≤ b, and y(z) = +1, otherwise. In this case we have y(z) = sign((a − z)(z − b)). Through this analysis, we can
see that we only need to learn a simple quadratic function to perform accurate classification. Hence one could simply
define a projected quantum kernel as
2
k PQ (xi , xj ) = (logg (xi )/p)(logg (xj )/p) + 1 , (K2)

where the division in (logg (xi )/p) is performed as real number in R. This projected quantum kernel can efficiently
learn any quadratic function az 2 + bz + c with z = logg (xi )/p, hence solving the above learning problem.
Theorem 4 (Corollary 3.19 in [18]). Let H be a class of functions taking values in {+1, −1} with VC-dimension d.
Then with probability ≥ 1 − δ over the sampling of z1 , . . . zN from some distribution D, we have
N
r r
1 X 2d log(eN/d) log(1/δ)
E I[h(z) 6= y(z)] ≤ I[h(zi ) 6= y(zi )] + + , (K3)
z∼D N i=1
N N

for all h ∈ H, where I[Statement] = 1 if Statement is true, otherwise I[Statement] = 0.


A simple and rigorous statement could be made by noticing that the VC-dimension [18, 59] for the function class
{sign(az 2 + bz + c)|a, b, c ∈ R} is 3. Let us apply Theorem 4 with

z = logg (x)/p and H = {sign(az 2 + bz + c)|a, b, c ∈ R}. (K4)

This theorem bounds the prediction error for new inputs z coming from the same distribution as how the training data
i=1 , we perform a minimization over a, b, c ∈ R such that the
is sampled. For a given set of training data (zi , y(zi ))N
PN
1
training error N i=1 I[h(zi ) 6= y(zi )] is zero. This can be achieved by applying a standard support vector machine
algorithm [60] using the above kernel k PQ , because y(zi ) ∈ H, so one can always fit the training data perfectly. Using
Eq. (K3) with δ = 0.01, we can provide a prediction error bound for the trained projected quantum kernel method

f∗ (x) = h∗ (logg (x)/p) = h∗ (z) = sign(a∗ z 2 + b∗ z + c∗ ). (K5)

Because we fit the training data perfectly, we have


N
1 X
I[h∗ (zi ) 6= y(zi )] = 0. (K6)
N i=1
29

With probability at least 0.99, a projected quantum kernel method f∗ (x) = h∗ (logg (x)/p) that perfectly fit a data set
of size N = O(log(1/)/2 ) has a prediction error

P [f (x) 6= y(x)] ≤ . (K7)


x∼Z∗
p

This concludes the proof showing that the discrete logarithm-based learning problem can be solved with a projected
quantum kernel method using a sample complexity independent of the input size n.
Despite the limitations of the quantum kernel method, the authors in [24] have shown that a clever choice of feature
mapping x → ρ(x) would also allow quantum kernels Tr(ρ(xi )ρ(xj )) to predict well in this learning problem.

Appendix L: Details of numerical studies

Here we give the complete details for the numerical studies presented in the main text. For the input distribution, we
focused on the fashion MNIST dataset [48]. We use principal component analysis (PCA) provided by scikit-learn [61]
to map each image (28 × 28 grayscale) into classical vectors xi ∈ Rn , where n is the number of principal components.
After PCA, we normalize the vectors xi such that each dimension is centered at zero and the standard deviation is
one. Finally, we sub-sample 800 data points from the dataset without replacement.

1. Embedding classical data into quantum states

The three approaches for embedding classical vectors xi ∈ Rn into quantum states |xi i are given below.

• E1: Separable encoding or qubit rotation circuit. This is a common choice in literature, e.g., see [51, 52].
n
O
|xi i = e−iXj xij |0n i , (L1)
j=1

where xij is the j-th entry of the n-dim. vector xi , Xj is the Pauli-X operator acting on the j-th qubit.

• E2: IQP-style encoding circuit. This is an embedding proposed in [15] that suggests a quantum advantage.

|xi i = UZ (xi )H ⊗n UZ (xi )H ⊗n |0n i , (L2)

where H ⊗n is the unitary that applies Hadamard gates on all qubits in parallel, and
 
Xn n X
X n
UZ (xi ) = exp  xij Zj + xij xij 0 Zj Zj 0  , (L3)
j=1 j=1 j 0 =1

n
with Zj defined as the Pauli-Z
P operator acting on the j-th qubit. In the original  proposal [15], x ∈ [0, 2π] , and
n Pn P n
they used UZ (xi ) = exp j=1 xij Zj + j 0 =1 (π − xij )(π − xij )Zj Zj instead. Here, due to the data
0 0
j=1
pre-processing steps, x will be centered around 0 with a standard deviation of 1, hence we made the equivalent
changes to the definition of UZ (xi ).

• E3: A Hamiltonian evolution ansatz. This ansatz has been explored in the literature [62–64] for quantum
many-body problems. We consider a Trotter formula with T Trotter steps (we choose T = 20) for evolving an
1D-Heisenberg model with interactions given by the classical vector xi for a time t proportional to the system
size (we choose t = n/3).

 TO
 
n  n+1
Y t
|xi i =  exp −i xij (Xj Xj+1 + Yj Yj+1 + Zj Zj+1 )  |ψj i , (L4)
j=1
T j=1

where Xj , Yj , Zj are the Pauli operators for the j-th qubit and |ψj i is a Haar-random single-qubit quantum
state. We sample and fix the Haar-random quantum states |ψj i for every qubit.
30

2. Definition of original and projected quantum kernels

We use Tensorflow-Quantum [47] for implementing the original/projected quantum kernel methods. This is done
by performing quantum circuit simulation for the above embeddings and computing the kernel function k(xi , xj ).
For quantum kernel, we store the quantum states |xi i as explicit amplitude vectors and compute the squared inner
product

k Q (xi , xj ) = |hxi |xj i|2 . (L5)

On actual quantum computers, we obtain the quantum kernel by measuring the expectation of the observable |0n ih0n |
on the quantum state Uemb (xj )† Uemb (xi ) |0n i. For projected quantum kernel, we use the kernel function
 
2
X X
k PQ (xi , xj ) = exp −γ (Tr(P ρ(xi )k ) − Tr(P ρ(xj )k )) , (L6)
k P ∈{X,Y,Z}

where P is a Pauli matrix and γ > 0 is a hyper-parameter chosen to maximize prediction accuracy. We compute
the kernel matrix K ∈ RN ×N with Kij = k(xi , xj ) using the sub-sampled dataset with N = 800 for both the
original/projected quantum kernel.

3. Dimension and geometric difference

Following the discussion in Appendix F 2, the approximate dimension of the original/projected quantum space is
computed by

N N
!
X 1 X
tl , (L7)
N −k
k=1 l=k

where N = 800 and t1 ≥ t2 ≥ . . . ≥ tN are the singular values of the kernel matrix K ∈ RN ×N . Based on the
discussion in Appendix F 3, we report the minimum geometric difference g of the original/projected quantum space
(we refer to both the original/projected quantum kernel matrix as K P/Q )

√ √
r p p
−2
ggen = K P/Q K C (K C + λI) K C K P/Q , (L8)

under a condition for having a small training error


r p p
gtra = λ K P/Q (K C + λI)−2 K P/Q < 0.045. (L9)

The actual value of g will depend on the list of choices for λ and classical kernels K C . We consider the following list
of λ

λ ∈ {0.00001, 0.0001, 0.001, 0.01, 0.025, 0.05, 0.1}, (L10)

and classical kernel matrix K C being the linear kernel k l (xi , xj ) = x†i xj or the Gaussian kernel k γ (xi , xj ) =
2
exp(−γ kxi − xj k ) with hyper-parameter γ from the list

γ ∈ {0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0}/(n Var[xik ]) (L11)

for estimating the minimum geometric difference. Var[xik ] is the variance of all the coordinates k = 1, . . . , n from all
the data points x1 , . . . , xN . One could add more choices of regularization parameters λ or classical kernel functions,
such as using polynomial kernels or neural tangent kernels, which are equivalent to training neural networks with
large hidden layers (a package, called Neural Tangents [65], is available for use). This will provide a smaller geometric
difference with the quantum state space, but all theoretical predictions remain unchanged.
31

4. Datasets

We include a variety of classical and quantum data sets.

1. Dataset (C): For the original classical image recognition data set, i.e., Dataset (C) in Figure 3(b), we choose
two classes, dresses (class 3) and shirts (class 6), to form a binary classification task. The prediction error
(between 0.0 and 1.0) is equal to the portion of data that are incorrectly labeled.
2. Dataset (Q, E1/E2/E3): For the quantum data sets in Figure 3(b), we consider the following quantum neural
network
 T
 
n 
Y t
UQNN =  exp −i Jj (Xj Xj+1 + Yj Yj+1 + Zj Zj+1 )  , (L12)
j=1
T

where we choose T = t = 10 and Jj ∈ R are randomly sampled from the Gaussian distribution with mean 0 and
standard deviation 1. We measure Z1 after the quantum neural network, hence the resulting function is

f (x) = Tr(Z1 UQNN |xihx| UQNN ). (L13)

The mapping from x to |xi depends on the feature embedding (E1, E2, or E3) discussed in Section L 1. A
different embedding |xi corresponds to a different funtion f (x), and hence would result in a different dataset.
The prediction error for these datasets are the average absolute error with f (x).
3. Engineered datasets: In Figure 4, we consider datasets that are engineered to saturate the potential of
a quantum ML model. Given the choice of classical kernel K C that has the smallest geometric difference g
with a quantum ML model K Q , we can create a data set that saturates sC = g 2 sQ following the procedure in
Appendix G. In particular, we construct the dataset such that sQ = 1 and sC = g 2 . We compute the eigenvector
v corresponding to the maximum eigenvalue of
√ √ −2 √ √
K Q K C K C + λI KC KQ (L14)

and construct y 0 = K Q v ∈ RN . yi0 corresponds to a real number for data point xi . Finally we define the label
of input data point xi as
(
sign(yi0 ), with prob. 0.9,
yi = (L15)
random ± 1, with prob. 0.1.

This data set will show the maximal separation between quantum and classical ML model. The plots in Figure 4
uses engineered datasets generated by saturating the geometric difference of classical ML models and quantum
ML models based on projected quantum kernels in Equation (L6) under different embeddings (E1, E2, and E3).
In Figure 6, we show the results for quantum ML models based on the original quantum kernels.

5. Classical machine learning models

We present the list of classical machine learning models that we compared with. We used scikit-learn [61] for
training the classical ML models.
• Neural network: We perform a grid search over two-layer feedforward neural networks with hidden layer size

h ∈ {10, 25, 50, 75, 100, 125, 150, 200}. (L16)

For classification, we use MLPClassifier. For regression, we use MLPRegressor.


• Linear kernel method: We perform a grid search over the regularization parameter

C ∈ {0.006, 0.015, 0.03, 0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256, 512, 1024}. (L17)

For classification, we use SVC with linear kernel. For regression, we choose the best between SVR and Kernel-
Ridge (both using linear kernel).
32

Q (E1): g - small Q (E2): g - small Q (E3): g - small

FIG. 6. Prediction accuracy (higher the better) on engineered data sets. A label function is engineered to match the geometric
difference g(C||QK) between the original quantum kernel and classical approaches. No substantial advantage is found using
quantum kernel methods at large system size due to the small geometric difference g(C||QK). We consider the best performing
classical ML models among Gaussian SVM, linear SVM, Adaboost, random forest, neural networks, and gradient boosting.

• Gaussian kernel method: We perform a grid search over the regularization parameter

C ∈ {0.006, 0.015, 0.03, 0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256, 512, 1024}. (L18)

and kernel hyper-parameter

γ ∈ {0.25, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 20.0}/(n Var[xik ]). (L19)

Var[xik ] is the variance of all the coordinates k = 1, . . . , n from all the data points x1 , . . . , xN . For classification,
we use SVC with RBF kernel (equivalent to Gaussian kernel). For regression, we choose the best between SVR
and KernelRidge (both using RBF kernel).

• Random forest: We perform a grid search over the individual tree depth

max depth ∈ {2, 3, 4, 5}, (L20)

and number of trees

n estimators ∈ {25, 50, 100, 200, 500}. (L21)

For classification, we use RandomForestClassifier. For regression, we use RandomForestRegressor.

• Gradient boosting: We perform a grid search over the individual tree depth

max depth ∈ {2, 3, 4, 5}, (L22)

and number of trees

n estimators ∈ {25, 50, 100, 200, 500}. (L23)

For classification, we use GradientBoostingClassifier. For regression, we use GradientBoostingRegressor.

• Adaboost: We perform a grid search over the number of estimators

n estimators ∈ {25, 50, 100, 200, 500}. (L24)

For classification, we use AdaBoostClassifier. For regression, we use AdaBoostRegressor.


33

n=10 n=11 n=12

FIG. 7. Prediction error (lower the better) on quantum data set (E2) over different training set size N . We can see that as the
number of data increases, every model improves and the separation between them decreases.

6. Quantum machine learning models

For training quantum kernel methods, we use the kernel function k Q (xi , xj ) = Tr(ρ(xi )ρ(xj )). For classification,
we use SVC with the quantum kernel. For regression, we choose the best between SVR and KernelRidge (both using
the quantum kernel). We perform a grid search over

C ∈ {0.006, 0.015, 0.03, 0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256, 512, 1024}. (L25)

For training projected quantum kernel methods, we use the kernel function
 
2
X X
k PQ (xi , xj ) = exp −γ (Tr(P ρ(xi )k ) − Tr(P ρ(xj )k ))  , (L26)
k P ∈{X,Y,Z}

where P is a Pauli matrix. For classification, we use SVC with the projected quantum kernel k PQ (xi , xj ). For
regression, we choose the best between SVR and KernelRidge (both using the projected quantum kernel k PQ (xi , xj )).
We perform a grid search over

C ∈ {0.006, 0.015, 0.03, 0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0, 128.0, 256, 512, 1024}. (L27)

and kernel hyper-parameter

γ ∈ {0.25, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 20.0}/(n Var[Tr(P ρ(xi )k )]). (L28)

Var[Tr(P ρ(xi )k )] is the variance of Tr(P ρ(xi )k ) for all P ∈ {X, Y, Z}, all coordinates k = 1, . . . , n, and all data points
x1 , . . . , xN . We report the prediction performance under the best hyper-parameter for all classical and quantum
machine learning models.

Appendix M: Additional numerical experiments

In the main text, we have presented engineered data sets to saturate the geometric inequality sC ≤ g(C||PQ)2 sPQ
between classical ML and projected quantum kernel. As an additional experiment to see if the same approach can
work with the original quantum kernel method, we can create similar engineered data sets that saturate the geometric
inequality between classical ML and quantum kernel The result is given in Fig. 6. We can see that due to the large
dimension d and small geometric difference g(C||Q) between classical ML and quantum kernel at large system size,
there are no obvious advantage even for this best-case scenario. Interestingly, we see some advantage of projected
quantum kernel over classical ML even when this data set is not constructed for projected quantum kernel.
34

FIG. 8. A comparison between the prediction error bound based on classical kernel methods (see Eq. (D1)) and the prediction
performance of the best classical ML model on the three quantum datasets. We consider the best performing classical ML
models among Gaussian SVM, linear SVM, Adaboost, random forest, neural networks, and gradient boosting. While the
prediction error bound is an upper bound to the actual prediction error, the trends are very similar (a large prediction error
bound gives a large prediction error).

In Fig. 7, we show the prediction performance for learning a quantum neural network under a wide range for the
number of training data N . We can see that there is a non-trivial advantage for small training size N = 100 when
comparing projected quantum kernel and the best classical ML model. However, as training size N increases, every
model will improve and the prediction advantage will shrink.
In Fig. 8, we compare the prediction error bound sK (N ) for classical kernel methods and the prediction performance
of the best classical ML model (including a variety of classical ML models in Section L 5). To be more precise, we
consider different classical kernel functions and different regularization parameter λ. Then we compute
s s
PN PN PN PN
λ2 i=1 j=1 ((K + λI)−2 )ij yi yj i=1 j=1 ((K + λI)
−1 K(K + λI)−1 ) y y
ij i j
sK,λ (N ) = + . (M1)
N N

This is a generalization of sK (N ) described in the main text, where we consider regularized classical kernel methods
with a regularization parameter λ to improve generalization performance (setting λ = 0 reduces to sK (N ) given in
the main text). See Section D for a detailed proof of an upper bound to the prediction error (note that the output
label yi = Tr(OU ρ(xi ))). We can see that while the prediction error bound and the actual prediction error has a
non-negligible gap, the two figures follow a similar trend. When the prediction error bound is small, the prediction
error of the best classical ML is also fairly small (and vice versa). It shows that sK,λ (N ) is a good predictive metric
for whether a classical ML model can learn to predict outputs from a quantum computation model.

You might also like