Unsupervised Machine Learning On A Hybrid Quantum Computer
Unsupervised Machine Learning On A Hybrid Quantum Computer
INTRODUCTION CLUSTERING
Driver unitary
with wij = 0 if there is no edge between them. A cut assignments
Measure
(bit strings)
δ(S) ⊂ E is a set of edges that separates the vertices V
into two disjoint sets S and S̄ = V \ S. The cost w(δ(S))
of a cut is defined as the sum of all weights of edges
connecting vertices in S with vertices in S̄ Cost unitary
X
w(δ(S)) = wij . (1)
i∈S,j∈S̄ Bayesian
optimizer
c Copyright 2017 Rigetti & Co, Inc. 2
a 0 1 2 3 4 in the appendix).
To avoid a random walk over the space of potential
5 6 7 8 9 evaluation points, the Bayesian optimizer maximizes a
utility function that can be calculated from the posterior
10 11 12 13 14 distribution after each update. In this way, it intelligently
chooses points to minimize the number of costly evalua-
15 16 17 18 19
tions of the black box objective function (see the appendix
for more details).
b
THE QUANTUM PROCESSOR
2 mm
We ran the QAOA optimizer on a quantum processor
consisting of 20 superconducting transmon qubits [34]
with fixed capacitive coupling in the lattice shown in
Fig. 2. Qubits 0–4 and 10–14 are tunable while qubits
5–9 and 15–19 are fixed-frequency devices. The former
have two Josephson junctions in an asymmetric SQUID
geometry to provide roughly 1 GHz of frequency tunability,
max
and flux-insensitive “sweet spots” [35] near ω01 /2π ≈
min
4.5 GHz and ω01 /2π ≈ 3.0 GHz. These tunable qubits
are coupled to bias lines for AC and DC flux delivery. Each
qubit is capacitively coupled to a quasi-lumped element
resonator for dispersive readout of the qubit state [36, 37].
Single-qubit control is effected by applying microwave
1m drives at the resonator ports, and two-qubit gates are
activated via RF drives on the flux bias lines, described
below.
FIG. 2. Connectivity of Rigetti 19Q. a, Chip schematic The device is fabricated on a high-resistivity silicon
showing tunable transmons (teal circles) capacitively coupled substrate with superconducting through-silicon via tech-
to fixed-frequency transmons (pink circles). b, Optical chip nology [38] to improve RF-isolation. The superconducting
image. Note that some couplers have been dropped to produce
circuitry is realized with Aluminum (Tc ≈ 1.2 K), and
a lattice with three-fold, rather than four-fold, connectivity.
patterned using a combination of optical and electron-
beam lithography. Due to a fabrication defect, qubit 3
ing to Bayes’ rule is not tunable, which prohibits operation of the 2-qubit
parametric gate described below between qubits 3 and
p(f |y) ∼ p(y|f )p(f ), (6) its neighbors (8 and 9). Consequently, we treat this as a
19-qubit processor.
where p(f |y) is the posterior distribution over function In Rigetti 19Q, as we call our device, each tunable qubit
space given the observations y, p(f ) is the prior over is capacitively coupled to one-to-three fixed-frequency
function space, and p(y|f ) is the likelihood of observing qubits. The DC flux biases are set close to zero flux
the values y given the model for f . With growing number such that each tunable qubit is at its maximum frequency
max
of optimization steps (observations y) the true black-box ωT . Two-qubit parametric CZ gates are activated in the
objective is increasingly well approximated. The trick lies |11i ↔ |20i and/or |11i ↔ |02i sub-manifolds by applying
in choosing the prior p(f ) in a way that offers closed-form an RF flux pulse with amplitude A0 , frequency ωm and
solutions for easy numerical updates, such as Gaussian duration tCZ to the tunable qubit [39–41]. For RF flux
processes, which assume a normal distribution as a prior modulation about the qubit extremal frequency, the oscil-
over the function space [18](cf. S10). In the present case lation frequency is doubled to 2ωm and the mean effective
of QAOA, it should be noted that sampling at each step qubit frequency shifts to ω̄T . Note that the frequency
will generally lead to a non-trivial distribution of values shift increases with larger flux pulse amplitude. The
when the state |γγ , β i is entangled or mixed. To fit this effective detuning between neighboring qubits becomes
into the Bayesian Optimization framework we calculate ∆ = ω̄T − ωF .
the best observed sample and return this to the optimizer. The resonant condition for a CZ gate is achieved when
Hence, the function f represents the value of the best ∆ = 2ωm − ηT or ∆ = 2ωm + ηF , where ηT , ηF are the an-
sampled bit string at location γ , β . More generally, one harmonicities of the tunable and fixed qubit, respectively.
could compute any statistic of the distribution (as detailed An effective rotation angle of 2π on these transitions im-
c Copyright 2017 Rigetti & Co, Inc. 3
a
0 1 2 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
0 1 2 4
b
43
56
59
18
0.
0.
0.
49
0.
0.
0.
0.
63
44
5 6 7 8 9
0.
0.
0.
0.
0.
23
52
64
60
36
10 11 12 13 14
57
29
72
71
40
0.
0.
0.
0.
FIG. 5. The performance of our implementation of the cluster-
0.
41
0.
0.
50
40
81
0.
0.
IMPLEMENTATION
c Copyright 2017 Rigetti & Co, Inc. 4
implemented by the sequence requiring two interaction with the total number of samples. In Fig. 5 we compare
CNOTij · RZj (γwij ) · CNOTij [45]. The second factor is that the empirical distributions for a noiseless simulation of
the CNOT interactions cannot be applied simultaneously the algorithm (orange), the experimental implementation
on overlapping sets of qubits. Given the connectivity of of the algorithm (blue), and the random sampling of bit
the graph for our problem (where some vertices have 3 strings with finite statistics (green), along with the an-
neighbors), that means the cost terms must be broken alytical distribution for random sampling of bit strings
down into 3 rounds of non-overlapping unitaries, each (red)—the shaded regions correspond to 95% confidence
of which consists of two interactions, so that the overall intervals. Taking the null hypothesis to be the random
circuit has a depth corresponding to 6 two-qubit gates sampling of bit strings, standard hypothesis testing meth-
interspersed with single qubit operations. Additional cir- ods based on the Kolmogorov-Smirnov statistic exclude
cuit compilation steps are taken to minimize the number the null hypothesis as an explanation for the behavior
of intermediate single-qubit operations, so the depth is of the experimental implementation of the algorithm to
ultimately dominated by the interaction. a level higher than 99%. Similarly, the null hypothesis
The entire algorithm is implemented in Python, lever- is excluded as an explanation for the behaviour of the
aging the pyQuil library [17] for describing parameterized noiseless simulation of the algorithm to a level higher
quantum circuits in the quantum instruction language than 99.99% (see the appendix for more details).
Quil [16], and run through the Forest platform [42] for
controlling the quantum computer and accessing the data
it generates. The Bayesian optimizer is provided by the
open source package BayesianOptimization, also writ-
ten in Python [19]. CONCLUSION AND SUMMARY
c Copyright 2017 Rigetti & Co, Inc. 5
CONTRIBUTIONS [18] C. E. Rasmussen and C. K. I. Williams, Gaussian Pro-
cesses for Machine Learning (MIT Press, 2006).
[19] F. Nogueira, “bayesian-optimization,”
https://github.com/fmfn/BayesianOptimization (2014).
J.S.O., N.R., and M.P.S. developed the theoretical pro- [20] A. K. Jain, M. N. Murty, and P. J. Flynn, ACM Comput.
Surv. 31, 264 (1999).
posal. J.S.O. and E.S.F. implemented the algorithm,
[21] A. K. Jain and R. C. Dubes, Algorithms for Clustering
and J.S.O. performed the data analysis. M.B., E.A.S., Data (Prentice-Hall, Inc., Upper Saddle River, NJ, USA,
M.S., and A.B. designed the 20-qubit device. R.M., S.C., 1988).
C.A.R., N.A., A.S., S.H., N.D., D.S., and P.S. brought [22] A. S. Shirkhorshidi, S. Aghabozorgi, and T. Y. Wah,
up the experiment. C.B.O., A.P., B.B., P.K., G.P., PLOS ONE 10, 1 (2015).
N.T., M.R. developed the infrastructure for automatic re- [23] S. Boriah, V. Chandola, and V. Kumar, “Similar-
calibration. E.C.P., P.K., W.Z., and R.S.S. developed the ity measures for categorical data: A comparative
evaluation,” in Proceedings of the 2008 SIAM In-
compiler and QVM tools. J.S.O., M.P.S., B.R.J., M.R.,
ternational Conference on Data Mining, pp. 243–254,
R.M. wrote the manuscript. B.R.J., M.R., A.H., M.P.S., http://epubs.siam.org/doi/pdf/10.1137/1.9781611972788.22.
and C.R. were principal investigators of the effort. [24] R. M. Karp, in Complexity of Computer Computations,
edited by R. E. Miller and J. W. Thatcher (New York:
Plenum, 1972) pp. 85–103.
[25] B. Alidaee, G. A. Kochenberger, and A. Ahmadian,
International Journal of Systems Science 25, 401 (1994).
[1] P. W. Shor, in Proceedings 35th Annual Symposium on [26] J. Krarup and P. M. Pruzan, “Computer-aided layout
Foundations of Computer Science (1994) pp. 124–134. design,” in Mathematical Programming in Use, edited
[2] A. W. Harrow, A. Hassidim, and S. Lloyd, Phys. Rev. by M. L. Balinski and C. Lemarechal (Springer Berlin
Lett. 103, 150502 (2009). Heidelberg, Berlin, Heidelberg, 1978) pp. 75–94.
[3] P. W. Shor, in Proceedings of 37th Conference on Foun- [27] G. Gallo, P. L. Hammer, and B. Simeone, “Quadratic
dations of Computer Science (1996) pp. 56–65. knapsack problems,” in Combinatorial Optimization,
[4] E. Knill, R. Laflamme, and W. H. edited by M. W. Padberg (Springer Berlin Heidelberg,
Zurek, Science 279, 342 (1998), Berlin, Heidelberg, 1980) pp. 132–149.
http://science.sciencemag.org/content/279/5349/342.full.pdf. [28] H. Neven, G. Rose, and W. G. Macready, arXiv:0804.4457
[5] D. Aharonov and M. Ben-Or, SIAM (2008).
Journal on Computing 38, 1207 (2008), [29] Note that we can turn any maximization procedure into
https://doi.org/10.1137/S0097539799359385. a minimization procedure by simply changing the sign of
[6] P. Aliferis, D. Gottesman, and J. Preskill, Quantum Info. the objective function.
Comput. 6, 97 (2006). [30] GNOME icon artists, “gnome computer icon,” (2008),
[7] A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. creative commons license.
Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’Brien, [31] Z. Wang, S. Hadfield, Z. Jiang, and E. G. Rieffel, “The
Nature Communications 5, 4213 EP (2014). quantum approximation optimization algorithm for max-
[8] J. R. McClean, J. Romero, R. Babbush, and A. Aspuru- cut: A fermionic view,” (2017).
Guzik, New Journal of Physics 18, 023023 (2016). [32] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and
[9] A. Kandala, A. Mezzacapo, K. Temme, M. Takita, N. de Freitas, Proceedings of the IEEE 104, 148 (2015).
M. Brink, J. M. Chow, and J. M. Gambetta, Nature [33] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish,
549, 242 EP (2017). N. Sundaram, M. M. A. Patwary, P. Prabhat, and R. P.
[10] J. I. Colless, V. V. Ramasesh, D. Dahlen, M. S. Blok, Adams, in Proceedings of the 32Nd International Confer-
J. R. McClean, J. Carter, W. A. de Jong, and I. Siddiqi, ence on International Conference on Machine Learning -
“Robust determination of molecular spectra on a quantum Volume 37 , ICML’15 (JMLR.org, 2015) pp. 2171–2180.
processor,” (2017), arXiv:1707.06408. [34] J. Koch, T. Yu, J. M. Gambetta, A. A. Houck, D. I.
[11] E. Farhi, J. Goldstone, and S. Gutmann, arXiv:1411.4028 Schuster, J. Majer, A. Blais, M. H. Devoret, S. M. Girvin,
(2014). and R. J. Schoelkopf, Physical Review A 76, 042319
[12] S. Hadfield, Z. Wang, B. O’Gorman, E. G. Rieffel, D. Ven- (2007).
turelli, and R. Biswas, “From the quantum approximate [35] D. Vion, A. Aassime, A. Cottet, P. Joyez, H. Pothier,
optimization algorithm to a quantum alternating operator C. Urbina, D. Esteve, and M. H. Devoret, Science 296,
ansatz,” (2017), arXiv:1709.03489. 886 (2002).
[13] I. H. Kim and B. Swingle, “Robust entanglement renor- [36] A. Blais, R.-S. Huang, A. Wallraff, S. M. Girvin, and
malization on a noisy quantum computer,” (2017), R. J. Schoelkopf, Physical Review A 69, 062320 (2004).
arXiv:1711.07500. [37] A. Blais, J. M. Gambetta, A. Wallraff, D. I. Schuster, S. M.
[14] A. Lucas, Frontiers in Physics 2, 5 (2014). Girvin, M. H. Devoret, and R. J. Schoelkopf, Physical
[15] S. Poljak and Z. Tuza, in Combinatorial Optimization, Review A 75, 032329 (2007).
Vol. 20, edited by W. Cook, L. Lovász, and P. Seymour [38] M. Vahidpour, W. O’Brien, J. T. Whyland, J. Ange-
(American Mathematical Society, 1995). les, J. Marshall, D. Scarabelli, G. Crossman, K. Yadav,
[16] R. S. Smith, M. J. Curtis, and W. J. Zeng, “A practical Y. Mohan, C. Bui, V. Rawat, R. Renzas, N. Vodrahalli,
quantum instruction set architecture,” (2016). A. Bestwick, and C. Rigetti, arXiv:1708.02226 (2017).
[17] Rigetti Computing, “pyquil,” [39] N. Didier, E. A. Sete, M. P. da Silva, and C. Rigetti, “An-
https://github.com/rigetticomputing/pyquil (2016). alytical modeling of parametrically-modulated transmon
c Copyright 2017 Rigetti & Co, Inc. 6
qubits,” (2017), arXiv:1706.06566. Associates Inc., USA, 2012) pp. 2951–2959.
[40] S. Caldwell, N. Didier, C. A. Ryan, E. A. Sete, A. Hud- [54] J. Chung, P. Kannappan, C. Ng, and P. Sahoo, Journal of
son, P. Karalekas, R. Manenti, M. Reagor, M. P. da Silva, Mathematical Analysis and Applications 138, 280 (1989).
R. Sinclair, E. Acala, N. Alidoust, J. Angeles, A. Bestwick, [55] G. B. Coleman and H. C. Andrews, Proc IEEE 67, 773
M. Block, B. Bloom, A. Bradley, C. Bui, L. Capelluto, (1979).
R. Chilcott, J. Cordova, G. Crossman, M. Curtis, S. Desh- [56] E. Magesan, M. Gambetta, and J. Emerson, Phys Rev
pande, T. E. Bouayadi, D. Girshovich, S. Hong, K. Kuang, Lett. 106, 180504 (2011).
M. Lenihan, T. Manning, A. Marchenkov, J. Marshall,
R. Maydra, Y. Mohan, W. O’Brien, C. Osborn, J. Otter-
bach, A. Papageorge, J. P. Paquette, M. Pelstring, A. Pol-
loreno, G. Prawiroatmodjo, V. Rawat, R. Renzas, N. Ru-
bin, D. Russell, M. Rust, D. Scarabelli, M. Scheer, M. Sel-
vanayagam, R. Smith, A. Staley, M. Suska, N. Tezak,
D. C. Thompson, T. W. To, M. Vahidpour, N. Vodra-
halli, T. Whyland, K. Yadav, W. Zeng, and C. Rigetti,
“Parametrically activated entangling gates using transmon
qubits,” (2017), arXiv:1706.06562.
[41] M. Reagor, C. B. Osborn, N. Tezak, A. Staley,
G. Prawiroatmodjo, M. Scheer, N. Alidoust, E. A. Sete,
N. Didier, M. P. da Silva, E. Acala, J. Angeles, A. Best-
wick, M. Block, B. Bloom, A. Bradley, C. Bui, S. Cald-
well, L. Capelluto, R. Chilcott, J. Cordova, G. Crossman,
M. Curtis, S. Deshpande, T. E. Bouayadi, D. Girshovich,
S. Hong, A. Hudson, P. Karalekas, K. Kuang, M. Leni-
han, R. Manenti, T. Manning, J. Marshall, Y. Mohan,
W. O’Brien, J. Otterbach, A. Papageorge, J. P. Paque-
tte, M. Pelstring, A. Polloreno, V. Rawat, C. A. Ryan,
R. Renzas, N. Rubin, D. Russell, M. Rust, D. Scarabelli,
M. Selvanayagam, R. Sinclair, R. Smith, M. Suska, T. W.
To, M. Vahidpour, N. Vodrahalli, T. Whyland, K. Yadav,
W. Zeng, and C. T. Rigetti, “Demonstration of univer-
sal parametric entangling gates on a multi-qubit lattice,”
(2017), arXiv:1706.06570.
[42] Rigetti Computing, “Forest,”
https://www.rigetti.com/forest (2017).
[43] A. Bhattacharyya, Bulletin of the Calcutta Mathematical
Society , 99 (1943).
[44] As the Bhattacharyya coefficient is not a traditional dis-
tance metric (it violates the triangle inequality) we should
interpret clustering as the characteristic that distribu-
tions within the same cluster, i.e. the same label, have
minimal—in our case zero—overlap. Phrased in this way
the connection to VLSI design becomes obvious, where
on sub-goal is to identify groups of objects with minimal
overlap.
[45] M. A. Nielsen and I. L. Chuang, Quantum Computation
and Quantum Information (Cambridge University Press,
2011).
[46] J. Bergstra and Y. Bengio, Journal of Machine Learning
Research 13, 281 (2012).
[47] S. van der Walt, S. C. Colbert, and G. Varoquaux, Com-
puting in Science Engineering 13, 22 (2011).
[48] J. Johansson, P. Nation, and F. Nori, Computer Physics
Communications 184, 1234 (2013).
[49] J. D. Hunter, Computing In Science & Engineering 9, 90
(2007).
[50] Inkscape Project, “Inkscape,” .
[51] J. K. Blitzstein and J. Hwang, Introduction to Probability
(CRC Press, Taylor & Francis Group, 2015).
[52] We use the short-hand notation zi:j to denote the collec-
tion of values {zi , . . . zj }.
[53] J. Snoek, H. Larochelle, and R. P. Adams, in Proceedings
of the 25th International Conference on Neural Informa-
tion Processing Systems - Volume 2 , NIPS’12 (Curran
c Copyright 2017 Rigetti & Co, Inc. 7
SUPPLEMENTARY INFORMATION
Starting with the Maxcut formulation (1) we can construct the Ising Hamiltonian connected to a given Maxcut
instance. To this end we note that we can lift a general graph G on n-nodes to a fully connected graph Kn by
introducing missing edges and initializing their corresponding weights to zero. We assume that the weights wij = wji
are symmetric, corresponding to an undirected graph and introduce Ising spin variables sj ∈ {−1, +1} taking on value
sj = +1 if vj ∈ S and sj = −1 if vj ∈ S̄. With this we can express the cost of a cut as
X
w(δ(S)) = wij
i∈S,j∈S̄
1 X
= wij
2
(i,j)∈δ(S)
1 X 1 X
= wij − wij si sj
4 4
i,j∈V i,j∈V
1 X
= wij (1 − si sj ). (S1)
4
i,j∈V
Identifying the spin variables with the spin operators of qubits yields the quantum analog of the weighted Maxcut
problem as
1 X
ĤC = − wij (1 − σ̂iz σ̂jz ) (S2)
2
i,j∈V
where we introduce the additional “−” sign to encode the optimal solution as the minimal energy state (as opposed to
the maximal energy state from the Maxcut prescription). In this sense we want to minimize ĤC .
The full optimization trace of running the Bayesian optimized Maxcut clustering algorithm is shown in Fig. S1,
where the abscissa shows the step count of the optimizer. Each violin in the top panel shows the kernel-density
estimates (KDE) of the cost distribution associated with the sampled bit-strings at the corresponding step. The width
reflects the frequency with which a given cost has been sampled, while the thick and thin line within each violin
indicate the 1σ and 2σ intervals of the distribution, respectively. To indicate the extreme values we cut off the KDE
at the extreme values of the sample. Finally the white dot at the center of the violins show the mean value of the
sampled distribution. In the optimization procedure we return the largest value of the cost distribution. The middle
panel shows the best sampled value at step i (red curve) corresponding to the extreme value of the distributions in the
top panel, whereas the green curve is the mean value. The blue curve is the historic best value of the optimizer and
shows the construction of the individual trace curves of Fig. 4. Finally the lowest panel shows the behavior of the
Bayesian optimizer in choosing the next hyperparameter pair (γ, β). The jumpiness of the angle choices is likely due
to the noise in the 19Q chip and seems significantly reduced in a simulated experiment as seen in Fig. S2d.
As we can see, there is some variability in the mean value as well as the width of the distributions. At certain (β, γ)
points we do indeed sample large cost values from the distribution corresponding to (approximate) solutions of the
randomly chosen Maxcut problem instance.
To demonstrate the clustering properties of the algorithm we simulate a larger Maxcut problem instance on the
Quantum Virtual Machine (QVM). We construct the distance matrix shown in Fig. S2a resulting from the Euclidean
distance between 20 random points in R2 as shown in Fig. S2b. The corresponding graph is fully connected and the
label assignment corresponding to its Maxcut solution is shown in Fig. S2b. It is worth pointing out that this is a
c Copyright 2017 Rigetti & Co, Inc. 1
FIG. S1. Trace of the Bayesian Optimization of a p = 1 step QAOA procedure for the 19 qubit Maxcut problem instance with
random weights as discussed in the main text. Each violin contains 2500 samples drawn from the QPU and is cut off at its
observed extreme values. We normalized the plot to indicate the best possible value. Note that this is intractable in general
(i.e., it requires knowledge of the true optimum, which is hard to obtain). Detailed descriptions are in the text.
bipartite graph and has only two equally optimal solutions, the one shown and the exact opposite coloring. Hence
randomly sampling bit-strings only has a chance of 2/220 ≈ 2 · 10−6 of finding an optimal solution, meaning we would
have to sample on the order of 219 bit-strings to find the correct answer with significant success probability. The
corresponding optimization trace is shown in Fig. S2c. Each violin contains N = 250 samples, and hence we sample
only 250/220 ≈ 0.02 of the full state space at each point corresponding to a chance 250 · 2/220 ≈ 4.7 · 10−4 to sample
the right bit-string. This corresponds to a 100× improvement of the sampling procedure given a correctly prepared
c Copyright 2017 Rigetti & Co, Inc. 2
(a) (b)
(c) (d)
FIG. S2. (a) Euclidean distance matrix for a sample of 20 points as shown in Fig. S2b before labels have been assigned. This
matrix will be used as the adjacency matrix for the Maxcut clustering algorithm. (b) Random sample of 20 two-dimensional
points forming two visually distinguishable clusters. The color assignment is the result of the clustering algorithm 1. Calculating
the mutual Euclidean distances for all points gives rise to the distance matrix shown in Fig. S2a. (c) QAOA optimization trace
for a fully connected 20 node graph corresponding to the distance matrix in Fig. S2a. The parameters are p = 1 and each violin
contains N = 250 samples, i.e. we sample 250/220 ≈ 0.02 of the whole state space. This demonstrates that the algorithm is
capable of finding good solutions even for a non-trivial instance. (d ) QAOA optimization trace for a fully connected 20 node
graph corresponding to the distance matrix in Fig. S2a. The parameters are p = 1 and each violin contains N = 250 samples.
This demonstrates that the algorithm is capable of finding good solutions even for a non-trivial instance.
distribution as compared to just sampling from a uniform distribution. We can see that due to the fully connected
nature of the graph the variation in the mean is not significant for p = 1 steps in the QAOA iteration (see Fig. S2d for
more details). However, there is significant variations in the standard deviation of the sampled distributions with
only a few of them allowing access to the optimal value with so few samples. A better view of the optimizer trace is
shown in Fig. S2d where we plot the average and best observed cost at each (β, γ)-pair in addition to the overall best
value observed at the time a new point is evaluated. We can see that the optimizer slowly improves its best value and
that it increasingly samples from distributions with large standard deviations. The clustering steps are described in
pseudo-code by algorithm 1
c Copyright 2017 Rigetti & Co, Inc. 3
(a) (b)
FIG. S3. Update process of a Bayesian Optimization procedure. In both plots the blue area indicate a 2σ interval of the GP
distribution. (a) GP distribution prior; without any updates we can draw any of the three curves which more or less are within
the two-standard deviation of the prior. (b) Evaluating the true value of the random variable will clamp the GP distribution
at the observation points, forcing us to update the priors according to Bayes’ rule. Drawing values from the posterior GP
distribution will hence have a reduced overall uncertainty and adjusted mean value.
We can interpret the cost associated with the Ising Hamiltonian ĤC corresponding to a classical combinatorial
optimization problem (COP) as a function
where |xi ∈ Zn2 is a classic bit-string sampled from the distribution Dx with PDF p(x; θ) = |ψ(x; θ)|2 prepared by
running the QAOA procedure as described in the main text and θ = (γγ , β ). We can now identify the bit-string variable
|xi as a random variable X ∼ Dx (θ) drawn from the parameterized distribution Dx (θ) ⊆ Zn2 and consequently can
interpret the function f as a random variable F ∼ Df (θ) ⊆ R by functional composition with a PDF pf (v; θ) and
CDF Cf (v; θ)
In order to optimize the distribution Df (θ) for sampling (nearly) optimal bit-strings we need to define an optimization
objective. To this end, we consider the j-th order statistics F(j) [51] of an i.i.d. sample of N experiments {Fi }i=1...N
F(1) = min{F1 , . . . , FN }
F(2) = min{F1 , . . . , FN } \ {F(1) }
..
.
F(N −1) = max{F1 , . . . , FN } \ {F(N ) }
F(N ) = max{F1 , . . . , FN }. (S4)
By definition we have F(1) ≤ F(2) ≤ · · · ≤ F(N ) . Notice that the F(j) are random variables as well, but due to the
ordering relation are not i.i.d. anymore. We can use the order statistic to define the k-th percentile (k = 1, ..., 100) as
the CDF of F(b100j/N c=k) . The optimization routine can now be used to optimize the expectation values of any order
statistic at hand. To this end we need to compute the PDF of these order statistics [51] according to
N −1
pF(j) (v; θ) = N pf (v; θ)Cf (v; θ)j−1 (1 − Cf (v; θ))N −j (S5)
j−1
In practice we are mostly interested in optimizing the extreme value statistic, i.e. minimizing the first-order or
maximizing the N -th order statistic. The expectation values of these can be computed as
c Copyright 2017 Rigetti & Co, Inc. 4
Z
s1 (θ) := hF(1) (θ)i = N dv vpf (v; θ)(1 − Cf (v; θ))N −1 (S6)
and
Z
sN (θ) := hF(N ) (θ)i = N dv vpf (v; θ)Cf (v; θ)N −1 (S7)
Note that this approach also enables us to estimate the uncertainty of these random variables, giving quality
estimates of the sample. Despite looking horribly complicated, those values can readily be computed numerically
from a set of samples of the distribution Df (θ). A pseudo-code representation of the statistics calculation is given in
algorithm 2.
The extreme value functions sj (θ) : [0, 2π)2p → R, j = 1, N are generally not analytically known and typically
expensive to evaluate. Hence we have access to s(θ) only through evaluating it on a set of m points θ1:m [52] with
corresponding variables vi = s(θi ) and noisy observations y1:m . Note that we drop the subscript j to denote the
distinction between minimal and maximal value functions as the approach is identical for either of them. To make
use of Bayesian optimization techniques [32, 53] we assume that the variables v = v1:m are jointly Gaussian and that
the observations y1:m are normally distributed given v, completing the construction of a Gaussian process (GP)[18].
The distinction between the variable vi and its observation yi is important given that the expectation values S6
and S7 are subject to sampling noise, due to finite samples, but also due to finite gate and readout fidelities and other
experimental realities, and hence cannot be known exactly. We describe the GP using the moments of a multivariate
normal distribution
m(θ) =E[v(θ)] (S8)
k(θ, θ0 ) =E[(v(θ) − µ(θ))(v(θ0 ) − µ(θ0 ))] (S9)
and introduce the notation
c Copyright 2017 Rigetti & Co, Inc. 5
To account for imperfect/noisy evaluations of the true underlying function we simply need to adjust the update
rule for the GP mean and kernels. This can also be done analytically and hence be directly applied to our case of
numerically sampled extreme values statistics. For exhaustive details of the update rules, kernels and general GP
properties see Ref. [18]. It should be noted that the updated rules of Gaussian kernels require a matrix inversion
which scales as O(m3 ) and hence can become prohibitively expensive when the number m of samples becomes large.
Improving this scaling is an active area of research and early promising results such as Deep Networks for Global
Optimization (DNGO) [33] provide surrogate methods for Gaussian processes with linear scaling.
So far we have only been concerned with updating the GP to reflect the incorporation of new knowledge. To close
the optimization loop we need a procedure to select the next sampling point θ∗ . Choosing a point at random will
essentially induce a random walk over the optimization parameter domain and might not be very efficient [46] (though
this is still better than a grid search). A nice improvement over this random walk is offered by Bayesian framework
itself: due to the constantly updating the posterior distribution we can estimate the areas of highest uncertainty and
consequently chose the next point accordingly. However, this might bring the optimizer far away from the optimal
point by trying to minimize the global uncertainty. To prevent the optimizer from drifting off we need a way to
balance its tendency to explore areas of high uncertainty with exploiting the search around the currently best known
value. This procedure is encapsulated in the acquisition function α(θ; Dm ) where Dm = {(θi , yi )}i=1,...,m is the set of
observations up to iteration m [32]. Since the posterior is Gaussian for every point θ there are many analytic ways to
construct an acquisition function. Here, we use the Upper Confidence Bound (UCB) metric, which can be calculated as
where βm is a hyperparameter controlling the explore-exploit behavior of the optimizer and µm (θ), σm (θ) are the
mean and variance of the Gaussian of the posterior GP restricted to point θ. Maximizing the acquisition function of
all values θ at each iteration step yields the next point for sampling from the unknown function s. For more details see
Ref. [32]. A pseudo-code representation of the Bayesian Optimization routine is given in algorithm 3.
To demonstrate the applicability of the results beyond a single problem instance we ran the simulations on 5
randomly chosen problem instances over a fourteen hour window on 19Q architecture. We recorded the optimization
traces (cf. Fig. S4a) and calculated the empirical CDF (eCDF) for the time-to-optimum, i.e. the number of steps
before the optimizer reached the optimal value, as seen in Figs. 5, S4b. Note that we can estimate the optimal value
easily for the problem at hand. We compared the empirical CDF (eCDF) to the CDF of a random sampling procedure
that follows a Bernoulli distribution B(N, p) with a success probability p = 2/219 and N = Nsteps Nshots samples. The
additional factor 2 in the success probability is due to the inversion symmetry of the solution, i.e. there are two
equivalent solutions which minimize the cost and are related to each other by simply inverting each bit-assignment.
The CDF for the Bernoulli random variable can then be easily written as:
c Copyright 2017 Rigetti & Co, Inc. 6
(a) (b)
FIG. S4. (a) Traces for the normalized Maxcut cost for 83 independent runs of the algorithm on the 19Q chip for a fixed
random problem instances of Fig. 3. Notice that most traces reach the optimal cost well before the cutoff at 55 steps. (b) The
performance of our implementation of the clustering algorithm (red) can be compared to the performance of an algorithm that
simply draws cluster assignments at random (green). It is clear that our algorithm generates the optimal assignment much
more quickly than it would be expected by chance: the 95% confidence region for our empirical observations have very small
overlap for the distribution given by random assignments, and the Kolmogorov-Smirnov statistic indicates we can reject the null
hypothesis of random assignments at a level higher than 99.9%.
TABLE S1. Kolmogorov-Smirnov statistics and significance values for the CDF shown in Fig. 5. All values are calculated with
respect to the exact random sampling CDF
eCDF KS α
empirical random bitstring sampling (Fig. 5) 0.077 1.559
Rigetti-QVM (Fig. 5) 0.838 1.273 · 10−7
19Q single instance (Fig. 5) 0.339 1.339 · 10−2
19Q randomized instances (Fig. S4b) 0.392 8.4451 · 10−4
k∗Nshots
P (success after k steps) = 1 − (1 − p) (S12)
To compare the eCDF (red curve in Fig. S4b) to the random sampling CDF (green curve) we calculate the
Kolmogorov-Smirnov statistic between two eCDFs
where F1,n is the first eCDF with n points and F2,m is the second one with m points. Given the eCDFs of Fig. S4b
we find KS23,55 ≈ 0.392. We can calculate the significance level α by inverting the prescription for rejection of the
Null-Hypothesis H0 , i.e. the two eCDFs result from the same underlying distribution function:
r
n+m
KSn,m ≥ c(α) (S14)
nm
p
where c(α) = −0.5 log(α/2). Plugging in the empirical KS statistic we find that H0 can be rejected with a probability
p = 1 − α with α = 8.451 · 10−4 . We also calculated the KS statistic for the curves in the main body of the text
summarized in Table. S1
c Copyright 2017 Rigetti & Co, Inc. 7
a
0 1 2 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
0 1 2 4
b
43
56
59
18
0.
0.
0.
4
0.
0.
0.
0.
63
44
9
5 6 7 8 9
0.
0.
0.
0.
0.
23
52
64
60
36
10 11 12 13 14
57
29
72
71
40
0.
0.
0.
0.
0.
41
0.
0.
50
40
81
0.
0.
15 16 17 18 19
FIG. S5. Pictorial representation of a set of compact distributions in R2 . Calculating the mutual Bhattacharyya coefficient
between the distributions leads to a graph that that maps to the Rigetti 19Q architecture with low overhead. Applying the
proposed clustering algorithm will find assignments for sets of the least overlapping distributions. For more details see the body
of the appendix.
To motivate the clustering application we look at Fig. S5. It shows 19 distribution with compact support in R2
with finite overlaps. We can define a similarity metric between these distributions using the Bhattacharyya coefficient
[43, 54, 55]
Z p
b : DR2 × DR2 −→ R, b(p, q) = p(x)q(x)d2 x (S15)
where we use DR2 to denote the space of compact-support distributions over R2 . Since the Bhattacharyya coefficient is
not a typical distance metric –it does not fulfill the triangle inequality–, we use the following procedure: Using the
example of Fig. S5 we can identify the individual distributions with a given qubit and calculate the overlap metric with
all other distributions. With this we can construct a graph G = (V, E) where the vertices correspond to identifiers
of the individual distributions and the edges to overlaps in these distributions where the weights are given by the
Bhattacharyya coefficient. In the case of Fig. S5 this will lead to a graph that has a low overhead when mapped to
the Rigetti 19Q connectivity and enables us to run the clustering algorithm on the quantum hardware. To make this
translation we need to remove the self-similarity between the distributions, corresponding to self-cycles in the graph. It
should be noted that clustering in this context means to identify a set of distribution that are as dissimilar as possible,
i.e. have as little overlap as possible.
Tables S2 and S3 summarize the main performance parameters of Rigetti 19Q. Single-qubit gate fidelities are
estimated with standard randomized benchmarking protocols [56] with 25 random Clifford gate sequences of lengths
c Copyright 2017 Rigetti & Co, Inc. 8
TABLE S2. Rigetti 19Q performance parameters — All of the parameters listed in this table have been measured at
base temperature T ≈ 10 mK. The reported T1 ’s and T2∗ ’s are averaged values over 10 measurements acquired at ω01 max
. The
errors indicate the standard deviation of the averaged value. Note that these estimates fluctuate in time due to multiple factors.
ωrmax /2π max
ω01 /2π η/2π T1 T2∗ F1q FRO
MHz MHz MHz µs µs
0 5592 4386 -208 15.2 ± 2.5 7.2 ± 0.7 0.9815 0.938
1 5703 4292 -210 17.6 ± 1.7 7.7 ± 1.4 0.9907 0.958
2 5599 4221 -142 18.2 ± 1.1 10.8 ± 0.6 0.9813 0.970
3 5708 3829 -224 31.0 ± 2.6 16.8 ± 0.8 0.9908 0.886
4 5633 4372 -220 23.0 ± 0.5 5.2 ± 0.2 0.9887 0.953
5 5178 3690 -224 22.2 ± 2.1 11.1 ± 1.0 0.9645 0.965
6 5356 3809 -208 26.8 ± 2.5 26.8 ± 2.5 0.9905 0.840
7 5164 3531 -216 29.4 ± 3.8 13.0 ± 1.2 0.9916 0.925
8 5367 3707 -208 24.5 ± 2.8 13.8 ± 0.4 0.9869 0.947
9 5201 3690 -214 20.8 ± 6.2 11.1 ± 0.7 0.9934 0.927
10 5801 4595 -194 17.1 ± 1.2 10.6 ± 0.5 0.9916 0.942
11 5511 4275 -204 16.9 ± 2.0 4.9 ± 1.0 0.9901 0.900
12 5825 4600 -194 8.2 ± 0.9 10.9 ± 1.4 0.9902 0.942
13 5523 4434 -196 18.7 ± 2.0 12.7 ± 0.4 0.9933 0.921
14 5848 4552 -204 13.9 ± 2.2 9.4 ± 0.7 0.9916 0.947
15 5093 3733 -230 20.8 ± 3.1 7.3 ± 0.4 0.9852 0.970
16 5298 3854 -218 16.7 ± 1.2 7.5 ± 0.5 0.9906 0.948
17 5097 3574 -226 24.0 ± 4.2 8.4 ± 0.4 0.9895 0.921
18 5301 3877 -216 16.9 ± 2.9 12.9 ± 1.3 0.9496 0.930
19 5108 3574 -228 24.7 ± 2.8 9.8 ± 0.8 0.9942 0.930
l ∈ {2, 4, 8, 16, 32, 64, 128}. Readout fidelity is given by the assignment fidelity FRO = [p(0|0) + p(1|1)]/2, where p(b|a)
is the probability of measuring the qubit in state b when prepared in state a. Two-qubit gate fidelities are estimated
with quantum process tomography [45] with preparation and measurement rotations {I, Rx (π/2), Ry (π/2), Rx (π)}.
The reported process fidelity F2q indicates the average fidelity between the ideal process and the measured process
imposing complete positivity and trace preservation constraints. We further averaged over the extracted F2q from four
separate tomography experiments. Qubit-qubit coupling strengths are extracted from Ramsey experiments with and
without π-pulses on neighboring qubits.
c Copyright 2017 Rigetti & Co, Inc. 9
TABLE S3. Rigetti 19Q two-qubit gate parameters and performance — These parameters refer to two-qubit in-
teractions of Rigetti 19Q. Qubit 3 is not tunable and for this reason parameters related to the pairs 3 − 8, 3 − 9 are not
included.
A0 fm tCZ F2q
Φ/Φ0 MHz ns
0−5 0.27 94.5 168 0.936
0−6 0.36 123.9 197 0.889
1−6 0.37 137.1 173 0.888
1−7 0.59 137.9 179 0.919
2−7 0.62 87.4 160 0.817
2−8 0.23 55.6 189 0.906
4−9 0.43 183.6 122 0.854
5 − 10 0.60 152.9 145 0.870
6 − 11 0.38 142.4 180 0.838
7 − 12 0.60 241.9 214 0.870
8 − 13 0.40 152.0 185 0.881
9 − 14 0.62 130.8 139 0.872
10 − 15 0.53 142.1 154 0.854
10 − 16 0.43 170.3 180 0.838
11 − 16 0.38 160.6 155 0.891
11 − 17 0.29 85.7 207 0.844
12 − 17 0.36 177.1 184 0.876
12 − 18 0.28 113.9 203 0.886
13 − 18 0.24 66.2 152 0.936
13 − 19 0.62 109.6 181 0.921
14 − 19 0.59 188.1 142 0.797
c Copyright 2017 Rigetti & Co, Inc. 10