0% found this document useful (0 votes)

9 views56 pages

Statevec Dist

Uploaded by

Aryan Saharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views56 pages

Statevec Dist

Uploaded by

Aryan Saharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Distributed Simulation of Statevectors and Density Matrices

Tyson Jones,1, 2, ∗ Bálint Koczor,1, 2 and Simon C. Benjamin1, 2

1
Department of Materials, University of Oxford,
Parks Road, Oxford OX1 3PH, United Kingdom
2
Quantum Motion Technologies, Pearl House,
5 Market Road, London N7 9PL, United Kingdom
Classical simulation of quantum computers is an irreplaceable step in the design of quantum al-
gorithms. Exponential simulation costs demand the use of high-performance computing techniques,
and in particular distribution, whereby the quantum state description is partitioned between a net-
work of cooperating computers – necessary for the exact simulation of more than approximately
arXiv:2311.01512v1 [quant-ph] 2 Nov 2023

30 qubits. Distributed computing is notoriously difficult, requiring bespoke algorithms dissimilar

to their serial counterparts with different resource considerations, and which appear to restrict the
utilities of a quantum simulator. This manuscript presents a plethora of novel algorithms for dis-
tributed full-state simulation of gates, operators, noise channels and other calculations in digital
quantum computers. We show how a simple, common but seemingly restrictive distribution model
actually permits a rich set of advanced facilities including Pauli gadgets, many-controlled many-
target general unitaries, density matrices, general decoherence channels, and partial traces. These
algorithms include asymptotically, polynomially improved simulations of exotic gates, and thorough
motivations for high-performance computing techniques which will be useful for even non-distributed
simulators. Our results are derived in language familiar to a quantum information theory audience,
and our algorithms formalised for the scientific simulation community. We have implemented all
algorithms herein presented into an isolated, minimalist C++ project, hosted open-source on Githuba
with a permissive MIT license, and extensive testing. This manuscript aims both to significantly
improve the high-performance quantum simulation tools available, and offer a thorough introduction
to, and derivation of, full-state simulation techniques.

CONTENTS V Distributed density matrix algorithms 30

A State representation . . . . . . . . . . . 30
I Introduction 2
B Unitary gates . . . . . . . . . . . . . . . 32
A Notation . . . . . . . . . . . . . . . . . . 3
C Kraus map . . . . . . . . . . . . . . . . 33
II Local statevector algorithms 4
D Dephasing channel . . . . . . . . . . . . 35
A Local costs . . . . . . . . . . . . . . . . 4
E Depolarising channel . . . . . . . . . . . 37
B One-target gate . . . . . . . . . . . . . . 5
F Damping channel . . . . . . . . . . . . . 42
C Many-control one-target gate . . . . . . 8
G Pauli string expectation value . . . . . . 43
D Many-target gate . . . . . . . . . . . . . 9
H Partial trace . . . . . . . . . . . . . . . . 45
III Distribution 11
VI Summary 50
A Communication buffer size . . . . . . . . 11
VII Contributions 51
B Communication costs . . . . . . . . . . . 12
VIII Acknowledgements 51
C Communication patterns . . . . . . . . . 12
References 53
IV Distributed statevector algorithms 14
A One-target gate . . . . . . . . . . . . . . 14
B Many-control one-target gate . . . . . . 16
C Swap gate . . . . . . . . . . . . . . . . . 18
D Many-target gate . . . . . . . . . . . . . 21
E Pauli tensor . . . . . . . . . . . . . . . . 24
F Phase gadget . . . . . . . . . . . . . . . 28
G Pauli gadget . . . . . . . . . . . . . . . . 29

∗ [email protected]
a github.com/TysonRayJones/Distributed-Full-State-Algorithms
2

I. INTRODUCTION store a tractable subset of the full state description.

The memory aggregate between all machines enables
Quantum computers are anticipated to revolutionise the study of larger quantum systems with greater
high-performance computing and scientific simula- parallelisation. Under ideal weak scaling, 1024 com-
tion. In the meantime, the converse is true; high- pute nodes of a supercomputer could together sim-
performance computing and simulation play an es- ulate a 40 qubit quantum state in a similar time
sential role in the development of quantum comput- it takes a single node to simulate a 30 qubit state.
ers. Naturally such methods are necessary for emu- Indeed, distributing large data structures between
lating quantum devices in the interim to their exper- machines and modifying each through local paral-
imental realisation, in order to study the behaviour lelisation strategies like multi-threading is a typical
of analytically intractable quantum algorithms like application of supercomputing platforms.
variational schemes [1] or those upon variably noisy Developing simulations to run on distributed and
devices [2]. But the utility of classical simulation is parallel systems often proves an advanced exercise
far grander. It is central in verifying experimental in programming and algorithm design. Many of
prototypes [3], motivating supremacy [4] and error the facilities leveraged by serial programs, like glob-
correction thresholds [5], discovering [6] and recom- ally accessible memory and synchronised program
piling [7] quantum circuits, and even proves useful clocks, are not available. The metrics by which we
for benchmarking supercomputers themselves [8, 9]. measure and predict the runtime performance of a
Even after fault-tolerant and insimulably large quan- distributed program are also different. Accessing
tum computers have been achieved, there will re- data stored on another node requires synchronised
main questions answerable only through strong sim- inter-node communication at a relatively significant
ulation. network penalty. Despite supercomputing facilities
A common method of strong simulation is the “full- boasting powerful network architectures, exchanging
state” technique, also referred to as “direct evo- information over the network remains orders of mag-
lution”, “statevector” and “brute force” simula- nitude slower than local operations like local mem-
tion [10, 11]. These evolve a precise representation ory access [13]. And since statevector simulation
of the quantum state - a dense statevector or density is typically ‘memory bandwidth bound’ [14], com-
matrix - and are a prudent first choice of simulator munication becomes the most significant resource in
when it is unknown which (if any) specialised simula- distributed runtime accounting, and the most perti-
tion techniques to employ for a given circuit or calcu- nent metric to optimise. Yet, devising even an un-
lation. Alas, full-state simulation of a generic quan- optimised distributed algorithm to simulate certain
tum circuit is exponentially costly in time or mem- quantum operators can be an immense challenge.
ory, and sometimes both. A floating-point (with In this manuscript, we derive nineteen novel, dis-
double precision) statevector description of the ide- tributed algorithms to accelerate the simulation of
alised 53 qubit Sycamore processor [4] would require statevectors and density matrices, achieving prov-
approximately 144 petabytes, and a density matrix able constant to polynomial speedups. All algo-
description to precisely model its true, noisy be- rithms and their costs are summarised in Table I.
haviour would require 1.3 × 1018 petabytes; or the
combined memory of one quintillion ARCHER2 su- The manuscript structure is as follows. The remain-
percomputers [12]. Serially simulating even very der of Sec. I defines our notation. Sec. II presents
simple quantum operations upon such large states the performance measures of local (that is, non-
is prohibitively slow. distributed) simulation, and derives three conven-
tional local algorithms. Sec. III describes how the
Fortunately, methods of statevector simulation par- quantum state is partitioned between multiple com-
allelise extremely well. The modification of a nu- pute nodes, how these nodes can communicate, and
merical quantum state representation under the ac- the performance measures of distributed simulation.
tion of gates and channels can very effectively utilise Sec. IV reviews the simple one-target gate, then de-
classical hardware acceleration techniques like multi- rives 6 novel distributed algorithms for simulating
threading and general-purpose GPU parallelisation. unitary and Hermitian operators upon statevectors.
Alas, the limitations of finite memory remain. A 15- Sec. V reviews an existing technique for distributed
qubit dense mixed state, for example, has already representation of density matrices and simulation of
become too large to process by a 12 GB GPU despite unitary gates (described in a previous work [9]), then
a similar 14-qubit state being rapidly simulable. derives 13 novel distributed algorithms for simulat-
This makes the use of distribution essential, wherein ing decoherence channels, and evaluating expecta-
multiple machines cooperating over a network each tion values and partial traces. Sec. VI summarises
the core mechanisms of our algorithms.
3

A. Notation amplitude of basis state |i⟩⟨j|, indexing from zero. In

Sec. V, we will introduce a so-called “Choi-vector”
Let underlined v = {v0 , v1 , . . . } notate an ordered representation of ρ as an unnormalised statevector
set with i-th element vi or v[i], indexing from i ≥ 0. notated ||ρ⟩⟩, instantiated with ordered set ρ.
Index variable i ∈ N is italicised √ to distinguish it We present algorithms in pseudocode reminiscent
from the imaginary unit i = −1. Let M de- of Python programming code, but also invoke the
note a matrix with elements M [i, j] (uppercase) or below symbols, and make regular use of the bit-
mij ≡ mi,j (lowercase), also indexing from zero. M̂ twiddling functions in Alg. 1.
indicates an operator, and M̂t indicates one explic-
itly targeting the t-th qubit (t ≥ 0), where the 0-th % modulo operator << bitwise left-shift
qubit is the rightmost in a ket (similarly of bits form-
>> bitwise right-shit | bitwise or
ing a binary integer). M̂t is a many-qubit operator
targeting all qubits within array t of length len(t). ˆ bitwise exclusive or & bitwise and
The operator’s
n n
corresponding matrix representation ∼ bitwise complement ! logical not
is M ∈ C2 ×2 , where n = len(t). Symbol σ̂ is
∪ union & list concatenation x∗ complex conjugate of x
reserved for Pauli operators 1̂, X̂,Ŷ ,Ẑ, while Û is T
M transpose of matrix M Û † adjoint of operator
reserved for unitary operators, Ĥ for Hermitian op-
erators and K̂ for Kraus operators. 1̂⊗n notates an { } empty list (or matrix) Û
n-qubit identity operator, and 1⊗n its 2n × 2n diag- // comment # multithreading pragma
onal matrix. Unless specified otherwise, all objects
are presented in the Ẑ-basis.
A ket |ψ⟩ (with a Greek symbol) indicates a general
pure quantum state, and |ψ⟩N (grey subscript) ex- Algorithm 1: Bit-twiddling functions of
plicitly indicates one of N qubits; its statevector is a unsigned integers, most adapted from Ref. [15].
(2N )-length complex array ψ. Meanwhile, |i⟩N (with 1 getBit(n, t):
a Latin symbol) indicates the i-th of 2N computa- 2 return (n >> t) & 1
tional basis states, indexing from zero and where the
3 flipBit(n, t):
rightmost qubit is least significant. Ergo 4 return n ˆ (1 << t)
|9⟩5 ≡ |0⟩1 |1⟩1 |0⟩1 |0⟩1 |1⟩1 , (1) 5 flipBits(n, t):
6 for q in t:
where |a⟩ |b⟩ ≡ |a⟩⊗|b⟩ and ⊗ is the Kronecker prod- 7 n = flipBit(n, q)
uct. Let i[t] ∈ {0, 1} denote the t-th bit of i ∈ N, 8 return n
so that 9[0] = 1 and 9[4] = 0. When a sum does 9 insertBit(n, t, b):
not explicitly specify the lower bound, it is assumed 10 l = (n >> t) << (t + 1)
to begin from zero and the upper bound becomes 11 m = b << t
exclusive. Our notations so far imply 12 r = n & ((1 << t) − 1)
13 return l | m | r
n−1 n
insertBits(n, t, b): // t must be sorted
X X
i≡ i[t] 2t ≡ i[t] 2t (2) 14
15 for q in t:
t=0 t
16 n = insertBit(n, q, b)
|i⟩N ≡ i[N −1] i[N −2] . . . i[1] i[0] , (3) 17 return n
N
2
X 18 setBits(n, t, v):
|ψ⟩N ≡ αi |i⟩N , αi ∈ C, (4) 19 for q in range(0, len(t)):
i 20 b = getBit(v, q) << t[q]
ψ = {α0 , . . . , α2N −1 }, (5) 21 n = (n & (∼b)) | b
22 return n
Ût |ψ⟩N ≃ 1⊗(N −t−1) ⊗ U ⊗ 1⊗t ψ. (6) 23 allBitsAreOne(n, t): // can be made
24 v=1 // O(1) using mask
We also use i¬t to indicate the natural number pro- 25 for q in t:
duced by flipping the t-th bit of i ∈ N, and i¬t is that 26 v = v & getBit(n, q)
resulting from flipping all bits with indices t ∈ t. 27 return v

A bold ρ notates a density operator with density 28 getBitMask(t):

matrix ρ and elements ρ[i, j] = ρij , which is the 29 return flipBits(0, t)
4

II. LOCAL STATEVECTOR ALGORITHMS EiB

(double precision)
PiB
Before we can discuss the challenges of distributed
simulation, we must first understand the simpler
TiB

Memory
problem of local (i.e. non-distributed), shared-
memory simulation. This paradigm already invites Statevector
a myriad of high-performance computing consid- GiB
Density matrix
erations such as memory striding [16], hierarchi-
Overhead
cal caching [17], branch prediction [18], vectorisa- MiB
tion [19, 20], type-aware arithmetic [21] and inlin-
ing [22]. And on multiprocessor architectures, non- KiB
10 20 30 40 50
uniform memory access [23] and local parallelisation
paradigms like multithreading [24] with incurred nu- Number of qubits
ances such as cache-misses and false sharing [25].
We here forego an introduction to these topics, re- FIG. 1. The memory costs to represent pure states (via
statevectors) and mixed states (via density matrices),
ferring the interested reader to a discussion in quan-
assuming that a single complex amplitude requires 16 B
tum simulation settings in Sec 6.3 of Ref. [26]. Still, (like a C++ complex at double precision) and a process
this manuscript’s algorithms will be optimised under overhead of 100 KiB.
these considerations in order to establish a salient
performance threshold. In this section we review
and derive the basic principles and algorithms of a properties like entanglement and unitarity, making
local full-state simulator. resource prediction trivial. For these reasons, full-
Such simulators maintain a dense statevector |ψ⟩, state simulators are a natural first choice in much
typically instantiated as an array ψ of scalars αi ∈ C of quantum computing research. Their drawback
related by is that representing an N -qubit pure state requires
simultaneous storage of 2N complex scalars (an ex-
2
X
N ponentially growing memory cost), and simulating
|ψ⟩N = αi |i⟩N ↔ an n-qubit general operator acting upon the state
i requires O(2N +n ) floating-point operations (an ex-
ψ = {α0 , . . . , α2N −1 }. (7) ponentially growing time cost). We illustrate these
memory costs (and the equivalent for a dense N -
The scalars can adopt any complex numerical imple- qubit mixed state) in Figure 1. Thankfully, opera-
mentation such as Cartesian or polar, or a report- tors modify the statevector in simple, regular ways,
edly well-performing adaptive polar encoding [27], to admitting algorithms which can incorporate many
which the algorithms and algebra in this manuscript high-performance computing techniques, as this sec-
are agnostic. tion presents.
Naturally, a classical simulator can store an unnor-
malised array of complex scalars and ergo represent A. Local costs
non-physical states satisfying
2
X
N We will measure the performance of local algorithms
|αi |2 ̸= 1 . (8) in this manuscript via the following metrics:
i
• [bops] The number of prescribed basic opera-
This proves a useful facility in the simulation of vari- tions or “bops”, such as bitwise, arithmetic, log-
ational quantum algorithms [28, 29] and the repre- ical and indexing operations. Performing a bop
sentation of density matrices presented later in this in the arithmetic logic unit (ALU) of a modern
manuscript. CPU can be orders of magnitude faster than the
By storing all of a statevector’s amplitudes, a full- other primitive operations listed below, but their
state simulator maintains a complete description of accounting remains relevant to data-dominated
a quantum state and permits precise a posteriori cal- HPC [30] like quantun simulation. While bit-
culation of any state properties such as probabilities wise operations (like those of Alg. 1) and integer
of measurement outcomes and observable expecta- arithmetic are both classified as bops, the for-
tion values. It also means that the memory and run- mer is significantly faster [31] and is preferred in
time costs of simulation are roughly homogeneous tight loops. Our example code will be optimised
across circuits of equal size, independent of state thusly.
5

• [flops] The number of floating-point operations B. One-target gate

or flops. These are accounted separately from
bops because they are significantly slower to per-
form (typically implemented as a sequence of
bops) and done so within a separate CPU sub-
M̂
processor; the floating-point unit (FPU) [21].
• [memory] The memory overhead, i.e. the size of
temporary data structures created during an al- The one-target gate (or “single-qubit gate”) is the
gorithm. This excludes persistent pre-allocated most frequently appearing class of operator in quan-
memory dedicated to the quantum state repre- tum circuits, describing 1-local non-entangling uni-
sentation. For memory efficiency, all algorithms taries, qubit projectors, Pauli operators, rotations,
in this manuscript will modify a pre-allocated and a wide family of named gates like the Hadmard.
statevector (or density matrix) in-place, though It is arguably the simplest operator in a full-state
some will require non-negligible non-distributed simulator, and an important motivator for the per-
temporary structures, exponentially smaller than formance goals when simulating more complicated
a state. operators. We here derive a local algorithm to in-
place simulate the one-target gate acting upon an
• [writes] The number of memory writes, i.e. modi- N -qubit pure state in O(2N ) bops and flops, 2N
fication of data structures which reside in the ap- memory writes, and an O(1) memory overhead.
plication’s heap memory and (when not cached)
in the simulating hardware’s main memory (like Let M̂t be a general one-target gate upon qubit t ≥
RAM). In this manuscript, modifying a statevec- 0, described by a complex matrix
tor amplitude constitutes a memory write (and  
encounters all the caching and multithreading m00 m01
nuances) while all other data (like local scalar M = . (9)
m10 m11
variables and gate descriptions) are assumed neg-
ligible stack items preloaded into registers or fast
While bespoke, optimised simulation is possible
caches. Accessing heap memory is assumed sig-
when M is diagonal (like a phase gate), anti-diagonal
nificantly slower than stack primitives, especially
in the tight memory-bandwidth-bound loops typ- (like X̂ and Ŷ ) or unit (like projector |0⟩⟨0|), we will
ical of quantum simulation. In serial simulation, here assume M is completely general and uncon-
this is our most important metric to minimise. strained; therefore operator M̂t can be non-physical.
We seek to apply M̂t upon an arbitrary N -qubit pure
We will succinctly summarise an algorithm’s cost un-
state |ψ⟩ (stored as array ψ) of amplitudes αi ∈ C.
der these respective metrics through the below tags
This means modifying ψ in-place to describe the re-
displayed within an algorithm’s caption.
sulting state
[a bops][b flops][c memory][d writes]

M̂t |ψ⟩ = 1̂⊗(N −t−1) ⊗ M̂ ⊗ 1̂⊗t |ψ⟩ , (10)
Note that these measures do not capture all lo-
cal performance considerations; our presented al- although naturally there is no need to instantiate
N N
gorithms are optimised to avoid branching, enable such a prohibitively large C2 ×2 operator matrix.
auto-vectorisation, cache efficiently, and avoid false- While there exists a large HPC literature on efficient
sharing when possible in multithreaded settings. multiplication of generic Kronecker products upon
Some of these considerations are introduced by ex- distributed vectors [32], the exponential size of our
ample in the one-target gate of the next section. vector (here, the statevector) warrants a dedicated
We now derive local algorithms to simulate three treatment.
canonical pure-state operations; the one-target gate, Denote the t-th bit of binary-encoded natural num-
the many-control one-target, and the many-target ber i ∈ N as i[t] . By expanding the general initial
gate. These will later serve to demonstrate the state into a basis of bit sequences,
challenges introduced by distribution and the main
N
study of this manuscript. 2
X
|ψ⟩N = αi |i⟩N (11)
i
N
2
X
= αi i[N −1] 1
. . . i[0] 1
, (12)
i
6

   
we can express the action of the operator as M0 M1 M2 M3
N
2
α0 α0 α0 α0
X
M̂t |ψ⟩ = αi i[N −1] 1
. . . M̂ i[t] 1 . . . i[0] 1
i α1 α1 α1 α1
(13) α2 α2 α2 α2

where the single-qubit kets are mapped to α3 α3 α3 α3

( α4 α4 α4 α4
m00 |0⟩1 + m10 |1⟩1 i[t] = 0
M̂ i[t] 1 = (14) α5 α5 α5 α5
m01 |0⟩1 + m11 |1⟩1 i[t] = 1
α6 α6 α6 α6
= mi[t] i[t] i[t] 1
+ m!i[t] i[t] !i[t] 1
. (15)
α7 α7 α7 α7
We have used symbol ! as logical not to flip a single α8 α8 α8 α8
bit. Substituting Eq. 15 into Eq. 13 yields
α9 α9 α9 α9
2N
X α10 α10 α10 α10
M̂t |ψ⟩ = (16)
i α11 α11 α11 α11
αi mi[t] i[t] i[N −1] 1
. . . i[t] 1
. . . i[0] 1 α12 α12 α12 α12

+ αi m!i[t] i[t] i[N −1] . . . !i[t] . . . i[0] α13 α13 α13 α13
1 1 1
2 N α14 α14 α14 α14
X
= αi mi[t] i[t] |i⟩N + αi m!i[t] i[t] |i¬t ⟩N , α15 α15 α15 α15
i
(17)
FIG. 2. The memory access pattern of Alg. 2’s local
simulation of the one-target gate M̂t . Each column of 16
where i¬t denotes the number formed by flipping the
amplitudes αi denotes the array representation ψ of a 4-
t-th bit of the integer i. By invoking that ! ! b = b P
qubit statevector |ψ⟩4 = α |i⟩4 . Amplitudes joined
i i
and (i¬t )¬t = i, we conclude
by an arrow (also sharing the same colour for clarity) are
N linearly combined with one another under the action of
2
X M̂t upon ψ. This means these elements of ψ are accessed
M̂t |ψ⟩ = αi mi[t] i[t] + αi¬t mi[t] !i[t] |i⟩N . within the same iteration of Alg. 2’s for loop, informing
i the memory stride and caching behaviour. Amplitudes
(18) not connected by arrows can be independently processed
in parallel. However, the simultaneous modification of
This form reveals that the one-target gate M̂t amplitudes with intersecting arrows may cause cache
linearly combines distinct pairs of amplitudes, conflicts. We will use diagrams of this kind through-
weighted by the elements of M . The amplitudes out this manuscript to demonstrate how varying the tar-
of |ψ⟩N are simultaneously modified to become get qubits (varied here between columns) of an operator
change the memory access pattern.
tM̂
αi −−→ αi mi[t] i[t] + αi¬t mi[t] !i[t] (19)
favourably avoids branch prediction, enables vectori-
where each pair of amplitudes αi and αi¬t are mod- sation, minimises memory reads and maximises bit-
ified independently from all others. wise arithmetic, while hiding all of these nuances.
We simply petition the bits of i into three sequences,
Our final task is to clarify precisely the positions
letting |i⟩N ≡ |k⟩N −t−1 i[t] 1 |j⟩t , or equivalently
of the paired amplitudes in the state’s array ψ
and the resulting memory access pattern. Because i ≡ j + i[t] 2t + k 2t+1 . Without loss of generality, we
i¬t = i ± 2t , the paired amplitudes are a dis- express
tance 2t apart. Bit logic alone already uniquely de- t
2 2X
X
N −t−1

termines the access pattern, which we visualise in |ψ⟩N ≡ βkj |k⟩N −t−1 |0⟩1 |j⟩t +
Fig. 2. There however remain many ways to imple- j k
ment the amplitude iteration, but few which respect
γkj |k⟩N −t−1 |1⟩1 |j⟩t , (20)
the previous section’s HPC considerations. We will
now directly derive a specific implementation which where βkj , γkj ∈ C together form the 2N amplitudes
7

Algorithm 2: [local][statevector] simply set

One-target gate M̂t upon qubit t of N -qubit (cache-line size)
m00 m01
pure state ψ, where M = ( m10 m11
) ∈ C2×2 . (iterations per thread) = S := ,
(amplitude size)
M̂ (22)

[O(2N ) bops][O(2N ) flops] where conveniently all sizes are expected to be pow-
[O(1) memory][2N writes] ers of 2, and assigned iterations are contiguous. At
1 local oneTargGate(ψ, M , t): double precision with a typical 64 byte cache-line,
2 N = log2 ( len(ψ) )
S = 4 [33]. However, we can further avoid some
cache-misses incurred by a single thread (due to
// loop every |n⟩N = |0⟩1 |k⟩N −t−1 |j⟩t fetching both β and γ amplitudes) across its assigned
# multithread iterations, by setting
3 for n in range(0, 2N /2):
// |iβ⟩N = |k⟩N −t−1 |0⟩1 |j⟩t (iterations per thread) = 2t+1 , (23)
4 iβ = insertBit(n, t, 0) // Alg. 1
when this exceeds S. Beware that when t is near
// |iγ⟩N = |k⟩N −t−1 |1⟩1 |j⟩t its maximum of N , such an allocation may non-
5 iγ = flipBit(iβ, t) // Alg. 1 uniformly divide the work between threads and
6 β = ψ[iβ] wastefully leave threads idle. The ideal schedule is
7 γ = ψ[iγ] ergo the maximum multiple of the cache-line size
(in amplitudes) which is less than or equal to the
// modify the paired amplitudes
8 ψ[iβ] = m00 β + m01 γ uniform division of total iterations between threads.
9 ψ[iγ] = m10 β + m11 γ
That is

(num iterations)
(iterations per thread) = S ,
S (num threads)
of |ψ⟩. Precisely, βkj ≡ αi =⇒ γkj = αi¬t . The (24)
middle ket, equivalent to i[t] , is that targeted by
M̂t . Eq. 19 prescribes where (num threads) is the maximum concurrently
supported by the executing hardware, and is not
t
2 2X N −t−1 necessarily a power of 2. Notice however that
in Alg. 2, (num iterations) = 2N /2 is a runtime
X
M̂t |ψ⟩N ≡ (βkj m00 + γkj m01 ) |k⟩ |0⟩ |j⟩
j k parameter; alas, thread allocation must often be
specified at compile-time, such as it is when using
+ (βkj m10 + γkj m11 ) |k⟩ |1⟩ |j⟩ .
OpenMP [34]. In such settings, the largest divi-
(21)
sion of contiguous iterations between threads can
This form suggests a bitwise and trivially parallelis- be allocated, incurring O((iterations per thread))
able local strategy to branchlessly locate the paired false shares. This is the default static schedule in
amplitudes among the 2N of an N -qubit statevec- OpenMP, specified with C precompiler directive
tor’s array, and modify them in-place, which we # pragma omp for
present in Alg. 2. It makes use of several bit-
twiddling functions defined in Alg. 1. This is the multithreaded configuration assumed for
all algorithms in this manuscript when indicated by
We finally remark on local parallelisation. Alg. 2 directive
features an exponentially large loop wherein each it-
eration modifies a unique subset of amplitudes; they # multithread
can ergo be performed concurrently via multithread-
ing. All variables defined within the loop become
thread-private, while the statevector is shared be-
tween threads and is simultaneously modifiable only
when threads write to separate cache-lines. Other-
wise, false-sharing may degrade performance, in the
worst case that of serial array access [25]. We hence
endeavour to allocate threads to perform specific it-
erations which modify disjoint cache-lines. We could
8

   
C. Many-control one-target gate M1 C 0 (M 1 ) C 2 (M 1 ) C 0,2 (M 1 )

α0 α0 α0 α0
•
• α1 α1 α1 α1
M̂ α2 α2 α2 α2

α3 α3 α3 α3
The many-control one-target gate introduces one or
more control qubits to the previous section’s opera- α4 α4 α4 α4
tor. This creates a simple entangling gate, yet even
α5 α5 α5 α5
the one-control one-target gate is easily made univer-
sal [35, 36] and appears as an elementary gate in al- α6 α6 α6 α6
most every non-trivial quantum circuit [37]. Many- α7 α7 α7 α7
control gates can be challenging to perform exper-
imentally and are traditionally decomposed into a α8 α8 α8 α8
series of one-control gates. But in classical simula- α9 α9 α9 α9
tion, many-control gates are just as easy to effect di-
rectly, and actually prescribe fewer amplitude mod- α10 α10 α10 α10
ifications and floating-point operations than their α11 α11 α11 α11
non-controlled counterparts. We here derive a local
algorithm to in-place simulate the one-target gate α12 α12 α12 α12
with s control qubits upon an N -qubit pure state α13 α13 α13 α13
in O(s 2N −s ) bops, O(2N −s ) flops, 2N −s memory
α14 α14 α14 α14
writes, and an O(1) memory overhead.
α15 α15 α15 α15
We seek to apply Cc (M̂t ) upon an arbitrary N -qubit
pure state |ψ⟩, where c = {c0 , . . . , cs−1 } is an arbi-
trarily ordered list of s unique control qubits, and FIG. 3. The memory access pattern of Alg. 3’s local
operator M̂ targets qubit t ∈ / c and is described by simulation of the many-control one-target gate Cc (M̂t ).
a 2 × 2 general complex matrix M (as in the pre- Amplitudes in grey fail the control condition of Eq. 27
and are not modified nor accessed.
vious section). Let us temporarily assume that the
target qubit t is the least significant and rightmost
(t = 0), and the next s contiguous qubits are con- Therefore, a general Cc (M̂t ) gate modifies only 2N −s
trolled upon. Such a gate is described by the matrix amplitudes αi of the 2N in the statevector array ψ,
C{1,...,s} (M̂0 ) which satisfy
 
1 i[cn ] = 1, ∀cn ∈ c, (27)
2s
 .. 
 . 
≃
  and does so under the action of the non-controlled
1
 

 m00

m01  M̂t gate. We illustrate the resulting memory access
m10 m11 pattern in Fig. 3.
(25) Ascertaining which amplitudes satisfy (what we
⊗s dub) the “control condition” of Eq. 27 must be done
= 1⊗(s+1) + |1⟩⟨1| ⊗ (M − 1).
efficiently since it may otherwise induce an overhead
(26) in every iteration of the exponentially large loop of
This form prescribes identity (no modification) upon the non-controlled Alg. 2. We again leverage HPC
every computational basis state except those for techniques to produce a cache-efficient, branchless,
which all control qubits are in state |1⟩1 ; for those vectorisable, bitwise procedure in a derivation which
states, apply M̂0 as if it were non-controlled, like hides all such nuances from the reader. We seek to
in the previous section. There are only a fraction iterate only the 2N −s amplitudes satisfying the con-
1/2s such states, due to the 2s unique binary as- trol condition, which have indices i with fixed bits
signments of the s control qubits. This prescription (value 1) at c. To do so, we enumerate all (N − s)-
is unchanged when the control and target qubits are bit integers j ∈ {0, . . . , 2N −s − 1}, corresponding to
reordered or the number of controls is varied, as in- (N − s)-qubit basis states
tuited by swapping rows of Eq. 25 or inserting addi-
|j⟩N −s ≡ j[N −s−1] . . . j[1] j[0] (28)
tional identities into Eq. 26. 1 1 1
9

and predeterminedly interleave |1⟩1 at every index D. Many-target gate

cj ∈ c. It is important these insertions happen at
strictly increasing bit indices so as not to displace
previously inserted bits. The result is an enumera-
tion of only indices i of the form M̂

|i⟩N = j[N −s−1] 1

. . . |1⟩1 j[c0 −1] 1
. . . j[0] 1
, (29)
The many-target gate (also known as the “multi-
which we a priori know satisfy the control condition. qubit gate”) describes a broad class of digital quan-
Note we should explicitly iterate only half of these tum operators including powerful entangling gates,
indices while updating two amplitudes per iteration; two-qubit gates like the Mølmer-Sørensen gate natu-
those paired by the target qubit. We formalise this ral in trapped ion architectures [38, 39], the Barenco
in Alg. 3. gate [40], the Berkeley gate [41] and the native en-
tangling gate of Google’s Sycamore quantum proces-
Note that further optimisation is possible. The one-
sor [42]. Such gates may defy convenient decompo-
control gate (the special case of s = 1 above) per-
sition into smaller primitives, and are best specified
mits a bespoke nested-loop routine with an inner-
and effected as dense complex matrices. Here we
loop accessing explicitly contiguous array elements,
derive a local algorithm to in-place simulate an n-
enabling explicit vectorisation; we direct the inter-
target gate upon an N -qubit pure state in O(2N +n )
ested reader to Sec. 6.3.2 of Ref. [26].
bops and flops, O(2N ) memory read/writes, and
O(2n ) memory overhead. Unitarity of the gate is
not required. Note we distinguish the many-target
Algorithm 3: [local][statevector] gate from all-target or full-state operators which act
Many-control one-target gate Cc (M̂t ) with s on all qubits, as the latter requires special treatment
unique control qubits c = {c0 , . . . , cs−1 } and and benefits from a distinct iteration strategy than
target t ∈
/ c, described by matrix developed here. We here instead assume 2n ≪ 2N .
M = (m 00 m01
m10 m11 ) ∈ C
2×2
, applied to an N -qubit
pure statevector ψ. Let M̂t denote an n-qubit operator upon unique and
• arbitrarily ordered target qubits
n n
t = {t0 , . . . , tn−1 },
• described by matrix M ∈ C2 ×2 with elements mij .
The ordering of indices in t corresponds to the or-
M̂ dering of columns in M . We seek to simulate this
[O(s 2 N −s
) bops][O(2N −s ) flops] operator acting upon an N -qubit pure state |ψ⟩N .
[O(1) memory][2N −s writes] Without loss of generality, let
1 local manyCtrlOneTargGate(ψ, c, M , t): N −n
2X 2n
2 N = log2 ( len(ψ) ) X
|ψ⟩N ≡ αkj |k, j⟩N , αkj ∈ C (30)
3 s = len(c)
k j
// get all qubits in increasing order
4 q = c ∪ {t}
where |k, j⟩ is an N -qubit computational basis state
formed by the j-th n-qubit substate of qubits in t,
5 sort(q)
and the k-th (N −n)-qubit substate of the remaining
// loop every |j⟩N −s−1 non-targeted qubits. Scalars {αkj : k, j} together
# multithread form the 2N amplitudes of |ψ⟩. In this manner, we
6 for j in range(0, 2N /2s ): leave the elements and ordering of t unspecified. For
// produce |i⟩N where i[cn ] = i[t] = 1 illustration, if t = {2, 0} (increasing significance),
7 i = insertBits(j, q, 1) // Alg. 1 then an example basis state is
// set iβ [t] = 0 |k = 1, j = 2⟩4 ≡ |0⟩1 |0⟩1 |1⟩1 |1⟩1 ≡ |3⟩4 .
8 iβ = flipBit(i, t) // Alg. 1 t0 t1
9 iγ = i
10 β = ψ[iβ] By definition (through the construction of M as a
11 γ = ψ[iγ] matrix to be multiplied upon a Z-basis unit column
vector), the many-qubit gate applied to a basis state
// modify the paired amplitudes of its targeted subspace yields
12 ψ[iβ] = m00 β + m01 γ n
2
13 ψ[iγ] = m10 β + m11 γ X
M̂ |j⟩n = mlj |l⟩n . (31)
l
10

   
M 0,1 M 0,2 M 0,1,2 M 0,1,3 Algorithm 4: [local][statevector]
Many-target gate M̂t with n unique target
α0 α0 α0 α0 qubits t = {t0 , . . . , tn−1 }, described by matrix
n n
α1 α1 α1 α1 M̂ = {mij : i, j} ∈ C2 ×2 , applied to an
α2 α2 α2 α2
N -qubit pure statevector |ψ⟩.

α3 α3 α3 α3
M̂
α4 α4 α4 α4

α5 α5 α5 α5 [O(2N +n ) bops][O(2N +n ) flops]

[O(2n ) memory] [O(2N +n ) writes]
α6 α6 α6 α6
1 local manyTargGate(ψ, M , t):
α7 α7 α7 α7
2 N = log2 ( len(ψ) )
α8 α8 α8 α8 3 n = len(t) // = log2 (len(M ))
α9 α9 α9 α9 4 v = new array of size 2n
α10 α10 α10 α10 5 q = clone of t (size n)
6 sort(q)
α11 α11 α11 α11
// loop every |k⟩N −n
α12 α12 α12 α12
# multithread with private v clones
α13 α13 α13 α13 7 for k in range(0, 2N −n ):
α14 α14 α14 α14 // form |z⟩N ≡ |k, 0⟩N
8 z = insertBits(k, q, 0) // Alg. 1
α15 α15 α15 α15
// loop every |j⟩n
FIG. 4. The memory access pattern of Alg. 4’s local 9 for j in range(0, 2n ):
simulation of the many-target gate M̂t . // form |i⟩N ≡ |k, j⟩N
10 i = setBits(z, t, j) // Alg. 1
// store amplitudes {αkl : k}
11 v[j] = ψ[i]
Applied to the full state, the gate then effects
12 for j in range(0, 2n ):
N −n
2X 2n
X 2n
X 13 i = setBits(z, t,Pj) // Alg. 1
M̂t |ψ⟩N = αkj mlj |k, l⟩N . (32) // modify αkj → l αkl mkl
k j l 14 ψ[i] = 0
15 for l in range(0, 2n ):
By swapping what have become mere summation 16 ψ[i] += mjl v[l]
labels j ↔ l, we more clearly express the new state
as
N −n
2X 2n 2n
! {0, . . . , 2n − 1} are located at indices |i⟩N ≡ |k, l⟩N
which differ only by the n bits i[q] at indices q ∈ t.
X X
M̂t |ψ⟩N = αkl mjl |k, j⟩N . (33)
k j l
|z⟩N ≡ |k, 0⟩N =⇒ ∀ |i⟩N ∈ {|k, l⟩N : l},
Comparing this form to Eq. 30 reveals that the am- ∃ q ⊆ t s.t. i = z¬q . (35)
plitudes have been simultaneously modified to
This means the array indices of all amplitudes to
n
M̂t
2
X be linearly combined together can be produced by
αkj −−→ αkl mjl . (34) flipping bits q ∈ t of the index z, where |z⟩N =
l |k, 0⟩N . We obtain |z⟩N from a given |k⟩N −n by
inserting 0 bits into unsigned binary integer z at the
We see that under M̂t , each amplitude becomes a indices specified in t. in strictly increasing order.
linear combination of 2n amplitudes, weighted by a Finally, we simply iterate each of the possible 2N −n
row of M , and are independent of all other ampli- values of k. We formalise this protocol in Alg. 4, and
tudes outside the combination. illustrate its memory access pattern in Fig. 4.
The precise indices of these amplitudes in the stat- Like the previous algorithms of this manuscript (and
evector array ψ are determined by bit logic. We indeed, most of those to come), its runtime when de-
observe that for a given k, the amplitudes αkl , ∀l ∈ ployed on modern computers is memory bandwidth
11

bound. That is, the fetching and modification of am- Algorithms 2, 3 and 4 presented serial methods
plitudes from heap memory dominates the runtime, for simulating the single-target, many-control and
occluding the time of the bitwise and indexing al- many-target gates respectively. To be performant,
gebra. We should expect that targeting lower-index they made economical accesses to the state array ψ.
rightmost qubits will see better caching performance Distributed simulation of these gates however will re-
than targeting high-index leftmost qubits, because quire explicit synchronisation and network commu-
the linearly combining amplitudes in the former sce- nication between nodes to even access amplitudes
nario lie closer together and potentially within the within another node’s statevector partition, which
same cache-lines. we refer to as the “sub-statevector”. We adopt the
message passing interface (MPI) [43] ubiquitous in
If we wished to make the memory addresses accessed
parallel computing. Exchanging amplitudes through
by the inner j loops of Alg. 4 be strictly increasing,
messages requires the use of a communication buffer
we would simply initially permute the columns of M
to receive and process network-received amplitudes
as per the ordering of t to produce matrix M ′ , then before local modification of a state. For the remain-
sort t. That is, we would leverage that der of this manuscript, we employ fixed labels:
′
M̂t ≡ M̂sorted(t) . (36)
ψ := An individual node’s sub-statevector.
Doing so also permits array sorted(t) to be replaced φ := An individual node’s communication buffer.
with a bitmask, though we caution optimisation of
the indexing will not appreciably affect the memory- We identify nodes by their rank r ∈ {0, . . . , W − 1}.
dominated performance. Each node’s partition represents an unnormalised
substate |ψr ⟩N −w of the full quantum state repre-
Finally, we caution that multithreaded deployment sented by the ensemble
of this algorithm requires that each simultaneous
thread maintains a private 2n -length array (v in 2
X
w

Alg. 4), to collect and copy the amplitudes modified |Ψ⟩N ≡ |r⟩w |ψr ⟩N −w . (38)
by a single iteration. This scales up the temporary r
memory costs by factor O(num threads).
In other words, node r contains global amplitudes
αi of |Ψ⟩N with indices satisfying

i ∈ {rΛ, rΛ + 1, . . . , (r + 1)Λ − 1}. (39)

This means that the j-th local amplitude ψ[j] stored

in node r corresponds to global basis state

III. DISTRIBUTION |i⟩N ≡ |r⟩w |j⟩N −w , (40)

and ergo to global amplitude αi with index

Distributed computing involves dividing the serial
task of a single machine into smaller tasks for a mul- i = rΛ+j (41)
titude of networked machines. Distributed simula- = (r << (N − w)) | j. (42)
tion of an N -qubit pure state |ψ⟩N requires parti-
tioning the 2N complex amplitudes of its statevec- We will frequently refer to |r⟩w and |j⟩N −w as the
tor ψ between W ∈ N nodes. Here, W is the world “prefix” and “suffix” substates of |i⟩N respectively.
size [43]. A uniform partitioning is only possible be- These relationships will enable the convenient and
tween W = 2w nodes for w ∈ N [44] and hence we efficient determination of communication and mem-
assume that only a power-of-two number of nodes ory access patterns through bitwise algebra.
are ever deployed. This is a typical precondition of
distributed quantum simulators [9, 45]. The number
of amplitudes stored within each node is then fixed A. Communication buffer size
at
Λ := 2N −w . (37) We must decide the size of each node’s communi-
cation buffer φ, which will share a node’s available
This naturally upperbounds the world size to sim- memory with ψ. We will later see that simulating
ulate an N -qubit statevector at W ≤ 2N whereby many operators requires an exchange of ≤ Λ am-
each node stores at least one amplitude. Of course plitudes between pairs of nodes, and so it is pru-
in practical settings, W ≪ 2N . dent to upperbound len(φ) ≤ Λ while choosing
12

len(φ) = O(Λ). A contrary choice of len(φ) = Quantities a, b, f are given per-node because nodes
O(1) necessitates exponentially many communica- perform these operations concurrently. Quantities
tion overheads [46]. Distributed simulators like In- c, d, e are given as totals aggregated between all
tel’s IQS [45] opt for len(φ) = Λ/2 which requires nodes. It will be easy to distinguish these by whether
typical gate simulations perform two rounds of am- the measures include variable Λ (per-node) or N
plitude exchange, and may enable the second com- (global).
munication to overlap local state processing for a mi-
nor speedup. It also enables the simulation of a sin-
gle additional qubit on typical node configurations. C. Communication patterns
In contrast, simulators like Oxford’s QuEST [9] fix
len(φ) = Λ. This restricts simulation to one fewer This manuscript will present algorithms with dras-
qubits, though induces no appreciable slowdown tically differing communication patterns, by which
due to the dominating network speeds. See Refer- we refer to the graph of which nodes exchange mes-
ences [9, 26] for a more detailed comparison of these sages with one another. The metrics of the previ-
strategies. ous section are useful for comparing two given dis-
In this work, we fix the buffer to be of the same size tributed algorithms, but they cannot alone predict
as the sub-statevector, i.e. their absolute runtime performance. A distributed
application’s ultimate runtime depends on its emerg-
len(φ) := Λ. (43) ing communication pattern and the underlying hard-
ware network topology [47].
This enables the simulation of many advanced oper- Fortunately, our algorithms share a useful property
ators precluded by a smaller buffer, though means which simplifies their costs. We will see that nearly
the total memory costs (in bytes) of distributed sim- all simulated quantum operators in this manuscript
ulation of an N -qubit pure state is b 2N where b is the yield pairwise communication, whereby a node ex-
number of bytes to represent a single complex ampli- changes amplitudes with a single other unique node.
tude (b = 16 at double precision in the C language). To be precise, if rank r sends amplitudes to r′ , then
This is double the serial costs shown in Fig. 1. it is the only rank to do so, and so too does it re-
ceive amplitudes only from r′ . We visualise this be-
low where a circle indicates one of four nodes and an
B. Communication costs
arrow indicates one of four total passed messages:

We now discuss the measures of distributed perfor- pairwise indivisible

mance. The local algorithms of Sec. II were opti-
mised to minimise the number of basic operations
(bops), floating-point operations (flops), heap mem-
ory writes, and the memory overhead, while avoid
branching, and enabling optimisations like vectori- This innocuous property enables significant speedup
sation. While these costs remain important in dis- and simplifications of the communication and syn-
tributed algorithms, they are eclipsed by the over- chronisation protocols. It means all exchanges be-
whelming runtime penalties of network communica- tween paired nodes can happen concurrently without
tion [13]. queueing or scheduling, and allows us to abstract al-
most all communication (MPI) code invoked by this
As such, the primary objectives of high-performance work’s algorithms into the few lines of Alg. 5.
distributed code is to both minimise the number of
rounds of serial communication (suppressing syn- Knowing our communications are pairwise radically
chronisation and latency overheads) and the to- reduces the space of possible patterns, makes per-
tal data size transferred (suppressing waiting for formance easier to predict, and may help config-
bandwidth-bound traffic, and mitigating risks of ure network switches [48]. It also means that com-
queuing and network saturation). We refer to these munication graphs are trivial to partition into uni-
quantities respectively as the number of exchanges formly loaded subgraphs and hence that communica-
and the number of complex scalars exchanged. We tion is easy to optimise for restrictive networks like
will succinctly summarise a distributed algorithm’s tree topologies [49]. Finally, it simplifies the mes-
costs in its caption with the below tags sage passing involved in all algorithms within this
manuscript into five distinct paradigms, which we
[a bops][b flops][c exchanges] visualise in Fig. 5.
[d exchanged][e memory][f writes]
13

ψ ψ' ψ ψ' ψ ψ' ψ ψ'

φ φ' φ φ' φ φ'

ψ ψ'

e)
a) b) c)

FIG. 5. The five paradigms of pairwise amplitude exchange used in this manuscript. Arrays ψ and φ are a node’s
sub-statevector and communication buffer respectively, and ψ ′ and φ′ are those of a paired node. During a round of
communication, all or some of the nodes will perform one of the below paradigms, with the remaining nodes idle.
a) Nodes send their full sub-statevector to directly overwrite their pair node’s communication buffer. This permits
local modification of received amplitudes in the buffer before integration into the sub-statevector.
b) Nodes pack a subset of their sub-statevector into their buffer before exchanging (a subset of) their buffers. This
reduces communication costs from a) when not all pair node amplitudes inform the new local amplitudes. Notice
that since the buffers cannot be directly swapped, amplitudes are sent to the pair node’s empty offset buffer.
c) One node of the pair packs and sends its buffer, while the receiving pair node sends nothing.
d) Nodes intend to directly swap the entirety of their sub-statevector, though must do so via a).
e) Nodes intend to directly swap distinct subsets of their sub-statevectors, though must do so via b).

Algorithm 5: Some convenience inter-node communication functions used in this manuscript’s

pseudocodem defined in terms of the MPI C standard [43]. Note that an actual implementation will
require communicating arrays in multiple smaller messages when they would otherwise exceed the MPI
maximum message size of approximately 230 double-precision amplitudes. It is also prudent to dispatch
these messages asynchronously [50].
1 getRank():
2 obtain via MPI Comm rank
3 getWorldSize():
4 obtain via MPI Comm size
// send n elements of send (from index i) to node r′ , overwriting recv (from index j).
// Node r′ performs the same to this node.
5 exchangeArrays(send, i, recv, j, n, r′ ):
6 MPI Sendrecv(
7 (address of) send[i], n, MPI COMPLEX, r′ , MPI ANY TAG,
8 (address of) recv[j], n, MPI COMPLEX, r′ , MPI ANY TAG, MPI COMM WORLD)
// send entire ψ to node r’, overwriting φ′ , and receive entirety of ψ ′ , overwriting φ
9 exchangeArrays(ψ, φ, r′ ):
10 exchangeArrays(ψ, 0, φ, 0, r′ , len(ψ))
14

IV. DISTRIBUTED STATEVECTOR The paired amplitude αi¬t corresponds to basis state
ALGORITHMS
|i¬t ⟩N ≡ i[N −1] 1
. . . !i[t] 1
. . . i[0] 1
(47)
′ ′
This section will derive six novel distributed algo- = |r ⟩w |j ⟩N −w . (48)
rithms to simulate many-control gates, SWAP gates,
many-target gates, tensors of Pauli operators, phase Flipping bit t of integer i must modify either the w-
gadgets and Pauli gadgets. As a means of review, bit prefix or the (N −w)-bit suffix of i’s bit sequence,
we begin however by deriving an existing distributed and ergo modify either j or r. Two scenarios emerge:
simulation technique of the one-target gate, gener- 1. When t < N − w, then
alising the local algorithm of Sec. II B.
r′ = r, j ′ = j¬t . (49)

A. One-target gate All paired amplitudes αi¬t are stored within the
same rank r as αi , and so every node already
contains all amplitudes which will determine its
new values. No communication is necessary, and
M̂ we say the method is “embarrassingly parallel”.
Furthermore, since
Let us first revisit the one-target gate, the staple
of quantum circuits, for which we derived a local M̂t<N −w |r⟩w |ψ⟩N −w = |r⟩w M̂t |ψ⟩N −w ,
simulation algorithm in Sec. II B. Our distributed (50)
simulation of this gate will inform the target perfor-
mance of all other algorithms in this manuscript. we can modify a node’s local sub-statevector ψ
We here derive a distributed in-place simulation in an identical manner to the local simulation of
of the one-target gate upon an N -qubit pure stat- M̂t |ψ⟩. We simply invoke Alg. 2 upon ψ on every
evector distributed between W = 2w nodes, with node.
Λ = 2N −w amplitudes per-node. Our algorithm 2. When t ≥ N − w, then
prescribes O(Λ) bops and flops per-node, exactly
Λ writes per-node, at most a single round of mes- r′ = r¬(t−(N −w)) j ′ = j. (51)
sage passing in which case 2N total amplitudes are
exchanged over the network, and an O(1) memory Every paired amplitude to those in node r is
overhead. stored in a single other node r′ . We call this the
“pair node”. Ergo each node must send its full
Let M̂t denote a general one-target gate upon qubit sub-statevector ψ to the buffer φ′ of its pair node.
t ≥ 0, described by a complex matrix This is the paradigm seen in Fig. 5 a). There-

m00 m01
after, because j ′ = j above, each local amplitude
M= ∈ C2×2 . (44) ψ[j] will linearly combine with the received am-
m10 m11
plitude φ[j], weighted as per Eq. 19.
We will later see bespoke distributed methods for
faster simulation of certain families of one-target and The memory access and communication patterns of
separable gates, but we will here assume M is com- these scenarios are illustrated in Fig. 6. In general,
pletely general and unconstrained. We seek to ap- we cannot know in advance which of scenarios 1. or
ply M̂t upon an arbitrary N -qubit pure state |Ψ⟩N 2. will be invoked during distributed simulation, be-
which is distributed between W = 2w nodes, each cause N , w and t are all user-controlled parameters.
storing sub-statevector ψ of size Λ = 2N −w , and an So our algorithm to simulate M̂t must incorporate
equal-sized communication buffer φ. both. We formalise this scheme in Alg. 6.
Let us compare the costs of local vs distributed sim-
We showed in Eq. 19 that M̂t modifies an amplitude ulation of the one-target gate (i.e. Alg 2 against
αi of |Ψ⟩N to become a linear combination of αi and Alg 6). The former prescribed a total of O(2N )
αi¬t . Ergo to modify αi , we must determine within bops and flops to be serially performed by a sin-
which node the paired αi¬t is stored. Recall that gle machine. The latter exploits the parallelisation
the j-th local amplitude stored within node r ≥ 0 of W distributed nodes and involves only O(Λ) =
corresponds to global amplitude αi satisfying O(2N /W ) bops and flops per machine (a uniform
|i⟩N ≡ i[N −1] . . . i[t] . . . i[0] (45) load), suggesting an O(W ) speedup. However, when
1 1 1
the upper w qubits are targeted, inter-node com-
≡ |r⟩w |j⟩N −w . (46) munication was required and all O(2N ) amplitudes
15

were simultaneously exchanged over the network in As such, our one-target gate distributed simulation
a pairwise fashion. While the relative magnitude of sets a salient performance threshold. We endeavour
this overhead depends on many physical factors (like to maintain this weak scaling in the other algorithms
network throughput, CPU speeds, cache through- of this manuscript.
put) and virtual parameters (the number of qubits,
the prescribed memory access pattern), distributed Algorithm 6: [distributed][statevector]
simulation of this kind in realistic regimes shows m00 m01
One-target gate M̂t , where M = ( m 10 m11
),
excellent weak scaling and the network overhead is
upon an N -qubit pure statevector distributed
tractable [9]. This means that if we introduce an ad-
between 2w nodes as local ψ (buffer φ).
ditional qubit (doubling the serial costs) while dou-
bling the number of distributed nodes, our total run- M̂
time should be approximately unchanged.
[O(Λ) bops][O(Λ) flops][0 or 1 exchanges]
[O(2N ) exchanged][O(1) memory][Λ writes]
 1 distrib oneTargGate(ψ, φ, M , t):
   M0 r = getRank() // Alg. 5
M0 M1 M2 1 2
2
3 w = log2 (getWorldSize()) // Alg. 5
4 Λ = len(ψ)
α0 α0 α0 0 3 5 N = log2 (Λ) + w
α1 α1 α1
// embarrassingly parallel
7 4 6 if t < N − w:
α2 α2 α2 7 local oneTargGate(ψ, M , t) // Alg. 2
α3 α3 α3 6 5 // full sub-state exchange is necessary
8 else:
α4 α4 α4
 // exchange with r′
α5 α5 α5 M1 9 q = t − (N − w)
1 2 10 r′ = flipBit(r, q) // Alg. 1
α6 α6 α6 11 exchangeArrays(ψ, φ, r′ ) // Alg. 5
0 3
α7 α7 α7 // determine row of M
12 b = getBit(r, q) // Alg. 1
α8 α8 α8 7 4
// modify local amplitudes
α9 α9 α9 # multithread
6 5 13 for j in range(0, Λ):
14 ψ[j] = mb,b ψ[j] + mb,!b φ[j]
α10 α10 α10

α11 α11 α11 M2
1 2
α12 α12 α12
0 3
α13 α13 α13

α14 α14 α14 7 4

α15 α15 α15
6 5

FIG. 6. The memory access and communication pat-

terns of Alg. 6’s distributed simulation of the one-target
gate M̂t . Each (of the left three) column shows the full
4-qubit statevector |Ψ⟩4 divided between 8 nodes each
containing Λ = 2 amplitudes. Arrows between ampli-
tudes of different nodes indicate the need for communi-
cation. The rightmost column shows the resulting com-
munication topology wherein circles represent nodes and
directional arrows indicate the sending of amplitudes.
16
  
B. Many-control one-target gate M2 C 0 (M 2 ) C 0,3 (M2 ) C 2 (M 1 )

α0 α0 α0 α0
• α1 α1 α1 α1
•
α2 α2 α2 α2
M̂
α3 α3 α3 α3
Introducing control qubits to the previous one-target
gate empowers it to become entangling and univer- α4 α4 α4 α4
sal [35, 36], which for its relative simplicity, estab- α5 α5 α5 α5
lishes it as the entangling primitive of many quan-
α6 α6 α6 α6
tum algorithms [37]. Section II C derived a local, se-
rial algorithm to effect the many-control one-target α7 α7 α7 α7
gate which we now adapt for distributed simulation.
We derive an in-place distributed simulation of the α8 α8 α8 α8
one-target gate with s control qubits upon an N -
α9 α9 α9 α9
qubit pure state. Our method prescribes as few as
O(sΛ/2s ) (at most O(Λ)) bops per node, O(Λ/2s) α10 α10 α10 α10
(at most O(Λ)) flops and writes per node, at most α11 α11 α11 α11
a single round of communication whereby a total
of O(2N /2s ) amplitudes are exchanged, and a fixed
α12 α12 α12 α12
memory overhead.
α13 α13 α13 α13
We consider the operator Cc (M̂t ) where c =
{c0 , . . . , cs−1 } is an arbitrarily ordered list of s α14 α14 α14 α14
unique control qubits, t ∈ / c is the target qubit, α15 α15 α15 α15
and where M̂ is described by matrix M ∈ C2×2 (as
in the previous section). We seek to apply Cc (M̂t ) FIG. 7. The memory access pattern of Alg. 7’s dis-
upon an arbitrary N -qubit pure state |Ψ⟩N which is tributed simulation of the many-control one-target gate
distributed between W = 2w nodes, each with sub- Cc (M̂t ). Each column shows 16 amplitudes distributed
statevector ψ of size Λ = 2N −w and an equal-sized between 4 nodes. If a node contains only grey amplitudes
communication buffer φ. which fail the control condition, it need not perform any
communication.
Section II C established that Cc (M̂t ) modifies only
the 2N −s global amplitudes αi of full-state |Ψ⟩N
which satisfy the control condition 1. When all controls lie within the prefix, i.e.

i[cn ] = 1, ∀cn ∈ c, (reiteration of 27) cn ≥ N − w, ∀cn ∈ c. (54)

doing so under the non-controlled action of M̂t . The control condition is determined entirely by
Since all amplitudes failing the control condition are a node’s rank r, so that every or none of the am-
not modified nor are involved in the modification of plitudes therein satisfy it. Ergo, some nodes sim-
other amplitudes, they need not be communicated ulate M̂t as normal while others do nothing; this
over the network in any circumstance. This already is seen in column 4 of Fig. 7. Because only a frac-
determines the communication pattern, which we il- tion 1/2s of nodes satisfy r[cn −(N −w)] = 1 ∀cn ∈
lustrate (as a memory access diagram) in Figure 7. c, the number of communicating nodes and hence
the total data size communicated exponentially
Explicitly deriving this pattern is non-trivial. We shrinks with additional controls. We visualise
again invoke that each local amplitude index j in this communication pattern below where circles
node r is equivalent to a global index i satisfying indicate nodes, grey symbols indicate their rank,
|i⟩N ≡ i[N −1] . . . i[t] . . . i[0] (52) arrows indicate a message, and black symbols in-
1 1 1
dicate the number of amplitudes in each message.
≡ |r⟩w |j⟩N −w . (53)
r
We call |r⟩w the prefix substate and |j⟩N −w the suf-
Λ …
fix. As per Eq. 49, when t < N − w, simulation is r + 2s - 1
embarrassingly parallel regardless of c. But when
t ≥ N − w, three distinct scenarios emerge.
17

2. When all controls lie within the suffix, i.e. Algorithm 7: [distributed][statevector]
Many-control one-target gate Cc (M̂t ) with s
cn < N − w, ∀cn ∈ c. (55)
unique control qubits c = {c0 , . . . , cs−1 } and
The control condition is independent of rank, and target t ∈
/ c, described by matrix
since every node then contains every assignment M = (m 00 m01
m10 m11 ) ∈ C
2×2
, applied to an N -qubit
of bits c, all nodes contain amplitudes which pure statevector distributed between 2w nodes
satisfy the condition and need communicating. as local ψ (buffer φ).
There are precisely Λ/2s such amplitudes per •
•
node. This scenario is seen in column 2 of Fig. 7,
and is illustrated below. M̂
0 [best O(sΛ/2s ), worst O(Λ) bops]
Λ /2s
… [best O(Λ/2s ), worst O(Λ) flops]
[0 or 1 exchanges][O(2N /2s ) exchanged]
W -1
[O(1) memory][best Λ/2s , worst Λ writes]
Because not all local amplitudes are communi- 1 distrib manyCtrlOneTargGate(ψ, φ, c, M , t):
cated, it is prudent to pack only those which 2 r = getRank() // Alg. 5
are into the local communication buffer before 3 λ = log2 (len(ψ)) // = N − w
exchanging the packed buffer subset. This is
// separate prefix and suffix controls
paradigm b) of Fig. 5, and drastically reduces
4 c(p) = {q − λ : q ≥ λ, ∀q ∈ c}
the total number amplitudes transferred over the
network by factor 1/2s . 5 c(s) = {q : q < λ, ∀q ∈ c}
// halt if r fails control condition
3. When the controls are divded between the
prefix and suffix, i.e. 6 if not allBitsAreOne(r, c(p) ): // Alg. 1
7 return
∃ cn ≥ N − w and ck < N − w. (56)

// update as Cc (M̂t ) |Ψ⟩ = |r⟩ Cc(s) (M̂t ) |ψ⟩
8 if t < λ:
Only a fraction of ranks satisfy the control con-
9 local manyCtrlOneTargGate(ψ, c(s) , M , t)
dition, as do only a fraction of the amplitudes
therein; so some nodes exchange only some of // Alg. 3
// exchange with r′ is necessary
their amplitudes. Let us distinguish between the
10 if t ≥ λ:
controls acting upon the prefix and suffix sub- // all local αj satisfy condition,
states: // so controls can be disregarded
11 if len(c(s) ) = 0:
c(p) = {cn ∈ c : cn ≥ N − w}, (57)
12 distrib oneTargGate(ψ, M , t)
c(s) = {cn ∈ c : cn < N − w}, (58) // Alg. 6
(p) (p) (s) (s) // only subset of local αj satisfy,
s = len(c ), s = len(c ). (59) // determined only by suffix controls
13 else:
A fraction 1/s(p) nodes communicate, exchang- 14 r′ = flipBit(r, t − λ) // Alg. 1
ing fraction 1/s(s) of their local amplitudes. This 15
′ (s)
distrib ctrlSub(ψ, φ, r , c , M , t)
is seen in column 3 of Fig. 7, and below. // Alg. 8
r
( s)
Λ / 2s … (p)
r + 2s -1 decomposing distributed simulation of Cc (M̂t ) into
subtasks which we succinctly summarise in the com-
Once again, the communicating nodes will pack ments of Alg. 7.
the relevant subset of their local amplitudes into The performance of this algorithm varies drastically
buffers before exchanging, as per Fig. 5 2). with the configuration of qubits c and t, but all
In all these scenarios, the rank r′ of the pair node costs are upperbounded by those to simulate M̂t via
with which node r communicates (controls permit- Alg. 6. When t < N − w, the method is embarrass-
ting) is determined by the target qubit t in the same ingly parallel. Otherwise, when none of the left-most
w qubits are controlled, every node has an identical
manner as for the non-controlled gate M̂t . There
task of sending and locally modifying Λ/2s ampli-
are several additional observations to make when
18

Algorithm 8: A subroutine of Alg. 7, triggered C. Swap gate

in aforementioned scenarios 2. and 3., whereby
a subset of the node’s sub-statevector is packed
and exchanged with the pair node before local ×
modification. ×
1 distrib ctrlSub(ψ, φ, r′ , c, M , t):
2 Λ = len(ψ) Today’s proposed quantum architectures have lim-
3 s = len(c) ited connectivity [51], constraining which qubits can
4 l = Λ/2s undergo multi-qubit operators like the proceeding
5 sort(c) section’s control gate. Creating entanglement be-
tween arbitrary qubits sometimes requires swapping
// pack subset of αj into buffer
# multithread
qubits; either physically shuttling them as is typi-
6 for k in range(0, l): cal in ion traps [52] or effectively exchanging their
7 j = insertBits(k, c, 1) // Alg. 1 quantum states through multiple native operations
8 φ[k] = ψ[j] as performed in superconducting platforms [53, 54].
Through either implementation, the SWAP opera-
// swap buffer subarrays (Fig. 5.2), tion is of vital importance in the theory of quan-
// receiving amps at φ[l . . . ]
tum computation, appearing in quantum teleporta-
9 exchangeArrays(φ, 0, φ, l, l, r′ ) // Alg. 5 tion [55], proofs of universality [36], and in the com-
// determine row of M pilations of generic circuits into those compatible
10 b = getBit(r, t − log2 (Λ)) // Alg. 1 with restricted architectures [56]. We will also later
see that distributed simulation of the SWAP gate it-
// update local amplitudes self is a helpful utility in efficiently simulating more
# multithread
advanced gates, motivating that the SWAP gate it-
11 for j in range(0, l):
12 k = insertBits(j, c, 1) // Alg. 1
self be made as efficient as possible. In this section,
13 j′ = j + l we develop a distributed simulation of the SWAP
14 ψ[k] = mb,b ψ[k] + mb,!b φ[j ′ ] gate upon an N -qubit statevector which prescribes
O(Λ) bops per node, no flops at all, a single round
of communication exchanging 2N /2 amplitudes to-
tal, only Λ/2 main memory writes per node, and a
tudes. But when those w qubits are controlled, the fixed memory overhead.
load per-node varies and in the worst case, a single
node may send and modify Λ amplitudes while an- We seek to apply gate SWAPt1 ,t2 upon qubits t1
other node does nothing. Note this latter node need and t2 > t1 of an N -qubit pure state |Ψ⟩ distributed
not wait idly, since synchronisation is not needed between W = 2w nodes, where
until its next prescribed communication in the pro-
1 0 0 0
 
gram, and it can in the interim proceed with other
local tasks (like simulating the next gate in a cir- 0 0 1 0
SWAP =  . (60)
cuit). Still, we measure the algorithm’s performance 0 1 0 0
by its slowest node. In all communicating scenarios, 0 0 0 1
the total number of amplitudes sent over the net-
work is 2N /2s , performed in a single serial round. While we could naively leverage decomposition
We summarise these costs in the caption of Alg. 7.
SWAPt1 ,t2 ≡ Ct1 (X̂t2 ) Ct2 (X̂t1 ) Ct1 (X̂t2 ) (61)

and simulate each control-NOT via the previous sec-

tion’s Alg. 7, this may (depending on t1 and t2 ) in-
voke as many as three statevector exchanges and a
total of 3/2 × 2N flops, communicated amplitudes,
and memory writes. Instead, a greatly superior
strategy is possible.

P an N -qubit basis state of arbitrary

Let us re-express
state |Ψ⟩N = i αi |i⟩N as (where x, y, z ∈ N)

|i⟩N ≡ |z⟩N −t2 −1 i[t2 ] 1

|y⟩t2 −t1 −1 i[t1 ] 1
|x⟩t1 .
(62)
19

The SWAP gate swaps the targeted qubits of |i⟩; r r'

z 0 y0x ψ0 ψ0 z 1 y0x
SWAPt1 ,t2 |i⟩N
⋮ ⋮
= |z⟩N −t2 −1 i[t1 ] |y⟩t2 −t1 −1 i[t2 ] 1
|x⟩t1
(
|i⟩N i[t1 ] = i[t2 ]
= (63)
i¬{t1 ,t2 } N i[t1 ] ̸= i[t2 ] z 0 y1x ψ2 t1 ψ2 t1 z 1 y1x

⋮ ⋮
This induces a change only when the principal bits
differ; swapping them is ergo equivalent to flip-
ping both.
P Therefore SWAPt1 ,t2 upon the full state
|Ψ⟩N = i αi |i⟩N swaps a subset of amplitudes; z 0  y' 0 x  ψ2 2 t1 ψ2 2 t1 z 1  y' 0 x 

⋮ ⋮
(
αi i[t1 ] = i[t2 ]
αi → (64)
αi¬{t1 ,t2 } i[t1 ] ̸= i[t2 ] .

We call αi¬{t1 ,t2 } = α(i¬t1 )¬t2 the “pair amplitude”. z 0  y' 1 x  ψ3 2 t1 ψ3 2 t1 z 1  y' 1 x 
Recall that the j-th local amplitude stored within ⋮ ⋮
node r corresponds to global amplitude αi satisfying
|i⟩N ≡ |r⟩w |j⟩λ , where λ = N − w. (65)
Three distinct scenarios emerge (assuming t2 > t1 ).
FIG. 8. The amplitudes which require swapping in sce-
1. When t2 < λ, and consequently nario 3. of Alg. 9’s distributed simulation of SWAPt1 ,t2 .
Arrows (and colour) connect elements of the local sub-
i¬{t1 ,t2 } = |r⟩w j¬{t1 ,t2 } . (66)
N λ statevector ψ which must be swapped between nodes r
The pair amplitude is contained within the same and r′ . This is the memory exchange pattern e) of Fig. 5.
node and simulation is embarrassingly parallel.
2. When t1 ≥ λ, such that 0
i¬{t1 ,t2 } = r¬{t1 −λ,t2 −λ} |j⟩λ . (67) Λ /2 …
N w
W -1
If r[t1 −λ] ̸= r[t2 −λ] (as satisfied by half of all
nodes) then all local amplitudes in node r must Notice that the destination local address j ′ =
be exchanged with their pair amplitudes within j¬t1 differs from the source local address; we vi-
pair node r′ = r¬{t1 −λ,t2 −λ} . No modification is sualise this in Fig. 8. We expand upon the nu-
necessary, so the nodes directly swap their sub- ances of this scenario below.
statevector ψ as per Fig. 5 d). The remaining
nodes contain only amplitudes with global in- Because scenario 3. requires that paired nodes ex-
dices i satisfying i[t1 ] = i[t2 ] and so do nothing. change only half their local amplitudes, these ampli-
In total 2N /2 amplitudes are exchanged in par- tudes should first be packed into the communication
allel batches of Λ. buffers as per Fig. 5 b), like was performed for the
multi-controlled gate in Sec. IV B. This means pack-
ing every second contiguous batch of 2t1 amplitudes
Λ … before swapping the buffers, incurring local memory
penalties but halving the communicated data. Note
too that when t1 = λ − 1, packing is unnecessary;
3. When t1 < λ and t2 ≥ λ, in which case
the first (or last) contiguous half of a node’s sub-
i¬{t1 ,t2 } = r¬(t2 −λ) |j¬t1 ⟩λ . (68) statevector can be directly sent to the pair node’s
N w
buffer, although we exclude this optimisation from
Every node r must exchange amplitudes with our pseudocode.
pair node r′ = r¬(t2 −λ) , but only those ampli-
tudes of local index j satisfying j[t1 ] ̸= r[t2 −λ] ; The local indices of amplitudes modified by the
this is half of all local amplitudes. In this sce- SWAP gate are determined through the same bit
nario, a total of 2N /2 amplitudes are exchanged algebra used in the previous algorithms. We present
in parallel batches of Λ/2, and the node load is the resulting scheme in Alg. 9, and its communica-
homogeneous. Pairs exchange via Fig. 5 e). tion pattern in Fig. 9. This algorithm prescribes no
20

floating point operations, an exchange of (at most) SWAP 0,4 SWAP 0,5
3 4 3 4
half of all amplitudes, and at most a single round 2 5 2 5
of pairwise communication. Despite being an exper- 1 6 1 6
imentally fearsome two-qubit gate, we have shown
0 7 0 7
the SWAP gate is substantially cheaper to classically
simulate than the one-qubit gate of Alg. 6. 15 8 15 8

14 9 14 9
13 10 13 10
Algorithm 9: [distributed][statevector] 12 11 12 11
SWAPt1 ,t2 gate upon qubits t1 and t2 > t1 of an
N -qubit pure state distributed between W = 2w SWAP 0,6 SWAP 0,7
3 4 3 4
nodes as local arrays ψ with buffers φ. 2 5 2 5
× 1 6 1 6
× 0 7 0 7
[O(Λ) bops][0 flops][0 or 1 exchanges]
[2N /2 exchanged][O(1) memory][Λ/2 writes] 15 8 15 8

1 distrib swapGate(ψ, φ, t1 , t2 ): // t2 > t1 14 9 14 9

2 Λ = len(ψ) 13 10 13 10
12 11 12 11
3 λ = log2 (Λ)
// embarrassingly parallel SWAP 4,5 SWAP 4,6
3 4 3 4
4 if t2 < λ: 2 5 2 5
// loop |k⟩λ−2 ≡ |z⟩N −t2 −1 |y⟩t2 −t1 −1 |x⟩t1 1 6 1 6
# multithread
5 for k in range(0, Λ/4): 0 7 0 7
// |jab ⟩λ = |z⟩ |a⟩1 |y⟩ |b⟩1 |x⟩
15 8 15 8
6 j11 = insertBits(k, {t1 , t2 }, 1)
7 j10 = flipBit(j11 , t1 ) 14 9 14 9
8 j01 = flipBit(j11 , t2 ) 13 10 13 10
9 ψ[j01 ], ψ[j10 ] = ψ[j10 ], ψ[j01 ] // swap 12 11 12 11

// swap entire ψ with pair... SWAP 4,7 SWAP 5,6

10 else if t1 ≥ λ: 3 4 3 4
2 5 2 5
11 t′1 = t1 − λ
1 6 1 6
12 t′2 = t2 − λ
// if node contains any i[t1 ] ̸= i[t2 ] 0 7 0 7
13 if getBit(r, t′1 ) ̸= getBit(r, t′2 ):
15 8 15 8
14 r′ = flipBits(r, {t′1 , t′2 })
15 exchangeArrays(ψ, φ, r′ ) 14 9 14 9
# multithread 13 10 13 10
16 for k in range(0, Λ): 12 11 12 11
17 ψ[k] = φ[k]
SWAP 5,7 SWAP 6,7
// swap only half of ψ with pair 3 4 3 4
2 5 2 5
18 else:
1 6 1 6
// pack half of buffer
19 b = ! getBit(r, t2 − λ) 0 7 0 7
# multithread
20 for k in range(0, Λ/2): 15 8 15 8
21 j = insertBit(k, t1 , b) 14 9 14 9
22 φ[k] = ψ[j]
13 10 13 10
// swap half-buffers 12 11 12 11
23 r′ = flipBit(r, t2 − λ)
24 exchangeArrays(φ, 0, φ, Λ/2, Λ/2, r′ ) FIG. 9. Some communication patterns of Alg. 9’s dis-
// overwrite local amps with buffer tributed simulation of the SWAP gate. A circle indicates
# multithread one of 16 nodes (ranked from 0) and directional arrows
25 for k in range(0, Λ/2): indicate the sending of amplitudes when simulating the
26 j = insertBit(k, t1 , b) SWAP upon an 8-qubit statevector.
27 ψ[j] = ψ[k + l]
21

D. Many-target gate M 6,7
3 4
2 5
1 6

M̂ 0 7

15 8

Section II D developed a local simulation of the 14 9

many-target (or “multi-qubit”) gate, enabling two- 13 10
qubit unitaries like the Mølmer-Sørensen [38, 39], 12 11
Barenco [40] and Berkeley [41] gates, as well as any
n-qubit operator expressible as a dense 2n × 2n com- FIG. 10. The communication pattern of a hypothetical
plex matrix. Alas, directly distributing this scheme direct simulation of a two-qubit dense gate upon the up-
for n ≥ 2 appears impossible; each node execut- per qubits of an 8-qubit statevector distributed between
ing (a parallel version of) Alg. 4 would require more W = 16 nodes. Communication is not pairwise, and in-
volves sending more amplitudes to a node than can fit
remote amplitudes than can fit in the communica-
in its receiving communication buffer.
tion buffer, as we will make concrete below. For-
tunately, this limitation can be surpassed with an
indirect method. In this section, we derive a dis- of 2n amplitudes. While this was no problem for
tributed in-place simulation of an n-qubit general the local simulation strategy of Alg. 4, it is a signifi-
gate upon an N -qubit pure state, distributed be- cant obstacle to a distributed implementation which
tween 2w nodes (with Λ = 2N −w amplitudes per- uses local statevector partitions and buffers of size
node). Our method prescribes O(2n Λ) flops, and Λ = 2N −w . To see why, assume the leftmost n
at most 2n rounds of pairwise communication, each qubits are targeted, and consider the action upon
exchanging Λ/2 amplitudes. We assume that n is basis state |i⟩N ≡ |j⟩n |0⟩N −n .
sufficiently small so that the gate’s 22n element ma-
trix description remains tractable and is safely du- M̂{N −1, N −2, ..., N −n} |i⟩N = M |j⟩n ⊗ |0⟩N −n (70)
plicated in every node. This admits the looser con- n
2
dition that n ≤ N − w (or that there are fewer ma- X
trix columns than statevector amplitudes per-node) = mlj |l⟩n ⊗ |0⟩N −n . (71)
l
which we must enforce as a strict precondition to
our method. This basis state, which we will assume has its corre-
Although our derivation here is original, the core sponding amplitude stored in node r, has become a
mechanic of this section’s algorithm (later dubbed (2n )-state superposition. The very next basis state
“cache blocking”) has appeared frequently in the lit- in node r, which is |i + 1⟩N = |j⟩n |1⟩N −n , super-
erature since 2007 [57], and has been implemented in poses to a unique set of states;
the Oxford’s QuEST [9], IBM’s Qiskit [58], Fujitsu’s
M̂{N −1, N −2, ..., N −n} |i + 1⟩N = M |j⟩n ⊗ |1⟩N −n
mpiQulacs [59] (which uses the technique as the pri- n
mary means to distribute Qulacs [60]) and NVIDIA’s 2
X
cuQuantum [61] simulators. A recent work [50] also = mlj |l⟩n ⊗ |1⟩N −n . (72)
studied the utility of the technique for reducing en- l
ergy consumption of quantum simulation in HPC
settings. Continuing, we observe that every state of global
index i ∈ {0, 2N −max(n,w) } within node r becomes
Let M̂t denote an n-target operator upon qubits a unique (2n )-state superposition. Hence, the to-
t = {t0 , . . . , tn−1 } described by matrix M = {mij : tal number of unique global amplitudes which de-
n n
i, j} ∈ C2 ×2 . We seek to modify local arrays ψ termine the updated local amplitudes in r can be as
such that their W = 2w distributed ensemble cap- many as 2min(n,w) Λ. This exceeds the buffer size of
tures the transformation of N -qubit state |Ψ⟩N to Λ; the remote amplitudes cannot be obtained within
a single round of communication, nor can they be
|Ψ⟩N → M̂t |Ψ⟩N . (69) simultaneously stored within a node. This problem
arises even for two-qubit (n = 2) gates! In precise
We first clarify why M̂t cannot be directly effected terms, if two or more qubits with indices t ≥ N − w
under our distribution model before proposing a res- are targeted, the many-qubit gate requires buffer-
olution. Recall from Eq. 34 that M̂t modifies each exceeding communication; an example is shown in
amplitude of |Ψ⟩N to become a linear combination Fig. 10. It is principally possible to decompose the
22

SWAP 0,4 SWAP 1,5 SWAP 2,6

3 4 3 4 3 4
2 5 2 5 2 5
1 6 1 6 1 6
 0 7 0 7 0 7
M 4,5,6,7
3 4 15 8 15 8 15 8
2 5 14 9 14 9 14 9
13 10 13 10 13 10
12 11 12 11 12 11
1 6 
SWAP 3,7 M 0,1,2,3 SWAP 0,4
3 4 3 4 3 4
2 5 2 5 2 5
0 7 1 6 1 6 1 6
= 0 7 0 7 0 7
15 8 15 8 15 8
15 8 14 9 14 9 14 9
13 10 13 10 13 10
12 11 12 11 12 11

14 9 SWAP 1,5 SWAP 2,6 SWAP 3,7

3 4 3 4 3 4
2 5 2 5 2 5
13 10 1 6 1 6 1 6
12 11 0 7 0 7 0 7
15 8 15 8 15 8
14 9 14 9 14 9
13 10 13 10 13 10
12 11 12 11 12 11

FIG. 11. The total effective communication pattern (left) and those of each decomposed step (right) of Alg. 10’s
distributed simulation of the 4-target gate M̂t upon the upper qubits t = {4, 5, 6, 7} of an 8-qubit statevector.

problematic communication into several serial steps, We utilise that

each using tractable buffer-size messages. Devising
such a decomposition algebraically is non-trivial and M̂q1 ≡ SWAPq1 ,q2 M̂q2 SWAPq1 ,q2 , (74)
beyond the talents of the author; but we can fortu-
nately devise it through other means. to re-express an upper-targeting gate in terms of a
lower-targeting one;
Consider now the contrary scenario of targeting the
rightmost n qubits (where n < N − w) of a basis
state |i⟩N = |r⟩w |j⟩N −w , which has its amplitude M̂t ≡ (75)
tq ≥N −w, ∀q
stored within rank r. n
! n
!
O O
n
2
X SWAPq,tq M̂{0, ..., n−1} SWAPq,tq .
M̂{0,...,n−1} |i⟩N = |r⟩w ⊗ mlj |l⟩N −w . (73) q q
l
Given arbitrary targets t, we swap any of index
All 2n amplitudes which inform each new amplitude tq > N −w with a non-targeted lower qubit. We then
within the node are already contained within the simulate the embarrassingly parallel lower-targeting
node. Simulating this gate is embarrassingly parallel many-qubit gate, and finally undo our swaps. This
and will resemble the local strategy of Alg. 4. scheme incurs the costs of simulating 2η SWAP
gates, where η = len({q : q ∈ t, q ≥ N − w}).
If we were given n arbitrary target qubits t (where
We clarify the effective circuit prescribed by our de-
n < N − w) and we could somehow transform M̂t composition (shown with control qubits) in Fig. 12.
into an alternate operation M̂t′ which targets the The communication pattern of an example decom-
lower qubits (where q < N − w, ∀q ∈ t′ ), then we position is shown in Fig. 11.
could simulate M̂t via embarrassingly parallel simu-
Which low-index qubits should we swap high-index
lation of M̂t′ . If such a transformation were possible, targets to? Definitively, the smallest/rightmost un-
we should also wish it is cheap. Enter our hero, the targeted qubits (with indices < N − w), starting at
SWAP gate, whose especially efficient simulation we 0. This minimises the stride between subsequently
derived in the previous section! superposed amplitudes during the local invocation
23

5 SWAPs (on the same side of the embarrassingly

 parallel simulation of M̂0,1,... ) can be combined
4 M into a single operation, reducing the total net-
3 work traffic by ensuring amplitudes are sent di-
= rectly to their final node. This communication
2 would not be pairwise, and would require more

sophisticated logic to efficiently pack and ex-
1 M change amplitudes than seen in Alg. 9. Such a
method, making use of so-called “fused-swaps”,
0
is discussed in Ref. [57], implemented in Fujitsu’s
mpiQulacs [59], and made possible in NVIDIA’s
FIG. 12. Decomposition of a many-control many- cuQuantum simulator [61] through function
target gate targeting the upper qubits into a sequence custatevecDistIndexBitSwapScheduler.
of SWAPs and a gate targeting the lower qubits.
• Consider introducing control qubits c. The de-
composition of Cc (M̂t ) suggested by Fig. 12 ap-
of the many-target gate (Alg. 4), achieving its best plies non-controlled SWAPs and ergo communi-
caching performance. The choice of the low-index cate amplitudes back and forth which are ul-
qubit has no effect on the communication pattern of timately not modified by Cc (M̂t ). An opti-
the SWAP gate. mised method would develop and use a bespoke
controlled-SWAP gate, where each swap inher-
How costly are these SWAP gates? Each swap en-
its controls c, exponentially reducing communi-
counters scenario 3. of Sec. IV C, whereby every
cation costs in the same manner described in
node exchanges Λ/2 amplitudes with its pair node,
Sec. IV B.
and does so with no floating-point operations. We
ergo communicate ηΛ total amplitudes in 2η serial • Control qubits also introduce many new oppor-
rounds, and induce the same flops as the local many- tunities for direct pairwise simulation of Cc (M̂t )
target gate. We formalise our algorithm and its re- without decomposition because they may de-
source costs in Alg. 10. crease the number of external amplitudes needed
by a node to become tractable and fit within the
We remark that a many-controlled many-target gate
buffer. Note that such an optimisation presents
can be simulated in an almost identical fashion (re-
an overwhelming number of edge-cases.
calling from Sec. IV B that control qubits do not in-
duce any communication) though we must exercise Even without these optimisations, Alg. 10 to simu-
care when swapping targets with control qubits. late an n-qubit gate prescribes only a factor O(2n )
more flops than the one-qubit gate of Alg. 6; this is
We now discuss several potential optimisations to
a fundamental minimum due to the O(22n ) elements
our scheme.
of input matrix M .
• Our algorithm swaps any target qubit of index
Because Alg. 10 did not impose unitarity or other
q ≥ N − w, but in fact having only a sin-
constraints upon M , it can in principle be invoked
gle high-index target will yield pairwise buffer-
to simulate any (N − w)-qubit digital quantum
compatible direct simulation. So in principle,
operation upon a statevector. It can ergo effect
one fewer SWAP than suggested above is nec-
all subsequent statevector operators derived in this
essary. However, the bespoke implementation of
manuscript. However, we will instead develop a va-
this scenario (a pairwise-communicating many-
riety of bespoke simulations for advanced operators
qubit gate) will yield the same communication
which use different (and more convenient) input pa-
complexity as our method, and can serve only
rameterisations and which yield exponentially faster
to shrink caching and memory-movement costs.
simulation.
We anticipate these savings to be modest, and
not worth the added algorithmic complexity for
the treatment of this scenario as an explicit edge-
case.
• Our algorithm performs each SWAP gate of the
decomposition one by one. This means a single
amplitude might be moved by multiple SWAPs,
and exchanged between nodes multiple times be-
fore arriving at its final node. In principle, all
24

Algorithm 10: [distributed][statevector] E. Pauli tensor

Many-target gate M̂t upon n unique target
qubits tn= {t
n
0 , . . . , tn−1 } described by matrix
M ∈ C2 ×2 , applied to an N -qubit pure X̂
statevector distributed between 2w nodes as Ŷ
local Λ-length array ψ (with buffer φ).
Ẑ
M̂

[O(2n Λ) bops][O(2n Λ) flops] The one-target Pauli operators X̂, Ŷ and Ẑ are core
[best: 0, worst: 2 min(n, w) exchanges] primitives of quantum information theory, and their
[worst: 2N min(n, w) exchanged] controlled and many-target extensions appear ex-
[O(2n ) memory][O(2n Λ) writes] tensively in experimental literature. For instance,
Toffoli gates [62], fan-out gates [63], many-control
1 distrib manyTargGate(ψ, φ, M , t):
many-target Ẑ gates [64] and others appear as nat-
2 λ = log2 (len(ψ)) // = N − w ural primitive operators for Rydberg atom comput-
// need fewer targets than suffix qubits ers [65]. Many-control many-target X̂ gates appear
3 n = len(t) in quantum arithmetic circuits [66]. But most signif-
4 if n > λ: icantly, Pauli tensors form a natural basis for Hermi-
5 fail tian operators like Hamiltonians, and their efficient
// locate lowest non-targeted qubit simulation would enable fast calculation of quanti-
6 b = getBitMask(t) // Alg. 1 ties like expectation values. Rapid, direct simulation
7 q = 0 of a Pauli tensor will also enable efficient simulation
8 while getBit(b, q) == 1: // Alg. 1 of more exotic operators like Pauli gadgets, as we
9 q += 1 will explore in Sec. IV G.
// record which targets require swapping As always, let Λ be the size of each node’s sub-
10 t′ = {} statevector. In this section, we derive a distributed
11 for i in range(0, n): in-place simulation of the n-qubit Pauli tensor which
12 if t[i] < λ: prescribes O(n Λ) bops, Λ flops, Λ memory writes,
13 t′ [i] = t[i]
no memory overhead and at most a single round of
14 else:
15 t′ [i] = q communication. This makes it more efficient than
16 q += 1 the one-target gate of Alg. 6 despite prescribing
17 while getBit(b, q) == 1: // Alg. 1 a factor O(n) more bops. Further, all amplitude
18 q += 1 modifications happen to be multiplication with unit
scalars ±1 or ±i, which may enable optimisation on
// perform necessary swaps
19 for i in range(0, n):
some systems and with some amplitude types.
20 if t′ [i] ̸= t[i]: We consider the separable n-qubit unitary
21 distrib swapGate(ψ, φ, t′ [i], t[i])
n
// Alg. 9 O (q)
// embarrassingly parallel M̂t′ Ût = σ̂tq (76)
q
22 local manyTargGate(ψ, M , t′ ) // Alg. 4

// undo swaps composed of Pauli operators σ̂ (q) ∈ {X̂, Ŷ , Ẑ} acting

23 for i in range(0, n): upon target qubits t = {t0 , . . . , tn−1 } of an N -qubit
24 if t′ [i] ̸= t[i]: pure state |Ψ⟩N . A naive and inefficient method to
25 distrib swapGate(ψ, φ, t′ [i], t[i])
simulate Ût |Ψ⟩N is to construct matrix
// Alg. 9
n
n
×2n
O
M= σ (q) ∈ C2 , where (77)
q

0 1 0 −i 1 0
X= , Y = , Z= ,
1 0 i 0 0 −1

and effect it as a dense n-qubit gate M̂t via Alg. 10.

This would involve 2n Λ flops and writes, and up to
2n rounds of communication.
25

Alternatively, we could simulate each one-target op- The explicit action of the Pauli tensor is to modify
erator in turn, utilising the amplitudes under

(n−1) (1) (0)
Ût |Ψ⟩N = σ̂tn−1 . . . σ̂t1 σ̂t0 |Ψ⟩N . . . , (78) Ût
αi −→ βi¬tx,y αi¬tx,y (88)
and perform a total of n invocations of Alg. 6. This
would cost n Λ flops and writes, and at most n where complex unit βi is trivially bitwise evalu-
rounds of communication. Still, a superior method able independent of any amplitude. Notice too
can accelerate things by a factor n. that unlike the previously presented gates in this
manuscript, Eq. 88 does not prescribe any superpo-
Neither of these naive methods leverage two useful
sition or linear combination of the amplitudes with
properties of the Pauli matrices; that they have unit
one another; only the swapping of amplitudes and
complex amplitudes (as do their Kronecker prod-
their multiplication with a complex unit.
ucts) and that they are diagonal or anti-diagonal.
These properties respectfully enable simulation of We now derive the communication strategy to effect
Ût with no arbitrary floating-point operations (all Eq. 88. Recall that the j-th local amplitude in node
instead become sign flips and complex component r < 2w corresponds to global amplitude αi of basis
swaps) and in (at most) a single round of communi- state
cation. We now derive such a scheme.
|i⟩N ≡ |r⟩w |j⟩λ , λ = N − w. (89)
A single Pauli operator σ̂q maps an N -qubit basis
state |i⟩N to Under Ût , this amplitude swaps with that of index
X̂q |i⟩N = |i¬q ⟩N , (79)
i¬tx,y ≡ r¬t′′ j¬t′ , (90)
Ŷq |i⟩N = i (−1)i[q] |i¬q ⟩N , (80) N w λ

Ẑq |i⟩N = (−1)i[q] |i⟩N . (81) where we have defined

Let tx ⊆ t contain the indices of qubits targeted by t′ = {q : q < λ, q ∈ tx,y }, (91)

an X̂ operator in the full tensor Ût . Formally, t′′ = {q − λ : q ≥ λ, q ∈ tx,y }. (92)
tx = {tj : σ̂ (j) = X̂}. (82)
Every amplitude within node r is exchanged with a
x,y x
Let t ⊇ t contain all qubits targeted by either fixed pair node r′ = r¬t′′ as determined by only the
X̂ or Ŷ . We can then express the action of the full qubits targeted by X̂ or Ŷ and with indices ≥ N −w.
tensor upon a basis state as By idempotency of bit flips, communication is pair-
Y Y
Ût |i⟩N = (−1)i[q] i (−1)i[q] i¬tx,y N (83) wise and tractable; we visualise the communication
q∈tz q∈ty pattern in Fig. 13. This is indeed a result of the
(anti-)diagonal form of the Pauli operators, and not
= (−1)f (i) η i¬tx,y N
(84) due to the separability of the tensor into one-qubit
where we have defined constant η = ilen(t ) ∈
y
gates, as we make evident in Fig 14. Notice the algo-
{±1, ±i} and basis state dependent function f (i) = rithm is embarrassingly parallel whenever all opera-
len({q ∈ ty,z : q[i] = 1}) ∈ N. This function returns tors are Ẑ, or target only the lowest N − w qubits.
the number of qubits in |i⟩N which are targeted by Eq. 90 also reveals that the local position j ′ of a
either Ŷ or Ẑ and which are in their |1⟩1 state. The swapped amplitude (of previous local index j) is
expression (−1)f (i) simplifies to ±1 as determined by j ′ = j¬t′ , and can be computed in O(1) using a
the parity of that number, which we will eventually bitmask. Our final consideration is how to locally
compute with efficient bit logic. update amplitudes during the embarrassingly par-
allel scenario of r′ = r, i.e. when t′′ = {}. We
tensor upon an arbitrary N -qubit pure
The full Pauli P will require swapping amplitudes (and multiplying
state |Ψ⟩N = i αi |i⟩N is therefore expressable as by βi ) which differ in array indices only at the tar-
N
2
X geted bits. Our iteration of these amplitudes should
Ût |Ψ⟩N = βi αi i¬tx,y (85) avoid branching, and should not visit an amplitude
N
i more than once. Our method will leverage the the
N
2 bitwise complement, and its property:
X
= βi¬tx,y αi¬tx,y |i⟩N , (86) N
O
i
|∼i⟩N = !i[q] 1
= 2N − i − 1 N
. (93)
f (i)
where βi = (−1) η ∈ {±1 ± i}. (87) q
26

X4 ⊗X5 X4 ⊗X6 X4 ⊗X7 X5 ⊗X6 X5 ⊗X7

3 4 3 4 3 4 3 4 3 4
2 5 2 5 2 5 2 5 2 5
1 6 1 6 1 6 1 6 1 6
0 7 0 7 0 7 0 7 0 7
15 8 15 8 15 8 15 8 15 8
14 9 14 9 14 9 14 9 14 9
13 10 13 10 13 10 13 10 13 10
12 11 12 11 12 11 12 11 12 11

X4 ⊗X5 ⊗X6 X4 ⊗X5 ⊗X7 X4 ⊗X6 ⊗X7 X5 ⊗X6 ⊗X7 X4 ⊗X5 ⊗X6 ⊗X7
3 4 3 4 3 4 3 4 3 4
2 5 2 5 2 5 2 5 2 5
1 6 1 6 1 6 1 6 1 6
0 7 0 7 0 7 0 7 0 7
15 8 15 8 15 8 15 8 15 8
14 9 14 9 14 9 14 9 14 9
13 10 13 10 13 10 13 10 13 10
12 11 12 11 12 11 12 11 12 11

FIG.C13.7 (X4 ⊗X
Some C5 (X4patterns
5 ) communication ⊗X7 ) of Alg.C411’s(X5 simulation C4,7 (X
⊗X6 ⊗X7 ) of the Pauli 5 ⊗X6 upon
tensor ) C4,5,7statevector
an 8-qubit (X6 )
3 4 3 4 3 4 3 4 3 4
2
distributed 5
between 16 nodes.2 Only 2 qubits 5t ≥ 4 trigger communication,
X̂ 5and Ŷ targeting 2 5 and do so2 identically,
5 so
only1 X̂ is demonstrated.
6 1 is incidentally
This 6 1
the same 6
pattern admitted by1 the Pauli gadget
6 1 14 (Sec. IV6G).
of Alg.
0 7 0 7 0 7 0 7 0 7
15 8  15  8 15 8 15 8 15 8
14 9 H N -114
⊗H N -2 9 14 that9 14 9 14 9
13 10 3 4
13 10 13 10 13 10 13 10 E
12 11 2 12 5 11 12 11 12 11 12 11
|i⟩N ≡ |k, j⟩N =⇒ i¬t N
= k, 2len(t) − j − 1 ,
1 6
(94)
0 7
where binary integer j spans the classical states of
15 8
t. This means we can iterate the first 2len(t) /2 of
14 9 the len(t)-length bit-sequences (each corresponding
13 10 to amplitude αj ) and flip all bits in t′ to access the
12 11 paired amplitude αj¬t′ . The result is that all pairs
of amplitudes (to be swapped) are iterated precisely
FIG. 14. The communication necessary to simulate a once. We formalise this scheme in Alg. 11.
pair of Hadamard gates on the uppermost qubits. While
each Hadamard is pairwise simulable by Alg. 6, their Our simulation of the n-target Pauli tensor is as effi-
direct tensor is not. This is unlike Pauli tensors which cient as the one-qubit gate of Alg. 6, and prescribes a
are always pairwise simulable. factor n fewer communications than if each Pauli was
simulated in turn, or as a dense matrix via Alg. 10.
Yet, we have neglected several optimisations possi-
To clarify this property, consider the ordered three- ble for specific input parameters. For instance, when
bit sequences below which share a colour if they are all of the Paulis are Ẑ, the entire operator is di-
bitwise complements of one another. agonal and can be simulated in an embarrassingly
π ⊗
parallel manner as a phase gadget Ẑ ⊗ ≃ ei 2 Ẑ as
|0⟩3 = 000 described in the next section. Further, when none of
|1⟩3 = 001 the lowest (rightmost) N − w qubits are targeted by
|2⟩3 = 010 X̂ or Ŷ gates, then after amplitudes are exchanged,
|3⟩3 = 011 the local destination address j ′ is equal to the orig-
inal local index j. The sub-statevectors ψ are ef-
|∼3⟩3 = 100
fectively directly swapped, as per the paradigm in
|∼2⟩3 = 101 Fig. 5 d), before local scaling (multiplying by factor
|∼1⟩3 = 110 βi ). Bespoke iteration of this situation, avoiding the
|∼0⟩3 = 111 otherwise necessary bitwise arithmetic, can improve
caching performance and auto-vectorisation.
When a subset t of all qubits are targeted, it holds
27

Algorithm 11: [distributed][statevector] Algorithm 12: Subroutines of Alg. 11

Nn (q)
Pauli tensor Ût = q σ̂tq consisting of n Pauli
1 local tensorSub(ψ, t′ , r, η, bxy , byz , λ):
operators σ̂ (q) (encoded into array σ) upon 2 Λ = len(ψ)
unique targets t = {t0 , . . . , tn−1 }, applied to an 3 n′ = len(t′ )
N -qubit statevector distributed between 2w 4 sort(t′ )
Λ-length arrays ψ (with buffer φ).
// loop |k⟩λ−n′
X̂ # multithread ′
5 for k in range(0, Λ/2n ):
Ŷ 6
′
h = insertBits(k, t , 0) // Alg. 1
Ẑ // loop |j⟩λ = |k, l⟩λ (first half)
[O(n Λ) bops][Λ flops][0 or 1 exchanges] # multithread (flatten ′
with above)
7 for l in range(0, 2n /2):
[2N exchanged][O(1) memory][Λ writes]
8 j = setBits(h, t′ , l) // Alg. 1
1 distrib pauliTensor(ψ, φ, σ, t): // find pair amp and coeffs
2 λ = log2 (len(ψ)) 9 i = (r << λ) | j
3 n = len(t) // = len(σ) 10 p = getMaskParity(i & byz ) // Alg. 1
4 r = getRank() // Alg. 5 11 β = (1 − 2 p) η
y
12 j ′ = j ˆ bxy
// prepare η = ilen(t ) 13 i′ = (r << λ) | j ′
5 η = 1 14 p′ = getMaskParity(i′ & byz ) // Alg. 1
6 for q in range(0, n): 15 β ′ = (1 − 2 p′ ) η
7 if σ[q] is Ŷ : // swap (scaled) amps
8 η *= i 16 γ = ψ[j]
// determine pair rank 17 ψ[j] = β ′ ψ[j ′ ] // times unit-complex
9 r′ = r 18 ψ[j ′ ] = β γ // times unit-complex
10 for q in range(0, n):
11 if t[q] ≥ λ and (σ[q] is X̂ or Ŷ ):
12 r′ = flipBit(r′ , t[q] − λ) // Alg. 1 19 distrib tensorSub(ψ, φ, r′ , η, bxy , byz , λ):
20 Λ = len(ψ)
// prepare index bit mask (for j → j ′ )
13 t′ = {} // exchange full sub-statevectors
14 for q in range(0, n): 21 exchangeArrays(ψ, φ, r′ ) // Alg. 5
15 if t[q] < λ and (σ[q] is X̂ or Ŷ ): # multithread
16 append t[q] to t′ 22 for j in range(0, Λ):
17 bxy = getBitMask(t′ ) // Alg. 1 // update local j as |r⟩ |j⟩ → |r′ ⟩ |j ′ ⟩
// prepare coefficient bit mask (for βi ) 23 j ′ = j ˆ bxy
18 byz = 0 24 i′ = (r′ << λ) | j ′
19 for q in range(0, n): 25 p = getMaskParity(i′ & byz ) // Alg. 1
20 if (σ[q] is Ŷ or Ẑ): 26 β = (1 − 2 p) η
21 byz = flipBit(byz , t[q]) // Alg. 1 27 ψ[j] = β φ[j ′ ] // times unit-complex
// invoke subroutine (Alg. 12)
22 if r′ = r:
23 local tensorSub(ψ, t′ , r, η, bxy , byz , λ)
24 else:
25 distrib tensorSub(ψ, φ, r′ , η, bxy , byz , λ)
28

F. Phase gadget We simply leverage that our computational basis

states |i⟩N are composed of eigenstates of Ẑ.
N
2
X N
O
|Ψ⟩N ≡ αi |i⟩N , |i⟩N = i[q] 1
, (97)
ei θ Ẑ1 Ẑ2 ... i q

Ẑ |0⟩ = |0⟩ , Ẑ |1⟩ = − |1⟩ (98)

i[q]
=⇒ Ẑq |i⟩N = (−1) |i⟩N .
The Ẑ-phase gadget, also known as ZZ(θ) and as the (reiteration of Eq. 81)
many-qubit Z rotation, appears frequently in quan-
tum machine learning [67], recompilation [68, 69] By the spectral theorem [81],
and variational eigensolving literature [70, 71] as a
exp i θẐq |i⟩N = exp i θ (−1)i[q] |i⟩N ,

simple parameterised entangling gate. It also forms (99)
the basic evolution operator in Trotterised Ising Pn
i
Hamiltonians [72–74]. Though the phase gadget is ∴ Ût (θ) |i⟩N = exp i θ (−1) q [tq ] |i⟩N . (100)
a high-fidelity native gate in neutral-atom [75] and
ion-trap experiments [76, 77], it is regularly compiled The phase gadget upon the state therefore merely
into simpler operations on other platforms [78, 79], multiplies a complex scalar upon each amplitude,
such as a sequence of controlled-NOT gates [80]. In
Ût (θ)
classical simulation however, its Ẑ-basis matrix is αi −−−→ exp(i θ si ) αi , si = ±1 (101)
diagonal and trivial to effect. In this section, we de-
rive an embarrassingly parallel in-place simulation where sign si is determined by the parity of the num-
of the many-qubit phase gadget upon a distributed ber of targeted |1⟩1 substates in the basis state |i⟩N .
N -qubit pure state (with Λ amplitudes per-node), Our pseudocode assumes this quantity is O(1) evalu-
prescribing O(Λ) bops, flops and writes per-node. able by function getMaskParity() [15]; such a func-
tion is provided by many C compilers.
We define the phase gadget as an n-qubit unitary
operator with real parameter θ ∈ R, We formalise a embarrassingly parallel scheme to
  evaluate Eq. 101 in Alg. 13. It is as cheap as the
n
O best case of the one-qubit gate simulation of Alg. 6,
Ût (θ) = exp i θ Ẑtj  , (95) though is expected to have superior caching perfor-
j mance.

acting upon target qubits t = {t0 , . . . , tn−1 }. We Algorithm 13: [distributed][statevector]

N
n
seek to apply Ût (θ) upon an N -qubit statevector Phase gadget Ût (θ) = exp i θ q Ẑtq upon n
|Ψ⟩N distributed between W = 2w nodes. unique target qubits t of an N -qubit statevector
A treatment of Ût (θ) as a dense 2n ×2n complex ma- distributed between partitions ψ.
trix (simulated via the many-target gate of Alg. 10) ⊗n
ei θ Ẑ
is needlessly wasteful and incurs a O(2n )× slowdown
over the optimum. One might otherwise be tempted [O(Λ) bops][O(Λ) flops][0 exchanges]
to leverage the Euler equality [0 exchanged][O(1) memory][Λ writes]
n
O 1 distrib phaseGadget(ψ, t, θ):
Ût (θ) |Ψ⟩N = cos(θ) |Ψ⟩N + i sin(θ) Ẑtj |Ψ⟩N
2 Λ = len(ψ)
j
3 r′ = getRank() << log2 (Λ) // Alg. 5
(96)
4 b = getBitMask(t) // Alg. 1
Nn
and effect j Ẑtj through the previous section’s em- 5 u = exp(i θ)
6 v = exp(−i θ)
barrassingly parallel (when all Ẑ) scheme to simu-
late Pauli tensors (Alg. 11). One would then sum // loop all |j⟩λ
the i sin(θ) scaled result (also an embarrassingly par- # multithread
allel operation) with a cos(θ) scaled clone of |Ψ⟩N . 7 for j in range(0, Λ):
This not-in-place scheme wastefully incurs an O(2N ) 8 i = r′ | j // |i⟩N = |r⟩w |j⟩λ
memory overhead which could make use of the com- Pn
// find p = i[tq ] mod 2 in O(1)
q
munication buffers but which still prescribes need-
9 p = getMaskParity(i & b)
lessly many flops and memory writes. A superior
10 ψ[j] *= v p + u(!p)
strategy is possible.
29

Note a potential optimisation is possible. The gadget of the previous section, is to rotate the tar-
main loop, which evaluates the sign si = ±1 per- get qubits of the gadget into the eigenstates of the
amplitude, can be divided into two loops which corresponding Pauli operator. For example,
each iterate only the amplitudes a priori known π π
to yield even (and odd) parity (respectively), pre- exp i θ X̂0 Ŷ1 Ẑ2 ≡ R̂Y0 R̂X1 − (103)
determining si . This may incur more caching costs 2 2
from the twice iteration of the state array, but en- × exp iθẐ0 Ẑ1 Ẑ2
ables a compiler to trivially vectorise the loop’s array π π
modification. ×R̂Y0 − R̂X1 ,
2 2

where the Ẑ-phase gate is simulated by Alg. 13 and

each one-target rotation gate by Alg. 6 using matrix
representations

π 1 1 ∓1
RX̂ (± ) = √ , (104)
2 2 ±1 1

G. Pauli gadget π 1 1 ∓i
RŶ (± ) = √ . (105)
2 2 ∓i 1

Such a scheme involves O(n 2N ) total flops and could

ei θ X̂1 Ŷ2 Ẑ3 ... invoke as many as 2 n rounds of statevector ex-
change. A superior strategy admitting at most a
single exchange is possible, which we now derive.
The Pauli gadget, also known as the Pauli expo- We first re-express the Pauli gadget into an Euler
nential [71] and the multi-qubit multi-axis rota- form made possible by the idempotency of the Pauli
tion, is a powerful paramaterised many-qubit en- operators.
tangling unitary gate fundamental in Trotterisation
n
and real-time simulation [72, 73], variational algo- O (j)
Ût (θ) ≡ cos(θ)1̂⊗n + i sin(θ) σ̂tj . (106)
rithms [69, 70] and error correction [82]. It is a gen-
j
eralisation of the previous section’s phase gadget to
rotations around an arbitrary axis, and describes the We decided in Sec. IV F that this was not a use-
Mølmer—Sørensen gate (MS or XX(θ)) [38] and its ful form to optimally simulate the Ẑ-phase gadget,
global variant (GMS) [83] as well as other exponen- though it will here prove worthwhile; it allows us to
tiated Pauli gates [84] natural to ion-trap comput- invoke the results of Sec. IV E which showed that a
ers [76, 77]. In this section, we derive an in-place dis- Pauli tensor upon basis state |i⟩N produces
tributed simulation of an n-target Pauli gadget with
the same costs as the one-qubit gate of Alg. 6, al- n
(j)
O
beit a factor n more bops, though which is expected σ̂tj |i⟩N = βi i¬tx,y N
, (Eq .84)
insignificant in the ultimate runtime. j

We define the Pauli gadget as an n-qubit unitary where βi = (−1)f (i) η ∈ {±1 ± i}
operator with parameter θ ∈ R,
  and where tx,y ⊆ t are the target qubits
y
with corre-
n sponding X̂ or Ŷ operators, η = ilen(t ) ∈ {±1, ±i}
(j)
O
Ût (θ) = exp i θ σ̂tj  , (102) and function f (i) = len({q ∈ ty,z : q[i] = 1}) ∈ N
j counts the Ŷ or Ẑ targeted qubits in the |1⟩1 state
within |i⟩N . The Pauli gadget ergo modifies
P the i-th
comprised of Pauli operators σ̂ (j) ∈ {X̂, Ŷ , Ẑ} and global amplitude of general state |Ψ⟩ = i αi |i⟩N
acting upon target qubits t = {t0 , . . . , tn−1 }. We under
seek to apply Ût (θ) upon an N -qubit statevector
|Ψ⟩N distributed between 2w nodes. Ût (θ)
αi −−−→ cos(θ) αi + i sin(θ) βi¬tx,y αi¬tx,y . (107)
Once again, it is prudent to avoid constructing
an exponentially large matrix description of Ût (θ) This is similar to the modification prescribed by the
and simulating it through the multi-target gate of Pauli tensor as per Eq. 88 and features the same
Alg. 10. A clever but ultimately unsatisfactory so- pair amplitude αi¬tx,y with factor βi¬tx,y , although
lution, inspired by our scheme to effect the phase now the new amplitude also depends on its old value
30

of αi . The admitted communication pattern of the V. DISTRIBUTED DENSITY MATRIX

Pauli gadget is therefore identical to that of the Pauli ALGORITHMS
tensor. However, floating-point multiplication with
generally non-integer quantities cos(θ) and sin(θ) The algorithms presented so far in this manuscript
have been introduced. simulated idealised purity-preserving operators
We formalise this scheme in Alg. 14 which merely upon pure states numerically instantiated as stat-
describes a small revision to the Pauli tensor simula- evectors. We now seek a treatment of realistic
tion of Alg. 11. The Pauli gadget admits O(Λ) flops operations and noise, as their simulation is essen-
but otherwise identical scaling costs to the Pauli ten- tial in the development of quantum computers [85].
sor, and is expected the same ultimate runtime per- We consider processes modelled as channels; linear,
formance of the one-qubit gate. completely-positive, trace-preserving maps between
density matrices. Such maps describe a very general
family of physical processes [86].
Algorithm 14: [distributed][statevector]
N In order to precisely simulate decoherence and the
n (q) evolution of a mixed state, we will numerically in-
Pauli gadget Ût (θ) = exp i θ q σ̂tq
stantiate and evolve a density matrix. We must re-
consisting of n Pauli operators σ̂ (q) (encoded place our 2N -length array Ψ, which encoded N -qubit
into array σ) upon unique targets statevector |Ψ⟩N , with an encoding of the squared
t = {t0 , . . . , tn−1 }, applied to an N -qubit larger density matrix ρN . This appears an intim-
statevector distributed as ψ local arrays, with idating new task; one may expect to create a new
buffers φ. distributed matrix data-structure, then re-derive all
previous algorithms for simulating unitaries upon it,
ei θ X̂1 Ŷ2 ... in addition to deriving new simulations of decoher-
ence channels. Incredibly, this gargantuan effort is
[O(n Λ) bops][O(Λ) flops][0 or 1 exchanges]
not necessary. By leveraging the Choi–Jamiolkowski
[2N exchanged][O(1) memory][Λ writes]
isomorphism [87, 88] (also referred to as the channel-
1 distrib pauliGadget(ψ, φ, σ, t, θ): state duality [89]), all distributed statevector algo-
2 a = cos(θ) rithms presented in this manuscript can be repur-
3 b = sin(θ) posed for numerically simulating unitaries upon a
// Identical to Alg. 11 except density matrix. This was first very briefly described
lines 17-18 of subroutine 12 become: by the authors in Ref. [9].
4 ψ[j] = a ψ[j] + i b β ′ ψ[j ′ ]
In this section, we will explicitly demonstrate this
5 ψ[j ′ ] = a ψ[j ′ ] + i b β γ
correspondence, then derive thirteen novel dis-
// and line 27 becomes: tributed algorithms for simulating density matrices.
6 ψ[j] = a ψ[j] + i b β φ[j ′ ] These algorithms will effect unitaries and noise chan-
nels, and calculate expectation values and partial
// Note the subroutine of Alg. 12 must be
modified in a similar, trivial way.
traces. In all these schemes, we assume each of
W = 2w nodes contains a state partition labelled
ρ (described below) and an equally sized communi-
cation buffer φ. For a state of N -qubits, these arrays
have size Λ = 22N −w . We also introduce the con-
straint that w ≤ N , also elaborated upon below.

A. State representation

Sec. II modelled an N -qubit pure state |Ψ⟩N as a

dense statevector in the Ẑ-basis, numerically instan-
tiated as a complex array Ψ.
N
2
X
|Ψ⟩N = αi |i⟩N ↔
i
Ψ = {α0 , . . . , α2N −1 }. (repetition of 7)
31

Sec. III then distributed Ψ between W = 2w nodes, impractical regime ill-suited for distribution. Even
each with a local sub-statevector ψ of length 2N −w . employing more nodes than there are columns of the
We must now perform a similar procedure for repre- encoded density matrix is a considerable waste of
senting a density matrix. parallel resources. We are ergo safe to impose an
important precondition for this section’s algorithms:
Consider a general N -qubit density matrix ρN with
elements αkl ∈ C corresponding to structure N ≥ w. (113)
2N X
2N
That is, we assume the number of density matrix
X
ρN = αkl |k⟩ ⟨l|N . (108)
k l
elements stored in each node, Λ, is at least 2N , or
one column’s worth. The smallest density matrix
The basis projector |k⟩ ⟨l|N can be numerically in- that W nodes can cooperatively simulate is then
N N N ≥ log2 (W ). This is in no way restrictive; employ-
stantiated as a R2 ×2 matrix. Instead, we vectorise ing 32 nodes, for example, would require we simulate
it under the Choi–Jamiolkowski isomorphism [87, density matrices of at least 5 qubits, and 1024 nodes
88], admitting the same form as a 2N -qubit basis demand at least 10 qubits; both are trivial tasks
ket of index i = k + l 2N , for a single node. Furthermore, using 4096 nodes
||i⟩⟩2N ≃ |l⟩N |k⟩N , (109) of ARCHER2 [12], our precondition requires we em-
ploy at least ×10−6 % of local memory (to simulate
which we numerically instantiate as an array. We a measly 12 noisy qubits); it is prudent to use over
have used notation ||i⟩⟩2N to indicate a state which 50%! It is fortunate that precondition 113 is safely
is described with a 2N -qubit statevector but which assumed, since it proves critical to eliminate many
is not necessarily a pure state of 2N -qubits; it will tedious edge-cases from our subsequent algorithms .
instead generally be unnormalised and describe a
mixed state of N qubits. We henceforth refer to Initialising our distributed Choi-vector into canoni-
this as a “Choi-vector”. cal pure states is as trivial as it is for statevectors;

Our state ρN is then expressible as |0⟩ ⟨0|N ≃ |0⟩2N (114)

N
2 X
2 N 1
X |+⟩ ⟨+|N ≃ N/2 |+⟩2N (115)
||ρ⟩⟩2N = αkl |l⟩N |k⟩N (110) 2
k l |i⟩ ⟨i|N ≃ i(2N + 1) 2N
(116)
22N
X 2N
X
≡ βi ||i⟩⟩2N , βi = αi%2N ,⌊i/2N ⌋ , (111) |ψ⟩ ⟨ψ|N ≃ ψj ψk∗ j + k 2N (117)
2N
i j,k
and in local simulation, is numerically instantiated
as a (22N )-length complex array. This can be imag- For concision, our density matrix algorithms will of-
ined as concatenating the columns of matrix ρ into ten invoke the below subroutine (Alg. 15).
a column vector. This means we can leverage an ex-
isting statevector data structure to store ρ, though Algorithm 15: A convenience subroutine of
the next section will reveal a greater benefit of our our distributed density-matrix algorithms which
representation. merely infers the number of qubits described by
We distribute the array encoding a density matrix a local Choi-vector array ρ.
between compute nodes in an identical fashion to 1 getNumQubits(ρ):
the statevector array in Sec. III, as if it were indeed 2 W = getWorldSize() // Alg. 5
a 2N -qubit statevector. Precisely, we uniformly dis- 3 N = log2 (len(ρ) W )/2
tribute the 22N elements βi between W = 2w nodes, 4 return N
each storing a sub-array ρ of length Λ = 22N −w .
The j-th local element ρ[j] in node r stores the ele-
ment βj+r Λ , which is the coefficient of global basis
projector
||i⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩λ . (112)

In theory, we can instantiate density matrices with

as few as N = ⌈w/2⌉ qubits, prescribing 1 or 2 am-
plitudes per-node. Of course this is a ridiculous and
32

B. Unitary gates where t + N notates array t with N added to each

element. We next appreciate that when ||ρ⟩⟩ is in-
stantiated as an unnormalised statevector, effect-
(118) ing ||ρ⟩⟩2N → Ût ||ρ⟩⟩2N can be done through an
Û identical process to the updating of a pure state
|Ψ⟩2N → Ût |Ψ⟩2N . Eq. 123 then bespeaks a simple
strategy to simulate a unitary upon a density matrix
The ability to simulate unitary, purity-preserving ∗
by sequentially simulating two gates (Ût then Ût+N )
operations upon a density matrix is as important upon its Choi-vector, treated as an unnormalised
as simulating decoherence. statevector of 2N qubits, using the statevector al-
In quantum computing literature, wherein a precise gorithms of Sec. IV.
single-channel description of an experimental pro- We formalise the general case, when Û is a many-
cess is often unobtainable (and otherwise computa- target unitary specified element-wise, in Alg. 16.
tionally intractable), the execution of a quantum cir- This is a density matrix generalisation of the equiv-
cuit on a noisy device is traditionally modelled as an alent statevector scheme in Alg. 10, which is inci-
intertwined sequence of unitary and mixing opera- dentally invoked twice within, and ergo prescribes
tions [90]. T It is ergo prudent to develop distributed only twice as many flops and writes than the lat-
simulations of this manuscript’s previous unitary op- ter algorithm upon a pure state of twice as many
erations upon density matrices. Fortunately, as first qubits. Further, since Alg. 16’s first invocation of
described in Ref. [9], the Choi–Jamiolkowski isomor- Alg. 10 modifies at most the lowest N qubits of
phism [87, 88] permits us to repurpose our previous ||ρ⟩⟩2N (and because N ≥ w by precondition), it
statevector algorithms upon our vectorised density achieves the best-case scenario and is embarrassingly
matrix. In this section, we derive a distributed sim- parallel. The second invocation of Alg. 10 invokes at
ulation of any unitary Û upon an N -qubit density most min(w, len(t)) rounds of communication.
matrix ρN represented as a Choi-vector ||ρ⟩⟩2N , in
approximately double the time to simulate the same
unitary upon a 2N -qubit pure state via the schemes
Algorithm 16: [distributed][density matrix]
of Section IV. In its generic n-qubit unitary form, we
require n ≤ N − ⌈log2 (W )⌉, where W is the number Unitary Ût with n target qubits t, specified
n n
of nodes. We also explicitly develop bespoke meth- element-wise as a dense matrix U ∈ C2 ×2 ,
ods to simulate the SWAP gate, Pauli tensor, phase acting upon an N -qubit density matrix with
gadget and Pauli gadget. We note that unitarity of Choi-vector distributed between W = 2w nodes
the matrix representation of Û is not actually re- as local arrays ρ of size Λ = 22N −w . Each node
quired by our algorithms. also stores a Λ-sized buffer φ. Control qubits
are trivial to introduce, and are simply shifted
Algorithms 6 to 14 modelled the evolution of an like t.
N -qubit pure state |Ψ⟩N under an operator Û ∈
SU (2N ), whereby Û
|Ψ⟩N → Û |Ψ⟩N . (119) [O(2n Λ) bops][O(2n Λ) flops]
These algorithms also did not require unitarity of Û [best: 0, worst: min(n, w) exchanges]
when specified as a dense matrix. Under the same [worst: min(n, w) 22N /2 exchanged]
operation, a density matrix ρ becomes [O(2n ) memory][O(2n Λ) writes]
1 distrib density manyTargGate(ρ, φ, U , t):
ρN → Û ρN Û † , (120)
// ||ρ⟩⟩ → Ût ||ρ⟩⟩
equivalently described by Choi-vector 2 distrib manyTargGate(ρ, φ, U , t) // Alg. 10
EE
Û ρ Û † = Û ∗ ⊗ Û ||ρ⟩⟩2N (121) ∗
// Ût → Ût+N
2N
3 N = getNumQubits(ρ) // Alg. 15
= Û ∗ ⊗ 1⊗N 1⊗N ⊗ Û ||ρ⟩⟩2N . 4 for q in range(0, len(t)):
(122) 5 t[q] += N
6 U = U∗
Assume that Û targets specific qubits with indices ∗
t, effecting identity upon the remaining. Then // ||ρ⟩⟩ → Ût+N ||ρ⟩⟩
EE 7 distrib manyTargGate(ρ, φ, U , t) // Alg. 10
Ût ρ Ût† ∗
= Ût+N Ût ||ρ⟩⟩2N , (123)
2N
33

We can improve on this for specific unitaries. The simulating the same operations upon density matri-
previous section’s bespoke statevector algorithms ces represented as Choi-vectors. Observe:
for simulating SWAP gates (Alg. 9), Pauli tensors
(Alg. 11), phase gadgets (Alg. 13) and Pauli gadgets Û = SWAP =⇒ Û ∗ = Û
(Alg. 14) were markedly more efficient than their Û = X̂ ⊗ Ŷ ⊗m Ẑ ⊗ =⇒ Û ∗ = (−1)m Û
simulation as general many-target gates via Alg. 10.
Û (θ) = exp(i θẐ ⊗ ) =⇒ Û (θ)∗ = Û (−θ)
We can repurpose every one of these algorithms for
Û (θ) = exp(i θX̂ ⊗ Ŷ ⊗m Ẑ ⊗ ) =⇒ Û (θ)∗ = Û ((−1)m+1 θ).

Algorithm 17: [distributed][density matrix] The above Û ∗ can be directly effected upon a Choi-
The SWAP gate, Pauli tensor, phase gadget and vector through Û ’s statevector algorithm, without
Pauli gadget upon an N -qubit density matrix needing a matrix construction. We make this strat-
with Choi-vector distributed between arrays ρ egy explicit in Alg. 17.
and same-sized per-node buffer φ. These are
special cases of Alg. 16 which avoid construction
of a unitary matrix.
× X̂
⊗ ⊗
ei θ Ẑ ei θ σ̂
× Ŷ
// for clarity, assume N is global C. Kraus map
1 N = getNumQubits(ρ) // Alg. 15
2 distrib density swapGate(ρ, φ, t1 , t2 ): (124)
3 distrib swapGate(ρ, φ, t1 , t2 ) // Alg. 9 E{K̂ (m) }
4 distrib swapGate(ρ, φ, t1 + N , t2 + N )

5 distrib density pauliTensor(ρ, φ, σ, t): The power of a density matrix state description is its
6 distrib pauliTensor(ρ, φ, σ, t) // Alg. 11 ability to capture classical uncertainty as a result of
decohering processes. A Kraus map, also known as
7 m = 0
8 for q in range(0, len(t)):
an operator-sum representation of a channel [86], al-
9 t[q] += N lows an operational description of an open system’s
dynamics which abstracts properties of the environ-
10 if σ[q] is Ŷ :
11 m = !m ment. A Kraus map can describe any completely-
positive trace-preserving channel, and ergo capture
12 distrib pauliTensor(ρ, φ, σ, t) // Alg. 11 almost all quantum noise processes of practical in-
13 if m == 1: terest. In this section, we derive a distributed sim-
# multithread ulation of an n-qubit Kraus map of M operators,
14 for j in range(0, len(ρ)): each specified as general matrices, acting upon an N -
15 ρ[j] *= −1 qubit density matrix represented as a Choi-vector.
We assume M ≪ 22N such that M 24n ≪ 22N −w ;
16 distrib density phaseGadget(ρ, t, θ): i.e. that the descriptions of the Kraus channels are
17 distrib phaseGadget(ρ, t, θ) // Alg. 13 much smaller than the distributed density matrix, so
18 for q in range(0, len(t)): are safely duplicated on each node. We will strictly
19 t[q] += N require that n ≤ N − ⌈w/2⌉ to satisfy memory pre-
20 distrib phaseGadget(ρ, t, −θ) // Alg. 13 conditions imposed by an invoked subroutine (in-
cidentally, Alg. 10’s simulation of the many-target
21 distrib density pauliGadget(ρ, φ, σ, t, θ): gate upon a statevector).
22 distrib pauliGadget(ρ, φ, σ, t, θ) // Alg 14 An n-qubit Kraus map consisting of M Kraus op-
(m)
23 θ *= −1 erators {K̂t : m < M }, each described by a ma-
n n
24 for q in range(0, len(t)): trix K (m) ∈ C2 ×2 , operating upon qubits t (where
25 t[q] += N n = len(t)) modifies an N -qubit density matrix ρ
26 if σ[q] is Ŷ : via
27 θ *= −1
M
(m) (m)†
X
28 distrib pauliGadget(ρ, φ, σ, t, θ) // Alg 14 ρ → E(ρ) = K̂t ρ K̂t . (125)
m
34

A valid Kraus map satisfies Algorithm 18: [distributed][density matrix]

(m)
M Kraus map of M operators {K̂t : m}, each
(m)† (m)
X
= 1̂⊗N ,
n n
K̂t K̂t (126) described by a matrix K (m) ∈ C2 ×2 , targeting
m n qubits t of an N -qubit density matrix with
Choi-vector distributed between W = 2w nodes
though our algorithm does not require this. A
as local arrays ρ of size Λ = 22N −w . Each node
naive strategy may seek to clone ρ, modify ρ →
(m) (m)† also stores a Λ-sized buffer φ.
K̂t ρ K̂t for a particular m via Alg. 16 (which,
recall, did not require unitarity), and matrix- E{K̂ (m) }
sum the results across m. Such a strategy re-
quires O(22N ) additional memory and yields a
O(M 22n 22N ) runtime, inducing as many as O(M n) [O(22n Λ) bops][O(22n Λ) flops]
rounds of communication. A superior strategy is [best: 0, worst: min(2n, w) exchanges]
possible. [worst: min(2n, w) 22N /2 exchanged]
[O(24n ) memory][O(22n Λ) writes]
Equ. 123 informs us that the action of a single Kraus
1 distrib krausMap(ρ, φ, {K (m) }, t):
operator upon the Choi-vector of ρN is
2 n = len(t)
EE
(m) (m)† (m)∗ (m)
K̂t ρ K̂t = K̂t+N K̂t ||ρ⟩⟩2N . (127) K (m)∗ ⊗ K (m)
P
// create S = m
2N 2n 2n
3 S = 02 ×2
By linearity, the full Kraus map produces 4 for m in range(0, M ):
M
! 5 for i in range(0, 2n ):
X (m)∗ (m) 6 for j in range(0, 2n ):
||E(ρ)⟩⟩2N = K̂t+N K̂t ||ρ⟩⟩2N (128) 7 for k in range(0, 2n ):
m
8 for l in range(0, 2n ):
= Ŝt ∪ t+N ||ρ⟩⟩2N , (129) 9 r = i 2n + k
10 c = j 2n + l
described as the action of a single 2n-qubit superop- 11 v = K (m) [i, j]∗ K (m) [k, l]
erator 12 S[r, c] += v
M
X // create t′ = t ∪ t + N
Ŝt ∪ t+N = K̂ (m)∗ ⊗ K̂ . (130) 13 N = getNumQubits(ρ) // Alg. 15
t ∪ t+N
m 14 t′ = 02n
15 for q in range(0, n):
A dense matrix representation of Ŝ can ergo be ob- 16 t′ [q] = t[q]
tained by numerically evaluating M Kronecker prod- 17 t′ [q + n] = t[q] + N
n n 2n 2n
ucts of C2 ×2 matrices, and M sums of C2 ×2
matrices. This is a trivial overhead when M and // ||ρ⟩⟩ → Ŝt ∪ t+N ||ρ⟩⟩
18 distrib manyTargGate(ρ, φ, S, t′ ) // Alg. 10
n are small, as assumed. Then, Ŝ could be simu-
lated upon ||ρ⟩⟩2N via Alg. 10 as if it were an (un-
normalised) 2n-qubit unitary gate acting upon an
(unnormalised) 2N -qubit statevector.
We formalise this scheme in Alg. 18. It admits the
same asymptotic costs as distributed simulation of a
unitary targeting twice as many qubits upon a stat-
evector containing twice as many qubits.
Note that for clarity, we assumed that the construc-
tion of the non-distributed 2n-qubit superoperator
matrix S is serially tractable. However, as it involves
O(M 16n ) floating-point operations, it may quickly
grow to be a non-negligible overhead even when S
remains tractable in memory. In that scenario, the
five nested loops of Alg. 18 can be trivially flattened
into a single embarrassingly parallel iteration of the
elements of S, and locally parallelised with multi-
threading.
35

D. Dephasing channel thereby modifying the amplitudes as

(
αkl , k[t] = l[t] ,
αkl → (135)
Eϕ (131) (1 − 2p)αkl , k[t] ̸= l[t] .
Eϕ
P matrix ρN encoded
Our density P as a Choi-vector
||ρ⟩⟩2N = kl αkl |l⟩N |k⟩N ≡ i βi ||i⟩⟩2N sees
The dephasing channel, also known as the phase (
damping channel [86], is arguably the simplest and βi , i[t] = i[t+N ] ,
βi → (136)
most commonly deployed noise model appearing in (1 − 2p)βi , i[t] ̸= i[t+N ] .
quantum computing literature. It describes the loss
of phase coherence [91], or the loss of quantum in- This reveals that dephasing upon a Choi-vector re-
formation without energy loss [86] and can be con- sembles a diagonal operator upon a statevector, and
ceptualised discretely in time as a probabilistic, er- that distributed simulation will be embarrassingly
roneous Ẑ operation(s). We here consider the one parallel. Our next task is to efficiently identify
and two qubit dephasing channels. While both can which local amplitudes satisfy i[t] ̸= i[t+N ] once dis-
in-principle be described by Kraus maps and sim- tributed. We refer to i[t] and i[t+N ] as the “principal
ulated through Alg. 18, their Kraus operators are bits”.
sparse and suggest a superior, bespoke treatment. Recall that the 22N amplitudes are uniformly dis-
In this section, we derive embarrassingly parallel distributed between arrays ρ among 2w nodes (where
tributed algorithms to simulate both the one and two N ≥ w) such that the j-th local amplitude ρ[j] ≡ βi
qubit depashing channels upon an N -qubit density of node r corresponds to global basis state
matrix, distributed between 2w Choi-vectors of lo-
cal size Λ = 22N −w in O(Λ) bops, flops and memory ||i⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩2N −w (137)
writes.
≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w
| {z } | {z }
t+N t
1. One qubit
We have highlighted the distinct subdomains of t
A dephasing channel describing a single erroneous Ẑ (and corresponding t + N ) as pink and blue. This
occurring on qubit t of an N –qubit density matrix form suggests there are two distinct scenarios in the
ρ with probability p produces state determination of which local amplitudes satisfy i[t] ̸=
i[t+N ] , informed by qubit t:
E(ρ) = (1 − p)ρ + p Ẑt ρ Ẑt . (132)
1. When t < N − w (pink above), the principal
We could express this as a Kraus map with operators bits of i are determined entirely by local index j,
(1) √ (2) √
K̂t = p Ẑ and K̂t = 1 − p 1̂, and simulate it i.e. i[t] = j[t] and i[t+N ] = j[t+N ] . (138)
via Alg. 18 in as many as ∼ 4Λ memory writes and
2 rounds of communication. Or, we could unitarily We can directly enumerate indices j satisfying
apply Ẑt to a clone of the state via Alg. 17 then j[t] ̸= j[t+N ] , modifying ρ[j].
linearly combine the states, doubling our memory
costs and caching penalties. A superior strategy is 2. When t ≥ N − w (blue above), the rank r deter-
possible. mines the bit i[t+N ] , which is fixed for across all
local indices j in the node. We enumerate only j
The action of Ẑt on a basis projector |k⟩ ⟨l| is satisfying
Ẑt |k⟩ ⟨l| Ẑt = (−1)k[t] (−1)l[t] |k⟩ ⟨l| , (133) j[t] ̸= r[t−(N −w)] . (139)
inducing sign skl = ±1 determined by bits k[t] ,
lP This informs a simple, embarrassingly parallel dis-
[t] . The channel ergo maps a general state ρ =
αkl |k⟩ ⟨l| to tributed algorithm to simulate the one-qubit dephas-
kl ing gate, which we formalise in Alg. 19. It uses bit
algebra and avoids branching in a similar logic to
N N
2 X
X 2 the SWAP gate of Sec. IV C.
E(ρ) = αkl (1 − p + p skl ) |k⟩ ⟨l| , (134)
k l
36

Algorithm 19: [distributed][density matrix] Like the one-qubit dephasing channel, we’ve shown
One-qubit dephasing of qubit t with probability two-qubit dephasing is also diagonal and embarrass-
p of an N -qubit density matrix with Choi-vector ingly parallel. We distribute the 22N amplitudes uni-
distributed between arrays ρ of length Λ among formly between 2w nodes, such that the j-th local
2w nodes. amplitude ρ[j] ≡ βi of node r again corresponds to
Eϕ global basis state ||i⟩⟩ = ||r⟩⟩ ||j⟩⟩, where

[O(Λ) bops][Λ/2 flops][0 exchanges] ||i⟩⟩2N ≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w
[0 exchanged][O(1) memory][Λ/2 writes] | {z
t2 +N, t1 +N
} | {z
t2 , t1
}
1 distrib oneQubitDephasing(ρ, t, p):
2 Λ = len(ρ) There are three distinct ways that t1 and t2 can be
3 N = getNumQubits(ρ) // Alg. 15 found among these sub-registers.
4 w = log2 (getWorldSize()) // Alg. 5 1. When t2 < N − w (both qubits pink), the prin-
5 c = 1 − 2p cipal bits i[t1 ] , i[t1 +N ] , i[t2 ] , i[t2 +N ] are all deter-
6 if t ≥ N − w: mined by local index j.
r = getRank() // Alg. 5
2. When t1 ≥ N −w (both qubits blue), bits i[t1 ]+N
7
8 b = getBit(r, t − (N − w)) // Alg. 1
# multithread and i[t2 ]+N are fixed per-node and determined by
9 for k in range(0, Λ/2): the rank r, an the remaining principal bits by j.
10 j = insertBit(k, t, ! b) // Alg. 1 3. When t1 < N − w (in pink) and t2 ≥ N − w (in
11 ρ[j] *= c blue), then bit i[t2 ]+N is determined by the rank,
12 else: and all other principal bits by j.
# multithread
13 for k in range(0, Λ/4): With bit interleaving, we could devise non-branching
14 j = insertBit(k, t, 1) // Alg. 1 loops which directly enumerate only local indices
15 j = insertBit(j, t + N , 0) j whose global indices i satisfy i[t1 ] ̸= i[t1 +N ] or
16 ρ[j] *= c i[t2 ] ̸= i[t2 +N ] . However, Eq. 142 reveals 75% of all
17 j = insertBit(k, t, 0) // Alg. 1 amplitudes are to be modified. It is ergo worthwhile
18 j = insertBit(j, t + N , 1) to enumerate all indices and modify every element,
19 ρ[j] *= c with a quarter multiplied by unity, accepting a 25%
increase in flops and memory writes in exchange for
significantly simplified code. We present Alg. 20.
2. Two-qubit
Algorithm 20: [distributed][density matrix]
We can derive a similar method for a two-qubit de- Two-qubit dephasing of qubits t1 and t2 > t1
phasing channel, inducing Ẑ on either or both of with probability p of an N -qubit density matrix
qubits t1 and t2 with probability p. Assume with Choi-vector distributed between arrays ρ
P t2 > t1 .
The channel upon a general state ρ = αkl |k⟩ ⟨l| of length Λ among 2w nodes.
kl
produces Eϕ
!
p Ẑt1 ρẐt1 + Ẑt2 ρẐt2 [O(Λ) bops][Λ flops][0 exchanges]
ε(ρ) = (1 − p)ρ + (140) [0 exchanged][O(1) memory][Λ writes]
3 + Ẑt1 Ẑt2 ρẐt1 Ẑt2
(t ) (t )
!! 1 distrib twoQubitDephasing(ρ, t1 , t2 , p):
X p skl1 + skl2 2 Λ = len(ρ)
= αkl |k⟩ ⟨l| 1 − p +
3 (t ) (t )
+ skl1 skl2 3 r′ = getRank() << log2 (Λ) // Alg. 5
kl
4 c = 1 − 4 p/3
(141)
# multithread
(t) 5 for j in range(0, Λ):
where skl = (−1)k[t] +l[t] = ±1. This suggests 6 i = r′ | j // ||i⟩⟩2N = ||r⟩⟩w ||j⟩⟩λ
( 7 b1 = getBit(i, t1 ) // Alg. 1
αkl , k[t1 ] = l[t1 ] ∧ k[t2 ] = l[t2 ] , 8 b′1 = getBit(i, t1 + N )
αkl →
1 − 43p αkl , otherwise,

9 b2 = getBit(i, t2 )
(142) 10 b′2 = getBit(i, t2 + N )
11 b = (b1 ˆ b′1 ) | (b2 ˆ b′2 ) // ∈ {0, 1}
f = b (c − 1) + 1 // ∈ {1, c}
P that an amplitude βi of the Choi-vector ||ρ⟩⟩2N =
and 12
ρ[j] *= f
i βi ||i⟩⟩N is multiplied by (1−4 p/3) if either i[t1 ] ̸=
13

i[t1 +N ] or i[t2 ] ̸= i[t2 +N ] (or both).

E. Depolarising channel This prescribes a change of amplitudes

(
1 − 2p α + 2p

αkl → 3 kl 3 αk¬t l¬t , k[t] = l[t] ,
E∆ (143) 4p
1 − 3 αkl , k[t] ̸= l[t] ,
E∆
(149)

The depolarising channel, also known as the uniform or the equivalent P change to the equivalent Choi-
Pauli channel, transforms qubits towards the max- vector ||ρ⟩⟩2N = i βi ||i⟩⟩2N of
imally mixed state, describing incoherent noise re- (
1 − 2p 2p

sulting from the erroneous application of any Pauli βi → 3 βi + 3 βi¬{t,t+N } , i[t] = i[t+N ] ,
4p
qubit. It is a ubiquitous noise model deployed in 1 − 3 βi , i[t] ̸= i[t+N ] .
quantum error correction [92], is often a suitable de- (150)
scription of the average noise in deep, generic cir-
cuits of many qubits [93], and is the effective noise Unlike the dephasing channel, we see already that
produced by randomised compiling (also known as the depolarising channel upon the Choi-vector is not
twirling) of circuits suffering coherent noise [94, 95]. diagonal; it will linearly combine amplitudes and re-
Like the dephasing channel, we could describe depo- quire communication. Recall that the 22N ampli-
larising as a Kraus map and simulate it via Alg. 18, tudes of ||ρ⟩⟩ are uniformly distributed between ar-
though this would not leverage the sparsity of the rays ρ among 2w nodes (where N ≥ w), such that
resulting superoperator. In this section, we instead the j-th local amplitude ρ[j] ≡ βi of node r corre-
derive superior distributed algorithms to simulate sponds to global basis state
both the one and two-qubit depolarising channels
upon a density matrix distributed between Λ-length ||i⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩2N −w (151)
arrays, in O(Λ) operations and at most two rounds ≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w
of communication. For simplicity, we study uniform | {z } | {z }
depolarising, though an extension to a general Pauli t+N t
channel is straightforward.
Two communication scenarios emerge, informed by
qubit t.
1. One-qubit
1. When t < N −w, the principal bits i[t] and i[t+N ]
are determined entirely by a local index j, and
The one-qubit uniformly depolarising channel upon the pair amplitude βi¬{t,t+N } resides within the
qubit t of an N -qubit density matrix ρ produces same node as βi . Simulation is embarrassingly
state parallel.
p
2. When t ≥ N − w, bit i[t] is determined by lo-
ε(ρ) = (1 − p)ρ + X̂t ρX̂t + Ŷt ρŶt + Ẑt ρẐt ,
3 cal index j, but i[t+N ] is fixed by the node rank
(144) r. Precisely, i[t+N ] = r[t−(N −w)] . The pair am-
where p is the probability of any error occurring. plitude βi¬{t,t+N } resides within pair node r′ =
Each Pauli operator upon a basis state |k⟩ ⟨l| pro- r¬(t−(N −w)) , requiring communication. Since
duces only local amplitudes with indices j satisfying
j[t] = i[t+N ] need to be exchanged, we first pack
X̂t |k⟩ ⟨l| X̂t = |k¬t ⟩ ⟨l¬t | , (145) only this half into the node’s buffer φ before ex-
k[t] +l[t]
change. This is communication paradigm b) of
Ŷt |k⟩ ⟨l| Ŷt = (−1) |k¬t ⟩ ⟨l¬t | , (146) Fig. 5.
k[t] +l[t]
Ẑt |k⟩ ⟨l| Ẑt = (−1) |k⟩ ⟨l| , (147) Translating these schemes (and in effect, implement-
ing Eq. 150 upon distributed {βi }) into efficient,
and ergo the
P depolarising channel maps a general non-branching, cache-friendly code is non-trivial.
state ρ = kl αkl |k⟩ ⟨l| to
We present such an implementation in Alg. 21. In-
X p k
terestingly, its performance is similar to a one-target
E(ρ) = 1 − p + (−1) [t]+l[t] αkl |k⟩ ⟨l| gate (Alg. 6) upon a statevector of 2N qubits, but
3
kl exchanges only half of all amplitudes when commu-
p
+ 1 + (−1)k[t] +l[t] αkl |k¬t ⟩ ⟨l¬t | . nication is necessary.

3
(148)
38

Algorithm 21: [distributed][density matrix] 2. Two-qubit

One-qubit depolarising of qubit t with
probability p of an N -qubit density matrix with We now consider the two-qubit uniformly depolaris-
Choi-vector distributed between arrays ρ of ing channel, inducing any Pauli error upon qubits t1
length Λ among 2w nodes. and t2 with probability p.
E∆
16 p p X
E(ρ) = 1 − ρ+ σ̂t′ 2 σ̂t1 ρ σ̂t′ 2 σ̂t1 .
[O(Λ) bops][O(Λ) flops][0 or 1 exchanges] 15 15 ′
σ̂,σ̂ ∈
[22N /2 exchanged][O(1) memory][O(Λ) writes] {1,X,Y,Z}
1 distrib oneQubitDepolarising(ρ, φ, t, p): (152)
2 Λ = len(ρ)
3 N = getNumQubits(ρ) // Alg. 15 P per Eq. 145 - 147, this maps a general state ρ =
As
kl αkl |k⟩ ⟨l| to
4 w = log2 (getWorldSize()) // Alg. 5
5 c1 = 2 p/3 N
2 X
2 N
6 c2 = 1 − 2 p/3
X 16 p
E(ρ) = 1− + γkl αkl |k⟩ ⟨l| (153)
7 c3 = 1 − 4 p/3 15
k l
!
// embarrassingly parallel |k¬t1 ⟩ ⟨l¬t1 | + |k¬t2 ⟩ ⟨l¬t2 |
8 if t < N − w: + γkl αkl ,
// loop ||k⟩⟩λ−2 , interleave ||j⟩⟩λ + k¬{t1 ,t2 } l¬{t1 ,t2 }
# multithread
(t)
9 for k in range(0, Λ/4): having defined (where skl = (−1)k[t] +l[t] = ±1)
10 j00 = insertBits(k, {t, t + N }, 0)
11 j01 = flipBit(j00 , t) // Alg. 1 1 (t ) (t ) (t ) (t )

γkl = p 1 + skl1 + skl2 + skl1 skl2 (154)
12 j10 = flipBit(j00 , t + N ) // Alg. 1 15
j11 = flipBit(j01 , t + N ) // Alg. 1
13
(
4p
, k[t1 ] = l[t1 ] ∧ k[t2 ] = l[t2 ]
14 γ = ρ[j00 ] = 15 (155)
0, otherwise.
15 ρ[j00 ] = c2 γ + c1 ρ[j11 ]
16 ρ[j01 ] *= c3 The channel therefore modifies amplitudes under
17 ρ[j10 ] *= c3  
αk¬t1 ,l¬t1 +

18 ρ[j11 ] = c1 γ + c2 ρ[j11 ] 
4p
4p
 1 − 5 αkl + 15  αk¬t2 ,l¬t2 +  ,



// exchange required αkl → αk¬{t1 ,t2 } ,l¬{t1 ,t2 }
19 else: 
 k[t1 ] = l[t1 ] ∧ k[t2 ] = l[t2 ] ,
// b = i[t+N ] for all local i


p
1 − 16

20 r = getRank() // Alg. 5 15 αkl , otherwise.
21 b = getBit(r, t − (N + w)) // Alg. 1 (156)
// pack half of local amps into buffer This suggests a quarter of all amplitudes are mod-
# multithread ified to become a combination of four previous am-
22 for k in range(0, Λ/2):
plitudes, and the remaining amplitudes are merely
23 j = insertBit(k, t, b) // Alg. 1
24 φ[k] = ρ[j]
scaled.
Assume ρ describes N qubits. Amplitudes of the
// swap half buffer with pair node P
25 r′ = flipBit(r, t − (N + w)) // Alg. 1 equivalent Choi-vector ||ρ⟩⟩2N = i β i ||i⟩
⟩ 2N are
26 exchangeArrays(φ, 0 φ, Λ/2, Λ/2, r′ ) modified as

// update βi where i[t] = j[t] ̸= i[t+N ] = b

# multithread
 
βi¬{t1 ,t1 +N } +

for k in range(0, Λ/2):

27 4p 4p 

 1− βi¬{t2 ,t2 +N } +  ,

5 βi + 15


28 j = insertBit(k, t, ! b) // Alg. 1
29 ρ[j] *= c3 βi → βi¬{t1 ,t1 +N,t2 ,t2 +N }
i[t1 ] = i[t1 +N ] ∧ i[t2 ] = i[t2 +N ] ,



// update βi where i[t] = j[t] = i[t+N ] = b

p
1 − 16

15 β i , otherwise.
# multithread
30 for k in range(0, Λ/2): (157)
31 j = insertBit(k, t, b) // Alg. 1
32 ρ[j] = c2 ρ[j] + c1 φ[k + Λ/2] It is worth clarifying which indices i satisfy i[t1 ] =
i[t1 +N ] ∧ i[t2 ] = i[t2 +N ] , since the distributed simula-
tion of Eq. 157 will be markedly more complicated
39

than that required by the dephasing channel. Con- pre-combine the outbound amplitudes and ex-
sider a basis state of 4 fewer qubits: change only one eighth of the buffer capacity. We
visualise this below.
||h⟩⟩2N −4 ≡ ||e⟩⟩N −t2 −1 ||d⟩⟩t2 −t1 −1 ||c⟩⟩N −(t2 −t1 )−1 ⊗
||b⟩⟩t2 −t1 −1 ||a⟩⟩t1 , (158) Λ /8 …

and that produced by interleaving 4 zero bits into h

at indices t1 , t2 , t1 + N and t2 + N :
3. When t1 ≥ N −w (both qubits blue), bits i[t1 +N ]
||i0000 ⟩⟩2N = ||e⟩⟩ ||0⟩⟩1 ||d⟩⟩ ||0⟩⟩1 ||c⟩⟩ ||0⟩⟩1 ||b⟩⟩ ||0⟩⟩1 ||a⟩⟩ and i[t2 +N ] (the two left-most subscripted bits
(159) in i0000 above) are determined entirely by the
rank r, while i[t1 ] and i[t2 ] depend on local in-
Let ix be the index produced by changing the zero dex j. Hence each of the three other amplitudes
bits above to x. Eq. 157 informs us that for ev- βi¬{t1 ,t1 +N } , βi¬{t2 ,t2 +N } and βi¬{t1 ,t1 +N,t2 ,t2 +N }
ery group of 16 basis states sharing fixed substates which inform the updated βi reside in distinct
a, b, c, d, e, the amplitudes of four states are com- nodes with respective ranks
bined together (the green-highlighted diagonals be-
low) while the remaining amplitudes are scaled. r′ = r¬(t1 −(N −w)) , (161)
′′
r = r¬(t2 −(N −w)) , (162)
||i0000 ⟩⟩ ||i0100 ⟩⟩ ||i1000 ⟩⟩ ||i1100 ⟩⟩ ′′′
r = r¬{t1 −(N −w), t2 −(N −w)} . (163)
||i0001 ⟩⟩ ||i0101 ⟩⟩ ||i1001 ⟩⟩ ||i1101 ⟩⟩
||i0010 ⟩⟩ ||i0110 ⟩⟩ ||i1010 ⟩⟩ ||i1110 ⟩⟩ This naively suggests a communication pattern
||i0011 ⟩⟩ ||i0111 ⟩⟩ ||i1011 ⟩⟩ ||i1111 ⟩⟩ dividing all nodes into fully-connected 4-node
groups.
Recall that the 22N amplitudes of ||ρ⟩⟩2N are uni-
formly distributed between 2w nodes, such that the Λ /4 …
j-th local amplitude ρ[j] = βi of node r again corre-
sponds to global basis state
Scenario 3. presents a new challenge; a non-pairwise
||i⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩2N −w (160) communication pattern. This is not a major ob-
≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w stacle because each node sends only a quarter of
| {z } | {z } its Choi-vector partition, so all remote amplitudes
t2 +N, t1 +N t2 , t1 needed by a particular node (3/4 Λ) can simultane-
ously fit within its communication buffer (of length
We must consider the division of the 4 principal bits Λ). Still, a direct implementation would necessi-
of i (informing the location of the above green states) tate three rounds of communication and a total of
between the blue and pink regions. Like two-qubit 3/4 × 22N amplitudes sent over the network. We
dephasing of Sec. V D 2, three distinct communica- can improve this.
tion and simulation scenarios are evident. We could employ the SWAP gate of Alg. 9 to swap
1. When t2 < N − w (both qubits pink), all prin- qubit t2 to one with index < N − w, achieving pair-
cipal bits are determined by local index j, and wise scenario 2 above, much like our scheme to sim-
all amplitudes to combine reside with the same ulate the many-target general gate of Alg. 10. This
node. Simulation is embarrassingly parallel. would still require three rounds of communication
and a total traffic of 3/4 × 22N .
2. When t2 ≥ N − w (in blue) and t1 < N − w
(in pink), bit i[t2 +N ] (the left-most subscripted A superior scheme is possible, prescribing only two
bit in i0000 above) is determined by the node’s rounds of communication and 1/2×22N total traffic.
rank r while all other principal bits by local index We substitute Eq. 157 for two separate transforma-
j. Ergo amplitudes of states ||i0000 ⟩⟩ and ||i0101 ⟩⟩ tions, first performing
reside in a different node to that storing ||i1010 ⟩⟩ 
4p 4p

and ||i1111 ⟩⟩, prescribing pairwise communication.  1 − 5 βi + 15 βi¬{t1 , t1 +N } ,

Only a quarter of a node’s local amplitudes deter- βi → i[t1 ] = i[t1 +N ] ∧ i[t2 ] = i[t2 +N ] ,
mine those in the pair node, suggesting we should  1 − 16 p β , otherwise.

i
15
pack outbound amplitudes into the buffer before (164)
communicating. However, since Eq. 157 does not
distinguish remote amplitudes, we can cleverly
40

ℰΔ 0,3 ℰΔ 1,3 ℰΔ 1,4 ℰΔ 2,4 ℰΔ 3,4

3 4 3 4 3 4 3 4 3 4
2 5 2 5 2 5 2 5 2 5
1 6 1 6 1 6 1 6 1 6
0 7 0 7 0 7 0 7 0 7
15 8 15 8 15 8 15 8 15 8
14 9 14 9 14 9 14 9 14 9
13 10 13 10 13 10 13 10 13 10
12 11 12 11 12 11 12 11 12 11

FIG. 15. Some communication patterns of Alg. 22’s distributed simulation of two-qubit depolarising upon a 5-qubit
density matrix distributed between 16 nodes. While E∆0,3 is pairwise, the other channels are simulated through two
consecutive rounds of pairwise communication.

then successively performing Algorithm 22: [distributed][density matrix]

 4p Two-qubit depolarising of qubits t1 and t2 > t1
βi + 15(1−4 p/5) βi¬{t2 , t2 +N } ,
 with probability p of an N -qubit density matrix
βi → i[t1 ] = i[t1 +N ] ∧ i[t2 ] = i[t2 +N ] , with Choi-vector distributed between arrays ρ
of length Λ among 2w nodes.

βi , otherwise.

(165) E∆
Together, these transformations update all ampli-
tudes under two-qubit depolarising, while each pre- [O(Λ) bops][O(Λ) flops][0, 1 or 2 exchanges]
scribes simple pairwise communication. We visu- [0, 22N /8 or 22N /2 exchanged][O(1) memory]
alise this below, where red edges indicate the second [17Λ/16, 3Λ/2 or 2Λ writes]
round of communication. 1 distrib twoQubitDepolarising(ρ, φ, t1 , t2 , p):
Λ /4 2 Λ = len(ρ)
3 N = getNumQubits(ρ) // Alg. 15
w = log2 (getWorldSize()) // Alg. 5
Λ /4 …
4
5 λ′ = N − w
6 q = {t1 , t2 , t1 + N, t2 + N }
Integrating these optimisations into a high- 7 c1 = 1 − 4 p/5
performance algorithm which avoids branching is 8 c2 = 4 p/15
9 c3 = −16 p/15
non-trivial, and we present the result in Alg. 22.
Some prescribed communication patterns are visu- 10 if t2 < λ′ :
alised in Fig. 15. 11 local twoQubitDepolarising( // Alg. 23
12 ρ, q, Λ, c1 , c2 , c3 )
13 else if t2 ≥ λ′ and t1 < λ′ :
14 pair twoQubitDepolarising( // Alg. 23
15 ρ, φ, q, Λ, λ′ , c1 , c2 , c3 )
16 else:
17 quad twoQubitDepolarising( // Alg. 24
18 ρ, φ, t1 , t2 , Λ, λ′ , c1 , c2 , c3 )
41

Algorithm 23: Subroutines of Alg. 22 Algorithm 24: Subroutines of Alg. 22

1 local twoQubitDepolarising(ρ, q, Λ, c1 , c2 , c3 ): 1 quad twoQubitDepolarising(ρ, φ, t1 , t2 , Λ, λ′ , c1 , c2 , c3 ):
p
// scale all amps by 1 or 1 + c3 = 1 − 16 15
2 r = getRank() // Alg. 5
# multithread 3 b1 = getBit(r, t1 − λ′ ) // Alg. 1
2 for j in range(0, Λ): 4 b2 = getBit(r, t2 − λ′ )
3 f1 = getBit(j, q[0]) == getBit(j, q[2]) 16 p
// scale amplitudes by 1 or (1 − 15
)
4 f2 = getBit(j, q[1]) == getBit(j, q[3])
# multithread
5 f = 1 + (!(f1 & f2 )) c3 5 for j in range(0, Λ):
6 ρ[j] *= f 6 f1 = getBit(j, t1 ) == b1
// combine 4 amplitudes among every 16 7 f2 = getBit(j, t2 ) == b2
# multithread 8 f = 1 + (!(f1 & f2 )) c3
7 for h in range(0, Λ/16): 9 ρ[j] *= f
8 j0000 = insertBits(h, q, 0) // pack fourth of buffer
9 j0101 = flipBits(j0000 , {q[0], q[2]}) # multithread
10 j1010 = flipBits(j0000 , {q[1], q[3]}) 10 for k in range(0, Λ/4):
11 j1111 = flipBits(j0101 , {q[1], q[3]}) 11 j = insertBit(k, t1 , b1 )
12 κ = ρ[j0000 ] + ρ[j0101 ] + ρ[j1010 ] + ρ[j1111 ] 12 j = insertBit(j, t2 , b2 )
13 φ[k] = ρ[j]
13 for j in {j0000 , j0101 , j1010 , j1111 }:
14 ρ[j] = c1 ρ[j] + c2 κ // swap sub-buffer with first pair node
14 r′ = flipBit(r, t1 − λ′ )
15 pair twoQubitDepolarising(ρ, φ, q, Λ, λ′ , c1 , c2 , c3 ): 15 exchangeArrays(φ, 0, φ, Λ/4, Λ/4, r′ )
16 r = getRank() // Alg. 5
// update amplitudes and buffer
17 b = getBit(r, q[1] − λ′ )
# multithread
p
// scale amplitudes by 1 or (1 − 1615
) 16 for k in range(0, Λ/4):
# multithread 17 j = insertBit(k, t1 , b1 )
18 for j in range(0, Λ): 18 j = insertBit(j, t2 , b2 )
19 f1 = getBit(j, q[0]) == getBit(j, q[2]) 19 ρ[j] = c1 ρ[j] + c2 φ[k + Λ/4]
20 f2 = getBit(j, q[1]) == b 20 φ[k] = ρ[j]
21 f = 1 + (!(f1 & f2 )) c3 // swap sub-buffer with second pair node
22 ρ[j] *= f 21 r′′ = flipBit(r, t2 − λ′ )
// pack eighth of buffer 22 exchangeArrays(φ, 0, φ, Λ/4, Λ/4, r′′ )
# multithread // update amplitudes
23 for k in range(0, Λ/8): # multithread
24 j000 = insertBits(k, q[: 3], 0) 23 for k in range(0, Λ/4):
25 j0b0 = setBit(j000 , q[1], b) 24 j = insertBit(k, t1 , b1 )
26 j1b1 = flipBits(j0b0 , {q[0], q[2]}) 25 j = insertBit(j, t2 , b2 )
27 φ[k] = ρ[j0b0 ] + ρ[j1b1 ] 26 ρ[j] += (c2 /c1 ) φ[k + Λ/4]

// swap sub-buffers, receive at φ[Λ/8 . . . ]

28 r′ = flipBit(r, q[1] − λ′ )
29 exchangeArrays(φ, 0, φ, Λ/8, Λ/8, r′ )
// combine elements with remote
# multithread
30 for k in range(0, Λ/8):
31 j000 = insertBits(k, q[: 3], 0)
32 j0b0 = setBit(j000 , q[1], b)
33 j1b1 = flipBits(j0b0 , {q[0], q[2]})
34 ρ[j0b0 ] = c1 ρ[j0b0 ] + c2 (ρ[j1b1 ] + φ[k + Λ/8])
35 ρ[j1b1 ] = c1 ρ[j1b1 ] + c2 (ρ[j0b0 ] + φ[k + Λ/8])
42

F. Damping channel uniformly distributed between arrays ρ among 2w

nodes (where N ≥ w), such that the j-th local am-
plitude ρ[j] ≡ βi of node r corresponds to global
Eγ basis state ||i⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩2N −w . As we saw of
1-qubit dephasing and depolarising, communication
The amplitude damping channel is a widely con-
is only required when index t + N falls within the
sidered realistic noise model in quantum informa-
leftmost w qubits of ||ρ⟩⟩2N . Eq. 170 ergo distributes
tion [96], describing dissipation of energy to the en-
into two scenarios:
vironment [86], or (conventionally) a qubit to the
|0⟩1 state. Although damping of qubit t with de- • When t < N − w, amplitude βi¬{t,t+N } is always
cay probability p can be described through Kraus contained within the same node containing βi .
operators Simulation is embarrassingly parallel.
√ • When t ≥ N − w, amplitude βi (within rank
   
(1) 1 0 (2) 0 p r) is stored within a separate node to βi¬{t,t+N }
K̂t =  √ , K̂t =  ,
0 1−p 0 0 (within rank r′ = r¬t−(N −w) ), and the bit i[t+N ]
(166) is determined entirely by r. Ergo, every node
(of rank r) satisfying i[t+N ] ≡ r[t−(N −w)] = 1 (as
its simulation as a Kraus map through Alg. 18 is do half of all nodes) sends half of its amplitudes
sub-optimal. We here instead derive an optimised (those satisfying i[t] = 1) to pair node r′ , receiv-
simulation of the damping channel which commu- ing none in return. This is memory exchange
nicates only a quarter of all amplitudes across the pattern c) of Fig. 5. In total, half of all nodes
network. Interestingly, the prescribed communica- send half of their local amplitudes to the other
tion is not pairwise, but is fortunately cheaper. non-sending half of all nodes. We illustrate this
below.
Observe that the above Kraus operators map a basis
projector |k⟩ ⟨l| to
 Λ /2 …

√ 1, k[t] = l[t] = 0,
(1) (1)†
K̂t |k⟩ ⟨l| K̂t = |k⟩ ⟨l| · 1 − p, k[t] ̸= l[t] ,
We formalise our method in Alg. 25, which overlaps

1 − p, k[t] = l[t] = 1,
(167) concurrent buffer packing with local amplitude mod-
( ification. Examples of the resulting communication
(2) (2)† p, k[t] = l[t] = 1, patterns are shown in Fig. 16.
K̂t |k⟩ ⟨l| K̂t = |k¬t ⟩ ⟨l¬t | ·
0, otherwise.
(168) ℰγ 1 ℰγ 2

and an amplitude
P of the equivalent Choi-vector ℰγ 3 ℰγ 4
||ρ⟩⟩2N = i βi ||i⟩⟩2N as
3 4 3 4
 2 5 2 5
 βi + p βi¬{t,t+N } i[t] = i[t+N ] = 0, 1 6 1 6
√
βi → 1 − p βi i[t] ̸= i[t+N ] , (170) 0 7 0 7
15 8 15 8

(1 − p) β
i i[t] = i[t+N ] = 1.
14 9 14 9
We see every amplitude is scaled, and those of index 13 10 13 10
12 11 12 11
i with principle bits (i[t] and i[t+N ] ) equal to zero are
linearly combined with a pair amplitude of opposite FIG. 16. Communication patterns of Alg. 25 simulating
bits. the amplitude damping channel upon a 5-qubit density
matrix distributed between 16 nodes. Each arrow indi-
Next, we consider when these amplitudes are dis-
cates the one-way sending of 32 amplitudes, via Fig. 5 c).
tributed. Recall that the 22N amplitudes of ||ρ⟩⟩ are
43

Algorithm 25: [distributed][density matrix] G. Pauli string expectation value

Amplitude damping of qubit t with decay
probability p of an N -qubit density matrix with !
Choi-vector distributed between arrays ρ of X O
Tr ρ hn σ̂t
length Λ among 2w nodes.
n t
Eγ
Expectation values of Hermitian operators are of
[O(Λ) bops][O(Λ) flops][0 or 1 sends] fundamental importance in quantum computation.
[22N /4 exchanged][O(1) memory][O(Λ) writes] Beyond their modelling of experimental, probabilis-
1 distrib damping(ρ, t, p): tic processes, expectation values appear extensively
2 Λ = len(ρ) as primitives in quantum algorithms [97], and are
3 N = getNumQubits(ρ) // Alg. 15 often themselves the desired output [98], such as
w = log they are in variational algorithms [1], real-time sim-
4
√ 2 (getWorldSize()) // Alg. 5
5 c1 = 1−p ulators [99], and condensed matter and chemistry
6 c2 = 1−p calculations [100]. Physically meaningful Hermitian
operators are often naturally expressed in the Pauli
7 if t < N − w:
basis [100], and indeed Pauli strings have emerged
# multithread as the canonical basis for realising operators like
8 for k in range(0, Λ/4): Hamiltonians on quantum computers. It is essen-
9 j00 = insertBits(k, {t, t + N }, 0) tial that quantum simulators support calculation of
10 j01 = flipBit(j00 , t) // Alg. 1 expectation values, and prudent these calculations
11 j10 = flipBit(j00 , t + N )
are highly optimised and bespoke, since even modest
12 j11 = flipBit(j01 , t + N )
Pauli strings of interest often contain many individ-
13 ρ[j00 ] += p ρ[j11 ] ual Pauli operators. The need for bespoke treatment
14 ρ[j01 ] *= c1 is especially evident when simulating realistic, noisy
15 ρ[j10 ] *= c1 processes via density matrices, where naive simula-
16 ρ[j11 ] *= c2 tion of a Pauli operator as a gate is already quadrati-
cally more expensive than if it were upon a statevec-
17 else:
tor. Alas, it is tempting to naively compute the ex-
18 r = getRank() // Alg. 5
19 r′ = flipBit(r, t − (N − w)) // Alg. 1 pectation value using existing simulator primitives,
20 b = getBit(r, t − (N − w)) // Alg. 1 like the previous algorithms of this manuscript, in
a grossly inefficient manner. In this section, we de-
// if i[t+N ] = 1, pack sub-buffer rive an embarrassingly parallel calculation (with a
21 if b == 1: single scalar reduction) of a T -term N -qubit Pauli
# multithread
string expectation value under an N -qubit density
22 for k in range(0, Λ/2):
23 j = insertBit(k, t, 1) // Alg. 1 matrix, distributed into Λ-length sub-Choi-vectors,
24 φ[k] = ρ[j] requiring no memory writes and only O(T Λ) flops
25 ρ[j] *= c2 per node. We assume that the description of Ĥ (a
list of real coefficients, and of Pauli-operator flags)
// non-blocking send half of buffer is tractably small so that a local copy is present on
26 MPI Isend(φ, Λ/2, MPI COMPLEX, r′ , every node; in essence that T N ≪ Λ. Sadly our
27 MPI ANY TAG, MPI COMM WORLD) method cannot accelerate the evaluation of expected
// all nodes update i where i[t] ̸= i[t+N ] values of statevectors, which we will elaborate upon.
# multithread
The Pauli operators X̂, Ŷ and Ẑ are a natural ba-
28 for k in range(0, Λ/2):
29 j = insertBit(k, t, !b) sis for Hermitian operators and have emerged as the
30 ρ[j] *= c1 canonical basis for realising Hamiltonians on quan-
tum computers, encoded as an N -qubit T -term Pauli
// if i[t+N ] = 0, receive sub-buffer string of the form
31 if b == 0:
// receive half of buffer T N
(n)
X O
32 MPI Recv(φ, Λ/2, MPI COMPLEX, r′ , Ĥ = hn σ̂t , (171)
33 MPI ANY TAG, MPI COMM WORLD) n t

# multithread hn ∈ R, σ̂ ∈ {1̂, X̂, Ŷ , Ẑ}.

34 for k in range(0, Λ/2):
35 j = insertBit(k, t, 0) This is a real-weighted sum of the Pauli tensors seen
36 ρ[j] += p φ[k] in Sec. IV E, where we computed ρ → σ̂ ⊗ ρ. Here,
44

we instead seek scalar ⟨E⟩ = Tr(Ĥρ) ∈ R. It is trivially divided between these nodes, each concur-
tempting to employ the Pauli tensor (upon density- rently weighted-summing its amplitudes, before a fi-
matrix) simulation of Alg. 17, using clones of the nal global reduction wherein each node contributes
density matrix which are summed (weighted by hn ) a single complex scalar.
to produce Ĥρ, before a trace evaluation via We formalise our strategy in Alg. 26. It requires
2N
2 2 N no exchanging of Choi-vectors between nodes, and
X X no writing to heap memory. Each amplitude of ρ is
||ρ⟩⟩ = βi ||i⟩⟩ =⇒ Tr(ρ) = βj(2N +1) .
i j
read precisely once.
(172) The subroutine pauliTensorElem (Line 18) per-
forms exactly N multiplications of real or imaginary
Such a scheme would require O(T ) exchanges, a to- integers (0, ±1, ±i). One may notice that half of all
tal of O(T ) 22N exchanged amplitudes and memory elements among the Pauli matrices are zero, and en-
writes, and a O(22N ) memory overhead. An aston- countering any one will yield a zero tensor element.
ishingly more efficient scheme is possible. This might lead one to erroneously expect only a
P 2N ×2N factor 1/2N of invocations of pauliTensorElem will
Assume ρ = kl αkl |k⟩ ⟨l| and that H : C
yield a non-zero result, and prompt an optimistion
is a Ẑ-basis matrix instantiating Ĥ. Of course, we which avoids their calculation. We caution against
will not instantiate such an expensive object, but it this; the input Pauli strings to Alg. 26 will not be
permits us to express dense (i.e. T ≪ 4N ), and so will not uniformly
2 X
2 N N invoke the subroutine for all permutations of argu-
X ments i and σ. We should expect instead to per-
⟨E⟩ = H lk αkl , (173)
form exponentially fewer invocations, precluding us
k l
to reason about the expected number of zero el-
Through a natural 2D extension of our ket indexing, ements as this depends on the user’s input Pauli
we may express string structure. Finally, a naive optimisation like
! an attempt to return early from the loop of Line 21
T N whenever v = 0 will cause needless branching and
(n)
X O
H lk = hn σ̂t (174) disrupt multithreaded performance.
n t lk
T N
We now lament the lack of an analogous statevec-
tor algorithm. The speedup of Alg. 26 over a

(n)
X Y
= hn σ̂N −t−1 . (175)
n t
l[t] , k[t] naive method invoking Pauli tensor simulation, re-
sults from our not propagating any amplitude un-
This reveals a O(N ) bop calculation of a matrix el- involved in the trace. It hence will not accelerate
ement (∈ {±1, ±i}) of the N -qubit Pauli tensor is the equivalent statevector calculation which neces-
possible, informed by the elements of the Pauli ma- sarily involves
Pall amplitudes. That is, for the gen-
trices, and hence that a single element H lk ∈ C is eral |Ψ⟩N = k αk |k⟩N , the expected value
calculable in O(T ) flops and O(T N ) bops. We can N N
2 X
2
in-principle leverage Hermitivity of Ĥ and ρ to eval- X
uate ⟨E⟩ in ≈ 2× fewer operations, but this will not ⟨E⟩ = ⟨Ψ| Ĥ |Ψ⟩ = H lk αk∗ αl , (178)
k l
meaningfully accelerate distributed simulation.
ExpressedPin terms of the equivalent Choi-vector includes products of all pairs of amplitudes {αk , αl }.
||ρ⟩⟩2N = i βi ||i⟩⟩2N , we can write This necessitates the products include amplitudes
residing in distinct nodes, and ergo that its dis-
2
X
2N
tributed calculation involves multiple rounds of
⟨E⟩ = H i βi , (176) inter-node exchange.
i
T N
(n)
X Y
Hi = hn σ̂N −t−1 , (177)
i[t+N ] , i[t]
n t

where each Hi is concurrently and independently

evaluable. We distribute the 22N amplitudes of ||ρ⟩⟩
between W = 2w nodes such that each contains a
sub-vector ρ of length Λ = 22N −w . Equ. 176 is then
45

Algorithm 26: [distributed][density matrix] H. Partial trace

Expected value under a T -term N -qubit Pauli
string of an N -qubit density matrix with
Trt (ρ)
Choi-vector distributed between arrays ρ of
length Λ among W nodes. List σ stores N T The partial trace is an extremely useful operation in
flags identifying Pauli operators, ordered by quantum information theory which reduces the di-
terms (matching coefficients h) then qubits mension of a density matrix, and obtains a descrip-
(least to most significant). tion of a sub-state from a composite state [101]. It is
!
X O ubiquitous in the study of noise, entanglement and
Tr ρ hn σ̂t quantum control, and an essential tool in the general
n t study of mixed states [86, 102].
[O(N T Λ) bops][O(T Λ) flops][0 exchanges] Several serial, local algorithms for computing the
[O(W ) exchanged][O(1) memory][0 writes] partial trace of a dense matrix exist in the literature,
1 distrib pauliStringExpec(ρ, h, σ): although their innovations are that explicit tensoring
2 T = len(h) of identity matrices can be avoided [103], and that
3 N = len(σ) / T amplitude indices can be calculated bitwise [104];
4 Λ = len(ρ) these are properties of all algorithms presented in
5 λ = log2 (Λ) this manuscript. To the best of the author’s knowl-
6 r = getRank() edge, there are no reports of a distributed partial
trace of dense matrices. Indeed the task is intimi-
7 x = 0
dating; the partial trace combines amplitudes of the
# multithread input density matrix which are distributed between
8 for j in range(0, Λ): many distinct nodes of our network, suggesting non-
9 i = (r << λ) | j pairwise communication.
10 v = 0
11 for n in range(0, T ): We here derive a pair-wise distributed algorithm (be-
12 σ ′ = σ[n N : n N + N ] tween 2w nodes) to compute the partial trace of
13 λ = pauliTensorElem(σ ′ , i, N ) any N -qubit multi-partite mixed state, tracing out
14 v += h[n] λ n qubits, in O(22N /2n ) total flops, O(w 22N ) ex-
# thread reduce changed amplitudes (in fewer than 2w rounds), and
15 x += v as many memory writes. These are smaller network
costs than w one-qubit gates, and asymptotically
# MPI reduce x
negligible flops. Our trick is similar to that used in
16 x = ...
Sec. IV D’s simulation of the many-target gate, al-
17 return x
beit with a more complicated post-processing step.
We can trace out a maximum of n ≤ N − ⌈w/2⌉
18 pauliTensorElem(σ, i, N ): qubits, as elaborated upon below. However, for our

1 0

0 1

0 −i

1 0
output density matrix to satisfy the preconditions of
19 M = { 0 1 , 1 0 , i 0 , 0 −1 } this manuscript’s other algorithms, we require the
20 v = 1 stricter condition that n ≤ N − w, which permits
21 for q in range(0, N ): the tracing out of fewer target qubits.
22 bc = getBit(i, q)
23 br = getBit(i, q + N ) Let us frame our general partial trace as the trac-
24 k = σ[q] ing out of n qubits t constituting subsystem B of
25 m = M [k] N -qubit composite density matrix ρAB , in order to
26 v *= m[br , bc ] // integer multiply form reduced matrix ρA of m = N − n qubits. The
27 return v ordering of t is inconsequential. Because we permit
t to be any subset of all qubits in [0..N ), it is not
generally true that ρAB = ρA ⊗ ρB . We ergo label
the constituent bits of the k-th basis ket as

|k⟩N ≡ k A , k B N
, (179)

where k A ∈ [0..2m ) and k B ∈ [0..2n ). These are the

m and n-qubit basis kets of ρA and ρB respectively.
We define function f to interweave the bits of k B
46

into positions t of k A , such that • As we should expect by the arbitrarity of the or-
′
dering of t, the sum in αkl is uniformly weighted.
k = ft (k A , k B ). (180) We can therefore iterate v in any order, and set
its constituent bits v[q] with simplified unordered
We can trivially compute f with bitwise algebra. bitwise operations.
Our general composite density matrix, with ampli-
tudes αkl , can then be written • We must optimise our memory strides by choos-
ing whether to contiguously iterate the “output”
X2N X 2N amplitudes α′ (for each, computing a full sum
ρ AB
= αkl |k⟩ ⟨l|N (181) of 2n scalars), or over the “input” amplitudes α
k l (adding each to one of 2m partial sums). We
X2m X2n X2m X 2n choose the former, since the cache penalties of a
≡ αft (kA ,kB ), ft (lA ,lB ) k A , k B lA , lB N . suboptimal write stride outweigh the read penal-
k A k B l A l B ties, and also since its multithreaded implemen-
(182) tation avoids race conditions and minimises false
sharing [25].
Let ⟨1⊗m , v| notate an interwoven tensor product of The equivalent reduced Choi-vector ρA 2m =
the m-qubit identity operator with the n qubits of P ′
the v-th basis bra of ρB . The reduced density matrix i βi ||i⟩⟩, resulting from tracing out qubits t of
AB
P
can then be expressed as ρ 2N
= i β i ||i⟩
⟩, has amplitudes
n
n 2
2 X
βi′ = βg(i,v) , where (185)
X
ρA = TrB (ρAB ) = 1⊗m , v ρAB 1⊗m , v
v
v
m
2
2 X m
g(i, v) = ft ∪ (t+N ) (i, (v << n) | v). (186)
X
= αft (kA ,v), ft (lA ,v) k A l A
. (183)
N The function g merely takes index i and interweaves
kA lA
the bits of v into positions t and t+N , the latter be-
This makes clear that the amplitudes α′ of ρA are ing the same positions shifted left by N . Local, serial
evaluation of this sum is trivial, requiring O(22N /2n )
2
X
n
flops, and local parallelisation is straightforward.
′
αkl = αft (k,v), ft (l,v) , k, l ∈ [0..2m ). (184) We now distribute these amplitudes between W =
v
2w nodes. Because ρAB and ρA differ in dimen-
Observe that a fraction 1/2n of all amplitudes of sion, their partitioned arrays on each node have dif-
ρAB are involved in the determination of ρA , and ferent sizes. Recall we assume N ≥ w such that each
that ρA is determined by a total of 22m+n sum terms. node contains at least one column’s worth of ρAB .
Before proceeding, we make several more immediate For the output reduced density matrix ρA to sat-
observations to inform subsequent optimisation. isfy this precondition, and ergo be compatible input
to this manuscript’s other algorithms, it must sim-
• In principle, evaluating a single αkl′
amplitude ilarly satisfy m ≥ w, equivalently that n ≤ N − w.
requires polynomially fewer than the 22m+n − 1 This is inessential to our algorithm however, which
floating-point additions suggested directly by the imposes a looser condition elaborated upon later.
sum over v, because the sum may be performed
by a sequence of hierarchical reductions on neigh- The j-th local amplitude ρAB [j] ≡ βg of node r cor-
bouring pairs. This is compatible with numerical responds to global basis state
stability techniques like Kahan summation [105],
||g⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩2N −w (187)
though reduces the floating-point costs by a mod-
est and shrinking factor 1 − 2−n , easily out- ≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w
weighed by its introduced caching overheads. | {z } | {z }
t+N t
• The pair of subscripted indices (ft (k, v), ft (l, v))
are unique for every unique assignment of Similarly, the j-th local reduced amplitude ρA [j] =
(k, l, v). Assuming no properties of ρAB (e.g. re- βi′ corresponds to
′
laxing Hermitivity), each amplitude αkl of ρA is
therefore a sum of unique amplitudes α of ρAB . ||i⟩⟩2m = ||r⟩⟩w ||j⟩⟩2m−w . (188)
There are no repeated partial sums between dif-
′
ferent αkl which we might otherwise seek to re- Therefore all to-be-traced qubits tq ∈ t (satisfying
use in heirarchal reductions. 0 ≤ tq < N ) target the suffix substate ||j⟩⟩2N −w , and
47

a subset of the shifted qubits in t + N will target the indices t ∪ t + N , and i is formed by the remaining
prefix substate ||r⟩⟩w . Two communication patterns six untargeted bits of g. The bits of g, constituted
emerge: by the bits of i and v, are arranged as:
1. When all tq < N − w, then no qubits in t + ||g⟩⟩2N = g[9] g[8] g[7] g[6] g[5] g[4] g[3] g[2] g[1] g[0]
N target the prefix substate. This means every 2N

index g(i, v) (Eq. 186) within a given node of = v[3] i[5] v[2] w
i[4] i[3] v[1] i[2] v[0] i[1] i[0] 2N −w .
rank r is determined entirely by (r, v, and) the (191)
suffix bits of i, equivalent to j. Precisely:
Evaluating a reduced, output amplitude βi′ requires
i = (r << (2N − w)) | j (189) summing all βg with indices satisfying v[0] = v[2] and
v[1] = v[3] (with fixed i). Because bits v[2] and v[3]
for all local j. Ergo all summed amplitudes lie in the prefix state, the sum terms are distributed
{βg(i,v) : v} reside within the same node. Fur- between multiple nodes. Precisely, between ranks
thermore, the index i of the destination ampli- {r = v[3] 22 + i[5] 21 + v[2] 20 : 0 ≤ v < 24 }.
tude βi′ shares the same w-bit prefix (r) as the
source/summed amplitudes, and ergo also resides So we first perform a series of SWAP gates upon ||ρ⟩⟩
in the same node. As a result, this scenario is (treated as a statevector) via Sec. IV C, in order to
embarrassingly parallel. move all targeted prefix bits into the suffix state.
We heuristically swap the left-most prefix targets
2. When ∃ tq ≥ N − w, the amplitudes featured in (starting with v[3] ) with the left-most non-targeted
Eq. 185 reside within distinct nodes. Computing suffix qubits (initially i[4] ), minimising the relative
a single βi′ will require prior communication. displacement of the i bits, which will accelerate a
Like we did in Sec. IV D to simulate the many- subsequent re-ordering step. After effecting
target general unitary on a statevector, we can
in-principle use SWAP gates to first obtain local- ||ρ⟩⟩2N → SWAP9,6 SWAP7,5 ||ρ⟩⟩2N , (192)
ity of these amplitudes. However, the procedure
the basis state ||g⟩⟩ has been mapped to ||g ′ ⟩⟩2N =
here is complicated by the reduced dimension of
the output structure ρA 2m , meaning that we
i[4] i[5] i[3] v[3] v[2] v[1] i[2] v[0] i[1] i[0] 2N −w
.
cannot simply “swap back” our swapped qubits. w

Let us focus on scenario 2., which admits a several All amplitudes βg across varying v (fixing i) now
step procedure. In essence, we apply SWAP gates reside within the same node, permitting embarrass-
to move all targeted qubits into the (2N − w)-qubit ingly parallel evaluation of Eq. 185. This resembles
suffix substate and then perform the subsequently a four-qubit partial trace, of qubits t′ = {2, 4, 5, 6},
embarrassingly parallel reduction; thereafter we per- upon a seven-qubit statevector. All W nodes per-
form additional SWAP gates on the reduced density form this reduction, producing a distributed m = 3-
matrix to restore the relative ordering of the non- qubit density matrix ρ′ . Alas, this is not yet the
targeted qubits. This a priori requires that all tar- output state; the i-th global basis state of ||ρ′ ⟩⟩ does
get qubits t, and their paired Choi qubits t + N , not correspond to the desired amplitude βi′ , but is
can fit into the suffix substate (a similar precondi- instead the state
tion of the many-target gate upon a statevector of
Sec. IV D). This requires ||i′ ⟩⟩2m = i[4] i[5] i[3] w
i[2] i[1] i[0] 2m−w
, (193)

n ≤ N − ⌈w/2⌉. (190) where the leftmost two prefix qubits are disordered.
The subsequent SWAP gates on the reduced m- Our final step is to restore the relative ordering of all
qubit density matrix, treated as an unnormalised bits of i (mapping ||i′ ⟩⟩ → ||i⟩⟩) by performing addi-
2m-qubit statevector, assume the equivalent precon- tional SWAP gates upon the corresponding qubits of
dition 2m ≥ w. the reduced 3-qubit density matrix. Each qubit can
be swapped directly to its known ordered location.
Even under these assumptions, the procedure is ver- In this example, we simply perform
bose in the general case. In lieu of its direct deriva-
tion, we opt to demonstrate it with an example. ||ρ′ ⟩⟩2m → SWAP5,4 ||ρ′ ⟩⟩2m . (194)
Imagine that we distribute an N = 5 qubit den-
sity matrix ρ between W = 8 nodes (ergo w = 3). Choi-vector ||ρ′ ⟩⟩2m is now the correct, distributed,
Assume we wish to trace out qubits t = {2, 4}. Let reduced density matrix. It is again beneficial
||g⟩⟩2N = ||i, v⟩⟩ be a basis state of the Choi-vector to heuristically perform these “post-processing”
||ρ⟩⟩2N , where v is formed by the four bits of g at SWAPs upon the leftmost prefix qubits first, to
48

Tr 2,4 ρ5
1 2 SWAP 9,6 || ρ〉〉10 SWAP 7,5 || ρ〉〉10 Tr 2,4,5,6 || ρ〉〉10 SWAP 5,4 || ρ'〉〉6
1 2 1 2 1 2 1 2
0 3
0 3 0 3 0 3 0 3
=

7 4 7 4 7 4 7 4
7 4
6 5 6 5 6 5 6 5
6 5

FIG. 17. The communication pattern of Alg. 28’s distributed simulation of the partial trace of qubits t = {2, 4}
upon a N = 5-qubit density matrix distributed between W = 8 nodes. The left plot shows the necessary traffic of
amplitudes to effect Tr2,4 (ρ) “directly”, in a way incompatible with our distributed partitioning. The right plots
decompose this into the pairwise-communicating steps of our algorithm, which operates upon ||ρ⟩⟩10 (and reduced
state ||ρ′ ⟩ 6 ) treated as statevectors. It is interesting to monitor the movement of amplitudes from node r = 1 to 2,
achieved via intermediate movements to nodes 5, 4, then 2.

avoid unnecessary displacement of subsequently We formalise this algorithm in Alg. 28. Below we dis-
swapped qubits between the suffix and prefix sub- cuss some potential optimisations, and some tempt-
states which causes wasteful communication. We ing but ultimately not worthwhile changes.
visualise the incurred communication pattern of this
• Our algorithm did not assume any properties
process in Fig. 17.
(normalisation or otherwise) of ρ. If we assume ρ
We summarise the complexity of this method. is Hermitian, then we can reduce the network and
floating-point costs by at most a factor 2. This
• At most w initial SWAP gates are required to
is because ρij = ρ∗ji enables us to process only a
remove all prefix targets. Each invokes Alg. 9
in communication scenario 3. whereby half factor (1+1/2N )/2 of all 22N amplitudes, further
a node’s amplitudes are exchanged. A total reducing the fraction 1/2n involved in the partial
of O(w) 22N /2 amplitudes are communicated in trace. Determining the exact reduction and con-
O(w) rounds, with as many memory writes, and sequential utility of the optimisation requires a
zero flops. careful treatment we do not here perform.

• The embarrassingly parallel evaluation of • The qubits of the reduced density matrix, before
Eq. 185, i.e. the “local trace”, involves the post-processing SWAPs, are only ever out-
O(22N /2n ) flops and bops, and a factor 1/2n of-order when an initial prefix qubit is swapped
fewer memory writes. past a non-targeted qubit. So it is tempting
to swap only adjacent qubits, percolating pre-
• Fewer than m final SWAPs are needed to re- fix qubits toward the suffix substate one qubit
order the reduced state, each invoking Alg. 9 in at a time. This preserves the relative order of
potentially any of its three communication sce- the non-targeted qubits, so no post-trace SWAPs
narios. However, things simplify when we en- are required. Alas, this is not worthwhile; it ne-
force m ≥ w, i.e. the precondition assumed by cessitates more total SWAPs, and all of them
this manuscript’s other algorithms. In that case, will operate on the larger N -qubit density ma-
all initially targeted prefix qubits get swapped trix, as opposed to the smaller reduced (N − n)-
into the leftmost suffix positions, and ergo af- qubit density matrix, increasing communication
ter reduction, only the prefix qubits are disor- and write costs.
dered (as per our example). The final SWAP
costs ergo simplify to O(w) SWAPs in scenario • Our initial SWAPs moved prefix targets into the
3. of Alg. 9, exchanging a total of O(w)22m /2 leftmost suffix qubits to reduce disordering of
amplitudes. Without this precondition, the fi- the untargeted qubits, and ergo reduce the num-
nal SWAP costs scale inversely with 22n , so are ber of subsequent SWAPs (and thus, the com-
anyway quickly occluded with increasing number munication costs) on the reduced state. This
of traced qubits n. Furthermore, our heuristic means however that the local partial trace cal-
of initially swapping the leftmost qubits first re- culation (invocation of local partialTraceSub
duces the necessary number of final SWAPs. at Line 14 of Alg. 28) targets high-index qubits
(in arrays t and t′ ). This causes a large mem-
49

ory stride; the accessed amplitudes ρ[g] across Algorithm 28: [distributed][density matrix]
g at Line 35 are far apart (beyond cache lines), Partial tracing of n qubits t from an N -qubit
and their bounding addresses overlap across dif- density matrix with Choi-vector distributed as
ferent i. This may lead to sub-optimal caching Λ-length arrays between 2w nodes.
behaviour, especially in multithreaded settings
Trt (ρ)
(although we thankfully note it does not in-
duce false sharing, since we merely read ρ). It [O(wΛ) bops][O(Λ/2n ) flops]
is worth considering to instead swap prefix tar- [O(w + N − n) exchanges] [O(w)22N exchanged]
gets into the rightmost suffix qubits. This makes [O(1) memory][O(w)Λ writes]
amplitudes ρ[g] at Line 35 contiguous in mem-
ory, improving caching and multithreaded per- 1 distrib partialTrace(ρ, φ, t):
formance, but requiring more post-processing 2 N = getNumQubits(ρ) // Alg. 15
SWAPs and ergo modestly increased commu- 3 λ = log2 (len(ρ))
nication costs. Such a strategy may prove 4 sort(t)
worthwhile when making use of so-called “fused- // local if all targets are in suffix
swaps” [57, 59, 61]. 5 if t[−1] + N < λ:
6 ρ′ = local partialTraceSub(ρ, t, t + N )
7 return ρ′
// find where to swap prefix targets
Algorithm 27: O(N ) subroutines of Alg. 28. 8 s = t ∪ (t + N )
1 getNextLeftmostZeroBit(b, i): 9 s′ = getReorderedTargets(s, λ) // Alg. 27
2 i -= 1 // swap prefix targets into suffix
3 while getBit(b, i) == 1: 10 for q in range(len(s) − 1, −1, −1):
4 i -= 1 11 if s′ [q] ̸= s[q]:
5 return i 12 distrib swapGate(ρ, φ, s[q], s[q ′ ])
// Alg. 9
6 getReorderedTargets(s, λ):
// perform embarrasingly parallel trace
// locate leftmost non-targeted suffix
13 t′ = s′ [len(t) : ]
7 b = getBitMask(s)
14 ρ′ = local partialTraceSub(ρ, t, t′ )
8 τ = getNextLeftmostZeroBit(b, λ)
// obtain new suffix-only targets // determine un-targeted qubit ordering
9 s′ = { } 15 s′′ = getRemainingQubitOrder(N , s, s′ )
10 for q in range(len(s) − 1, −1, −1): // Alg. 27
11 if s[q] < λ: // reorder untargeted via swaps
12 append s[q] to s′ 16 for q in range(len(s′′ ) − 1, −1, −1):
13 else: 17 if s′′ [q] != q:
14 append τ to s′ 18 p = index of q in s′′
15 τ = getNextLeftmostZeroBit(b, τ ) 19 distrib swapGate(ρ′ , φ, q, p)
16 return reversed(s′ ) 20 s′′ [q], s′′ [p] = s′′ [p], s′′ [q]

17 getRemainingQubitOrder(N , s, s′ , ): 21 return ρ′
// determine post-swap qubit ordering
18 q = range(0, 2 N ) 22 local partialTraceSub(ρ, t, t′ ):
19 for a, b in zip(s, s′ ): 23 Λ = len(ρ)
20 if a != b: 24 N = getNumQubits(ρ) // Alg. 15
21 q[a], q[b] = q[b], q[a] 25 w = log2 (getWorldSize()) // Alg. 5
// remove traced-out qubits 26 n = len(t)
22 s′′ = { } 27 γ = 22(N −n)−w
23 b′ = getBitMask(s′ )
28 ρ′ = 0γ×1 // new Choi-vector
24 for i in range(0, 2N ):
25 if getBit(b′ , i) == 0: 29 s = sorted(t ∪ t′ )
26 append q[i] to s′′
// make elements contiguous # multithread
27 b′′ = getBitMask(s′′ ) 30 for i in range(0,γ):
28 for i in range(0, len(s′′ )): 31 g0 = insertBits(i, s, 0) // Alg. 1
29 for j in range(0, s′′ [i]): 32 for v in range(0, 2n ):
30 s′′ [i] -= ! getBit(b′′ , j) 33 g = setBits(g0 , t, v) // Alg. 1
31 return s′′ 34 g = setBits(g, t′ , v)
35 ρ′ [i] += ρ[g]
36 return ρ′
50

VI. SUMMARY scribed communication of an operator. For in-

stance, the i-th global amplitude of an N -qubit
This work presented an overwhelming number of statevector distributed between 2w nodes corre-
novel, distributed algorithms for the full-state sim- sponds to basis state
ulation of statevectors and density matrices, under
the evolution of gates, channels, Hermitian operators |i⟩N ≡ |r⟩w |j⟩N −w , (196)
and partial traces. We tabulate the algorithms and N −w N −w
r = ⌊i/2 ⌋, j = i mod 2 ,
their costs in Table I. We derived all methods explic-
itly from first principles using the tools of quantum where r is the rank of the node containing the
information theory, and avoided traditional formu- amplitude and is encoded by the “prefix bits”,
lations in terms of linear algebra primitives. and j is the amplitude’s index in the node’s lo-
cal sub-state array, encoded by the “suffix bits”.
This is so that our work can serve as a first intro-
This treatment makes it obvious whenever simu-
duction to those interested in high-performance sim-
lating a gate requires communication; when the
ulation of digital quantum computers. Our other in-
gate upon a basis state modifies the prefix qubits.
tention is to advance the state of the art of full-state
We used this treatment to systematically deter-
simulators, to accelerate researcher workflows. To
mine all communication edge-cases of our algo-
this end, we have implemented all algorithms in an
rithms.
open-source C++ project, incorporating MPI [43] and
OpenMP [34], hosted on Githuba with a permissive 4. We used SWAP gates to change the target qubits
MIT license. We invite re-implementations in other of many-target operators in order to make them
languages, integrations of the algorithms into other compatible with our distribution constraints, and
simulators, or the use of this code base as a backend simulable by embarrassingly parallel subroutines.
in larger software stacks. This technique, sometimes referred to as “cache
blocking”, has been used extensively in the lit-
For the reader’s interest, we now summarise the core
erature and in quantum simulators [9, 50, 57–
insights and treatments primarily leveraged by our
59, 61], sometimes as the primary means to dis-
algorithms.
tribute simulation of operators. In this work, we
1. We invoked the correspondence between qubits only invoked it where necessary to circumvent
of a quantum register, and bits of a classical reg- limitations of our communication buffer, such as
ister encoding a basis state, i.e. for evaluation of the partial trace of a density
matrix.
N
2 N
|Ψ⟩N =
X
αi
O
i[j] . (195) 5. We invoked the Choi–Jamiolkowski isomor-
1 phism [87, 88] to reuse our statevector distri-
i j
bution scheme, in order to represent and sim-
Such a treatment has been reported in the liter- ulate density matrices. This often permits oper-
ature [57, 106, 107], and instantiates a classical ations upon a density matrix to be decomposed
register with a binary encoded unsigned integer. into a series of simpler operations upon an un-
This allows otherwise necessary integer algebra normalised statevector. We first reported this
to be replaced with bitwise operations, though strategy in Ref. [9].
this is of little performance benefit since memory Our algorithms used the above properties whilst
access dominates the runtime at practical scales. simultaneously satisfying several high-performance
Its main utility is enabling the next trick. computing considerations.
2. We avoided abstracting operators as matrices to 1. We ensured exponentially large control-flow
be multiplied (tensored with identities) upon a loops had independent iterations, and so could
state abstracted as a vector. Such a treatment be locally parallelised, such as through multi-
precludes optimised simulation of operators spec- threading. We labelled these loops, and where
ified sparsely in the Ẑ basis, and can make deriva- possible, organised memory access to be regular
tion of distributed simulation tedious. Instead, and of minimum stride to minimise caching con-
we defined operators as bitwise maps between flicts.
basis states. In this picture, we easily derive the
change upon each individual amplitude, allowing 2. We avoided branching, circumventing the risks
bespoke communication scheduling. of failed branch prediction, by iterating directly
the amplitudes requiring modification. The in-
3. We partitioned qubits of a register into “prefix” dex algebra was performed bitwise, giving com-
and “suffix” substates, to reason about the pre- pilers the best chance of auto-vectorising.
51

Calculation Tr (H ρ) ∑m ( i) ( i )†
i Kt ρ K t Tr t (ρ) ℰΔ
t1,t2
(ρ) ℰ Δ (ρ) ℰ ϕ (ρ) ℰϕ
t1,t2
(ρ) ℰ γ (ρ)
t
t t
Alg. 26
Alg. 15 Alg. 15
Alg. 18 Alg. 15
Alg. 28 Alg.
Alg. 15
22 Alg.
Alg. 15
21 Alg. 15
Alg. 19 Alg. 15
Alg. 20 Alg. 15
Alg. 25

Decoherence
†
Mt ρ Mt SWAP ρ SWAP   θ Z ⊗ ρ  - θ Z ⊗
σ⊗ ρ σ⊗ †
Unitary upon   θσ⊗ ρ  - θσ⊗
Send/Recv
Alg. 15 Alg. 17
Alg. 15 Alg. 17
Alg. 15 Alg.17
Alg. 15 Alg. 17
Alg. 15
density matrix Alg. 16

Unitary upon
Mt ψ SWAP ψ σ⊗ ψ   θσ⊗ ψ Mt ψ Cc Mt  ψ   θ Z ⊗ ψ
statevector Reduce
Alg. 15
Alg. 10 Alg.
Alg.15
9 Alg.
Alg.1511 Alg. 14
Alg. 15 Alg.
Alg.15
6 Alg.
Alg.15
7 Alg.
Alg. 15
13

Communication
Mt ψ exchangeArrays Mt ψ Cc Mt  ψ
Local simulation Alg.
Alg.15
Alg.
Alg. 15
4 5 Alg.
Alg.15
2 Alg.
Alg.15
3

FIG. 18. The dependency tree of this manuscript’s algorithms. The surprising connectivity becomes intuitive under
the following observations: Distributed statevector simulation often includes edge-cases algorithmically identical to
local statevector simulation; Under the Choi–Jamiolkowski isomorphism [87, 88], a unitary upon a density matrix
resembles two unnormalised unitaries upon an unnormalised statevector (which we called a “Choi-vector”); A Kraus
channel upon a density matrix is equivalent to a superoperator upon a Choi-vector, itself equivalent to an unnor-
malised unitary upon an unnormalised statevector; SWAP gates permit transpiling communicating operators into
embarrassingly parallel ones; The natural distribution of statevector amplitudes often admits simulation via pairwise
communication and simple array exchanges.

3. When we required to communicate a subset VIII. ACKNOWLEDGEMENTS

of a node’s local amplitudes, we first contigu-
ously packed this subset into the communication The authors thank Ania Brown for her pioneer-
buffer. This reduced the total network traffic. ing role in the early development of the QuEST
4. We endeavoured to make communication pair- simulator. TJ also thanks Keisuke Fujii and Ko-
wise, whereby nodes exchanged amplitudes in ex- suke Mitarai for hosting him at Osaka Univer-
clusive pairs. This simplifies the communication sity where this manuscript was incidentally com-
code, and makes network costs predictable and pleted, and the UKRI NQCC for financially support-
approximately uniform. ing QuEST development. TJ additionally thanks
Richard Meister for his contributions to QuEST
The astute reader will have noticed that many of and helpful input on matters HPC and software ar-
our algorithms invoked other algorithms as subrou- chitecture; so too he thanks Cica Gustiani, Sam
tines. For example, the distributed simulation of the Jaques, Adrian Chapman and Arthur Rattew for
one-qubit gate (Alg. 6) invokes the local simulation useful discussions, the Clarendon fund for finan-
(Alg. 2). The interdependence of our algorithms is cial support, and peers Isabell Hamecher, Chris
visualised in Fig. 18. Whittle, Patrick Inns, Sinan Shi and Andrea Vi-
We hope that functional, practical-scale quantum tangeli for aesthetic advice. This work was fur-
computers can one day repay the classical computa- ther supported by EPSRC grant EP/M013243/1.
tional costs of their development. Happy simulating! SCB acknowledges support from the EPSRC QCS
Hub EP/T001062/1, from U.S. Army Research
Office Grant No. W911NF-16-1-0070 (LOGIQ),
VII. CONTRIBUTIONS and from EU H2020-FETFLAG-03-2018 under the
grant agreement No 820495 (AQTION). The au-
TJ devised all algorithms not listed below, wrote the thors would also like to acknowledge the use of the
manuscript, and optimised and implemented all al- University of Oxford Advanced Research Comput-
gorithms. BK devised the algorithms of Section V C, ing (ARC) facility to test the algorithms in this
and SCB devised those of Sections IV D, IV F, V A, manuscript (dx.doi.org/10.5281/zenodo.22558). Fi-
and V B. All authors contributed to the develop- nally, TJ would like to offer no thanks at all to the
ment of QuEST [9] which enabled the research of many interruptions and delays to the writing of this
this manuscript. manuscript which began amidst the 2020 pandemic.
52

TABLE I. The distributed algorithms presented in this manuscript, and their costs, expressed either in aggregate
or per-node (pn). Assume each algorithm is invoked upon an N -qubit statevector or density matrix, constituted by
a total of Σ amplitudes, distributed between 2w nodes, such that each node contains Λ = Σ/2w amplitudes and an
equivalently sized communication buffer. All methods assume N ≥ w. Where per-node costs differ between nodes
(such as for the s-control one-target gate), the average cost is given.

Statevector algorithms (Σ = 2N , Λ = 2N −w )

Sec. Alg. Operation Symbol Bops (pn) Flops (pn) Writes (pn) Exchanges Exchanged Memory
IV A 6 One-target gate M̂t O(Λ) O(Λ) Λ 0 or 1 Σ O(1)
..
IV B 7 s-control one-target gate Cc (M̂t ) O(sΛ/2s ) O(Λ/2s ) Λ/2s 0 or 1 Σ/2s .
IV C 9 SWAP gate SWAPt1 ,t2 O(Λ) 0 Λ/2 0 or 1 Σ/2 O(1)
n n n
IV D 10 n-target gate a
M̂t O(2 Λ) O(2 Λ) O(2 Λ) O(min(n, w)) O(min(n, w))Σ O(2n )
IV E 11 n-qubit Pauli tensor ⊗n σ̂ O(nΛ) Λ Λ 0 or 1 Σ O(1)
⊗
..
IV F 13 n-qubit phase gadget exp iθẐ O(Λ) O(Λ) Λ 0 0 .
n
IV G 14 n-qubit Pauli gadget exp(iθ ⊗ σ̂) O(nΛ) O(Λ) Λ 0 or 1 Σ

Density matrix algorithms (Σ = 22N , Λ = 22N −w )

Sec. Alg. Operation Symbol Bops (pn) Flops (pn) Writes (pn) Exchanges Exchanged Memory
n n n
VB 16 n-qubit unitary b
Ût O(2 Λ) O(2 Λ) O(2 Λ) O(min(n, w)) O(min(n, w))Σ/2 O(2n )
VB 17 SWAP gate SWAPt1 ,t2 O(Λ) 0 Λ 0 or 1 Σ/2 O(1)
..
VB 17 n-qubit Pauli tensor ⊗n σ̂ O(nΛ) 2Λ or 3Λ 2Λ or 3Λ 0 or 1 Σ .
⊗

VB 17 n-qubit phase gadget exp iθẐ O(Λ) O(Λ) 2Λ 0 0
n
VB 17 n-qubit Pauli gadget exp(iθ ⊗ σ̂) O(nΛ) O(Λ) 2Λ 0 or 1 Σ
† n n n
b
O(16n )
P
VC 18 n-qubit Kraus map K̂ ρK̂ O(4 Λ) O(4 Λ) O(4 Λ) O(min(2n, w)) O(min(2n, w))Σ/2
VD 19 one-qubit dephasing Eϕ t O(Λ) Λ/2 Λ/2 0 0 O(1)
..
VD 20 two-qubit dephasing Eϕ t1 ,t2 O(Λ) Λ Λ 0 0 .
VE 21 one-qubit depolarising E∆t O(Λ) O(Λ) O(Λ) 0 or 1 Σ/2
VE 22 two-qubit depolarising E∆t1 ,t2 O(Λ) O(Λ) O(Λ) 0, 1 or 2 Σ/8 or Σ/2
c
VF 25 amplitude damping Eγ t O(Λ) O(Λ) O(Λ) 0 or 1 Σ/4

VG 26 T -term Pauli string expectation Tr Ĥ ρ O(N T Λ) O(T Λ) 0 0 0d
VH 28 n-qubit partial traceb Trt (ρ) O(wΛ) O(Λ/2n ) O(w)Λ O(w + N − n) O(w)Σ O(1)e

a where n ≤ N − w
b where n ≤ N − ⌈w/2⌉
c only one node of a pair sends amplitudes
d one scalar global reduction
e excluding the cost of the output density matrix
53

[1] Xiao Yuan, Suguru Endo, Qi Zhao, Ying Li, and High Performance Computing, Networking, Stor-
Simon C Benjamin. Theory of variational quantum age and Analysis, pages 866–874. IEEE, 2016.
simulation. Quantum, 3:191, 2019. [15] Sean Eron Anderson. Bit twiddling hacks. URL:
[2] Iskren Vankov, Daniel Mills, Petros Wallden, and http://graphics. stanford. edu/˜ seander/bithacks.
Elham Kashefi. Methods for classically simulating html, 2005.
noisy networked quantum architectures. Quantum [16] Mohammad Alaul Haque Monil, Seyong Lee, Jef-
Science and Technology, 5(1):014001, 2019. frey S. Vetter, and Allen D. Malony. Understand-
[3] Benjamin Villalonga, Sergio Boixo, Bron Nel- ing the impact of memory access patterns in In-
son, Christopher Henze, Eleanor Rieffel, Rupak tel processors. In 2020 IEEE/ACM Workshop
Biswas, and Salvatore Mandrà. A flexible high- on Memory Centric High Performance Computing
performance simulator for verifying and bench- (MCHPC), pages 52–61, 2020.
marking quantum circuits implemented on real [17] James Bottomley. Understanding caching. Linux
hardware. npj Quantum Information, 5(1):1–16, Journal, 2004(117):2, 2004.
2019. [18] James E Smith. A study of branch prediction
[4] Frank Arute, Kunal Arya, Ryan Babbush, Dave strategies. In 25 years of the international sym-
Bacon, Joseph C Bardin, Rami Barends, Rupak posia on Computer architecture (selected papers),
Biswas, Sergio Boixo, Fernando GSL Brandao, pages 202–215, 1998.
David A Buell, et al. Quantum supremacy using a [19] Samuel Larsen and Saman Amarasinghe. Exploit-
programmable superconducting processor. Nature, ing superword level parallelism with multimedia
574(7779):505–510, 2019. instruction sets. Acm Sigplan Notices, 35(5):145–
[5] Emanuel Knill, Raymond Laflamme, and Woj- 156, 2000.
ciech H Zurek. Resilient quantum computation. [20] Jaewook Shin, Mary Hall, and Jacqueline Chame.
Science, 279(5349):342–345, 1998. Superword-level parallelism in the presence of
[6] Bálint Koczor, Suguru Endo, Tyson Jones, control flow. In International Symposium on
Yuichiro Matsuzaki, and Simon C Benjamin. Code Generation and Optimization, pages 165–175.
Variational-state quantum metrology. New Jour- IEEE, 2005.
nal of Physics, 22(8):083038, 2020. [21] Stanley F Anderson, John G Earle, Robert El-
[7] Tyson Jones and Simon C Benjamin. Robust quan- liott Goldschmidt, and Don M Powers. The IBM
tum compilation and circuit optimisation via en- system/360 model 91: Floating-point execution
ergy minimisation. Quantum, 6:628, 2022. unit. IBM Journal of research and development,
[8] Dennis Willsch, Hannes Lagemann, Madita 11(1):34–53, 1967.
Willsch, Fengping Jin, Hans De Raedt, and Kristel [22] William Y Chen, Pohua P. Chang, Thomas M
Michielsen. Benchmarking supercomputers with Conte, and Wen-mei W. Hwu. The effect
the j\” ulich universal quantum computer simu- of code expanding optimizations on instruction
lator. arXiv preprint arXiv:1912.03243, 2019. cache design. IEEE Transactions on Computers,
[9] Tyson Jones, Anna Brown, Ian Bush, and Simon C 42(9):1045–1057, 1993.
Benjamin. QuEST and high performance simu- [23] Nakul Manchanda and Karan Anand. Non-uniform
lation of quantum computers. Scientific reports, memory access (NUMA). New York University, 4,
9(1):1–11, 2019. 2010.
[10] Cupjin Huang, Michael Newman, and Mario [24] Mario Nemirovsky and Dean M Tullsen. Mul-
Szegedy. Explicit lower bounds on strong quan- tithreading architecture. Synthesis Lectures on
tum simulation. IEEE Transactions on Informa- Computer Architecture, 8(1):1–109, 2013.
tion Theory, 66(9):5585–5600, 2020. [25] William J Bolosky and Michael L Scott. False
[11] Maarten Van Den Nes. Classical simulation of sharing and its effect on shared memory perfor-
quantum computation, the Gottesman-Knill theo- mance. In 4th symposium on experimental dis-
rem, and slightly beyond. Quantum Info. Comput., tributed and multiprocessor systems, pages 57–71.
10(3):258–271, mar 2010. Citeseer, 1993.
[12] EPCC CSE team. ARCHER2 hardware & soft- [26] Tyson Jones. Simulation of, and with, first gener-
ware (documentation). University of Edinburgh, ation quantum computers. PhD thesis, University
Mar 2020. of Oxford, 2022.
[13] Bahman S Motlagh and Ronald F DeMara. Mem- [27] Hans De Raedt, Fengping Jin, Dennis Willsch,
ory latency in distributed shared-memory mul- Madita Willsch, Naoki Yoshioka, Nobuyasu Ito,
tiprocessors. In Proceedings IEEE Southeast- Shengjun Yuan, and Kristel Michielsen. Massively
con’98’Engineering for a New Era’, pages 134–137. parallel quantum computer simulator, eleven years
IEEE, 1998. later. Computer Physics Communications, 237:47–
[14] Thomas Häner, Damian S Steiger, Mikhail 61, 2019.
Smelyanskiy, and Matthias Troyer. High perfor- [28] Tyson Jones and Julien Gacon. Efficient cal-
mance emulation of quantum circuits. In SC’16: culation of gradients in classical simulations of
Proceedings of the International Conference for variational quantum algorithms. arXiv preprint
54

arXiv:2009.02823, 2020. 8(3/4):623, 2009.

[29] Tyson Jones. Efficient classical calculation of [44] Carl Friedrich Gauss. Disquisitiones Arithmeticae
the quantum natural gradient. arXiv preprint auctore D. Carolo Friderico Gauss. in commissis
arXiv:2011.02991, 2020. apud Gerh. Fleischer, jun., 1801.
[30] Lei Wang, Jianfeng Zhan, Wanling Gao, Rui Ren, [45] Gian Giacomo Guerreschi, Justin Hogaboam,
Xiwen He, Chunjie Luo, Gang Lu, and Jingwei Fabio Baruffa, and Nicolas PD Sawaya. Intel quan-
Li. BOPS, not FLOPS! a new metric, measuring tum simulator: A cloud-ready high-performance
tool, and roofline performance model for datacen- simulator of quantum circuits. Quantum Science
ter computing. arXiv preprint arXiv:1801.09212, and Technology, 5(3):034007, 2020.
2018. [46] Ryan LaRose. Distributed memory techniques for
[31] Martin Fürer. Faster integer multiplication. In classical simulation of quantum circuits. arXiv
Proceedings of the thirty-ninth annual ACM sym- preprint arXiv:1801.01037, 2018.
posium on Theory of computing, pages 57–66, 2007. [47] Mario Gerla and Leonard Kleinrock. On the topo-
[32] Claude Tadonki and Bernard Philippe. Parallel logical design of distributed computer networks.
multiplication of a vector by a kronecker product IEEE Transactions on communications, 25(1):48–
of matrices. In Parallel Numerical Linear Algebra, 60, 1977.
pages 71–89. Nova Science Publishers, Inc., 2001. [48] Philippos Papaphilippou, Jiuxi Meng, and Wayne
[33] Ulrich Drepper. What every programmer should Luk. High-performance fpga network switch archi-
know about memory. Red Hat, Inc, 11(2007):2007, tecture. In Proceedings of the 2020 ACM/SIGDA
2007. International Symposium on Field-Programmable
[34] Leonardo Dagum and Ramesh Menon. OpenMP: Gate Arrays, pages 76–85, 2020.
an industry standard API for shared-memory pro- [49] Tianhai Zhao, Yunlan Wang, and Xu Wang. Opti-
gramming. IEEE computational science and engi- mized reduce communication performance with the
neering, 5(1):46–55, 1998. tree topology. In Proceedings of the 2020 4th High
[35] Seth Lloyd. Almost any quantum logic gate is uni- Performance Computing and Cluster Technologies
versal. Phys. Rev. Lett., 75:346–349, Jul 1995. Conference & 2020 3rd International Conference
[36] David Elieser Deutsch, Adriano Barenco, and Ar- on Big Data and Artificial Intelligence, pages 165–
tur Ekert. Universality in quantum computa- 171, 2020.
tion. Proceedings of the Royal Society of Lon- [50] Jakub Adamski, James Richings, and
don. Series A: Mathematical and Physical Sci- Oliver Thomson Brown. Energy efficiency of
ences, 449(1937):669–677, 1995. quantum statevector simulation at scale. arXiv
[37] Adriano Barenco, Charles H. Bennett, Richard preprint arXiv:2308.07402, 2023.
Cleve, David P. DiVincenzo, Norman Margolus, [51] John Preskill. Quantum computing in the nisq era
Peter Shor, Tycho Sleator, John A. Smolin, and and beyond. Quantum, 2:79, 2018.
Harald Weinfurter. Elementary gates for quantum [52] Prakash Murali, Dripto M Debroy, Kenneth R
computation. Phys. Rev. A, 52:3457–3467, Nov Brown, and Margaret Martonosi. Architect-
1995. ing noisy intermediate-scale trapped ion quantum
[38] Klaus Mølmer and Anders Sørensen. Multiparticle computers. In 2020 ACM/IEEE 47th Annual In-
entanglement of hot trapped ions. Physical Review ternational Symposium on Computer Architecture
Letters, 82(9):1835, 1999. (ISCA), pages 529–542. IEEE, 2020.
[39] Hartmut Häffner, Christian F Roos, and Rainer [53] Pranav Gokhale, Teague Tomesh, Martin Suchara,
Blatt. Quantum computing with trapped ions. and Frederic T Chong. Faster and more reliable
Physics reports, 469(4):155–203, 2008. quantum swaps via native gates. arXiv preprint
[40] Adriano Barenco. A universal two-bit gate for arXiv:2109.13199, 2021.
quantum computation. Proceedings of the Royal [54] Ke-Hui Song, Yu-Jing Zhao, Zhen-Gang Shi, Shao-
Society of London. Series A: Mathematical and Hua Xiang, and Xiong-Wen Chen. Simultaneous
Physical Sciences, 449(1937):679–683, 1995. implementation of n swap gates using supercon-
[41] Jun Zhang, Jiri Vala, Shankar Sastry, and K Bir- ducting charge qubits coupled to a cavity. Optics
gitta Whaley. Minimum construction of two- communications, 283(3):506–508, 2010.
qubit quantum operations. Physical review letters, [55] Lev Vaidman. Teleportation of quantum states.
93(2):020502, 2004. Physical Review A, 49(2):1473, 1994.
[42] Matthew P Harrigan, Kevin J Sung, Matthew Nee- [56] Richard L Taylor, Christopher DB Bentley,
ley, Kevin J Satzinger, Frank Arute, Kunal Arya, Julen S Pedernales, Lucas Lamata, Enrique
Juan Atalaya, Joseph C Bardin, Rami Barends, Solano, André RR Carvalho, and Joseph J Hope. A
Sergio Boixo, et al. Quantum approximate opti- study on fast gates for large-scale quantum simula-
mization of non-planar graph problems on a pla- tion with trapped ions. Scientific Reports, 7(1):1–8,
nar superconducting processor. Nature Physics, 2017.
17(3):332–336, 2021. [57] Koen De Raedt, Kristel Michielsen, Hans
[43] Ewing Lusk, S Huss, B Saphir, and M Snir. De Raedt, Binh Trieu, Guido Arnold, Marcus
MPI: A message-passing interface standard. In- Richter, Th Lippert, Hiroshi Watanabe, and
ternational Journal of Supercomputer Applications, Nobuyasu Ito. Massively parallel quantum com-
55

puter simulator. Computer Physics Communica- arXiv:2007.10515, 2020.

tions, 176(2):121–136, 2007. [72] Hale F Trotter. On the product of semi-groups of
[58] Jun Doi and Hiroshi Horii. Cache blocking tech- operators. Proceedings of the American Mathemat-
nique to large scale quantum computing simulation ical Society, 10(4):545–551, 1959.
on supercomputers. In 2020 IEEE International [73] Jacob D. Biamonte and Peter J. Love. Realizable
Conference on Quantum Computing and Engineer- hamiltonians for universal adiabatic quantum com-
ing (QCE), pages 212–222. IEEE, 2020. puters. Phys. Rev. A, 78:012352, Jul 2008.
[59] Satoshi Imamura, Masafumi Yamazaki, Takumi [74] Alba Cervera-Lierta. Exact ising model simulation
Honda, Akihiko Kasagi, Akihiro Tabuchi, Hiroshi on a quantum computer. Quantum, 2:114, 2018.
Nakao, Naoto Fukumoto, and Kohta Nakashima. [75] T Xia, M Lichtman, K Maller, AW Carr, MJ Pi-
mpiqulacs: A distributed quantum computer sim- otrowicz, L Isenhower, and M Saffman. Random-
ulator for a64fx-based cluster systems. arXiv ized benchmarking of single-qubit gates in a 2d ar-
preprint arXiv:2203.16044, 2022. ray of neutral-atom qubits. Physical review letters,
[60] Yasunari Suzuki, Yoshiaki Kawase, Yuya Ma- 114(10):100503, 2015.
sumura, Yuria Hiraga, Masahiro Nakadai, Jiabao [76] TP Harty, DTC Allcock, C J Ballance, L Guidoni,
Chen, Ken M Nakanishi, Kosuke Mitarai, Ryosuke HA Janacek, NM Linke, DN Stacey, and DM Lu-
Imai, Shiro Tamiya, et al. Qulacs: a fast and versa- cas. High-fidelity preparation, gates, memory, and
tile quantum circuit simulator for research purpose. readout of a trapped-ion quantum bit. Physical
Quantum, 5:559, 2021. review letters, 113(22):220501, 2014.
[61] Sam Stanwyck, Harun Bayraktar, and Tim Costa. [77] Philipp Schindler, Daniel Nigg, Thomas Monz,
cuquantum: Accelerating quantum circuit simula- Julio T Barreiro, Esteban Martinez, Shannon X
tion on gpus. In APS March Meeting Abstracts, Wang, Stephan Quint, Matthias F Brandl, Volck-
volume 2022, pages Q36–002, 2022. mar Nebendahl, Christian F Roos, et al. A quan-
[62] Tommaso Toffoli. Reversible computing. In Inter- tum information processor with trapped ions. New
national colloquium on automata, languages, and Journal of Physics, 15(12):123012, 2013.
programming, pages 632–644. Springer, 1980. [78] Mikko Möttönen, Juha J Vartiainen, Ville
[63] Mohammadsadegh Khazali and Klaus Mølmer. Bergholm, and Martti M Salomaa. Quantum cir-
Fast multiqubit gates by adiabatic evolution in in- cuits for general multiqubit gates. Physical review
teracting excited-state manifolds of rydberg atoms letters, 93(13):130502, 2004.
and superconducting circuits. Physical Review X, [79] Esteban A Martinez, Thomas Monz, Daniel Nigg,
10(2):021054, 2020. Philipp Schindler, and Rainer Blatt. Compiling
[64] Klaus Mølmer, Larry Isenhower, and Mark quantum algorithms for architectures with multi-
Saffman. Efficient grover search with rydberg qubit gates. New Journal of Physics, 18(6):063029,
blockade. Journal of Physics B: Atomic, Molec- 2016.
ular and Optical Physics, 44(18):184016, 2011. [80] Dripto M Debroy and Kenneth R Brown. Extended
[65] Mark Saffman. Quantum computing with atomic flag gadgets for low-overhead circuit verification.
qubits and rydberg interactions: progress and chal- Physical Review A, 102(5):052409, 2020.
lenges. Journal of Physics B: Atomic, Molecular [81] John Neumann, Eugene Paul Wigner, and Robert
and Optical Physics, 49(20):202001, 2016. Hofstadter. Mathematical foundations of quantum
[66] Vlatko Vedral, Adriano Barenco, and Artur Ekert. mechanics. Princeton university press, 1955.
Quantum networks for elementary arithmetic op- [82] Daniel Gottesman. An introduction to quantum
erations. Physical Review A, 54(1):147, 1996. error correction and fault-tolerant quantum com-
[67] Xun Gao and Lu-Ming Duan. Efficient representa- putation. In Quantum information science and its
tion of quantum many-body states with deep neu- contributions to mathematics, Proceedings of Sym-
ral networks. Nature communications, 8(1):1–6, posia in Applied Mathematics, volume 68, pages
2017. 13–58, 2010.
[68] Aleks Kissinger and John van de Wetering. Reduc- [83] Dmitri Maslov and Yunseong Nam. Use of global
ing the number of non-clifford gates in quantum interactions in efficient quantum circuit construc-
circuits. Phys. Rev. A, 102:022406, Aug 2020. tions. New Journal of Physics, 20(3):033018, 2018.
[69] Alexander Cowtan, Silas Dilkes, Ross Duncan, [84] John van de Wetering. Constructing quantum cir-
Will Simmons, and Seyon Sivarajah. Phase gadget cuits with global gates. New Journal of Physics,
synthesis for shallow circuits. Electronic Proceed- 23(4):043015, 2021.
ings in Theoretical Computer Science, 318:213– [85] Kevin M Obenland and Alvin M Despain. Sim-
228, 2020. ulating the effect of decoherence and inaccuracies
[70] Xiu Gu, Jonathan Allcock, Shuoming An, and Yu- on a quantum computer. In NASA International
xi Liu. Efficient multi-qubit subspace rotations Conference on Quantum Computing and Quantum
via topological quantum walks. arXiv preprint Communications, pages 447–459. Springer, 1998.
arXiv:2111.06534, 2021. [86] Michael A Nielsen and Isaac Chuang. Quantum
[71] Alexander Cowtan, Will Simmons, and Ross Dun- computation and quantum information, 2002.
can. A generic compilation strategy for the [87] Man-Duen Choi. Completely positive linear maps
unitary coupled cluster ansatz. arXiv preprint on complex matrices. Linear algebra and its appli-
56

cations, 10(3):285–290, 1975. Physical Review Letters, 87(16):167902, 2001.

[88] Andrzej Jamiolkowski. Linear transformations [98] Masaya Kohda, Ryosuke Imai, Keita Kanno, Ko-
which preserve trace and positive semidefiniteness suke Mitarai, Wataru Mizukami, and Yuya O. Nak-
of operators. Reports on Mathematical Physics, agawa. Quantum expectation-value estimation by
3(4):275–278, 1972. computational basis sampling. Phys. Rev. Res.,
[89] Min Jiang, Shunlong Luo, and Shuangshuang Fu. 4:033173, Sep 2022.
Channel-state duality. Phys. Rev. A, 87:022310, [99] Iulia M Georgescu, Sahel Ashhab, and Franco Nori.
Feb 2013. Quantum simulation. Reviews of Modern Physics,
[90] Dorit Aharonov. Quantum to classical phase tran- 86(1):153, 2014.
sition in noisy quantum computers. Physical Re- [100] Sam McArdle, Suguru Endo, Alán Aspuru-Guzik,
view A, 62(6):062311, 2000. Simon C Benjamin, and Xiao Yuan. Quan-
[91] AG Huibers, M Switkes, CM Marcus, K Camp- tum computational chemistry. Reviews of Modern
man, and AC Gossard. Dephasing in open quan- Physics, 92(1):015003, 2020.
tum dots. Physical Review Letters, 81(1):200, 1998. [101] Asher Peres. Quantum theory: concepts and meth-
[92] Barbara M. Terhal. Quantum error correction for ods, volume 72. Springer, 1997.
quantum memories. Rev. Mod. Phys., 87:307–346, [102] Matteo GA Paris. The modern tools of quantum
Apr 2015. mechanics: A tutorial on quantum states, mea-
[93] Miroslav Urbanek, Benjamin Nachman, Vincent R surements, and operations. The European Physical
Pascuzzi, Andre He, Christian W Bauer, and Journal Special Topics, 203(1):61–86, 2012.
Wibe A de Jong. Mitigating depolarizing noise on [103] Jonas Maziero. Computing partial traces and re-
quantum computers with noise-estimation circuits. duced density matrices. International Journal of
Physical Review Letters, 127(27):270502, 2021. Modern Physics C, 28(01):1750005, 2017.
[94] Zhenyu Cai, Xiaosi Xu, and Simon C Benjamin. [104] Pranay Barkataki and MS Ramkarthik. A set the-
Mitigating coherent noise using pauli conjugation. oretical approach for the partial tracing operation
npj Quantum Information, 6(1):1–9, 2020. in quantum mechanics. International Journal of
[95] Joel J Wallman and Joseph Emerson. Noise tailor- Quantum Information, 16(06):1850050, 2018.
ing for scalable quantum computation via random- [105] W Kahan. Further remarks on reducing truncation
ized compiling. Physical Review A, 94(5):052325, errors, commun. Assoc. Comput. Mach, 8:40, 1965.
2016. [106] Evandro Chagas Ribeiro da Rosa and Bruno G
[96] Sumeet Khatri, Kunal Sharma, and Mark M Taketani. Qsystem: bitwise representation for
Wilde. Information-theoretic aspects of the gen- quantum circuit simulations. arXiv preprint
eralized amplitude-damping channel. Physical Re- arXiv:2004.03560, 2020.
view A, 102(1):012401, 2020. [107] Xiu-Zhe Luo, Jin-Guo Liu, Pan Zhang, and Lei
[97] Harry Buhrman, Richard Cleve, John Watrous, Wang. Yao. jl: Extensible, efficient framework for
and Ronald De Wolf. Quantum fingerprinting. quantum algorithm design. Quantum, 4:341, 2020.

Amplitude-Efficient State Preparation - Mottonen Binary Tree
No ratings yet
Amplitude-Efficient State Preparation - Mottonen Binary Tree
13 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
Simulating Time Dependent and Nonlinear Classical Oscillators Through Nonlinear Schrödingerization
No ratings yet
Simulating Time Dependent and Nonlinear Classical Oscillators Through Nonlinear Schrödingerization
78 pages
Quantumnotesgraduate
No ratings yet
Quantumnotesgraduate
132 pages
Challenging Claims Quantum Supremacy
No ratings yet
Challenging Claims Quantum Supremacy
78 pages
Synthesizing Quantum Circuit Optimizers
No ratings yet
Synthesizing Quantum Circuit Optimizers
70 pages
99bahyPRXQuantum 4 027001
No ratings yet
99bahyPRXQuantum 4 027001
70 pages
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
No ratings yet
Scalable Quantum Simulations of Scattering in Scalar Field Theory On 120 Qubits
50 pages
Quantum Simulation of Schrödingers Equation
No ratings yet
Quantum Simulation of Schrödingers Equation
50 pages
Ece IV Signals & Systems (10ec44) Notes
No ratings yet
Ece IV Signals & Systems (10ec44) Notes
115 pages
Lecture 5.1 - Workflows For Quantum-Centric Supercomputing
No ratings yet
Lecture 5.1 - Workflows For Quantum-Centric Supercomputing
37 pages
2020-04-28 - QX Simulator - EQC - w2
No ratings yet
2020-04-28 - QX Simulator - EQC - w2
38 pages
How To Build A Software Quantum Simulator
No ratings yet
How To Build A Software Quantum Simulator
19 pages
Article
No ratings yet
Article
21 pages
Quantum Algorithms With Controlled Hodgkin-Huxley Neurons
No ratings yet
Quantum Algorithms With Controlled Hodgkin-Huxley Neurons
15 pages
Classical Simulation of Peaked Shallow
No ratings yet
Classical Simulation of Peaked Shallow
32 pages
Efficient Quantum Circuit Simulation by Tensor Network Methods On Modern Gpus
No ratings yet
Efficient Quantum Circuit Simulation by Tensor Network Methods On Modern Gpus
26 pages
Simulating Quantum Dynamics On A Quantum Computer: Journal of Physics A: Mathematical and Theoretical November 2010
No ratings yet
Simulating Quantum Dynamics On A Quantum Computer: Journal of Physics A: Mathematical and Theoretical November 2010
23 pages
T318 Applied Network Security: Dr. Mahmoud Attalah
No ratings yet
T318 Applied Network Security: Dr. Mahmoud Attalah
54 pages
Faster SCHR Odinger-Style Simulation of Quantum Circuits: Aneeqa Fatima Igor L. Markov
No ratings yet
Faster SCHR Odinger-Style Simulation of Quantum Circuits: Aneeqa Fatima Igor L. Markov
14 pages
Physics Simulation Via Quantum Graph Neural Network
No ratings yet
Physics Simulation Via Quantum Graph Neural Network
17 pages
A Variational Quantum Linear Solver Application To Discrete Finite-Element Methods
No ratings yet
A Variational Quantum Linear Solver Application To Discrete Finite-Element Methods
23 pages
PRXQuantum 5 020311
No ratings yet
PRXQuantum 5 020311
18 pages
A Herculean Task: Classical Simulation of Quantum Computers
No ratings yet
A Herculean Task: Classical Simulation of Quantum Computers
14 pages
v1 Covered
No ratings yet
v1 Covered
13 pages
2024 QSW MQT Handbook
No ratings yet
2024 QSW MQT Handbook
11 pages
Quantum Many-Body Simulations On Digital Quantum Computers
No ratings yet
Quantum Many-Body Simulations On Digital Quantum Computers
13 pages
A Hybrid Quantum-Classical Generative Adversarial Network For Near-Term Quantum Processors
No ratings yet
A Hybrid Quantum-Classical Generative Adversarial Network For Near-Term Quantum Processors
13 pages
Quantum Algorithms For Fixed Qubit Architectures
No ratings yet
Quantum Algorithms For Fixed Qubit Architectures
20 pages
Chapter 10 - Determining How Costs Behave
100% (1)
Chapter 10 - Determining How Costs Behave
41 pages
Escanez-Exposito 2025 Phys. Scr. 100 025107
No ratings yet
Escanez-Exposito 2025 Phys. Scr. 100 025107
17 pages
Classical Simulation of Variational Quantum Classifiers Using Tensor Rings
No ratings yet
Classical Simulation of Variational Quantum Classifiers Using Tensor Rings
11 pages
Quantum Computers
No ratings yet
Quantum Computers
13 pages
TMP F642
No ratings yet
TMP F642
12 pages
Qibo: A Framework For Quantum Simulation With Hardware Acceleration
No ratings yet
Qibo: A Framework For Quantum Simulation With Hardware Acceleration
15 pages
759 A Power System Use Case For Quantum Computing Optimally Partitioning A Large Network
No ratings yet
759 A Power System Use Case For Quantum Computing Optimally Partitioning A Large Network
12 pages
Hybrid Approaches For Efficient Simulations of 3-Q
No ratings yet
Hybrid Approaches For Efficient Simulations of 3-Q
10 pages
Unit 4 Materials
No ratings yet
Unit 4 Materials
9 pages
Accelerating Spectral Clustering On Quan-2
No ratings yet
Accelerating Spectral Clustering On Quan-2
13 pages
De Avila 2015 J. Phys.: Conf. Ser. 649 012004
No ratings yet
De Avila 2015 J. Phys.: Conf. Ser. 649 012004
18 pages
Decision Support Systems: Hong Guo, Juheng Zhang, Gary J. Koehler
No ratings yet
Decision Support Systems: Hong Guo, Juheng Zhang, Gary J. Koehler
15 pages
Radecka - FPGA Emulation of Quantum Circuits
No ratings yet
Radecka - FPGA Emulation of Quantum Circuits
6 pages
Practical Applications of Quantum Computing
No ratings yet
Practical Applications of Quantum Computing
22 pages
Analog Quantum Computer CVQC
No ratings yet
Analog Quantum Computer CVQC
10 pages
Grurl 2020
No ratings yet
Grurl 2020
7 pages
Strong Simulation of Linear Optical Processes: Nicolas Heurtel, Shane Mansfield, Jean Senellart, and Beno It Valiron
No ratings yet
Strong Simulation of Linear Optical Processes: Nicolas Heurtel, Shane Mansfield, Jean Senellart, and Beno It Valiron
26 pages
Computación Cuantica
No ratings yet
Computación Cuantica
12 pages
Simulating Quantum Computing
No ratings yet
Simulating Quantum Computing
13 pages
Light in Quantum Computing and Simulation - Perspective
No ratings yet
Light in Quantum Computing and Simulation - Perspective
6 pages
Efficient Simulation of Quantum Systems
No ratings yet
Efficient Simulation of Quantum Systems
8 pages
10.1103 PRXQuantum.5.040320
No ratings yet
10.1103 PRXQuantum.5.040320
19 pages
An Open-Source Modular Framework For Quantum Computing
No ratings yet
An Open-Source Modular Framework For Quantum Computing
6 pages
Discrete-Event Simulation: Figure 1.3: Simple Digital Logic Circuit
No ratings yet
Discrete-Event Simulation: Figure 1.3: Simple Digital Logic Circuit
8 pages
Synthesis of Quantum-Logic Circuits
No ratings yet
Synthesis of Quantum-Logic Circuits
11 pages
Computação Quantica
No ratings yet
Computação Quantica
7 pages
FPGA-Based Circuit Model Emulation of Quantum Algorithms
No ratings yet
FPGA-Based Circuit Model Emulation of Quantum Algorithms
6 pages
2 Quantum - Circuit - Simulator - Based - On - FPGA
No ratings yet
2 Quantum - Circuit - Simulator - Based - On - FPGA
3 pages
Quantum Computing Cheat Sheet
No ratings yet
Quantum Computing Cheat Sheet
3 pages
Quantum Supremacy Some Fundamental Concepts
No ratings yet
Quantum Supremacy Some Fundamental Concepts
2 pages
Lecture 1 (11 - 06 - 20)
No ratings yet
Lecture 1 (11 - 06 - 20)
5 pages
Accelerating Quantum Computer Simulations Using GPUs
No ratings yet
Accelerating Quantum Computer Simulations Using GPUs
1 page
Data Structures & Algorithms - Week 1 To 7
No ratings yet
Data Structures & Algorithms - Week 1 To 7
105 pages
Apple T Notes
No ratings yet
Apple T Notes
29 pages
Science Ado6285
No ratings yet
Science Ado6285
6 pages
EC6303
No ratings yet
EC6303
5 pages
Discrete Memoryless Source Final 2
100% (6)
Discrete Memoryless Source Final 2
34 pages
Stability of Linear Control Systems
100% (1)
Stability of Linear Control Systems
11 pages
Sample of Error Logs
No ratings yet
Sample of Error Logs
12 pages
Write An ALP For All Arithematic Operations and Write ALP For Product of Two Numbers Withoutusing MUL Operation
No ratings yet
Write An ALP For All Arithematic Operations and Write ALP For Product of Two Numbers Withoutusing MUL Operation
3 pages
CM20315 03 Shallow
No ratings yet
CM20315 03 Shallow
59 pages
Equations Worksheet
No ratings yet
Equations Worksheet
3 pages
E-Commerce With Digital Signature
No ratings yet
E-Commerce With Digital Signature
18 pages
Longest Common Subsequence
No ratings yet
Longest Common Subsequence
11 pages
Unit Iii Greedy and Dynamic Programming
No ratings yet
Unit Iii Greedy and Dynamic Programming
120 pages
9.2 Notes 2DArray Challenges - Watermark
No ratings yet
9.2 Notes 2DArray Challenges - Watermark
11 pages
Nonlinear Optimization With Inequality Constraints
No ratings yet
Nonlinear Optimization With Inequality Constraints
21 pages
Introduction To Time Series Analysis
No ratings yet
Introduction To Time Series Analysis
17 pages
Applications of Thermodynamic Models
No ratings yet
Applications of Thermodynamic Models
4 pages
DL Practical 3 Loss Function
No ratings yet
DL Practical 3 Loss Function
6 pages
Absolute/Global Extrema: Maxima and Minima of A Function of One Variable
No ratings yet
Absolute/Global Extrema: Maxima and Minima of A Function of One Variable
3 pages
CH 4 Algorithms and Flowcharts
No ratings yet
CH 4 Algorithms and Flowcharts
23 pages
Theory of Approximation and Splines-I Lecture-1 Basic Concepts of Interpolation
No ratings yet
Theory of Approximation and Splines-I Lecture-1 Basic Concepts of Interpolation
4 pages
Sweety Model1 200gm 05feb24
No ratings yet
Sweety Model1 200gm 05feb24
21 pages
Credit Card Approval Data Information
No ratings yet
Credit Card Approval Data Information
3 pages
Unit V
No ratings yet
Unit V
22 pages
AIAssignment2solutiondoc 2024 10 24 16 25 02
No ratings yet
AIAssignment2solutiondoc 2024 10 24 16 25 02
10 pages
WWW Gradplus Pro Lessons Elective IV Digital Image Processing Nagpur University Summer 2019
No ratings yet
WWW Gradplus Pro Lessons Elective IV Digital Image Processing Nagpur University Summer 2019
2 pages
Math10 - Exit - Assessment TOS
No ratings yet
Math10 - Exit - Assessment TOS
1 page

Statevec Dist

Uploaded by

Statevec Dist

Uploaded by

Distributed Simulation of Statevectors and Density Matrices

Tyson Jones,1, 2, ∗ Bálint Koczor,1, 2 and Simon C. Benjamin1, 2

30 qubits. Distributed computing is notoriously difficult, requiring bespoke algorithms dissimilar

CONTENTS V Distributed density matrix algorithms 30

I. INTRODUCTION store a tractable subset of the full state description.

A. Notation amplitude of basis state |i⟩⟨j|, indexing from zero. In

A bold ρ notates a density operator with density 28 getBitMask(t):

II. LOCAL STATEVECTOR ALGORITHMS EiB

• [flops] The number of floating-point operations B. One-target gate

where the single-qubit kets are mapped to α3 α3 α3 α3

Algorithm 2: [local][statevector] simply set

and predeterminedly interleave |1⟩1 at every index D. Many-target gate

|i⟩N = j[N −s−1] 1

α5 α5 α5 α5 [O(2N +n ) bops][O(2N +n ) flops]

i ∈ {rΛ, rΛ + 1, . . . , (r + 1)Λ − 1}. (39)

This means that the j-th local amplitude ψ[j] stored

III. DISTRIBUTION |i⟩N ≡ |r⟩w |j⟩N −w , (40)

and ergo to global amplitude αi with index

We now discuss the measures of distributed perfor- pairwise indivisible

ψ ψ' ψ ψ' ψ ψ' ψ ψ'

φ φ' φ φ' φ φ'

Algorithm 5: Some convenience inter-node communication functions used in this manuscript’s

α14 α14 α14 7 4

FIG. 6. The memory access and communication pat-

i[cn ] = 1, ∀cn ∈ c, (reiteration of 27) cn ≥ N − w, ∀cn ∈ c. (54)

Algorithm 8: A subroutine of Alg. 7, triggered C. Swap gate

and simulate each control-NOT via the previous sec-

P an N -qubit basis state of arbitrary

|i⟩N ≡ |z⟩N −t2 −1 i[t2 ] 1

The SWAP gate swaps the targeted qubits of |i⟩; r r'

1 distrib swapGate(ψ, φ, t1 , t2 ): // t2 > t1 14 9 14 9

// swap entire ψ with pair... SWAP 4,7 SWAP 5,6

Section II D developed a local simulation of the 14 9

SWAP 0,4 SWAP 1,5 SWAP 2,6

14 9 SWAP 1,5 SWAP 2,6 SWAP 3,7

problematic communication into several serial steps, We utilise that

5 SWAPs (on the same side of the embarrassingly

Algorithm 10: [distributed][statevector] E. Pauli tensor

// undo swaps composed of Pauli operators σ̂ (q) ∈ {X̂, Ŷ , Ẑ} acting

and effect it as a dense n-qubit gate M̂t via Alg. 10.

Ẑq |i⟩N = (−1)i[q] |i⟩N . (81) where we have defined

Let tx ⊆ t contain the indices of qubits targeted by t′ = {q : q < λ, q ∈ tx,y }, (91)

X4 ⊗X5 X4 ⊗X6 X4 ⊗X7 X5 ⊗X6 X5 ⊗X7

Algorithm 11: [distributed][statevector] Algorithm 12: Subroutines of Alg. 11

F. Phase gadget We simply leverage that our computational basis

Ẑ |0⟩ = |0⟩ , Ẑ |1⟩ = − |1⟩ (98)

acting upon target qubits t = {t0 , . . . , tn−1 }. We Algorithm 13: [distributed][statevector]

where the Ẑ-phase gate is simulated by Alg. 13 and

Such a scheme involves O(n 2N ) total flops and could

of αi . The admitted communication pattern of the V. DISTRIBUTED DENSITY MATRIX

Sec. II modelled an N -qubit pure state |Ψ⟩N as a

Our state ρN is then expressible as |0⟩ ⟨0|N ≃ |0⟩2N (114)

In theory, we can instantiate density matrices with

B. Unitary gates where t + N notates array t with N added to each

A valid Kraus map satisfies Algorithm 18: [distributed][density matrix]

D. Dephasing channel thereby modifying the amplitudes as

i[t1 +N ] or i[t2 ] ̸= i[t2 +N ] (or both).

E. Depolarising channel This prescribes a change of amplitudes

Algorithm 21: [distributed][density matrix] 2. Two-qubit

// update βi where i[t] = j[t] ̸= i[t+N ] = b

and that produced by interleaving 4 zero bits into h

ℰΔ 0,3 ℰΔ 1,3 ℰΔ 1,4 ℰΔ 2,4 ℰΔ 3,4

then successively performing Algorithm 22: [distributed][density matrix]

Algorithm 23: Subroutines of Alg. 22 Algorithm 24: Subroutines of Alg. 22

// swap sub-buffers, receive at φ[Λ/8 . . . ]

F. Damping channel uniformly distributed between arrays ρ among 2w

The damping channel ergo modifies an amplitude of 3 4 3 4

Algorithm 25: [distributed][density matrix] G. Pauli string expectation value

# multithread hn ∈ R, σ̂ ∈ {1̂, X̂, Ŷ , Ẑ}.

where each Hi is concurrently and independently

Algorithm 26: [distributed][density matrix] H. Partial trace

where k A ∈ [0..2m ) and k B ∈ [0..2n ). These are the

VI. SUMMARY scribed communication of an operator. For in-

3. When we required to communicate a subset VIII. ACKNOWLEDGEMENTS

Density matrix algorithms (Σ = 22N , Λ = 22N −w )

arXiv:2009.02823, 2020. 8(3/4):623, 2009.