Statevec Dist
Statevec Dist
∗ [email protected]
a github.com/TysonRayJones/Distributed-Full-State-Algorithms
2
(double precision)
PiB
Before we can discuss the challenges of distributed
simulation, we must first understand the simpler
TiB
Memory
problem of local (i.e. non-distributed), shared-
memory simulation. This paradigm already invites Statevector
a myriad of high-performance computing consid- GiB
Density matrix
erations such as memory striding [16], hierarchi-
Overhead
cal caching [17], branch prediction [18], vectorisa- MiB
tion [19, 20], type-aware arithmetic [21] and inlin-
ing [22]. And on multiprocessor architectures, non- KiB
10 20 30 40 50
uniform memory access [23] and local parallelisation
paradigms like multithreading [24] with incurred nu- Number of qubits
ances such as cache-misses and false sharing [25].
We here forego an introduction to these topics, re- FIG. 1. The memory costs to represent pure states (via
statevectors) and mixed states (via density matrices),
ferring the interested reader to a discussion in quan-
assuming that a single complex amplitude requires 16 B
tum simulation settings in Sec 6.3 of Ref. [26]. Still, (like a C++ complex at double precision) and a process
this manuscript’s algorithms will be optimised under overhead of 100 KiB.
these considerations in order to establish a salient
performance threshold. In this section we review
and derive the basic principles and algorithms of a properties like entanglement and unitarity, making
local full-state simulator. resource prediction trivial. For these reasons, full-
Such simulators maintain a dense statevector |ψ⟩, state simulators are a natural first choice in much
typically instantiated as an array ψ of scalars αi ∈ C of quantum computing research. Their drawback
related by is that representing an N -qubit pure state requires
simultaneous storage of 2N complex scalars (an ex-
2
X
N ponentially growing memory cost), and simulating
|ψ⟩N = αi |i⟩N ↔ an n-qubit general operator acting upon the state
i requires O(2N +n ) floating-point operations (an ex-
ψ = {α0 , . . . , α2N −1 }. (7) ponentially growing time cost). We illustrate these
memory costs (and the equivalent for a dense N -
The scalars can adopt any complex numerical imple- qubit mixed state) in Figure 1. Thankfully, opera-
mentation such as Cartesian or polar, or a report- tors modify the statevector in simple, regular ways,
edly well-performing adaptive polar encoding [27], to admitting algorithms which can incorporate many
which the algorithms and algebra in this manuscript high-performance computing techniques, as this sec-
are agnostic. tion presents.
Naturally, a classical simulator can store an unnor-
malised array of complex scalars and ergo represent A. Local costs
non-physical states satisfying
2
X
N We will measure the performance of local algorithms
|αi |2 ̸= 1 . (8) in this manuscript via the following metrics:
i
• [bops] The number of prescribed basic opera-
This proves a useful facility in the simulation of vari- tions or “bops”, such as bitwise, arithmetic, log-
ational quantum algorithms [28, 29] and the repre- ical and indexing operations. Performing a bop
sentation of density matrices presented later in this in the arithmetic logic unit (ALU) of a modern
manuscript. CPU can be orders of magnitude faster than the
By storing all of a statevector’s amplitudes, a full- other primitive operations listed below, but their
state simulator maintains a complete description of accounting remains relevant to data-dominated
a quantum state and permits precise a posteriori cal- HPC [30] like quantun simulation. While bit-
culation of any state properties such as probabilities wise operations (like those of Alg. 1) and integer
of measurement outcomes and observable expecta- arithmetic are both classified as bops, the for-
tion values. It also means that the memory and run- mer is significantly faster [31] and is preferred in
time costs of simulation are roughly homogeneous tight loops. Our example code will be optimised
across circuits of equal size, independent of state thusly.
5
we can express the action of the operator as M0 M1 M2 M3
N
2
α0 α0 α0 α0
X
M̂t |ψ⟩ = αi i[N −1] 1
. . . M̂ i[t] 1 . . . i[0] 1
i α1 α1 α1 α1
(13) α2 α2 α2 α2
termines the access pattern, which we visualise in |ψ⟩N ≡ βkj |k⟩N −t−1 |0⟩1 |j⟩t +
Fig. 2. There however remain many ways to imple- j k
ment the amplitude iteration, but few which respect
γkj |k⟩N −t−1 |1⟩1 |j⟩t , (20)
the previous section’s HPC considerations. We will
now directly derive a specific implementation which where βkj , γkj ∈ C together form the 2N amplitudes
7
[O(2N ) bops][O(2N ) flops] where conveniently all sizes are expected to be pow-
[O(1) memory][2N writes] ers of 2, and assigned iterations are contiguous. At
1 local oneTargGate(ψ, M , t): double precision with a typical 64 byte cache-line,
2 N = log2 ( len(ψ) )
S = 4 [33]. However, we can further avoid some
cache-misses incurred by a single thread (due to
// loop every |n⟩N = |0⟩1 |k⟩N −t−1 |j⟩t fetching both β and γ amplitudes) across its assigned
# multithread iterations, by setting
3 for n in range(0, 2N /2):
// |iβ⟩N = |k⟩N −t−1 |0⟩1 |j⟩t (iterations per thread) = 2t+1 , (23)
4 iβ = insertBit(n, t, 0) // Alg. 1
when this exceeds S. Beware that when t is near
// |iγ⟩N = |k⟩N −t−1 |1⟩1 |j⟩t its maximum of N , such an allocation may non-
5 iγ = flipBit(iβ, t) // Alg. 1 uniformly divide the work between threads and
6 β = ψ[iβ] wastefully leave threads idle. The ideal schedule is
7 γ = ψ[iγ] ergo the maximum multiple of the cache-line size
(in amplitudes) which is less than or equal to the
// modify the paired amplitudes
8 ψ[iβ] = m00 β + m01 γ uniform division of total iterations between threads.
9 ψ[iγ] = m10 β + m11 γ
That is
(num iterations)
(iterations per thread) = S ,
S (num threads)
of |ψ⟩. Precisely, βkj ≡ αi =⇒ γkj = αi¬t . The (24)
middle ket, equivalent to i[t] , is that targeted by
M̂t . Eq. 19 prescribes where (num threads) is the maximum concurrently
supported by the executing hardware, and is not
t
2 2X N −t−1 necessarily a power of 2. Notice however that
in Alg. 2, (num iterations) = 2N /2 is a runtime
X
M̂t |ψ⟩N ≡ (βkj m00 + γkj m01 ) |k⟩ |0⟩ |j⟩
j k parameter; alas, thread allocation must often be
specified at compile-time, such as it is when using
+ (βkj m10 + γkj m11 ) |k⟩ |1⟩ |j⟩ .
OpenMP [34]. In such settings, the largest divi-
(21)
sion of contiguous iterations between threads can
This form suggests a bitwise and trivially parallelis- be allocated, incurring O((iterations per thread))
able local strategy to branchlessly locate the paired false shares. This is the default static schedule in
amplitudes among the 2N of an N -qubit statevec- OpenMP, specified with C precompiler directive
tor’s array, and modify them in-place, which we # pragma omp for
present in Alg. 2. It makes use of several bit-
twiddling functions defined in Alg. 1. This is the multithreaded configuration assumed for
all algorithms in this manuscript when indicated by
We finally remark on local parallelisation. Alg. 2 directive
features an exponentially large loop wherein each it-
eration modifies a unique subset of amplitudes; they # multithread
can ergo be performed concurrently via multithread-
ing. All variables defined within the loop become
thread-private, while the statevector is shared be-
tween threads and is simultaneously modifiable only
when threads write to separate cache-lines. Other-
wise, false-sharing may degrade performance, in the
worst case that of serial array access [25]. We hence
endeavour to allocate threads to perform specific it-
erations which modify disjoint cache-lines. We could
8
C. Many-control one-target gate M1 C 0 (M 1 ) C 2 (M 1 ) C 0,2 (M 1 )
α0 α0 α0 α0
•
• α1 α1 α1 α1
M̂ α2 α2 α2 α2
α3 α3 α3 α3
The many-control one-target gate introduces one or
more control qubits to the previous section’s opera- α4 α4 α4 α4
tor. This creates a simple entangling gate, yet even
α5 α5 α5 α5
the one-control one-target gate is easily made univer-
sal [35, 36] and appears as an elementary gate in al- α6 α6 α6 α6
most every non-trivial quantum circuit [37]. Many- α7 α7 α7 α7
control gates can be challenging to perform exper-
imentally and are traditionally decomposed into a α8 α8 α8 α8
series of one-control gates. But in classical simula- α9 α9 α9 α9
tion, many-control gates are just as easy to effect di-
rectly, and actually prescribe fewer amplitude mod- α10 α10 α10 α10
ifications and floating-point operations than their α11 α11 α11 α11
non-controlled counterparts. We here derive a local
algorithm to in-place simulate the one-target gate α12 α12 α12 α12
with s control qubits upon an N -qubit pure state α13 α13 α13 α13
in O(s 2N −s ) bops, O(2N −s ) flops, 2N −s memory
α14 α14 α14 α14
writes, and an O(1) memory overhead.
α15 α15 α15 α15
We seek to apply Cc (M̂t ) upon an arbitrary N -qubit
pure state |ψ⟩, where c = {c0 , . . . , cs−1 } is an arbi-
trarily ordered list of s unique control qubits, and FIG. 3. The memory access pattern of Alg. 3’s local
operator M̂ targets qubit t ∈ / c and is described by simulation of the many-control one-target gate Cc (M̂t ).
a 2 × 2 general complex matrix M (as in the pre- Amplitudes in grey fail the control condition of Eq. 27
and are not modified nor accessed.
vious section). Let us temporarily assume that the
target qubit t is the least significant and rightmost
(t = 0), and the next s contiguous qubits are con- Therefore, a general Cc (M̂t ) gate modifies only 2N −s
trolled upon. Such a gate is described by the matrix amplitudes αi of the 2N in the statevector array ψ,
C{1,...,s} (M̂0 ) which satisfy
1 i[cn ] = 1, ∀cn ∈ c, (27)
2s
..
.
≃
and does so under the action of the non-controlled
1
m00
m01 M̂t gate. We illustrate the resulting memory access
m10 m11 pattern in Fig. 3.
(25) Ascertaining which amplitudes satisfy (what we
⊗s dub) the “control condition” of Eq. 27 must be done
= 1⊗(s+1) + |1⟩⟨1| ⊗ (M − 1).
efficiently since it may otherwise induce an overhead
(26) in every iteration of the exponentially large loop of
This form prescribes identity (no modification) upon the non-controlled Alg. 2. We again leverage HPC
every computational basis state except those for techniques to produce a cache-efficient, branchless,
which all control qubits are in state |1⟩1 ; for those vectorisable, bitwise procedure in a derivation which
states, apply M̂0 as if it were non-controlled, like hides all such nuances from the reader. We seek to
in the previous section. There are only a fraction iterate only the 2N −s amplitudes satisfying the con-
1/2s such states, due to the 2s unique binary as- trol condition, which have indices i with fixed bits
signments of the s control qubits. This prescription (value 1) at c. To do so, we enumerate all (N − s)-
is unchanged when the control and target qubits are bit integers j ∈ {0, . . . , 2N −s − 1}, corresponding to
reordered or the number of controls is varied, as in- (N − s)-qubit basis states
tuited by swapping rows of Eq. 25 or inserting addi-
|j⟩N −s ≡ j[N −s−1] . . . j[1] j[0] (28)
tional identities into Eq. 26. 1 1 1
9
M 0,1 M 0,2 M 0,1,2 M 0,1,3 Algorithm 4: [local][statevector]
Many-target gate M̂t with n unique target
α0 α0 α0 α0 qubits t = {t0 , . . . , tn−1 }, described by matrix
n n
α1 α1 α1 α1 M̂ = {mij : i, j} ∈ C2 ×2 , applied to an
α2 α2 α2 α2
N -qubit pure statevector |ψ⟩.
α3 α3 α3 α3
M̂
α4 α4 α4 α4
bound. That is, the fetching and modification of am- Algorithms 2, 3 and 4 presented serial methods
plitudes from heap memory dominates the runtime, for simulating the single-target, many-control and
occluding the time of the bitwise and indexing al- many-target gates respectively. To be performant,
gebra. We should expect that targeting lower-index they made economical accesses to the state array ψ.
rightmost qubits will see better caching performance Distributed simulation of these gates however will re-
than targeting high-index leftmost qubits, because quire explicit synchronisation and network commu-
the linearly combining amplitudes in the former sce- nication between nodes to even access amplitudes
nario lie closer together and potentially within the within another node’s statevector partition, which
same cache-lines. we refer to as the “sub-statevector”. We adopt the
message passing interface (MPI) [43] ubiquitous in
If we wished to make the memory addresses accessed
parallel computing. Exchanging amplitudes through
by the inner j loops of Alg. 4 be strictly increasing,
messages requires the use of a communication buffer
we would simply initially permute the columns of M
to receive and process network-received amplitudes
as per the ordering of t to produce matrix M ′ , then before local modification of a state. For the remain-
sort t. That is, we would leverage that der of this manuscript, we employ fixed labels:
′
M̂t ≡ M̂sorted(t) . (36)
ψ := An individual node’s sub-statevector.
Doing so also permits array sorted(t) to be replaced φ := An individual node’s communication buffer.
with a bitmask, though we caution optimisation of
the indexing will not appreciably affect the memory- We identify nodes by their rank r ∈ {0, . . . , W − 1}.
dominated performance. Each node’s partition represents an unnormalised
substate |ψr ⟩N −w of the full quantum state repre-
Finally, we caution that multithreaded deployment sented by the ensemble
of this algorithm requires that each simultaneous
thread maintains a private 2n -length array (v in 2
X
w
Alg. 4), to collect and copy the amplitudes modified |Ψ⟩N ≡ |r⟩w |ψr ⟩N −w . (38)
by a single iteration. This scales up the temporary r
memory costs by factor O(num threads).
In other words, node r contains global amplitudes
αi of |Ψ⟩N with indices satisfying
len(φ) = O(Λ). A contrary choice of len(φ) = Quantities a, b, f are given per-node because nodes
O(1) necessitates exponentially many communica- perform these operations concurrently. Quantities
tion overheads [46]. Distributed simulators like In- c, d, e are given as totals aggregated between all
tel’s IQS [45] opt for len(φ) = Λ/2 which requires nodes. It will be easy to distinguish these by whether
typical gate simulations perform two rounds of am- the measures include variable Λ (per-node) or N
plitude exchange, and may enable the second com- (global).
munication to overlap local state processing for a mi-
nor speedup. It also enables the simulation of a sin-
gle additional qubit on typical node configurations. C. Communication patterns
In contrast, simulators like Oxford’s QuEST [9] fix
len(φ) = Λ. This restricts simulation to one fewer This manuscript will present algorithms with dras-
qubits, though induces no appreciable slowdown tically differing communication patterns, by which
due to the dominating network speeds. See Refer- we refer to the graph of which nodes exchange mes-
ences [9, 26] for a more detailed comparison of these sages with one another. The metrics of the previ-
strategies. ous section are useful for comparing two given dis-
In this work, we fix the buffer to be of the same size tributed algorithms, but they cannot alone predict
as the sub-statevector, i.e. their absolute runtime performance. A distributed
application’s ultimate runtime depends on its emerg-
len(φ) := Λ. (43) ing communication pattern and the underlying hard-
ware network topology [47].
This enables the simulation of many advanced oper- Fortunately, our algorithms share a useful property
ators precluded by a smaller buffer, though means which simplifies their costs. We will see that nearly
the total memory costs (in bytes) of distributed sim- all simulated quantum operators in this manuscript
ulation of an N -qubit pure state is b 2N where b is the yield pairwise communication, whereby a node ex-
number of bytes to represent a single complex ampli- changes amplitudes with a single other unique node.
tude (b = 16 at double precision in the C language). To be precise, if rank r sends amplitudes to r′ , then
This is double the serial costs shown in Fig. 1. it is the only rank to do so, and so too does it re-
ceive amplitudes only from r′ . We visualise this be-
low where a circle indicates one of four nodes and an
B. Communication costs
arrow indicates one of four total passed messages:
d)
e)
a) b) c)
FIG. 5. The five paradigms of pairwise amplitude exchange used in this manuscript. Arrays ψ and φ are a node’s
sub-statevector and communication buffer respectively, and ψ ′ and φ′ are those of a paired node. During a round of
communication, all or some of the nodes will perform one of the below paradigms, with the remaining nodes idle.
a) Nodes send their full sub-statevector to directly overwrite their pair node’s communication buffer. This permits
local modification of received amplitudes in the buffer before integration into the sub-statevector.
b) Nodes pack a subset of their sub-statevector into their buffer before exchanging (a subset of) their buffers. This
reduces communication costs from a) when not all pair node amplitudes inform the new local amplitudes. Notice
that since the buffers cannot be directly swapped, amplitudes are sent to the pair node’s empty offset buffer.
c) One node of the pair packs and sends its buffer, while the receiving pair node sends nothing.
d) Nodes intend to directly swap the entirety of their sub-statevector, though must do so via a).
e) Nodes intend to directly swap distinct subsets of their sub-statevectors, though must do so via b).
IV. DISTRIBUTED STATEVECTOR The paired amplitude αi¬t corresponds to basis state
ALGORITHMS
|i¬t ⟩N ≡ i[N −1] 1
. . . !i[t] 1
. . . i[0] 1
(47)
′ ′
This section will derive six novel distributed algo- = |r ⟩w |j ⟩N −w . (48)
rithms to simulate many-control gates, SWAP gates,
many-target gates, tensors of Pauli operators, phase Flipping bit t of integer i must modify either the w-
gadgets and Pauli gadgets. As a means of review, bit prefix or the (N −w)-bit suffix of i’s bit sequence,
we begin however by deriving an existing distributed and ergo modify either j or r. Two scenarios emerge:
simulation technique of the one-target gate, gener- 1. When t < N − w, then
alising the local algorithm of Sec. II B.
r′ = r, j ′ = j¬t . (49)
A. One-target gate All paired amplitudes αi¬t are stored within the
same rank r as αi , and so every node already
contains all amplitudes which will determine its
new values. No communication is necessary, and
M̂ we say the method is “embarrassingly parallel”.
Furthermore, since
Let us first revisit the one-target gate, the staple
of quantum circuits, for which we derived a local M̂t<N −w |r⟩w |ψ⟩N −w = |r⟩w M̂t |ψ⟩N −w ,
simulation algorithm in Sec. II B. Our distributed (50)
simulation of this gate will inform the target perfor-
mance of all other algorithms in this manuscript. we can modify a node’s local sub-statevector ψ
We here derive a distributed in-place simulation in an identical manner to the local simulation of
of the one-target gate upon an N -qubit pure stat- M̂t |ψ⟩. We simply invoke Alg. 2 upon ψ on every
evector distributed between W = 2w nodes, with node.
Λ = 2N −w amplitudes per-node. Our algorithm 2. When t ≥ N − w, then
prescribes O(Λ) bops and flops per-node, exactly
Λ writes per-node, at most a single round of mes- r′ = r¬(t−(N −w)) j ′ = j. (51)
sage passing in which case 2N total amplitudes are
exchanged over the network, and an O(1) memory Every paired amplitude to those in node r is
overhead. stored in a single other node r′ . We call this the
“pair node”. Ergo each node must send its full
Let M̂t denote a general one-target gate upon qubit sub-statevector ψ to the buffer φ′ of its pair node.
t ≥ 0, described by a complex matrix This is the paradigm seen in Fig. 5 a). There-
m00 m01
after, because j ′ = j above, each local amplitude
M= ∈ C2×2 . (44) ψ[j] will linearly combine with the received am-
m10 m11
plitude φ[j], weighted as per Eq. 19.
We will later see bespoke distributed methods for
faster simulation of certain families of one-target and The memory access and communication patterns of
separable gates, but we will here assume M is com- these scenarios are illustrated in Fig. 6. In general,
pletely general and unconstrained. We seek to ap- we cannot know in advance which of scenarios 1. or
ply M̂t upon an arbitrary N -qubit pure state |Ψ⟩N 2. will be invoked during distributed simulation, be-
which is distributed between W = 2w nodes, each cause N , w and t are all user-controlled parameters.
storing sub-statevector ψ of size Λ = 2N −w , and an So our algorithm to simulate M̂t must incorporate
equal-sized communication buffer φ. both. We formalise this scheme in Alg. 6.
Let us compare the costs of local vs distributed sim-
We showed in Eq. 19 that M̂t modifies an amplitude ulation of the one-target gate (i.e. Alg 2 against
αi of |Ψ⟩N to become a linear combination of αi and Alg 6). The former prescribed a total of O(2N )
αi¬t . Ergo to modify αi , we must determine within bops and flops to be serially performed by a sin-
which node the paired αi¬t is stored. Recall that gle machine. The latter exploits the parallelisation
the j-th local amplitude stored within node r ≥ 0 of W distributed nodes and involves only O(Λ) =
corresponds to global amplitude αi satisfying O(2N /W ) bops and flops per machine (a uniform
|i⟩N ≡ i[N −1] . . . i[t] . . . i[0] (45) load), suggesting an O(W ) speedup. However, when
1 1 1
the upper w qubits are targeted, inter-node com-
≡ |r⟩w |j⟩N −w . (46) munication was required and all O(2N ) amplitudes
15
were simultaneously exchanged over the network in As such, our one-target gate distributed simulation
a pairwise fashion. While the relative magnitude of sets a salient performance threshold. We endeavour
this overhead depends on many physical factors (like to maintain this weak scaling in the other algorithms
network throughput, CPU speeds, cache through- of this manuscript.
put) and virtual parameters (the number of qubits,
the prescribed memory access pattern), distributed Algorithm 6: [distributed][statevector]
simulation of this kind in realistic regimes shows m00 m01
One-target gate M̂t , where M = ( m 10 m11
),
excellent weak scaling and the network overhead is
upon an N -qubit pure statevector distributed
tractable [9]. This means that if we introduce an ad-
between 2w nodes as local ψ (buffer φ).
ditional qubit (doubling the serial costs) while dou-
bling the number of distributed nodes, our total run- M̂
time should be approximately unchanged.
[O(Λ) bops][O(Λ) flops][0 or 1 exchanges]
[O(2N ) exchanged][O(1) memory][Λ writes]
1 distrib oneTargGate(ψ, φ, M , t):
M0 r = getRank() // Alg. 5
M0 M1 M2 1 2
2
3 w = log2 (getWorldSize()) // Alg. 5
4 Λ = len(ψ)
α0 α0 α0 0 3 5 N = log2 (Λ) + w
α1 α1 α1
// embarrassingly parallel
7 4 6 if t < N − w:
α2 α2 α2 7 local oneTargGate(ψ, M , t) // Alg. 2
α3 α3 α3 6 5 // full sub-state exchange is necessary
8 else:
α4 α4 α4
// exchange with r′
α5 α5 α5 M1 9 q = t − (N − w)
1 2 10 r′ = flipBit(r, q) // Alg. 1
α6 α6 α6 11 exchangeArrays(ψ, φ, r′ ) // Alg. 5
0 3
α7 α7 α7 // determine row of M
12 b = getBit(r, q) // Alg. 1
α8 α8 α8 7 4
// modify local amplitudes
α9 α9 α9 # multithread
6 5 13 for j in range(0, Λ):
14 ψ[j] = mb,b ψ[j] + mb,!b φ[j]
α10 α10 α10
α11 α11 α11 M2
1 2
α12 α12 α12
0 3
α13 α13 α13
α0 α0 α0 α0
• α1 α1 α1 α1
•
α2 α2 α2 α2
M̂
α3 α3 α3 α3
Introducing control qubits to the previous one-target
gate empowers it to become entangling and univer- α4 α4 α4 α4
sal [35, 36], which for its relative simplicity, estab- α5 α5 α5 α5
lishes it as the entangling primitive of many quan-
α6 α6 α6 α6
tum algorithms [37]. Section II C derived a local, se-
rial algorithm to effect the many-control one-target α7 α7 α7 α7
gate which we now adapt for distributed simulation.
We derive an in-place distributed simulation of the α8 α8 α8 α8
one-target gate with s control qubits upon an N -
α9 α9 α9 α9
qubit pure state. Our method prescribes as few as
O(sΛ/2s ) (at most O(Λ)) bops per node, O(Λ/2s) α10 α10 α10 α10
(at most O(Λ)) flops and writes per node, at most α11 α11 α11 α11
a single round of communication whereby a total
of O(2N /2s ) amplitudes are exchanged, and a fixed
α12 α12 α12 α12
memory overhead.
α13 α13 α13 α13
We consider the operator Cc (M̂t ) where c =
{c0 , . . . , cs−1 } is an arbitrarily ordered list of s α14 α14 α14 α14
unique control qubits, t ∈ / c is the target qubit, α15 α15 α15 α15
and where M̂ is described by matrix M ∈ C2×2 (as
in the previous section). We seek to apply Cc (M̂t ) FIG. 7. The memory access pattern of Alg. 7’s dis-
upon an arbitrary N -qubit pure state |Ψ⟩N which is tributed simulation of the many-control one-target gate
distributed between W = 2w nodes, each with sub- Cc (M̂t ). Each column shows 16 amplitudes distributed
statevector ψ of size Λ = 2N −w and an equal-sized between 4 nodes. If a node contains only grey amplitudes
communication buffer φ. which fail the control condition, it need not perform any
communication.
Section II C established that Cc (M̂t ) modifies only
the 2N −s global amplitudes αi of full-state |Ψ⟩N
which satisfy the control condition 1. When all controls lie within the prefix, i.e.
doing so under the non-controlled action of M̂t . The control condition is determined entirely by
Since all amplitudes failing the control condition are a node’s rank r, so that every or none of the am-
not modified nor are involved in the modification of plitudes therein satisfy it. Ergo, some nodes sim-
other amplitudes, they need not be communicated ulate M̂t as normal while others do nothing; this
over the network in any circumstance. This already is seen in column 4 of Fig. 7. Because only a frac-
determines the communication pattern, which we il- tion 1/2s of nodes satisfy r[cn −(N −w)] = 1 ∀cn ∈
lustrate (as a memory access diagram) in Figure 7. c, the number of communicating nodes and hence
the total data size communicated exponentially
Explicitly deriving this pattern is non-trivial. We shrinks with additional controls. We visualise
again invoke that each local amplitude index j in this communication pattern below where circles
node r is equivalent to a global index i satisfying indicate nodes, grey symbols indicate their rank,
|i⟩N ≡ i[N −1] . . . i[t] . . . i[0] (52) arrows indicate a message, and black symbols in-
1 1 1
dicate the number of amplitudes in each message.
≡ |r⟩w |j⟩N −w . (53)
r
We call |r⟩w the prefix substate and |j⟩N −w the suf-
Λ …
fix. As per Eq. 49, when t < N − w, simulation is r + 2s - 1
embarrassingly parallel regardless of c. But when
t ≥ N − w, three distinct scenarios emerge.
17
2. When all controls lie within the suffix, i.e. Algorithm 7: [distributed][statevector]
Many-control one-target gate Cc (M̂t ) with s
cn < N − w, ∀cn ∈ c. (55)
unique control qubits c = {c0 , . . . , cs−1 } and
The control condition is independent of rank, and target t ∈
/ c, described by matrix
since every node then contains every assignment M = (m 00 m01
m10 m11 ) ∈ C
2×2
, applied to an N -qubit
of bits c, all nodes contain amplitudes which pure statevector distributed between 2w nodes
satisfy the condition and need communicating. as local ψ (buffer φ).
There are precisely Λ/2s such amplitudes per •
•
node. This scenario is seen in column 2 of Fig. 7,
and is illustrated below. M̂
0 [best O(sΛ/2s ), worst O(Λ) bops]
Λ /2s
… [best O(Λ/2s ), worst O(Λ) flops]
[0 or 1 exchanges][O(2N /2s ) exchanged]
W -1
[O(1) memory][best Λ/2s , worst Λ writes]
Because not all local amplitudes are communi- 1 distrib manyCtrlOneTargGate(ψ, φ, c, M , t):
cated, it is prudent to pack only those which 2 r = getRank() // Alg. 5
are into the local communication buffer before 3 λ = log2 (len(ψ)) // = N − w
exchanging the packed buffer subset. This is
// separate prefix and suffix controls
paradigm b) of Fig. 5, and drastically reduces
4 c(p) = {q − λ : q ≥ λ, ∀q ∈ c}
the total number amplitudes transferred over the
network by factor 1/2s . 5 c(s) = {q : q < λ, ∀q ∈ c}
// halt if r fails control condition
3. When the controls are divded between the
prefix and suffix, i.e. 6 if not allBitsAreOne(r, c(p) ): // Alg. 1
7 return
∃ cn ≥ N − w and ck < N − w. (56)
// update as Cc (M̂t ) |Ψ⟩ = |r⟩ Cc(s) (M̂t ) |ψ⟩
8 if t < λ:
Only a fraction of ranks satisfy the control con-
9 local manyCtrlOneTargGate(ψ, c(s) , M , t)
dition, as do only a fraction of the amplitudes
therein; so some nodes exchange only some of // Alg. 3
// exchange with r′ is necessary
their amplitudes. Let us distinguish between the
10 if t ≥ λ:
controls acting upon the prefix and suffix sub- // all local αj satisfy condition,
states: // so controls can be disregarded
11 if len(c(s) ) = 0:
c(p) = {cn ∈ c : cn ≥ N − w}, (57)
12 distrib oneTargGate(ψ, M , t)
c(s) = {cn ∈ c : cn < N − w}, (58) // Alg. 6
(p) (p) (s) (s) // only subset of local αj satisfy,
s = len(c ), s = len(c ). (59) // determined only by suffix controls
13 else:
A fraction 1/s(p) nodes communicate, exchang- 14 r′ = flipBit(r, t − λ) // Alg. 1
ing fraction 1/s(s) of their local amplitudes. This 15
′ (s)
distrib ctrlSub(ψ, φ, r , c , M , t)
is seen in column 3 of Fig. 7, and below. // Alg. 8
r
( s)
Λ / 2s … (p)
r + 2s -1 decomposing distributed simulation of Cc (M̂t ) into
subtasks which we succinctly summarise in the com-
Once again, the communicating nodes will pack ments of Alg. 7.
the relevant subset of their local amplitudes into The performance of this algorithm varies drastically
buffers before exchanging, as per Fig. 5 2). with the configuration of qubits c and t, but all
In all these scenarios, the rank r′ of the pair node costs are upperbounded by those to simulate M̂t via
with which node r communicates (controls permit- Alg. 6. When t < N − w, the method is embarrass-
ting) is determined by the target qubit t in the same ingly parallel. Otherwise, when none of the left-most
w qubits are controlled, every node has an identical
manner as for the non-controlled gate M̂t . There
task of sending and locally modifying Λ/2s ampli-
are several additional observations to make when
18
z 0 y0x ψ0 ψ0 z 1 y0x
SWAPt1 ,t2 |i⟩N
⋮ ⋮
= |z⟩N −t2 −1 i[t1 ] |y⟩t2 −t1 −1 i[t2 ] 1
|x⟩t1
(
|i⟩N i[t1 ] = i[t2 ]
= (63)
i¬{t1 ,t2 } N i[t1 ] ̸= i[t2 ] z 0 y1x ψ2 t1 ψ2 t1 z 1 y1x
⋮ ⋮
This induces a change only when the principal bits
differ; swapping them is ergo equivalent to flip-
ping both.
P Therefore SWAPt1 ,t2 upon the full state
|Ψ⟩N = i αi |i⟩N swaps a subset of amplitudes; z 0 y' 0 x ψ2 2 t1 ψ2 2 t1 z 1 y' 0 x
⋮ ⋮
(
αi i[t1 ] = i[t2 ]
αi → (64)
αi¬{t1 ,t2 } i[t1 ] ̸= i[t2 ] .
We call αi¬{t1 ,t2 } = α(i¬t1 )¬t2 the “pair amplitude”. z 0 y' 1 x ψ3 2 t1 ψ3 2 t1 z 1 y' 1 x
Recall that the j-th local amplitude stored within ⋮ ⋮
node r corresponds to global amplitude αi satisfying
|i⟩N ≡ |r⟩w |j⟩λ , where λ = N − w. (65)
Three distinct scenarios emerge (assuming t2 > t1 ).
FIG. 8. The amplitudes which require swapping in sce-
1. When t2 < λ, and consequently nario 3. of Alg. 9’s distributed simulation of SWAPt1 ,t2 .
Arrows (and colour) connect elements of the local sub-
i¬{t1 ,t2 } = |r⟩w j¬{t1 ,t2 } . (66)
N λ statevector ψ which must be swapped between nodes r
The pair amplitude is contained within the same and r′ . This is the memory exchange pattern e) of Fig. 5.
node and simulation is embarrassingly parallel.
2. When t1 ≥ λ, such that 0
i¬{t1 ,t2 } = r¬{t1 −λ,t2 −λ} |j⟩λ . (67) Λ /2 …
N w
W -1
If r[t1 −λ] ̸= r[t2 −λ] (as satisfied by half of all
nodes) then all local amplitudes in node r must Notice that the destination local address j ′ =
be exchanged with their pair amplitudes within j¬t1 differs from the source local address; we vi-
pair node r′ = r¬{t1 −λ,t2 −λ} . No modification is sualise this in Fig. 8. We expand upon the nu-
necessary, so the nodes directly swap their sub- ances of this scenario below.
statevector ψ as per Fig. 5 d). The remaining
nodes contain only amplitudes with global in- Because scenario 3. requires that paired nodes ex-
dices i satisfying i[t1 ] = i[t2 ] and so do nothing. change only half their local amplitudes, these ampli-
In total 2N /2 amplitudes are exchanged in par- tudes should first be packed into the communication
allel batches of Λ. buffers as per Fig. 5 b), like was performed for the
multi-controlled gate in Sec. IV B. This means pack-
ing every second contiguous batch of 2t1 amplitudes
Λ … before swapping the buffers, incurring local memory
penalties but halving the communicated data. Note
too that when t1 = λ − 1, packing is unnecessary;
3. When t1 < λ and t2 ≥ λ, in which case
the first (or last) contiguous half of a node’s sub-
i¬{t1 ,t2 } = r¬(t2 −λ) |j¬t1 ⟩λ . (68) statevector can be directly sent to the pair node’s
N w
buffer, although we exclude this optimisation from
Every node r must exchange amplitudes with our pseudocode.
pair node r′ = r¬(t2 −λ) , but only those ampli-
tudes of local index j satisfying j[t1 ] ̸= r[t2 −λ] ; The local indices of amplitudes modified by the
this is half of all local amplitudes. In this sce- SWAP gate are determined through the same bit
nario, a total of 2N /2 amplitudes are exchanged algebra used in the previous algorithms. We present
in parallel batches of Λ/2, and the node load is the resulting scheme in Alg. 9, and its communica-
homogeneous. Pairs exchange via Fig. 5 e). tion pattern in Fig. 9. This algorithm prescribes no
20
floating point operations, an exchange of (at most) SWAP 0,4 SWAP 0,5
3 4 3 4
half of all amplitudes, and at most a single round 2 5 2 5
of pairwise communication. Despite being an exper- 1 6 1 6
imentally fearsome two-qubit gate, we have shown
0 7 0 7
the SWAP gate is substantially cheaper to classically
simulate than the one-qubit gate of Alg. 6. 15 8 15 8
14 9 14 9
13 10 13 10
Algorithm 9: [distributed][statevector] 12 11 12 11
SWAPt1 ,t2 gate upon qubits t1 and t2 > t1 of an
N -qubit pure state distributed between W = 2w SWAP 0,6 SWAP 0,7
3 4 3 4
nodes as local arrays ψ with buffers φ. 2 5 2 5
× 1 6 1 6
× 0 7 0 7
[O(Λ) bops][0 flops][0 or 1 exchanges]
[2N /2 exchanged][O(1) memory][Λ/2 writes] 15 8 15 8
2 Λ = len(ψ) 13 10 13 10
12 11 12 11
3 λ = log2 (Λ)
// embarrassingly parallel SWAP 4,5 SWAP 4,6
3 4 3 4
4 if t2 < λ: 2 5 2 5
// loop |k⟩λ−2 ≡ |z⟩N −t2 −1 |y⟩t2 −t1 −1 |x⟩t1 1 6 1 6
# multithread
5 for k in range(0, Λ/4): 0 7 0 7
// |jab ⟩λ = |z⟩ |a⟩1 |y⟩ |b⟩1 |x⟩
15 8 15 8
6 j11 = insertBits(k, {t1 , t2 }, 1)
7 j10 = flipBit(j11 , t1 ) 14 9 14 9
8 j01 = flipBit(j11 , t2 ) 13 10 13 10
9 ψ[j01 ], ψ[j10 ] = ψ[j10 ], ψ[j01 ] // swap 12 11 12 11
M̂ 0 7
15 8
FIG. 11. The total effective communication pattern (left) and those of each decomposed step (right) of Alg. 10’s
distributed simulation of the 4-target gate M̂t upon the upper qubits t = {4, 5, 6, 7} of an 8-qubit statevector.
[O(2n Λ) bops][O(2n Λ) flops] The one-target Pauli operators X̂, Ŷ and Ẑ are core
[best: 0, worst: 2 min(n, w) exchanges] primitives of quantum information theory, and their
[worst: 2N min(n, w) exchanged] controlled and many-target extensions appear ex-
[O(2n ) memory][O(2n Λ) writes] tensively in experimental literature. For instance,
Toffoli gates [62], fan-out gates [63], many-control
1 distrib manyTargGate(ψ, φ, M , t):
many-target Ẑ gates [64] and others appear as nat-
2 λ = log2 (len(ψ)) // = N − w ural primitive operators for Rydberg atom comput-
// need fewer targets than suffix qubits ers [65]. Many-control many-target X̂ gates appear
3 n = len(t) in quantum arithmetic circuits [66]. But most signif-
4 if n > λ: icantly, Pauli tensors form a natural basis for Hermi-
5 fail tian operators like Hamiltonians, and their efficient
// locate lowest non-targeted qubit simulation would enable fast calculation of quanti-
6 b = getBitMask(t) // Alg. 1 ties like expectation values. Rapid, direct simulation
7 q = 0 of a Pauli tensor will also enable efficient simulation
8 while getBit(b, q) == 1: // Alg. 1 of more exotic operators like Pauli gadgets, as we
9 q += 1 will explore in Sec. IV G.
// record which targets require swapping As always, let Λ be the size of each node’s sub-
10 t′ = {} statevector. In this section, we derive a distributed
11 for i in range(0, n): in-place simulation of the n-qubit Pauli tensor which
12 if t[i] < λ: prescribes O(n Λ) bops, Λ flops, Λ memory writes,
13 t′ [i] = t[i]
no memory overhead and at most a single round of
14 else:
15 t′ [i] = q communication. This makes it more efficient than
16 q += 1 the one-target gate of Alg. 6 despite prescribing
17 while getBit(b, q) == 1: // Alg. 1 a factor O(n) more bops. Further, all amplitude
18 q += 1 modifications happen to be multiplication with unit
scalars ±1 or ±i, which may enable optimisation on
// perform necessary swaps
19 for i in range(0, n):
some systems and with some amplitude types.
20 if t′ [i] ̸= t[i]: We consider the separable n-qubit unitary
21 distrib swapGate(ψ, φ, t′ [i], t[i])
n
// Alg. 9 O (q)
// embarrassingly parallel M̂t′ Ût = σ̂tq (76)
q
22 local manyTargGate(ψ, M , t′ ) // Alg. 4
Alternatively, we could simulate each one-target op- The explicit action of the Pauli tensor is to modify
erator in turn, utilising the amplitudes under
(n−1) (1) (0)
Ût |Ψ⟩N = σ̂tn−1 . . . σ̂t1 σ̂t0 |Ψ⟩N . . . , (78) Ût
αi −→ βi¬tx,y αi¬tx,y (88)
and perform a total of n invocations of Alg. 6. This
would cost n Λ flops and writes, and at most n where complex unit βi is trivially bitwise evalu-
rounds of communication. Still, a superior method able independent of any amplitude. Notice too
can accelerate things by a factor n. that unlike the previously presented gates in this
manuscript, Eq. 88 does not prescribe any superpo-
Neither of these naive methods leverage two useful
sition or linear combination of the amplitudes with
properties of the Pauli matrices; that they have unit
one another; only the swapping of amplitudes and
complex amplitudes (as do their Kronecker prod-
their multiplication with a complex unit.
ucts) and that they are diagonal or anti-diagonal.
These properties respectfully enable simulation of We now derive the communication strategy to effect
Ût with no arbitrary floating-point operations (all Eq. 88. Recall that the j-th local amplitude in node
instead become sign flips and complex component r < 2w corresponds to global amplitude αi of basis
swaps) and in (at most) a single round of communi- state
cation. We now derive such a scheme.
|i⟩N ≡ |r⟩w |j⟩λ , λ = N − w. (89)
A single Pauli operator σ̂q maps an N -qubit basis
state |i⟩N to Under Ût , this amplitude swaps with that of index
X̂q |i⟩N = |i¬q ⟩N , (79)
i¬tx,y ≡ r¬t′′ j¬t′ , (90)
Ŷq |i⟩N = i (−1)i[q] |i¬q ⟩N , (80) N w λ
X4 ⊗X5 ⊗X6 X4 ⊗X5 ⊗X7 X4 ⊗X6 ⊗X7 X5 ⊗X6 ⊗X7 X4 ⊗X5 ⊗X6 ⊗X7
3 4 3 4 3 4 3 4 3 4
2 5 2 5 2 5 2 5 2 5
1 6 1 6 1 6 1 6 1 6
0 7 0 7 0 7 0 7 0 7
15 8 15 8 15 8 15 8 15 8
14 9 14 9 14 9 14 9 14 9
13 10 13 10 13 10 13 10 13 10
12 11 12 11 12 11 12 11 12 11
FIG.C13.7 (X4 ⊗X
Some C5 (X4patterns
5 ) communication ⊗X7 ) of Alg.C411’s(X5 simulation C4,7 (X
⊗X6 ⊗X7 ) of the Pauli 5 ⊗X6 upon
tensor ) C4,5,7statevector
an 8-qubit (X6 )
3 4 3 4 3 4 3 4 3 4
2
distributed 5
between 16 nodes.2 Only 2 qubits 5t ≥ 4 trigger communication,
X̂ 5and Ŷ targeting 2 5 and do so2 identically,
5 so
only1 X̂ is demonstrated.
6 1 is incidentally
This 6 1
the same 6
pattern admitted by1 the Pauli gadget
6 1 14 (Sec. IV6G).
of Alg.
0 7 0 7 0 7 0 7 0 7
15 8 15 8 15 8 15 8 15 8
14 9 H N -114
⊗H N -2 9 14 that9 14 9 14 9
13 10 3 4
13 10 13 10 13 10 13 10 E
12 11 2 12 5 11 12 11 12 11 12 11
|i⟩N ≡ |k, j⟩N =⇒ i¬t N
= k, 2len(t) − j − 1 ,
1 6
(94)
0 7
where binary integer j spans the classical states of
15 8
t. This means we can iterate the first 2len(t) /2 of
14 9 the len(t)-length bit-sequences (each corresponding
13 10 to amplitude αj ) and flip all bits in t′ to access the
12 11 paired amplitude αj¬t′ . The result is that all pairs
of amplitudes (to be swapped) are iterated precisely
FIG. 14. The communication necessary to simulate a once. We formalise this scheme in Alg. 11.
pair of Hadamard gates on the uppermost qubits. While
each Hadamard is pairwise simulable by Alg. 6, their Our simulation of the n-target Pauli tensor is as effi-
direct tensor is not. This is unlike Pauli tensors which cient as the one-qubit gate of Alg. 6, and prescribes a
are always pairwise simulable. factor n fewer communications than if each Pauli was
simulated in turn, or as a dense matrix via Alg. 10.
Yet, we have neglected several optimisations possi-
To clarify this property, consider the ordered three- ble for specific input parameters. For instance, when
bit sequences below which share a colour if they are all of the Paulis are Ẑ, the entire operator is di-
bitwise complements of one another. agonal and can be simulated in an embarrassingly
π ⊗
parallel manner as a phase gadget Ẑ ⊗ ≃ ei 2 Ẑ as
|0⟩3 = 000 described in the next section. Further, when none of
|1⟩3 = 001 the lowest (rightmost) N − w qubits are targeted by
|2⟩3 = 010 X̂ or Ŷ gates, then after amplitudes are exchanged,
|3⟩3 = 011 the local destination address j ′ is equal to the orig-
inal local index j. The sub-statevectors ψ are ef-
|∼3⟩3 = 100
fectively directly swapped, as per the paradigm in
|∼2⟩3 = 101 Fig. 5 d), before local scaling (multiplying by factor
|∼1⟩3 = 110 βi ). Bespoke iteration of this situation, avoiding the
|∼0⟩3 = 111 otherwise necessary bitwise arithmetic, can improve
caching performance and auto-vectorisation.
When a subset t of all qubits are targeted, it holds
27
Note a potential optimisation is possible. The gadget of the previous section, is to rotate the tar-
main loop, which evaluates the sign si = ±1 per- get qubits of the gadget into the eigenstates of the
amplitude, can be divided into two loops which corresponding Pauli operator. For example,
each iterate only the amplitudes a priori known π π
to yield even (and odd) parity (respectively), pre- exp i θ X̂0 Ŷ1 Ẑ2 ≡ R̂Y0 R̂X1 − (103)
determining si . This may incur more caching costs 2 2
from the twice iteration of the state array, but en- × exp iθẐ0 Ẑ1 Ẑ2
ables a compiler to trivially vectorise the loop’s array π π
modification. ×R̂Y0 − R̂X1 ,
2 2
We define the Pauli gadget as an n-qubit unitary where βi = (−1)f (i) η ∈ {±1 ± i}
operator with parameter θ ∈ R,
and where tx,y ⊆ t are the target qubits
y
with corre-
n sponding X̂ or Ŷ operators, η = ilen(t ) ∈ {±1, ±i}
(j)
O
Ût (θ) = exp i θ σ̂tj , (102) and function f (i) = len({q ∈ ty,z : q[i] = 1}) ∈ N
j counts the Ŷ or Ẑ targeted qubits in the |1⟩1 state
within |i⟩N . The Pauli gadget ergo modifies
P the i-th
comprised of Pauli operators σ̂ (j) ∈ {X̂, Ŷ , Ẑ} and global amplitude of general state |Ψ⟩ = i αi |i⟩N
acting upon target qubits t = {t0 , . . . , tn−1 }. We under
seek to apply Ût (θ) upon an N -qubit statevector
|Ψ⟩N distributed between 2w nodes. Ût (θ)
αi −−−→ cos(θ) αi + i sin(θ) βi¬tx,y αi¬tx,y . (107)
Once again, it is prudent to avoid constructing
an exponentially large matrix description of Ût (θ) This is similar to the modification prescribed by the
and simulating it through the multi-target gate of Pauli tensor as per Eq. 88 and features the same
Alg. 10. A clever but ultimately unsatisfactory so- pair amplitude αi¬tx,y with factor βi¬tx,y , although
lution, inspired by our scheme to effect the phase now the new amplitude also depends on its old value
30
A. State representation
Sec. III then distributed Ψ between W = 2w nodes, impractical regime ill-suited for distribution. Even
each with a local sub-statevector ψ of length 2N −w . employing more nodes than there are columns of the
We must now perform a similar procedure for repre- encoded density matrix is a considerable waste of
senting a density matrix. parallel resources. We are ergo safe to impose an
important precondition for this section’s algorithms:
Consider a general N -qubit density matrix ρN with
elements αkl ∈ C corresponding to structure N ≥ w. (113)
2N X
2N
That is, we assume the number of density matrix
X
ρN = αkl |k⟩ ⟨l|N . (108)
k l
elements stored in each node, Λ, is at least 2N , or
one column’s worth. The smallest density matrix
The basis projector |k⟩ ⟨l|N can be numerically in- that W nodes can cooperatively simulate is then
N N N ≥ log2 (W ). This is in no way restrictive; employ-
stantiated as a R2 ×2 matrix. Instead, we vectorise ing 32 nodes, for example, would require we simulate
it under the Choi–Jamiolkowski isomorphism [87, density matrices of at least 5 qubits, and 1024 nodes
88], admitting the same form as a 2N -qubit basis demand at least 10 qubits; both are trivial tasks
ket of index i = k + l 2N , for a single node. Furthermore, using 4096 nodes
||i⟩⟩2N ≃ |l⟩N |k⟩N , (109) of ARCHER2 [12], our precondition requires we em-
ploy at least ×10−6 % of local memory (to simulate
which we numerically instantiate as an array. We a measly 12 noisy qubits); it is prudent to use over
have used notation ||i⟩⟩2N to indicate a state which 50%! It is fortunate that precondition 113 is safely
is described with a 2N -qubit statevector but which assumed, since it proves critical to eliminate many
is not necessarily a pure state of 2N -qubits; it will tedious edge-cases from our subsequent algorithms .
instead generally be unnormalised and describe a
mixed state of N qubits. We henceforth refer to Initialising our distributed Choi-vector into canoni-
this as a “Choi-vector”. cal pure states is as trivial as it is for statevectors;
We can improve on this for specific unitaries. The simulating the same operations upon density matri-
previous section’s bespoke statevector algorithms ces represented as Choi-vectors. Observe:
for simulating SWAP gates (Alg. 9), Pauli tensors
(Alg. 11), phase gadgets (Alg. 13) and Pauli gadgets Û = SWAP =⇒ Û ∗ = Û
(Alg. 14) were markedly more efficient than their Û = X̂ ⊗ Ŷ ⊗m Ẑ ⊗ =⇒ Û ∗ = (−1)m Û
simulation as general many-target gates via Alg. 10.
Û (θ) = exp(i θẐ ⊗ ) =⇒ Û (θ)∗ = Û (−θ)
We can repurpose every one of these algorithms for
Û (θ) = exp(i θX̂ ⊗ Ŷ ⊗m Ẑ ⊗ ) =⇒ Û (θ)∗ = Û ((−1)m+1 θ).
Algorithm 17: [distributed][density matrix] The above Û ∗ can be directly effected upon a Choi-
The SWAP gate, Pauli tensor, phase gadget and vector through Û ’s statevector algorithm, without
Pauli gadget upon an N -qubit density matrix needing a matrix construction. We make this strat-
with Choi-vector distributed between arrays ρ egy explicit in Alg. 17.
and same-sized per-node buffer φ. These are
special cases of Alg. 16 which avoid construction
of a unitary matrix.
× X̂
⊗ ⊗
ei θ Ẑ ei θ σ̂
× Ŷ
// for clarity, assume N is global C. Kraus map
1 N = getNumQubits(ρ) // Alg. 15
2 distrib density swapGate(ρ, φ, t1 , t2 ): (124)
3 distrib swapGate(ρ, φ, t1 , t2 ) // Alg. 9 E{K̂ (m) }
4 distrib swapGate(ρ, φ, t1 + N , t2 + N )
5 distrib density pauliTensor(ρ, φ, σ, t): The power of a density matrix state description is its
6 distrib pauliTensor(ρ, φ, σ, t) // Alg. 11 ability to capture classical uncertainty as a result of
decohering processes. A Kraus map, also known as
7 m = 0
8 for q in range(0, len(t)):
an operator-sum representation of a channel [86], al-
9 t[q] += N lows an operational description of an open system’s
dynamics which abstracts properties of the environ-
10 if σ[q] is Ŷ :
11 m = !m ment. A Kraus map can describe any completely-
positive trace-preserving channel, and ergo capture
12 distrib pauliTensor(ρ, φ, σ, t) // Alg. 11 almost all quantum noise processes of practical in-
13 if m == 1: terest. In this section, we derive a distributed sim-
# multithread ulation of an n-qubit Kraus map of M operators,
14 for j in range(0, len(ρ)): each specified as general matrices, acting upon an N -
15 ρ[j] *= −1 qubit density matrix represented as a Choi-vector.
We assume M ≪ 22N such that M 24n ≪ 22N −w ;
16 distrib density phaseGadget(ρ, t, θ): i.e. that the descriptions of the Kraus channels are
17 distrib phaseGadget(ρ, t, θ) // Alg. 13 much smaller than the distributed density matrix, so
18 for q in range(0, len(t)): are safely duplicated on each node. We will strictly
19 t[q] += N require that n ≤ N − ⌈w/2⌉ to satisfy memory pre-
20 distrib phaseGadget(ρ, t, −θ) // Alg. 13 conditions imposed by an invoked subroutine (in-
cidentally, Alg. 10’s simulation of the many-target
21 distrib density pauliGadget(ρ, φ, σ, t, θ): gate upon a statevector).
22 distrib pauliGadget(ρ, φ, σ, t, θ) // Alg 14 An n-qubit Kraus map consisting of M Kraus op-
(m)
23 θ *= −1 erators {K̂t : m < M }, each described by a ma-
n n
24 for q in range(0, len(t)): trix K (m) ∈ C2 ×2 , operating upon qubits t (where
25 t[q] += N n = len(t)) modifies an N -qubit density matrix ρ
26 if σ[q] is Ŷ : via
27 θ *= −1
M
(m) (m)†
X
28 distrib pauliGadget(ρ, φ, σ, t, θ) // Alg 14 ρ → E(ρ) = K̂t ρ K̂t . (125)
m
34
Algorithm 19: [distributed][density matrix] Like the one-qubit dephasing channel, we’ve shown
One-qubit dephasing of qubit t with probability two-qubit dephasing is also diagonal and embarrass-
p of an N -qubit density matrix with Choi-vector ingly parallel. We distribute the 22N amplitudes uni-
distributed between arrays ρ of length Λ among formly between 2w nodes, such that the j-th local
2w nodes. amplitude ρ[j] ≡ βi of node r again corresponds to
Eϕ global basis state ||i⟩⟩ = ||r⟩⟩ ||j⟩⟩, where
[O(Λ) bops][Λ/2 flops][0 exchanges] ||i⟩⟩2N ≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w
[0 exchanged][O(1) memory][Λ/2 writes] | {z
t2 +N, t1 +N
} | {z
t2 , t1
}
1 distrib oneQubitDephasing(ρ, t, p):
2 Λ = len(ρ) There are three distinct ways that t1 and t2 can be
3 N = getNumQubits(ρ) // Alg. 15 found among these sub-registers.
4 w = log2 (getWorldSize()) // Alg. 5 1. When t2 < N − w (both qubits pink), the prin-
5 c = 1 − 2p cipal bits i[t1 ] , i[t1 +N ] , i[t2 ] , i[t2 +N ] are all deter-
6 if t ≥ N − w: mined by local index j.
r = getRank() // Alg. 5
2. When t1 ≥ N −w (both qubits blue), bits i[t1 ]+N
7
8 b = getBit(r, t − (N − w)) // Alg. 1
# multithread and i[t2 ]+N are fixed per-node and determined by
9 for k in range(0, Λ/2): the rank r, an the remaining principal bits by j.
10 j = insertBit(k, t, ! b) // Alg. 1 3. When t1 < N − w (in pink) and t2 ≥ N − w (in
11 ρ[j] *= c blue), then bit i[t2 ]+N is determined by the rank,
12 else: and all other principal bits by j.
# multithread
13 for k in range(0, Λ/4): With bit interleaving, we could devise non-branching
14 j = insertBit(k, t, 1) // Alg. 1 loops which directly enumerate only local indices
15 j = insertBit(j, t + N , 0) j whose global indices i satisfy i[t1 ] ̸= i[t1 +N ] or
16 ρ[j] *= c i[t2 ] ̸= i[t2 +N ] . However, Eq. 142 reveals 75% of all
17 j = insertBit(k, t, 0) // Alg. 1 amplitudes are to be modified. It is ergo worthwhile
18 j = insertBit(j, t + N , 1) to enumerate all indices and modify every element,
19 ρ[j] *= c with a quarter multiplied by unity, accepting a 25%
increase in flops and memory writes in exchange for
significantly simplified code. We present Alg. 20.
2. Two-qubit
Algorithm 20: [distributed][density matrix]
We can derive a similar method for a two-qubit de- Two-qubit dephasing of qubits t1 and t2 > t1
phasing channel, inducing Ẑ on either or both of with probability p of an N -qubit density matrix
qubits t1 and t2 with probability p. Assume with Choi-vector distributed between arrays ρ
P t2 > t1 .
The channel upon a general state ρ = αkl |k⟩ ⟨l| of length Λ among 2w nodes.
kl
produces Eϕ
!
p Ẑt1 ρẐt1 + Ẑt2 ρẐt2 [O(Λ) bops][Λ flops][0 exchanges]
ε(ρ) = (1 − p)ρ + (140) [0 exchanged][O(1) memory][Λ writes]
3 + Ẑt1 Ẑt2 ρẐt1 Ẑt2
(t ) (t )
!! 1 distrib twoQubitDephasing(ρ, t1 , t2 , p):
X p skl1 + skl2 2 Λ = len(ρ)
= αkl |k⟩ ⟨l| 1 − p +
3 (t ) (t )
+ skl1 skl2 3 r′ = getRank() << log2 (Λ) // Alg. 5
kl
4 c = 1 − 4 p/3
(141)
# multithread
(t) 5 for j in range(0, Λ):
where skl = (−1)k[t] +l[t] = ±1. This suggests 6 i = r′ | j // ||i⟩⟩2N = ||r⟩⟩w ||j⟩⟩λ
( 7 b1 = getBit(i, t1 ) // Alg. 1
αkl , k[t1 ] = l[t1 ] ∧ k[t2 ] = l[t2 ] , 8 b′1 = getBit(i, t1 + N )
αkl →
1 − 43p αkl , otherwise,
9 b2 = getBit(i, t2 )
(142) 10 b′2 = getBit(i, t2 + N )
11 b = (b1 ˆ b′1 ) | (b2 ˆ b′2 ) // ∈ {0, 1}
f = b (c − 1) + 1 // ∈ {1, c}
P that an amplitude βi of the Choi-vector ||ρ⟩⟩2N =
and 12
ρ[j] *= f
i βi ||i⟩⟩N is multiplied by (1−4 p/3) if either i[t1 ] ̸=
13
The depolarising channel, also known as the uniform or the equivalent P change to the equivalent Choi-
Pauli channel, transforms qubits towards the max- vector ||ρ⟩⟩2N = i βi ||i⟩⟩2N of
imally mixed state, describing incoherent noise re- (
1 − 2p 2p
sulting from the erroneous application of any Pauli βi → 3 βi + 3 βi¬{t,t+N } , i[t] = i[t+N ] ,
4p
qubit. It is a ubiquitous noise model deployed in 1 − 3 βi , i[t] ̸= i[t+N ] .
quantum error correction [92], is often a suitable de- (150)
scription of the average noise in deep, generic cir-
cuits of many qubits [93], and is the effective noise Unlike the dephasing channel, we see already that
produced by randomised compiling (also known as the depolarising channel upon the Choi-vector is not
twirling) of circuits suffering coherent noise [94, 95]. diagonal; it will linearly combine amplitudes and re-
Like the dephasing channel, we could describe depo- quire communication. Recall that the 22N ampli-
larising as a Kraus map and simulate it via Alg. 18, tudes of ||ρ⟩⟩ are uniformly distributed between ar-
though this would not leverage the sparsity of the rays ρ among 2w nodes (where N ≥ w), such that
resulting superoperator. In this section, we instead the j-th local amplitude ρ[j] ≡ βi of node r corre-
derive superior distributed algorithms to simulate sponds to global basis state
both the one and two-qubit depolarising channels
upon a density matrix distributed between Λ-length ||i⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩2N −w (151)
arrays, in O(Λ) operations and at most two rounds ≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w
of communication. For simplicity, we study uniform | {z } | {z }
depolarising, though an extension to a general Pauli t+N t
channel is straightforward.
Two communication scenarios emerge, informed by
qubit t.
1. One-qubit
1. When t < N −w, the principal bits i[t] and i[t+N ]
are determined entirely by a local index j, and
The one-qubit uniformly depolarising channel upon the pair amplitude βi¬{t,t+N } resides within the
qubit t of an N -qubit density matrix ρ produces same node as βi . Simulation is embarrassingly
state parallel.
p
2. When t ≥ N − w, bit i[t] is determined by lo-
ε(ρ) = (1 − p)ρ + X̂t ρX̂t + Ŷt ρŶt + Ẑt ρẐt ,
3 cal index j, but i[t+N ] is fixed by the node rank
(144) r. Precisely, i[t+N ] = r[t−(N −w)] . The pair am-
where p is the probability of any error occurring. plitude βi¬{t,t+N } resides within pair node r′ =
Each Pauli operator upon a basis state |k⟩ ⟨l| pro- r¬(t−(N −w)) , requiring communication. Since
duces only local amplitudes with indices j satisfying
j[t] = i[t+N ] need to be exchanged, we first pack
X̂t |k⟩ ⟨l| X̂t = |k¬t ⟩ ⟨l¬t | , (145) only this half into the node’s buffer φ before ex-
k[t] +l[t]
change. This is communication paradigm b) of
Ŷt |k⟩ ⟨l| Ŷt = (−1) |k¬t ⟩ ⟨l¬t | , (146) Fig. 5.
k[t] +l[t]
Ẑt |k⟩ ⟨l| Ẑt = (−1) |k⟩ ⟨l| , (147) Translating these schemes (and in effect, implement-
ing Eq. 150 upon distributed {βi }) into efficient,
and ergo the
P depolarising channel maps a general non-branching, cache-friendly code is non-trivial.
state ρ = kl αkl |k⟩ ⟨l| to
We present such an implementation in Alg. 21. In-
X p k
terestingly, its performance is similar to a one-target
E(ρ) = 1 − p + (−1) [t]+l[t] αkl |k⟩ ⟨l| gate (Alg. 6) upon a statevector of 2N qubits, but
3
kl exchanges only half of all amplitudes when commu-
p
+ 1 + (−1)k[t] +l[t] αkl |k¬t ⟩ ⟨l¬t | . nication is necessary.
3
(148)
38
than that required by the dephasing channel. Con- pre-combine the outbound amplitudes and ex-
sider a basis state of 4 fewer qubits: change only one eighth of the buffer capacity. We
visualise this below.
||h⟩⟩2N −4 ≡ ||e⟩⟩N −t2 −1 ||d⟩⟩t2 −t1 −1 ||c⟩⟩N −(t2 −t1 )−1 ⊗
||b⟩⟩t2 −t1 −1 ||a⟩⟩t1 , (158) Λ /8 …
FIG. 15. Some communication patterns of Alg. 22’s distributed simulation of two-qubit depolarising upon a 5-qubit
density matrix distributed between 16 nodes. While E∆0,3 is pairwise, the other channels are simulated through two
consecutive rounds of pairwise communication.
and an amplitude
P of the equivalent Choi-vector ℰγ 3 ℰγ 4
||ρ⟩⟩2N = i βi ||i⟩⟩2N as
3 4 3 4
2 5 2 5
βi + p βi¬{t,t+N } i[t] = i[t+N ] = 0, 1 6 1 6
√
βi → 1 − p βi i[t] ̸= i[t+N ] , (170) 0 7 0 7
15 8 15 8
(1 − p) β
i i[t] = i[t+N ] = 1.
14 9 14 9
We see every amplitude is scaled, and those of index 13 10 13 10
12 11 12 11
i with principle bits (i[t] and i[t+N ] ) equal to zero are
linearly combined with a pair amplitude of opposite FIG. 16. Communication patterns of Alg. 25 simulating
bits. the amplitude damping channel upon a 5-qubit density
matrix distributed between 16 nodes. Each arrow indi-
Next, we consider when these amplitudes are dis-
cates the one-way sending of 32 amplitudes, via Fig. 5 c).
tributed. Recall that the 22N amplitudes of ||ρ⟩⟩ are
43
we instead seek scalar ⟨E⟩ = Tr(Ĥρ) ∈ R. It is trivially divided between these nodes, each concur-
tempting to employ the Pauli tensor (upon density- rently weighted-summing its amplitudes, before a fi-
matrix) simulation of Alg. 17, using clones of the nal global reduction wherein each node contributes
density matrix which are summed (weighted by hn ) a single complex scalar.
to produce Ĥρ, before a trace evaluation via We formalise our strategy in Alg. 26. It requires
2N
2 2 N no exchanging of Choi-vectors between nodes, and
X X no writing to heap memory. Each amplitude of ρ is
||ρ⟩⟩ = βi ||i⟩⟩ =⇒ Tr(ρ) = βj(2N +1) .
i j
read precisely once.
(172) The subroutine pauliTensorElem (Line 18) per-
forms exactly N multiplications of real or imaginary
Such a scheme would require O(T ) exchanges, a to- integers (0, ±1, ±i). One may notice that half of all
tal of O(T ) 22N exchanged amplitudes and memory elements among the Pauli matrices are zero, and en-
writes, and a O(22N ) memory overhead. An aston- countering any one will yield a zero tensor element.
ishingly more efficient scheme is possible. This might lead one to erroneously expect only a
P 2N ×2N factor 1/2N of invocations of pauliTensorElem will
Assume ρ = kl αkl |k⟩ ⟨l| and that H : C
yield a non-zero result, and prompt an optimistion
is a Ẑ-basis matrix instantiating Ĥ. Of course, we which avoids their calculation. We caution against
will not instantiate such an expensive object, but it this; the input Pauli strings to Alg. 26 will not be
permits us to express dense (i.e. T ≪ 4N ), and so will not uniformly
2 X
2 N N invoke the subroutine for all permutations of argu-
X ments i and σ. We should expect instead to per-
⟨E⟩ = H lk αkl , (173)
form exponentially fewer invocations, precluding us
k l
to reason about the expected number of zero el-
Through a natural 2D extension of our ket indexing, ements as this depends on the user’s input Pauli
we may express string structure. Finally, a naive optimisation like
! an attempt to return early from the loop of Line 21
T N whenever v = 0 will cause needless branching and
(n)
X O
H lk = hn σ̂t (174) disrupt multithreaded performance.
n t lk
T N
We now lament the lack of an analogous statevec-
tor algorithm. The speedup of Alg. 26 over a
(n)
X Y
= hn σ̂N −t−1 . (175)
n t
l[t] , k[t] naive method invoking Pauli tensor simulation, re-
sults from our not propagating any amplitude un-
This reveals a O(N ) bop calculation of a matrix el- involved in the trace. It hence will not accelerate
ement (∈ {±1, ±i}) of the N -qubit Pauli tensor is the equivalent statevector calculation which neces-
possible, informed by the elements of the Pauli ma- sarily involves
Pall amplitudes. That is, for the gen-
trices, and hence that a single element H lk ∈ C is eral |Ψ⟩N = k αk |k⟩N , the expected value
calculable in O(T ) flops and O(T N ) bops. We can N N
2 X
2
in-principle leverage Hermitivity of Ĥ and ρ to eval- X
uate ⟨E⟩ in ≈ 2× fewer operations, but this will not ⟨E⟩ = ⟨Ψ| Ĥ |Ψ⟩ = H lk αk∗ αl , (178)
k l
meaningfully accelerate distributed simulation.
ExpressedPin terms of the equivalent Choi-vector includes products of all pairs of amplitudes {αk , αl }.
||ρ⟩⟩2N = i βi ||i⟩⟩2N , we can write This necessitates the products include amplitudes
residing in distinct nodes, and ergo that its dis-
2
X
2N
tributed calculation involves multiple rounds of
⟨E⟩ = H i βi , (176) inter-node exchange.
i
T N
(n)
X Y
Hi = hn σ̂N −t−1 , (177)
i[t+N ] , i[t]
n t
|k⟩N ≡ k A , k B N
, (179)
into positions t of k A , such that • As we should expect by the arbitrarity of the or-
′
dering of t, the sum in αkl is uniformly weighted.
k = ft (k A , k B ). (180) We can therefore iterate v in any order, and set
its constituent bits v[q] with simplified unordered
We can trivially compute f with bitwise algebra. bitwise operations.
Our general composite density matrix, with ampli-
tudes αkl , can then be written • We must optimise our memory strides by choos-
ing whether to contiguously iterate the “output”
X2N X 2N amplitudes α′ (for each, computing a full sum
ρ AB
= αkl |k⟩ ⟨l|N (181) of 2n scalars), or over the “input” amplitudes α
k l (adding each to one of 2m partial sums). We
X2m X2n X2m X 2n choose the former, since the cache penalties of a
≡ αft (kA ,kB ), ft (lA ,lB ) k A , k B lA , lB N . suboptimal write stride outweigh the read penal-
k A k B l A l B ties, and also since its multithreaded implemen-
(182) tation avoids race conditions and minimises false
sharing [25].
Let ⟨1⊗m , v| notate an interwoven tensor product of The equivalent reduced Choi-vector ρA 2m =
the m-qubit identity operator with the n qubits of P ′
the v-th basis bra of ρB . The reduced density matrix i βi ||i⟩⟩, resulting from tracing out qubits t of
AB
P
can then be expressed as ρ 2N
= i β i ||i⟩
⟩, has amplitudes
n
n 2
2 X
βi′ = βg(i,v) , where (185)
X
ρA = TrB (ρAB ) = 1⊗m , v ρAB 1⊗m , v
v
v
m
2
2 X m
g(i, v) = ft ∪ (t+N ) (i, (v << n) | v). (186)
X
= αft (kA ,v), ft (lA ,v) k A l A
. (183)
N The function g merely takes index i and interweaves
kA lA
the bits of v into positions t and t+N , the latter be-
This makes clear that the amplitudes α′ of ρA are ing the same positions shifted left by N . Local, serial
evaluation of this sum is trivial, requiring O(22N /2n )
2
X
n
flops, and local parallelisation is straightforward.
′
αkl = αft (k,v), ft (l,v) , k, l ∈ [0..2m ). (184) We now distribute these amplitudes between W =
v
2w nodes. Because ρAB and ρA differ in dimen-
Observe that a fraction 1/2n of all amplitudes of sion, their partitioned arrays on each node have dif-
ρAB are involved in the determination of ρA , and ferent sizes. Recall we assume N ≥ w such that each
that ρA is determined by a total of 22m+n sum terms. node contains at least one column’s worth of ρAB .
Before proceeding, we make several more immediate For the output reduced density matrix ρA to sat-
observations to inform subsequent optimisation. isfy this precondition, and ergo be compatible input
to this manuscript’s other algorithms, it must sim-
• In principle, evaluating a single αkl′
amplitude ilarly satisfy m ≥ w, equivalently that n ≤ N − w.
requires polynomially fewer than the 22m+n − 1 This is inessential to our algorithm however, which
floating-point additions suggested directly by the imposes a looser condition elaborated upon later.
sum over v, because the sum may be performed
by a sequence of hierarchical reductions on neigh- The j-th local amplitude ρAB [j] ≡ βg of node r cor-
bouring pairs. This is compatible with numerical responds to global basis state
stability techniques like Kahan summation [105],
||g⟩⟩2N ≡ ||r⟩⟩w ||j⟩⟩2N −w (187)
though reduces the floating-point costs by a mod-
est and shrinking factor 1 − 2−n , easily out- ≡ ||r⟩⟩w ||. . .⟩⟩N −w ||. . .⟩⟩w ||. . .⟩⟩N −w
weighed by its introduced caching overheads. | {z } | {z }
t+N t
• The pair of subscripted indices (ft (k, v), ft (l, v))
are unique for every unique assignment of Similarly, the j-th local reduced amplitude ρA [j] =
(k, l, v). Assuming no properties of ρAB (e.g. re- βi′ corresponds to
′
laxing Hermitivity), each amplitude αkl of ρA is
therefore a sum of unique amplitudes α of ρAB . ||i⟩⟩2m = ||r⟩⟩w ||j⟩⟩2m−w . (188)
There are no repeated partial sums between dif-
′
ferent αkl which we might otherwise seek to re- Therefore all to-be-traced qubits tq ∈ t (satisfying
use in heirarchal reductions. 0 ≤ tq < N ) target the suffix substate ||j⟩⟩2N −w , and
47
a subset of the shifted qubits in t + N will target the indices t ∪ t + N , and i is formed by the remaining
prefix substate ||r⟩⟩w . Two communication patterns six untargeted bits of g. The bits of g, constituted
emerge: by the bits of i and v, are arranged as:
1. When all tq < N − w, then no qubits in t + ||g⟩⟩2N = g[9] g[8] g[7] g[6] g[5] g[4] g[3] g[2] g[1] g[0]
N target the prefix substate. This means every 2N
index g(i, v) (Eq. 186) within a given node of = v[3] i[5] v[2] w
i[4] i[3] v[1] i[2] v[0] i[1] i[0] 2N −w .
rank r is determined entirely by (r, v, and) the (191)
suffix bits of i, equivalent to j. Precisely:
Evaluating a reduced, output amplitude βi′ requires
i = (r << (2N − w)) | j (189) summing all βg with indices satisfying v[0] = v[2] and
v[1] = v[3] (with fixed i). Because bits v[2] and v[3]
for all local j. Ergo all summed amplitudes lie in the prefix state, the sum terms are distributed
{βg(i,v) : v} reside within the same node. Fur- between multiple nodes. Precisely, between ranks
thermore, the index i of the destination ampli- {r = v[3] 22 + i[5] 21 + v[2] 20 : 0 ≤ v < 24 }.
tude βi′ shares the same w-bit prefix (r) as the
source/summed amplitudes, and ergo also resides So we first perform a series of SWAP gates upon ||ρ⟩⟩
in the same node. As a result, this scenario is (treated as a statevector) via Sec. IV C, in order to
embarrassingly parallel. move all targeted prefix bits into the suffix state.
We heuristically swap the left-most prefix targets
2. When ∃ tq ≥ N − w, the amplitudes featured in (starting with v[3] ) with the left-most non-targeted
Eq. 185 reside within distinct nodes. Computing suffix qubits (initially i[4] ), minimising the relative
a single βi′ will require prior communication. displacement of the i bits, which will accelerate a
Like we did in Sec. IV D to simulate the many- subsequent re-ordering step. After effecting
target general unitary on a statevector, we can
in-principle use SWAP gates to first obtain local- ||ρ⟩⟩2N → SWAP9,6 SWAP7,5 ||ρ⟩⟩2N , (192)
ity of these amplitudes. However, the procedure
the basis state ||g⟩⟩ has been mapped to ||g ′ ⟩⟩2N =
here is complicated by the reduced dimension of
the output structure ρA 2m , meaning that we
i[4] i[5] i[3] v[3] v[2] v[1] i[2] v[0] i[1] i[0] 2N −w
.
cannot simply “swap back” our swapped qubits. w
Let us focus on scenario 2., which admits a several All amplitudes βg across varying v (fixing i) now
step procedure. In essence, we apply SWAP gates reside within the same node, permitting embarrass-
to move all targeted qubits into the (2N − w)-qubit ingly parallel evaluation of Eq. 185. This resembles
suffix substate and then perform the subsequently a four-qubit partial trace, of qubits t′ = {2, 4, 5, 6},
embarrassingly parallel reduction; thereafter we per- upon a seven-qubit statevector. All W nodes per-
form additional SWAP gates on the reduced density form this reduction, producing a distributed m = 3-
matrix to restore the relative ordering of the non- qubit density matrix ρ′ . Alas, this is not yet the
targeted qubits. This a priori requires that all tar- output state; the i-th global basis state of ||ρ′ ⟩⟩ does
get qubits t, and their paired Choi qubits t + N , not correspond to the desired amplitude βi′ , but is
can fit into the suffix substate (a similar precondi- instead the state
tion of the many-target gate upon a statevector of
Sec. IV D). This requires ||i′ ⟩⟩2m = i[4] i[5] i[3] w
i[2] i[1] i[0] 2m−w
, (193)
n ≤ N − ⌈w/2⌉. (190) where the leftmost two prefix qubits are disordered.
The subsequent SWAP gates on the reduced m- Our final step is to restore the relative ordering of all
qubit density matrix, treated as an unnormalised bits of i (mapping ||i′ ⟩⟩ → ||i⟩⟩) by performing addi-
2m-qubit statevector, assume the equivalent precon- tional SWAP gates upon the corresponding qubits of
dition 2m ≥ w. the reduced 3-qubit density matrix. Each qubit can
be swapped directly to its known ordered location.
Even under these assumptions, the procedure is ver- In this example, we simply perform
bose in the general case. In lieu of its direct deriva-
tion, we opt to demonstrate it with an example. ||ρ′ ⟩⟩2m → SWAP5,4 ||ρ′ ⟩⟩2m . (194)
Imagine that we distribute an N = 5 qubit den-
sity matrix ρ between W = 8 nodes (ergo w = 3). Choi-vector ||ρ′ ⟩⟩2m is now the correct, distributed,
Assume we wish to trace out qubits t = {2, 4}. Let reduced density matrix. It is again beneficial
||g⟩⟩2N = ||i, v⟩⟩ be a basis state of the Choi-vector to heuristically perform these “post-processing”
||ρ⟩⟩2N , where v is formed by the four bits of g at SWAPs upon the leftmost prefix qubits first, to
48
Tr 2,4 ρ5
1 2 SWAP 9,6 || ρ〉〉10 SWAP 7,5 || ρ〉〉10 Tr 2,4,5,6 || ρ〉〉10 SWAP 5,4 || ρ'〉〉6
1 2 1 2 1 2 1 2
0 3
0 3 0 3 0 3 0 3
=
7 4 7 4 7 4 7 4
7 4
6 5 6 5 6 5 6 5
6 5
FIG. 17. The communication pattern of Alg. 28’s distributed simulation of the partial trace of qubits t = {2, 4}
upon a N = 5-qubit density matrix distributed between W = 8 nodes. The left plot shows the necessary traffic of
amplitudes to effect Tr2,4 (ρ) “directly”, in a way incompatible with our distributed partitioning. The right plots
decompose this into the pairwise-communicating steps of our algorithm, which operates upon ||ρ⟩⟩10 (and reduced
state ||ρ′ ⟩ 6 ) treated as statevectors. It is interesting to monitor the movement of amplitudes from node r = 1 to 2,
achieved via intermediate movements to nodes 5, 4, then 2.
avoid unnecessary displacement of subsequently We formalise this algorithm in Alg. 28. Below we dis-
swapped qubits between the suffix and prefix sub- cuss some potential optimisations, and some tempt-
states which causes wasteful communication. We ing but ultimately not worthwhile changes.
visualise the incurred communication pattern of this
• Our algorithm did not assume any properties
process in Fig. 17.
(normalisation or otherwise) of ρ. If we assume ρ
We summarise the complexity of this method. is Hermitian, then we can reduce the network and
floating-point costs by at most a factor 2. This
• At most w initial SWAP gates are required to
is because ρij = ρ∗ji enables us to process only a
remove all prefix targets. Each invokes Alg. 9
in communication scenario 3. whereby half factor (1+1/2N )/2 of all 22N amplitudes, further
a node’s amplitudes are exchanged. A total reducing the fraction 1/2n involved in the partial
of O(w) 22N /2 amplitudes are communicated in trace. Determining the exact reduction and con-
O(w) rounds, with as many memory writes, and sequential utility of the optimisation requires a
zero flops. careful treatment we do not here perform.
• The embarrassingly parallel evaluation of • The qubits of the reduced density matrix, before
Eq. 185, i.e. the “local trace”, involves the post-processing SWAPs, are only ever out-
O(22N /2n ) flops and bops, and a factor 1/2n of-order when an initial prefix qubit is swapped
fewer memory writes. past a non-targeted qubit. So it is tempting
to swap only adjacent qubits, percolating pre-
• Fewer than m final SWAPs are needed to re- fix qubits toward the suffix substate one qubit
order the reduced state, each invoking Alg. 9 in at a time. This preserves the relative order of
potentially any of its three communication sce- the non-targeted qubits, so no post-trace SWAPs
narios. However, things simplify when we en- are required. Alas, this is not worthwhile; it ne-
force m ≥ w, i.e. the precondition assumed by cessitates more total SWAPs, and all of them
this manuscript’s other algorithms. In that case, will operate on the larger N -qubit density ma-
all initially targeted prefix qubits get swapped trix, as opposed to the smaller reduced (N − n)-
into the leftmost suffix positions, and ergo af- qubit density matrix, increasing communication
ter reduction, only the prefix qubits are disor- and write costs.
dered (as per our example). The final SWAP
costs ergo simplify to O(w) SWAPs in scenario • Our initial SWAPs moved prefix targets into the
3. of Alg. 9, exchanging a total of O(w)22m /2 leftmost suffix qubits to reduce disordering of
amplitudes. Without this precondition, the fi- the untargeted qubits, and ergo reduce the num-
nal SWAP costs scale inversely with 22n , so are ber of subsequent SWAPs (and thus, the com-
anyway quickly occluded with increasing number munication costs) on the reduced state. This
of traced qubits n. Furthermore, our heuristic means however that the local partial trace cal-
of initially swapping the leftmost qubits first re- culation (invocation of local partialTraceSub
duces the necessary number of final SWAPs. at Line 14 of Alg. 28) targets high-index qubits
(in arrays t and t′ ). This causes a large mem-
49
ory stride; the accessed amplitudes ρ[g] across Algorithm 28: [distributed][density matrix]
g at Line 35 are far apart (beyond cache lines), Partial tracing of n qubits t from an N -qubit
and their bounding addresses overlap across dif- density matrix with Choi-vector distributed as
ferent i. This may lead to sub-optimal caching Λ-length arrays between 2w nodes.
behaviour, especially in multithreaded settings
Trt (ρ)
(although we thankfully note it does not in-
duce false sharing, since we merely read ρ). It [O(wΛ) bops][O(Λ/2n ) flops]
is worth considering to instead swap prefix tar- [O(w + N − n) exchanges] [O(w)22N exchanged]
gets into the rightmost suffix qubits. This makes [O(1) memory][O(w)Λ writes]
amplitudes ρ[g] at Line 35 contiguous in mem-
ory, improving caching and multithreaded per- 1 distrib partialTrace(ρ, φ, t):
formance, but requiring more post-processing 2 N = getNumQubits(ρ) // Alg. 15
SWAPs and ergo modestly increased commu- 3 λ = log2 (len(ρ))
nication costs. Such a strategy may prove 4 sort(t)
worthwhile when making use of so-called “fused- // local if all targets are in suffix
swaps” [57, 59, 61]. 5 if t[−1] + N < λ:
6 ρ′ = local partialTraceSub(ρ, t, t + N )
7 return ρ′
// find where to swap prefix targets
Algorithm 27: O(N ) subroutines of Alg. 28. 8 s = t ∪ (t + N )
1 getNextLeftmostZeroBit(b, i): 9 s′ = getReorderedTargets(s, λ) // Alg. 27
2 i -= 1 // swap prefix targets into suffix
3 while getBit(b, i) == 1: 10 for q in range(len(s) − 1, −1, −1):
4 i -= 1 11 if s′ [q] ̸= s[q]:
5 return i 12 distrib swapGate(ρ, φ, s[q], s[q ′ ])
// Alg. 9
6 getReorderedTargets(s, λ):
// perform embarrasingly parallel trace
// locate leftmost non-targeted suffix
13 t′ = s′ [len(t) : ]
7 b = getBitMask(s)
14 ρ′ = local partialTraceSub(ρ, t, t′ )
8 τ = getNextLeftmostZeroBit(b, λ)
// obtain new suffix-only targets // determine un-targeted qubit ordering
9 s′ = { } 15 s′′ = getRemainingQubitOrder(N , s, s′ )
10 for q in range(len(s) − 1, −1, −1): // Alg. 27
11 if s[q] < λ: // reorder untargeted via swaps
12 append s[q] to s′ 16 for q in range(len(s′′ ) − 1, −1, −1):
13 else: 17 if s′′ [q] != q:
14 append τ to s′ 18 p = index of q in s′′
15 τ = getNextLeftmostZeroBit(b, τ ) 19 distrib swapGate(ρ′ , φ, q, p)
16 return reversed(s′ ) 20 s′′ [q], s′′ [p] = s′′ [p], s′′ [q]
17 getRemainingQubitOrder(N , s, s′ , ): 21 return ρ′
// determine post-swap qubit ordering
18 q = range(0, 2 N ) 22 local partialTraceSub(ρ, t, t′ ):
19 for a, b in zip(s, s′ ): 23 Λ = len(ρ)
20 if a != b: 24 N = getNumQubits(ρ) // Alg. 15
21 q[a], q[b] = q[b], q[a] 25 w = log2 (getWorldSize()) // Alg. 5
// remove traced-out qubits 26 n = len(t)
22 s′′ = { } 27 γ = 22(N −n)−w
23 b′ = getBitMask(s′ )
28 ρ′ = 0γ×1 // new Choi-vector
24 for i in range(0, 2N ):
25 if getBit(b′ , i) == 0: 29 s = sorted(t ∪ t′ )
26 append q[i] to s′′
// make elements contiguous # multithread
27 b′′ = getBitMask(s′′ ) 30 for i in range(0,γ):
28 for i in range(0, len(s′′ )): 31 g0 = insertBits(i, s, 0) // Alg. 1
29 for j in range(0, s′′ [i]): 32 for v in range(0, 2n ):
30 s′′ [i] -= ! getBit(b′′ , j) 33 g = setBits(g0 , t, v) // Alg. 1
31 return s′′ 34 g = setBits(g, t′ , v)
35 ρ′ [i] += ρ[g]
36 return ρ′
50
Calculation Tr (H ρ) ∑m ( i) ( i )†
i Kt ρ K t Tr t (ρ) ℰΔ
t1,t2
(ρ) ℰ Δ (ρ) ℰ ϕ (ρ) ℰϕ
t1,t2
(ρ) ℰ γ (ρ)
t
t t
Alg. 26
Alg. 15 Alg. 15
Alg. 18 Alg. 15
Alg. 28 Alg.
Alg. 15
22 Alg.
Alg. 15
21 Alg. 15
Alg. 19 Alg. 15
Alg. 20 Alg. 15
Alg. 25
Decoherence
†
Mt ρ Mt SWAP ρ SWAP θ Z ⊗ ρ - θ Z ⊗
σ⊗ ρ σ⊗ †
Unitary upon θσ⊗ ρ - θσ⊗
Send/Recv
Alg. 15 Alg. 17
Alg. 15 Alg. 17
Alg. 15 Alg.17
Alg. 15 Alg. 17
Alg. 15
density matrix Alg. 16
Unitary upon
Mt ψ SWAP ψ σ⊗ ψ θσ⊗ ψ Mt ψ Cc Mt ψ θ Z ⊗ ψ
statevector Reduce
Alg. 15
Alg. 10 Alg.
Alg.15
9 Alg.
Alg.1511 Alg. 14
Alg. 15 Alg.
Alg.15
6 Alg.
Alg.15
7 Alg.
Alg. 15
13
Communication
Mt ψ exchangeArrays Mt ψ Cc Mt ψ
Local simulation Alg.
Alg.15
Alg.
Alg. 15
4 5 Alg.
Alg.15
2 Alg.
Alg.15
3
FIG. 18. The dependency tree of this manuscript’s algorithms. The surprising connectivity becomes intuitive under
the following observations: Distributed statevector simulation often includes edge-cases algorithmically identical to
local statevector simulation; Under the Choi–Jamiolkowski isomorphism [87, 88], a unitary upon a density matrix
resembles two unnormalised unitaries upon an unnormalised statevector (which we called a “Choi-vector”); A Kraus
channel upon a density matrix is equivalent to a superoperator upon a Choi-vector, itself equivalent to an unnor-
malised unitary upon an unnormalised statevector; SWAP gates permit transpiling communicating operators into
embarrassingly parallel ones; The natural distribution of statevector amplitudes often admits simulation via pairwise
communication and simple array exchanges.
TABLE I. The distributed algorithms presented in this manuscript, and their costs, expressed either in aggregate
or per-node (pn). Assume each algorithm is invoked upon an N -qubit statevector or density matrix, constituted by
a total of Σ amplitudes, distributed between 2w nodes, such that each node contains Λ = Σ/2w amplitudes and an
equivalently sized communication buffer. All methods assume N ≥ w. Where per-node costs differ between nodes
(such as for the s-control one-target gate), the average cost is given.
Statevector algorithms (Σ = 2N , Λ = 2N −w )
Sec. Alg. Operation Symbol Bops (pn) Flops (pn) Writes (pn) Exchanges Exchanged Memory
IV A 6 One-target gate M̂t O(Λ) O(Λ) Λ 0 or 1 Σ O(1)
..
IV B 7 s-control one-target gate Cc (M̂t ) O(sΛ/2s ) O(Λ/2s ) Λ/2s 0 or 1 Σ/2s .
IV C 9 SWAP gate SWAPt1 ,t2 O(Λ) 0 Λ/2 0 or 1 Σ/2 O(1)
n n n
IV D 10 n-target gate a
M̂t O(2 Λ) O(2 Λ) O(2 Λ) O(min(n, w)) O(min(n, w))Σ O(2n )
IV E 11 n-qubit Pauli tensor ⊗n σ̂ O(nΛ) Λ Λ 0 or 1 Σ O(1)
⊗
..
IV F 13 n-qubit phase gadget exp iθẐ O(Λ) O(Λ) Λ 0 0 .
n
IV G 14 n-qubit Pauli gadget exp(iθ ⊗ σ̂) O(nΛ) O(Λ) Λ 0 or 1 Σ
Sec. Alg. Operation Symbol Bops (pn) Flops (pn) Writes (pn) Exchanges Exchanged Memory
n n n
VB 16 n-qubit unitary b
Ût O(2 Λ) O(2 Λ) O(2 Λ) O(min(n, w)) O(min(n, w))Σ/2 O(2n )
VB 17 SWAP gate SWAPt1 ,t2 O(Λ) 0 Λ 0 or 1 Σ/2 O(1)
..
VB 17 n-qubit Pauli tensor ⊗n σ̂ O(nΛ) 2Λ or 3Λ 2Λ or 3Λ 0 or 1 Σ .
⊗
VB 17 n-qubit phase gadget exp iθẐ O(Λ) O(Λ) 2Λ 0 0
n
VB 17 n-qubit Pauli gadget exp(iθ ⊗ σ̂) O(nΛ) O(Λ) 2Λ 0 or 1 Σ
† n n n
b
O(16n )
P
VC 18 n-qubit Kraus map K̂ ρK̂ O(4 Λ) O(4 Λ) O(4 Λ) O(min(2n, w)) O(min(2n, w))Σ/2
VD 19 one-qubit dephasing Eϕ t O(Λ) Λ/2 Λ/2 0 0 O(1)
..
VD 20 two-qubit dephasing Eϕ t1 ,t2 O(Λ) Λ Λ 0 0 .
VE 21 one-qubit depolarising E∆t O(Λ) O(Λ) O(Λ) 0 or 1 Σ/2
VE 22 two-qubit depolarising E∆t1 ,t2 O(Λ) O(Λ) O(Λ) 0, 1 or 2 Σ/8 or Σ/2
c
VF 25 amplitude damping Eγ t O(Λ) O(Λ) O(Λ) 0 or 1 Σ/4
VG 26 T -term Pauli string expectation Tr Ĥ ρ O(N T Λ) O(T Λ) 0 0 0d
VH 28 n-qubit partial traceb Trt (ρ) O(wΛ) O(Λ/2n ) O(w)Λ O(w + N − n) O(w)Σ O(1)e
a where n ≤ N − w
b where n ≤ N − ⌈w/2⌉
c only one node of a pair sends amplitudes
d one scalar global reduction
e excluding the cost of the output density matrix
53
[1] Xiao Yuan, Suguru Endo, Qi Zhao, Ying Li, and High Performance Computing, Networking, Stor-
Simon C Benjamin. Theory of variational quantum age and Analysis, pages 866–874. IEEE, 2016.
simulation. Quantum, 3:191, 2019. [15] Sean Eron Anderson. Bit twiddling hacks. URL:
[2] Iskren Vankov, Daniel Mills, Petros Wallden, and http://graphics. stanford. edu/˜ seander/bithacks.
Elham Kashefi. Methods for classically simulating html, 2005.
noisy networked quantum architectures. Quantum [16] Mohammad Alaul Haque Monil, Seyong Lee, Jef-
Science and Technology, 5(1):014001, 2019. frey S. Vetter, and Allen D. Malony. Understand-
[3] Benjamin Villalonga, Sergio Boixo, Bron Nel- ing the impact of memory access patterns in In-
son, Christopher Henze, Eleanor Rieffel, Rupak tel processors. In 2020 IEEE/ACM Workshop
Biswas, and Salvatore Mandrà. A flexible high- on Memory Centric High Performance Computing
performance simulator for verifying and bench- (MCHPC), pages 52–61, 2020.
marking quantum circuits implemented on real [17] James Bottomley. Understanding caching. Linux
hardware. npj Quantum Information, 5(1):1–16, Journal, 2004(117):2, 2004.
2019. [18] James E Smith. A study of branch prediction
[4] Frank Arute, Kunal Arya, Ryan Babbush, Dave strategies. In 25 years of the international sym-
Bacon, Joseph C Bardin, Rami Barends, Rupak posia on Computer architecture (selected papers),
Biswas, Sergio Boixo, Fernando GSL Brandao, pages 202–215, 1998.
David A Buell, et al. Quantum supremacy using a [19] Samuel Larsen and Saman Amarasinghe. Exploit-
programmable superconducting processor. Nature, ing superword level parallelism with multimedia
574(7779):505–510, 2019. instruction sets. Acm Sigplan Notices, 35(5):145–
[5] Emanuel Knill, Raymond Laflamme, and Woj- 156, 2000.
ciech H Zurek. Resilient quantum computation. [20] Jaewook Shin, Mary Hall, and Jacqueline Chame.
Science, 279(5349):342–345, 1998. Superword-level parallelism in the presence of
[6] Bálint Koczor, Suguru Endo, Tyson Jones, control flow. In International Symposium on
Yuichiro Matsuzaki, and Simon C Benjamin. Code Generation and Optimization, pages 165–175.
Variational-state quantum metrology. New Jour- IEEE, 2005.
nal of Physics, 22(8):083038, 2020. [21] Stanley F Anderson, John G Earle, Robert El-
[7] Tyson Jones and Simon C Benjamin. Robust quan- liott Goldschmidt, and Don M Powers. The IBM
tum compilation and circuit optimisation via en- system/360 model 91: Floating-point execution
ergy minimisation. Quantum, 6:628, 2022. unit. IBM Journal of research and development,
[8] Dennis Willsch, Hannes Lagemann, Madita 11(1):34–53, 1967.
Willsch, Fengping Jin, Hans De Raedt, and Kristel [22] William Y Chen, Pohua P. Chang, Thomas M
Michielsen. Benchmarking supercomputers with Conte, and Wen-mei W. Hwu. The effect
the j\” ulich universal quantum computer simu- of code expanding optimizations on instruction
lator. arXiv preprint arXiv:1912.03243, 2019. cache design. IEEE Transactions on Computers,
[9] Tyson Jones, Anna Brown, Ian Bush, and Simon C 42(9):1045–1057, 1993.
Benjamin. QuEST and high performance simu- [23] Nakul Manchanda and Karan Anand. Non-uniform
lation of quantum computers. Scientific reports, memory access (NUMA). New York University, 4,
9(1):1–11, 2019. 2010.
[10] Cupjin Huang, Michael Newman, and Mario [24] Mario Nemirovsky and Dean M Tullsen. Mul-
Szegedy. Explicit lower bounds on strong quan- tithreading architecture. Synthesis Lectures on
tum simulation. IEEE Transactions on Informa- Computer Architecture, 8(1):1–109, 2013.
tion Theory, 66(9):5585–5600, 2020. [25] William J Bolosky and Michael L Scott. False
[11] Maarten Van Den Nes. Classical simulation of sharing and its effect on shared memory perfor-
quantum computation, the Gottesman-Knill theo- mance. In 4th symposium on experimental dis-
rem, and slightly beyond. Quantum Info. Comput., tributed and multiprocessor systems, pages 57–71.
10(3):258–271, mar 2010. Citeseer, 1993.
[12] EPCC CSE team. ARCHER2 hardware & soft- [26] Tyson Jones. Simulation of, and with, first gener-
ware (documentation). University of Edinburgh, ation quantum computers. PhD thesis, University
Mar 2020. of Oxford, 2022.
[13] Bahman S Motlagh and Ronald F DeMara. Mem- [27] Hans De Raedt, Fengping Jin, Dennis Willsch,
ory latency in distributed shared-memory mul- Madita Willsch, Naoki Yoshioka, Nobuyasu Ito,
tiprocessors. In Proceedings IEEE Southeast- Shengjun Yuan, and Kristel Michielsen. Massively
con’98’Engineering for a New Era’, pages 134–137. parallel quantum computer simulator, eleven years
IEEE, 1998. later. Computer Physics Communications, 237:47–
[14] Thomas Häner, Damian S Steiger, Mikhail 61, 2019.
Smelyanskiy, and Matthias Troyer. High perfor- [28] Tyson Jones and Julien Gacon. Efficient cal-
mance emulation of quantum circuits. In SC’16: culation of gradients in classical simulations of
Proceedings of the International Conference for variational quantum algorithms. arXiv preprint
54