Article
Article
Abstract: Software quantum simulators are the most accessible tools for designing and testing 1
quantum computing concepts, including state vector representations, quantum gates, and memory 4
management techniques. The simulator prototype implements various memory optimization strate- 5
gies, such as full-state representation, dynamic state pruning, and shared memory parallelization 6
with OpenMP and distributed memory models using MPI. Additionally, data compression tech- 7
niques, like ZFP, are explored to enhance simulation performance by reducing memory footprint. 8
The results are validated through performance comparisons with leading open-source quantum 9
simulators, such as Intel-QS, QuEST, and qsim. Our findings highlight the trade-offs between compu- 10
tational overhead and memory efficiency. This demonstrates that a hybrid approach using distributed 11
memory and compression offers the best scalability for simulating large quantum systems. This 12
work provides a foundation for developing efficient quantum simulators supporting more complex 13
1. Introduction 16
has been demonstrated that it allows efficient solutions to some complex problems whose 18
best-known solution has an exponential cost for the input size. Quantum superposition, 19
quantum uncertainty, and quantum entanglement are powerful resources that we can 20
use to encode, decode, transmit, and process information in a highly efficient way that is 21
Received: model, the D-Wave 2000Q has 2000 qubits. These quantum computers represent prototypes 28
Revised: that are not scalable and sufficient to test complex quantum algorithms. Constructing a 29
Published: The growing interest in Quantum Computing and the limitations of real quantum 31
Copyright: © 2024 by the authors. devices have caused many organizations to focus on developing software quantum simula- 32
Submitted to Quantum Rep. for tors that run on classical computers. These simulators are trendy tools suitable for testing 33
possible open access publication quantum computing concepts on ideal conditions, avoiding hardware challenges like the 34
under the terms and conditions limited number and quality of physical qubits and quantum error correction. A list of the 35
of the Creative Commons Attri- very recent initiatives is maintained on several websites [1–4]. This large number of projects 36
bution (CC BY) license (https:// reflects the area’s growth and makes it difficult for researchers to decide which tool to use 37
as valuable and widely used tools in the field of quantum computing. These simulators play 40
a crucial role in the development, testing, and validation of quantum algorithms before they 41
are implemented on actual quantum hardware. One of the primary advantages of quantum 42
simulators is their accessibility. Unlike quantum computers, which are still relatively scarce 43
and often require significant resources and expertise to operate, simulators can be run 44
developers to explore quantum algorithms and concepts without the need for physical 46
Quantum simulators offer a controlled environment for designing and refining quan- 48
tum algorithms. They can simulate ideal quantum systems without the noise and error 49
rates present in current quantum hardware, providing clearer insights into the theoretical 50
quires exponential time and involves highly complex memory management. The problem 55
cantly more prominent than any of the existing quantum prototypes would soon require 57
state, the process would take 18.000 petabytes (18 Exabytes) of classical computer memory. 59
To better understand the quantum computing model, it is necessary to know the key 64
aspects of the inheritance of quantum mechanics. This section describes the fundamental 65
concepts on which quantum computing is based. Readers with knowledge of the field may 66
states 0 and 1 are represented by a prescribed pair of normalized and mutually orthog- 69
onal quantum states denoted using Dirac’s notation |0⟩ and |1⟩ [5]. The two states form 70
a “computational basis,” and any other (pure) state of the qubit can be written as a su- 71
perposition α|0⟩ + β|1⟩ [6]. Formally, a Quantum State is a vector |ψ⟩ representing a 72
case of a two-dimensional vector; that is to say, there are only two vectors in the whole 75
two-dimensional vector space with real meaning; these are the two orthogonal vectors 76
|0⟩ and |1⟩, this is depicted in the figure 1a. Conversely, qubits do not suffer from this 77
limitation. The general state of a qubit is |ψ⟩ = α0 |0⟩ + α1 |1⟩ where α0 and α1 are two 78
complex numbers constrained only by the requirement that |ψ⟩, like |0⟩ and |1⟩, should be 79
a unit vector in the complex vector space, in other words, only by the normalization. 80
The Bloch sphere is commonly used to depict a qubit, figure 1b. Two angles represent 81
the state,0 < θ < π and 0 <= ϕ <= 2π. Thus, the state |ψ⟩ can be rewritten as 82
|ψ⟩ = eiγcos θ
2 |0⟩ + eiϕ sin 2θ
|1⟩ The vector from the origin to the point representing 83
the state makes an angle of θ with the z-axis and its component in the x-y plane make an 84
angle of ϕ with the x-axis. γ is the global phase, which does not affect the measurable 85
probabilities of the quantum state (it only introduces a uniform phase shift to the whole 86
state). The state |0⟩ is the North Pole of the sphere, and the state |1⟩ is the South Pole. 87
N 1
The general equation of a n-qubit state is |ψ⟩ = ∑2x=− 0 α x | X ⟩ Or, in its expanded form: 88
Where |0...00⟩ = |0⟩ ⊗ ...|0⟩ ⊗ |0⟩...|1...11⟩ = |1⟩ ⊗ ...|1⟩ ⊗ |1⟩. As we can see, a single 89
complex number can specify a single-qubit state, so n complex numbers can specify any 90
The special characteristic of quantum states is that they allow the system to be in a 92
quantum physics, if a quantum system can be found to be in one of a discrete set of states, 95
which we will write as |0⟩ or |1⟩, then, whenever it is not being observed it may also exist 96
Because a qubit can take on any one of infinitely many states, one can think that a 98
single qubit could store lots of classical information. However, the properties of quantum 99
measurement severely restrict the amount of information that can be extracted from a 100
qubit. Information about a quantum bit can be obtained only by measurement, and any 101
measurement results in one of only two states, the two basis states associated with the 102
measuring device; thus, a single measurement yields, at most, a single classical bit of 103
The quantum entanglement describes a correlation between different parts of a quan- 105
tum system that surpasses anything classically possible. It happens when the subsystems 106
interact so that the resulting state of the whole system cannot be expressed as the direct 107
product of the states of its parts [5]. States that cannot be written as the tensor product of n 108
single-qubit states are called entangled states. Thus, most quantum states are entangled 109
[9]. If we can write the tensor product of those states, they are said to be separate states. 110
In the Quantum Circuits model, the fundamental transformation of a quantum state 111
is carried out using Quantum Gates, which are the basic components of quantum circuits. 112
Quantum gates are analogous to classical logic gates but operate on qubits instead of 113
To transform the state of the equation 1, we need 2n × 2n unitary matrices. Applying a 115
single-qubit gate G to the i-th qubit of an n-qubit quantum state amounts to multiplying 116
by-step procedure for solving a problem, where each step or instruction can be performed 120
performing each step on a quantum computer. Although all classical algorithms can also 122
be performed on a quantum computer, the term quantum algorithm is generally used for 123
those algorithms that incorporate some essential feature of quantum computing, such as 124
superposition or entanglement. There are three classes of quantum algorithms with clear 125
Version October 24, 2024 submitted to Quantum Rep. 4
Figure 2. Quantum Gates: Pauli X gate acts linearly and it takes the state α|0⟩ + β|1⟩ to the corre-
sponding state in which the role of |0⟩ and |1⟩ have been interchanged; it is the quantum equivalent
of the NOT gate for classical computers. The Hadamard gate is the first authentic quantum gate
because can generate superposition states. Phase Shift Gate is a single qubit gate that leaves the basis
state |0⟩ unchanged and maps the state |1⟩ to eiϕ |1⟩.
advantages over known classical algorithms: algorithms based upon quantum versions of 126
the Fourier transform, quantum search algorithms and quantum simulations. 127
From a recent review, we take some quantum simulators that currently lead the field 129
to characterize their main properties, performance, execution mode, and simulation results 130
to provide comparison and analysis. To facilitate this task, we work with open-source 131
simulators. These simulators are considered state-of-the-art due to several factors [10]: 132
• Innovative Features: Each simulator offers unique capabilities that set them apart, 133
works, or novel approaches to handling quantum state representations. For example, 135
qsim’s integration with Cirq and its ability to simulate up to 40 qubits on a high- 136
performance workstation make it a significant tool for developers and researchers. 137
• Adoption and Partnerships: Some of these simulators are backed by major tech 138
companies and have extensive partnerships within the industry, increasing their 139
• Academic and Commercial Use: These tools are not only used in academic research 141
but are also increasingly adopted by industries for practical applications, which 142
• Recent Updates and Community Support: The continual updates, community sup- 144
port, and documentation available for these tools contribute to their status as leaders 145
in the field. This ongoing development ensures they remain relevant and useful as 146
• Open Collaboration: Open-source projects encourage open collaboration among de- 148
velopers, researchers, and users. Ensuring the source code is available for modification 149
and redistribution fosters a community-driven development approach. This can lead 150
to rapid improvements and innovations, as a diverse group of contributors can work 151
The combination of these factors makes these simulators outstanding in the current 153
world of quantum computing, pointing towards their innovativeness and leadership in 154
The quantum Fourier transform was selected to carry out the simulations because it 157
offers several advantages: it is a well-studied quantum algorithm with known properties, 158
making it a reliable benchmark for validating the accuracy and efficiency of quantum 159
simulators on classical hardware. QFT’s performance scales predictably with the number 160
of qubits, allowing researchers to analyze how the simulator handles increasing complexity. 161
Version October 24, 2024 submitted to Quantum Rep. 5
Performing QFT simulations helps estimate the computational resources (memory, pro- 162
cessing power) required for larger, more complex quantum algorithms. Finally, QFT is a 163
crucial component in many quantum algorithms, such as Shor’s algorithm for factoring 164
large integers. Simulating QFT provides a foundation for testing and understanding these 165
To evaluate the performance of the selected simulators, the following platforms were 168
used: 169
• Platform 1: One of the nodes of the cluster Guane of Supercomputing Center of 170
Universidad Industrial de Santander with the following configuration: two AMD 171
EPYC 9554 64-Core (two threads per core) @ 3.1 GHz Processors and 375 GB of RAM 172
memory. 173
• Platform 2: A workstation with One Intel(R) Xeon(R) E-2136 CPU 6-Core, (two threads 174
per core) @ 3.30GHz processor with 32 GiB of RAM and a NVIDIA Corporation 175
GP106GL Quadro P2000 5GB. This node is used only for GPU-capable simulators. 176
processing and has an intuitive Python interface. It is a full-state vector simulator using 179
arbitrary single-qubit gates and gates controlled by two qubits. [11]. The Intel Quan- 180
tum Simulator leverages the full capabilities of an HPC system through its shared and 181
enhancements such as vectorization, threading, and cache optimization through the process 183
of gate fusion. The primary object in the Intel Quantum Simulator (IQS) is the QubitRegister, 184
representing the quantum state of the qubits in the system of interest. When declaring 185
a QubitRegister, the number of qubits must be specified to allocate enough memory to 186
describe their state. The state can then be initialized to any computational basis state, 187
library is not restricted to qubit systems or specific quantum information processing tasks, 191
being capable of simulating arbitrary quantum processes [12]. Quantum++ is developed 192
using standard C++17 and has minimal external dependencies. It primarily utilizes the 193
Eigen 3 linear algebra template library, which is header-only. Additionally, when available, 194
it employs the OpenMP library to facilitate multi-processing. The primary data types 195
are complex vectors and complex matrices, such as complex dynamic matrices, double 196
dynamic matrices, complex dynamic column vectors, complex dynamic row vectors, etc. 197
Developed by Google, qsim is an optimized quantum circuit simulator that uses gate 199
[13]. Integrated with Cirq, it provides a robust environment for developing and testing 201
quantum algorithms. To achieve cutting-edge simulations of quantum circuits, it uses gate 202
fusion, AVX/FMA vectorized instructions, and openMP multi-threading. This relies on 203
NVIDIA’s cuQuantum SDK is another leading tool, designed to accelerate quantum 206
circuit simulations on GPUs. This toolkit is essential for developers looking to leverage 207
the power of GPUs to enhance simulation performance and scalability. It provides an 208
integrated programming model tailored for a hybrid environment, enabling the combined 209
quantum computing simulator designed for simulating quantum circuits, state-vectors, and 213
density matrices. Developed by the Quantum Technology Theory Group at the University 214
of Oxford, QuEST is distinguished by its ability to utilize multithreading, GPU acceleration, 215
and distribution, making it highly effective across various computing environments, from 216
laptops to networked supercomputers. The toolkit is capable of simulating both pure 217
quantum states and mixed states with precision, and supports a wide array of quantum 218
operations. It allows for simulations that are extensible and adaptable, thanks to its open- 219
source nature and support for various back-end hardware via its simple and flexible 220
interface [14]. QuEST represents a pure state for a system of n qubits using 2n complex 221
floating-point numbers, with each real and imaginary component having double precision 222
by default. However, QuEST can be configured to use single or quad precision if desired. 223
The simulator stores the state using C/C++ primitives, which means that by default, the 224
Qrack is a high-performance quantum computer simulator that is written in C++ and 227
supports OpenCL and CUDA [15] [16]. It is particularly notable for its ability to simulate 228
arbitrary numbers of entangled qubits, limited only by system resources. Qrack is designed 229
to be embedded in other projects and includes a comprehensive suite of standard quantum 230
gates, along with variations suitable for register operations and arbitrary rotations. The 231
simulator is integrated with other quantum computing frameworks like ProjectQ and 232
Qiskit, enhancing its versatility and application. Qrack also features optimizations for 233
noiseless pure state simulations and includes tools that aid in the control, extension, and 234
visualization of data from quantum circuits. Qrack maintains the state representation in 235
a factorized form to enhance simulation efficiency. A general ket state |ψ⟩ of n qubits is 236
Regarding academic, community and industry support for these simulators, the 239
continual updates, active support, and documentation for these tools contribute to their 240
status as leaders in the field. This ongoing development ensures they remain relevant 241
and valuable as quantum computing technology evolves. Each of these simulators offers 242
unique features and optimizations, making them suitable for various aspects of quantum 243
computing research and development. Their continual evolution is critical as the quantum 244
computing field strives to solve more complex problems and improve algorithm efficiency. 245
Table 1 shows a comparison of the evaluated simulators of their design properties and 246
Programming Provides C++ and C++ library with Offers a C library C++ based, with Integrated with
Model Python interfaces, emphasis on flexi- that’s easy to inte- a focus on inte- Cirq, emphasizes
supports state vec- bility and ease of in- grate and extend, grating with other ease of use in
tor simulation tegration with optional quantum comput- Python for simu-
Python bindings ing frameworks lating quantum
like Qiskit circuits
Design Prop- Focuses on Prioritizes modu- Designed for preci- Prioritizes rapid Designed to simu-
erties scalability and lar, generic pro- sion and versatility prototyping and late large quantum
performance gramming for ease in quantum state flexibility for em- circuits with high
across different of adaptation and manipulation bedding in various precision
computational maintenance applications
environments
Unique Fea- Supports dynamic Highly adapt- Extensible and sup- Integrates clas- Deeply integrated
tures circuit simulation able to various ports detailed state sical computing with Google’s
and state manipu- quantum com- analysis tools like elements within quantum comput-
lation during run- puting models fidelity and entan- quantum simula- ing framework,
time due to generic glement measures tions for enhanced providing exten-
programming functionality sive simulation
approach capabilities
Other projects, like XACC and Qiskit, provide a full-stack approach to quantum 248
computing, including a simulator and compilers and the possibility to run the program on 249
For convenience and agility, those simulators that provided QFT in their examples 251
were compared under equal conditions. First, OpenMP capable simulators are shown in 252
Figure 3a. Second, GPU capable simulators are depicted in Figure 3b. 253
(a) Shared Memory Performance using platform 1 (b) GPU Performance using platform 2
Figure 3. Comparison of the quantum Fourier transform using different simulators and optimization
techniques.
and to test the various memory management approaches, a software quantum simulator 256
prototype was developed in C++ (The Memory eFficient Quantum Simulator, TMFQS) [17]. 257
This prototype was designed in such a way that it allows us to change strategies easily 258
through minimal modifications. It allows us to easily adjust the data structures to represent 259
the fundamental concepts of quantum computing and the use of compression libraries. 260
On the other hand, to demonstrate the construction of a software quantum simulator in a 261
simple way, the scope of this work was limited to optimization techniques using shared 262
memory and distributed memory. The implementation using GPUs is left for further work. 263
It has to be pointed out that this prototype does not implement all the concepts of 264
quantum computing, such as quantum error correction, entanglement, measurement and 265
an extended set of quantum gates. The measurement operation, a fundamental aspect 266
of quantum computing, was not implemented in this prototype because the primary 267
focus of this research was to evaluate and optimize memory management strategies for 268
quantum state simulation. The objective was to explore various methods, such as state 269
Version October 24, 2024 submitted to Quantum Rep. 8
pruning, data compression, and parallelization, to enhance the efficiency of memory use 270
in large-scale quantum simulations. Since these techniques do not inherently depend 271
achieving the research goals. However, the measurement operation could be incorporated 273
in future iterations to extend the simulator’s capabilities for practical quantum algorithm 274
execution. Several scenarios were implemented to carry out the tests. 275
• Dynamic memory management. The primary purpose is to test the strategy of remov- 276
• Full State: The objective is to accelerate the simulations, avoiding the overhead intro- 278
• Full State with OpenMP: The intention is to accelerate the simulations of the previous 280
version. 281
• Full State with data compression: The purpose is to test a lossy compression library 282
• Full State with MPI: The main objective of this scenario is to distribute the amplitude 284
vector among different computing nodes, allowing for a greater number of qubits. 285
• Full State with MPI and data compression: Here, data compression was incorporated 286
As we saw in previous section, the basic simulation concepts include the following 289
elements. 290
• Quantum Gates: matrix representation of quantum gates. Only one-qubit and two- 292
We have used an array of double-precision floating point numbers to store the ampli- 295
tudes. No data structure has been used to represent the states, since the vector indices are 296
used to refer to them. To implement the method to apply a quantum gate to a quantum 297
proposed by [18]. In the figure 4, we can observe the main classes of the prototype. 299
Figure 4. Class Diagram of the Prototype: QuantumRegister class represents a quantum state and
implements the main method to transform a quantum state (applyGate). The QuantumGate class
implements a small set of quantum gates using the matrix representation.
Version October 24, 2024 submitted to Quantum Rep. 9
To evaluate the quantum simulator’s performance and memory management strate- 301
gies, we first consider the case where the simulation is executed on a single processor. In 302
this section, we will explore the key elements involved in simulating quantum systems on 303
a single processor, focusing on the representation of the state vector and the application of 304
In this subsection, we discuss how the simulator represents the quantum state. Specif- 307
ically, we describe the structure of the state vector, which stores the amplitudes of all 308
possible quantum states, and explain how these amplitudes are organized to optimize 309
The state vector is a linear combination of states represented by the following expres- 311
sion. 312
numbers, so we need two float or two double numbers to represent them in the code. Of 314
course, the state vector must fit in the local memory. 315
The amplitudes of the states are implemented using a single-dimension double- 316
precision array stored in a continuous memory space. To increase performance, a single 317
array was used to store both the real and the imaginary parts of each amplitude; that 318
is, the state vector was linearized. The real parts are placed in the odd positions of this 319
arrangement, and the imaginary parts are placed in the even positions. This strategy avoids 320
jumping between two arrays, one for the real part and one for the imaginary part. Figure 5 321
Like other simulators such as Intel-QS, this prototype only implements single-qubit 324
gates and controlled two-qubit gates. The minimum list of quantum gates developed 325
to implement the Quantum Fourier Transform algorithm are: Identity, Hadamard, Con- 326
trolledPhaseShift, ControlledNot, Swap. All these quantum gates are implemented as 327
two-dimensional double-precision arrays. This reduced set of quantum gates limits the 328
simulation of algorithms that require additional gates. However, adding new single-qubit 329
and controlled two-qubit gates is very easy. Just insert the corresponding matrix into the 330
To apply a quantum gate Gk to the k − th qubit of a state vector |ψ⟩ we have the 333
′
α0...00
α′
0...01
..
Gk |ψ⟩ = ψ′ = . (4)
α′
1...10
′
α1...11
The first traditional approach to face this problem is using sparse matrix management 336
methods. However, [18] and [19] states that applying a single-qubit gate Gk 337
g00 g01
Gk = (5)
g10 g11
To the k-th qubit of a quantum register of N qubits is equivalent to applying the gate 338
to pairs of amplitudes whose indices differ by k-th bits from their binary index. 339
to apply the Hadamard gate to the first qubit of the state |00⟩. For the values: k = 0, 341
∗... ∗ 0k ∗ ...∗ = 00, ∗... ∗ 1k ∗ ...∗ = 10, α00 = 1 + 0i = 1, α10 = 0 and Hadamard gate. 342
1 1 1
H= √ (7)
2 1 −1
Replacing these values in equation 6, we obtain the following results. 343
′ 1 1 1
α00 = √ ·1+ √ ·0 = √
2 2 2
(8)
′ 1 1 1
α10 = √ ·1− √ ·0 = √
2 2 2
Similarly, to apply a controlled two-qubit quantum gate to a quantum register, using 345
a control qubit c on a target qubit t, authors of [19] state that the new amplitudes can be 346
α′∗..∗1c ∗..∗0t ∗..∗ = g11 · α∗..∗1c ∗..∗0t ∗..∗ + g12 · α∗..∗1c ∗..∗1t ∗..∗
(9)
α′∗..∗1c ∗..∗1t ∗..∗ = g21 · α∗..∗1c ∗..∗0t ∗..∗ + g22 · α∗..∗1c ∗..∗1t ∗..∗
Let’s see how to apply the CPS gate to the second qubit of the state |11⟩ controlled by 348
the first qubit. All amplitudes are equal to 0 except α11 which is equal to 1. Replacing these 349
′ ′
α10 = 1·0+0·0 = 0
′
(10)
α11 = 0 · 1 + eiϕ · 1 = eiϕ
Thus, we obtain the amplitude values for the states |10⟩ and |11⟩ 351
Some simulators, like qiskit, reverse the order of the qubits such that qubit 0 corre- 353
sponds to the least significant bit of the binary representation of the state. In this case, the 354
In this work, we maintain the natural order of the qubits. For example, in state |011⟩, 356
qubit 0 is the leftmost, qubit 1 is in the middle, and qubit 2 is the rightmost. Therefore, the 357
distance between α′∗...∗0k ∗...∗ and α′∗...∗1 ∗...∗ is equal to 2(numQubits−1)−(k−th qubit) . To illustrate 358
k
this, figure 6 shows the distance between the states of a 4-qubit state vector. 359
Generally, a single-qubit gate can be applied to a quantum register performing the 360
362
for each amplitude in the state vector 363
do 364
366
368
370
done 371
372
In summary, calculating the amplitudes for the current state and the new affected state 373
is done as follows: Determine the value of the current state’s amplitude using equation 6. 374
Then, find the pair corresponding to the current state, and finally, calculate the value of the 375
To find the pair corresponding to the current state, we can use two methods: the first 377
calculates the distance using the relation 2(numQubits−1)−(k−th qubit) , as we explained before. 378
The second method applies an XOR operation between the binary representation of the 379
current state and a sequence of 0s with a 1 placed in the k-th position corresponding to the 380
qubit we are working on. For example, applying a quantum gate on the 0th qubit on a for 381
4-qubits state |0101⟩ we can find the corresponding pair using the following operation. 382
0101
1000
⊕ (11)
1101
This result can be corroborated in figure 6. C++ offers binary operations to execute 383
385
unsigned int pos = numQubits - qubit - 1 ; 386
In the version of the simulator where the least probable states are pruned, we use a 390
dynamic memory management because the states are stored non-sequentially in memory. 391
This arrangement results from their computation via equation 6. Consequently, a state 392
search method was developed to facilitate access to a specific state for calculations in 393
subsequent iterations. However, performance is negatively impacted because a lot of time 394
has to be spent searching for a state’s values before they are used in a calculation. Figure 7 395
depicts the order of a 3-qubits state vector after applying a single-qubit gate (Hadamard) 396
Because of this, the less probable states elimination approach was discarded early, 398
therefore, we focus on pure states, which imply that the state vector contains the complete 399
information about the quantum state; and this approach was adopted for the rest of this 400
The large volumes of data produced by extreme-scale scientific research and appli- 403
cations have driven the development of various data compression techniques for years. 404
The compression methods are optimized for floating point data. However, they require 405
additional calculations to compress and decompress data before working with it. Leading 406
– They use variable length encoding algorithms such as: Huffman encoding, Arith- 409
• Error-Bounded Lossy Compressors allow some controlled distortion. Can be broadly 412
A key objective of this work is to identify a method for reducing memory consumption 416
in a software quantum simulator. To achieve this, we have chosen error-bounded lossy 417
compressors, the compression technique that offers the best compression rate. 418
To compress the amplitude vector, we use the ZFP library [20] as it provides significant 419
performance in accuracy and data size reduction. Although ZFP supports both lossy and 420
lossless compression, as we stated before, we used the lossy approach to gain a better 421
To go from a vector of amplitudes using traditional data types to a compressed vector, 423
change the corresponding line in the types.h source file from typedef std::vector<double> 424
the corresponding header file from the ZFP library must be included. 426
In order to improve performance, parallelizing the code is necessary. The first method 428
is to apply a shared memory programming model. This was done using OpenMP. 429
We use valgrind to run program profiling and determine the sections of code that con- 430
sume the most resources. Afterward, it was determined that the QuantumRegister::applyGate431
method is the component of the simulator where we had to focus on increasing performance. 432
Figure 8 shows the profiling results. 433
The QuantumRegister::applyGate method iterates through the state vector, imple- 434
menting equation 6. To enhance performance, we partition the data and execute instructions 435
on segments of the state vector, thereby speeding up the simulation. It is crucial to carefully 436
In the distributed memory model, the state vector needs to be divided among numProcs 439
processes. On the other hand, the equation 6, proposed in [18], indicates that the calculation 440
of the amplitudes of the states must be done in pairs, therefore, we must guarantee that the 441
number of amplitudes per process is even. To achieve this, we use the relationship 442
2numQubits
numProcs = (12)
2m
Where 2m is the number of states per process. In this case we can face two cases: 443
• The pair corresponding to the current state is located in other process. In this case it is 445
Figure 9 shows the pairwise calculation scheme for a 5-qubit state vector, applying each 447
qubit. Partitioning with 2, 4, and 8 processes is also shown to visualize the communication 448
For instance, consider performing a calculation on qubit 0 of the state |00010⟩; the 450
corresponding pair would be |10010⟩. If two processes are used, communication should be 451
established with process 1. If four processes are utilized, the remote process is process 2. 452
Lastly, if eight processes are employed, the remote process will be process 4. 453
We use the following expression to calculate the process’s identifier where the corre- 454
pairState
remoteProcID = (13)
2m
Version October 24, 2024 submitted to Quantum Rep. 14
In Figure 9, it is evident that for 2 processes, specifically regarding qubit 0, the number 456
mitigate the overhead caused by the extensive number of communications, the entire 458
segment of the state vector is exchanged between the peer processes involved, as outlined 459
in equation 6. The calculations are then made locally, and the results are communicated 460
For this reason, we are unable to use the the total sum of local memory of each node 462
to augment the number of qubits, and can only utilize half of the combined memory from 463
Combining amplitude vector compression with amplitude vector distribution across 465
multiple processes is an approach that can be effective both in terms of efficient memory 466
usage and overall simulation performance. The version where a compressed vector is used 467
to store the amplitudes was parallelized to achieve this. To obtain effective performance, the 468
state vector portions are transmitted in a compressed manner. This makes communications 469
To achieve compressed messaging, the compressed portions of the state vector must 471
be serialized, and a custom MPI data type must be used in send and receive functions. 472
In order to validate the accuracy of our quantum simulator, we have executed different 474
tests and compared the outputs with intel-qs and quantum++. 475
To test the superposition principle we apply Hadamard gate to a quantum register of 477
The test was executed by initializing the first state with a probability equal to one, 479
that is to say, 1 × |0000⟩. Then, we repeat the experience with 1 × |0001⟩ and so on until 480
executing the test with the last state 1 × |1111⟩. The results of executing this quantum 481
circuit with intel-qs, quantum++ and TMFQS were the same. 482
5. Results 483
This section presents the results of several quantum simulation tests performed using 484
quantum operations, we can assess how well these strategies reduce memory consumption 486
and improve the efficiency of quantum computing simulations on classical hardware. 487
Throughout the section, we compare the simulator’s performance with and without the 488
section aims to contribute to ongoing efforts to make quantum computing simulations 491
more efficient and scalable, ultimately advancing the field of quantum computing. 492
TMFQS was evaluated using the quantum Fourier transform, as in assessing the 494
To run the simulations, we use two high-performance nodes from the scientific com- 497
puting center of the Universidad Industrial de Santander (SC3-UIS) with the following 498
characteristics: two AMD EPYC 9554 64-Core (two threads per core) @ 3.1 GHz Processors 499
We can free up memory that is not needed by eliminating the quantum states that have 502
the smallest chance of occurring, that is, eliminating those states whose amplitude is close 503
to zero. However, in addition to all the points against this approach like: loss of fidelity, 504
impact on algorithm accuracy, error accumulation, threshold sensitivity and impact on 505
quantum entanglement, this requires dynamic memory management, which introduces a 506
lot of extra work because before applying a quantum gate to a state, we need to search for it 507
in the array due the states are not ordered. That is to say, to apply every quantum gate, we 508
need to execute 2n search operations. This approach was tested using the Quantum Fourier 509
Transform algorithm. The quantum register contains all the states at the end of executing 510
the Quantum Fourier transform algorithm. Due to the initial superposition process, the 511
quantum register also has all the states in the first stage of Grover’s algorithm. Therefore, 512
this approach does not work well for these algorithms. 513
For these reasons, along with the risks outlined previously, we have decided to discard 514
this approach because its numerous disadvantages outweigh its benefits. 515
A quantum register with all the states arranged in a sequence can reduce the overhead 517
of searching for quantum states. This also eliminates the need for an extra data structure to 518
store the states and uses the indices of the amplitude vector to handle the quantum states. 519
The total amount needed using this strategy 2numQubits ∗ 16 Bytes for double precision 520
floating point numbers. However, we save 2numQubits ∗ 4 bytes, avoiding the state vector 521
array. 522
The graph of figure 12a shows the performance of QFT applying state pruning (dy- 523
namic memory) vs full-state strategies. Simulations using dynamic memory involving 524
more than 20 qubits were discarded due to their execution time exceeding one day. 525
Exponential growth is observed from an 18-qubit state vector in dynamic memory 526
approach. This is a consequence of a substantial increase in the memory needed to represent 527
Version October 24, 2024 submitted to Quantum Rep. 17
the state in question and, therefore, the processing time required. As can be seen by 528
comparing these strategies, the workload overhead using dynamic memory is significant. 529
We have parallelized the full-state version to increase performance using the shared memory 530
model with OpenMP. In the graph of the figure 12b we can see the results. 531
(a) Dynamic Memory vs Full-State Approach (b) QFT Full-State Approach with OpenMP (log10 )
Figure 12. QFT performance with different approaches
We observe a significant decrease in the processing time between the serial and parallel 532
execution of the full-state version as the number of qubits increases. That is, a considerable 533
acceleration is obtained by parallelizing the simulation. However, for clearer interpretation 534
of the results, we show the results calculating the base 10 logarithm of the simulation time. 535
In the graph shown in Figure 12b, it is evident that for smaller numbers of qubits, there is 536
We have selected one of the most widely used C++ libraries for data compression, ZFP, 539
to test this approach. We modified the full-state version of the simulator to compress the 540
amplitude vector. The graph of figure 13a shows the performance comparison between 541
full-state vs full-state using ZFP. The base-10 logarithm is used to more clearly highlight 542
the differences between the two simulations. It is evident that the overhead introduced by 543
The graph of figure 13b shows the amount of memory used by both simulator ver- 545
sions. We can observe that the compression approach is highly efficient. This enables the 546
(a) QFT Full-State with ZFP (log10 ) (b) QFT Full-State with ZFP Data Size
Figure 13. QFT performance with ZFP
Version October 24, 2024 submitted to Quantum Rep. 18
We have developed a simulation version employing MPI to increase memory capacity 549
by leveraging the RAM of additional computer nodes. To uphold computational efficiency, 550
it is essential to underscore the necessity of maintaining an optimal ratio between the 551
number of processes and the allocation of qubits per process. The graph shown in Figure 552
14a demonstrates the performance of the Quantum Fourier Transform across a range of 553
qubit counts from 25 to 30, using 2, 4, 8, 16, 32, and 64 processes. The relationship of the 554
equation 12 is valid only from seven qubits onwards; however, to visualize the performance 555
more clearly, we use the range of 25 to 30 qubits. In this case, a logarithmic scale was 556
In addition to achieving better performance, we can see that by increasing the number 558
of processes, we can increase the number of qubits and reduce the size of messages required 559
to exchange partial results between processes. We can see also that parallelism is helpful 560
We have developed a simulator version that combines MPI with OpenMP to achieve 563
better performance. In this approach, the state vector is evenly distributed across the 564
processes using MPI. OpenMP is then employed to parallelize the applyGate method, 565
further enhancing performance. The graph in figure 14b illustrates the performance of 566
Quantum Fourier Transform using this hybrid approach. Once again, the graph is clear 567
(a) QFT Full-State with MPI (b) QFT Full-State with MPI and OpenMP
Figure 14. QFT performance using parallel techniques. Each line corresponds to a specific number of
Qubits
Comparing the results of the graphs in Figures 14a and 14b, we see that the combina- 569
tion of MPI and OpenMP increases the performance, especially for cases where the size of 570
Taking advantage of distributed resources to have more memory available, combined 573
with data compression, makes it possible to perform simulations with a larger number of 574
qubits. 575
We have already seen that there is a processing overhead introduced by the com- 576
pression process, however, the transmission of compressed data contributes positively to 577
overall performance. We have modified TMFQS to test this approach. Figure 15 shows the 578
execution for a range of 25 to 30 qubits with a variation in the number of processes equal to 579
2, 4, 8, 16, 32, 64. It shows the performance of the distributed memory approach with data 580
compression. 581
Version October 24, 2024 submitted to Quantum Rep. 19
In this graph, we can see that performance has decreased; however, the reduction of 582
the required memory is significant because the same strategy used in the section on data 583
compression is adopted here. It has to be pointed out that this strategy is valid only if 584
To validate the results obtained with TMFQS, a comparison is made with other sim- 587
ulators. First, specific conditions must be established to allow a fair comparison of the 588
simulators studied. The common conditions were using a single computing node with a 589
shared memory model. Simulators that use GPUs are excluded because their performance 590
is much higher than the others, but their scaling is limited. The case of using distributed 591
memory is also excluded because only some include this capability. The graph in Figure 16 592
shows the performance of the quantum Fourier transform for intel-qs, quantum++, QuEST, 593
and TMFQS using shared memory model. All simulators were compiled with Intel OneAPI 594
suite and traditional optimization flag (-O2). The selected simulators use the C++ complex 595
numbers data type for memory management. They use a full-state vector scheme, which 596
allows better performance but does not reduce memory consumption. 597
As can be seen in the graph in Figure 16, the Intel-QS simulator performs lower than 598
the other simulators. QuEST exhibited the best performance. TMFQS performs acceptably 599
compared to these mature tools that have been optimized, for example, by using libraries 600
6. Conclusions 602
between theoretical understanding and practical implementation strategies. The limita- 604
tions of current quantum hardware, including qubit count and quality, drive the need for 605
quantum simulators that allow researchers to explore quantum algorithms on classical 606
computers. This work has shown that memory management techniques, such as dynamic 607
pruning, full-state representation, and data compression, are essential for optimizing the 608
simulation of quantum systems. While pruning techniques introduce certain challenges, 609
such as fidelity loss and increased computational complexity, full-state representation with 610
parallelization (via OpenMP or MPI) provides a robust framework for simulating larger 611
quantum states. The use of data compression, such as ZFP, further extends the capacity to 612
simulate a greater number of qubits without exceeding memory limits, though it introduces 613
The comparative performance of the prototype simulator against established simu- 615
lators like Intel-QS, QuEST, and qsim demonstrates the viability of these memory man- 616
agement techniques. By combining distributed and shared memory models, along with 617
data compression, the simulator can handle increasingly complex simulations. Ultimately, 618
this work contributes valuable insights into making quantum computing simulations more 619
scalable and efficient, supporting the broader field of quantum computing as it continues 620
to evolve. 621
While it is critical to investigate the potential of data compression and distribution to 623
enable the simulation of a more significant number of qubits, performing such experiments 624
would require a substantial amount of additional time and computational resources, es- 625
pecially given the exponential growth of memory demands with each added qubit. This 626
work was framed within a more extensive project whose main goal of this study was to 627
explore memory management techniques in practical and scalable scenarios. However, a 628
thorough evaluation of the maximum number of qubits achievable through compression 629
and distribution, while essential, extends beyond the scope of the current research. This 630
evaluation will be addressed in future work, where a more detailed exploration of the 631
trade-offs between memory compression efficiency and computational overhead will be 632
performed. 633
Author Contributions: G.D.: Project development, conceptualization, investigation, formal analysis, 634
software development, methodology, simulations excecutions, manuscript writing. L.S.: supervision 635
and review. C.B.: supervision and review. J.C.: supervision and review. All authors have read and 636
Data Availability Statement: This work did not utilize any input data or generate new output data 641
for analysis. Instead, simulations were conducted, and their results were compared with those from 642
other leading simulators in the field. The outcomes are fully reproducible by running the example 643
Bibliography 646
2019. 648
5. Bergou, J.A.; Hillery, M. Introduction to the Theory of Quantum Information Processing; Springer 653
6. Artur Ekert, P.H.; Inamori, H. Basic concepts in quantum computation. Coherent atomic matter 655
8. Williams, C.P. Explorations in Quantum Computing, Second Edition; Texts in Computer Science, 659
9. Eleanor, R.; Wolfgang, P. Quantum Computing, A Gentle Introduction; The MIT Press, 2011. 661
10. Insider, Q. Top 63 Quantum Computer Simulators For 2024. https://thequantuminsider.com, 662
11. Guerreschi, G.G.; Hogaboam, J.; Baruffa, F.; Sawaya, N. Intel Quantum Simulator: A cloud-ready 664
12. Gheorghiu, V. Quantum++: A modern C++ quantum computing library 2014. [arXiv:1412.4704]. 666
https://doi.org/10.1371/journal.pone.0208073. 667
14. Jones, T.; Brown, A.; Bush, I.; Benjamin, S.C. QuEST and High Performance Simulation of 669
74-9. 671
16. Strano, D.; Bollay, B.; Blaauw, A.; Shammah, N.; Zeng, W.J.; Mari, A. Exact and approximate 673
18. Trieu, D.B. Large-Scale Simulations of Error-Prone Quantum Computation Devices. Dr. (univ.), 677
Universität Wuppertal, Jülich, 2009. Record converted from VDB: 12.11.2012; Universität 678
19. Smelyanskiy, M.; Sawaya, N.P.D.; Aspuru-Guzik, A. qHiPSTER: The Quantum High Perfor- 680
20. Lindstrom, P. Fixed-Rate Compressed Floating-Point Arrays. IEEE Transactions on Visualization 682
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are 684
solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). 685
MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from 686