0% found this document useful (0 votes)
10 views

Learning high-accuracy error decoding for quantum processors

The article discusses the development of AlphaQubit, a recurrent-transformer-based neural network designed to improve error decoding in quantum processors, specifically for the surface code. This new decoder demonstrates superior performance over existing methods by effectively handling complex noise and adapting to real-world data from Google's Sycamore quantum processor. The research highlights the potential of machine learning to enhance quantum error correction, paving the way for more reliable fault-tolerant quantum computations.

Uploaded by

goalkeiper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Learning high-accuracy error decoding for quantum processors

The article discusses the development of AlphaQubit, a recurrent-transformer-based neural network designed to improve error decoding in quantum processors, specifically for the surface code. This new decoder demonstrates superior performance over existing methods by effectively handling complex noise and adapting to real-world data from Google's Sycamore quantum processor. The research highlights the potential of machine learning to enhance quantum error correction, paving the way for more reliable fault-tolerant quantum computations.

Uploaded by

goalkeiper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Article

Learning high-accuracy error decoding for


quantum processors

https://doi.org/10.1038/s41586-024-08148-8 Johannes Bausch1,3 ✉, Andrew W. Senior1,3 ✉, Francisco J. H. Heras1,3, Thomas Edlich1,3,


Alex Davies1,3, Michael Newman2,3, Cody Jones2, Kevin Satzinger2, Murphy Yuezhen Niu2,
Received: 13 February 2024
Sam Blackwell1, George Holland1, Dvir Kafri2, Juan Atalaya2, Craig Gidney2, Demis Hassabis1,
Accepted: 2 October 2024 Sergio Boixo2, Hartmut Neven2 & Pushmeet Kohli1

Published online: 20 November 2024

Open access Building a large-scale quantum computer requires effective strategies to correct
Check for updates errors that inevitably arise in physical quantum systems1. Quantum error-correction
codes2 present a way to reach this goal by encoding logical information redundantly
into many physical qubits. A key challenge in implementing such codes is accurately
decoding noisy syndrome information extracted from redundancy checks to obtain
the correct encoded logical information. Here we develop a recurrent, transformer-
based neural network that learns to decode the surface code, the leading quantum
error-correction code3. Our decoder outperforms other state-of-the-art decoders
on real-world data from Google’s Sycamore quantum processor for distance-3 and
distance-5 surface codes4. On distances up to 11, the decoder maintains its advantage
on simulated data with realistic noise including cross-talk and leakage, utilizing soft
readouts and leakage information. After training on approximate synthetic data, the
decoder adapts to the more complex, but unknown, underlying error distribution by
training on a limited budget of experimental samples. Our work illustrates the ability
of machine learning to go beyond human-designed algorithms by learning from
data directly, highlighting machine learning as a strong contender for decoding in
quantum computers.

The idea that quantum computation has the potential for computa- same stabilizer give different parity outcomes. A pair of observables
tional advantages over classical computation, both in terms of speed XL and ZL, which commute with the stabilizers but anti-commute with
and resource consumption, dates all the way back to Feynman5. Beyond each other, define the logical state of the surface code qubit. The mini-
Shor’s well-known prime factoring algorithm6 and Grover’s quadratic mum length of these observables is called the code distance, which
speed-up for unstructured search7, many potential applications in represents the number of errors required to change the logical qubit
fields such as material science8, machine learning9 and optimization10 without flipping a stabilizer check. In a square surface code, this is the
have been proposed. side length d of the data-qubit grid.
Yet, for practical quantum computation to become a reality, errors The task of an error-correction decoder is to use the history of stabi-
on the physical level of the device need to be corrected so that deep cir- lizer measurements, the error syndrome, to apply a correction to the
cuits can be run with high confidence in their result. Such fault-tolerant noisy logical measurement outcome to obtain the correct one. In the
quantum computation can be achieved through redundancy intro- near term, highly accurate decoders can enable proof-of-principle dem-
duced by combining multiple physical qubits into one logical qubit1. onstrations of fault tolerance. Longer term, they can boost the effective
Ultimately, to perform fault-tolerant quantum computation such as the performance of the quantum device, requiring fewer physical qubits
factorization of a 2,000-bit number, the logical error rate needs to be per logical qubit or reducing requirements on device accuracy3,4,14.
reduced to about 10−12 per logical operation3,11, far below the error rates Quantum error correction frequently requires different decoding
in today’s hardware that are around 10−3 to 10−2 per physical operation. methods to classical error correction15,16 and, despite recent significant
One of the most promising strategies for fault-tolerant computation progress4,17–21, challenges remain. A quantum error-correction decoder
is based on the surface code (Fig. 1), which has the highest-known toler- must contend with complex noise effects that include leakage, that is,
ance for errors of any code with a planar connectivity3,12,13. In the surface qubit excitations beyond the computational states ∣0⟩ and ∣1⟩ that are
code, a logical qubit is formed by a d × d grid of physical qubits, called long-lived and mobile22; and cross-talk, that is, unwanted interactions
data qubits, such that errors can be detected by periodically measuring between qubits inducing long-range and complicated patterns of
X and Z stabilizer checks on groups of adjacent data qubits, using d 2 − 1 events23. These effects fall outside the theoretical assumptions underly-
stabilizer qubits located between the data qubits (Fig. 1a). A detection ing most frequently used quantum error-correction decoders, such as
event (or event) occurs when two consecutive measurements of the minimum-weight perfect matching (MWPM)16,24. Extending decoders

1
Google DeepMind, London, UK. 2Google Quantum AI, Santa Barbara, CA, USA. 3These authors contributed equally: Johannes Bausch, Andrew W. Senior, Francisco J. H. Heras, Thomas Edlich,
Alex Davies, Michael Newman. ✉e-mail: [email protected]; [email protected]

834 | Nature | Vol 635 | 28 November 2024


trained a graph-neural-network decoder using both circuit-level
a b Z
Z
and experimental data, showing parity with a standard MWPM
Z
Z
Z

decoder.
Z

Z Z
Z
Z
Z

In this work, we present AlphaQubit, a recurrent-transformer-based


Z
Z

Z
Z
Z

neural-network architecture that learns to predict errors in the logi-


X
Z X
X
Z
Z
Z
X
X

ZL
X
Z X

X Z X Z X
Z
Z

cal observable based on the syndrome inputs (Methods and Fig. 2a).
X Z
X
X
Z
Z
X

This network, after two-stage training—pretraining with simulated


X Z X Z X samples and finetuning with a limited quantity of experimental sam-
Z
Z

X
Z

X
X

X
ples (Fig. 2b)—decodes the Sycamore surface code experiments more
accurately than any previous decoder (machine learning or otherwise).
Z
Z
Z
X
X
X
X
Z

X Z X Z X
Z
Z
X Z
X
X

On simulated data, which model a near-term device with an approxi-


Z
Z
X

X
Z
Z

Y mately 6% detection event density (Fig. 4a, inset) using a richer noise
X
Z X
X
Z
Z
Z
X
X
X
X
Z

X Z X Z X
model than those considered in previous machine-learning decoder
Z
Z
X Z
X
X
Z
Z
X

work—including leakage, cross-talk and soft readouts with an ampli-


X
Z

X
Z
X
Z X
X
Z
Z
Z

XL tude damping component—AlphaQubit scales to code distance 11 and


X
X
X
X
Z

Z Z
Z
Z
X Z
X
X
Z

generalizes to 100,000 error-correction rounds while maintaining


Z
X

accuracy beyond correlated MWPM (MWPM-Corr). AlphaQubit ben-


Fig. 1 | The rotated surface code and a memory experiment. a, Data qubits efits from analogue inputs (as previously observed36) and we show it
(grey circles) on a d × d square lattice (here shown for code distance d = 5) are maintains an accuracy lead against MWPM-Corr augmented to process
interspersed with X and Z stabilizer qubits (X and Z in circles). The logical analogue inputs40.
observables X L (ZL) are defined as products of X (Z) operators along a row
(column) of data qubits. b, In a memory experiment, a logical qubit is
initialized, repeated stabilizer measurements are performed and then Quantum error correction on current quantum devices
the logical qubit state is measured. During the experiment all qubits and
As physical error rates in quantum hardware have been brought down,
operations are subject to errors (here symbolically shown as bit (X), phase (Z),
and combined bit and phase flips (Y) acting on individual data qubits from time researchers have started to conduct error-suppression experiments on
step to time step). real quantum devices4,17,18,20,21,41. We first apply AlphaQubit to Google’s
Sycamore memory experiment42, which comprises both X-basis and
Z-basis memory experiments on surface codes with distance 3 and
distance 5. The 3 × 3 code block was executed at 4 separate locations
to account for the complex noise effects described above4,25–27 or sup- on the Sycamore chip, and the 5 × 5 code block was executed at a single
pressing them on the hardware side23,28,29 are areas of active research. A location. Fifty thousand experiments were performed for each total
further challenge for error-correcting real-world quantum devices is rounds count n ∈ {1, 3, …, 25}, and the resulting data were split into
the difficulty in modelling errors accurately4,14,30,31. In principle, a decoder even and odd subsets for twofold cross-validation. Below we describe
that adapts to more realistic noise sources and that learns directly from training on even with final testing on odd.
data (without the need to fit precise noise models) can help to realize a Decoder performance is quantified by the logical error per round
fault-tolerant quantum computer using realistic noisy hardware. (LER), the fraction of experiments in which the decoder fails for each
additional error-correction round4 (‘Logical error per round’ in Meth-
ods and Fig. 3a).
Machine-learning quantum error-correction decoders Decoders are trained for a specific distance, basis and location,
Recently, there has been an explosion of machine-learning techniques but can decode experiments of any number of rounds. Training is in
applied to quantum computation, including decoding (Extended Data two stages: pretraining and finetuning (‘Sycamore data’ in Methods
Table 1). Some previous studies have used supervised learning32 or and Fig. 2b). In the pretraining stage, we train on one of three kinds of
reinforcement learning33 to train neural-network-based decoders. simulated data with different degrees of similarity to the experimen-
Most consider qubit-level (as opposed to circuit-level) errors, which tal data. In the first two scenarios, we pretrain on up to 2 billion sam-
greatly simplifies the decoding problem as all errors are local in time. ples drawn from detector error noise models (DEMs)43. The DEMs are
A handful of studies consider more realistic circuit-level noise mod- either fitted to the (even) detection error event correlations pij (ref. 4)
els. Chamberland et al.34 trained recurrent and convolutional networks or use weights derived from a Pauli noise model that approximates
and matched look-up table performance for surface codes and colour the noise that occurs on hardware, based on device calibration data
codes with code distances up to 5. Baireuther et al.32 trained a long (from cross-entropy benchmarks (XEB); ‘Detector error model’ in Meth-
short-term memory (LSTM) decoder for a colour code corrupted by ods). In the third scenario, we pretrain on up to 500 million samples of
both Pauli and beyond-Pauli circuit-level noise. Zhang et al.35 developed superconducting-inspired circuit depolarizing noise (SI1000 noise44;
fast decoders based on three-dimensional convolutions, and applied ‘Circuit depolarizing noise’ in Methods), which does not depend on the
them to surface codes up to distance 7 with experimentally inspired experimental data or quantum device except in choosing the overall
circuit-level noise, but did not achieve MWPM performance. noise scale to approximately match experimental event densities.
More recently, two studies have tested the performance of For the finetuning stage, we partition the 325,000 even experimental
machine-learning decoders using the Sycamore surface code experi- samples into training and validation sets (‘Sycamore data’ in Methods).
ment. Varbanov et al.36 built a recurrent-neural-network decoder This procedure allows us to train a decoder to high accuracy with lim-
based on the architecture of Baireuther et al.32,37. They trained it on ited access to experimental data, while holding back the other fold
a circuit-level depolarizing noise model (with noise parameters fit- (odd) as a test set.
ted to the experimental data) and evaluated on experimental data, AlphaQubit achieves an LER of (2.901 ± 0.023) × 10−2 at distance 3 and
approaching parity with the best previously published result at code (2.748 ± 0.015) × 10−2 at distance 5 (Fig. 3a,b), giving an error-suppression
distance 3. Furthermore, they quantified the benefits of modelling ratio Λ = 1.056 ± 0.010, where ensembling 20 independently trained
correlations and of using analogue inputs (modelled by a symmet- models contributes 0.03 × 10−2 (0.08 × 10−2) improvement at code
ric Gaussian I/Q—in-phase (I) and quadrature (Q)—readout noise distance 3 (5) (‘Ensembling’ in Methods). This LER is even lower than
model38), which yielded a slight increase in accuracy. Lange et al.39 that of the tensor-network decoder—(3.028 ± 0.023) × 10−2 at distance 3,

Nature | Vol 635 | 28 November 2024 | 835


Article
a b

Error-correction round n + 1

X
Z Finetune
Z

Embed
X
X
Z
Z
X

Decoder staten

Syndrome transformer

Y
Syndrome transformer
X Device data

Syndrome transformer
Pretrain

Error-correction round n

X
Z
Z
Embed

X XEB Noise
X
SI1000 model
Z pij fitting
Z data
X agnostic

Decoder staten – 1

Fig. 2 | Error correction and training of AlphaQubit. a, One error-correction experiment is used to predict whether an error has occurred. b, Decoder
round in the surface code. The X and Z stabilizer information updates the training stages. Pretraining samples come either from a data-agnostic SI1000
decoder’s internal state, encoded by a vector for each stabilizer. The internal noise model, or from an error model derived from experimental data using pij
state is then modified by multiple layers of a syndrome transformer neural or XEB methods4,31.
network containing attention and convolutions. The state at the end of an

(2.915 ± 0.016) × 10−2 at distance 5 and Λ = 1.039 ± 0.010—to our knowledge, hardware with error rates significantly lower than the Sycamore experi-
the most accurate decoder hitherto reported for this experiment4,45 mental data (‘Training details’ in Methods and Extended Data Fig. 2b)
but impractical for larger code distances owing to its computational and distances beyond 5, we explore the performance of our decoder
cost. State-of-the-art MWPM-based decoders, such as correlated match- on simulated data (in place of experimental samples) at code distances
ing (MWPM-Corr), matching with belief propagation (MWPM-BP) and 3, 5, 7, 9 and 11 (17–241 physical qubits).
PyMatching, an open-source implementation of MWPM4,24,26, lead To go beyond conventional circuit noise models37,46 we use a Pauli+
to higher LERs than the tensor network and AlphaQubit (Fig. 3a,b). simulator (‘Pauli+ model for simulations with leakage’ in Methods) that
For comparison, we also show the results of the LSTM-based neural can model crucial real-world effects such as cross-talk and leakage. The
network from Varbanov et al.36 and our own implementation of an simulator’s readouts are further augmented with soft I/Q information
LSTM (both pretrained on XEB DEMs). These achieve good results that models a dispersive measurement of superconducting qubits, to
for 3 × 3. Varbanov’s LSTM-based neural network fails to match the capture underlying analogue information about uncertainty and leak-
tensor-network decoder at 5 × 5 (Fig. 3b). Although our LSTM achieves age38,47,48 (‘Measurement noise’ and ‘Simulating future quantum devices’
this, it does not scale to larger code distances (see next section). in Methods, Fig. 4a, inset, and Extended Data Fig. 2a). These analogue
Pretraining with samples from a noise model matched to the experi- I/Q measurements and derived events are provided to AlphaQubit in
mental data (pij or XEB DEMs) leads to better performance than using the the form of probabilities40 (‘Soft measurement inputs versus soft event
device-agnostic SI1000 (Fig. 3c). The pij DEMs are the same noise model inputs’ and ‘Input representation’ in Methods).
that set the prior for the matching-based and tensor-network decod-
ers. On this prior, our decoder achieves parity with the tensor-network Decoding at higher distances
decoder (within error). We note that even when pretraining with SI1000 For each code distance, we pretrain our decoder on up to 2.5 billion sam-
samples, and without any finetuning, AlphaQubit achieves parity with ples from a device-agnostic circuit depolarizing noise model (SI1000
MWPM-BP at code distance 5. with a simple variant of I/Q readout and leakage) before using a limited
Finetuning with a limited amount of experimental data decreases amount of data generated by the Pauli+ simulator (with realistic simu-
the LER gap between models pretrained with well-matched (pij and lation of leakage and full I/Q noise; ‘Pauli+’ in Methods) to stand in as
XEB) and general (SI1000) priors; and improves the LER of all models experimental data for finetuning. In Fig. 4a, we show the LER at each
well beyond the tensor-network decoder (Fig. 3c). code distance after finetuning.
To establish strong baselines, we compare MWPM-Corr with a DEM
tuned specifically for the Pauli+ noise model and augmented to benefit
Quantum error correction for future quantum devices from analogue readouts. We also include our LSTM decoder, trained
Simulating future quantum devices for code distances 3, 7 and 11 with unlimited Pauli+ training samples.
To achieve reliable quantum computation, the decoder must scale to AlphaQubit achieves the highest accuracy for all code distances up to
higher code distances. To assess the decoder’s accuracy on envisioned 11, surpassing even the correlated matching decoder augmented with

836 | Nature | Vol 635 | 28 November 2024


a
1.0 1.0
3 × 3 (XZ, NESW) 5 × 5 (XZ)
0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
1 – 2 × logical error

1 – 2 × logical error
0.4 0.4

0.3 0.3

0.2 0.2

AlphaQubit (finetuned): 2.901% ± 0.023% LER AlphaQubit (finetuned): 2.748% ± 0.015% LER
Tensor network: 3.028% ± 0.023% LER Tensor network: 2.915% ± 0.016% LER
MWPM-BP: 3.117% ± 0.024% LER MWPM-BP: 3.059% ± 0.014% LER
MWPM-Corr: 3.498% ± 0.025% LER MWPM-Corr: 3.597% ± 0.015% LER
MWPM (PyMatching): 4.015% ± 0.031% LER MWPM (PyMatching): 4.356% ± 0.019% LER
0.1 0.1
1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25
Error-correction round Error-correction round

b c
0.0450 0.032
Stage
Pretrained
0.0425
Finetuned
0.031

0.0400 MWPM-BP

0.0375 0.030

Mean LER
Mean LER

0.0350
Tensor network
0.029

0.0325

0.028
0.0300
Varbanov et al. (ref. 36)

Varbanov et al. (ref. 36)


MWPM (PyMatching)

MWPM (PyMatching)

0.0275
Tensor network

Tensor network

0.027
MWPM-Corr

MWPM-Corr
LSTM 200M

LSTM 200M
MWPM-BP

MWPM-BP
Pretrained

Pretrained

Pretrained

Pretrained

Pretrained
Finetuned

Finetuned

Finetuned

Finetuned

Finetuned
0.0250

0.026
3 5 SI1000 pij XEB
Code distance Distance 5: pretrain noise model

Fig. 3 | Logical error per round on the 3 × 3 and 5 × 5 Sycamore experiment. (shades of grey). The LER is calculated from the slope of the fitted lines. The
All AlphaQubit results (both pretrained and finetuned) are for ensembles of error bars are the 95% confidence interval. b, LERs of our decoders and other
20 models. All results are averaged across bases, even and odd cross-validation published results for the Sycamore experiment data. We also show the
splits, and, for the 3 × 3 experiments, the location (north, east, south, west performance of an LSTM model pretrained on XEB DEM data. Error bars are
(NESW)), and are fitted across experiments of different durations. a, The 1 − 2 × standard bootstrap errors. c, LERs of our decoder pretrained on different noise
logical error versus error-correction round for code distance-3 and distance-5 models, and after finetuning on experimental data. Error bars are standard
memory experiments in the Sycamore experimental dataset for the baseline bootstrap errors.
tensor-network decoder (black), our decoder (red) and three variants of MWPM

soft information (Fig. 4a). At distance 3, the augmented MWPM-Corr For comparison, we also test our model on hard inputs (that is, where
LER is larger than the AlphaQubit LER by a factor of 1.25; by a factor of the analogue readouts were binarized before decoding). Although,
1.4 at distance 9, and by a factor of 1.25 at distance 11. as expected, both decoders perform worse, AlphaQubit maintains
We note that although the LSTM scales up to code distance 7, con- roughly the same improvement in error suppression compared with
sistent with regimes tested in the literature32,36, it does not scale to MWPM-Corr at distance 11 (LER ≈ 1.2 × 10−5 for MWPM-Corr versus
distance 11 despite the significantly larger number of model param- LER ≈ 9 × 10−6 for AlphaQubit; Fig. 4b). When previous studies mention
eters (200 million) compared with our model (5.4 million over all code MWPM, they generally refer to its uncorrelated version36,39,49, which is
distances; ‘Parameters’ in Methods). weaker than MWPM-Corr, with an LER ≈ 4.1 × 10−5 at distance 11 (Fig. 4b).

Nature | Vol 635 | 28 November 2024 | 837


Article
a b c
Detection event density (%)
3×3
Sycamore
Pauli+

–3 0 6 14 –3 –3
10 10 10

5×5

7×7
LER

LER

LER
–4 –4 –4
10 10 10

9×9

–5 –5 –5
10 10 10
AlphaQubit (soft + leakage) AlphaQubit (hard + leakage) 11 × 11

LSTM (soft + leakage) MWPM-Corr (hard)


MWPM-Corr (soft) PyMatching (hard) MWPM-Corr AlphaQubit
5 6 7 8
3 5 7 9 11 3 5 7 9 11 0 10 10 10 10
Code distance Code distance Number of unique finetuning samples

Fig. 4 | Larger code distances and finetuning accuracy trade-off. a,b, LER of are bootstrap standard errors. a, Soft decoder inputs. Inset: detection event
different decoders for Pauli+ noise at different code distances. For each code density of the Pauli+ simulation compared with the Sycamore experimental
distance, our decoder (red) is finetuned on 100 million samples from this noise samples (error bars are standard error of the mean). b, Hard decoder inputs.
model after pretraining on a device-agnostic circuit depolarizing noise model c, LER of AlphaQubit (soft inputs) pretrained on SI1000 noise and finetuned
(SI1000). MWPM-Corr (black) and PyMatching (grey) are calibrated with a DEM with different number of unique Pauli+ samples at code distances 3–11.
tuned specifically to the Pauli+ noise model with soft information. The error bars

To assess the effect of further limiting experimental data, at each to logical computations’ in Methods). We demonstrate that AlphaQubit
code distance, we finetuned the same SI1000-pretrained base model from the previous section, with its recurrent structure, can sustain its
using only 105 to 108 samples (Fig. 4c). As baselines, we show the cor- accuracy far beyond the 25 error-correction rounds that the decoder
responding MWPM-Corr performance from Fig. 4a, as well as the per- was trained on. We find its performance generalizes to experiments
formance of the pretrained model before any finetuning. Despite the of at least 100,000 rounds (Fig. 5 and ‘Time scalability’ in Methods).
data-agnostic SI1000 prior, for code distances up to 11, the pretrained
model is already on par with MWPM-Corr and further improves with Utility beyond error correction
more finetuning examples. As we trained the neural network by minimizing cross-entropy, its out-
put can be interpreted as the probability of a logical error, a probability
Generalization to a streaming decoder we found to be well calibrated (Fig. 6a and Extended Data Fig. 6a). For
When training a decoder, the available pretraining and finetuning data example, of samples with prediction probability 0.8, approximately
will cover only a limited range of number of error-correction rounds. 80% contain a logical error. Samples with a probability close to 0.5 are
However, a practical decoder will need to perform equally well for more likely to have been misclassified than samples with a probability
longer experiments (‘Discussion and conclusion’, and ‘Generalization closer to 0 or 1 (Extended Data Fig. 6b).

a b –4
10
1.0

0.8
1 – 2 × logical error

0.6
LER

–5
10
0.4

0.2
AlphaQubit
MWPM-Corr
0 PyMatching
–6
10
2 3 4 5 2 3 4 5
10 10 10 10 10 10 10 10
Error-correction round Error-correction round

Fig. 5 | Generalization to larger number of error-correction rounds at code finetuned on 108 distance-11 Pauli+ simulated experiments of 25 rounds.
distance 11. a,b, The 1 − 2 × logical error after up to 100,000 error-correction Both finetuning and test samples are Pauli+. We plot LER values only where
rounds (a) and the corresponding LER (b) for PyMatching (grey), MWPM-Corr the corresponding 1 − 2 × logical error value is above 0.1. The error bars are
(black) and AlphaQubit (red) pretrained on SI1000 samples up to 25 rounds and bootstrap standard errors.

838 | Nature | Vol 635 | 28 November 2024


a b Distance 3
with the same accuracy as the individual code-distance-trained decod-
0 Reference
ers (Extended Data Fig. 7b). As our architecture is not specific to the
–3
Distance-5 calibration 10 Distance 5
Distance 7
Observed probability

Distance-11 calibration –4
10 Distance 9
Distance 11 surface code, we anticipate that it can be adapted to colour codes or
–5
10 other quantum low-density parity check codes.
As a machine-learning model, our decoder’s greatest strengths come

LER
–6
0.5 10
–7
10 from its ability to learn from real experimental data. This enables it to
–8
10 utilize rich inputs representing I/Q noise and leakage, without manual
–9
10 design of particular algorithms for each feature. This ability to use
1.0
0 0.5 1.0 20 40 60 80 100
available experimental information showcases a strength of machine
Predicted probability Removed sample percentage (%) learning for solving scientific problems more generally.
Although we anticipate that other decoding techniques will continue
Fig. 6 | Using the network’s output as a confidence measure for post- to improve, this work supports our belief that machine-learning decod-
selection. Calibration and post-selection data are evaluated on 109 Pauli+
ers may achieve the necessary error suppression and speed to enable
simulated experiments. a, Example calibration plot at distance 5 (green
practical quantum computing.
continuous line) and distance 11 (purple continuous line), with small but
present error bars for s.e.m. The black dashed line represents a perfectly
calibrated classifier. b, LER versus the fraction of low-confidence experiments
Online content
discarded. Error bars are s.e.m. from values in each bin (visible for a LER ≲ 10 −8).
Any methods, additional references, Nature Portfolio reporting summa-
ries, source data, extended data, supplementary information, acknowl-
The probabilistic output can be used as soft information in hierar- edgements, peer review information; details of author contributions
chical decoding schemes50,51, or as a confidence measure to discard and competing interests; and statements of data and code availability
the least confident samples (Fig. 6b). On Pauli+ simulated data, and are available at https://doi.org/10.1038/s41586-024-08148-8.
by rejecting only 0.2% of the 25-round experiments at distance 11, we
can reduce the error rate by a factor of about 20 (1% rejection gives a 1. Shor, P. W. Scheme for reducing decoherence in quantum computer memory. Phys. Rev.
A 52, R2493–R2496 (1995).
factor of about 107, 10% a factor of about 790), which can prove useful in
2. Gottesman, D. E. Stabilizer Codes and Quantum Error Correction. PhD thesis, California
protocols such as magic-state distillation, a major anticipated resource Institute of Technology (1997).
cost in fault-tolerant quantum computation52,53. 3. Fowler, A. G., Mariantoni, M., Martinis, J. M. & Cleland, A. N. Surface codes: towards
practical large-scale quantum computation. Phys. Rev. A 86, 032324 (2012).
4. Google Quantum AI. Suppressing quantum errors by scaling a surface code logical qubit.
Nature 614, 676–681 (2023).
Discussion and conclusion 5. Feynman, R. P. Simulating physics with computers. Int. J. Theor. Phys. 21, 467–488 (1982).
6. Shor, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms on
We present AlphaQubit, a neural-network decoder designed to decode
a quantum computer. SIAM Rev. 41, 303–332 (1999).
the surface code that can establish a state of the art in error suppres- 7. Grover, L. K. A fast quantum mechanical algorithm for database search. In Proc. Annual
sion. On experimental data, it outperforms the previous best-in-class ACM Symposium on Theory of Computing 212–219 (ACM, 1996).
8. Lloyd, S. Universal quantum simulators. Science 273, 1073–1078 (1996).
tensor-network decoder. Its accuracy persists at scale, continuing
9. Huang, H.-Y. et al. Quantum advantage in learning from experiments. Science 376,
to outperform soft-input-augmented correlated matching at dis- 1182–1186 (2022).
tances up to 11. AlphaQubit thus sets a benchmark for the field of 10. Kadowaki, T. & Nishimori, H. Quantum annealing in the transverse Ising model. Phys. Rev.
E 58, 5355 (1998).
machine-learning decoding, and opens up the prospect of using highly 11. Gidney, C. & Ekerå, M. How to factor 2048 bit RSA integers in 8 hours using 20 million
accurate machine-learning decoders in real quantum hardware. noisy qubits. Quantum 5, 433 (2021).
Several challenges remain. Ultimately, to enable logical error rates 12. Bravyi, S. B. & Kitaev, A. Y. Quantum codes on a lattice with boundary. Preprint at https://
arxiv.org/abs/quant-ph/9811052 (1998).
below 10−12, we will need to operate at larger code distances. At distance 13. Kitaev, A. Y. Y. Fault-tolerant quantum computation by anyons. Ann. Phys. 303, 2–30
11, training appears more challenging (Fig. 4) and requires increasing (2003).
amounts of data (Extended Data Fig. 7c). Although, in our experience, 14. Google Quantum AI. Exponential suppression of bit or phase errors with cyclic error
correction. Nature 595, 383–387 (2021).
data efficiency can be markedly increased with training and architec- 15. Fowler, A. G., Whiteside, A. C. & Hollenberg, L. C. Towards practical classical processing
ture improvements, demonstrating high accuracy at distances beyond for the surface code. Phys. Rev. Lett. 108, 180501 (2012).
11 remains an important step to be addressed in future work (‘Further 16. Dennis, E., Kitaev, A., Landahl, A. & Preskill, J. Topological quantum memory. J. Math.
Phys. 43, 4452–4505 (2002).
considerations of scaling experiments’ in Methods). 17. Sivak, V. V. et al. Real-time quantum error correction beyond break-even. Nature 616,
Furthermore, decoders need to achieve a throughput of 1 μs 50–55 (2023).
per round for superconducting qubits4,31 and 1 ms for trapped-ion 18. Krinner, S. et al. Realizing repeated quantum error correction in a distance-three surface
code. Nature 605, 669–674 (2022).
devices20. Improving throughput remains an important goal for both 19. Egan, L. et al. Fault-tolerant control of an error-corrected qubit. Nature 598, 281–286
machine-learning and matching-based decoders38,54–57. Although the (2021).
AlphaQubit throughput is slower than the 1-μs target (‘Decoding speed’ 20. Ryan-Anderson, C. et al. Realization of real-time fault-tolerant quantum error correction.
Phys. Rev. X 11, 041058 (2021).
in Methods), a host of established techniques (‘Decoding speed’ in 21. Zhao, Y. et al. Realization of an error-correcting surface code with superconducting
Methods) can be applied to speed it up, including knowledge distilla- qubits. Phys. Rev. Lett. 129, 030501 (2022).
tion, lower-precision inference and weight pruning, as well as imple- 22. Ghosh, J., Fowler, A. G., Martinis, J. M. & Geller, M. R. Understanding the effects of
leakage in superconducting quantum-error-detection circuits. Phys. Rev. A 88, 062329
mentation in custom hardware. (2013).
To realize a fault-tolerant quantum computation, a decoder needs 23. Tripathi, V. et al. Suppression of crosstalk in superconducting qubits using dynamical
to handle logical computation. Graph-based decoders can achieve this decoupling. Phys. Rev. Appl. 18, 024068 (2022).
24. Higgott, O. PyMatching: a Python package for decoding quantum codes with
through a windowing approach58. We envisage co-training network minimum-weight perfect matching. Preprint at https://arxiv.org/abs/2105.13082 (2021).
components, one for each gate needed for a logical circuit. To reduce 25. Fowler, A. G. Optimal complexity correction of correlated errors in the surface code.
complexity and training cost, these components might share param- Preprint at https://arxiv.org/abs/1310.0863 (2013).
26. Higgott, O., Bohdanowicz, T. C., Kubica, A., Flammia, S. T. & Campbell, E. T. Improved
eters and be modulated by side inputs (such as gate type; ‘Generaliza- decoding of circuit noise and fragile boundaries of tailored surface codes. Phys. Rev. X 13,
tion to logical computations’ in Methods). Such generalization abilities 031007 (2023).
are intimated by our decoder’s generalization across rounds that far 27. Shutty, N., Newman, M. & Villalonga, B. Efficient near-optimal decoding of the surface
code through ensembling. Preprint at https://arxiv.org/abs/2401.12434 (2024).
exceed its training regime and by the ability to train a single decoder to 28. McEwen, M. et al. Removing leakage-induced correlated errors in superconducting
decode all of the code distances 3–11 of the scaling experiment (Fig. 4) quantum error correction. Nat. Commun. 12, 1761 (2021).

Nature | Vol 635 | 28 November 2024 | 839


Article
29. Aharonov, D., Kitaev, A. & Preskill, J. Fault-tolerant quantum computation with long-range 48. Wallraff, A. et al. Approaching unit visibility for control of a superconducting qubit with
correlated noise. Phys. Rev. Lett. 96, 050504 (2006). dispersive readout. Phys. Rev. Lett. 95, 060501 (2005).
30. Chen, E. H. et al. Calibrated decoders for experimental quantum error correction. Phys. 49. Cao, H., Pan, F., Wang, Y. & Zhang, P. qecGPT: decoding quantum error-correcting codes
Rev. Lett. 128, 110504 (2022). with generative pre-trained transformers. Preprint at https://arxiv.org/abs/2307.09025
31. Google Quantum AI. Quantum supremacy using a programmable superconducting (2023).
processor. Nature 574, 505–510 (2019). 50. Pattison, C. A., Krishna, A. & Preskill, J. Hierarchical memories: simulating quantum LDPC
32. Baireuther, P., Caio, M. D., Criger, B., Beenakker, C. W. J. & O’Brien, T. E. Neural network codes with local gates. Preprint at https://arxiv.org/abs/2303.04798 (2023).
decoder for topological color codes with circuit level noise. New J. Phys. 21, 013003 51. Gidney, C., Newman, M., Brooks, P. & Jones, C. Yoked surface codes. Preprint at https://
(2019). arxiv.org/abs/2312.04522 (2023).
33. Sweke, R., Kesselring, M. S., van Nieuwenburg, E. P. & Eisert, J. Reinforcement learning 52. Bombín, H., Pant, M., Roberts, S. & Seetharam, K. I. Fault-tolerant postselection for
decoders for fault-tolerant quantum computation. Mach. Learn. Sci. Technol. 2, 025005 low-overhead magic state preparation. PRX Quantum 5, 010302 (2024).
(2020). 53. Bravyi, S. & Haah, J. Magic-state distillation with low overhead. Phys. Rev. A 86, 052329
34. Chamberland, C. & Ronagh, P. Deep neural decoders for near term fault-tolerant (2012).
experiments. Quantum Sci. Technol. 3, 044002 (2018). 54. Liyanage, N., Wu, Y., Deters, A. & Zhong, L. Scalable quantum error correction for surface
35. Zhang, M. et al. A scalable, fast and programmable neural decoder for fault-tolerant codes using FPGA. In IEEE 31st Annual International Symposium on Field-Programmable
quantum computation using surface codes. Preprint at https://arxiv.org/abs/2305.15767 Custom Computing Machines 217 (IEEE, 2023).
(2023). 55. Skoric, L., Browne, D. E., Barnes, K. M., Gillespie, N. I. & Campbell, E. T. Parallel window
36. Varbanov, B. M., Serra-Peralta, M., Byfield, D. & Terhal, B. M. Neural network decoder for decoding enables scalable fault tolerant quantum computation. Nat. Commun. 14, 7040
near-term surface-code experiments. Preprint at https://arxiv.org/abs/2307.03280 (2023). (2023).
37. Baireuther, P., O’Brien, T. E., Tarasinski, B. & Beenakker, C. W. J. Machine-learning-assisted 56. Tan, X., Zhang, F., Chao, R., Shi, Y. & Chen, J. Scalable surface-code decoders with
correction of correlated qubit errors in a topological code. Quantum 2, 48 (2018). parallelization in time. PRX Quantum 4, 040344 (2023).
38. Jeffrey, E. et al. Fast accurate state measurement with superconducting qubits. Phys. Rev. 57. Barber, B. et al. A real-time, scalable, fast and highly resource efficient decoder for a
Lett. 112, 190504 (2014). quantum computer. Preprint at https://arxiv.org/abs/2309.05558 (2023).
39. Lange, M. et al. Data-driven decoding of quantum error correcting codes using graph 58. Bombin, H. et al. Logical blocks for fault-tolerant topological quantum computation. PRX
neural networks. Preprint at https://arxiv.org/abs/2307.01241 (2023). Quantum 4, 020303 (2023).
40. Pattison, C. A., Beverland, M. E., da Silva, M. P. & Delfosse, N. Improved quantum error
correction using soft information. Preprint at https://arxiv.org/abs/2107.13589 (2021). Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
41. Bluvstein, D. et al. Logical quantum processor based on reconfigurable atom arrays. published maps and institutional affiliations.
Nature 626, 58–65 (2024).
42. Google Quantum AI Team. Data for “Suppressing quantum errors by scaling a surface Open Access This article is licensed under a Creative Commons Attribution
code logical qubit”. Zenodo https://doi.org/10.5281/zenodo.6804040 (2022). 4.0 International License, which permits use, sharing, adaptation, distribution
43. Gidney, C. Stim: a fast stabilizer circuit simulator. Quantum 5, 497 (2021). and reproduction in any medium or format, as long as you give appropriate
44. Gidney, C., Newman, M., Fowler, A. & Broughton, M. A fault-tolerant honeycomb memory. credit to the original author(s) and the source, provide a link to the Creative Commons licence,
Quantum 5, 605 (2021). and indicate if changes were made. The images or other third party material in this article are
45. Bravyi, S., Suchara, M. & Vargo, A. Efficient algorithms for maximum likelihood decoding included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
in the surface code. Phys. Rev. A 90, 032326 (2014). to the material. If material is not included in the article’s Creative Commons licence and your
46. O’Brien, T. E., Tarasinski, B. & DiCarlo, L. Density-matrix simulation of small surface codes intended use is not permitted by statutory regulation or exceeds the permitted use, you will
under current and projected experimental noise. npj Quantum Inf. 3, 39 (2017). need to obtain permission directly from the copyright holder. To view a copy of this licence,
47. Blais, A., Huang, R.-S., Wallraff, A., Girvin, S. M. & Schoelkopf, R. J. Cavity quantum visit http://creativecommons.org/licenses/by/4.0/.
electrodynamics for superconducting electrical circuits: an architecture for quantum
computation. Phys. Rev. A 69, 062320 (2004). © The Author(s) 2024

840 | Nature | Vol 635 | 28 November 2024


Methods phenomenological noise is the most realistic case. Yet beyond this
qualitative classification of noise types to study, the actual noise model
Learning to decode the surface code that is used to describe the various operations on the physical qubits
In this work, we present a neural-network architecture that learns to can vary tremendously in accuracy, from simple bit- or phase-flip noise,
decode the surface code under realistic hardware-level noise. The net- to a full simulation of the master equation of the quantum system.
work combines a number of problem-specific features. Its per-stabilizer (And even then, how accurately the master equations describe a real-
decoder state representation—a vector for each of the d2 − 1 stabilizers— world system can vary significantly).
stores information about the syndrome history up to the current round.
Convolutions allow the spatial dissemination of information between The rotated surface code. Here we study the memory experiment for a
adjacent stabilizer representations and at longer range when dilated. rotated surface code59, a variant of the surface code, which itself is a vari-
Self-attention allows the stabilizer state vectors to be updated based ant of Kitaev’s toric code60. In the rotated surface code, stabilizers are
on the current state of each of the other stabilizers giving full inter- interspersed in a two-dimensional grid of data qubits (Fig. 1a). Stabilizer
connection with a limited number of parameters. The pooling and readouts are performed via stabilizer ancilla qubits, in a circuit as given
readout network aggregate information from the representations of in Extended Data Fig. 1. For all experiments, we use the XZZX circuit
the relevant stabilizers to make a logical error prediction. variants for the rotated surface code, which is Clifford-equivalent to
Using experimental examples consisting of syndromes and the cor- the conventional CSS surface code61.
responding logical errors, we train the network to improve its predic- Although the XZZX code has the same stabilizers at every face
tions of the logical errors using backpropagation with a cross-entropy (Extended Data Fig. 1), for visualization purposes we draw the usual
objective. The neural network processes the stabilizer readouts round CSS surface code where X-type and Z-type stabilizers are interleaved
by round with the syndrome transformer to iteratively update the in a checkerboard-like pattern throughout the two-dimensional grid.
decoder state representation (Fig. 2a). At the end of an experiment, Similarly, we denote those stabilizers that are collected in the first and
the readout network uses the decoder state to predict whether a logical final rounds of the memory experiments (that is, those that can be
error occurred in the experiment (see ‘Model details’ for more details). inferred from the initially prepared eigenstates and final measurements
Because real experimental data are in limited supply, we train our (Fig. 1b and Extended Data Fig. 4) as ‘on-basis’, whereas ‘off-basis’ refers
decoder in a two-stage process (Fig. 2b). In a pretraining stage, we first to the stabilizers that cannot; so, for instance, in an X-basis memory
prime the network based on samples from a generic noise model (such experiment, the on-basis stabilizers would be the transformed X-type
as circuit depolarizing noise, which describes errors in the device based stabilizers from the usual CSS surface code.
on a Pauli error noise model) for which we can quickly generate as many
samples as needed (for example, using a Clifford circuit simulator such Sycamore memory experiment dataset. This is the publicly released
as Stim43). In a finetuning stage, we optimize this model for a physical dataset42 accompanying Google’s Sycamore surface code experiment4,
device by training on a limited quantity of experimental samples from comprising:
the device. 1. Four areas at code size 3 × 3, dubbed north, south, east, and west; as
With this two-stage training method, we achieve state-of-the-art well as one area for code size 5 × 5.
decoding using current quantum hardware, and demonstrate the 2. For each of the five areas: both X and Z memory experiment bases,
applicability of this decoder for larger-scale quantum devices. First, on which set the basis in which the logical qubit is initialized (randomly
experimental data from a Sycamore quantum processor on distance-3 in ∣ + ⟩ or ∣ − ⟩ for an X experiment, and ∣0⟩ or ∣1⟩ for a Z experiment),
and distance-5 surface codes4, when pretraining on synthetic data and in which it is measured at the end.
modelling device noise and then finetuning on experimental samples, 3. For each of the five areas and two bases: individual experiments at
we observe significantly better error suppression than state-of-the-art 1, 3, 5, …, 25 error-correction rounds, at 50,000 shots each.
correlated matching and tensor-network decoders4,26,27,31,45. Second,
anticipating larger future quantum devices, we demonstrate that our This means that 50,000 shots × 13 round settings × 2 bases × 5 areas =
decoder achieves better accuracy than a correlated-matching-based 6.5 × 106 individual shots were recorded in total. Each dataset was split
decoder for code distances 3 to 11 using samples from a Pauli+ quan- into an even and odd subset for twofold cross-validation, and accompa-
tum simulator modelling cross-talk, leakage and analogue readouts nied by a DEM fitted to the respective subset, to be used for decoding
(in-phase (I) and quadrature (Q) signals38, I/Q for short). In this scenario, the other fold, respectively.
we again pretrain with a circuit-level depolarizing noise model before
finetuning on samples from the Pauli+ simulator. In both scenarios, we Detector error model. A DEM43 can be thought of as an error hypergraph,
can pretrain to high accuracy without experimental samples, and our where stochastic error mechanisms are hyperedges connecting the clus-
two-stage training procedure further improves accuracy by finetuning ters of detectors they trigger. These mechanisms are independent and
with realistic amounts of experimental samples. have an associated error probability. The DEMs we use were previously
fitted4 to each experimental set using a generalization of the pij method14.
Datasets and noise models The XEB DEMs are computed by running a standard calibration
Memory experiments. A memory experiment is the most basic suite that measures the fidelity of different gates (frequently using
error-correction experiment to run, but is representative of the diffi- XEB benchmarking to compute the error rate, hence the name).
culty of fault-tolerant computation with the surface code4. We encode We use the open-source program Stim43 to generate samples using
a known single (logical) qubit state∣ψi⟩ as ρi = ∣ψi⟩⟨ψi∣, perform multiple the DEMs. This is necessary for pretraining AlphaQubit, as the lim-
error-correction rounds (that is, stabilizer qubit measurements) and ited quantity of experimental data available makes training with only
finally perform a (logical) measurement on the resulting state ρf. We experimental data unfeasible (see ‘Training details’).
declare success if the decoded logical outcome matches the initial
logical encoding. Circuit depolarizing noise. As error syndromes cannot be read out
In a real-world experiment, all operations on the physical qubits directly with a single measurement, a stabilizer readout circuit has to
are noisy. From a theory perspective, various levels of abstraction in be applied to deduce the stabilizers, as shown in Extended Data Fig. 1.
modelling real-world noise can be studied, for example, the noise- The entire sequence of circuit depolarizing noise for a memory experi-
less case, the code capacity case (noiseless readouts) or the phenom- ment of the surface code is shown in Extended Data Fig. 1 for an XZZX
enological case (noise on physical qubits and readouts). In this sense, circuit variant of the rotated surface code61.
Article
SI1000 (superconducting-inspired 1,000-ns round duration) result in an average point between the ∣0⟩ and ∣1⟩ centres in (I, Q) space64.
noise44 is a circuit depolarizing noise model comprising Pauli errors Ordinarily, a measured (I, Q) value is classified to ∣0⟩ or ∣1⟩ (or in some
of non-uniform strengths, which approximate the relative noisiness of cases ∣2⟩), and this discrete measurement outcome is given to the
the various circuit processes in superconducting circuits: for example, decoder. However, a neural-network decoder can use the raw (I, Q)
as measurements remain a major source of errors in superconductors, value instead, giving it access to richer information without further
measurement noise has weight 5p for noise parameter p. In contrast, preprocessing.
single qubit gates and idling introduce only a small amount of noise, To simulate this process, we run a simulation with noiseless measure-
hence their relative strength is p/10. ments and then add noise after the fact. This can be as simple as discrete
assignment error (for example, flip each measurement outcome with
Intermediate data. For some simulated data modalities, such as our some probability) or we can emulate the richer (I, Q) signals. For our
SI1000 Stim-simulated data, we have enough privileged information simulations, we consider a simplified one-dimensional space for our
about the quantum state to determine what would happen if we had analogue readout signal, with probability density functions Pi for ∣0⟩,
finished the experiment earlier. For an experiment with n rounds, we ∣1⟩ and ∣2⟩ centred around z = 0, 1 and 2, respectively, shown in Extended
can provide the result of data-qubit measurements, and consequently Data Fig. 2a. Although we could also consider higher-order leaked
alternative ‘last rounds’ of detectors and logical observables, if they states (for example, by centring ∣3⟩ around 3, as for the other states)
were to happen after rounds 1, 2, …, n − 1 instead of at the end of the and present those to the network as separate inputs in a suitable fash-
experiment. ion (for example, analogous to what we will describe below), and
The no-cloning theorem makes these intermediate measurements although they are produced by the Pauli+ simulation, we omit them in
not accessible in an experimental setting, so we do not use it as an input this analysis as they will be produced with even lower frequency than
in our decoders. However, in simulated data, they provide an auxiliary state ∣2⟩. For this reason, we map higher-order leaked states from the
label per experimental round, improving training by providing more Pauli+ simulation to ∣2⟩, that is, we simply bucket states into ‘leaked’
information per sample. (∣2⟩) and ‘not leaked’ (∣0⟩ and ∣1⟩).
These probability distributions are parameterized by a signal-to-noise
Measurement noise. In each error-correction round, we projectively ratio (SNR) and a dimensionless measurement duration, t = tmeas/T1, the
measure many of the qubits, allowing us to extract information about ratio of the measurement duration to the qubit lifetime, T1. The distri-
errors that have occurred. Consider measuring a self-adjoint operator bution for ∣0⟩, P0(z, SNR), is simply a Gaussian distribution centred at
A with discrete eigenvalues {λi}i∈I for some index set I. Let Pi be the pro- z = 0. For∣1⟩, we centre at z = 1 and add the effect of decay from∣1⟩ to∣0⟩.
jector into the subspace with eigenvalue λi. Then the probability For ∣2⟩, we centre at z = 2 and assume that the decay from∣2⟩ to∣1⟩ occurs
of observing λi in a measurement is given by Born’s rule, pi = Tr(ρPi), roughly twice as frequently as the ∣1⟩ to ∣0⟩ case (in reality, qubits can
and in that case the resulting state is projected into Pi ρPi /pi. For the deviate from this based on details in the qubit T1 spectra). For simplic-
case of a single qubit measured in the computational basis {∣0⟩, ∣1⟩} , ity, we do not include the second-order process of decaying from ∣2⟩
p∣0⟩ = ⟨0∣ρ∣0⟩ and p∣1⟩ = ⟨1∣ρ∣1⟩. In an error-correction round, we only to ∣1⟩ to ∣0⟩, although that does happen experimentally.
measure a subset of the qubits, but the projective nature of the meas- In this simplified single-parameter picture40 (see ‘Datasets and noise
urement causes the entangled state of the data qubits to remain an models’), we can thus write
eigenstate of all the stabilizer operators.
In practice, measurement is a challenging engineering problem: SNR
P0(z , SNR) = exp(−SNR × z 2)
ordinarily, we want qubits isolated from their environment to allow π
coherent operations, but measurement necessitates interaction with t   t 
P1(z , SNR, t ) = exp  − t z − 
the environment. In addition, we need to immediately re-use measured 2   4 SNR 
qubits for the next error-correction round, which requires either a    t 
×  Erf  SNR z − 
‘non-demolition’ measurement (where the qubit is faithfully projected    2 SNR 
into ∣0⟩ or ∣1⟩ corresponding to the measurement outcome) or other  t 

state preparation, such as unconditional reset to∣0⟩ following measure- + Erf  SNR 1 − z + 
  2 SNR 
ment. Unconditional reset also provides an opportunity to remove
SNR
leakage from the system28. Measurement can cause other problems + e−t exp(−SNR (z − 1) 2 )
π
such as state transitions and unwanted dephasing, which must be care-
P2(z , SNR, t ) = P1(z − 1, SNR, 2t ).
fully avoided62,63.
Implementations vary between physical platforms. For example, in
standard dispersive measurement of superconducting qubits, a linear For each measurement outcome from the simulation of state∣i⟩, we
resonator (or series of resonators) serves as an intermediary between sample an ‘observed’ value of z according to the associated probability
the qubit and the outside world47. The measurement is implemented density function. This ‘observed’ z can then be processed using a prior
as a microwave scattering experiment to probe the resonator’s fre- probability distribution and the known probability density functions
quency, which depends on the qubit state due to coupling with nonlin- to determine a posterior probability for each state. For example, we
ear Josephson elements48. The scattered microwave pulse is amplified may have a prior distribution that leakage occurs with probability 0.01
and digitized to determine its amplitude and phase, which encodes and we split the remaining 0.99 evenly between ∣0⟩ and ∣1⟩. More gener-
information about the qubit state38. ally, we express these posterior probabilities as
The resulting amplitude and phase is traditionally represented in a
two-dimensional space of in-phase (I) and quadrature (Q) amplitudes, post 1 ≔ Prob(∣1⟩ ∣ ¬ ∣2⟩)
(I, Q). Ideally, there is a distinct point in (I, Q) space associated with Prob(∣1⟩ ∨ ¬ ∣2⟩) Prob(∣1⟩) P1(z , SNR)
= = =
each qubit state (∣0⟩, ∣1⟩, and potentially leakage states such as ∣2⟩). Prob(¬∣2⟩) Prob(¬∣2⟩) (w0 /w1)P0(z , SNR) + P1(z , SNR, t )
However, the measured signals are obfuscated with noise from sources
such as transmission loss, amplifiers and electronics, manifesting as and
a spread or ‘cloud’ of points in (I, Q) space associated with each state.
P2(z , SNR, t )
In addition, qubits can show unwanted transitions between states dur- post 2 ≔ Prob( 2) = ⟩ ,
w0 /w 2P0(z , SNR) + w1 /w 2P1(z , SNR, t ) + P2(z , SNR, t )
ing the measurement, such as decaying from ∣1⟩ to ∣0⟩, which would
where w0 + w1 + w2 = 1 are the prior probabilities of the three measure- of measurement variables into corresponding detection events, Ei,n ≔
ment outcomes, and w0 ≔ w0 /(w0 + w1), w1 ≔ w1 /(w0 + w1) are the mar- Mi,n ⊕ Mi,n−1, completely analogous to the binary measurement case.
ginal prior probabilities, conditioned on not having seen leakage. This means that
This means we provide the network with two inputs:
1. post1: the posterior probability of observing a ∣1⟩ state, conditioned Ei , n  Ber(qi , n) where qi , n := pi , n (1 − pi , n −1) + (1 − pi , n )pi , n −1
that the state was not leaked (that is, not in ∣2⟩). Once thresholded,
this is the traditional measurement output from which syndrome or is parameterized by the probability that exactly one of Mi,n and Mi,n−1 is 1.
data-qubit measurements, and subsequent detection events, are Analogously to before, this ‘soft XOR’ defines a linear recurrence on
derived. the detection events that can be integrated to obtain back the posterior
It is noted that owing to our ordering of the states∣0⟩,∣1⟩ and ∣2⟩ along measurement probabilities from the soft detection events:
the z axis (Extended Data Fig. 2a), if a state was leaked, it is most
qi , n − pi , n −1
likely attributed to∣1⟩, which is a valid choice of mapping an observed pi ,−1 = 0 and pi , n =
leaked state to a∣0⟩ or∣1⟩ measurement outcome. For matching-based 1 − 2pi , n −1
decoders, this assignment is a valid choice of assignment of a leaked
state to the {∣0,⟩∣1⟩} subspace, and as good as, for example, a random It is clear from the above that the special case of complete uncertainty
mapping. Indeed, for a decoder unable to process leakage informa- (pi,n = 1/2 for some n) is non-invertible, as all information is lost in
tion, a leaked state is ‘lost information’, and thus an assignment to that case.
∣1⟩ will create a detection event in about 50% of cases. (In the same By induction, one can also show that for a series of measurements
fashion, if we were to include higher-order leaked states, attributing (for example, along an edge of data qubits in the surface code grid),
∣3⟩, ∣4⟩ and so on to ∣1⟩ remains a valid choice). thresholding the measurements against 1/2 and then XOR’ing the set is
2. post2: this is the probability of having seen leakage. Owing to the equivalent to performing an iterative ‘soft XOR’, and then thresholding.
low prior probability of seeing leakage in first place (usually <1%), To show this, let us simplify notation and drop the multiindex; our soft
the posterior distributions are skewed against ∣2⟩, as can be seen in measurement probabilities are p1, …, pn such that all pi ≠ 1/2, and the
Extended Data Fig. 2a: even though the distribution for the state ∣2⟩ corresponding thresholded Boolean values are zi := pi > 1/2. We denote
is centred around z = 2, has the same width as the other two distribu- with SoftXOR(p1, …, pn) the soft XOR defined above, and want to show
tions and additionally has a higher decay tail towards z = 1 owing to SoftXOR(p1, …, pn) > 1/2 if and only if (iff) z1 ⊕ … ⊕ zn. The induction start
its twice-as-high normalized measurement time t, the prior weight is then immediate from the definition, as SoftXOR(p1) = p1. Let us thus
shifts its posterior to only give a significant chance of interpreting assume the hypothesis holds up to some value m − 1. Then
a measurement outcome as∣2⟩ at a z value already very close to z = 2.
SoftXOR(p1 , …, pm ) = pm [1 − SoftXOR(p1 , …, pm −1 )]
Soft measurement inputs versus soft event inputs. For our model, + (1 − pm )SoftXOR(p1 , …, pm −1 )
we have found that directly providing stabilizer measurements as =: pm (1 − b) + (1 − pm )b
inputs instead of stabilizer detection events is beneficial (Extended 1
Data Fig. 9). Traditionally, we have binary stabilizer readouts si,n ∈ {0, 1}, >1/2 if and only if b(1 − 2pm ) > − pm .
2
where i indexes the stabilizer qubit in the surface code and n indexes
the error-correction round. A detection event is then derived simply as Now if pm < 1/2, we have 1 − 2pm > 0 and thus b > 1/2; otherwise if
the change of a stabilizer measurement across error-correction rounds, pm > 1/2, we have b < 1/2. Thus
di,n := si,n ⊕ si,n−1, which itself is a binary variable ∈ {0, 1}. This is the quan-
tity that is traditionally used by most decoders, for example, MWPM.  1 1  1 1
SoftXOR(p1, …, pm ) > 1/2 iff b < ∧ pm >  ∨ b > ∧ pm < .
The XOR operation used to compute the change in stabilizer meas-  2 2  2 2
urements results in a 1:1 correspondence of information encapsulated
in the events input versus the measurements input; indeed, given detec- The two cases then translate to
tion events di,n, we can—up to a possibly unknown initial measurement
frame—obtain back the stabilizer measurements si, n ≡ ∑nm =0 di, n (mod 2), 1 1
b< ∨ pm > iff ¬ (z1 ⊕ … ⊕ zm−1) ∨ zm ≕ A
where di,0 is the first event frame. This first event frame was derived by 2 2
either XOR’ing with an assumed zero frame before the first measure- 1 1
b > ∨ pm < iff z1 ⊕ … ⊕ zm−1 ∨ ¬ zm ≔ B ,
ment (for example, for those stabilizers corresponding to the memory 2 2
experiment basis; blue zeros in first stabilizer plot of Extended Data
Fig. 4b), or was set to zero to remove an initially random stabilizer frame and A ∨ B = z1 ⊕ … ⊕ zm.
that does not allow extraction of more information about a first detec-
tion event (for example, for the off-basis stabilizers orthogonal to the Pitfalls for training on soft information. A crucial safeguard in
memory experiment basis; see ‘The rotated surface code’). all machine-learning models is to never leak the label (that is, the
This bijection allows us to present a comparison of measurement value to be predicted) into the input of the model. For a distance-d
and event inputs, as they both contain the same amount of informa- rotated-surface-code experiment with binary stabilizer labels, there
tion for the decoder (possibly up to the first frame, as aforementioned; exist exactly d2 − 1 bits of information that are extracted at each
however, for Pauli noise, we take the initial off-basis stabilizers and XOR error-correction round; and it is impossible to deduce, from these meas-
them onto the stabilizers anyhow, so that this assumed initial random urements alone, the logical state of the qubit in the experiment basis.
frame is precisely zero as well). Naturally, this also holds true in the final round of a memory experi-
If the measurements are transformed into posterior probabilities for ment, when we measure all data qubits in the experiment’s basis—d 2
each stabilizer measurement, we can assume that each such posterior bits—and compute the on-basis stabilizer measurements from them—
(d 2 − 1)/2 many for an odd-distance surface code patch. As those stabi-
pi , n ≔ post 1(i , n) = Prob(∣1⟩i , n ∣ ¬ ∣2⟩i , n) (1) lizers are a strict subset of the full d 2 − 1 stabilizers derived in previous
rounds, the same argument applies: no information about the logical
parameterizes a Bernoulli random variable Mi,n ~ Ber(pi,n). As those are state of the surface code qubit can be leaked, as all stabilizers commute
also Boolean-valued random quantities, we can then transform pairs with the logical operators of the code.
Article
However, when reading d 2 data qubits with soft information, and then CZ dephasing was 2 × 10−4 instead of 8 × 10−4), but the heating rate was
re-computing the stabilizers from them via SoftXOR, there is a map of unchanged at 1/(700 μs). When leakage is scaled in this work, it is these
d 2 floating point values to d 2 − 1 floating point values. We found that this three rates that are scaled together. Leakage is removed from the system
gives the model the ability to deduce the current logical state from the by multi-level reset gates applied after measurement, by data-qubit
inputs, which poses an issue if the initial qubit state is not randomized. leakage removal65 applied to code qubits every syndrome round, and
An intuition for why this might happen is that the model learns to (par- by a passive decay rate that is proportional to 1/T1.
tially) invert the quadratic SoftXOR equations that map soft data-qubit
measurements to the stabilizers, from which it can then learn what Simulating future quantum devices. We use the Pauli+ simulator
the logical observable should be; even if this inversion is not perfect, described in ‘Pauli+ model for simulations with leakage’ that can
it is conceivable that the network might learn to extract a non-zero model and modulate the effects of cross-talk and leakage, and aug-
amount of additional information about the label, which should not ment it with soft I/Q readouts, as described in ‘Measurement noise’,
have leaked into the inputs. For this reason, we always threshold the to generate data in place of experimental samples at code distances
data-qubit measurements used for computing stabilizer measurements 3, 5, 7, 9 and 11. The density of detection events for our Pauli+ simulated
in the last memory experiment round (and the leakage data as well), data is roughly 60% lower than for the Sycamore experiment4 (Fig. 4a,
irrespective of whether we were providing the model with soft or hard inset, and Extended Data Fig. 2b) and about 0.1% of stabilizer meas-
inputs in previous rounds. In this way, we ensure that there is precisely urements will appear leaked. For comparison, in the ‘Quantum error
the same amount of information derived from the data qubits as in a correction on current quantum devices’ section, each decoder is
standard ‘non-soft readouts’ memory experiment, that is, d 2 bits, which finetuned on 3.25 × 105 samples; the entire experiment comprised
in turn are mapped to (d 2 − 1)/2 stabilizer measurements. This makes it 6.5 × 106 unique shots in total4.
impossible for any decoder to discern the logical measurement label
from its inputs. Metrics
Logical error per round. If E(n) denotes the decoder error rate (com-
Pauli+ model for simulations with leakage. Realistic device noise puted as the erroneous fraction of logical error predictions) at stabi-
was implemented in a manner similar to the Pauli+ model described lizer measurement round n, we can make an ansatz for its functional
in the supplementary material of the Sycamore experiment arti- dependence on n via
cle4. This model was updated by scaling noise strengths down from
the near-threshold regime in that work to realize approximately
1
E (n) = (1 − (1 − 2ϵ)n ) , (2)
Λ = 4 in surface code performance for MWPM-Corr, where Λ is the 2
ratio of logical error rates between two surface codes of distance d
and d + 2, as in the supplementary material of ref. 65. Moreover, the following previous work (supplementary material, equation (3) in
simulator was modified from the ‘stabilizer tableau’ representa- ref. 14). For equation (2), we can see that E(0) = 0 (that is, we assume
tion to use the Pauli-frame representation, which yields indistin- no error at round n = 0), E(1) = ϵ and E(n) approaches 1/2 for larger n.
guishable results as transitions between stabilizer states are Pauli In this context, the quantity ϵ is called the LER; indeed, it describes the
channels. exponential decay of the decoder’s fidelity F(n)
We briefly review the Pauli+ model here, although details are
described in the supplementary material of the Sycamore experiment F (n) ≔ 1 − 2E (n) = (1 − 2ϵ)n . (3)
article4. The Pauli+ model extends a Pauli-frame simulator with leakage
states. These include transitions to and from leaked states, as well as How we obtain the LER ϵ from the error rates depends on whether we
error channels where a two-qubit gate applied to a qubit pair where consider results after multiple different number of rounds, or whether
one input is leaked is replaced by a noise channel on the non-leaked we derive it from an experiment with a unique number of rounds (say,
qubit. For a noise channel in the simulation (in general, a Kraus chan- for instance, 25). The 2 ways of deriving the metric are compatible, in
nel), the qubit subspace is Pauli twirled, and transitions to and from the sense that performing a fit on an experiment with a single number
leaked states are converted to stochastic transitions. A Pauli-frame of rounds (with the added constraint of setting F(0) = 1 explicitly) yields
simulator is extended such that in addition to a Pauli operator at each exactly the same LER as inverting the error E(n) directly via equation (4).
qubit, leaked states can be tracked. For example, the possible states of Experiment at a fixed number of rounds. For an experiment at a fixed
one qubit with leaked excited states could be {I, X, Y, Z, L2, L3}, where number of rounds (for example, n = 25) we simply invert equation (2),
L2 and L3 are states of the Pauli-frame simulator that represent quan- and obtain
tum states ∣2⟩ and ∣3⟩.
The noise strength is adjusted to what might be achievable in super- 1 n 
conducting quantum processors in the medium term, several years ϵ= 1 − 1 − 2E (n) . (4)
2 
from the time of this study. Each gate in the simulation is associated with
a baseline amount of depolarizing noise. The strength of depolarizing Experiment across multiple rounds. We determine ϵ via a linear fit
noise for each operation was informed by recent device characteriza- of the log fidelity
tion4 and an estimate of how noise might improve in future devices65.
In addition to this conventional Pauli-channel noise, there is a model logF (n) = logF0 + nlog(1 − 2ϵ). (5)
for coherent cross-talk that accounts for interactions between pairs
of CZ gates; this model is Pauli twirled to produce Pauli channels that To assess the fit’s quality, we use the goodness of fit, R2. In addition, we
are correlated on groups of qubits up to size four, and the unitary cal- expect F0 to be close to F(0) = 1, so we consider significant departures
culated includes leakage levels4. Leakage is introduced in three ways. of logF0 from 0 to indicate a bad fit (see ‘Termination’). As shown in
There is a probability of leakage introduced by dephasing during the CZ Extended Data Fig. 3, all our fits for the 3 × 3 and 5 × 5 memory experi-
gate, a ‘heating rate’ of leakage that is a function of gate duration, and ments show R2 ≥ 0.98 and F0 ≥ 1.
leakage terms that arise from the cross-talk unitary described above. As done in the original Sycamore experiment4, we exclude the point
The leakage rates were adjusted from the values in the supplementary (1, E(1)) from our fits due to a time boundary effect, which yields a much
material of the Sycamore experiment article4 such that CZ dephasing stronger error suppression per round at the first error-correction
and cross-talk were reduced to 25% of the previous values (for example, round.
the final-round stabilizers are not measured but computed from the
Note on statistics data qubits, we encode this by using a separate embedding for the final
Combining different datasets. In the experimental datasets (for exam­ round, with a separate final-round linear projection for the on-basis
ple, Fig. 3b and Extended Data Fig. 9a), we have 16 (for code distance 3) computed stabilizers and a single learned embedding for all the unde-
or 4 (for distance 5) distinct datasets per aggregated model perfor- fined off-basis stabilizers (where on- and off-basis stabilizers are defined
mance—the combination of X and Z bases, even and odd subsets, and in ‘The rotated surface code’). Each stabilizer representation is inde-
the different device regions. As we do not expect performance on the pendently passed through a two-layer residual network to derive the
different datasets to be the same, we purposefully exclude the spread representation, Sni, provided to the recurrent neural network (RNN)
across datasets from our error estimation. Our error estimation is for stabilizer i. ( S′Ni for the final stabilizers).
derived exclusively from the bootstrap estimation (499 resamples) of At each error-correction round, the stabilizer representations are
the individual fidelity points. Consistent with the literature4, we prop- added to the corresponding decoder state vectors and then scaled
agate individual fitting errors by Gaussian error propagation; that is, by a factor 0.707 to control the magnitude (Extended Data Fig. 4d).
we sum two quantities e1 ± de1 and e2 ± de2 via e = (e1 + e2) ± de12 + de22 , This code-distance-independent constant is intended to prevent the
discarding the spread between the quantities. scale of the state vectors from growing, designed so if both inputs
are zero-mean unit variance, the scaled-summed output will also be.
Combining different seeds. When we have multiple seeds but only We emphasize that although the noise model used in the scaling
one dataset (for example, the ablation for Pauli+ data, Extended Data experiment (Pauli+, which simulates effects such as cross-talk and
Fig. 9b), we exclusively consider the spread across datasets, discarding leakage, and augmented soft readouts with amplitude damping) is
the bootstrapped variance of the individual samples. more realistic than a circuit depolarizing noise model, we only provide
AlphaQubit with these stabilizer measurements and events (plus leak-
Many-rounds experiments. We use 9 bootstrap resamples. age information) described above. There is no privileged access to
information for cross-talk or other noise events from the simulators,
Model details beyond what one could measure in an actual memory experiment.
AlphaQubit is a neural network designed to decode the surface code AlphaQubit learns these noise effects beyond the SI1000 circuit depo-
for a range of code distances and for experiments of arbitrary duration. larizing noise prior solely from finetuning.
Here we describe the features of the architecture, particularly those that Similarly, as explained in ‘Detector error model’, the DEMs used for
are adapted to the quantum error-correction problem. The ‘Ablations’ the matching-based decoders are all derived from a circuit depolarizing
section shows that several of these become more important for the noise prior, and then either fitted to the Sycamore experimental data,
more complex decoding problem at larger code distances. Pseudocode or the noise parameters manually adjusted to match the noise effects
for the model can be found in Supplementary Information. within the Pauli+ and I/Q noise simulations. As such, the DEMs do not
The recurrent architecture design (Extended Data Fig. 4b) reflects the directly capture any of the cross-talk effects present in the experimental
time-equivariant nature of the problem with the syndrome transformer data or simulation (beyond what can be captured in the fit and param-
maintaining the decoder state, which represents the information from eter adjustment).
previous stabilizers relevant to deciding an experiment outcome. The
decoder state has the potential to store some information for a window Syndrome transformer. We designed the computation block
of arbitrary duration and is not limited to a fixed window. (Extended Data Fig. 4d) to match the quantum error-correction task. At
the heart of our RNN block architecture is the syndrome transformer, a
Input representation. The network is provided with between one and self-attention architecture based on the transformer66, which has seen
four inputs per stabilizer. In the simplest case, analogous to MWPM, recent success in a variety of problems67,68. Transformers consist of
we provide binary detection events, which are the temporal differ- multi-head self-attention followed by a fully connected dense block that
ences of binary measurements of the stabilizer state. Although these we augment with gating69, which modulates the dense block’s computa-
contain the same information, in practice we find that measurement tion with multiplicative factors between zero and one. To this, we add
inputs lead to better results than event inputs alone, so we provide both two elements: two-dimensional convolutions, to promote better scaling
(Extended Data Fig. 9). There are several possible factors at play here with code distance and inspired by the space-translation symmetry
that might explain the usefulness of providing measurements: they of the problem, and (optionally) attention bias, which precomputes
have a more uniform distribution than events, making the model input a component of the attention and adds a degree of interpretability
less biased. Furthermore, for the Pauli+ data, providing measurements (Extended Data Fig. 4e).
may help resolve the asymmetry of the states ∣1⟩ versus ∣0⟩ (due to the The syndrome transformer updates the per-stabilizer decoder state
amplitude damping component in the noise model; see ‘Measurement representation by incorporating information from other stabilizers
noise’); yet after translation to events, the information whether the based on their location. Although previous decoders have exploited
event was due to a flip∣0⟩ to ∣1⟩ versus the other way round is lost. When symmetries of the toric code70 and used convolutional neural networks
simulating I/Q noise, we thus provide both measurements and events for processing the surface code71, boundary conditions of the surface
as probabilities as described in ‘Soft measurement inputs versus soft code together with non-uniformity of real physical devices mean that
event inputs’. For experiments with leakage, we also provide the leakage there could be advantages to a model that goes beyond rigid spatial
probability and the temporal-difference analogue. Although the soft invariance. Much of the information passing can be local, to handle local
model input is presented at float32 values to the model, we anticipate spatial correlations, and can be modelled with two-dimensional con-
that a much lower bit precision (such as eight or four bits) would suffice volutions. Longer-range interactions depending on relative stabilizer
to capture most of the benefit of soft readouts. location are partially supported by dilated convolutions (in which a 3
A representation is built up for each input stabilizer as shown in × 3 convolutional kernel is ‘spaced out’ to skip intervening pixels and
Extended Data Fig. 4c, by summing linear projections of each of the model longer-range dependencies). Dense all-to-all attention enables
input features. This means that there are d 2 − 1 different embeddings the model to dynamically reason about all possible stabilizer pairs
generated in the ‘StabilizerEmbedder’, which are then added to the d 2 depending on their current state. Such a pairwise attention mecha-
− 1 representations that comprise the decoder state (Extended Data nism is useful in capturing (possibly long-range) correlations during
Fig. 4d). To allow the transformer to distinguish between the stabiliz- decoding, much like MWPM finds edges in the decoder graph between
ers, we also add a learned input embedding of the stabilizer index i. As pairs of detection events.
Article
For each transformer layer, we apply three dilated convolutions after stabilizers far away in the surface code. The second head discourages
first scattering the stabilizer representations to their two-dimensional attention to immediate neighbouring stabilizers (even more so to
spatial layout in a (d + 1) × (d + 1) grid, with an optional learned padding the diagonally adjacent stabilizers between on- and on-, respectively
vector for locations where there is no stabilizer. off- and off-basis stabilizers; see ‘The rotated surface code’) while
encouraging attention to non-neighbouring stabilizers. The third
Attention bias. The attention allows information exchange between head instead does the opposite, strongly encouraging local attention
all pairs of stabilizers. We expect that this attention between any two while discouraging attention to stabilizers farther away. In addition,
stabilizers is dependent on two factors: the history of events for each of the third head seems to show patterns of higher attention biases for
those stabilizers and the relationship of those stabilizers in the circuit on-basis (see ‘The rotated surface code’) stabilizers than for off-basis
(in terms of basis, connectivity and spatial offset). As the relationship stabilizers. This is visible in the attention maps marked with an aster-
between the stabilizers is constant, we learn an attention bias that isk. Lastly, the final head predominantly discourages attention to the
modulates the attention between stabilizer i and j. The attention bias same stabilizer while being slightly encouraging towards attention
is a precomputed offset to the attention logits, learned separately for for non-same stabilizers. We observed similar patterns of local and
each head in each transformer layer. non-local attention bias for other models; however, it did not always
The attention bias embeds fixed information about the layout and show as clearly and in some models the attention bias offered no
connectivity of the stabilizers by constructing a learned embedding obvious interpretation.
for each stabilizer pair i, j as a function of i and j. This embedding is
independent of the decoder state and at each transformer layer is pro- Readout network. After the RNN has processed the final stabilizers
jected down to a scalar bias per head to be added to the conventional from round N to create the decoder stateN representation, a readout
content-based attention logits. network (Extended Data Fig. 4f) processes the state to make a final
The attention bias embedding is a (d2 − 1) × (d2 − 1) × 48 tensor con- prediction, again using the spatial distribution of the stabilizers. In
structed by adding learned embeddings of discrete features for each the readout network, we first transform the per-stabilizer representa-
stabilizer pair i, j based on their spatial layout. The features are chosen tion to a per-data-qubit representation by a scatter operation, which
to encapsulate salient attributes of the spatial relationship between arranges the decoder state representation according to the stabiliz-
stabilizers. ers’ spatial layout and then applies a 2 × 2 convolution that combines
1. The spatial coordinates of stabilizer i and stabilizer j. information from the 4 stabilizer neighbours of each data qubit. We
2. The signed spatial offset of stabilizer i from stabilizer j. then apply a dimensionality reduction and mean pooling along rows
3. The Manhattan distance between i and j. or columns of the data qubits perpendicular to the logical observable
4. A bit to indicate if the stabilizer types (by which we mean the basis rows or columns of qubits (depending on the measurement basis), to
labels inherited by transforming the X-type and Z-type stabilizers arrive at a vector representation per equivalent logical observable.
from the usual CSS code, as explained in ‘The rotated surface code’) This representation is then processed by a residual network to make
for i and j are the same or not. the final label prediction. We can compute one label for each of the d
rows or columns corresponding to equivalent choices of logical obser­
These learned embeddings are then independently passed through vables in the experiment basis (Extended Data Fig. 4a) and average the
a residual network to form the final embedding. Although the embed- loss for all of these if all the labels are available (as they are for the scal-
ding is learned, after training the attention bias is constant and can be ing experiment simulations). Only a single logical observable is used
precomputed. at inference time (the leftmost or the lowest according to the basis).
To provide a simple further modulation of the bias, at each round, The network was designed to pool along logical observables to give
the current and previous stabilizers are used to compute indicator a prediction per line, but in practice we found better results pooling
features for spatial and time–space event correlations4 for each i, j perpendicular to them.
pair. At round n these are the products:
1. eventni × eventnj (spatial) Auxiliary tasks. Often, training neural networks to make predictions
2. eventni × event(n−1)j (time–space) other than those required for the main machine-learning task, known
3. event(n−1)i × eventnj (time–space) as auxiliary tasks, can lead to improved training or better performance
4. event(n−1)i × event(n−1)j (spatial) on that main task. Here we ask the network to make a prediction of the
as well as the diagonals of these (seven features as two diagonals are next stabilizers, by a linear projection and logistic output unit from
identical). each stabilizer’s representation. Extended Data Fig. 9 shows that this
auxiliary task seems to detract slightly from the network performance,
These features are concatenated to the attention bias embedding but leads to slightly faster training.
and directly projected to the attention bias with a learned projection.
Although, for speed reasons, only simple binary features are provided Efficient training for variable-duration experiments. As the compu-
and projected directly to the bias, with the additional features only tation could be terminated at any round, the network could be asked
the attention bias embedding can be precomputed and the projec- to make a prediction at any round. We use our privileged access to
tion becomes costly for large code distances. The ablations show that the quantum state during simulation to provide a label for any round
the attention bias adds little to the performance, and it is more costly (see ‘Intermediate data’). It is noted that owing to the special nature of
during training, so although the ablations of Extended Data Fig. 9a,b the final stabilizers being computed from the measured data qubits,
are relative to a baseline with attention bias, the Pauli+ experiments there is a set of final stabilizers for round N that are different from the
were executed without it. bulk stabilizers for round N of experiments that last longer. With such
Attention bias visualizations. To investigate whether the attention simulated data, when training, we can share computation for these
bias learns an interpretable representation, we visualize its logits in experiments of different lengths as shown in Extended Data Fig. 5b.
Extended Data Fig. 5. For each of the 4 attention heads for the first For N rounds, N labels can be trained with 2N applications of the embed-
transformer layer of one (5 × 5) DEM-trained model, we plot the atten- ding and RNN core (N for the bulk and N for the final stabilizers for each
tion logits for each stabilizer in a physical layout. It clearly shows that duration), and N readout computations. (versus N(N + 1)/2 applications
the different attention bias heads perform distinct functions. The of the embedding and RNN with N readout computations for training on
first head modulates the attention towards the same stabilizer and N separate examples of durations 1, …, N).
Pauli+. Pretraining. We train and evaluate the model on SI1000 data
Implementation details generated using Stim with added I/Q noise (see ‘Measurement noise’),
Our machine-learning decoder architecture is implemented and trained with intermediate measurements as auxiliary labels (see ‘Intermediate
using the JAX, Haiku and JAXline72 machine-learning frameworks. data’). The I/Q noise is simplified in that we set t = 0 (that is, no amplitude
damping component). Furthermore, before each measurement in
Training details the simulated experiment, we first randomly set the qubit to a leaked
Sycamore data. Cross-validation. For the data from Google’s Syca- ∣2⟩ state with a 0.275% chance, and then add I/Q noise. This is the only
more memory experiment paper4, we consider the two disjoint sets of source of leakage that we add to the system; Stim itself (unlike Pauli+)
odd- and even-indexed experiments (see ‘Sycamore memory experi- does not simulate leakage (or cross-talk).
ment dataset’) to perform twofold cross-validation. Although the amount of leakage is approximately matched to the
Pretraining. Owing to the limited amount of experimental data cap- Pauli+ data, this is a very simplified simulation of leakage in the sys-
tured in the Sycamore experiment, compared with the number of train- tem compared with ‘Pauli+ model for simulations with leakage’. Here
ing examples required for a machine-learning decoder, we pretrain we treat leakage as an effect that just occurs during measurements;
on simulated data with three different degrees of similarity to the whereas in the Pauli+ simulation leakage occurs through realistically
experimental data. The first one is data generated from a previously modelled effects during the application of quantum gates, and also
published DEM file4, obtained by fitting one half of the experimental spreads accordingly. One consequence is that in the Pauli+ data, the
data (either odd- or even-indexed, 25,000 samples per experiment amount of leakage between, for example, stabilizer and data qubits
length, see ‘Detector error model’). Examples are uniformly sampled varies strongly, an effect we disregard during pretraining. Nonethe-
from the lengths {1, 3, …, 25}. We use simulated samples from the DEM less, providing the model with this very simplified version of leakage
fitted to the same half as validation dataset; which we use to perform (through the same inputs post1 and post2 described in ‘Measurement
early stopping, that is, to select the model parameters producing the noise’) helps prime our decoder to expect leakage information as input.
minimum LER on the validation dataset across training steps. We stop Examples all have 25 error-correction rounds.
the experiment before using 2 billion training samples. Finetuning. For the finetuning, we trained models using samples from
When XEB pretraining, we follow a similar procedure, but using the Pauli+ simulator (see ‘Pauli+ model for simulations with leakage’)
data generated from a DEM file derived from a previously published using 15 different seeds. We trained with either soft or hard inputs, with
Stim file4, obtained using device calibration data, for both training approximately 0.1% chance of leakage in the stabilizer measurements
and evaluating. and with I/Q hyperparameters: SNR = 10 and t = 0.01. We used the auxil-
In the last pretraining variety, SI1000 pretraining, we train and evalu- iary task of predicting the next stabilizers but not intermediate labels
ate with SI1000 data not fitted to the device (see ‘Circuit depolarizing as these are not available in a realistic scenario. After termination, the
noise’). Instead of simulating experiments of lengths {1, 3, …, 25}, in model parameters obtaining the lowest development set cross-entropy
the SI1000 pretraining modality, we always simulate experiments of loss were chosen.
length 25 and use intermediate measurements as auxiliary labels at For each code distance 3, 5, 7, 9 and 11, we generated 100 million
each training rounds (see ‘Intermediate data’). training samples and 5 million test samples from the Pauli+ simulator,
For all three varieties of pretraining, we use a ‘noise curriculum’ in which were then augmented with I/Q noise in post-processing. For each
which we show data with lower noise strength at the beginning of the data limit from Fig. 4, we sampled a subset from the training samples
training, continuously transitioning to higher noise strength during and split off about 10% as a development set. The models were trained
the training process. More precisely, for DEM and XEB pretraining, we for up to 10 epochs.
consider several replications of the DEM with error event probabilities
scaled with factors f = 0.5, 0.6, …, 1.0; whereas for SI1000 pretraining, Loss. We trained the model using cross-entropy objectives with binary
we scale the parameter p accordingly. Each dataset is sampled with a targets. For the scaling experiments, where we have a label for each
probability proportional to logical observable for each experiment duration, all these losses were
averaged.
pf (t) ∝ 1 + wcG( fc (t), σc, f ) (6) As an auxiliary loss, we used next stabilizer prediction cross-entropy
loss (see ‘Auxiliary tasks’) averaged across all rounds and all stabilizers
where G(μ, σ; x) is the standard un-normalized Gaussian function and and then down-weighted relative to the error-prediction loss (Extended
fc(t) is the peak scale factor: Data Fig. 8b).

Loss minimization. We minimize loss using stochastic gradient descent.


1 − fc,min
fc (t) = fc,min + (7) We use the Lamb73 and Lion74 optimizers for experimental and scaling
1 + exp(− sc(t /tc − 1)) datasets, respectively. We use weight decay (L2 norm on non-bias pa-
rameters) everywhere, either relative to zero (for pretraining) or rela-
that transitions from the minimum peak scale factor, fc,min to 1 at a num- tive to pretrained parameters (for finetuning, using a stronger weight
ber of training steps t = tc. The values of the noise curriculum parameters decay). The batch size was increased once (from 256 to 1,024) during
can be found in Extended Data Fig. 8a. We found that this noise curricu- pretraining (Extended Data Fig. 8b) and kept at 1,024 for finetuning.
lum stabilizes training, and yielded slightly better accuracy at the end. The learning rate is piecewise constant after an initial linear warm-up
Finetuning. We then finetune the model using one half of the experi- of 10,000 steps, with reductions by a factor 0.7 at specified numbers
mental data (for the DEM pretrained case, always the half used to of steps. Scaling experiments used a cosine learning-rate schedule.
derive the DEM file). We further divide this half of the data, and use For the Pauli+ experiments, we pretrain with a rounds curriculum;
the first 19,880 samples as a training dataset and the remaining 5,120 for example, for a 25-rounds experiment, we train for 30 million seen
samples as a development dataset for early stopping, keeping the examples on 3 rounds, from 30 million to 60 million examples on 6
parameters giving the best fitted LER (up to 30,000 training steps). rounds, from 60 million to 90 million examples on 12 rounds, and above
The final model is evaluated on the other half of the experimental that on 25 rounds.
data—the 25,000 held-out samples not used for training or early
stopping. The best validation LER is found after about 120 passes Termination. Pretraining is terminated after 2 billion examples for the
through the data. Sycamore experiments with pij and XEB and 500 million examples for
Article
SI1000. For the scaling runs, pretraining was terminated after up to Throughput measures only the time required for computation per
2.5 billion examples for the scaling runs. Model parameters are accu- round, ignoring the latency—the time required to deliver a final answer
mulated with an exponential moving average and regularly evaluated after receiving the final round’s stabilizers.
on development data to compute the LER (by fitting across rounds By design, the model runtime is independent of the physical noise
3, 5, …, 25 for the Sycamore data, or by computing for 25 rounds for level (and hence the error syndrome densities), whereas matching
the scaling data). The set of parameters with the lowest development is slower the greater the noise. The fixed runtime of neural network
set LER is retained. With noisy fidelity estimates, particularly early on decoders is considered to be a practical advantage77.
in training, we found that LER could be overestimated (see ‘Logical We are confident that our decoder can be sped up significantly using a
error per round’), so we exclude parameter sets for which the fit has number of well-known techniques. Although well studied, the optimiza-
R2 ≤ 0.9 or an intercept ≤max(−0.02, −σ) where σ is the estimated tion of neural networks for speed in deployment is a multistep process
standard deviation for the intercept of the fit line. For the scaling to be undertaken after demonstrating the accuracy of our approach at
finetuning experiments, as the constrained development data would the target scale. First, having established the accuracy that is achievable,
in some cases be too small to reliably estimate the low LERs, the low- optimizing the architecture and hyperparameters for inference speed
est development set cross-entropy was used to select the model is likely to deliver significant gains while maintaining that accuracy. For
parameters. instance, improving parallelism and pipelining are expected to improve
throughput, perhaps at the expense of incurring greater latency. Using
Hyperparameters. For the Sycamore experiments, we tuned the hyper­ faster components such as local, sparse or axial attention78, which
parameters of the network by training models on the 5 × 5 DEM, using restrict attention to a subset of stabilizer pairs, have the potential to
samples from the same DEM as validation data for hyperparameter deliver speed improvements. Custom implementation to optimize
selection. We used the same hyperparameters for 3 × 3 except for learn- for a particular hardware accelerator design and to remove memory
ing rate (× 2 ) and using dilation 1 convolutions (see Extended Data access bottlenecks can also improve speed for a specific architecture.
Fig. 8c for details). Second, techniques such as knowledge distillation79 and attention
For the scaling investigation, the same base model was used, with transfer80, lower-precision computation, and weight and activation
some hyperparameter values further tuned to minimize validation pruning can be applied to achieve similar performance with less compu-
set LER for the 11 × 11 code. Again, the same model is used for all other tation. In knowledge distillation, we first train a large, accurate ‘teacher’
code distances except for adjusting the learning rate and choosing the network on the task. Then we train a smaller ‘student’ network whose
convolution dilations (see Extended Data Fig. 8c for details). architecture is tuned for inference speed81. In addition to the logi-
cal observable labels that were available to the teacher, the student
Parameters. As the architecture used for all code distances is the same, network is trained to match the probabilistic outputs of the teacher.
the number of parameters (the weights of the neural network) is con- Optionally, the student can be trained to match the internal activa-
stant except for additional stabilizer index embedding parameters tions and attention maps of the teacher. These richer, denser targets
needed for larger code distances. All the convolutions are 3 × 3, albeit have been shown to enable the student to achieve higher accuracy
that the dilations are varied with the code distance. than could be achieved when training the student network without the
teacher network. For increased accuracy, an ensemble of teachers can
Ensembling be distilled into a single student.
It is possible to combine multiple classifiers to obtain a more accurate We note that machine-learning acceleration hardware is continually
prediction by ensembling27,75. As the models can be run independently improving (for example, one study82 found a factor of 2 increase in
in parallel, ensembling does not change the computation speed, but floating point operations per second about every 2.6 years). Should this
does require more computational resources. We apply ensembles trend continue and we can choose architectures to exploit the improve-
of multiple models for both the Sycamore and scaling experiments. ments, this would lead to considerable speed-up over the timescale
Extended Data Fig. 7d shows the improvement in LER owing to ensem- anticipated for the development of full-scale quantum computers.
bling compared with the average performance of the models used in the Finally, custom hardware-specific implementation on application-
ensemble. specific integrated circuits or field-programmable gate arrays can
We train multiple models with identical hyperparameters, but dif- deliver further speed improvements. Previous studies have demon-
ferent random seeds leading to different parameter initializations strated the feasibility of implementing decoders (for example, Union
and training on different sequences of examples. We use 20 seeds for Find54 and neural network83) in such custom hardware. To best exploit
the Sycamore data and 15 for the scaling experiments. We average the application-specific integrated circuits and field-programmable gate
logits from the different networks (computing a geometric mean of the arrays, low precision, fixed-point arithmetic and sparse networks may
predicted error probabilities). As there is a range of accuracy between be necessary.
seeds, it might be possible to achieve greater performance by ensem- We also note that, in principle, it is possible to achieve unbounded
bling only the models with the best validation set performance, but throughput by decomposing the matching decoding problem in an
here we ensemble all the models trained. extremely parallel fashion55,56. We expect similar ideas might be applied
to a machine-learning decoder (‘Generalization to logical computa-
Decoding speed tions’ below).
We designed and tuned the current model to deliver the best error
suppression possible up to 11 × 11 codes, with the only speed considera- Generalization to logical computations
tion being to keep training time manageable for experiments. So far, Beyond scaling a decoder to achieve sufficiently low logical error rates
we have put no effort into optimizing the design for inference speed. (see ‘Further considerations of scaling experiments’) and ensuring
When running AlphaQubit on an accelerator for code distances up throughput rates commensurate with hardware demands (see ‘Decod-
to 25 (but untrained for distances >11), its throughput is within 1–2 ing speed’), realizing a fault-tolerant quantum computation requires
orders of magnitude above the target throughput rate of 1 μs (ref. 20), a decoder to handle more than memory experiments, such as lattice
and within an order of magnitude of the throughput of, for example, surgery operations. Despite significant recent progress on the experi-
PyMatching76, an implementation of standard MWPM without cor- mental side4,17–21,84–87 and decoding side25–27,88–90, decoding a logical
relations, analogue inputs or calibrated error probabilities (Extended computation is only now beginning to be explored, even for established
Data Fig. 7a). decoding schemes such as MWPM58.
One possible approach (akin, on a high level, to the spatial and tem- experiment at an LER of 10−5) could result in fewer ‘challenging’ exam-
poral windowing approach presented for graph-based decoders55,56,58) ples for learning, as seen in our distance-11 scaling demonstration where
is to train separate network components to implement different build- most samples are correctly classified late in training. The effectiveness
ing blocks of a fault-tolerant quantum circuit. These could be trained of finetuning might be limited, similar to pretraining, because only a
individually (as demonstrated for idling in the memory experiment) and small portion of the data actually helps improve accuracy.
across constructed combinations, to ensure that the state information There are several possible approaches to improve sample efficiency.
passed between them is consistent and sufficient for decoding arbitrary We have shown it is possible to train AlphaQubit across code distances
circuits. We expect that many, perhaps all, of the network parameters instead of for a single code distance (see ‘Generalization to logical
could be shared, such that a single model generalizes across differ- computations’). As we reach equal performance with the same total
ent decoding scenarios, perhaps utilizing side inputs to indicate the number of training steps as required to train a decoder for the high-
gate required, and producing auxiliary outputs that allow decoders est code distance only, and as training steps for lower code distance
on different building blocks to exchange information necessary for examples are faster, training across code distances has the potential to
the joint decoding task. save computational resources. Another approach that might improve
The ability of our neural architecture to transfer between various sample efficiency is the concept of ‘hard sample mining’91,92, where
decoding scenarios will be crucial in this context. Beyond generali- difficult examples are collected (for example, by finding examples
zation across rounds, that is, the streamability of the decoder (Fig. 5 whose predictions are incorrect, or not confidently correct, for some
and Extended Data Fig. 6c), we can also successfully train a single decoder) or constructed. Training samples can then be biased towards
machine-learning decoder across multiple code distances. More spe- these ‘hard samples’ rather than just drawing random samples. One
cifically, in the ‘Quantum error correction on current quantum devices’ indication that this is a fruitful avenue to pursue is that we have found
and ‘Decoding at higher distances’ sections, we present decoders that benefit by training with noise levels higher than the intended target
have been pretrained on the code distance on which the decoder will noise level, consistent with previous findings36. This could also work for
then be finetuned. This approach allowed us to assess each code dis- finetuning, for example, in the form of injecting errors into the device
tance independently, during both pretraining and finetuning. But the to artificially collect high-utility data. Similarly, different finetuning
decoder can in fact be trained on a mixture of examples from a range mechanisms such as LoRA93 or few-shot learning strategies such as
of code distances, such that a single decoder can process any of the meta learning94 might prove fruitful.
code distances it was trained on (Extended Data Fig. 7b). The resulting Despite these arguments, training a machine-learning decoder at
logical error rates of the cross-code-distance decoder are consistent larger code distances will remain challenging. Correlated matching
with the model trained per code distance, when training up to the same (and other, more recent graph-based decoders) will remain strong
number of samples seen in the single code distance d = 11 case. This contenders going forward. Demonstrating that AlphaQubit can
form of generalization is enabled through the composition of archi- scale to distance 25 while maintaining competitive accuracy against
tecture components (convolutions and attention) that can be applied MWPM-Corr will be one of the necessary steps (together with other
independent of code distance. decoding scenarios; see ‘Generalization to logical computations’)
towards decoding a fault-tolerant computation at scale. We expect
Further considerations of scaling experiments that improvements during training, as outlined above, and further
In Extended Data Fig. 7e, we collect error-suppression factors Λ from hyperparameter adjustments and architectural improvements will be
distances 3 to 5 and distances 3 to 11, for various decoders and input crucial to go to distances beyond 11 in an efficient manner.
modalities. The strongest error-suppression up to distance 11 is
achievable with AlphaQubit, at an LER of (6.11 ± 0.22) × 10−6 (Extended Time scalability
Data Fig. 7d). Ultimately, to perform fault-tolerant quantum computation at arbi-
With our two-stage training procedure, the bulk of the computational trary circuit depths, a decoder needs to be able to maintain its decod-
training cost comes from pretraining the model to high accuracy. One ing performance for an arbitrary number of error-correction rounds.
model, pretrained on a generic noise model such as SI1000, can be Extended Data Fig. 6c shows the performance of networks trained up
copied and finetuned independently to various device characteristics, to 25 rounds and demonstrates that they maintain their performance
which amortizes the cost of pretraining. As detailed in the main text, when applied to much longer experiments, up to 100,000 rounds or
these finetuning samples are a scarce experimental resource. In this until the 1 − 2 × logical error drops below 0.1.
work, we have shown that, up to code distance 11, we can finetune a We note that there appears to be a systematic decrease in LER with
model on a realistic number of experimental samples (107; Fig. 4c). It experiment length for the matching decoder, as well as AlphaQubit on
is an important question for future research to demonstrate that this smaller distances. We suggest that this is caused by the relatively high
methodology remains successful at larger scales. measurement noise in our simulations (inspired by a superconducting
As we go from distance 3 to 11, the number of pretraining samples noise profile). This measurement noise is particularly damaging in the
required for convergence increases, which, for example, manifests in terminal round, where data-qubit measurements can cause space-like
Fig. 4, where the resulting relative advantage over MWPM-Corr is not as separated detection event pairs, which tend to be more damaging than
large at d = 11 as it is at d = 9. In our experience, the number of samples time-like separated detection event pairs. This is further exacerbated
required to train a decoder at distance d seems to depend in a nonlinear by withholding I/Q information about these measurements from the
fashion on the distance (Extended Data Fig. 7c). Extrapolating this trend machine-learning and MWPM-Corr decoders (see ‘Pitfalls for training
is difficult, however, as we find that the sample growth rate strongly on soft information’). Although we have chosen to train on 25 rounds
depends on the choice of hyperparameters, such as learning rate, batch to mirror previous work4, it would be interesting to quantify the effect
size and architecture features. It could further be the case that fine- of extending the training to more rounds, or finetuning for longer
tuning a model at larger code distances to a fixed relative advantage experiments.
over pretrained performance is more challenging than at smaller code We also note that although the recurrent architecture of our decoder
distances: in Fig. 4c, we do not observe a significant improvement of means it uses a fixed amount of memory regardless of the experiment
the LER at distance 11 for 105– 106 finetuning samples compared with duration, PyMatching takes the entire matching graph as input, leading
the pretrained model. to memory consumption growing linearly with the experiment dura-
A possible partial explanation could be that the decreasing number of tion. We leave comparisons to streaming decoder implementations
failure cases at larger distances (for example, only 0.025% for a 25-round to future work.
Article
OnlyMeasurements. We only give syndrome information as raw qubit
Soft matching measurements (removing events).
The MWPM decoder can also be augmented to use soft information40. For both of these, we point out that we give the cumulative sum
For each I/Q point, we can compute the posterior probability that the (mod 2) of detection events as inputs where we are pretraining on a
sample was drawn from either the ∣0⟩ -outcome distribution or ∣1⟩- DEM (which does not provide the initial state of stabilizer measure-
outcome distribution—see ‘Measurement noise’. It is noted that this ments, and thus cannot be used to re-create absolute measurements).
can sometimes classify ∣2⟩ outcomes as highly confident ∣1⟩ outcomes.
We can threshold these posterior probabilities to obtain binary
measurement outcomes that are used to compute detection events. Data availability
Then, the probability of the opposite outcome can be interpreted as The data for the Pauli+ simulations and documentation for loading
a measurement error, which contributes a probability to one of the the datasets are available at https://storage.mtls.cloud.google.com/
error edges in the error graph that instantiates the MWPM decoder. gdm-qec. The data for Google’s Sycamore memory experiment are
The probabilities of these edges are reassigned, replacing the average available at https://doi.org/10.5281/zenodo.6804040 (ref. 42).
measurement probability contribution with these instance-specific
probabilities. This change can be further propagated to the correlation
reweighting step25. The posterior probabilities for data-qubit measure- Code availability
ments are withheld to compare fairly with AlphaQubit, from which these Detailed pseudocode for all components of the neural-network decod-
values are also withheld. We leave comparison with a leakage-aware ing architecture is provided as part of Supplementary Information.
matching decoder, which reweights edges based on leakage detec-
tions95, to future work. 59. Bombin, H. & Martin-Delgado, M. A. Optimal resources for topological two-dimensional
stabilizer codes: comparative study. Phys. Rev. A 76, 012305 (2007).
Ablations 60. Kitaev, A. Y. in Quantum Communication, Computing, and Measurement (eds Hirota, O. et al.)
181–188 (Springer, 1997).
To understand the effect of different components of the architecture, 61. Bonilla Ataides, J. P., Tuckett, D. K., Bartlett, S. D., Flammia, S. T. & Brown, B. J. The XZZX
we conduct ablations, by training networks where we remove or sim- surface code. Nat. Commun. 12, 2172 (2021).
62. Sank, D. et al. Measurement-induced state transitions in a superconducting qubit: beyond
plify one aspect of the main design and seeing the effect. For each sce-
the rotating wave approximation. Phys. Rev. Lett. 117, 190503 (2016).
nario (described in the following sections), we trained 5 models with 63. Khezri, M. et al. Measurement-induced state transitions in a superconducting qubit:
different random seeds and compare the mean test set LER in Extended within the rotating-wave approximation. Phys. Rev. Appl. 20, 054008 (2023).
64. Sank, T. Fast, Accurate State Measurement in Superconducting Qubits. PhD thesis, Univ.
Data Fig. 9a,b for 5 × 5 Sycamore DEM pretraining and 11 × 11 Pauli+
California, Santa Barbara (2014).
training respectively. For the former, Extended Data Fig. 9c also shows 65. Miao, K. C. et al. Overcoming leakage in quantum error correction. Nat. Phys. 19, 1780–1786
the effect on training speed. In each case, other hyperparameters were (2023).
66. Vaswani, A. et al. Attention is all you need. In Proc. 31st International Conference on
not changed, and it is possible that lost performance could be recov-
Neural Information Processing Systems 6000–6010 (NIPS, 2017).
ered by compensating with other changes (see ‘Decoding speed’). For 67. Brown, T. B. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst.
ablations, we assess only pretraining performance. 33, 1877–1901 (2020).
68. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596,
Although many of the ablations have only a small effect on the per- 583–589 (2021).
formance at 5 × 5, at 11 × 11 the effects are more marked. 69. Shazeer, N. GLU variants improve transformer. Preprint at https://arxiv.org/abs/2002.05202
(2020).
70. Egorov, E., Bondesan, R. & Welling, M. The END: an equivariant neural decoder for
Model ablations. LSTM. We substitute the whole recurrent core with quantum error correction. Preprint at https://arxiv.org/abs/2304.07362 (2023).
a stack of 6 LSTMs, as implemented in Haiku96. To keep the number 71. Gicev, S., Hollenberg, L. C. & Usman, M. A scalable and fast artificial neural network
of parameters roughly constant upon scaling (as AlphaQubit does), syndrome decoder for surface codes. Quantum 7, 1058 (2023).
72. Babuschkin, I. et al. The DeepMind JAX ecosystem. GitHub http://github.com/deepmind
we make the width of the LSTM hidden layers dependent on the code (2020).
distance d, and equal to 64 × (25 − 1)/(d2 − 1). As the LSTM uses dense 73. You, Y. et al. Large batch optimization for deep learning: training BERT in 76 minutes.
layers, which lack any spatial equivariance, we also remove the scatter In International Conference on Learning Representations (ICLR, 2020).
74. Chen, X. et al. Symbolic discovery of optimization algorithms. Adv. Neural Inf. Process.
and mean pooling operations in the readout. Syst. 36, 49205–49233 (2024).
NoConv. We remove all the convolutional elements in the syndrome 75. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
transformer. 76. Higgott, O. & Gidney, C. Sparse Blossom: correcting a million errors per core second with
minimum-weight matching. Preprint at https://arxiv.org/abs/2303.15933 (2023).
SimpleReadoutStack. We reduce the number of layers in the Readout 77. Varsamopoulos, S., Bertels, K. & Almudever, C. G. Comparing neural network based
ResNet from 16 to 1 (Extended Data Fig. 8b). decoders for the surface code. IEEE Trans. Comput. 69, 300–311 (2019).
SimpleInputStack. We reduce the number of ResNet layers in the 78. Ho, J., Kalchbrenner, N., Weissenborn, D. & Salimans, T. Axial attention in
multidimensional transformers. Preprint at https://arxiv.org/abs/1912.12180 (2019).
feature embedding from 2 to 1 (Extended Data Fig. 8b). 79. Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at
PoolingStabs. We do not scatter to two dimensions before pooling https://arxiv.org/abs/1503.02531 (2015).
in the readout. The result is that we mean pool across all stabilizers 80. Zagoruyko, S. & Komodakis, N. Paying more attention to attention: improving the
performance of convolutional neural networks via attention transfer. In International
instead of along data-qubit rows or columns (corresponding to logi- Conference on Learning Representations (ICLR, 2017).
cal observables). 81. Howard, A. G. et al. MobileNets: efficient convolutional neural networks for mobile vision
NoAttBias. We remove the attention bias, both embedding and event applications. Preprint at https://arxiv.org/abs/1704.04861 (2017).
82. JAX: composable transformations of Python+NumPy programs. GitHub https://github.com/
indicator features. The Pauli+ experiments were done without atten- jax-ml/jax (2020).
tion bias. 83. Overwater, R. W., Babaie, M. & Sebastiano, F. Neural-network decoders for quantum error
NoNextStabPred. We remove the next stabilizer prediction loss from correction using surface codes: a space exploration of the hardware cost-performance
tradeoffs. IEEE Trans. Quantum Eng. 3, 1–19 (2022).
the loss. 84. Waldherr, G. et al. Quantum error correction in a solid-state hybrid spin register. Nature
FewerDims. We reduce the number of dimensions per stabilizer in 506, 204–207 (2014).
the syndrome transformer to 30. 85. Luo, Y.-H. et al. Quantum teleportation of physical qubits into logical code spaces. Proc.
Natl Acad. Sci. USA 118, e2026250118 (2021).
FewerLayers. We reduce the number of layers in the syndrome trans- 86. Sundaresan, N. et al. Demonstrating multi-round subsystem quantum error correction
former from 3 to 1 for each round. using matching and maximum likelihood decoders. Nat. Commun. 14, 2852 (2023)
87. Gupta, R. S. et al. Encoding a magic state with beyond break-even fidelity. Nature 625,
259–263 (2024).
Input ablations. OnlyEvents. We only give syndrome information as 88. Paler, A. & Fowler, A. G. Pipelined correlated minimum weight perfect matching of the
detection events (removing measurements). surface code. Quantum 7, 1205 (2023).
89. DeMarti iOlius, A., Martinez, J. E., Fuentes, P. & Crespo, P. M. Performance enhancement Acknowledgements The Google DeepMind team thank J. Adler, C. Beattie, S. Bodenstein,
of surface codes via recursive minimum-weight perfect-match decoding. Phys. Rev. A C. Donner, P. Drotár, F. Fuchs, A. Gaunt, I. von Glehn, J. Kirkpatrick, C. Meyer, S. Mourad,
108, 022401 (2023). S. Nowozin, I. Penchev, N. Sukhanov and R. Tanburn for discussions and other contributions
90. Delfosse, N., Paetznick, A., Haah, J. & Hastings, M. B. Splitting decoders for correcting to the project, and acknowledge the support provided by many others at Google DeepMind.
hypergraph faults. Preprint at https://arxiv.org/abs/2309.15354 (2023). The Google Quantum AI team thank A. Fowler, T. O’Brien and N. Shutty for their feedback on
91. Lin, T.-Y., Goyal, P., Girshick, R. B., He, K. & Dollár, P. Focal loss for dense object detection. the paper.
In IEEE International Conference on Computer Vision 2999–3007 (IEEE, 2017).
92. Shrivastava, A., Gupta, A. K. & Girshick, R. B. Training region-based object detectors with Author contributions J.A. and D.K. developed the models and wrote the software for modelling
online hard example mining. In IEEE Conference on Computer Vision and Pattern realistic noise in superconducting processors. J.B. conceptualized and supervised the research,
Recognition 761–769 (IEEE, 2016). and contributed to project administration, data curation, investigation, formal analysis, validation
93. Hu, J. E. et al. LoRA: low-rank adaptation of large language models. In International and visualization of results, and writing of the paper. S. Blackwell supported the investigation,
Conference on Learning Representations (ICLR, 2022). resource provision and software development aspects of the experiments. S. Boixo provided
94. Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep project supervision, software tools and coordination, and direction of priorities for scalable
networks. In International Conference on Machine Learning Vol. 70, 1126–1135 (ACM, decoding. A.D. helped conceptualize and supervise the research, and contributed to data
2017). curation, the investigation, methodology and writing of the paper. T.E. contributed to data
95. Suchara, M., Cross, A. W. & Gambetta, J. M. Leakage suppression in the toric code. In IEEE curation, resource provision and methodology, the investigation, methodology, formal analysis,
International Symposium on Information Theory 1119–1123 (IEEE, 2015). validation and visualization of results, and writing of the paper. C.G. contributed knowledge
96. Hennigan, T., Cai, T., Norman, T. & Babuschkin, I. Haiku: sonnet for JAX. Version 0.0.9. about the theory of decoders and configuring noise models and software support. D.H.
GitHub http://github.com/deepmind/dm-haiku (2020). contributed to the research conceptualization and supervision, and sponsored the research.
97. Krastanov, S. & Jiang, L. Deep neural network probabilistic decoder for stabilizer codes. F.J.H.H. helped conceptualize the research, and contributed to data curation, investigation,
Sci. Rep. 7, 11003 (2017). methodology, formal analysis, validation and visualization of the results, and the writing of
98. Torlai, G. & Melko, R. G. Neural decoder for topological codes. Phys. Rev. Lett. 119, 030501 the paper. G.H. provided project administration, and supported the the conceptualization of
(2017). research direction. C.J. provided project supervision, software tools and coordination, and
99. Andreasson, P., Johansson, J., Liljestrand, S. & Granath, M. Quantum error correction for direction of priorities for scalable decoding. P.K. contributed to the research conceptualization
the toric code using deep reinforcement learning. Quantum 3, 183 (2019). and supervision, and sponsored the research. H.N. contributed to the research conceptualization
100. Maskara, N., Kubica, A. & Jochym-O’Connor, T. Advantages of versatile neural-network and supervision, and sponsored the research. M.N. helped conceptualize the research, and
decoding for topological codes. Phys. Rev. A 99, 052351 (2019). contributed to the investigation and methodology, project supervision, knowledge about
101. Wagner, T., Kampermann, H. & Bruß, D. Symmetries for a high-level neural decoder on the the theory of decoders, analysis and validation of results, and the writing of the paper. M.Y.N.
toric code. Phys. Rev. A 102, 042411 (2020). contributed to the research conceptualization, the investigation, methodology, cross-talk
102. Fitzek, D., Eliasson, M., Kockum, A. F. & Granath, M. Deep Q-learning decoder for error modelling in superconducting processors and the writing of the paper. K.S. contributed
depolarizing noise on the toric code. Phys. Rev. Res. 2, 023230 (2020). experimental knowledge on leakage, measurement, soft information and analysing soft
103. Ni, X. Neural network decoders for large-distance 2D toric codes. Quantum 4, 310 (2020). information for decoding, and helped write the paper. A.W.S. led the investigation and
104. Meinerz, K., Park, C.-Y. & Trebst, S. Scalable neural decoder for topological surface codes. methodology, helped conceptualize and supervise the research, and contributed to data
Phys. Rev. Lett. 128, 080505 (2022). curation, formal analysis, validation and visualization of the results, and the writing of the paper.
105. Matekole, E. S., Ye, E., Iyer, R. & Chen, S. Y.-C. Decoding surface codes with deep
reinforcement learning and probabilistic policy reuse. Preprint at https://arxiv.org/abs/ Competing interests Author-affiliated entities have filed US and international patent
2212.11890 (2022). applications related to quantum error-correction using neural networks and to use of in-phase
106. Choukroun, Y. & Wolf, L. Deep quantum error correction. In Proc. 38th AAAI Conference and quadrature information in decoding, including US18/237,204, PCT/US2024/036110,
on Artificial Intelligence 64–72 (AAAI, 2024). US18/237,323, PCT/US2024/036120, US18/237,331, PCT/US2024/036167, US18/758,727 and
107. Chamberland, C., Goncalves, L., Sivarajah, P., Peterson, E. & Grimberg, S. Techniques for PCT/US2024/036173.
combining fast local decoders with global decoders under circuit-level noise. Quantum
Sci. Technol. 8, 045011 (2023). Additional information
108. Wang, H. et al. Transformer-QEC: quantum error correction code decoding with Supplementary information The online version contains supplementary material available at
transferable transformers. In 7th International Conference on Computer-Aided Design https://doi.org/10.1038/s41586-024-08148-8.
(ICCAD, 2023). Correspondence and requests for materials should be addressed to Johannes Bausch or
109. Hall, B., Gicev, S. & Usman, M. Artificial neural network syndrome decoding on IBM Andrew W. Senior.
quantum processors. Phys. Rev. Res. 6, L032004 (2024). Peer review information Nature thanks Neil Gillespie, Nadia Haider and the other, anonymous,
110. Bordoni, S. & Giagu, S. Convolutional neural network based decoders for surface codes. reviewer(s) for their contribution to the peer review of this work.
Quantum Inf. Process. 22, 151 (2023). Reprints and permissions information is available at http://www.nature.com/reprints.
Article

Extended Data Fig. 1 | Stabilizer readout circuit for a 3 × 3 XZZX rotated dots indicate data qubits, gray dots indicate X/Z stabilizer qubits, as detailed in
surface code in the Z basis. (a) Common X and Z stabilizer readout for the Fig. 1a. D (yellow blocks) labels single- or two-qubit depolarizing noise, and X
XZZX code61. Here, the four first lines (a-d) are the data qubits surrounding a labels a bit flip channel. M, R, and MR are measurements, reset, and combined
stabilizer qubit (last line), which has been reset to ∣0⟩. (b) Relative strength of measurement and reset in the Z basis. H is a Hadamard gate, and CZ gates are
noise operations in an SI1000 circuit depolarizing noise model, parameterized indicated by their circuit symbol.
by p. (c) Corresponding circuit depolarizing noise gate and error schema. Black
Extended Data Fig. 2 | Noise and event densities for datasets used. top x-axis, with a non-linear scale. The detector error models are fitted to the
(a) Simplified I/Q noise with signal-to-noise ratio SNR = 10 and normalised Sycamore surface code experiment4. All datasets use the XZZX circuit variant
measurement time t = 0.01. Top plot: point spread functions when projected of the surface code (with CZ gates for the stabilizer readout, ‘The rotated
from its in-phase, quadrature, and time components onto a one-dimensional surface code’ in Methods, Extended data Fig. 1). As we are never compiling
z axis. Shown are the sampled point spread functions for ∣0⟩ (blue), ∣1⟩ (green), between gatesets, there is no implied noise overhead; these are the final event
and a leaked higher-excited state ∣2⟩ (violet). Bottom plot: posterior sampling densities observed when sampling from the respective datasets. For datasets
probability for the three measurement states, for prior weights w2 = 0.5%, with soft I/Q noise, the plots above show the average soft event density as
w0 = w1 = 49.75%. (b) Event densities for different datasets and the corresponding explained in ‘Measurement noise’ in Methods.
SI1000 p-value. We indicate the event density of the different datasets in the
Article

Extended Data Fig. 3 | Individual fits of logical error per round for 3 × 3 and 5 × 5 memory experiments. For the pij -DEM pretrained model.
Extended Data Fig. 4 | The neural network architecture designed for i = 1, …, d2 − 1 for a distance d surface code. (d) Each block of the recurrent
surface code decoding. (a) 5 × 5 rotated surface code layout, with data qubits network combines the decoder state and the stabilizers S n = (S n1, …, S nM) for one
(dark grey dots), X and Z stabilizer qubits (labelled light grey dots, or round (scaled down by a factor of 0.7). The decoder state is updated through
highlighted in blue/red when they detect a parity violation) interspersed in a three Syndrome Transformer layers. (e) Each Syndrome Transformer layer
checkerboard pattern. Logical observables ZL and XL are shown as bold lines on updates the stabilizer representations through multi-headed attention
the left and bottom grid edges respectively. (b) The recurrent network iterates optionally modulated by an (optional) learned attention bias followed by a
over time updating a representation of the decoder state and incorporating the dense block and dilated 2D convolutions. (f) Logical errors are predicted from
new stabilizers at each round. Three parallel lines indicate a representation per the final decoder state. The triple lines marked with * indicate a representation
stabilizer. (c) Creation of an embedding vector S ni for each new stabilizer per data qubit.
Article

Extended Data Fig. 5 | Further architecture details. (a) Attention bias with self-attention highlighted with a black square. (b) Architecture of
visualisation. Attention bias logits of the four heads of the first Syndrome the network when predicting labels at every round. S n are the stabilizer
Transformer layer of our decoder model pretrained on 5 × 5 DEM in the Z basis. representations as explained in Extended data Fig. 4b, where the primed
We obtain the logits by combining the learned attention bias embedding with quantities S′ n indicate the embeddings are computed using different
all-zero stabilizer values. The 24 × 24 attention logit matrices are each visualized embedding parameters and based only on the stabilizers in the experiment basis
as one grid per stabilizer, laid out according to the physical layout of the attending computed from the final data qubit measurements (‘Input representation’ in
stabilizer qubits. Each grid shows the logits for the attention to each stabilizer, Methods).
Extended Data Fig. 6 | Calibration of AlphaQubit’s outputs and predictions concentrate around the interval edges, i.e. at probabilities 0 and 1,
generalization to longer experiments for code-distances 3–11. (a) resulting in a high certainty. (c) Generalization to longer experiments. 1 - 2 ×
Calibration for Pauli+ generated data with SNR = 10, t = 0.01, and 0.1% stabilizer logical error and logical error rates for networks pretrained and finetuned on
qubit leakage chance. (b) Calibration histogram of predicted probabilities. The datasets for up to 25 error detection rounds but applied to decoding
predictions are grouped into correct (blue) and incorrect (red) before building experiments of longer durations. We only plot logical error rates where the
the histogram, and then binned into “certainty” bins depending on their corresponding 1 - 2 × logical error are greater than 0.1. The data is generated
distance from a 50: 50 prediction, i.e. by ∣1/2 − p∣ for a predicted probability p. from the same simulated experiments, stopped at different number of rounds.
For all code distances, wrong predictions have a lower certainty. Correct
Article

Extended Data Fig. 7 | Further result details on Sycamore and scaling embedding for a relative positional embedding, where each stabilizer position
experiments, and decoder speed. (a) Decoding time per error correction (normalized to [−1, 1] × [−1, 1]) is embedded (‘Input representation’ in Methods).
round vs. code distance. The hatched region for d > 11 indicates that while (c) Number of pretraining samples until the individual models (pre ensembling)
AlphaQubit is the same for all code-distances, it has not been trained or achieve the given LER (relative to the lowest-achieved LER, the latter shown
shown to work beyond d = 11. The line is a least squares fit to a × dexponent, and the as brown line). The dashed line indicates when the training was stopped
shaded region marks a 95% CI. Timing of uncorrelated matching (PyMatching). (see ‘Termination’ in Methods). Error bars are 95% CI (N = 15). (d) Performance
Times for PyMatching use a current CPU (Intel Xeon Platinum 8173M) and for improvement by ensembling multiple models, where we show the pij -pretrained
AlphaQubit use current TPU hardware with batch size 1. (b) LER of a decoder model for the Sycamore experiments (XEB and SI1000 variants show about
trained jointly on distances 3–11, as compared to decoders trained solely on the same improvement). (e) Average error suppression factors Λ for Pauli+
the individual code distances (both pretrained only, see ‘Decoding at higher experiment. The error suppression factor is computed from the data in Fig. 4,
distances’ and ‘Pauli+’ in Methods). Uncertainty is bootstrap standard error via the geometric average Λ3/11 = (ϵ 3 /ϵ 11 )1/4, for a logical error per round ϵ3 at
(9 resamples). The model and training hyperparameters are identical in both code distance 3, and ϵ11 at distance 11, respectively.
cases, but for the joint code distance-trained decoder we swap the index
Extended Data Fig. 8 | Network and training hyperparameters. (a) Noise (d) Hyperparameters for finetuning of the scaling experiment. The learning
curriculum parameters used in pretraining for the Sycamore experiment. rate for finetuning was the initial learning rate for the code-distance from
(b) Hyperparameters of the network architecture. (c) The dilations of the three Extended data Fig. 8c scaled by the factor dependent on the training set size.
3 × 3 convolutions in each syndrome transformer layer and the experiment The finetuning cosine learning rate schedule length was code distance × 2/3 × 108
learning rates are determined by the code-distance of the experiment. samples.
Article

Extended Data Fig. 9 | Ablations: The effect of removing or simplifying Error bars represent bootstrap standard errors from individual fidelities
decoder architecture elements. Decoder performance under ablations (499 resamples). (b) For 11 × 11 Pauli+ training. Error bars represent estimated
(white bars) compared to the baseline, ordered by mean LER. The blue bars mean error (N = 5). (c) Effect of ablations in performance and data efficiency.
indicate the model used for the experiments. (a) For 5 × 5pij DEM training, Decoding performance of the best performing subset of ablations for 5 × 5 DEM
averaged across bases (X and Z) and the two cross-validation folds. The red and 11 × 11 Pauli+. Colors indicate the number of training samples required for
horizontal line represents the performance of the finetuned ensemble. reaching performance parity with PyMatching for 11 × 11 Pauli+.
Extended Data Table 1 | ML decoders, further references and details

Further references are refs. 97–110.

You might also like