Learning high-accuracy error decoding for quantum processors
Learning high-accuracy error decoding for quantum processors
Open access Building a large-scale quantum computer requires effective strategies to correct
Check for updates errors that inevitably arise in physical quantum systems1. Quantum error-correction
codes2 present a way to reach this goal by encoding logical information redundantly
into many physical qubits. A key challenge in implementing such codes is accurately
decoding noisy syndrome information extracted from redundancy checks to obtain
the correct encoded logical information. Here we develop a recurrent, transformer-
based neural network that learns to decode the surface code, the leading quantum
error-correction code3. Our decoder outperforms other state-of-the-art decoders
on real-world data from Google’s Sycamore quantum processor for distance-3 and
distance-5 surface codes4. On distances up to 11, the decoder maintains its advantage
on simulated data with realistic noise including cross-talk and leakage, utilizing soft
readouts and leakage information. After training on approximate synthetic data, the
decoder adapts to the more complex, but unknown, underlying error distribution by
training on a limited budget of experimental samples. Our work illustrates the ability
of machine learning to go beyond human-designed algorithms by learning from
data directly, highlighting machine learning as a strong contender for decoding in
quantum computers.
The idea that quantum computation has the potential for computa- same stabilizer give different parity outcomes. A pair of observables
tional advantages over classical computation, both in terms of speed XL and ZL, which commute with the stabilizers but anti-commute with
and resource consumption, dates all the way back to Feynman5. Beyond each other, define the logical state of the surface code qubit. The mini-
Shor’s well-known prime factoring algorithm6 and Grover’s quadratic mum length of these observables is called the code distance, which
speed-up for unstructured search7, many potential applications in represents the number of errors required to change the logical qubit
fields such as material science8, machine learning9 and optimization10 without flipping a stabilizer check. In a square surface code, this is the
have been proposed. side length d of the data-qubit grid.
Yet, for practical quantum computation to become a reality, errors The task of an error-correction decoder is to use the history of stabi-
on the physical level of the device need to be corrected so that deep cir- lizer measurements, the error syndrome, to apply a correction to the
cuits can be run with high confidence in their result. Such fault-tolerant noisy logical measurement outcome to obtain the correct one. In the
quantum computation can be achieved through redundancy intro- near term, highly accurate decoders can enable proof-of-principle dem-
duced by combining multiple physical qubits into one logical qubit1. onstrations of fault tolerance. Longer term, they can boost the effective
Ultimately, to perform fault-tolerant quantum computation such as the performance of the quantum device, requiring fewer physical qubits
factorization of a 2,000-bit number, the logical error rate needs to be per logical qubit or reducing requirements on device accuracy3,4,14.
reduced to about 10−12 per logical operation3,11, far below the error rates Quantum error correction frequently requires different decoding
in today’s hardware that are around 10−3 to 10−2 per physical operation. methods to classical error correction15,16 and, despite recent significant
One of the most promising strategies for fault-tolerant computation progress4,17–21, challenges remain. A quantum error-correction decoder
is based on the surface code (Fig. 1), which has the highest-known toler- must contend with complex noise effects that include leakage, that is,
ance for errors of any code with a planar connectivity3,12,13. In the surface qubit excitations beyond the computational states ∣0⟩ and ∣1⟩ that are
code, a logical qubit is formed by a d × d grid of physical qubits, called long-lived and mobile22; and cross-talk, that is, unwanted interactions
data qubits, such that errors can be detected by periodically measuring between qubits inducing long-range and complicated patterns of
X and Z stabilizer checks on groups of adjacent data qubits, using d 2 − 1 events23. These effects fall outside the theoretical assumptions underly-
stabilizer qubits located between the data qubits (Fig. 1a). A detection ing most frequently used quantum error-correction decoders, such as
event (or event) occurs when two consecutive measurements of the minimum-weight perfect matching (MWPM)16,24. Extending decoders
1
Google DeepMind, London, UK. 2Google Quantum AI, Santa Barbara, CA, USA. 3These authors contributed equally: Johannes Bausch, Andrew W. Senior, Francisco J. H. Heras, Thomas Edlich,
Alex Davies, Michael Newman. ✉e-mail: [email protected]; [email protected]
decoder.
Z
Z Z
Z
Z
Z
Z
Z
Z
ZL
X
Z X
X Z X Z X
Z
Z
cal observable based on the syndrome inputs (Methods and Fig. 2a).
X Z
X
X
Z
Z
X
X
Z
X
X
X
ples (Fig. 2b)—decodes the Sycamore surface code experiments more
accurately than any previous decoder (machine learning or otherwise).
Z
Z
Z
X
X
X
X
Z
X Z X Z X
Z
Z
X Z
X
X
X
Z
Z
Y mately 6% detection event density (Fig. 4a, inset) using a richer noise
X
Z X
X
Z
Z
Z
X
X
X
X
Z
X Z X Z X
model than those considered in previous machine-learning decoder
Z
Z
X Z
X
X
Z
Z
X
X
Z
X
Z X
X
Z
Z
Z
Z Z
Z
Z
X Z
X
X
Z
Error-correction round n + 1
X
Z Finetune
Z
Embed
X
X
Z
Z
X
Decoder staten
Syndrome transformer
Y
Syndrome transformer
X Device data
Syndrome transformer
Pretrain
Error-correction round n
X
Z
Z
Embed
X XEB Noise
X
SI1000 model
Z pij fitting
Z data
X agnostic
Decoder staten – 1
Fig. 2 | Error correction and training of AlphaQubit. a, One error-correction experiment is used to predict whether an error has occurred. b, Decoder
round in the surface code. The X and Z stabilizer information updates the training stages. Pretraining samples come either from a data-agnostic SI1000
decoder’s internal state, encoded by a vector for each stabilizer. The internal noise model, or from an error model derived from experimental data using pij
state is then modified by multiple layers of a syndrome transformer neural or XEB methods4,31.
network containing attention and convolutions. The state at the end of an
(2.915 ± 0.016) × 10−2 at distance 5 and Λ = 1.039 ± 0.010—to our knowledge, hardware with error rates significantly lower than the Sycamore experi-
the most accurate decoder hitherto reported for this experiment4,45 mental data (‘Training details’ in Methods and Extended Data Fig. 2b)
but impractical for larger code distances owing to its computational and distances beyond 5, we explore the performance of our decoder
cost. State-of-the-art MWPM-based decoders, such as correlated match- on simulated data (in place of experimental samples) at code distances
ing (MWPM-Corr), matching with belief propagation (MWPM-BP) and 3, 5, 7, 9 and 11 (17–241 physical qubits).
PyMatching, an open-source implementation of MWPM4,24,26, lead To go beyond conventional circuit noise models37,46 we use a Pauli+
to higher LERs than the tensor network and AlphaQubit (Fig. 3a,b). simulator (‘Pauli+ model for simulations with leakage’ in Methods) that
For comparison, we also show the results of the LSTM-based neural can model crucial real-world effects such as cross-talk and leakage. The
network from Varbanov et al.36 and our own implementation of an simulator’s readouts are further augmented with soft I/Q information
LSTM (both pretrained on XEB DEMs). These achieve good results that models a dispersive measurement of superconducting qubits, to
for 3 × 3. Varbanov’s LSTM-based neural network fails to match the capture underlying analogue information about uncertainty and leak-
tensor-network decoder at 5 × 5 (Fig. 3b). Although our LSTM achieves age38,47,48 (‘Measurement noise’ and ‘Simulating future quantum devices’
this, it does not scale to larger code distances (see next section). in Methods, Fig. 4a, inset, and Extended Data Fig. 2a). These analogue
Pretraining with samples from a noise model matched to the experi- I/Q measurements and derived events are provided to AlphaQubit in
mental data (pij or XEB DEMs) leads to better performance than using the the form of probabilities40 (‘Soft measurement inputs versus soft event
device-agnostic SI1000 (Fig. 3c). The pij DEMs are the same noise model inputs’ and ‘Input representation’ in Methods).
that set the prior for the matching-based and tensor-network decod-
ers. On this prior, our decoder achieves parity with the tensor-network Decoding at higher distances
decoder (within error). We note that even when pretraining with SI1000 For each code distance, we pretrain our decoder on up to 2.5 billion sam-
samples, and without any finetuning, AlphaQubit achieves parity with ples from a device-agnostic circuit depolarizing noise model (SI1000
MWPM-BP at code distance 5. with a simple variant of I/Q readout and leakage) before using a limited
Finetuning with a limited amount of experimental data decreases amount of data generated by the Pauli+ simulator (with realistic simu-
the LER gap between models pretrained with well-matched (pij and lation of leakage and full I/Q noise; ‘Pauli+’ in Methods) to stand in as
XEB) and general (SI1000) priors; and improves the LER of all models experimental data for finetuning. In Fig. 4a, we show the LER at each
well beyond the tensor-network decoder (Fig. 3c). code distance after finetuning.
To establish strong baselines, we compare MWPM-Corr with a DEM
tuned specifically for the Pauli+ noise model and augmented to benefit
Quantum error correction for future quantum devices from analogue readouts. We also include our LSTM decoder, trained
Simulating future quantum devices for code distances 3, 7 and 11 with unlimited Pauli+ training samples.
To achieve reliable quantum computation, the decoder must scale to AlphaQubit achieves the highest accuracy for all code distances up to
higher code distances. To assess the decoder’s accuracy on envisioned 11, surpassing even the correlated matching decoder augmented with
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
1 – 2 × logical error
1 – 2 × logical error
0.4 0.4
0.3 0.3
0.2 0.2
AlphaQubit (finetuned): 2.901% ± 0.023% LER AlphaQubit (finetuned): 2.748% ± 0.015% LER
Tensor network: 3.028% ± 0.023% LER Tensor network: 2.915% ± 0.016% LER
MWPM-BP: 3.117% ± 0.024% LER MWPM-BP: 3.059% ± 0.014% LER
MWPM-Corr: 3.498% ± 0.025% LER MWPM-Corr: 3.597% ± 0.015% LER
MWPM (PyMatching): 4.015% ± 0.031% LER MWPM (PyMatching): 4.356% ± 0.019% LER
0.1 0.1
1 3 5 7 9 11 13 15 17 19 21 23 25 1 3 5 7 9 11 13 15 17 19 21 23 25
Error-correction round Error-correction round
b c
0.0450 0.032
Stage
Pretrained
0.0425
Finetuned
0.031
0.0400 MWPM-BP
0.0375 0.030
Mean LER
Mean LER
0.0350
Tensor network
0.029
0.0325
0.028
0.0300
Varbanov et al. (ref. 36)
MWPM (PyMatching)
0.0275
Tensor network
Tensor network
0.027
MWPM-Corr
MWPM-Corr
LSTM 200M
LSTM 200M
MWPM-BP
MWPM-BP
Pretrained
Pretrained
Pretrained
Pretrained
Pretrained
Finetuned
Finetuned
Finetuned
Finetuned
Finetuned
0.0250
0.026
3 5 SI1000 pij XEB
Code distance Distance 5: pretrain noise model
Fig. 3 | Logical error per round on the 3 × 3 and 5 × 5 Sycamore experiment. (shades of grey). The LER is calculated from the slope of the fitted lines. The
All AlphaQubit results (both pretrained and finetuned) are for ensembles of error bars are the 95% confidence interval. b, LERs of our decoders and other
20 models. All results are averaged across bases, even and odd cross-validation published results for the Sycamore experiment data. We also show the
splits, and, for the 3 × 3 experiments, the location (north, east, south, west performance of an LSTM model pretrained on XEB DEM data. Error bars are
(NESW)), and are fitted across experiments of different durations. a, The 1 − 2 × standard bootstrap errors. c, LERs of our decoder pretrained on different noise
logical error versus error-correction round for code distance-3 and distance-5 models, and after finetuning on experimental data. Error bars are standard
memory experiments in the Sycamore experimental dataset for the baseline bootstrap errors.
tensor-network decoder (black), our decoder (red) and three variants of MWPM
soft information (Fig. 4a). At distance 3, the augmented MWPM-Corr For comparison, we also test our model on hard inputs (that is, where
LER is larger than the AlphaQubit LER by a factor of 1.25; by a factor of the analogue readouts were binarized before decoding). Although,
1.4 at distance 9, and by a factor of 1.25 at distance 11. as expected, both decoders perform worse, AlphaQubit maintains
We note that although the LSTM scales up to code distance 7, con- roughly the same improvement in error suppression compared with
sistent with regimes tested in the literature32,36, it does not scale to MWPM-Corr at distance 11 (LER ≈ 1.2 × 10−5 for MWPM-Corr versus
distance 11 despite the significantly larger number of model param- LER ≈ 9 × 10−6 for AlphaQubit; Fig. 4b). When previous studies mention
eters (200 million) compared with our model (5.4 million over all code MWPM, they generally refer to its uncorrelated version36,39,49, which is
distances; ‘Parameters’ in Methods). weaker than MWPM-Corr, with an LER ≈ 4.1 × 10−5 at distance 11 (Fig. 4b).
–3 0 6 14 –3 –3
10 10 10
5×5
7×7
LER
LER
LER
–4 –4 –4
10 10 10
9×9
–5 –5 –5
10 10 10
AlphaQubit (soft + leakage) AlphaQubit (hard + leakage) 11 × 11
Fig. 4 | Larger code distances and finetuning accuracy trade-off. a,b, LER of are bootstrap standard errors. a, Soft decoder inputs. Inset: detection event
different decoders for Pauli+ noise at different code distances. For each code density of the Pauli+ simulation compared with the Sycamore experimental
distance, our decoder (red) is finetuned on 100 million samples from this noise samples (error bars are standard error of the mean). b, Hard decoder inputs.
model after pretraining on a device-agnostic circuit depolarizing noise model c, LER of AlphaQubit (soft inputs) pretrained on SI1000 noise and finetuned
(SI1000). MWPM-Corr (black) and PyMatching (grey) are calibrated with a DEM with different number of unique Pauli+ samples at code distances 3–11.
tuned specifically to the Pauli+ noise model with soft information. The error bars
To assess the effect of further limiting experimental data, at each to logical computations’ in Methods). We demonstrate that AlphaQubit
code distance, we finetuned the same SI1000-pretrained base model from the previous section, with its recurrent structure, can sustain its
using only 105 to 108 samples (Fig. 4c). As baselines, we show the cor- accuracy far beyond the 25 error-correction rounds that the decoder
responding MWPM-Corr performance from Fig. 4a, as well as the per- was trained on. We find its performance generalizes to experiments
formance of the pretrained model before any finetuning. Despite the of at least 100,000 rounds (Fig. 5 and ‘Time scalability’ in Methods).
data-agnostic SI1000 prior, for code distances up to 11, the pretrained
model is already on par with MWPM-Corr and further improves with Utility beyond error correction
more finetuning examples. As we trained the neural network by minimizing cross-entropy, its out-
put can be interpreted as the probability of a logical error, a probability
Generalization to a streaming decoder we found to be well calibrated (Fig. 6a and Extended Data Fig. 6a). For
When training a decoder, the available pretraining and finetuning data example, of samples with prediction probability 0.8, approximately
will cover only a limited range of number of error-correction rounds. 80% contain a logical error. Samples with a probability close to 0.5 are
However, a practical decoder will need to perform equally well for more likely to have been misclassified than samples with a probability
longer experiments (‘Discussion and conclusion’, and ‘Generalization closer to 0 or 1 (Extended Data Fig. 6b).
a b –4
10
1.0
0.8
1 – 2 × logical error
0.6
LER
–5
10
0.4
0.2
AlphaQubit
MWPM-Corr
0 PyMatching
–6
10
2 3 4 5 2 3 4 5
10 10 10 10 10 10 10 10
Error-correction round Error-correction round
Fig. 5 | Generalization to larger number of error-correction rounds at code finetuned on 108 distance-11 Pauli+ simulated experiments of 25 rounds.
distance 11. a,b, The 1 − 2 × logical error after up to 100,000 error-correction Both finetuning and test samples are Pauli+. We plot LER values only where
rounds (a) and the corresponding LER (b) for PyMatching (grey), MWPM-Corr the corresponding 1 − 2 × logical error value is above 0.1. The error bars are
(black) and AlphaQubit (red) pretrained on SI1000 samples up to 25 rounds and bootstrap standard errors.
Distance-11 calibration –4
10 Distance 9
Distance 11 surface code, we anticipate that it can be adapted to colour codes or
–5
10 other quantum low-density parity check codes.
As a machine-learning model, our decoder’s greatest strengths come
LER
–6
0.5 10
–7
10 from its ability to learn from real experimental data. This enables it to
–8
10 utilize rich inputs representing I/Q noise and leakage, without manual
–9
10 design of particular algorithms for each feature. This ability to use
1.0
0 0.5 1.0 20 40 60 80 100
available experimental information showcases a strength of machine
Predicted probability Removed sample percentage (%) learning for solving scientific problems more generally.
Although we anticipate that other decoding techniques will continue
Fig. 6 | Using the network’s output as a confidence measure for post- to improve, this work supports our belief that machine-learning decod-
selection. Calibration and post-selection data are evaluated on 109 Pauli+
ers may achieve the necessary error suppression and speed to enable
simulated experiments. a, Example calibration plot at distance 5 (green
practical quantum computing.
continuous line) and distance 11 (purple continuous line), with small but
present error bars for s.e.m. The black dashed line represents a perfectly
calibrated classifier. b, LER versus the fraction of low-confidence experiments
Online content
discarded. Error bars are s.e.m. from values in each bin (visible for a LER ≲ 10 −8).
Any methods, additional references, Nature Portfolio reporting summa-
ries, source data, extended data, supplementary information, acknowl-
The probabilistic output can be used as soft information in hierar- edgements, peer review information; details of author contributions
chical decoding schemes50,51, or as a confidence measure to discard and competing interests; and statements of data and code availability
the least confident samples (Fig. 6b). On Pauli+ simulated data, and are available at https://doi.org/10.1038/s41586-024-08148-8.
by rejecting only 0.2% of the 25-round experiments at distance 11, we
can reduce the error rate by a factor of about 20 (1% rejection gives a 1. Shor, P. W. Scheme for reducing decoherence in quantum computer memory. Phys. Rev.
A 52, R2493–R2496 (1995).
factor of about 107, 10% a factor of about 790), which can prove useful in
2. Gottesman, D. E. Stabilizer Codes and Quantum Error Correction. PhD thesis, California
protocols such as magic-state distillation, a major anticipated resource Institute of Technology (1997).
cost in fault-tolerant quantum computation52,53. 3. Fowler, A. G., Mariantoni, M., Martinis, J. M. & Cleland, A. N. Surface codes: towards
practical large-scale quantum computation. Phys. Rev. A 86, 032324 (2012).
4. Google Quantum AI. Suppressing quantum errors by scaling a surface code logical qubit.
Nature 614, 676–681 (2023).
Discussion and conclusion 5. Feynman, R. P. Simulating physics with computers. Int. J. Theor. Phys. 21, 467–488 (1982).
6. Shor, P. W. Polynomial-time algorithms for prime factorization and discrete logarithms on
We present AlphaQubit, a neural-network decoder designed to decode
a quantum computer. SIAM Rev. 41, 303–332 (1999).
the surface code that can establish a state of the art in error suppres- 7. Grover, L. K. A fast quantum mechanical algorithm for database search. In Proc. Annual
sion. On experimental data, it outperforms the previous best-in-class ACM Symposium on Theory of Computing 212–219 (ACM, 1996).
8. Lloyd, S. Universal quantum simulators. Science 273, 1073–1078 (1996).
tensor-network decoder. Its accuracy persists at scale, continuing
9. Huang, H.-Y. et al. Quantum advantage in learning from experiments. Science 376,
to outperform soft-input-augmented correlated matching at dis- 1182–1186 (2022).
tances up to 11. AlphaQubit thus sets a benchmark for the field of 10. Kadowaki, T. & Nishimori, H. Quantum annealing in the transverse Ising model. Phys. Rev.
E 58, 5355 (1998).
machine-learning decoding, and opens up the prospect of using highly 11. Gidney, C. & Ekerå, M. How to factor 2048 bit RSA integers in 8 hours using 20 million
accurate machine-learning decoders in real quantum hardware. noisy qubits. Quantum 5, 433 (2021).
Several challenges remain. Ultimately, to enable logical error rates 12. Bravyi, S. B. & Kitaev, A. Y. Quantum codes on a lattice with boundary. Preprint at https://
arxiv.org/abs/quant-ph/9811052 (1998).
below 10−12, we will need to operate at larger code distances. At distance 13. Kitaev, A. Y. Y. Fault-tolerant quantum computation by anyons. Ann. Phys. 303, 2–30
11, training appears more challenging (Fig. 4) and requires increasing (2003).
amounts of data (Extended Data Fig. 7c). Although, in our experience, 14. Google Quantum AI. Exponential suppression of bit or phase errors with cyclic error
correction. Nature 595, 383–387 (2021).
data efficiency can be markedly increased with training and architec- 15. Fowler, A. G., Whiteside, A. C. & Hollenberg, L. C. Towards practical classical processing
ture improvements, demonstrating high accuracy at distances beyond for the surface code. Phys. Rev. Lett. 108, 180501 (2012).
11 remains an important step to be addressed in future work (‘Further 16. Dennis, E., Kitaev, A., Landahl, A. & Preskill, J. Topological quantum memory. J. Math.
Phys. 43, 4452–4505 (2002).
considerations of scaling experiments’ in Methods). 17. Sivak, V. V. et al. Real-time quantum error correction beyond break-even. Nature 616,
Furthermore, decoders need to achieve a throughput of 1 μs 50–55 (2023).
per round for superconducting qubits4,31 and 1 ms for trapped-ion 18. Krinner, S. et al. Realizing repeated quantum error correction in a distance-three surface
code. Nature 605, 669–674 (2022).
devices20. Improving throughput remains an important goal for both 19. Egan, L. et al. Fault-tolerant control of an error-corrected qubit. Nature 598, 281–286
machine-learning and matching-based decoders38,54–57. Although the (2021).
AlphaQubit throughput is slower than the 1-μs target (‘Decoding speed’ 20. Ryan-Anderson, C. et al. Realization of real-time fault-tolerant quantum error correction.
Phys. Rev. X 11, 041058 (2021).
in Methods), a host of established techniques (‘Decoding speed’ in 21. Zhao, Y. et al. Realization of an error-correcting surface code with superconducting
Methods) can be applied to speed it up, including knowledge distilla- qubits. Phys. Rev. Lett. 129, 030501 (2022).
tion, lower-precision inference and weight pruning, as well as imple- 22. Ghosh, J., Fowler, A. G., Martinis, J. M. & Geller, M. R. Understanding the effects of
leakage in superconducting quantum-error-detection circuits. Phys. Rev. A 88, 062329
mentation in custom hardware. (2013).
To realize a fault-tolerant quantum computation, a decoder needs 23. Tripathi, V. et al. Suppression of crosstalk in superconducting qubits using dynamical
to handle logical computation. Graph-based decoders can achieve this decoupling. Phys. Rev. Appl. 18, 024068 (2022).
24. Higgott, O. PyMatching: a Python package for decoding quantum codes with
through a windowing approach58. We envisage co-training network minimum-weight perfect matching. Preprint at https://arxiv.org/abs/2105.13082 (2021).
components, one for each gate needed for a logical circuit. To reduce 25. Fowler, A. G. Optimal complexity correction of correlated errors in the surface code.
complexity and training cost, these components might share param- Preprint at https://arxiv.org/abs/1310.0863 (2013).
26. Higgott, O., Bohdanowicz, T. C., Kubica, A., Flammia, S. T. & Campbell, E. T. Improved
eters and be modulated by side inputs (such as gate type; ‘Generaliza- decoding of circuit noise and fragile boundaries of tailored surface codes. Phys. Rev. X 13,
tion to logical computations’ in Methods). Such generalization abilities 031007 (2023).
are intimated by our decoder’s generalization across rounds that far 27. Shutty, N., Newman, M. & Villalonga, B. Efficient near-optimal decoding of the surface
code through ensembling. Preprint at https://arxiv.org/abs/2401.12434 (2024).
exceed its training regime and by the ability to train a single decoder to 28. McEwen, M. et al. Removing leakage-induced correlated errors in superconducting
decode all of the code distances 3–11 of the scaling experiment (Fig. 4) quantum error correction. Nat. Commun. 12, 1761 (2021).
Extended Data Fig. 1 | Stabilizer readout circuit for a 3 × 3 XZZX rotated dots indicate data qubits, gray dots indicate X/Z stabilizer qubits, as detailed in
surface code in the Z basis. (a) Common X and Z stabilizer readout for the Fig. 1a. D (yellow blocks) labels single- or two-qubit depolarizing noise, and X
XZZX code61. Here, the four first lines (a-d) are the data qubits surrounding a labels a bit flip channel. M, R, and MR are measurements, reset, and combined
stabilizer qubit (last line), which has been reset to ∣0⟩. (b) Relative strength of measurement and reset in the Z basis. H is a Hadamard gate, and CZ gates are
noise operations in an SI1000 circuit depolarizing noise model, parameterized indicated by their circuit symbol.
by p. (c) Corresponding circuit depolarizing noise gate and error schema. Black
Extended Data Fig. 2 | Noise and event densities for datasets used. top x-axis, with a non-linear scale. The detector error models are fitted to the
(a) Simplified I/Q noise with signal-to-noise ratio SNR = 10 and normalised Sycamore surface code experiment4. All datasets use the XZZX circuit variant
measurement time t = 0.01. Top plot: point spread functions when projected of the surface code (with CZ gates for the stabilizer readout, ‘The rotated
from its in-phase, quadrature, and time components onto a one-dimensional surface code’ in Methods, Extended data Fig. 1). As we are never compiling
z axis. Shown are the sampled point spread functions for ∣0⟩ (blue), ∣1⟩ (green), between gatesets, there is no implied noise overhead; these are the final event
and a leaked higher-excited state ∣2⟩ (violet). Bottom plot: posterior sampling densities observed when sampling from the respective datasets. For datasets
probability for the three measurement states, for prior weights w2 = 0.5%, with soft I/Q noise, the plots above show the average soft event density as
w0 = w1 = 49.75%. (b) Event densities for different datasets and the corresponding explained in ‘Measurement noise’ in Methods.
SI1000 p-value. We indicate the event density of the different datasets in the
Article
Extended Data Fig. 3 | Individual fits of logical error per round for 3 × 3 and 5 × 5 memory experiments. For the pij -DEM pretrained model.
Extended Data Fig. 4 | The neural network architecture designed for i = 1, …, d2 − 1 for a distance d surface code. (d) Each block of the recurrent
surface code decoding. (a) 5 × 5 rotated surface code layout, with data qubits network combines the decoder state and the stabilizers S n = (S n1, …, S nM) for one
(dark grey dots), X and Z stabilizer qubits (labelled light grey dots, or round (scaled down by a factor of 0.7). The decoder state is updated through
highlighted in blue/red when they detect a parity violation) interspersed in a three Syndrome Transformer layers. (e) Each Syndrome Transformer layer
checkerboard pattern. Logical observables ZL and XL are shown as bold lines on updates the stabilizer representations through multi-headed attention
the left and bottom grid edges respectively. (b) The recurrent network iterates optionally modulated by an (optional) learned attention bias followed by a
over time updating a representation of the decoder state and incorporating the dense block and dilated 2D convolutions. (f) Logical errors are predicted from
new stabilizers at each round. Three parallel lines indicate a representation per the final decoder state. The triple lines marked with * indicate a representation
stabilizer. (c) Creation of an embedding vector S ni for each new stabilizer per data qubit.
Article
Extended Data Fig. 5 | Further architecture details. (a) Attention bias with self-attention highlighted with a black square. (b) Architecture of
visualisation. Attention bias logits of the four heads of the first Syndrome the network when predicting labels at every round. S n are the stabilizer
Transformer layer of our decoder model pretrained on 5 × 5 DEM in the Z basis. representations as explained in Extended data Fig. 4b, where the primed
We obtain the logits by combining the learned attention bias embedding with quantities S′ n indicate the embeddings are computed using different
all-zero stabilizer values. The 24 × 24 attention logit matrices are each visualized embedding parameters and based only on the stabilizers in the experiment basis
as one grid per stabilizer, laid out according to the physical layout of the attending computed from the final data qubit measurements (‘Input representation’ in
stabilizer qubits. Each grid shows the logits for the attention to each stabilizer, Methods).
Extended Data Fig. 6 | Calibration of AlphaQubit’s outputs and predictions concentrate around the interval edges, i.e. at probabilities 0 and 1,
generalization to longer experiments for code-distances 3–11. (a) resulting in a high certainty. (c) Generalization to longer experiments. 1 - 2 ×
Calibration for Pauli+ generated data with SNR = 10, t = 0.01, and 0.1% stabilizer logical error and logical error rates for networks pretrained and finetuned on
qubit leakage chance. (b) Calibration histogram of predicted probabilities. The datasets for up to 25 error detection rounds but applied to decoding
predictions are grouped into correct (blue) and incorrect (red) before building experiments of longer durations. We only plot logical error rates where the
the histogram, and then binned into “certainty” bins depending on their corresponding 1 - 2 × logical error are greater than 0.1. The data is generated
distance from a 50: 50 prediction, i.e. by ∣1/2 − p∣ for a predicted probability p. from the same simulated experiments, stopped at different number of rounds.
For all code distances, wrong predictions have a lower certainty. Correct
Article
Extended Data Fig. 7 | Further result details on Sycamore and scaling embedding for a relative positional embedding, where each stabilizer position
experiments, and decoder speed. (a) Decoding time per error correction (normalized to [−1, 1] × [−1, 1]) is embedded (‘Input representation’ in Methods).
round vs. code distance. The hatched region for d > 11 indicates that while (c) Number of pretraining samples until the individual models (pre ensembling)
AlphaQubit is the same for all code-distances, it has not been trained or achieve the given LER (relative to the lowest-achieved LER, the latter shown
shown to work beyond d = 11. The line is a least squares fit to a × dexponent, and the as brown line). The dashed line indicates when the training was stopped
shaded region marks a 95% CI. Timing of uncorrelated matching (PyMatching). (see ‘Termination’ in Methods). Error bars are 95% CI (N = 15). (d) Performance
Times for PyMatching use a current CPU (Intel Xeon Platinum 8173M) and for improvement by ensembling multiple models, where we show the pij -pretrained
AlphaQubit use current TPU hardware with batch size 1. (b) LER of a decoder model for the Sycamore experiments (XEB and SI1000 variants show about
trained jointly on distances 3–11, as compared to decoders trained solely on the same improvement). (e) Average error suppression factors Λ for Pauli+
the individual code distances (both pretrained only, see ‘Decoding at higher experiment. The error suppression factor is computed from the data in Fig. 4,
distances’ and ‘Pauli+’ in Methods). Uncertainty is bootstrap standard error via the geometric average Λ3/11 = (ϵ 3 /ϵ 11 )1/4, for a logical error per round ϵ3 at
(9 resamples). The model and training hyperparameters are identical in both code distance 3, and ϵ11 at distance 11, respectively.
cases, but for the joint code distance-trained decoder we swap the index
Extended Data Fig. 8 | Network and training hyperparameters. (a) Noise (d) Hyperparameters for finetuning of the scaling experiment. The learning
curriculum parameters used in pretraining for the Sycamore experiment. rate for finetuning was the initial learning rate for the code-distance from
(b) Hyperparameters of the network architecture. (c) The dilations of the three Extended data Fig. 8c scaled by the factor dependent on the training set size.
3 × 3 convolutions in each syndrome transformer layer and the experiment The finetuning cosine learning rate schedule length was code distance × 2/3 × 108
learning rates are determined by the code-distance of the experiment. samples.
Article
Extended Data Fig. 9 | Ablations: The effect of removing or simplifying Error bars represent bootstrap standard errors from individual fidelities
decoder architecture elements. Decoder performance under ablations (499 resamples). (b) For 11 × 11 Pauli+ training. Error bars represent estimated
(white bars) compared to the baseline, ordered by mean LER. The blue bars mean error (N = 5). (c) Effect of ablations in performance and data efficiency.
indicate the model used for the experiments. (a) For 5 × 5pij DEM training, Decoding performance of the best performing subset of ablations for 5 × 5 DEM
averaged across bases (X and Z) and the two cross-validation folds. The red and 11 × 11 Pauli+. Colors indicate the number of training samples required for
horizontal line represents the performance of the finetuned ensemble. reaching performance parity with PyMatching for 11 × 11 Pauli+.
Extended Data Table 1 | ML decoders, further references and details