0% found this document useful (0 votes)
10 views11 pages

Can Be Faster

The document introduces a new algorithm called RXTX for computing the product of a matrix and its transpose, which reduces the number of multiplications and additions by 5% compared to the previous state-of-the-art methods. RXTX is developed through a combination of machine learning techniques and combinatorial optimization, demonstrating improved efficiency even for smaller matrices. The paper provides detailed comparisons and theoretical analysis of RXTX against existing algorithms, highlighting its advantages in specific structured matrix products.

Uploaded by

openid_mgC8yaK1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Can Be Faster

The document introduces a new algorithm called RXTX for computing the product of a matrix and its transpose, which reduces the number of multiplications and additions by 5% compared to the previous state-of-the-art methods. RXTX is developed through a combination of machine learning techniques and combinatorial optimization, demonstrating improved efficiency even for smaller matrices. The paper provides detailed comparisons and theoretical analysis of RXTX against existing algorithms, highlighting its advantages in specific structured matrix products.

Uploaded by

openid_mgC8yaK1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

XX t Can Be Faster

Dmitry Rybin ∗ 1,2 , Yushun Zhang †1,2 , and Zhi-Quan Luo‡1,2


1 The Chinese University of Hong Kong, Shenzhen, China
2 Shenzhen Research Institute of Big Data

May 16, 2025


arXiv:2505.09814v1 [cs.DS] 14 May 2025

Abstract
We present a new algorithm RXTX for computation of the product of matrix by its transpose XX t . RXTX
uses 5% less multiplications and additions and provides accelerations even for small sizes of matrix X.
The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial
Optimization.

1 Introduction

Algorithm Previous State-of-the-Art for XX t RXTX - AI-discovered


 Algorithm
 
 
At Ct
 
 
Bt Dt
Matrix Form
AAt + BBt AC t + BD t
       
A B
C D ∗ CC t + DD t 



 ∗



   ∗ ∗ 
∗ ∗ ∗
Recursion S(n) = 4S(n/2) + 2M(n/2) R(n) = 8R(n/4) + 26M (n/4)
Asymptotic S(n) ∼ 23 M (n) R(n) ∼ 26
41 M ( n )
4 × 4 rank 38 34

Table 1: New algorithm (RXTX) is based on recursive 4 × 4 block matrix multiplication. It uses 8 recursive
calls and 26 general products. In comparison, previous SotA uses 16 recursive calls and 24 general products.
R(n), S(n), M(n) - are the number of multiplications performed by RXTX, previous SotA, and Strassen
algorithm respectively for n × n matrix X. RXTX asymptotic constant 26/41 ≈ 0.6341 is 5% smaller than
2/3 ≈ 0.6666, which is asymptotic constant of previous SotA.

Finding faster matrix multiplication algorithms is a central challenge in computer science and numerical
linear algebra. Since the groundbreaking results of Strassen Strassen [1969] and Winograd Winograd [1968],
which demonstrated that the number of multiplications required for a general matrix product AB can
be significantly reduced, extensive research has emerged exploring this problem. Techniques in the area
∗ Correspondence author. Email: [email protected].
† Email: [email protected].
‡ Email: [email protected]

1
Algorithm 1 RXTX - AI-discovered asymptotic SotA for XX t
1: Input: 4 × 4 block-matrix X
2: Output: C = XX t using 8 recursive calls and 26 general products.
3: m1 = (− X2 + X3 − X4 + X8 ) · ( X8 + X11 )t
4: m2 = ( X1 − X5 − X6 + X7 ) · ( X15 + X5 )t
5: m3 = (− X2 + X12 ) · (− X10 + X16 + X12 )t
6: m4 = ( X9 − X6 ) · ( X13 + X9 − X14 )t
7: m5 = ( X2 + X11 ) · (− X6 + X15 − X7 )t
8: m6 = ( X6 + X11 ) · ( X6 + X7 − X11 )t
9: m7 = X11 · ( X6 + X7 )t
10: m8 = X2 · (− X14 − X10 + X6 − X15 + X7 + X16 + X12 )t
11: m9 = X6 · ( X13 + X9 − X14 − X10 + X6 + X7 − X11 )t
12: m10 = ( X2 − X3 + X7 + X11 + X4 − X8 ) · X11 t

13: m11 = ( X5 + X6 − X7 ) · X5t

14: m12 = ( X2 − X3 + X4 ) · X8t


15: m13 = (− X1 + X5 + X6 + X3 − X7 + X11 ) · X15 t

16: m14 = (− X1 + X5 + X6 ) · ( X13 + X9 + X15 )t


17: m15 = ( X2 + X4 − X8 ) · ( X11 + X16 + X12 )t
18: m16 = ( X1 − X8 ) · ( X9 − X16 )t
19: m17 = X12 · ( X10 − X12 )t
20: m18 = X9 · ( X13 − X14 )t
21: m19 = (− X2 + X3 ) · (− X15 + X7 + X8 )t
22: m20 = ( X5 + X9 − X8 ) · X9t
23: m21 = X8 · ( X9 − X8 + X12 )t
24: m22 = (− X6 + X7 ) · ( X5 + X7 − X11 )t
25: m23 = X1 · ( X13 − X5 + X16 )t
26: m24 = (− X1 + X4 + X12 ) · X16 t

27: m25 = ( X9 + X2 + X10 ) · X14t

28: m26 = ( X6 + X10 + X12 ) · X10 t

29: t
s 1 = X1 X1
30: s2 = X2 X2t
31: s3 = X3 X3t
32: s4 = X4 X4t
33: s5 = X13 X13t

34: s6 = X14 X14t

35: s7 = X15 X15t

36: s8 = X16 X16t

37: C11 = s1 + s2 + s3 + s4
38: C12 = m2 − m5 − m7 + m11 + m12 + m13 + m19
39: C13 = m1 + m3 + m12 + m15 + m16 + m17 + m21 − m24
40: C14 = m2 − m3 − m5 − m7 − m8 + m11 + m13 − m17 + m23 + m24
41: C22 = m1 + m6 − m7 + m10 + m11 + m12 + m22
42: C23 = m1 − m4 + m6 − m7 − m9 + m10 + m12 + m18 + m20 + m21
43: C24 = m2 + m4 + m11 + m14 + m16 − m18 − m20 + m23
44: C33 = m4 − m6 + m7 + m9 − m17 − m18 + m26
45: C34 = m3 + m5 + m7 + m8 + m17 + m18 + m25
46: C44 = s5 + s6 + s7 + s8
47: return C 2
range from gradient descent approaches Smirnov [2013] and heuristics Éric Drevet et al. [2011], to group-
theoretic methods Ye and Lim [2018], graph-based random walks Kauers and Moosbauer [2022], and deep
reinforcement learning Fawzi et al. [2022].
Despite this progress, much less attention has been paid to matrix products with additional structure,
such as B = A or B = At , or products involving sparsity or symmetry Dumas et al. [2020, 2023], Arrigoni
et al. [2021]. This is surprising given that expressions like AAt are widely used in fields such as statistics,
data analysis, deep learning, and wireless communications. For example, AAt often represents a covariance
or Gram matrix, while in linear regression, the solution for the data pair ( X, y) involves the data covariance
matrix X t X:
β = ( X t X )−1 X t y.
From a theoretical standpoint, computing XX t has the same asymptotic complexity as general matrix
multiplication. As a result, only constant-factor speedups are possible. The RXTX algorithm, presented in
Algorithm 1, achieves such a speedup by exploiting structure specific to XX t .

1.1 Previous works


Prior work by Ye and Lim [2016, 2018] used representation theory and the Cohn–Umans framework to derive
new multiplication schemes for structured matrix products. Reinforcement learning methods have also been
applied to this domain. For instance, Fawzi et al. [2022] used deep RL to compute tensor ranks and discover
novel multiplication algorithms. Neural Networks wtih proper training setup can rediscover Strassen and
Laderman algorithms for small matrices Elser [2016].
More recently, Dumas et al. [2020, 2023] proposed optimized schemes for computing XX T over finite
fields and complex numbers. To the best of our knowledge, the current state-of-the-art approach for real-
valued XX T is due to Arrigoni et al. [2021], which recursively applies Strassen’s algorithm to 2 × 2 block
matrices, reducing the problem to general matrix multiplication. In contrast, our approach uses the structure
of XX t in a novel way.

2 Analysis of RXTX
We define
• R(n) - number of multiplications performed by RXTX for n × n matrix
• S(n) - number of multiplications performed by recursive Strassen Arrigoni et al. [2021] for n × n matrix
• M (n) - number of multiplications performed by Strassen-Winograd algorithm for general product of
n × n matrices
• R+ (n) - number of additions and multiplications performed by RXTX for n × n matrix
• S+ (n) - number of additions and multiplications performed by recursive Strassen Arrigoni et al. [2021]
for n × n matrix

• M+ (n) - number of additions and multiplications performed by Strassen-Winograd algorithm for


general product of n × n matrices
The superscript opt indicates an optimal cutoff: for sufficiently small matrices, standard matrix multiplication
is used instead of further recursive calls.

3
2.1 Number of multiplications
Theorem 1. The number of multiplications for RXTX:
26 15 26 log 7 15 3/2
R(n) = M(n) + n3/2 = n 2 + n .
41 41 41 41
The number of multiplications for recursive Strassen:
2 1 2 1
S(n) = M (n) + n2 = nlog2 7 + n2 .
3 3 3 3
Proof. The definition of RXTX involves 8 recursive calls and 26 general matrix multiplications. It follows that
R(n) = 8R(n/4) + 26M (n/4).
The general solutions to this recursive equation has a form Cormen et al. [2009]
R(n) = αM (n) + βn3/2 .
Plugging n = 1 and n = 4 we get
1 = α + β,
34 = 49α + 8β.
Solving this system we obtain
26 15
α= ≈ 0.6341, β= ≈ 0.3658.
41 41
Similarly, recursive Strassen for XX t uses 4 recursive calls and 2 general matrix multiplications:
S(n) = 4S(n/2) + 2M (n/2).
General solution form
S(n) = γM (n) + δn2 .
Plugging n = 1 and n = 2 we get
1 = γ + δ, 6 = 7γ + 4δ.
Solving this system we obtain γ = 2/3 ≈ 0.6666 and δ = 1/3 ≈ 0.3333.
In Figure 1 we can see the ratio R(n)/S(n) for n given by powers of 4. The ratio always stays below 100%
and approaches the asymptotic 95%, which indicates a 5% reduction in the number of multiplications. Same
happens in Figure 2, where we use optimal cutoff i.e. for small enough matrix sizes we use standard matrix
multiplication instead of further recursive calls.

Ratio of R(n) to S(n). Ratio of R(n) to naive algorithm n2 (n + 1)/2.

Figure 1: Comparison of number of multiplications of RXTX to previous SotA and naive algorithm.

4
Ratio of Ropt (n) to Sopt (n). Ratio of Ropt (n) to n2 (n + 1)/2.

Figure 2: Comparison of number of multiplications of RXTX with optimal cutoff to previous SotA and naive
algorithm.

2.2 Total number of operations


Theorem 2. Total number of additions and multiplications for RXTX:
156 log 7 615 2 155 3/2
R+ (n) = n 2 − n + n .
41 164 164
Total number of additions and multiplications for recursive Strassen:
7
S+ (n) = 4nlog2 7 − n2 log2 n − 3n2 .
4
Proof. The definition of RXTX involves 139 additions of (n/4) × (n/4) matrices. There exist methods in
the literature Mårtensson and Wagner [2024] to reduce this number by utilizing common sub-expressions
that appear in the algorithm 1 e.g. X6 + X7 . For example, while Strassen algorithm uses 18 additions,
its Winograd variant uses only 15 additions. We designed a custom search that allowed us to reduce the
number of additions in RXTX from 139 to 100. We provide the resulting addition scheme in Algorithm 2 and
Algorithm 3. Assuming 100 additions, we get the recursion
R+ (n) = 8R+ (n/4) + 26M+ (n/4) + 100(n/4)2 .
General solution has a form
26
R+ (n) = M+ (n) + αn2 + βn3/2 .
41
Plugging the value n = 1 and n = 4 gives
26
+ α + β = 1.
41
26
· 214 + 16α + 8β = 134
41
We conclude that
95 155
α=− ≈ −0.5793, β= ≈ 0.9451.
164 164
Similarly, definition of recursive Strassen gives
S+ (n) = 4S+ (n/2) + 2M+ (n) + 3(n/2)2 .
Which has a solution of the form
2
S+ ( n ) = M+ (n) + γn2 log2 n + δn2 .
3

5
Plugging values n = 1 and n = 2 gives γ = −7/4 and δ = 1/3. It is known Cenk and Hasan [2017] that
M+ (n) = 6nlog2 7 − 5n2 .
It follows that
26 95 2 155 3/2 156 log 7 615 2 155 3/2
R+ (n) = (6nlog2 7 − 5n2 ) − n + n = n 2 − n + n
41 164 164 41 164 164
and
2 7 1 7
S+ ( n ) = (6nlog2 7 − 5n2 ) − n2 log2 n + n2 = 4nlog2 7 − n2 log2 n − 3n2 .
3 4 3 4

Ratio of R+ (n) to S+ (n). Ratio of R+ (n) to (2n − 1)n(n + 1)/2.

Figure 3: Comparison of number of operations of RXTX to recursive Strassen and naive algorithm. RXTX
outperforms recursive Strassen for n ≥ 256 and naive algorithm for n ≥ 1024.

opt opt opt


Ratio of R+ (n) to S+ (n). Ratio of R+ (n) to (2n − 1)n(n + 1)/2.

Figure 4: Comparison of algorithms with optimal cutoffs i.e. for small enough matrices in recursion switch
to the algorithm with least operations. RXTX outperforms naive algorithm for n ≥ 32 and SotA for n ≥ 256.

6
Algorithm 2 First stage of optimized addition scheme. The number of additions is reduced from 77 to 53.
1: Input: X1 , X2 , ..., X16
2: Output: Left elements L1 , ..., L26 and right elements R1 , ..., R26 of multiplications m1 , ...m26
3: y1 ← X13 − X14
4: y2 ← X12 − X10
5: w 1 ← X2 + X4 − X8
6: w 2 ← X1 − X5 − X6
7: w 3 ← X6 + X7
8: w4 ← X14 + X15
9: w5 ← y2 + X16
10: w6 ← X10 + X11
11: w 7 ← X9 + y 1
12: w 8 ← X9 − X8
13: w9 ← X7 − X11
14: w10 ← X6 − X7
15: w11 ← X2 − X3
16: L 1 ← − w 1 + X3 R1 ← X8 + X11
17: L 2 ← w 2 + X7 R2 ← X15 + X5
18: L3 ← − X2 + X12 R 3 ← w5
19: L 4 ← X9 − X6 R 4 ← w7
20: L5 ← X2 + X11 R5 ← X15 − w3
21: L6 ← X6 + X11 R6 ← w3 − X11
22: L7 ← X11 R 7 ← w3
23: L 8 ← X2 R 8 ← w3 − w4 + w5
24: L 9 ← X6 R 9 ← w7 − w6 + w3
25: L10 ← w1 − X3 + X7 + X11 R10 ← X11
26: L11 ← X5 + w10 R11 ← X5
27: L12 ← w11 + X4 R12 ← X8
28: L13 ← −w2 + X3 − w9 R13 ← X15
29: L14 ← −w2 R14 ← w7 + w4
30: L15 ← w1 R15 ← w6 + w5
31: L16 ← X1 − X8 R16 ← X9 − X16
32: L17 ← X12 R17 ← −y2
33: L18 ← X9 R18 ← y1
34: L19 ← −w11 R19 ← − X15 + X7 + X8
35: L20 ← X5 + w8 R20 ← X9
36: L21 ← X8 R21 ← X12 + w8
37: L22 ← −w10 R22 ← X5 + w9
38: L23 ← X1 R23 ← X13 − X5 + X16
39: L24 ← − X1 + X4 + X12 R24 ← X16
40: L25 ← X9 + X2 + X10 R25 ← X14
41: L26 ← X6 + X10 + X12 R26 ← X10

7
Algorithm 3 Second stage of optimized addition scheme. The number of additions is reduced from 62 to 47.
1: Input: m1 , m2 , ..., m26 and s1 , ...s8 .
2: Output: Entries Cij using 47 additions.
3: z1 ← m7 − m11 − m12
4: z2 ← m1 + m12 + m21
5: z3 ← m3 + m17 − m24
6: z4 ← m2 + m11 + m23
7: z5 ← m5 + m7 + m8
8: z6 ← m4 − m18 − m20
9: z7 ← m6 − m7 − m9
10: z8 ← m17 + m18
11: C11 ← s1 + s2 + s3 + s4
12: C12 ← m2 − m5 − z1 + m13 + m19
13: C13 ← z2 + z3 + m15 + m16
14: C14 ← z4 − z3 − z5 + m13
15: C22 ← m1 + m6 − z1 + m10 + m22
16: C23 ← z2 − z6 + z7 + m10
17: C24 ← z4 + z6 + m14 + m16
18: C33 ← m4 − z7 − z8 + m26
19: C34 ← m3 + z5 + z8 + m25
20: C44 ← s5 + s6 + s7 + s8

2.3 Runtime of RXTX


We verify that RXTX gives speed-up in practice for large sizes of matrix X. Figure 5 shows histogram of
runtimes in the following setup:
• 6144 × 6144 dense matrix is sampled 1000 times with random normal N (0, 1) entries. Here 6144 =
3 · 212 .
• RXTX is implemented as depth-1 recursion i.e. we directly use BLAS routines to compute 26 general
matrix multiplications and 8 symmetric products of matrices of size 1536 × 1536.
• Default is direct call of BLAS-3 routine for XX t .
• single thread CPU 10th Gen Intel Core i7-10510U Processor, 1.8 GHz 4 cores.
We didn’t perform search for the smallest matrix size where RXTX outperforms other methods since runtime
is highly sensitive to hardware, computation graph organization, and memory management. Figure 4
suggests that RXTX can be faster than recursive Strassen for n ≥ 256.

8
Figure 5: The average runtime for RXTX is 2.524s, which is 9% faster than average runtime of specific BLAS
routine 2.778s. RXTX was faster in 99% of the runs.

3 Discovery Methodology
3.1 Description of RL-guided Large Neighborhood Search
In this section we briefly present our methodology. Full methodology with other discovered accelerations
will be described in Rybin et al. [2025]. We combine RL-guided Large Neighborhood Search Wu et al. [2021],
Addanki et al. [2020] with a two–level MILP pipeline:
1. The RL agent proposes a (potentially redundant) set of rank-1 bilinear products;
2. MILP-A exhaustively enumerates tens of thousands of linear relations between these candidate rank-1
bilinear products and target expressions;

3. MILP-B then selects the smallest subset of products whose induced relations cover every target
expression of XX t .
The loop iterates under a Large Neighborhood Search regime. One way to view this pipeline is a simplifi-
2 2 2
cation of AlphaTensor RL approach Fawzi et al. [2022]: instead of sampling tensors from Rn ⊗ Rn ⊗ Rn ,
2 2
we sample candidate tensors from Rn ⊗ Rn and let the MILP solver find optimal linear combinations of
sampled candidates.

9
3.2 Example: matrix times transpose algorithm search for 2-by-2 matrix
Consider the example for 2 × 2 matrix X. We want to perform the computation of XX t :
x12 + x22
     
x1 x2 x1 x3 x1 x3 + x2 x4
· =
x3 x4 x2 x4 x1 x3 + x2 x4 x32 + x42
We identify 3 target expressions
T = { x12 + x22 , x32 + x42 , x1 x3 + x2 x4 }.
We randomly sample thousands of products p1 , ..., pm , each one given by
! !
4 4
∑ αi xi · ∑ β j xj
i =1 j =1

with αi , β j ∈ {−1, 0, +1} chosen by RL policy πθ . MILP-A enumerates ways to write target expressions
from T as linear combinations of sampled products ∑ γi pi . MILP-B selects minimal number of sampled
products such that every target expression can be obtained as their linear combination. Key observation is
that MILP-A and MILP-B are rapidly solvable with solvers like Gurobi Gurobi Optimization, LLC [2024].

4 Acknowledgements
The work of Z.-Q. Luo was supported by the Guangdong Major Project of Basic and Applied Basic Research
(No.2023B0303000001), the Guangdong Provincial Key Laboratory of Big Data Computing, and the National
Key Research and Development Project under grant 2022YFA1003900.

References
R. Addanki, V. Nair, and M. Alizadeh. Neural large neighborhood search. In Learning Meets Combinatorial
Algorithms @ NeurIPS 2020, 2020.

V. Arrigoni, F. Maggioli, A. Massini, and E. Rodolà. Efficiently parallelizable strassen-based multiplication of


a matrix by its transpose. In Proceedings of the 50th International Conference on Parallel Processing, ICPP ’21,
New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450390682.
M. Cenk and M. A. Hasan. On the arithmetic complexity of strassen-like matrix multiplications. Journal of
Symbolic Computation, 80:484–501, 2017. ISSN 0747-7171. doi: https://doi.org/10.1016/j.jsc.2016.07.004.
URL https://www.sciencedirect.com/science/article/pii/S0747717116300359.
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, 3rd Edition. MIT Press,
2009. ISBN 978-0-262-03384-8. URL http://mitpress.mit.edu/books/introduction-algorithms.
J.-G. Dumas, C. Pernet, and A. Sedoglavic. On fast multiplication of a matrix by its transpose. In Proceedings
of the 45th International Symposium on Symbolic and Algebraic Computation, ISSAC ’20, page 162–169, New
York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371001.
J.-G. Dumas, C. Pernet, and A. Sedoglavic. Some fast algorithms multiplying a matrix by its adjoint. Journal
of Symbolic Computation, 115:285–315, 2023.
V. Elser. A network that learns strassen multiplication. Journal of Machine Learning Research, 17:1–13, 2016.

A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R Ruiz,


J. Schrittwieser, G. Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement
learning. Nature, 610(7930):47–53, 2022.

10
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URL https://www.gurobi.com.

M. Kauers and J. Moosbauer. The fbhhrbnrssshk-algorithm for multiplication in Z52×5 is still not the end of
the story. ArXiv, abs/2210.04045, 2022. URL https://api.semanticscholar.org/CorpusID:252780122.

E. Mårtensson and P. S. Wagner. The number of the beast: Reducing additions in fast matrix multiplication
algorithms for dimensions up to 666. Cryptology ePrint Archive, Paper 2024/2063, 2024. URL https:
//eprint.iacr.org/2024/2063.
D. Rybin, Y. Zhang, and Z.-Q. Luo. Accelerating structured matrix computations with machine learning
based search. in progress, 2025.

A. V. Smirnov. The bilinear complexity and practical algorithms for matrix multiplication. Compu-
tational Mathematics and Mathematical Physics, 53(12):1781–1795, Dec. 2013. ISSN 1555-6662. doi:
10.1134/s0965542513120129. URL http://dx.doi.org/10.1134/S0965542513120129.
V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, Aug. 1969. ISSN
0945-3245. doi: 10.1007/bf02165411. URL http://dx.doi.org/10.1007/BF02165411.

S. Winograd. A new algorithm for inner product. IEEE Transactions on Computers, 100(7):693–694, 1968.
Y. Wu, W. Song, Z. Cao, and J. Zhang. Learning large neighborhood search policy for integer programming.
arXiv preprint arXiv:2111.03466, 2021.
K. Ye and L.-H. Lim. Algorithms for structured matrix-vector product of optimal bilinear complexity. In
2016 IEEE Information Theory Workshop (ITW), pages 310–314. IEEE, 2016.
K. Ye and L.-H. Lim. Fast structured matrix computations: tensor rank and cohn–umans method. Foundations
of Computational Mathematics, 18:45–95, 2018.
C. Éric Drevet, M. Nazrul Islam, and Éric Schost. Optimization techniques for small matrix multiplication.
Theoretical Computer Science, 412(22):2219–2236, 2011. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.
2010.12.012. URL https://www.sciencedirect.com/science/article/pii/S0304397510007036.

11

You might also like