Can Be Faster
Can Be Faster
Abstract
We present a new algorithm RXTX for computation of the product of matrix by its transpose XX t . RXTX
uses 5% less multiplications and additions and provides accelerations even for small sizes of matrix X.
The algorithm was discovered by combining Machine Learning-based search methods with Combinatorial
Optimization.
1 Introduction
Table 1: New algorithm (RXTX) is based on recursive 4 × 4 block matrix multiplication. It uses 8 recursive
calls and 26 general products. In comparison, previous SotA uses 16 recursive calls and 24 general products.
R(n), S(n), M(n) - are the number of multiplications performed by RXTX, previous SotA, and Strassen
algorithm respectively for n × n matrix X. RXTX asymptotic constant 26/41 ≈ 0.6341 is 5% smaller than
2/3 ≈ 0.6666, which is asymptotic constant of previous SotA.
Finding faster matrix multiplication algorithms is a central challenge in computer science and numerical
linear algebra. Since the groundbreaking results of Strassen Strassen [1969] and Winograd Winograd [1968],
which demonstrated that the number of multiplications required for a general matrix product AB can
be significantly reduced, extensive research has emerged exploring this problem. Techniques in the area
∗ Correspondence author. Email: [email protected].
† Email: [email protected].
‡ Email: [email protected]
1
Algorithm 1 RXTX - AI-discovered asymptotic SotA for XX t
1: Input: 4 × 4 block-matrix X
2: Output: C = XX t using 8 recursive calls and 26 general products.
3: m1 = (− X2 + X3 − X4 + X8 ) · ( X8 + X11 )t
4: m2 = ( X1 − X5 − X6 + X7 ) · ( X15 + X5 )t
5: m3 = (− X2 + X12 ) · (− X10 + X16 + X12 )t
6: m4 = ( X9 − X6 ) · ( X13 + X9 − X14 )t
7: m5 = ( X2 + X11 ) · (− X6 + X15 − X7 )t
8: m6 = ( X6 + X11 ) · ( X6 + X7 − X11 )t
9: m7 = X11 · ( X6 + X7 )t
10: m8 = X2 · (− X14 − X10 + X6 − X15 + X7 + X16 + X12 )t
11: m9 = X6 · ( X13 + X9 − X14 − X10 + X6 + X7 − X11 )t
12: m10 = ( X2 − X3 + X7 + X11 + X4 − X8 ) · X11 t
29: t
s 1 = X1 X1
30: s2 = X2 X2t
31: s3 = X3 X3t
32: s4 = X4 X4t
33: s5 = X13 X13t
37: C11 = s1 + s2 + s3 + s4
38: C12 = m2 − m5 − m7 + m11 + m12 + m13 + m19
39: C13 = m1 + m3 + m12 + m15 + m16 + m17 + m21 − m24
40: C14 = m2 − m3 − m5 − m7 − m8 + m11 + m13 − m17 + m23 + m24
41: C22 = m1 + m6 − m7 + m10 + m11 + m12 + m22
42: C23 = m1 − m4 + m6 − m7 − m9 + m10 + m12 + m18 + m20 + m21
43: C24 = m2 + m4 + m11 + m14 + m16 − m18 − m20 + m23
44: C33 = m4 − m6 + m7 + m9 − m17 − m18 + m26
45: C34 = m3 + m5 + m7 + m8 + m17 + m18 + m25
46: C44 = s5 + s6 + s7 + s8
47: return C 2
range from gradient descent approaches Smirnov [2013] and heuristics Éric Drevet et al. [2011], to group-
theoretic methods Ye and Lim [2018], graph-based random walks Kauers and Moosbauer [2022], and deep
reinforcement learning Fawzi et al. [2022].
Despite this progress, much less attention has been paid to matrix products with additional structure,
such as B = A or B = At , or products involving sparsity or symmetry Dumas et al. [2020, 2023], Arrigoni
et al. [2021]. This is surprising given that expressions like AAt are widely used in fields such as statistics,
data analysis, deep learning, and wireless communications. For example, AAt often represents a covariance
or Gram matrix, while in linear regression, the solution for the data pair ( X, y) involves the data covariance
matrix X t X:
β = ( X t X )−1 X t y.
From a theoretical standpoint, computing XX t has the same asymptotic complexity as general matrix
multiplication. As a result, only constant-factor speedups are possible. The RXTX algorithm, presented in
Algorithm 1, achieves such a speedup by exploiting structure specific to XX t .
2 Analysis of RXTX
We define
• R(n) - number of multiplications performed by RXTX for n × n matrix
• S(n) - number of multiplications performed by recursive Strassen Arrigoni et al. [2021] for n × n matrix
• M (n) - number of multiplications performed by Strassen-Winograd algorithm for general product of
n × n matrices
• R+ (n) - number of additions and multiplications performed by RXTX for n × n matrix
• S+ (n) - number of additions and multiplications performed by recursive Strassen Arrigoni et al. [2021]
for n × n matrix
3
2.1 Number of multiplications
Theorem 1. The number of multiplications for RXTX:
26 15 26 log 7 15 3/2
R(n) = M(n) + n3/2 = n 2 + n .
41 41 41 41
The number of multiplications for recursive Strassen:
2 1 2 1
S(n) = M (n) + n2 = nlog2 7 + n2 .
3 3 3 3
Proof. The definition of RXTX involves 8 recursive calls and 26 general matrix multiplications. It follows that
R(n) = 8R(n/4) + 26M (n/4).
The general solutions to this recursive equation has a form Cormen et al. [2009]
R(n) = αM (n) + βn3/2 .
Plugging n = 1 and n = 4 we get
1 = α + β,
34 = 49α + 8β.
Solving this system we obtain
26 15
α= ≈ 0.6341, β= ≈ 0.3658.
41 41
Similarly, recursive Strassen for XX t uses 4 recursive calls and 2 general matrix multiplications:
S(n) = 4S(n/2) + 2M (n/2).
General solution form
S(n) = γM (n) + δn2 .
Plugging n = 1 and n = 2 we get
1 = γ + δ, 6 = 7γ + 4δ.
Solving this system we obtain γ = 2/3 ≈ 0.6666 and δ = 1/3 ≈ 0.3333.
In Figure 1 we can see the ratio R(n)/S(n) for n given by powers of 4. The ratio always stays below 100%
and approaches the asymptotic 95%, which indicates a 5% reduction in the number of multiplications. Same
happens in Figure 2, where we use optimal cutoff i.e. for small enough matrix sizes we use standard matrix
multiplication instead of further recursive calls.
Figure 1: Comparison of number of multiplications of RXTX to previous SotA and naive algorithm.
4
Ratio of Ropt (n) to Sopt (n). Ratio of Ropt (n) to n2 (n + 1)/2.
Figure 2: Comparison of number of multiplications of RXTX with optimal cutoff to previous SotA and naive
algorithm.
5
Plugging values n = 1 and n = 2 gives γ = −7/4 and δ = 1/3. It is known Cenk and Hasan [2017] that
M+ (n) = 6nlog2 7 − 5n2 .
It follows that
26 95 2 155 3/2 156 log 7 615 2 155 3/2
R+ (n) = (6nlog2 7 − 5n2 ) − n + n = n 2 − n + n
41 164 164 41 164 164
and
2 7 1 7
S+ ( n ) = (6nlog2 7 − 5n2 ) − n2 log2 n + n2 = 4nlog2 7 − n2 log2 n − 3n2 .
3 4 3 4
Figure 3: Comparison of number of operations of RXTX to recursive Strassen and naive algorithm. RXTX
outperforms recursive Strassen for n ≥ 256 and naive algorithm for n ≥ 1024.
Figure 4: Comparison of algorithms with optimal cutoffs i.e. for small enough matrices in recursion switch
to the algorithm with least operations. RXTX outperforms naive algorithm for n ≥ 32 and SotA for n ≥ 256.
6
Algorithm 2 First stage of optimized addition scheme. The number of additions is reduced from 77 to 53.
1: Input: X1 , X2 , ..., X16
2: Output: Left elements L1 , ..., L26 and right elements R1 , ..., R26 of multiplications m1 , ...m26
3: y1 ← X13 − X14
4: y2 ← X12 − X10
5: w 1 ← X2 + X4 − X8
6: w 2 ← X1 − X5 − X6
7: w 3 ← X6 + X7
8: w4 ← X14 + X15
9: w5 ← y2 + X16
10: w6 ← X10 + X11
11: w 7 ← X9 + y 1
12: w 8 ← X9 − X8
13: w9 ← X7 − X11
14: w10 ← X6 − X7
15: w11 ← X2 − X3
16: L 1 ← − w 1 + X3 R1 ← X8 + X11
17: L 2 ← w 2 + X7 R2 ← X15 + X5
18: L3 ← − X2 + X12 R 3 ← w5
19: L 4 ← X9 − X6 R 4 ← w7
20: L5 ← X2 + X11 R5 ← X15 − w3
21: L6 ← X6 + X11 R6 ← w3 − X11
22: L7 ← X11 R 7 ← w3
23: L 8 ← X2 R 8 ← w3 − w4 + w5
24: L 9 ← X6 R 9 ← w7 − w6 + w3
25: L10 ← w1 − X3 + X7 + X11 R10 ← X11
26: L11 ← X5 + w10 R11 ← X5
27: L12 ← w11 + X4 R12 ← X8
28: L13 ← −w2 + X3 − w9 R13 ← X15
29: L14 ← −w2 R14 ← w7 + w4
30: L15 ← w1 R15 ← w6 + w5
31: L16 ← X1 − X8 R16 ← X9 − X16
32: L17 ← X12 R17 ← −y2
33: L18 ← X9 R18 ← y1
34: L19 ← −w11 R19 ← − X15 + X7 + X8
35: L20 ← X5 + w8 R20 ← X9
36: L21 ← X8 R21 ← X12 + w8
37: L22 ← −w10 R22 ← X5 + w9
38: L23 ← X1 R23 ← X13 − X5 + X16
39: L24 ← − X1 + X4 + X12 R24 ← X16
40: L25 ← X9 + X2 + X10 R25 ← X14
41: L26 ← X6 + X10 + X12 R26 ← X10
7
Algorithm 3 Second stage of optimized addition scheme. The number of additions is reduced from 62 to 47.
1: Input: m1 , m2 , ..., m26 and s1 , ...s8 .
2: Output: Entries Cij using 47 additions.
3: z1 ← m7 − m11 − m12
4: z2 ← m1 + m12 + m21
5: z3 ← m3 + m17 − m24
6: z4 ← m2 + m11 + m23
7: z5 ← m5 + m7 + m8
8: z6 ← m4 − m18 − m20
9: z7 ← m6 − m7 − m9
10: z8 ← m17 + m18
11: C11 ← s1 + s2 + s3 + s4
12: C12 ← m2 − m5 − z1 + m13 + m19
13: C13 ← z2 + z3 + m15 + m16
14: C14 ← z4 − z3 − z5 + m13
15: C22 ← m1 + m6 − z1 + m10 + m22
16: C23 ← z2 − z6 + z7 + m10
17: C24 ← z4 + z6 + m14 + m16
18: C33 ← m4 − z7 − z8 + m26
19: C34 ← m3 + z5 + z8 + m25
20: C44 ← s5 + s6 + s7 + s8
8
Figure 5: The average runtime for RXTX is 2.524s, which is 9% faster than average runtime of specific BLAS
routine 2.778s. RXTX was faster in 99% of the runs.
3 Discovery Methodology
3.1 Description of RL-guided Large Neighborhood Search
In this section we briefly present our methodology. Full methodology with other discovered accelerations
will be described in Rybin et al. [2025]. We combine RL-guided Large Neighborhood Search Wu et al. [2021],
Addanki et al. [2020] with a two–level MILP pipeline:
1. The RL agent proposes a (potentially redundant) set of rank-1 bilinear products;
2. MILP-A exhaustively enumerates tens of thousands of linear relations between these candidate rank-1
bilinear products and target expressions;
3. MILP-B then selects the smallest subset of products whose induced relations cover every target
expression of XX t .
The loop iterates under a Large Neighborhood Search regime. One way to view this pipeline is a simplifi-
2 2 2
cation of AlphaTensor RL approach Fawzi et al. [2022]: instead of sampling tensors from Rn ⊗ Rn ⊗ Rn ,
2 2
we sample candidate tensors from Rn ⊗ Rn and let the MILP solver find optimal linear combinations of
sampled candidates.
9
3.2 Example: matrix times transpose algorithm search for 2-by-2 matrix
Consider the example for 2 × 2 matrix X. We want to perform the computation of XX t :
x12 + x22
x1 x2 x1 x3 x1 x3 + x2 x4
· =
x3 x4 x2 x4 x1 x3 + x2 x4 x32 + x42
We identify 3 target expressions
T = { x12 + x22 , x32 + x42 , x1 x3 + x2 x4 }.
We randomly sample thousands of products p1 , ..., pm , each one given by
! !
4 4
∑ αi xi · ∑ β j xj
i =1 j =1
with αi , β j ∈ {−1, 0, +1} chosen by RL policy πθ . MILP-A enumerates ways to write target expressions
from T as linear combinations of sampled products ∑ γi pi . MILP-B selects minimal number of sampled
products such that every target expression can be obtained as their linear combination. Key observation is
that MILP-A and MILP-B are rapidly solvable with solvers like Gurobi Gurobi Optimization, LLC [2024].
4 Acknowledgements
The work of Z.-Q. Luo was supported by the Guangdong Major Project of Basic and Applied Basic Research
(No.2023B0303000001), the Guangdong Provincial Key Laboratory of Big Data Computing, and the National
Key Research and Development Project under grant 2022YFA1003900.
References
R. Addanki, V. Nair, and M. Alizadeh. Neural large neighborhood search. In Learning Meets Combinatorial
Algorithms @ NeurIPS 2020, 2020.
10
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2024. URL https://www.gurobi.com.
M. Kauers and J. Moosbauer. The fbhhrbnrssshk-algorithm for multiplication in Z52×5 is still not the end of
the story. ArXiv, abs/2210.04045, 2022. URL https://api.semanticscholar.org/CorpusID:252780122.
E. Mårtensson and P. S. Wagner. The number of the beast: Reducing additions in fast matrix multiplication
algorithms for dimensions up to 666. Cryptology ePrint Archive, Paper 2024/2063, 2024. URL https:
//eprint.iacr.org/2024/2063.
D. Rybin, Y. Zhang, and Z.-Q. Luo. Accelerating structured matrix computations with machine learning
based search. in progress, 2025.
A. V. Smirnov. The bilinear complexity and practical algorithms for matrix multiplication. Compu-
tational Mathematics and Mathematical Physics, 53(12):1781–1795, Dec. 2013. ISSN 1555-6662. doi:
10.1134/s0965542513120129. URL http://dx.doi.org/10.1134/S0965542513120129.
V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354–356, Aug. 1969. ISSN
0945-3245. doi: 10.1007/bf02165411. URL http://dx.doi.org/10.1007/BF02165411.
S. Winograd. A new algorithm for inner product. IEEE Transactions on Computers, 100(7):693–694, 1968.
Y. Wu, W. Song, Z. Cao, and J. Zhang. Learning large neighborhood search policy for integer programming.
arXiv preprint arXiv:2111.03466, 2021.
K. Ye and L.-H. Lim. Algorithms for structured matrix-vector product of optimal bilinear complexity. In
2016 IEEE Information Theory Workshop (ITW), pages 310–314. IEEE, 2016.
K. Ye and L.-H. Lim. Fast structured matrix computations: tensor rank and cohn–umans method. Foundations
of Computational Mathematics, 18:45–95, 2018.
C. Éric Drevet, M. Nazrul Islam, and Éric Schost. Optimization techniques for small matrix multiplication.
Theoretical Computer Science, 412(22):2219–2236, 2011. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.
2010.12.012. URL https://www.sciencedirect.com/science/article/pii/S0304397510007036.
11