0% found this document useful (0 votes)
171 views6 pages

Winograd FFT

The document discusses optimizing the Number Theoretical Transform (NTT) using Winograd's factorization approach. It finds that Winograd's approach uses fewer multiplications than the Cooley-Tuckey factorization for small transform sizes like 32. The author implemented Winograd's factorization for size 32 in hardware and open-sourced the code.

Uploaded by

Leon Hibnik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
171 views6 pages

Winograd FFT

The document discusses optimizing the Number Theoretical Transform (NTT) using Winograd's factorization approach. It finds that Winograd's approach uses fewer multiplications than the Cooley-Tuckey factorization for small transform sizes like 32. The author implemented Winograd's factorization for size 32 in hardware and open-sourced the code.

Uploaded by

Leon Hibnik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

NTT Mini: Exploring Winograd’s Heuristic for

Faster NTT
Carol Danvers
[email protected]

Abstract
We report on the Winograd-based implementation for the Number
Theoretical Transform. It uses less multiplications than the better-known
Cooley-Tuckey alternative. This optimization is important for very high
order finite-fields. Unfortunately, the Winograd scheme is difficult to gen-
eralize for arbitrary sizes and is only known for small-size transforms. We
open-source our hardware implementation for size 32 based on [1].

1 Motivation
Zero Knowledge Proofs (ZKP) rely on a small number of computationally inten-
sive primitives such as Multi Scalar Multiplication (MSM) and Number The-
oretic Transform (NTT). The acceleration of these primitives is a necessary
enabler for global adoption of these technologies. In [2], we discussed MSM.
The focus of this note is NTT.
NTT is a generalization of the Discrete Fourier Transform (DFT) for finite-
fields. Being a linear transform, it can be written in a matrix form

⃗y = F ⃗x (1)

The transform matrix F has a special form:


 
1 1 1 ...
1 ωN ωN 2
. . .
F = 1 ω 2 ω 4 (2)
 
N N . . .
.. .. ..
 
..
. . . .
where ωN is the root-of-unity of order N and N is the transform length.
Multiplication is computationally expensive, so our goal is to minimize the
number of multiplications, (this comes at the expense of more additions). For
an arbitrary full-rank matrix of size N , computing (1) costs N 2 multiplications.
The special form of the NTT matrix F allows factorization to a product of
sparse matrices where many of the non-zero elements are ±1.

1
The Cooley-Tuckey (CT) factorization can be applied recursively for any N
depending on its prime factorization. For N that is a power of a small prime,
CT achieves a transform cost of N log N multiplications. Of particular interest
is N that is a power of 2.
The Winograd factorization, discussed here, is a more efficient factorization,
reducing F to a product of sparse matrices with only ±1’s, and a single diagonal
matrix with non-trivial values. The rank r of the diagonal matrix is always N ≤
r < N log N . For Winograd r is actually the number of required multiplications.
By definition, the Winograd factorization is not limited by N , though it is not
recursive and only known for some small values of N . Below is a table comparing
CT to Winograd for N = 16, 32.

Size CT Winograd
16 17 13
32 49 40

Table 1: Number of multiplications for different transform sizes

2 Theoretical Background
Winograd, much like CT, can be presented both as a series of recursive steps
and as a factorization of the DFT matrix. The symmetries in the DFT matrix
allow elegant factorization.

2.1 Notation
Define the following notations used hereafter:

 
1 1 1
H=√ (3)
2 1 −1
 
a11 B a12 B
A⊗B = (4)
a21 B a22 B
 
A 0
A⊕B = (5)
0 B

Additionally, let us denote by Mn the matrix of dimension 2n .

2.2 Cooley-Tuckey Factorization


We follow the presentation of the factorization from [3].

Theorem 2.1 (Cooley-Tuckey Factorization) The 2n ×2n DFT matrix Fn


can be factored as:
Fn = Pn A(0) (n−1)
n ...An (6)

2
where, for k = 0, ..., n − 1:

A(k)
n = In−k−1 ⊗ Bk+1 (7)
B = (Ik ⊕ Ωk )(H ⊗ Ik ) (8)
k
−1
Ωk = diag(w20k+1 , ..., w22k+1 ) (9)

and Pn is a bit reversal permutation that satisfies:

Pn (v1 ⊗ ... ⊗ vn ) = vn ⊗ ... ⊗ v1 (10)


(k)
Notice that An is a sparse matrix, containing only 2 non-zero entries in each
row. Thus, multiplying a vector by this matrix can be done in O(2n ) time, which
(k))
is linear in the dimension 2n . The number of An matrices in the factorization
n
is n, which is logarithmic in the dimension 2 . Thus, this leads to a total of
O(n · 2n ) time (or N log N for N = 2n ).

2.2.1 Winograd Factorization


Winograd utilizes different symmetries in the DFT matrix.

Theorem 2.2 (Winograd’s Heuristic Factorization) The 2n ×2n DFT ma-


trix Fn can be factored as:

Fn = (H2 · In−1 )(Fn ⊕ Qn−1 Pπnn ) (11)

where:
 
In−1 ⊗ Ψ2⊗4
Pπnn = (12)
In−1 ⊗ Ψ2⊗4 I1→
n−1
 
1 0 0 0
Ψ2⊗4 = (13)
0 0 1 0

Qn−1 is a prefix matrix, and I1→n−1 is is the matrix obtained from the identity
matrix by shifting its columns by one position to the right.

Intuitively, the main goal of this factorization is to decompose the DFT


matrix as follows:

Fn = M1 · ... · Mk · D · Mk+1 ...Mn (14)

where Mi is a sparse matrix consisting of ±1 entries, and D is a diagonal matrix


(typically of dimension greater than 2n ).
The strategy of [1] is to gradually decompose the matrices Fn , Qn by utilizing
the above theorem, as well as several permutations and the following decompo-
sition rules that capitalize on inherent symmetries in each of these matrices.

3
Claim 2.3 For n × n matrices A, B, C, the following holds.
 
A B
= (H2 ⊗ In )(A ⊕ B) (15)
A −B
 
A A
= (A ⊕ B)(H2 ⊗ In ) (16)
B −B
 
A B 1
= (H2 ⊗ In )((A + B) ⊕ (A − B))(H2 ⊗ In ) (17)
B A 2
 
  A−B 0 0
A B 1
= (T ⊗ In )  0 −(A + B) 0  (T ⊗ In ) (18)
B −A 2
0 0 B
 
  C −A 0 0
A B
= (Q ⊗ H ⊗ In−1 )  0 B − A 0  (T ⊗ H ⊗ In−1 ) (19)
C A
0 0 A
where
 
1 0
T = 0 1 (20)
1 1
 
0 1 1
Q= (21)
1 0 1
Using these rules, the authors of [1] managed to decrease the dimension
of the factorization by 2. The caveat of this scheme is the non-uniformity of
the factorization of Qn requiring a well-designed permutation to exploit its
symmetries.

3 Results
This note summarizes our initial experience with Winograd factorizations of
size N = 2n , based on the derivations of [1]. We used Symbolic Algebra System
(SAS) tools to automate the ad-hoc factorizations for arbitrary finite-fields. We
outline our workflow as follows:
1. SAS code generates C++ template. The template captures the structure
of size 2n transform only, and does not depend on specific finite field
selection. It implements
(a) sparse matrix multiplication (14), and
(b) diagonal matrix D computation
2. C++ template is instantiated for a specific finite field as C++ code
3. C++ code is compatible with High Level Synthesis EDA tools, that even-
tually produce RTL (Verilog, VHDL), and, finally, FPGA bitstreams or
GL1 for ASICs.

4
We open-source a C++ template for NTT of size 25 = 32 together with a
C++ instance for the scalar finite field of the Elliptic Curve bn254. Template
ntt32 winograd uses arbitrary precision unsigned integers to represent the ele-
ments of the finite field. Specifically, we use type template ap uint<W> from
Xilinx’ Vitis HLS toolchain [4]. The template takes the vector ⃗x = (x0, . . . , x31)
as input and returns the output ⃗y = (y0, . . . , y31) by reference:
template<int W> void n t t 3 2 w i n o g r a d (
const a p u i n t <W> x0 , . . . , const a p u i n t <W> x31 ,
a p u i n t <W>∗ y0 , . . . , a p u i n t <W>∗ y31
) {...}
All finite field matrix operations are unrolled and call scalar functions basic add mod(),
basic sub mod() and mult red(), which are specific for the selected finite field.
The template is instantiated in function ntt32 winograd bn254 scalar
void n t t 3 2 w i n o g r a d b n 2 5 4 s c a l a r (
const a p u i n t <254> x0 , . . . , const a p u i n t <254> x31 ,
a p u i n t <254>∗ y0 , . . . , a p u i n t <254>∗ y31 ,
)
{
n t t 3 2 w i n o g r a d ( x0 , . . . , x31 , y0 , . . . , y31 ) ;
return ;
}
The header file ntt32 winograd bn254 scalar.hpp declares the above instance
and optimized finite field operations for scalar bn254 field.
When using our example, this header is the only file to include.

#include ” n t t 3 2 w i n o g r a d b n 2 5 4 s c a l a r . hpp”
...
// d e f i n e i n p u t v e c t o r x
...
n t t 3 2 w i n o g r a d b n 2 5 4 s c a l a r ( x0 , . . . , x31 , &y0 , . . . , & y31 ) ;
// use o u t p u t v e c t o r y
...

4 Usage
1. Make sure you have g++ toolchain installed.
2. Make sure you have Xilinx Vitis installed. Environment variable XIL-
INX HLS should be defined and point to the distribution. This will allow
the toolchain to find Xilinx’s arbitrary precision headers.
3. Make sure you have C++ Boost library [5]. Environment variable BOOST
should point to the installation. We need this library for tests only.

5
4. Download our code from here [6]
5. Run make test

5 Future Directions
In this note, we demonstrated the Winograd factorization only for small de-
gree polynomials. Real-world instances of Zero Knowledge Proofs and Fully
Homomorphic Encryption require higher degree polynomial arithmetic, with
common sizes of N reaching 215 and often higher. There are various interesting
directions to proceed. One is to extend the work of [1], finding the Winograd
factorization for specific power-of-two N ’s larger than 32, potentially discov-
ering a closed-form extendable expression. A second, more immediate, is to
utilize the recursive structure of NTT, together with the small-size Winograd
building-blocks, to extend the construction to higher N ’s similarly to what was
done by CT.
Our HLS code correctness was verified in C++ and Verilog simulations. We
will soon integrate it as part of our Cloud-ZK dev-kit [7], enabling developers to
integrate with a fast NTT implementation running on AWS F1 FPGA instances.
We hope that our implementation will lead to a better intuition on the com-
plexity of Winograd. The number of multiplications, dominating the computa-
tion time is behaving as O(N ). We think it will be interesting to compare the
theoretical complexity to concrete measurements. In general, Winograd takes
more additions than CT, which might become a non-negligible factor in total
running time.

References
[1] Mateusz Raciborski and Aleksandr Cariow. On the derivation of winograd-
type dft algorithms for input sequences whose length is a power of two.
Electronics, 11:1342, 04 2022.
[2] Charles F Xavier. Pipemsm: Hardware acceleration for multi-scalar multi-
plication. Cryptology ePrint Archive, 2022.
[3] Daan Camps, Roel Van Beeumen, and Chao Yang. Quantum fourier trans-
form revisited. Numerical Linear Algebra with Applications, 28(1), sep 2020.
[4] Vitis high-level synthesis user guide (ug1399).
https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/
Overview-of-Arbitrary-Precision-Integer-Data-Types.
[5] C++ boost library. https://www.boost.org/.
[6] Winograd ntt32 code. https://github.com/ingonyama-zk/ntt_
winograd.git.
[7] Cloud zk. https://github.com/ingonyama-zk/cloud-ZK.

You might also like