0% found this document useful (0 votes)

171 views6 pages

Winograd FFT

The document discusses optimizing the Number Theoretical Transform (NTT) using Winograd's factorization approach. It finds that Winograd's approach uses fewer multiplications than the Cooley-Tuckey factorization for small transform sizes like 32. The author implemented Winograd's factorization for size 32 in hardware and open-sourced the code.

Uploaded by

Leon Hibnik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

171 views6 pages

Winograd FFT

Uploaded by

Leon Hibnik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

NTT Mini: Exploring Winograd’s Heuristic for

Faster NTT
Carol Danvers
[email protected]

Abstract
We report on the Winograd-based implementation for the Number
Theoretical Transform. It uses less multiplications than the better-known
Cooley-Tuckey alternative. This optimization is important for very high
order finite-fields. Unfortunately, the Winograd scheme is difficult to gen-
eralize for arbitrary sizes and is only known for small-size transforms. We
open-source our hardware implementation for size 32 based on [1].

1 Motivation
Zero Knowledge Proofs (ZKP) rely on a small number of computationally inten-
sive primitives such as Multi Scalar Multiplication (MSM) and Number The-
oretic Transform (NTT). The acceleration of these primitives is a necessary
enabler for global adoption of these technologies. In [2], we discussed MSM.
The focus of this note is NTT.
NTT is a generalization of the Discrete Fourier Transform (DFT) for finite-
fields. Being a linear transform, it can be written in a matrix form

⃗y = F ⃗x (1)

The transform matrix F has a special form:

 
1 1 1 ...
1 ωN ωN 2
. . .
F = 1 ω 2 ω 4 (2)
 
N N . . .
.. .. ..
 
..
. . . .
where ωN is the root-of-unity of order N and N is the transform length.
Multiplication is computationally expensive, so our goal is to minimize the
number of multiplications, (this comes at the expense of more additions). For
an arbitrary full-rank matrix of size N , computing (1) costs N 2 multiplications.
The special form of the NTT matrix F allows factorization to a product of
sparse matrices where many of the non-zero elements are ±1.

1
The Cooley-Tuckey (CT) factorization can be applied recursively for any N
depending on its prime factorization. For N that is a power of a small prime,
CT achieves a transform cost of N log N multiplications. Of particular interest
is N that is a power of 2.
The Winograd factorization, discussed here, is a more efficient factorization,
reducing F to a product of sparse matrices with only ±1’s, and a single diagonal
matrix with non-trivial values. The rank r of the diagonal matrix is always N ≤
r < N log N . For Winograd r is actually the number of required multiplications.
By definition, the Winograd factorization is not limited by N , though it is not
recursive and only known for some small values of N . Below is a table comparing
CT to Winograd for N = 16, 32.

Size CT Winograd
16 17 13
32 49 40

Table 1: Number of multiplications for different transform sizes

2 Theoretical Background
Winograd, much like CT, can be presented both as a series of recursive steps
and as a factorization of the DFT matrix. The symmetries in the DFT matrix
allow elegant factorization.

2.1 Notation
Define the following notations used hereafter:

1 1 1
H=√ (3)
2 1 −1

a11 B a12 B
A⊗B = (4)
a21 B a22 B

A 0
A⊕B = (5)
0 B

Additionally, let us denote by Mn the matrix of dimension 2n .

2.2 Cooley-Tuckey Factorization

We follow the presentation of the factorization from [3].

Theorem 2.1 (Cooley-Tuckey Factorization) The 2n ×2n DFT matrix Fn

can be factored as:
Fn = Pn A(0) (n−1)
n ...An (6)

2
where, for k = 0, ..., n − 1:

A(k)
n = In−k−1 ⊗ Bk+1 (7)
B = (Ik ⊕ Ωk )(H ⊗ Ik ) (8)
k
−1
Ωk = diag(w20k+1 , ..., w22k+1 ) (9)

and Pn is a bit reversal permutation that satisfies:

Pn (v1 ⊗ ... ⊗ vn ) = vn ⊗ ... ⊗ v1 (10)

(k)
Notice that An is a sparse matrix, containing only 2 non-zero entries in each
row. Thus, multiplying a vector by this matrix can be done in O(2n ) time, which
(k))
is linear in the dimension 2n . The number of An matrices in the factorization
n
is n, which is logarithmic in the dimension 2 . Thus, this leads to a total of
O(n · 2n ) time (or N log N for N = 2n ).

2.2.1 Winograd Factorization

Winograd utilizes different symmetries in the DFT matrix.

Theorem 2.2 (Winograd’s Heuristic Factorization) The 2n ×2n DFT ma-

trix Fn can be factored as:

Fn = (H2 · In−1 )(Fn ⊕ Qn−1 Pπnn ) (11)

where:

In−1 ⊗ Ψ2⊗4
Pπnn = (12)
In−1 ⊗ Ψ2⊗4 I1→
n−1

1 0 0 0
Ψ2⊗4 = (13)
0 0 1 0

Qn−1 is a prefix matrix, and I1→n−1 is is the matrix obtained from the identity
matrix by shifting its columns by one position to the right.

Intuitively, the main goal of this factorization is to decompose the DFT

matrix as follows:

Fn = M1 · ... · Mk · D · Mk+1 ...Mn (14)

where Mi is a sparse matrix consisting of ±1 entries, and D is a diagonal matrix

(typically of dimension greater than 2n ).
The strategy of [1] is to gradually decompose the matrices Fn , Qn by utilizing
the above theorem, as well as several permutations and the following decompo-
sition rules that capitalize on inherent symmetries in each of these matrices.

3
Claim 2.3 For n × n matrices A, B, C, the following holds.

A B
= (H2 ⊗ In )(A ⊕ B) (15)
A −B

A A
= (A ⊕ B)(H2 ⊗ In ) (16)
B −B

A B 1
= (H2 ⊗ In )((A + B) ⊕ (A − B))(H2 ⊗ In ) (17)
B A 2
 
A−B 0 0
A B 1
= (T ⊗ In )  0 −(A + B) 0  (T ⊗ In ) (18)
B −A 2
0 0 B
 
C −A 0 0
A B
= (Q ⊗ H ⊗ In−1 )  0 B − A 0  (T ⊗ H ⊗ In−1 ) (19)
C A
0 0 A
where
 
1 0
T = 0 1 (20)
1 1

0 1 1
Q= (21)
1 0 1
Using these rules, the authors of [1] managed to decrease the dimension
of the factorization by 2. The caveat of this scheme is the non-uniformity of
the factorization of Qn requiring a well-designed permutation to exploit its
symmetries.

3 Results
This note summarizes our initial experience with Winograd factorizations of
size N = 2n , based on the derivations of [1]. We used Symbolic Algebra System
(SAS) tools to automate the ad-hoc factorizations for arbitrary finite-fields. We
outline our workflow as follows:
1. SAS code generates C++ template. The template captures the structure
of size 2n transform only, and does not depend on specific finite field
selection. It implements
(a) sparse matrix multiplication (14), and
(b) diagonal matrix D computation
2. C++ template is instantiated for a specific finite field as C++ code
3. C++ code is compatible with High Level Synthesis EDA tools, that even-
tually produce RTL (Verilog, VHDL), and, finally, FPGA bitstreams or
GL1 for ASICs.

4
We open-source a C++ template for NTT of size 25 = 32 together with a
C++ instance for the scalar finite field of the Elliptic Curve bn254. Template
ntt32 winograd uses arbitrary precision unsigned integers to represent the ele-
ments of the finite field. Specifically, we use type template ap uint<W> from
Xilinx’ Vitis HLS toolchain [4]. The template takes the vector ⃗x = (x0, . . . , x31)
as input and returns the output ⃗y = (y0, . . . , y31) by reference:
template<int W> void n t t 3 2 w i n o g r a d (
const a p u i n t <W> x0 , . . . , const a p u i n t <W> x31 ,
a p u i n t <W>∗ y0 , . . . , a p u i n t <W>∗ y31
) {...}
All finite field matrix operations are unrolled and call scalar functions basic add mod(),
basic sub mod() and mult red(), which are specific for the selected finite field.
The template is instantiated in function ntt32 winograd bn254 scalar
void n t t 3 2 w i n o g r a d b n 2 5 4 s c a l a r (
const a p u i n t <254> x0 , . . . , const a p u i n t <254> x31 ,
a p u i n t <254>∗ y0 , . . . , a p u i n t <254>∗ y31 ,
)
{
n t t 3 2 w i n o g r a d ( x0 , . . . , x31 , y0 , . . . , y31 ) ;
return ;
}
The header file ntt32 winograd bn254 scalar.hpp declares the above instance
and optimized finite field operations for scalar bn254 field.
When using our example, this header is the only file to include.

#include ” n t t 3 2 w i n o g r a d b n 2 5 4 s c a l a r . hpp”
...
// d e f i n e i n p u t v e c t o r x
...
n t t 3 2 w i n o g r a d b n 2 5 4 s c a l a r ( x0 , . . . , x31 , &y0 , . . . , & y31 ) ;
// use o u t p u t v e c t o r y
...

4 Usage
1. Make sure you have g++ toolchain installed.
2. Make sure you have Xilinx Vitis installed. Environment variable XIL-
INX HLS should be defined and point to the distribution. This will allow
the toolchain to find Xilinx’s arbitrary precision headers.
3. Make sure you have C++ Boost library [5]. Environment variable BOOST
should point to the installation. We need this library for tests only.

5
4. Download our code from here [6]
5. Run make test

5 Future Directions
In this note, we demonstrated the Winograd factorization only for small de-
gree polynomials. Real-world instances of Zero Knowledge Proofs and Fully
Homomorphic Encryption require higher degree polynomial arithmetic, with
common sizes of N reaching 215 and often higher. There are various interesting
directions to proceed. One is to extend the work of [1], finding the Winograd
factorization for specific power-of-two N ’s larger than 32, potentially discov-
ering a closed-form extendable expression. A second, more immediate, is to
utilize the recursive structure of NTT, together with the small-size Winograd
building-blocks, to extend the construction to higher N ’s similarly to what was
done by CT.
Our HLS code correctness was verified in C++ and Verilog simulations. We
will soon integrate it as part of our Cloud-ZK dev-kit [7], enabling developers to
integrate with a fast NTT implementation running on AWS F1 FPGA instances.
We hope that our implementation will lead to a better intuition on the com-
plexity of Winograd. The number of multiplications, dominating the computa-
tion time is behaving as O(N ). We think it will be interesting to compare the
theoretical complexity to concrete measurements. In general, Winograd takes
more additions than CT, which might become a non-negligible factor in total
running time.

References
[1] Mateusz Raciborski and Aleksandr Cariow. On the derivation of winograd-
type dft algorithms for input sequences whose length is a power of two.
Electronics, 11:1342, 04 2022.
[2] Charles F Xavier. Pipemsm: Hardware acceleration for multi-scalar multi-
plication. Cryptology ePrint Archive, 2022.
[3] Daan Camps, Roel Van Beeumen, and Chao Yang. Quantum fourier trans-
form revisited. Numerical Linear Algebra with Applications, 28(1), sep 2020.
[4] Vitis high-level synthesis user guide (ug1399).
https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/
Overview-of-Arbitrary-Precision-Integer-Data-Types.
[5] C++ boost library. https://www.boost.org/.
[6] Winograd ntt32 code. https://github.com/ingonyama-zk/ntt_
winograd.git.
[7] Cloud zk. https://github.com/ingonyama-zk/cloud-ZK.

Fundamentals of Backend Engineering Notes
No ratings yet
Fundamentals of Backend Engineering Notes
183 pages
Strassen PDF
No ratings yet
Strassen PDF
27 pages
An Introduction To Programming The Winograd Fourier Transform Algorithm
No ratings yet
An Introduction To Programming The Winograd Fourier Transform Algorithm
14 pages
Sparse Winograd Convolutional Neural Networks On Small-Scale Systolic Arrays
No ratings yet
Sparse Winograd Convolutional Neural Networks On Small-Scale Systolic Arrays
7 pages
Fast Discrete Fourier Transform Algorithms Requiring Less Than 0 (N Log N) Multiplications
No ratings yet
Fast Discrete Fourier Transform Algorithms Requiring Less Than 0 (N Log N) Multiplications
21 pages
In-Place, In-Order Prime Factor FFT Algorithm: DFT's
No ratings yet
In-Place, In-Order Prime Factor FFT Algorithm: DFT's
12 pages
Fast Algorithms For Convolutional Neural Networks: Andrew Lavin Scott Gray Nervana Systems
No ratings yet
Fast Algorithms For Convolutional Neural Networks: Andrew Lavin Scott Gray Nervana Systems
9 pages
Lavin Fast Algorithms For CVPR 2016 Paper
No ratings yet
Lavin Fast Algorithms For CVPR 2016 Paper
9 pages
Winograd Algorithm For Fast Convolution
50% (2)
Winograd Algorithm For Fast Convolution
21 pages
High Performance ECDSA Over F Based On Java With Hardware Acceleration
No ratings yet
High Performance ECDSA Over F Based On Java With Hardware Acceleration
14 pages
FFT PDF
No ratings yet
FFT PDF
17 pages
Evaluating Fast Algorithms For Convolutional Neural Networks On FPGAs
No ratings yet
Evaluating Fast Algorithms For Convolutional Neural Networks On FPGAs
8 pages
A Novel Memory-Based FFT Processor For Dmtiofdm Applications
No ratings yet
A Novel Memory-Based FFT Processor For Dmtiofdm Applications
4 pages
1975 - General Context-Free Recognition in Less Than Cubic Time
No ratings yet
1975 - General Context-Free Recognition in Less Than Cubic Time
15 pages
Stride 2 1-D, 2-D, and 3-D Winograd For Convolutional Neural Networks
No ratings yet
Stride 2 1-D, 2-D, and 3-D Winograd For Convolutional Neural Networks
11 pages
CS 6505 Homework 2: William Holton
No ratings yet
CS 6505 Homework 2: William Holton
7 pages
FFT Algorithms PDF
No ratings yet
FFT Algorithms PDF
37 pages
DSP12 PP 8 Point Radix 2 Dit FFT PPT
No ratings yet
DSP12 PP 8 Point Radix 2 Dit FFT PPT
37 pages
Dctinfpga
No ratings yet
Dctinfpga
85 pages
Approximate 8-Point FFT Using Radix-2 Booth Multiplier: Avinash.Y EDM16B038 Project Guide: Noor Mahammad SK
No ratings yet
Approximate 8-Point FFT Using Radix-2 Booth Multiplier: Avinash.Y EDM16B038 Project Guide: Noor Mahammad SK
16 pages
Breaking The Coppersmith-Winograd Barrier For Matrix Multiplication Algorithms
No ratings yet
Breaking The Coppersmith-Winograd Barrier For Matrix Multiplication Algorithms
73 pages
M FFTAlgorithm
No ratings yet
M FFTAlgorithm
11 pages
Thesis 237
No ratings yet
Thesis 237
116 pages
Discrete Image Processing
No ratings yet
Discrete Image Processing
85 pages
Chan 1989
No ratings yet
Chan 1989
16 pages
155.FFT Ropec
No ratings yet
155.FFT Ropec
7 pages
MixMax Generator
No ratings yet
MixMax Generator
7 pages
A Simple Fixed-Point Error Bound For The Fast Fourier Transform
No ratings yet
A Simple Fixed-Point Error Bound For The Fast Fourier Transform
6 pages
The FFT Via Matrix Factorizations: A Key To Designing High Performance Implementations
No ratings yet
The FFT Via Matrix Factorizations: A Key To Designing High Performance Implementations
42 pages
The FFT Via Matrix Factorizations: A Key To Designing High Performance Implementations
No ratings yet
The FFT Via Matrix Factorizations: A Key To Designing High Performance Implementations
42 pages
FFT Tutorial 121102
No ratings yet
FFT Tutorial 121102
28 pages
Intro To Matrices: Don't Be Scared
No ratings yet
Intro To Matrices: Don't Be Scared
31 pages
Comp Mult Linear
No ratings yet
Comp Mult Linear
48 pages
Deep Tensor Convolution On Multicores
No ratings yet
Deep Tensor Convolution On Multicores
10 pages
Module 2 Image Transforms
No ratings yet
Module 2 Image Transforms
21 pages
Performance Analysis of Approximate 8-Point FFT of Radix 2 Using Multiplier
No ratings yet
Performance Analysis of Approximate 8-Point FFT of Radix 2 Using Multiplier
25 pages
Transform I Met
No ratings yet
Transform I Met
44 pages
Dsp12 - PP 8 Point Radix-2 Dit-Fft
No ratings yet
Dsp12 - PP 8 Point Radix-2 Dit-Fft
37 pages
Radix4 FFT Proyecto
No ratings yet
Radix4 FFT Proyecto
7 pages
Giris Hetal 2013
No ratings yet
Giris Hetal 2013
8 pages
Lab Notes 6
No ratings yet
Lab Notes 6
21 pages
DCT
No ratings yet
DCT
5 pages
Review of Transforms: ECGR 6118 Computer Project: Transforms Student Name
No ratings yet
Review of Transforms: ECGR 6118 Computer Project: Transforms Student Name
25 pages
Dit FFT
100% (1)
Dit FFT
18 pages
University of Khartoum Department of Electrical and Electronic Engineering Fourth Year
No ratings yet
University of Khartoum Department of Electrical and Electronic Engineering Fourth Year
61 pages
Digital Imaging 2006/2007: Image Processing II
No ratings yet
Digital Imaging 2006/2007: Image Processing II
92 pages
A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm
No ratings yet
A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm
8 pages
Math Ma Tics of Sound Analysis
No ratings yet
Math Ma Tics of Sound Analysis
106 pages
Mix Columns
No ratings yet
Mix Columns
5 pages
Matrix Mult 09
No ratings yet
Matrix Mult 09
12 pages
208432-Article Text-469507-1-10-20200721
No ratings yet
208432-Article Text-469507-1-10-20200721
12 pages
COMS 6998 Lec 1
No ratings yet
COMS 6998 Lec 1
8 pages
The W Hashing Function
No ratings yet
The W Hashing Function
20 pages
Analysis of Scale Invariance Property Applying Homogeneity
No ratings yet
Analysis of Scale Invariance Property Applying Homogeneity
6 pages
IKV 2 Main
No ratings yet
IKV 2 Main
97 pages
Teach Eal My Atsu
No ratings yet
Teach Eal My Atsu
37 pages
Faster Matrix Multiplication Via Asymmetric Hashing: Ran Duan Hongxun Wu Renfei Zhou
No ratings yet
Faster Matrix Multiplication Via Asymmetric Hashing: Ran Duan Hongxun Wu Renfei Zhou
87 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Case Study Guidelines: Property of STI
No ratings yet
Case Study Guidelines: Property of STI
2 pages
2024-06-24 Biz Main
No ratings yet
2024-06-24 Biz Main
75 pages
Aws Lab Manual
No ratings yet
Aws Lab Manual
44 pages
University of Iowa Thesis Deposit
100% (2)
University of Iowa Thesis Deposit
6 pages
Advances in Dataflow Programming Languages: University of Ulster
No ratings yet
Advances in Dataflow Programming Languages: University of Ulster
34 pages
UNIT III - Cloud Computing-1
No ratings yet
UNIT III - Cloud Computing-1
35 pages
CIE-H10A: 8 Ports Remote I/O Controller
No ratings yet
CIE-H10A: 8 Ports Remote I/O Controller
3 pages
Computer Architecture and Organization Chapter 5 &6
No ratings yet
Computer Architecture and Organization Chapter 5 &6
22 pages
Quiz (Guardium L1) - Attempt Review
No ratings yet
Quiz (Guardium L1) - Attempt Review
11 pages
Server Management
No ratings yet
Server Management
44 pages
Gls University: 0301404 Data Communication & Networking. Unit - I - Prof. Hemali Moradiya - Prof. Jainin Vakil
No ratings yet
Gls University: 0301404 Data Communication & Networking. Unit - I - Prof. Hemali Moradiya - Prof. Jainin Vakil
263 pages
ls6xt M
No ratings yet
ls6xt M
661 pages
TMAE Updated V2
No ratings yet
TMAE Updated V2
9 pages
BSC CS
No ratings yet
BSC CS
2 pages
Thiết bị lưu trữ SAN Unity XT 880
No ratings yet
Thiết bị lưu trữ SAN Unity XT 880
4 pages
New Book by Balloon
No ratings yet
New Book by Balloon
2 pages
CN Lab Ex12
No ratings yet
CN Lab Ex12
4 pages
DSD Unit 3 Sorting and Searching
No ratings yet
DSD Unit 3 Sorting and Searching
36 pages
Introduction To Computer Network .Docx - 20240417 - 161156 - 0000
No ratings yet
Introduction To Computer Network .Docx - 20240417 - 161156 - 0000
58 pages
Recommend Courses, Books, Projects, Certification...
No ratings yet
Recommend Courses, Books, Projects, Certification...
2 pages
Free Piano Plugins Compatible With FL Studio
No ratings yet
Free Piano Plugins Compatible With FL Studio
2 pages
Excel VBA in Easy Steps
No ratings yet
Excel VBA in Easy Steps
20 pages
BC - FortiEDR vs. Cylance Endpoint
No ratings yet
BC - FortiEDR vs. Cylance Endpoint
2 pages
Unit 1 Introduction of Oracle
No ratings yet
Unit 1 Introduction of Oracle
7 pages
RN21009 CoLOS Product Suite 6.4
No ratings yet
RN21009 CoLOS Product Suite 6.4
12 pages
Cloud Computing Overview
No ratings yet
Cloud Computing Overview
7 pages
Setting Up Python and Visual Studio Code - Windows 10
No ratings yet
Setting Up Python and Visual Studio Code - Windows 10
5 pages
Esquema MS-7267 V2.1
100% (1)
Esquema MS-7267 V2.1
29 pages
Grade 8 Second Term Note
No ratings yet
Grade 8 Second Term Note
21 pages

Winograd FFT

Uploaded by

Winograd FFT

Uploaded by

NTT Mini: Exploring Winograd’s Heuristic for

The transform matrix F has a special form:

Table 1: Number of multiplications for different transform sizes

Additionally, let us denote by Mn the matrix of dimension 2n .

2.2 Cooley-Tuckey Factorization

Theorem 2.1 (Cooley-Tuckey Factorization) The 2n ×2n DFT matrix Fn

and Pn is a bit reversal permutation that satisfies:

Pn (v1 ⊗ ... ⊗ vn ) = vn ⊗ ... ⊗ v1 (10)

2.2.1 Winograd Factorization

Theorem 2.2 (Winograd’s Heuristic Factorization) The 2n ×2n DFT ma-

Fn = (H2 · In−1 )(Fn ⊕ Qn−1 Pπnn ) (11)

Intuitively, the main goal of this factorization is to decompose the DFT

Fn = M1 · ... · Mk · D · Mk+1 ...Mn (14)

where Mi is a sparse matrix consisting of ±1 entries, and D is a diagonal matrix

You might also like