Accelerating FHE Integer Multiplier Using Negative
Accelerating FHE Integer Multiplier Using Negative
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2840108, IEEE
Transactions on Circuits and Systems II: Express Briefs
1
Abstract—This brief proposes a novel hardware structure for In order to further reduce the transform length, this brief
large integer multiplication in fully homomorphic encryption. propose a method based on negative wrapped convolution (
We propose a method based on negative wrapped convolution to [11]) to avoid zero-padding in Strassen’s algorithm, which can
avoid zero-padding in Strassen’s algorithm, which can cut down
half of the Fourier transform length. In addition, we also optimize cut down half of the Fourier transform length. In addition,
the ping-pong Fast Fourier transform algorithm by doubling the we also optimize the ping-pong FFT algorithm in [12] by
transform throughput and generating the round constant on the doubling the transform throughput and generating the round
fly. Based on our proposed method and optimized algorithm, constant on the fly. A 768k-bit integer multiplier is designed
we design and implement a 768k-bit integer multiplier on Altera and implemented on the basis of our proposed method and
Stratix V FPGA. Implementation results on FPGA show that our
structure outperforms the current competitors in area-efficiency. optimized ping-pong FFT algorithm.
The rest of this brief is organized as follows. Section II
gives an introduction on the background of negative wrapped
Index Terms—fully homomorphic encryption (FHE), integer
multiplication, negative wrapped convolution, ping-pong Fast convolution and ping-pong FFT algorithm. Section III presents
Fourier transform (FFT) our proposed method and optimized algorithm. Section IV
shows the VLSI architecture of our design. The implemen-
tation results and comparisons with previous works are given
I. I NTRODUCTION in Section V. Conclusions then follow in Section VI.
Fully homomorphic encryption (FHE) proposed by Gentry
[1] offers an algorithm-level solution to protect one’s privacy II. M ATHEMATICAL BACKGROUND
during cloud computing. However, it is still not ready for prac-
A. Strassen’s integer multiplication algorithm
tical application due to the limitation in speed and resources.
One of the most time-consuming operations in general FHE General FHE applications like [2], [3] require up to million-
schemes like [2], [3] is the multiplication of up to million- bit integer multiplications. Strassen’s algorithm proposed in [8]
bit large integers. So far, a wealth of researches ( [4]–[9]) is one of the most efficient integer multiplication algorithm
have been conducted on the hardware acceleration for integer under such large word length level. It is a convolution based
multiplication in FHE. algorithm to obtain the product of two large integers through
Generally, as previous works ( [4]–[9]) did, Strassen’s Fast Fourier transform. The detailed schedule of Strassen’s
convolution based algorithm [10] is employed to obtain the algorithm can be summarized into three major steps as follows.
product of two large integers with their Fourier transform. The • (1) Splitting and zero-padding: First, split two multi-
key approach to accelerate Strassen’s algorithm is to speed plicands a,b into n-dimension vectors. Each dimension ai
Pn−1
up Fourier transform. Aimed at that, Wang et al proposed a and bi is a w-bit integer satisfies a = i=0 ai 2iw and
Pn−1
radix-16 and radix-64 Fast Fourier transform (FFT) engine in b = i=0 bi 2iw . Then, zero-padding n-dimension vector
[4]. In [5], Doroz et al implemented FFT on ASIC platform into 2n-dimension vector a, b:
utilizing recursive algorithm. Wang et al then achieved another
VLSI design of 768k-bit (1k=1024 in this paper) FFT-based a = (0, 0, · · · , 0, an−1 , · · · , a1 , a0 ),
multiplier with a high-radix FFT method in [6]. However, in b = (0, 0, · · · , 0, bn−1 , · · · , b1 , b0 ).
all of the previous three works, two multiplicands need to be
• (2) Cyclic convolution: Perform cyclic convolution and
zero-padded to double length to obtain correct product, which
obtain the convolution result c of vector a and b using
brings extra complexity for both timing and memory. Feng et
2n-point Fast Fourier transform (FFT) and inverse Fast
al has tried to cut down the transform length by 1/4 in [7]
Fourier transform (IFFT): (Note that ω is 2n-th primitive
by proposing a double modulus number theoretical transform
root satisfies ω 2n = 1 in finite field.)
(NTT) method. But Feng’s method in [7] is still not efficient
enough and it is based on two special moduli which can not c = IF F Tω2n (F F Tω2n (a) ∗ F F Tω2n (b))
be generalized.
• (3) Recombining: Recombine 2n-dimension vector c into
This work was supported by the National Natural Science Foundation of integer c through accumulation:
China under Grant 61674086. (Corresponding author: Shuguo Li.)
The authors are with the Institute of Microelectronics, Tsinghua Uni- 2n−1
X
versity, Beijing 100084, China (e-mail: [email protected]; c= ci · 2iw
[email protected]).
i=0
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2840108, IEEE
Transactions on Circuits and Systems II: Express Briefs
2
Algorithm 1 k-point negative wrapped convolution algorithm Algorithm 2 Ping-pong algorithm [12] for k-point FFT
[11] Input: Signal x = (x0 , x1 , · · · , xk−1 ), where k = 2d . n-th primitive root
Input: Discrete signal a = (a0 , a1 , · · · , ak−1 ), b = (b0 , b1 , · · · , bk−1 ). ω satisfy ω k = 1, ω i 6= 1 for ∀i < n. An external continuous memory
k-th primitive root ω satisfy ω k = 1, ω i 6= 1 for ∀i < n. Negative factor y which has the same size as x. Memory pointer P x, P y.
φ satisfies φ2 = ω. Pi
Output: FFT result of vector x
Output: c = (c0 , c1 , · · · , ck−1 ), where ci = j=0 aj bi−j −
1: Initialize J ← 1, P x ← x, P y ← y
Pk−1
a b 2: for i = 1 to d do
j=i+1 j k+i−j
3: m←0
1: for i = 0 to k − 1 do
i 4: while m < k/2 do
2: āi ← ai φ
5: rc ← ω m
3: b̄i ← bi φi
6: for j = 1 to J do
4: end for
7: P y[0] ← P x[0] + P x[k/2]
5: Ā ← F F Tωk (ā)
8: P y[J] ← rc(P x[0] − P x[k/2])
6: B̄ ← F F Tωk (b̄)
9: P x ← P x + 1, P y ← P y + 1
7: for i = 0 to k − 1 do
10: end for
8: C̄i ← Āi B̄i
11: P y ← P y + J, m ← m + J
9: end for
12: end while
10: c̄ ← IF F Tωk (C̄)
13: J ← 2 ∗ J, P x ← P x − k/2, P y ← P y − k
11: for i = 0 to k − 1 do
14: (P x, P y) ← (P y, P x)
12: ci ← c̄i φ−i
15: end for
13: end for
16: return data at P x if d is even, data at P y if d is odd.
14: return c
Fast Fourier transform utilized in step 5, 6 and 10 of Fig. 1. Strassen’s algorithm without zero-padding
algorithm 1 is the most computational intensive operation.
Ping-pong FFT algorithm [12] is a special variant of Fast In this work, we consider both cyclic convolution in (2) and
Fourier transform which avoids scrambling between stages. negative wrapped convolution in (3). We have observed that
Different from constant- geometry FFT used in [13] or in- if both two convolution results c+ and c− are obtained, one
place FFT used in [7], ping-pong FFT can provide continuous can easily recover the multiplication result ci through (4).
memory access at the cost of an extra copy of data. Ping-pong
(ci+ + ci− )/2 (0 ≤ i < n)
FFT algorithm can be faster than constant- geometry FFT or ci = (4)
(c(i−n)+ − c(i−n)− )/2 (n ≤ i < 2n)
in-place FFT on hardware platform because continuous access
provides faster clock frequency and less data correlation . One Based on (4), we propose an improved Strassen’s algorithm
typical version of ping-pong FFT algorithm can be described as Fig.1 shows. In our proposed algorithm, two input integers
as algorithm 2. a and b are regarded as two n-length vectors a, b without
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2840108, IEEE
Transactions on Circuits and Systems II: Express Briefs
3
BF BF
A. Overall VLSI architecture rc gen
unit3 unit4
unit
Based on the proposed multiplication algorithm in section
III, we design a 768k-bit multiplier. A parameter setup similar
as [6] is employed. The 768k-bit integers are divided into 32k
pieces of 24-bit digits without zero-padding. NTT is performed Fig. 2. Overall VLSI architecture
over finite prime field with modulus p = 264 − 232 + 1, 32k-th
primitive root ω = 0x0a22c55c8f 3b59cc (ω 32k = 1 mod p),
negative factor φ = 0x6b5de761379102cd (φ2 = ω mod p).
Fig.2 shows the overall VLSI architecture of our pro- the 48-bit input digit into two 24-bit digits a, b and their
posed multiplier. Our multiplier consists of three major parts: negative wrapping digits a, b. In the output process, it merges
memory part, arithmetic part and control part. Memory part the cyclic convolution result c+ and negative convolution result
includes three blocks of dual-port memory named as MEM a, c− into one final result and solve the carry chain. Round
MEM b and MEM c. The three memory blocks are employed constant generating unit generates round constant and negative
to store the input data and intermedia results of ping-pong FFT. factor. Butterfly units are the core arithmetic unit which can
Arithmetic part consists of one IO unit, one round constant be configured to calculate Fourier transform, inverse Fourier
generating unit and 4 butterfly units (named as IO unit, rc transform or digit-wise multiplication of transform result.
ger unit and BF unit 1-4, respectively). IO unit handles both Control part of our proposed multiplier is a state machine
the input and output process. In the input process, it converts used to control the memory part and arithmetic part.
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2840108, IEEE
Transactions on Circuits and Systems II: Express Briefs
4
B. Memory organization
idle input a
Each of the three memory blocks in Fig.2 utilize a 4.2Mb
single inverse
(256b x 16k) dual port memory. Fig.3 shows the memory mult transform C
organization detail (take MEM a as an example). Each address
output
of MEM a stores 2 64-bit digits of vector a and 2 64-bit input a
c mult
digits of negative wrapped vector a. When performing Fourier C=AB
continous
transform, data at address 0 and address J is fetched from mult
different ports simultaneously and parallel processed by 4 input b
transform
butterfly units. transform b into B
a into A
data1 a0 a1 a0 a1
addr1
a2 a3 a2 a3 Fig. 5. State machine transfer diagram
Dual port block q1
wr_en1
memory TABLE I
IMPLEMENTATION RESULTS ON A LTERAS S TRATIX -V FPGA
data2 aJ aJ+1 aJ aJ+1
width=64x4 Design Utilization Summary
addr2 q2 Logic Utilization
Height=16k Used Available Utilization
wr_en2 addr=14 bit Combinational ALUT 7568 256600 2.9%
a32k-2 a32k-1 a32k-2 a32k-1 Total registers 3437 513200 0.7%
Total block memory bits 12, 582, 912 19, 599, 360 64%
Total DSP blocks 72 256 28%
Logic clock frequency 181 MHz
Fig. 3. Memory organization of MEM a Input/Output NTT/INTT Mult
Clock cycles
16385 122955 16390
Total clock cycles 418025
Multiplication time 2.30 ms
Px[0] Py[0]
stage stage stage stage
0 1 2 3
Detailed state transfer diagram is show in Fig.5. When
mux
Px[n/2] a[0] Py[J] working under continuous multiplication mode, input states
are parallel performed with transform states. Under single
Red
J=1
multiplication mode, all the states are performed sequentially.
mux
rc b[0] c[0]
mux
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSII.2018.2840108, IEEE
Transactions on Circuits and Systems II: Express Briefs
5
TABLE II
I MPLEMENTATION RESULT COMPARISON
463k ALUTs+
Wang [4] Altera Stratix-V 768 100 0.375 173.6 226.0
336k resisters
243k ALUTs+
Wang [6] Altera Stratix-V 768 229.4 0.206 50.1 65.2
245k resisters
7.9k ALUTs+
Feng [7] Altera Stratix-V 1024 170 4.9c 38.7 37.8
3.6k resisters
[7]’s method 7.2k ALUTs+
Altera Stratix-V 768 180 3.5c 25.2 32.8
under 768k multd 2.8k resisters
7.6k ALUTs+
This design Altera Stratix-V 768 181 2.30 17.5 22.8(-30%)
3.4k resisters
Doroz [5] TSMC 90nm 1152 666 7.74 26.7 Mgates 206.6 179.3
[5]’s method
TSMC 90nm 768 675 5.1 17.2 Mgates 87.7 114.2
under 768k multd
This design TSMC 90nm 768 370 1.13 23.0 Mgates 26.0 30.0(-74%)
a ATP is the product of time and resources which has a unit of ALUT·s for FPGA design and Kgates·s for ASIC design.
b Normalized ATP is the quotient of a design’s ATP by its multiplication bit size (in Mb unit). It has a unit of ALUT·s/Mb for FPGA
design or Kgates·s/Mb for ASIC design.
c [7] is a pipelined structure. 4.9 ms is the time interval between two pipelined multiplications.
d These results are obtained by us through transplanting [7]’s and [5]’s method into new 768k-bit designs.
Compared with [4] and [6], instead of using high radix [4] W. Wang and X. Huang, “FPGA Implementation of a Large-number
FFT units, which requires complicated memory structure and Multiplier for Fully Homomorphic Encryption,” in Circuits and Systems
(ISCAS), 2013 IEEE International Symposium on. IEEE, 2013, pp.
high area cost, our design improves the multiplier efficiency at 2589–2592.
algorithm level using negative wrapped convolution and ping- [5] Y. Doroz, E. Ozturk, and B. Sunar, “Evaluating the Hardware Perfor-
pong FFT. In total, we reduce the ATP by 89 percent and 65 mance of a Million-bit Multiplier,” in Digital System Design (DSD),
2013 Euromicro Conference on. IEEE, 2013, pp. 955–962.
percent compared with [4] and [6] respectively at the same [6] W. Wang, X. Huang, N. Emmart, and C. Weems, “VLSI Design of
multiplication size. a Large-number Multiplier for Fully Homomorphic Encryption,” Very
Compared with work [7] which improves NTT efficiency Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 22,
no. 9, pp. 1879–1887, 2014.
through double modulus method, our design requires fewer [7] X. Feng and S. Li, “Design of an Area-Effcient Million-Bit Integer
resources and achieves faster speed due to the continous Multiplier Using Double Modulus NTT,” IEEE Transactions on Very
memory access of ping-pong FFT. we achieve a 30 percent Large Scale Integration (VLSI) Systems, vol. 25, no. 9, pp. 2658–2662,
2017.
reduction on the ATP under 768k multiplication. [8] J. Ding and S. Li, “A Modular Multiplier Implemented with Truncated
Compared with ASIC design [5] using recursive FFT, our Multiplication,” IEEE Transactions on Circuits and Systems II: Express
Briefs, 2017.
design on the same platform is 6.8 times faster at 1.3 times [9] X. Huang and W. Wang, “A Novel and Efficient Design for an RSA
resource cost. The ATP under 768k multiplication is reduced Cryptosystem With a Very Large Key Size,” IEEE Transactions on
by 74 percent. Circuits and Systems II: Express Briefs, vol. 62, no. 10, pp. 972–976,
2015.
[10] D. D. A. Schönhage and V. Strassen, “Schnelle Multiplikation Grosser
Zahlen,” Computing, vol. 7, no. 3-4, pp. 281–292, 1971.
VI. CONCLUSION [11] V. Lyubashevsky, D. Micciancio, C. Peikert, and A. Rosen, “SWIFFT: A
Modest Proposal for FFT Hashing,” Lecture Notes in Computer Science,
In this brief, we proposed a method to accelerate FHE vol. 5086, pp. 54–72, 2008.
integer multiplier using negative wrapped convolution and [12] R. Crandall and C. Pomerance, Prime numbers: A Computational
optimized ping-pong FFT algorithm. Implementation results Perspective. Springer Science & Business Media, 2006, vol. 182.
[13] D. D. Chen, N. Mentens, F. Vercauteren, S. S. Roy, R. C. Cheung,
shown that our method can significantly improve the area- D. Pao, and I. Verbauwhede, “High-speed Polynomial Multiplication
efficiency of large FHE integer multiplier, which made it Architecture for Ring-LWE and SHE Cryptosystems,” IEEE Transac-
practical for resource constraint fully homomorphic encryption tions on Circuits and Systems I: Regular Papers, vol. 62, no. 1, pp.
157–166, 2015.
applications.
R EFERENCES
[1] C. Gentry, “Fully Homomorphic Encryption Using Ideal Lattices,” in
STOC, vol. 9, 2009, pp. 169–178.
[2] C. Gentry and S. Halevi, “Implementing Gentrys Fully-homomorphic
Encryption Scheme,” in Advances in Cryptology–EUROCRYPT 2011.
Springer, 2011, pp. 129–148.
[3] M. Van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan, “Fully
Homomorphic Encryption over the Integers,” in Advances in cryptology–
EUROCRYPT 2010. Springer, 2010, pp. 24–43.
1549-7747 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.