A Modular Multiplier Implemented With Truncated Multiplication
A Modular Multiplier Implemented With Truncated Multiplication
net/publication/321315058
Article in IEEE Transactions on Circuits and Systems II: Express Briefs · November 2017
DOI: 10.1109/TCSII.2017.2771239
CITATIONS READS
31 812
2 authors, including:
Jinnan Ding
Tsinghua University
7 PUBLICATIONS 157 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jinnan Ding on 24 March 2019.
Abstract—In this paper, we propose a modular multiplication Algorithm 1 Improved Barrett Modular Multiplication
algorithm with four truncated multiplications to reduce the 22n+3
critical path. According to our algorithm, a high-speed 3- Input: A < 2M , B < 2M , 2n−1 < M < 2n , µ = b c.
stage modular multiplier is constructed. Moreover, synthesized M
Output: R ≡ AB (mod M ), R < 2M.
with TSMC 90nm, our design can perform a 256-bit modular T
multiplication for every clock period of 3.58ns with circuit scale 1: T = A × B, T1 = b n−2 c.
of approximately 629K equivalent gates. Further, by utilizing the 2
µ × T1
modular multiplier, we construct an Elliptic Curves Cryptogra- 2: q = b n+5 c.
2
phy (ECC) processor, which can perform a scalar multiplication 3: R = T − q × M.
in 19.4µs, which is one of the fastest to date in published
literatures. 4: return R.
multiplier. Furthermore, it may also introduce a negative Algorithm 2 Proposed Modular Multiplication Algorithm
number if applied to BMM. 22n+4
Input: A < 4M , B < 4M , 2n−1 < M < 2n , µ = b c.
In [6], the authors constructed a truncated multiplier that M
was capable of performing speculation and correction inde- Output: R ≡ AB (mod M ), R < 4M.
pendently. Fig. 1 illustrates its structure diagram, where the 1: T 2 = An(n−2,k1 ) B, T 1 = Ao
n(n−2,k1 ) B, T 0 = Aon−2−k1 B.
partial product array (PPA) of the multiplier is divided into 2: q = µn(n+6,k2 ) T 2, AB = T 2 × 2n−2 + T 1 × 2n−2−k1 + T 0.
the most significant part (MSP), least significant part (LSP)
3: R = (AB − qon+2 M ) mod 2n+2 .
and the carry-estimation part. The truncated product can be
corrected according to the carry output from the LSP and the 4: return R.
carry-estimation part.
As for an MSM used for BMM, a small error is tolerable;
Booth encoder
therefore, we can discard LSP in [6] and take the MSP and
MSM LSM
carry-estimation part as the MSM applied in BMM.
n1+n2-n+2 bits k bits n-k-2 bits
k_row
n_row
T2 T1
A. Proposed Algorithm In this paper, we will employ these three partial products to
represent AB.
Our algorithm evolves from the improved BMM and is
“A on−k−2 B” corresponds to the grey points in Fig. 2,
proposed as Algorithm 2. The input and output bounds of our
while n − k − 2 denotes the reserved bit length of LSM.
algorithm are compatible, which make it suitable for consec-
utive modular multiplication. In Algorithm 2, the operations n row−1
X n−k−3
X
“nn ” and “o n(n,k) ” are implemented by an MSM, and “on ” A on−k−2 B = ppi,j × 2j (3)
by an LSM. Proofs of our algorithm will be provided later. i=0 j=0
Normally, the PPA of a full-word multiplier is a The operation “onn−2,k ” sums up the middle k columns in
parallelogram-shaped array. To simplify our deduction (not in Fig. 2 (without carry bits):
practical hardware design), we modify the parallelogram array nX
row−1 n−3
into a rectangle with the height of n row and width of n1+n2 , X
ppi,j ×2j−n+2+k mod 2k .
Ao
n(n−2,k)B = (4)
where n1 and n2 represent the bit length of the multiplicand
i=0 j=n−k−2
and multiplier, respectively. The most significant truncated multiplication “n(n−2,k) ”
Fig. 2 depicts the structure of the MSM and LSM of step 1 adds up all the black points in Fig. 2 and discards the least
in Algorithm 2. The hollow points denote added zeros, while significant k-bit, where the subscript (n − 2) represents the
the black points represent the MSM, and the grey points belong truncated bit length. It is defined as follows:
to LSM. Parameter k row denotes the number of rows of grey
+n2−1
row−1 n1X
points, and n row depends on the length of the multiplier’s 1 nX
ppi,j 2j mod 2n1+n2−n−2
operands and the generation method of PPA such as Booth An(n−2,k)B = n−2
2 i=0 j=n−k−2
encoding [7]. (5)
Let ppi,j denote the (i+1)th-row (j+1)th-column element
Our multiplications are all unsigned, however, sign ex-
of PPA in Fig. 2. Normally, we have
tension in Booth encoding may introduce certain negative
n row−1 X2 −1
X n1 +n correction bits to the representations of T 2, T 1 and T 0.
ppi,j × 2j mod 2n1 +n2 .
AB = (1) These negative correction bits can be all expelled during the
i=0 j=0 compression procedure, except for one occasion, which is
3
when T 2 happens to be all 1’s and a carry bit is generated by AB − q × M < 4M < 2n+2 . (10)
“T 1×2n−k +T 0”, namely T 2×2n +T 1×2n−k +T 0 ≥ 2n1 +n2 .
Additionally, A on+2 B only adds up the lowest n + 2 bits;
In this circumstance, those 1’s in T 2 should be replaced with
therefore, under the modular operation, we have:
0’s, which we call carry-correction. Hence, there should be
two cases for T 2: (q × M ) mod 2n+2 = (qon+2 M ) mod 2n+2 (11)
(
carry-correction case: 0;
T2 = (6)
default: An(n−2,k)B. Furtherly, R = (AB − qon+2 M ) mod 2n+2
= AB − q × M < 4M
B. Proof
and also R ≡ AB (mod M )
According to (1) and (6), we defined ∆e to be the difference
between A×B/2n and An(n,k) B:
A×B
∆e = − A n(n,k) B IV. I MPLEMENTATIONS AND C OMPARISONS
2n
k row−1
X n−k−1 A. Implementation of modular multiplier
X 1 1
< ppi,j × 2j−n − k + n + 1 In our proposed algorithm, there are four truncated multi-
i=0 j=0
2 2
plications, three of which are consecutive multiplications (n
≤ k row × (2−k − 2−n ) − 2−k + 2−n + 1 and on in step 1 represent one MSM). To exert parallelism
< (k row − 1) × 2−k + 1. (7) and balance the delay, we utilize four truncated multipliers
to construct a 3-stage pipeline modular multiplier, which can
Parameter k is the bit length of carry-estimation part to perform a modular multiplication every clock cycle as depicted
design our truncated multiplier. If we choose k to satisfy in Fig. 3.
k = dlog2 (k row − 1)e, (8)
A B
it results in ∆e < 2. Moreover, it is obvious that ∆e ≥ 0,
MUX MUX
thus, the error introduced by the MSM is
Register A Register B
0 ≤ ∆e < 2. (9) stage 1
To validate the correctness and compatibility (input output LSM
MSM
bounds) of our proposed algorithm, the proof is demonstrated 0
below.
In step 1 and step 2 of Algorithm 2, two “n” are utilized, MUX
which create two errors. Thus, we used ∆e1 and ∆e2 to Register T2 Register T1 Register T0
T 2kT 1 and T 0 are computed by two different reduction trees cycle. Moreover, in order to establish a fair comparison with
in parallel and their carry chains are not connected, resulting previous works, we implemented our design with 130nm.
in their delays being independent and shorter than a full-word As [9] and [12] in Table II are WB-MMM designs, the
multiplier. area cost was moderate, while their speed was limited. An
The following stage is to facilitates the implementation of interleaved modular multiplication algorithm (IMM), based
q = µnn+6 T 2 from step 2 in Algorithm 2, by utilizing an on BMM and MMM for special modulus, was introduced in
MSM and a short adder. Since µ is a constant, the MSM in [13], and only consumed very small area, which renders it
stage 2 becomes a constant multiplier, where some techniques appropriate for low cost applications. For high-speed purposes,
involving a constant multiplier, such as high radix Booth [11] applied FW-MMM, and while their area is comparable
encoding, can be used. Additionally, the computation of T 3 with ours, it takes 3 clock cycles to perform a modular
appears as if it requires a long adder. In fact, only half the multiplication.
length of T 3 is needed; hence, a short adder (k + 4 bits) was We also include a bit-scan MMM design [14], and imple-
used instead, as illustrated in the dashed box in Fig. 2. ment it in 90nm to make a fair comparison. Though the orien-
Precisely like the MSM in the second stage, the LSM tations of our design and some low-cost designs are different,
computing the result R is also a constant multiplier. The the AT factor and its reciprocal denoted as “performance” may
difference between them is that there is a subtraction in the give us a certain aspect to make a comparison.
computation of R to be dealt with, which can be absorbed In order to demonstrate the enhancement of our proposed
into the reduction tree, so as to effectively reduce the circuit algorithm over the improved BMM in [4], we further im-
delay. plemented the full-word Barrett modular multiplication (FW-
In our Booth encoded multiplier, sign extension may cause BMM), with a full-word 256-bit multiplier. The application
T 3 ≥ 2n1+n2 , which is the carry-correction case in (6); of truncated multipliers reduced the requirement of the BMM
therefore, T 2 and q should be multiplexed outputs. from 3 full-word multiplications to 2 (LSM≈MSM≈ 21 full-
word multiplication). The removal of double-full-word final
B. Complexity analysis additions in our MSM makes the critical path 17% shorter
In our proposed algorithm, there are four truncated mul- than the full-word multiplier in [4], while the pipeline structure
tiplications, which can be implemented by four truncated further enhances the throughput of our design, which led to
multipliers to establish a 3-stage pipeline modular multiplier. a total improvement of 80% in terms of performance of our
When the pipeline is in fully loaded running, it can generate a proposed algorithm compared with the improved BMM.
modular multiplication result every clock cycle. Table I offers TABLE II: Comparison of 256-bit Modular Multipliers
a complexity comparison with some common implementations
Freq Gates Speedα Performance
of modular multiplication. Design Technique Process ATβ
(MHz) (KGates) (ns) (AT )−1
Word-based MMM (WB-MMM) in [9] is generally utilized
[14] Bit-MMM 90nm 752 30.0 121 5.24 0.191
for area efficient designs, while its performance is not at the [11] FW-MMM 90nm 185 540 16.2 12.6 0.079
same level with full-word multiplier designs. When compared [9] WB-MMM 90nm 286 125 88 15.9 0.063
with BMM in [10] and the full-word Montgomery modular Ours FW-BMM 90nm 279 629 3.58 3.3 0.303
multiplication (FW-MMM) in [11], our algorithm exhibits a [12] WB-MMM 130nm 556 122 500 61 0.016
shorter delay and less clock cycles at the cost of about twice [13] IMM 130nm 321 45 200 9.0 0.111
[4] FW-BMM 130nm 196 477 15 7.2 0.139
the area.
Ours FW-BMM 130nm 230 910 4.35 4.0 0.250
α Average time cost of one modular multiplication
β AT=Gate Counts×Time×130nm/Technology, with the unit of Gate · ms.
TABLE I: Complexity of modular multiplication implementations
Algorithms Critical path Cycles Resources
BMM [10] n-bit multiplication 3 1 n-bit multiplier D. Comparison of ECC Implementations
FW-MMM [11] n-bit multiplication 3 1 n-bit multiplier
Using our modular multiplier, we built a 256-bit ECC
WB-MMM [9] full addition and n n-bit full adder
several logic gates processor following arbitrary elliptic curve over GF (p). The
Ours n-bit truncated mul- 1 4 truncated multi- Montgomery powering ladder algorithm [15] was applied in
tiplication pliers equivalent to
2 n-bit multipliers
the scalar multiplication. Furthermore, projective coordinates
were utilized, and the 3-stage modular multiplier can work at
full-load with 11 cycles and 9 cycles to perform point addition
and point doubling, respectively, according to the schedule
C. Comparison of Modular Multipliers method as per [16]. The modular inversion is based on the
Based on our proposed algorithm, a 256-bit modular mul- binary extended Euclidean algorithm [17]. At a clock rate
tiplier has been realized over arbitrary 256-bit prime field of 279MHz, a scalar multiplication requires about 19.4µs,
GF (p). Synthesized in TSMC 90nm with Design Compiler which represents the fastest rate we could locate in published
J-2014.09-SP3, the results of the experiment demonstrate that literatures.
the critical path delay is 3.58ns in the worst corner, with a In Table III, [11] and [18] both used full-word MMM,
circuit scale of about 629K gates. With the pipeline technique, whose area costs are comparable with ours; however, since it
modular multiplication result can be generated every clock had to perform the MMM with three full-word multiplications,
5
the delay of the scalar multiplication was more than 3 times [2] P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key
longer than ours. The WB-MMM was employed in [12] and Encryption Algorithm on a Standard Digital Signal Processor,” Advances
in cryptology CRYPTO’86, pp. 311–323, 1986.
[19], which allowed them to have a high clock rate with [3] H. Laszlo, “Fast Truncated Multiplication for Cryptographic Applica-
moderate area; however, it also increased the cycle counts tions,” Lecture Notes in Computer Science, vol. 3659, pp. 211–225,
of a modular multiplication. The IMM was emplyed in [20] 2005.
[4] Y. Kong, “Optimizing the Improved Barrett Modular Multipliers for
and [21], in which both offered a good balance between the Public-Key Cryptography,” International Conference on Computational
area and delay, while [21] is a dual-field design and we only Intelligence and Software Engineering, pp. 1–4, 2010.
implemented its prime-field part in 90nm. In [22] and [23], the [5] H.-J. Ko and Hsiao, “Design and Application of Faithfully Rounded and
Truncated Multipliers With Combined Deletion, Reduction, Truncation,
authors utilized the fast reduction method of the Mersenne-like and Rounding,” IEEE Transactions on Circuits and Systems II: Express
modulus to accelerate modular multiplication, while they only Briefs, vol. 58, no. 5, pp. 304–308, 2011.
supported one special prime of China’s ECC algorithm, SM2. [6] S. K. Chen, C. W. Liu, T. Y. Wu, and A. C. Tsai, “Design and
Implementation of High-Speed and Energy-Efficient Variable-Latency
Although this fast reduction method can significantly speed Speculating Booth Multiplier (VLSBM),” IEEE Transactions on Circuits
up ECC operations, its flexibility is quite limited. Moreover, and Systems I Regular Papers, vol. 60, no. 10, pp. 2631–2643, 2013.
compared with [23], which only supported SM2 prime with [7] A. D. Booth, “A signed binary multiplication technique,” Quarterly
Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236–
restricted elliptic curves and there is no SPA countermeasures, 240, 1951.
our design is much more flexible, although it requires a little [8] C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Transactions
more area. on Electronic Computers, vol. EC-13, no. 1, pp. 14–17, 1964.
[9] S. R. Kuang, K. Y. Wu, and R. Y. Lu, “Low-Cost High-Performance
We also used the modular multiplier of FW-BMM in [4] to VLSI Architecture for Montgomery Modular Multiplication,” IEEE
build an ECC processor. Since our 3-stage modular multiplier Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24,
is fully occupied during the ECC scalar multiplication, the no. 2, pp. 434–443, Feb 2016.
[10] J. Groschdl, “High-Speed RSA Hardware Based on Barrets Modular
advantages of our design over [4] are tremendous (more than Reduction Method,” in International Workshop on Cryptographic Hard-
80%) both in Table II and III in term of AT. ware and Embedded Systems, 2000, pp. 191–203.
Owing to the use of truncated multipliers and the parallel [11] S.-C. Chung, J.-W. Lee, H.-C. Chang, and C.-Y. Lee, “A high-
performance elliptic curve cryptographic processor over GF(p) with
schedules of our proposed pipeline modular multiplier, our SPA resistance,” 2012 IEEE International Symposium on Circuits and
ECC processor achieves an ultra-high performance while Systems, pp. 1456–1459, 2012.
still exhibiting an adequate AT, which makes it suitable for [12] G. Chen, G. Bai, and H. Chen, “A High-Performance Elliptic Curve
Cryptographic Processor for General Curves Over GF(p) Based on a
computation-intensive applications. Systolic Arithmetic Unit,” IEEE Transactions on Circuits and Systems
II: Express Briefs, vol. 54, no. 5, pp. 412–416, 2007.
TABLE III: Comparison of 256-bit ECC processors [13] M. Knežević, F. Vercauteren, and I. Verbauwhede, “Faster interleaved
modular multiplication based on Barrett and Montgomery reduction
Freq Gates Speedα Performance methods,” IEEE Transactions on Computers, vol. 59, no. 12, pp. 1715–
Design Technique Process ATβ
(MHz) (KGates) (µs) (AT )−1 1721, 2010.
[21] IMM 90nm 725 52.4 632 47.8 0.021 [14] A. Rezai and P. Keshavarzi, “High-Throughput Modular Multiplication
[11] FW-MMM 90nm 185 540 120 93.6 0.011 and Exponentiation Algorithms Using Multibit-ScanMultibit-Shift Tech-
Ours FW-BMM 90nm 279 659 19.4 18.5 0.054
nique,” IEEE Transactions on Very Large Scale Integration Systems,
vol. 23, no. 9, pp. 1710–1719, 2015.
[20] IMM 130nm 110 168 510 85.7 0.012
[15] M. Joye and S. M. Yen, “The Montgomery powering ladder,” Lecture
[12] WB-MMM 130nm 556 122 1010 123 0.008 Notes in Computer Science, vol. 2523, pp. 291–302, 2002.
[22] SM2 130nm 228 156 208 32.4 0.031 [16] N. Guillermin, “A high speed coprocessor for elliptic curve scalar
[23] SM2 130nm 164 659 20.4 13.4 0.075 multiplications over F(p),” in International Conference on Cryptographic
[4] FW-BMM 130nm 189 507 82.9 42.0 0.024 Hardware and Embedded Systems, 2010, pp. 48–64.
Ours FW-BMM 130nm 230 952 23.5 22.4 0.045 [17] R. L’Orencz, “New Algorithm for the Classical Modular Inverse,” in
Cryptographic Hardware and Embedded Systems, 2002.
[18] FW-MMM 180nm 200 750 95 51.5 0.019
[18] X. Zhang and S. Li, “A High Performance ASIC Based Elliptic Curve
[19] WB-MMM 180nm 333 94 1480 100 0.010 Cryptographic Processor over GF(p),” International Design and Test
Ours FW-BMM 180nm 136 977 39.8 28.1 0.036 Workshop - IDT, pp. 182 –186, 2007.
α Average time cost of one scalar multiplication
β AT=Gate Counts×Time×130nm/Technology, with the unit of Gate · s.
[19] D. Karakoyunlu, F. K. Gurkaynak, B. Sunar, and Y. Leblebici, “Efficient
and side-channel-aware implementations of elliptic curve cryptosystems
over prime fields,” IET Information Security, vol. 4, no. 1, pp. 30–43,
March 2010.
[20] S. Ghosh, M. Alam, D. Chowdhury, and I. Gupta, “Parallel crypto-
V. C ONCLUSION devices for GF(p) elliptic curve multiplication resistant against side
In this paper, a modular multiplication algorithm with channel attacks,” Computers and Electrical Engineering, vol. 35, no. 2,
pp. 329–338, 2009.
truncated multiplications was proposed, according to which [21] Z. Liu, D. Liu, and X. Zou, “An Efficient and Flexible Hardware Im-
a 3-stage pipeline modular multiplier with four truncated plementation of the Dual-Field Elliptic Curve Cryptographic Processor,”
multipliers was constructed. When the modular multiplier was IEEE Transactions on Industrial Electronics, vol. 64, no. 3, pp. 2353–
2362, 2017.
applied in an ECC processor, it demonstrated the highest [22] D. Zhang and G. Bai, “Ultra high-performance ASIC implementation
single core performance in published literatures as well as a of SM2 with power-analysis resistance,” in 2015 IEEE International
good area-time efficiency, which is appropriate for server-side Conference on Electron Devices and Solid-State Circuits (EDSSC), June
2015, pp. 523–526.
applications. [23] Z. Zhao and G. Bai, “Ultra High-Speed SM2 ASIC Implementation,” in
2014 IEEE 13th International Conference on Trust, Security and Privacy
R EFERENCES in Computing and Communications, Sept 2014, pp. 182–188.