0% found this document useful (0 votes)
9 views6 pages

A Modular Multiplier Implemented With Truncated Multiplication

This paper presents a modular multiplication algorithm utilizing truncated multiplications to enhance speed and efficiency, achieving a 256-bit modular multiplication in 3.58ns with a circuit scale of approximately 629K gates. The proposed algorithm constructs a high-speed 3-stage modular multiplier and integrates it into an Elliptic Curves Cryptography (ECC) processor, which performs scalar multiplication in 19.4µs, one of the fastest reported. The work demonstrates significant improvements in both performance and area-time efficiency compared to previous designs.

Uploaded by

samtension89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

A Modular Multiplier Implemented With Truncated Multiplication

This paper presents a modular multiplication algorithm utilizing truncated multiplications to enhance speed and efficiency, achieving a 256-bit modular multiplication in 3.58ns with a circuit scale of approximately 629K gates. The proposed algorithm constructs a high-speed 3-stage modular multiplier and integrates it into an Elliptic Curves Cryptography (ECC) processor, which performs scalar multiplication in 19.4µs, one of the fastest reported. The work demonstrates significant improvements in both performance and area-time efficiency compared to previous designs.

Uploaded by

samtension89
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321315058

A Modular Multiplier Implemented With Truncated Multiplication

Article in IEEE Transactions on Circuits and Systems II: Express Briefs · November 2017
DOI: 10.1109/TCSII.2017.2771239

CITATIONS READS
31 812

2 authors, including:

Jinnan Ding
Tsinghua University
7 PUBLICATIONS 157 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jinnan Ding on 24 March 2019.

The user has requested enhancement of the downloaded file.


1

A modular multiplier implemented with truncated


multiplication
Jinnan Ding Student Member, IEEE, and Shuguo Li Member, IEEE

Abstract—In this paper, we propose a modular multiplication Algorithm 1 Improved Barrett Modular Multiplication
algorithm with four truncated multiplications to reduce the 22n+3
critical path. According to our algorithm, a high-speed 3- Input: A < 2M , B < 2M , 2n−1 < M < 2n , µ = b c.
stage modular multiplier is constructed. Moreover, synthesized M
Output: R ≡ AB (mod M ), R < 2M.
with TSMC 90nm, our design can perform a 256-bit modular T
multiplication for every clock period of 3.58ns with circuit scale 1: T = A × B, T1 = b n−2 c.
of approximately 629K equivalent gates. Further, by utilizing the 2
µ × T1
modular multiplier, we construct an Elliptic Curves Cryptogra- 2: q = b n+5 c.
2
phy (ECC) processor, which can perform a scalar multiplication 3: R = T − q × M.
in 19.4µs, which is one of the fastest to date in published
literatures. 4: return R.

Index Terms—modular multiplier, truncated multiplication,


Barrett modular multiplication (BMM), elliptic curve cryptog-
raphy (ECC). II. BACKGROUND
A. Barrett modular multiplication
I. I NTRODUCTION BMM was first presented in [2]. After the original BMM
was first introduced, Y. Kong evaluated different versions of
T HE fundamental operation concerning most public key
cryptography systems is modular multiplication, which
also constitutes one of the bottlenecks for public-key cryp-
improved BMM and offered a general expression of input and
output with introduced parameters to structure the input and
tography processor design. In previous works, there were output bounds to fall in the same range, called “compatible”
basically two methods that have been proposed to implement [4]. His algorithm is presented as Algorithm 1 with a selected
modular multiplication in high performance designs, namely, set of parameters.
the Montgomery modular multiplication (MMM) [1] and the Step 1 in Algorithm 1 constitutes a full multiplication.
Barrett modular multiplication (BMM) [2]. Both algorithms In fact, the lower part of that multiplication is not needed
perform modular multiplication with 3 consecutive multipli- until step 3, and only the higher part (T1 ) is required in the
cations. However, in order to speed up the process, faster immediate step 2. Similarly, step 2 only requires the higher
multiplier is required. part of the multiplication. Thus, it is possible to introduce
Among the multiplier designs, the truncated multiplier is the most significant multipliers (MSM) in both steps 1 and 2,
a highly specialized type of multiplier that only computes instead of full-word multipliers. However, if we simply replace
part of the product. In [3], Laszlo Hars provided a detailed the full-word multiplication with the MSM, a significant error
and systematic discussion of the truncated multiplier in both will be introduced, which causes the input and output bounds
time complexity and hardware cost. By introducing truncated to become incompatible. Therefore, it is obvious that bound
multipliers in BMM, instead of normal ones, it not only adjustments must be made to Algorithm 1 and that the MSM
shortens the critical path but also saves hardware resources. should be constructed precisely, in order to reduce the size of
As long as the error introduced by the truncated multiplier is the introduced error.
minor or small enough, custom designed truncated multipliers Additionally, the result R in step 3 of Algorithm 1 is only
can be applied. (n + 1)-bit; hence, the multiplication in step 3 can be easily
In this paper, we propose an algorithm that evolves from implemented with the least significant multiplier (LSM), which
improved BMM, which is appropriate for truncated multiplier. introduces on error at all; thus, no adjustment is required for
According to the algorithm, a 3-stage pipeline modular multi- the LSM.
plier is implemented, and based on the modular multiplier, an
ECC processor is constructed, which significantly outperforms B. Truncated multiplier
the record of previous works and still performs well according
to area-time efficiency. The truncated multiplier is often referred as MSM, since it is
easy to implement an LSM. Based on the error compensation
This work was supported by the National Natural Science Foundation method, the truncated multiplier can be categorized into two
of China under Grant No.61674086. types: constant correction and variable correction. Reference
The authors are with the Institute of Microelectronics, Tsinghua
University, Beijing, China (Email: [email protected]; [5] introduced a truncated multiplier with faithfully rounding
[email protected]) result, however, its area cost is approximately 32 of a full
2

multiplier. Furthermore, it may also introduce a negative Algorithm 2 Proposed Modular Multiplication Algorithm
number if applied to BMM. 22n+4
Input: A < 4M , B < 4M , 2n−1 < M < 2n , µ = b c.
In [6], the authors constructed a truncated multiplier that M
was capable of performing speculation and correction inde- Output: R ≡ AB (mod M ), R < 4M.
pendently. Fig. 1 illustrates its structure diagram, where the 1: T 2 = An(n−2,k1 ) B, T 1 = Ao
n(n−2,k1 ) B, T 0 = Aon−2−k1 B.
partial product array (PPA) of the multiplier is divided into 2: q = µn(n+6,k2 ) T 2, AB = T 2 × 2n−2 + T 1 × 2n−2−k1 + T 0.
the most significant part (MSP), least significant part (LSP)
3: R = (AB − qon+2 M ) mod 2n+2 .
and the carry-estimation part. The truncated product can be
corrected according to the carry output from the LSP and the 4: return R.
carry-estimation part.
As for an MSM used for BMM, a small error is tolerable;
Booth encoder
therefore, we can discard LSP in [6] and take the MSP and
MSM LSM
carry-estimation part as the MSM applied in BMM.
n1+n2-n+2 bits k bits n-k-2 bits

k_row
n_row

Wallace-tree reduction Wallace-tree reduction


& Final addition & Final addition

T2 T1

k+4 bits addition to compute AB T0

Fig. 2: Structure of our truncated multiplier

The “mod2n1 +n2 ” operation is utilized to eliminate the sign


extension effect of Booth encoding. Here, AB is divided into
Fig. 1: Structure of truncated multiplier in [?] three distinct parts according to (2).
AB = An(n−2,k) B ×2n−2 +A o
n(n−2,k)B ×2n−k−2
III. P ROPOSED ALGORITHM AND PROOFS +Ao n−k−2B (mod 2n1+n2 ). (2)

A. Proposed Algorithm In this paper, we will employ these three partial products to
represent AB.
Our algorithm evolves from the improved BMM and is
“A on−k−2 B” corresponds to the grey points in Fig. 2,
proposed as Algorithm 2. The input and output bounds of our
while n − k − 2 denotes the reserved bit length of LSM.
algorithm are compatible, which make it suitable for consec-
utive modular multiplication. In Algorithm 2, the operations n row−1
X n−k−3
X
“nn ” and “o n(n,k) ” are implemented by an MSM, and “on ” A on−k−2 B = ppi,j × 2j (3)
by an LSM. Proofs of our algorithm will be provided later. i=0 j=0

Normally, the PPA of a full-word multiplier is a The operation “onn−2,k ” sums up the middle k columns in
parallelogram-shaped array. To simplify our deduction (not in Fig. 2 (without carry bits):
practical hardware design), we modify the parallelogram array nX
row−1 n−3
into a rectangle with the height of n row and width of n1+n2 , X
ppi,j ×2j−n+2+k mod 2k .

Ao
n(n−2,k)B = (4)
where n1 and n2 represent the bit length of the multiplicand
i=0 j=n−k−2
and multiplier, respectively. The most significant truncated multiplication “n(n−2,k) ”
Fig. 2 depicts the structure of the MSM and LSM of step 1 adds up all the black points in Fig. 2 and discards the least
in Algorithm 2. The hollow points denote added zeros, while significant k-bit, where the subscript (n − 2) represents the
the black points represent the MSM, and the grey points belong truncated bit length. It is defined as follows:
to LSM. Parameter k row denotes the number of rows of grey
+n2−1
row−1 n1X
points, and n row depends on the length of the multiplier’s  1 nX
ppi,j 2j mod 2n1+n2−n−2

operands and the generation method of PPA such as Booth An(n−2,k)B = n−2
2 i=0 j=n−k−2
encoding [7]. (5)
Let ppi,j denote the (i+1)th-row (j+1)th-column element
Our multiplications are all unsigned, however, sign ex-
of PPA in Fig. 2. Normally, we have
tension in Booth encoding may introduce certain negative
n row−1 X2 −1
X n1 +n correction bits to the representations of T 2, T 1 and T 0.
ppi,j × 2j mod 2n1 +n2 .

AB = (1) These negative correction bits can be all expelled during the
i=0 j=0 compression procedure, except for one occasion, which is
3

when T 2 happens to be all 1’s and a carry bit is generated by AB − q × M < 4M < 2n+2 . (10)
“T 1×2n−k +T 0”, namely T 2×2n +T 1×2n−k +T 0 ≥ 2n1 +n2 .
Additionally, A on+2 B only adds up the lowest n + 2 bits;
In this circumstance, those 1’s in T 2 should be replaced with
therefore, under the modular operation, we have:
0’s, which we call carry-correction. Hence, there should be
two cases for T 2: (q × M ) mod 2n+2 = (qon+2 M ) mod 2n+2 (11)
(
carry-correction case: 0;
T2 = (6)
default: An(n−2,k)B. Furtherly, R = (AB − qon+2 M ) mod 2n+2
= AB − q × M < 4M
B. Proof
and also R ≡ AB (mod M )
According to (1) and (6), we defined ∆e to be the difference
between A×B/2n and An(n,k) B:
A×B
∆e = − A n(n,k) B IV. I MPLEMENTATIONS AND C OMPARISONS
2n
k row−1
X n−k−1 A. Implementation of modular multiplier
X 1 1
< ppi,j × 2j−n − k + n + 1 In our proposed algorithm, there are four truncated multi-
i=0 j=0
2 2
plications, three of which are consecutive multiplications (n
≤ k row × (2−k − 2−n ) − 2−k + 2−n + 1 and on in step 1 represent one MSM). To exert parallelism
< (k row − 1) × 2−k + 1. (7) and balance the delay, we utilize four truncated multipliers
to construct a 3-stage pipeline modular multiplier, which can
Parameter k is the bit length of carry-estimation part to perform a modular multiplication every clock cycle as depicted
design our truncated multiplier. If we choose k to satisfy in Fig. 3.
k = dlog2 (k row − 1)e, (8)
A B
it results in ∆e < 2. Moreover, it is obvious that ∆e ≥ 0,
MUX MUX
thus, the error introduced by the MSM is
Register A Register B
0 ≤ ∆e < 2. (9) stage 1
To validate the correctness and compatibility (input output LSM
MSM
bounds) of our proposed algorithm, the proof is demonstrated 0
below.
In step 1 and step 2 of Algorithm 2, two “n” are utilized, MUX

which create two errors. Thus, we used ∆e1 and ∆e2 to Register T2 Register T1 Register T0

denote the errors of “n(n−2,k1 ) ” and “n(n+6,k2 ) ”, respectively. stage 2


By selecting two different proper k values by (8), ∆e1 and
μ MSM +
∆e2 will fall in [0, 2). This results in the following: 0
A×B
∆e1 = n−2 −An(n−2,k1 ) B ∈ [0, 2), MUX
2 Register q Register T3
µ × T2 stage 3
∆e2 = n+6 −µn(n+6,k2 ) T 2 ∈ [0, 2).
2 LSM
M
The constant µ in Algorithm 2 has a rounding error, that is
R
22n+4 22n+4
µ=b c= − δ, 0 ≤ δ < 1. Fig. 3: Structure of modular multiplier
M M
With above preconditions, q can be computed:
The structure of our truncated multipliers (MSM and LSM)
q = µnn+6 T 2 have already been shown in Fig. 2, which is similar to the
2n+4
(2 M
− δ) × ( A×B
2n−2 − ∆e1 )
design in [6]. Techniques such as Booth encoding [7] and
= n+6
− ∆e2 Wallace-tree reduction method [8] are applied to our truncated
2
AB 2n−2 AB 1 multipliers. The critical path is the MSM+Mux in the first and
= − ∆e1 − 2n+4 δ − ∆e2 + n+6 ∆e1 δ second stage and the LSM+Mux in the third stage. They are
M M 2 2
shorter than the critical path of a full-word multiplier, due to
the half-sized Wallace-tree reduction and final addition.
With the ranges of the errors and input requirement of Stage 1 of our modular multiplier completes step 1 in
Algorithm 2: 2n−1 < M < 2n , A < 4M < 2n+2 , and Algorithm 2, where T 2kT 1 (concatenation of T2 and T1) and
B < 4M < 2n+2 , T 0 are computed by the MSM and LSM, respectively. This
AB 2n−1 AB AB LSM and MSM can be merged into one module, so as to share
q> − − 2n+4 − 2 > −4 the same Booth encoder as presented in Fig. 2. Nevertheless,
M M 2 M
4

T 2kT 1 and T 0 are computed by two different reduction trees cycle. Moreover, in order to establish a fair comparison with
in parallel and their carry chains are not connected, resulting previous works, we implemented our design with 130nm.
in their delays being independent and shorter than a full-word As [9] and [12] in Table II are WB-MMM designs, the
multiplier. area cost was moderate, while their speed was limited. An
The following stage is to facilitates the implementation of interleaved modular multiplication algorithm (IMM), based
q = µnn+6 T 2 from step 2 in Algorithm 2, by utilizing an on BMM and MMM for special modulus, was introduced in
MSM and a short adder. Since µ is a constant, the MSM in [13], and only consumed very small area, which renders it
stage 2 becomes a constant multiplier, where some techniques appropriate for low cost applications. For high-speed purposes,
involving a constant multiplier, such as high radix Booth [11] applied FW-MMM, and while their area is comparable
encoding, can be used. Additionally, the computation of T 3 with ours, it takes 3 clock cycles to perform a modular
appears as if it requires a long adder. In fact, only half the multiplication.
length of T 3 is needed; hence, a short adder (k + 4 bits) was We also include a bit-scan MMM design [14], and imple-
used instead, as illustrated in the dashed box in Fig. 2. ment it in 90nm to make a fair comparison. Though the orien-
Precisely like the MSM in the second stage, the LSM tations of our design and some low-cost designs are different,
computing the result R is also a constant multiplier. The the AT factor and its reciprocal denoted as “performance” may
difference between them is that there is a subtraction in the give us a certain aspect to make a comparison.
computation of R to be dealt with, which can be absorbed In order to demonstrate the enhancement of our proposed
into the reduction tree, so as to effectively reduce the circuit algorithm over the improved BMM in [4], we further im-
delay. plemented the full-word Barrett modular multiplication (FW-
In our Booth encoded multiplier, sign extension may cause BMM), with a full-word 256-bit multiplier. The application
T 3 ≥ 2n1+n2 , which is the carry-correction case in (6); of truncated multipliers reduced the requirement of the BMM
therefore, T 2 and q should be multiplexed outputs. from 3 full-word multiplications to 2 (LSM≈MSM≈ 21 full-
word multiplication). The removal of double-full-word final
B. Complexity analysis additions in our MSM makes the critical path 17% shorter
In our proposed algorithm, there are four truncated mul- than the full-word multiplier in [4], while the pipeline structure
tiplications, which can be implemented by four truncated further enhances the throughput of our design, which led to
multipliers to establish a 3-stage pipeline modular multiplier. a total improvement of 80% in terms of performance of our
When the pipeline is in fully loaded running, it can generate a proposed algorithm compared with the improved BMM.
modular multiplication result every clock cycle. Table I offers TABLE II: Comparison of 256-bit Modular Multipliers
a complexity comparison with some common implementations
Freq Gates Speedα Performance
of modular multiplication. Design Technique Process ATβ
(MHz) (KGates) (ns) (AT )−1
Word-based MMM (WB-MMM) in [9] is generally utilized
[14] Bit-MMM 90nm 752 30.0 121 5.24 0.191
for area efficient designs, while its performance is not at the [11] FW-MMM 90nm 185 540 16.2 12.6 0.079
same level with full-word multiplier designs. When compared [9] WB-MMM 90nm 286 125 88 15.9 0.063
with BMM in [10] and the full-word Montgomery modular Ours FW-BMM 90nm 279 629 3.58 3.3 0.303
multiplication (FW-MMM) in [11], our algorithm exhibits a [12] WB-MMM 130nm 556 122 500 61 0.016
shorter delay and less clock cycles at the cost of about twice [13] IMM 130nm 321 45 200 9.0 0.111
[4] FW-BMM 130nm 196 477 15 7.2 0.139
the area.
Ours FW-BMM 130nm 230 910 4.35 4.0 0.250
α Average time cost of one modular multiplication
β AT=Gate Counts×Time×130nm/Technology, with the unit of Gate · ms.
TABLE I: Complexity of modular multiplication implementations
Algorithms Critical path Cycles Resources
BMM [10] n-bit multiplication 3 1 n-bit multiplier D. Comparison of ECC Implementations
FW-MMM [11] n-bit multiplication 3 1 n-bit multiplier
Using our modular multiplier, we built a 256-bit ECC
WB-MMM [9] full addition and n n-bit full adder
several logic gates processor following arbitrary elliptic curve over GF (p). The
Ours n-bit truncated mul- 1 4 truncated multi- Montgomery powering ladder algorithm [15] was applied in
tiplication pliers equivalent to
2 n-bit multipliers
the scalar multiplication. Furthermore, projective coordinates
were utilized, and the 3-stage modular multiplier can work at
full-load with 11 cycles and 9 cycles to perform point addition
and point doubling, respectively, according to the schedule
C. Comparison of Modular Multipliers method as per [16]. The modular inversion is based on the
Based on our proposed algorithm, a 256-bit modular mul- binary extended Euclidean algorithm [17]. At a clock rate
tiplier has been realized over arbitrary 256-bit prime field of 279MHz, a scalar multiplication requires about 19.4µs,
GF (p). Synthesized in TSMC 90nm with Design Compiler which represents the fastest rate we could locate in published
J-2014.09-SP3, the results of the experiment demonstrate that literatures.
the critical path delay is 3.58ns in the worst corner, with a In Table III, [11] and [18] both used full-word MMM,
circuit scale of about 629K gates. With the pipeline technique, whose area costs are comparable with ours; however, since it
modular multiplication result can be generated every clock had to perform the MMM with three full-word multiplications,
5

the delay of the scalar multiplication was more than 3 times [2] P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key
longer than ours. The WB-MMM was employed in [12] and Encryption Algorithm on a Standard Digital Signal Processor,” Advances
in cryptology CRYPTO’86, pp. 311–323, 1986.
[19], which allowed them to have a high clock rate with [3] H. Laszlo, “Fast Truncated Multiplication for Cryptographic Applica-
moderate area; however, it also increased the cycle counts tions,” Lecture Notes in Computer Science, vol. 3659, pp. 211–225,
of a modular multiplication. The IMM was emplyed in [20] 2005.
[4] Y. Kong, “Optimizing the Improved Barrett Modular Multipliers for
and [21], in which both offered a good balance between the Public-Key Cryptography,” International Conference on Computational
area and delay, while [21] is a dual-field design and we only Intelligence and Software Engineering, pp. 1–4, 2010.
implemented its prime-field part in 90nm. In [22] and [23], the [5] H.-J. Ko and Hsiao, “Design and Application of Faithfully Rounded and
Truncated Multipliers With Combined Deletion, Reduction, Truncation,
authors utilized the fast reduction method of the Mersenne-like and Rounding,” IEEE Transactions on Circuits and Systems II: Express
modulus to accelerate modular multiplication, while they only Briefs, vol. 58, no. 5, pp. 304–308, 2011.
supported one special prime of China’s ECC algorithm, SM2. [6] S. K. Chen, C. W. Liu, T. Y. Wu, and A. C. Tsai, “Design and
Implementation of High-Speed and Energy-Efficient Variable-Latency
Although this fast reduction method can significantly speed Speculating Booth Multiplier (VLSBM),” IEEE Transactions on Circuits
up ECC operations, its flexibility is quite limited. Moreover, and Systems I Regular Papers, vol. 60, no. 10, pp. 2631–2643, 2013.
compared with [23], which only supported SM2 prime with [7] A. D. Booth, “A signed binary multiplication technique,” Quarterly
Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236–
restricted elliptic curves and there is no SPA countermeasures, 240, 1951.
our design is much more flexible, although it requires a little [8] C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Transactions
more area. on Electronic Computers, vol. EC-13, no. 1, pp. 14–17, 1964.
[9] S. R. Kuang, K. Y. Wu, and R. Y. Lu, “Low-Cost High-Performance
We also used the modular multiplier of FW-BMM in [4] to VLSI Architecture for Montgomery Modular Multiplication,” IEEE
build an ECC processor. Since our 3-stage modular multiplier Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24,
is fully occupied during the ECC scalar multiplication, the no. 2, pp. 434–443, Feb 2016.
[10] J. Groschdl, “High-Speed RSA Hardware Based on Barrets Modular
advantages of our design over [4] are tremendous (more than Reduction Method,” in International Workshop on Cryptographic Hard-
80%) both in Table II and III in term of AT. ware and Embedded Systems, 2000, pp. 191–203.
Owing to the use of truncated multipliers and the parallel [11] S.-C. Chung, J.-W. Lee, H.-C. Chang, and C.-Y. Lee, “A high-
performance elliptic curve cryptographic processor over GF(p) with
schedules of our proposed pipeline modular multiplier, our SPA resistance,” 2012 IEEE International Symposium on Circuits and
ECC processor achieves an ultra-high performance while Systems, pp. 1456–1459, 2012.
still exhibiting an adequate AT, which makes it suitable for [12] G. Chen, G. Bai, and H. Chen, “A High-Performance Elliptic Curve
Cryptographic Processor for General Curves Over GF(p) Based on a
computation-intensive applications. Systolic Arithmetic Unit,” IEEE Transactions on Circuits and Systems
II: Express Briefs, vol. 54, no. 5, pp. 412–416, 2007.
TABLE III: Comparison of 256-bit ECC processors [13] M. Knežević, F. Vercauteren, and I. Verbauwhede, “Faster interleaved
modular multiplication based on Barrett and Montgomery reduction
Freq Gates Speedα Performance methods,” IEEE Transactions on Computers, vol. 59, no. 12, pp. 1715–
Design Technique Process ATβ
(MHz) (KGates) (µs) (AT )−1 1721, 2010.
[21] IMM 90nm 725 52.4 632 47.8 0.021 [14] A. Rezai and P. Keshavarzi, “High-Throughput Modular Multiplication
[11] FW-MMM 90nm 185 540 120 93.6 0.011 and Exponentiation Algorithms Using Multibit-ScanMultibit-Shift Tech-
Ours FW-BMM 90nm 279 659 19.4 18.5 0.054
nique,” IEEE Transactions on Very Large Scale Integration Systems,
vol. 23, no. 9, pp. 1710–1719, 2015.
[20] IMM 130nm 110 168 510 85.7 0.012
[15] M. Joye and S. M. Yen, “The Montgomery powering ladder,” Lecture
[12] WB-MMM 130nm 556 122 1010 123 0.008 Notes in Computer Science, vol. 2523, pp. 291–302, 2002.
[22] SM2 130nm 228 156 208 32.4 0.031 [16] N. Guillermin, “A high speed coprocessor for elliptic curve scalar
[23] SM2 130nm 164 659 20.4 13.4 0.075 multiplications over F(p),” in International Conference on Cryptographic
[4] FW-BMM 130nm 189 507 82.9 42.0 0.024 Hardware and Embedded Systems, 2010, pp. 48–64.
Ours FW-BMM 130nm 230 952 23.5 22.4 0.045 [17] R. L’Orencz, “New Algorithm for the Classical Modular Inverse,” in
Cryptographic Hardware and Embedded Systems, 2002.
[18] FW-MMM 180nm 200 750 95 51.5 0.019
[18] X. Zhang and S. Li, “A High Performance ASIC Based Elliptic Curve
[19] WB-MMM 180nm 333 94 1480 100 0.010 Cryptographic Processor over GF(p),” International Design and Test
Ours FW-BMM 180nm 136 977 39.8 28.1 0.036 Workshop - IDT, pp. 182 –186, 2007.
α Average time cost of one scalar multiplication
β AT=Gate Counts×Time×130nm/Technology, with the unit of Gate · s.
[19] D. Karakoyunlu, F. K. Gurkaynak, B. Sunar, and Y. Leblebici, “Efficient
and side-channel-aware implementations of elliptic curve cryptosystems
over prime fields,” IET Information Security, vol. 4, no. 1, pp. 30–43,
March 2010.
[20] S. Ghosh, M. Alam, D. Chowdhury, and I. Gupta, “Parallel crypto-
V. C ONCLUSION devices for GF(p) elliptic curve multiplication resistant against side
In this paper, a modular multiplication algorithm with channel attacks,” Computers and Electrical Engineering, vol. 35, no. 2,
pp. 329–338, 2009.
truncated multiplications was proposed, according to which [21] Z. Liu, D. Liu, and X. Zou, “An Efficient and Flexible Hardware Im-
a 3-stage pipeline modular multiplier with four truncated plementation of the Dual-Field Elliptic Curve Cryptographic Processor,”
multipliers was constructed. When the modular multiplier was IEEE Transactions on Industrial Electronics, vol. 64, no. 3, pp. 2353–
2362, 2017.
applied in an ECC processor, it demonstrated the highest [22] D. Zhang and G. Bai, “Ultra high-performance ASIC implementation
single core performance in published literatures as well as a of SM2 with power-analysis resistance,” in 2015 IEEE International
good area-time efficiency, which is appropriate for server-side Conference on Electron Devices and Solid-State Circuits (EDSSC), June
2015, pp. 523–526.
applications. [23] Z. Zhao and G. Bai, “Ultra High-Speed SM2 ASIC Implementation,” in
2014 IEEE 13th International Conference on Trust, Security and Privacy
R EFERENCES in Computing and Communications, Sept 2014, pp. 182–188.

[1] P. L. Montgomery, “Modular multiplication without trial division,”


Mathematics of Computation, vol. 44, no. 170, pp. 519–519, 1985.

View publication stats

You might also like