0% found this document useful (0 votes)

9 views6 pages

A Modular Multiplier Implemented With Truncated Multiplication

This paper presents a modular multiplication algorithm utilizing truncated multiplications to enhance speed and efficiency, achieving a 256-bit modular multiplication in 3.58ns with a circuit scale of approximately 629K gates. The proposed algorithm constructs a high-speed 3-stage modular multiplier and integrates it into an Elliptic Curves Cryptography (ECC) processor, which performs scalar multiplication in 19.4µs, one of the fastest reported. The work demonstrates significant improvements in both performance and area-time efficiency compared to previous designs.

Uploaded by

samtension89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views6 pages

A Modular Multiplier Implemented With Truncated Multiplication

Uploaded by

samtension89

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321315058

A Modular Multiplier Implemented With Truncated Multiplication

Article in IEEE Transactions on Circuits and Systems II: Express Briefs · November 2017
DOI: 10.1109/TCSII.2017.2771239

CITATIONS READS
31 812

2 authors, including:

Jinnan Ding
Tsinghua University
7 PUBLICATIONS 157 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jinnan Ding on 24 March 2019.

The user has requested enhancement of the downloaded file.

A modular multiplier implemented with truncated

multiplication
Jinnan Ding Student Member, IEEE, and Shuguo Li Member, IEEE

Abstract—In this paper, we propose a modular multiplication Algorithm 1 Improved Barrett Modular Multiplication
algorithm with four truncated multiplications to reduce the 22n+3
critical path. According to our algorithm, a high-speed 3- Input: A < 2M , B < 2M , 2n−1 < M < 2n , µ = b c.
stage modular multiplier is constructed. Moreover, synthesized M
Output: R ≡ AB (mod M ), R < 2M.
with TSMC 90nm, our design can perform a 256-bit modular T
multiplication for every clock period of 3.58ns with circuit scale 1: T = A × B, T1 = b n−2 c.
of approximately 629K equivalent gates. Further, by utilizing the 2
µ × T1
modular multiplier, we construct an Elliptic Curves Cryptogra- 2: q = b n+5 c.
2
phy (ECC) processor, which can perform a scalar multiplication 3: R = T − q × M.
in 19.4µs, which is one of the fastest to date in published
literatures. 4: return R.

Index Terms—modular multiplier, truncated multiplication,

Barrett modular multiplication (BMM), elliptic curve cryptog-
raphy (ECC). II. BACKGROUND
A. Barrett modular multiplication
I. I NTRODUCTION BMM was first presented in [2]. After the original BMM
was first introduced, Y. Kong evaluated different versions of
T HE fundamental operation concerning most public key
cryptography systems is modular multiplication, which
also constitutes one of the bottlenecks for public-key cryp-
improved BMM and offered a general expression of input and
output with introduced parameters to structure the input and
tography processor design. In previous works, there were output bounds to fall in the same range, called “compatible”
basically two methods that have been proposed to implement [4]. His algorithm is presented as Algorithm 1 with a selected
modular multiplication in high performance designs, namely, set of parameters.
the Montgomery modular multiplication (MMM) [1] and the Step 1 in Algorithm 1 constitutes a full multiplication.
Barrett modular multiplication (BMM) [2]. Both algorithms In fact, the lower part of that multiplication is not needed
perform modular multiplication with 3 consecutive multipli- until step 3, and only the higher part (T1 ) is required in the
cations. However, in order to speed up the process, faster immediate step 2. Similarly, step 2 only requires the higher
multiplier is required. part of the multiplication. Thus, it is possible to introduce
Among the multiplier designs, the truncated multiplier is the most significant multipliers (MSM) in both steps 1 and 2,
a highly specialized type of multiplier that only computes instead of full-word multipliers. However, if we simply replace
part of the product. In [3], Laszlo Hars provided a detailed the full-word multiplication with the MSM, a significant error
and systematic discussion of the truncated multiplier in both will be introduced, which causes the input and output bounds
time complexity and hardware cost. By introducing truncated to become incompatible. Therefore, it is obvious that bound
multipliers in BMM, instead of normal ones, it not only adjustments must be made to Algorithm 1 and that the MSM
shortens the critical path but also saves hardware resources. should be constructed precisely, in order to reduce the size of
As long as the error introduced by the truncated multiplier is the introduced error.
minor or small enough, custom designed truncated multipliers Additionally, the result R in step 3 of Algorithm 1 is only
can be applied. (n + 1)-bit; hence, the multiplication in step 3 can be easily
In this paper, we propose an algorithm that evolves from implemented with the least significant multiplier (LSM), which
improved BMM, which is appropriate for truncated multiplier. introduces on error at all; thus, no adjustment is required for
According to the algorithm, a 3-stage pipeline modular multi- the LSM.
plier is implemented, and based on the modular multiplier, an
ECC processor is constructed, which significantly outperforms B. Truncated multiplier
the record of previous works and still performs well according
to area-time efficiency. The truncated multiplier is often referred as MSM, since it is
easy to implement an LSM. Based on the error compensation
This work was supported by the National Natural Science Foundation method, the truncated multiplier can be categorized into two
of China under Grant No.61674086. types: constant correction and variable correction. Reference
The authors are with the Institute of Microelectronics, Tsinghua
University, Beijing, China (Email: [email protected]; [5] introduced a truncated multiplier with faithfully rounding
[email protected]) result, however, its area cost is approximately 32 of a full
2

multiplier. Furthermore, it may also introduce a negative Algorithm 2 Proposed Modular Multiplication Algorithm
number if applied to BMM. 22n+4
Input: A < 4M , B < 4M , 2n−1 < M < 2n , µ = b c.
In [6], the authors constructed a truncated multiplier that M
was capable of performing speculation and correction inde- Output: R ≡ AB (mod M ), R < 4M.
pendently. Fig. 1 illustrates its structure diagram, where the 1: T 2 = An(n−2,k1 ) B, T 1 = Ao
n(n−2,k1 ) B, T 0 = Aon−2−k1 B.
partial product array (PPA) of the multiplier is divided into 2: q = µn(n+6,k2 ) T 2, AB = T 2 × 2n−2 + T 1 × 2n−2−k1 + T 0.
the most significant part (MSP), least significant part (LSP)
3: R = (AB − qon+2 M ) mod 2n+2 .
and the carry-estimation part. The truncated product can be
corrected according to the carry output from the LSP and the 4: return R.
carry-estimation part.
As for an MSM used for BMM, a small error is tolerable;
Booth encoder
therefore, we can discard LSP in [6] and take the MSP and
MSM LSM
carry-estimation part as the MSM applied in BMM.
n1+n2-n+2 bits k bits n-k-2 bits

k_row
n_row

Wallace-tree reduction Wallace-tree reduction

& Final addition & Final addition

T2 T1

k+4 bits addition to compute AB T0

Fig. 2: Structure of our truncated multiplier

The “mod2n1 +n2 ” operation is utilized to eliminate the sign

extension effect of Booth encoding. Here, AB is divided into
Fig. 1: Structure of truncated multiplier in [?] three distinct parts according to (2).
AB = An(n−2,k) B ×2n−2 +A o
n(n−2,k)B ×2n−k−2
III. P ROPOSED ALGORITHM AND PROOFS +Ao n−k−2B (mod 2n1+n2 ). (2)

A. Proposed Algorithm In this paper, we will employ these three partial products to
represent AB.
Our algorithm evolves from the improved BMM and is
“A on−k−2 B” corresponds to the grey points in Fig. 2,
proposed as Algorithm 2. The input and output bounds of our
while n − k − 2 denotes the reserved bit length of LSM.
algorithm are compatible, which make it suitable for consec-
utive modular multiplication. In Algorithm 2, the operations n row−1
X n−k−3
X
“nn ” and “o n(n,k) ” are implemented by an MSM, and “on ” A on−k−2 B = ppi,j × 2j (3)
by an LSM. Proofs of our algorithm will be provided later. i=0 j=0

Normally, the PPA of a full-word multiplier is a The operation “onn−2,k ” sums up the middle k columns in
parallelogram-shaped array. To simplify our deduction (not in Fig. 2 (without carry bits):
practical hardware design), we modify the parallelogram array nX
row−1 n−3
into a rectangle with the height of n row and width of n1+n2 , X
ppi,j ×2j−n+2+k mod 2k .

Ao
n(n−2,k)B = (4)
where n1 and n2 represent the bit length of the multiplicand
i=0 j=n−k−2
and multiplier, respectively. The most significant truncated multiplication “n(n−2,k) ”
Fig. 2 depicts the structure of the MSM and LSM of step 1 adds up all the black points in Fig. 2 and discards the least
in Algorithm 2. The hollow points denote added zeros, while significant k-bit, where the subscript (n − 2) represents the
the black points represent the MSM, and the grey points belong truncated bit length. It is defined as follows:
to LSM. Parameter k row denotes the number of rows of grey
+n2−1
row−1 n1X
points, and n row depends on the length of the multiplier’s 1 nX
ppi,j 2j mod 2n1+n2−n−2

operands and the generation method of PPA such as Booth An(n−2,k)B = n−2
2 i=0 j=n−k−2
encoding [7]. (5)
Let ppi,j denote the (i+1)th-row (j+1)th-column element
Our multiplications are all unsigned, however, sign ex-
of PPA in Fig. 2. Normally, we have
tension in Booth encoding may introduce certain negative
n row−1 X2 −1
X n1 +n correction bits to the representations of T 2, T 1 and T 0.
ppi,j × 2j mod 2n1 +n2 .

AB = (1) These negative correction bits can be all expelled during the
i=0 j=0 compression procedure, except for one occasion, which is
3

when T 2 happens to be all 1’s and a carry bit is generated by AB − q × M < 4M < 2n+2 . (10)
“T 1×2n−k +T 0”, namely T 2×2n +T 1×2n−k +T 0 ≥ 2n1 +n2 .
Additionally, A on+2 B only adds up the lowest n + 2 bits;
In this circumstance, those 1’s in T 2 should be replaced with
therefore, under the modular operation, we have:
0’s, which we call carry-correction. Hence, there should be
two cases for T 2: (q × M ) mod 2n+2 = (qon+2 M ) mod 2n+2 (11)
(
carry-correction case: 0;
T2 = (6)
default: An(n−2,k)B. Furtherly, R = (AB − qon+2 M ) mod 2n+2
= AB − q × M < 4M
B. Proof
and also R ≡ AB (mod M )
According to (1) and (6), we defined ∆e to be the difference
between A×B/2n and An(n,k) B:
A×B
∆e = − A n(n,k) B IV. I MPLEMENTATIONS AND C OMPARISONS
2n
k row−1
X n−k−1 A. Implementation of modular multiplier
X 1 1
< ppi,j × 2j−n − k + n + 1 In our proposed algorithm, there are four truncated multi-
i=0 j=0
2 2
plications, three of which are consecutive multiplications (n
≤ k row × (2−k − 2−n ) − 2−k + 2−n + 1 and on in step 1 represent one MSM). To exert parallelism
< (k row − 1) × 2−k + 1. (7) and balance the delay, we utilize four truncated multipliers
to construct a 3-stage pipeline modular multiplier, which can
Parameter k is the bit length of carry-estimation part to perform a modular multiplication every clock cycle as depicted
design our truncated multiplier. If we choose k to satisfy in Fig. 3.
k = dlog2 (k row − 1)e, (8)
A B
it results in ∆e < 2. Moreover, it is obvious that ∆e ≥ 0,
MUX MUX
thus, the error introduced by the MSM is
Register A Register B
0 ≤ ∆e < 2. (9) stage 1
To validate the correctness and compatibility (input output LSM
MSM
bounds) of our proposed algorithm, the proof is demonstrated 0
below.
In step 1 and step 2 of Algorithm 2, two “n” are utilized, MUX

which create two errors. Thus, we used ∆e1 and ∆e2 to Register T2 Register T1 Register T0

denote the errors of “n(n−2,k1 ) ” and “n(n+6,k2 ) ”, respectively. stage 2

By selecting two different proper k values by (8), ∆e1 and
μ MSM +
∆e2 will fall in [0, 2). This results in the following: 0
A×B
∆e1 = n−2 −An(n−2,k1 ) B ∈ [0, 2), MUX
2 Register q Register T3
µ × T2 stage 3
∆e2 = n+6 −µn(n+6,k2 ) T 2 ∈ [0, 2).
2 LSM
M
The constant µ in Algorithm 2 has a rounding error, that is
R
22n+4 22n+4
µ=b c= − δ, 0 ≤ δ < 1. Fig. 3: Structure of modular multiplier
M M
With above preconditions, q can be computed:
The structure of our truncated multipliers (MSM and LSM)
q = µnn+6 T 2 have already been shown in Fig. 2, which is similar to the
2n+4
(2 M
− δ) × ( A×B
2n−2 − ∆e1 )
design in [6]. Techniques such as Booth encoding [7] and
= n+6
− ∆e2 Wallace-tree reduction method [8] are applied to our truncated
2
AB 2n−2 AB 1 multipliers. The critical path is the MSM+Mux in the first and
= − ∆e1 − 2n+4 δ − ∆e2 + n+6 ∆e1 δ second stage and the LSM+Mux in the third stage. They are
M M 2 2
shorter than the critical path of a full-word multiplier, due to
the half-sized Wallace-tree reduction and final addition.
With the ranges of the errors and input requirement of Stage 1 of our modular multiplier completes step 1 in
Algorithm 2: 2n−1 < M < 2n , A < 4M < 2n+2 , and Algorithm 2, where T 2kT 1 (concatenation of T2 and T1) and
B < 4M < 2n+2 , T 0 are computed by the MSM and LSM, respectively. This
AB 2n−1 AB AB LSM and MSM can be merged into one module, so as to share
q> − − 2n+4 − 2 > −4 the same Booth encoder as presented in Fig. 2. Nevertheless,
M M 2 M
4

T 2kT 1 and T 0 are computed by two different reduction trees cycle. Moreover, in order to establish a fair comparison with
in parallel and their carry chains are not connected, resulting previous works, we implemented our design with 130nm.
in their delays being independent and shorter than a full-word As [9] and [12] in Table II are WB-MMM designs, the
multiplier. area cost was moderate, while their speed was limited. An
The following stage is to facilitates the implementation of interleaved modular multiplication algorithm (IMM), based
q = µnn+6 T 2 from step 2 in Algorithm 2, by utilizing an on BMM and MMM for special modulus, was introduced in
MSM and a short adder. Since µ is a constant, the MSM in [13], and only consumed very small area, which renders it
stage 2 becomes a constant multiplier, where some techniques appropriate for low cost applications. For high-speed purposes,
involving a constant multiplier, such as high radix Booth [11] applied FW-MMM, and while their area is comparable
encoding, can be used. Additionally, the computation of T 3 with ours, it takes 3 clock cycles to perform a modular
appears as if it requires a long adder. In fact, only half the multiplication.
length of T 3 is needed; hence, a short adder (k + 4 bits) was We also include a bit-scan MMM design [14], and imple-
used instead, as illustrated in the dashed box in Fig. 2. ment it in 90nm to make a fair comparison. Though the orien-
Precisely like the MSM in the second stage, the LSM tations of our design and some low-cost designs are different,
computing the result R is also a constant multiplier. The the AT factor and its reciprocal denoted as “performance” may
difference between them is that there is a subtraction in the give us a certain aspect to make a comparison.
computation of R to be dealt with, which can be absorbed In order to demonstrate the enhancement of our proposed
into the reduction tree, so as to effectively reduce the circuit algorithm over the improved BMM in [4], we further im-
delay. plemented the full-word Barrett modular multiplication (FW-
In our Booth encoded multiplier, sign extension may cause BMM), with a full-word 256-bit multiplier. The application
T 3 ≥ 2n1+n2 , which is the carry-correction case in (6); of truncated multipliers reduced the requirement of the BMM
therefore, T 2 and q should be multiplexed outputs. from 3 full-word multiplications to 2 (LSM≈MSM≈ 21 full-
word multiplication). The removal of double-full-word final
B. Complexity analysis additions in our MSM makes the critical path 17% shorter
In our proposed algorithm, there are four truncated mul- than the full-word multiplier in [4], while the pipeline structure
tiplications, which can be implemented by four truncated further enhances the throughput of our design, which led to
multipliers to establish a 3-stage pipeline modular multiplier. a total improvement of 80% in terms of performance of our
When the pipeline is in fully loaded running, it can generate a proposed algorithm compared with the improved BMM.
modular multiplication result every clock cycle. Table I offers TABLE II: Comparison of 256-bit Modular Multipliers
a complexity comparison with some common implementations
Freq Gates Speedα Performance
of modular multiplication. Design Technique Process ATβ
(MHz) (KGates) (ns) (AT )−1
Word-based MMM (WB-MMM) in [9] is generally utilized
[14] Bit-MMM 90nm 752 30.0 121 5.24 0.191
for area efficient designs, while its performance is not at the [11] FW-MMM 90nm 185 540 16.2 12.6 0.079
same level with full-word multiplier designs. When compared [9] WB-MMM 90nm 286 125 88 15.9 0.063
with BMM in [10] and the full-word Montgomery modular Ours FW-BMM 90nm 279 629 3.58 3.3 0.303
multiplication (FW-MMM) in [11], our algorithm exhibits a [12] WB-MMM 130nm 556 122 500 61 0.016
shorter delay and less clock cycles at the cost of about twice [13] IMM 130nm 321 45 200 9.0 0.111
[4] FW-BMM 130nm 196 477 15 7.2 0.139
the area.
Ours FW-BMM 130nm 230 910 4.35 4.0 0.250
α Average time cost of one modular multiplication
β AT=Gate Counts×Time×130nm/Technology, with the unit of Gate · ms.
TABLE I: Complexity of modular multiplication implementations
Algorithms Critical path Cycles Resources
BMM [10] n-bit multiplication 3 1 n-bit multiplier D. Comparison of ECC Implementations
FW-MMM [11] n-bit multiplication 3 1 n-bit multiplier
Using our modular multiplier, we built a 256-bit ECC
WB-MMM [9] full addition and n n-bit full adder
several logic gates processor following arbitrary elliptic curve over GF (p). The
Ours n-bit truncated mul- 1 4 truncated multi- Montgomery powering ladder algorithm [15] was applied in
tiplication pliers equivalent to
2 n-bit multipliers
the scalar multiplication. Furthermore, projective coordinates
were utilized, and the 3-stage modular multiplier can work at
full-load with 11 cycles and 9 cycles to perform point addition
and point doubling, respectively, according to the schedule
C. Comparison of Modular Multipliers method as per [16]. The modular inversion is based on the
Based on our proposed algorithm, a 256-bit modular mul- binary extended Euclidean algorithm [17]. At a clock rate
tiplier has been realized over arbitrary 256-bit prime field of 279MHz, a scalar multiplication requires about 19.4µs,
GF (p). Synthesized in TSMC 90nm with Design Compiler which represents the fastest rate we could locate in published
J-2014.09-SP3, the results of the experiment demonstrate that literatures.
the critical path delay is 3.58ns in the worst corner, with a In Table III, [11] and [18] both used full-word MMM,
circuit scale of about 629K gates. With the pipeline technique, whose area costs are comparable with ours; however, since it
modular multiplication result can be generated every clock had to perform the MMM with three full-word multiplications,
5

the delay of the scalar multiplication was more than 3 times [2] P. Barrett, “Implementing the Rivest Shamir and Adleman Public Key
longer than ours. The WB-MMM was employed in [12] and Encryption Algorithm on a Standard Digital Signal Processor,” Advances
in cryptology CRYPTO’86, pp. 311–323, 1986.
[19], which allowed them to have a high clock rate with [3] H. Laszlo, “Fast Truncated Multiplication for Cryptographic Applica-
moderate area; however, it also increased the cycle counts tions,” Lecture Notes in Computer Science, vol. 3659, pp. 211–225,
of a modular multiplication. The IMM was emplyed in [20] 2005.
[4] Y. Kong, “Optimizing the Improved Barrett Modular Multipliers for
and [21], in which both offered a good balance between the Public-Key Cryptography,” International Conference on Computational
area and delay, while [21] is a dual-field design and we only Intelligence and Software Engineering, pp. 1–4, 2010.
implemented its prime-field part in 90nm. In [22] and [23], the [5] H.-J. Ko and Hsiao, “Design and Application of Faithfully Rounded and
Truncated Multipliers With Combined Deletion, Reduction, Truncation,
authors utilized the fast reduction method of the Mersenne-like and Rounding,” IEEE Transactions on Circuits and Systems II: Express
modulus to accelerate modular multiplication, while they only Briefs, vol. 58, no. 5, pp. 304–308, 2011.
supported one special prime of China’s ECC algorithm, SM2. [6] S. K. Chen, C. W. Liu, T. Y. Wu, and A. C. Tsai, “Design and
Implementation of High-Speed and Energy-Efficient Variable-Latency
Although this fast reduction method can significantly speed Speculating Booth Multiplier (VLSBM),” IEEE Transactions on Circuits
up ECC operations, its flexibility is quite limited. Moreover, and Systems I Regular Papers, vol. 60, no. 10, pp. 2631–2643, 2013.
compared with [23], which only supported SM2 prime with [7] A. D. Booth, “A signed binary multiplication technique,” Quarterly
Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236–
restricted elliptic curves and there is no SPA countermeasures, 240, 1951.
our design is much more flexible, although it requires a little [8] C. S. Wallace, “A Suggestion for a Fast Multiplier,” IEEE Transactions
more area. on Electronic Computers, vol. EC-13, no. 1, pp. 14–17, 1964.
[9] S. R. Kuang, K. Y. Wu, and R. Y. Lu, “Low-Cost High-Performance
We also used the modular multiplier of FW-BMM in [4] to VLSI Architecture for Montgomery Modular Multiplication,” IEEE
build an ECC processor. Since our 3-stage modular multiplier Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24,
is fully occupied during the ECC scalar multiplication, the no. 2, pp. 434–443, Feb 2016.
[10] J. Groschdl, “High-Speed RSA Hardware Based on Barrets Modular
advantages of our design over [4] are tremendous (more than Reduction Method,” in International Workshop on Cryptographic Hard-
80%) both in Table II and III in term of AT. ware and Embedded Systems, 2000, pp. 191–203.
Owing to the use of truncated multipliers and the parallel [11] S.-C. Chung, J.-W. Lee, H.-C. Chang, and C.-Y. Lee, “A high-
performance elliptic curve cryptographic processor over GF(p) with
schedules of our proposed pipeline modular multiplier, our SPA resistance,” 2012 IEEE International Symposium on Circuits and
ECC processor achieves an ultra-high performance while Systems, pp. 1456–1459, 2012.
still exhibiting an adequate AT, which makes it suitable for [12] G. Chen, G. Bai, and H. Chen, “A High-Performance Elliptic Curve
Cryptographic Processor for General Curves Over GF(p) Based on a
computation-intensive applications. Systolic Arithmetic Unit,” IEEE Transactions on Circuits and Systems
II: Express Briefs, vol. 54, no. 5, pp. 412–416, 2007.
TABLE III: Comparison of 256-bit ECC processors [13] M. Knežević, F. Vercauteren, and I. Verbauwhede, “Faster interleaved
modular multiplication based on Barrett and Montgomery reduction
Freq Gates Speedα Performance methods,” IEEE Transactions on Computers, vol. 59, no. 12, pp. 1715–
Design Technique Process ATβ
(MHz) (KGates) (µs) (AT )−1 1721, 2010.
[21] IMM 90nm 725 52.4 632 47.8 0.021 [14] A. Rezai and P. Keshavarzi, “High-Throughput Modular Multiplication
[11] FW-MMM 90nm 185 540 120 93.6 0.011 and Exponentiation Algorithms Using Multibit-ScanMultibit-Shift Tech-
Ours FW-BMM 90nm 279 659 19.4 18.5 0.054
nique,” IEEE Transactions on Very Large Scale Integration Systems,
vol. 23, no. 9, pp. 1710–1719, 2015.
[20] IMM 130nm 110 168 510 85.7 0.012
[15] M. Joye and S. M. Yen, “The Montgomery powering ladder,” Lecture
[12] WB-MMM 130nm 556 122 1010 123 0.008 Notes in Computer Science, vol. 2523, pp. 291–302, 2002.
[22] SM2 130nm 228 156 208 32.4 0.031 [16] N. Guillermin, “A high speed coprocessor for elliptic curve scalar
[23] SM2 130nm 164 659 20.4 13.4 0.075 multiplications over F(p),” in International Conference on Cryptographic
[4] FW-BMM 130nm 189 507 82.9 42.0 0.024 Hardware and Embedded Systems, 2010, pp. 48–64.
Ours FW-BMM 130nm 230 952 23.5 22.4 0.045 [17] R. L’Orencz, “New Algorithm for the Classical Modular Inverse,” in
Cryptographic Hardware and Embedded Systems, 2002.
[18] FW-MMM 180nm 200 750 95 51.5 0.019
[18] X. Zhang and S. Li, “A High Performance ASIC Based Elliptic Curve
[19] WB-MMM 180nm 333 94 1480 100 0.010 Cryptographic Processor over GF(p),” International Design and Test
Ours FW-BMM 180nm 136 977 39.8 28.1 0.036 Workshop - IDT, pp. 182 –186, 2007.
α Average time cost of one scalar multiplication
β AT=Gate Counts×Time×130nm/Technology, with the unit of Gate · s.
[19] D. Karakoyunlu, F. K. Gurkaynak, B. Sunar, and Y. Leblebici, “Efficient
and side-channel-aware implementations of elliptic curve cryptosystems
over prime fields,” IET Information Security, vol. 4, no. 1, pp. 30–43,
March 2010.
[20] S. Ghosh, M. Alam, D. Chowdhury, and I. Gupta, “Parallel crypto-
V. C ONCLUSION devices for GF(p) elliptic curve multiplication resistant against side
In this paper, a modular multiplication algorithm with channel attacks,” Computers and Electrical Engineering, vol. 35, no. 2,
pp. 329–338, 2009.
truncated multiplications was proposed, according to which [21] Z. Liu, D. Liu, and X. Zou, “An Efficient and Flexible Hardware Im-
a 3-stage pipeline modular multiplier with four truncated plementation of the Dual-Field Elliptic Curve Cryptographic Processor,”
multipliers was constructed. When the modular multiplier was IEEE Transactions on Industrial Electronics, vol. 64, no. 3, pp. 2353–
2362, 2017.
applied in an ECC processor, it demonstrated the highest [22] D. Zhang and G. Bai, “Ultra high-performance ASIC implementation
single core performance in published literatures as well as a of SM2 with power-analysis resistance,” in 2015 IEEE International
good area-time efficiency, which is appropriate for server-side Conference on Electron Devices and Solid-State Circuits (EDSSC), June
2015, pp. 523–526.
applications. [23] Z. Zhao and G. Bai, “Ultra High-Speed SM2 ASIC Implementation,” in
2014 IEEE 13th International Conference on Trust, Security and Privacy
R EFERENCES in Computing and Communications, Sept 2014, pp. 182–188.

[1] P. L. Montgomery, “Modular multiplication without trial division,”

Mathematics of Computation, vol. 44, no. 170, pp. 519–519, 1985.

View publication stats

Design of A High-Performance Iterative Barrett Modular Multiplier For Crypto Systems
No ratings yet
Design of A High-Performance Iterative Barrett Modular Multiplier For Crypto Systems
14 pages
LPVLSI Design Unit 4 Spectrum
No ratings yet
LPVLSI Design Unit 4 Spectrum
20 pages
15 Easy Jazz, Blues and Funk Etudes
100% (10)
15 Easy Jazz, Blues and Funk Etudes
36 pages
Fast Multiplication Algorithms
No ratings yet
Fast Multiplication Algorithms
171 pages
Design of Power and Area Efficient Approximate Multipliers
0% (1)
Design of Power and Area Efficient Approximate Multipliers
22 pages
Testing and Analyzing Methods For Truncated Binary Multiplication
No ratings yet
Testing and Analyzing Methods For Truncated Binary Multiplication
94 pages
ROBA
67% (3)
ROBA
11 pages
Batch A7
No ratings yet
Batch A7
22 pages
Approximate Recursive Multipliers Using Low Power
No ratings yet
Approximate Recursive Multipliers Using Low Power
16 pages
A2 Intro
No ratings yet
A2 Intro
28 pages
Final - Wallac Tree Docx2
No ratings yet
Final - Wallac Tree Docx2
21 pages
Revathi Ramya
No ratings yet
Revathi Ramya
22 pages
Matrix Multiplication Using Only Addition
No ratings yet
Matrix Multiplication Using Only Addition
9 pages
Roba Multiplier
No ratings yet
Roba Multiplier
23 pages
Project8 Team1
No ratings yet
Project8 Team1
23 pages
DSM Mini Project
No ratings yet
DSM Mini Project
11 pages
Implementation of Low Power Baugh-Wooely Multiplier and Modified Baugh Wooely Multiplier Using Cadence (Encounter) RTL in DSM Technology
No ratings yet
Implementation of Low Power Baugh-Wooely Multiplier and Modified Baugh Wooely Multiplier Using Cadence (Encounter) RTL in DSM Technology
9 pages
L13-14 Mulltiplier
No ratings yet
L13-14 Mulltiplier
22 pages
Ijlbps 66006543d0393
No ratings yet
Ijlbps 66006543d0393
8 pages
Design of High Performance Dynamically Truncated A-1
No ratings yet
Design of High Performance Dynamically Truncated A-1
7 pages
Types of Multiplier
No ratings yet
Types of Multiplier
25 pages
Project8 Team1
No ratings yet
Project8 Team1
22 pages
PRJ Analysis
No ratings yet
PRJ Analysis
12 pages
A VLSI Architecture For Signed Multipliers
No ratings yet
A VLSI Architecture For Signed Multipliers
4 pages
Design and Implementation of 8X8 Truncat
No ratings yet
Design and Implementation of 8X8 Truncat
5 pages
Ngoni Migrations Into Zambia
100% (1)
Ngoni Migrations Into Zambia
4 pages
AxRMs Approximate Recursive Multipliers Using High-Performance Building Blocks
No ratings yet
AxRMs Approximate Recursive Multipliers Using High-Performance Building Blocks
7 pages
Tomar2017 18 PDF
No ratings yet
Tomar2017 18 PDF
13 pages
Truncated Multiplication With Correction Constant
No ratings yet
Truncated Multiplication With Correction Constant
9 pages
New Approximate Multiplier For Low Power Digital Signal Processing
No ratings yet
New Approximate Multiplier For Low Power Digital Signal Processing
6 pages
Electronics 12 00446 v2
No ratings yet
Electronics 12 00446 v2
21 pages
MICPRO2011-An Iterative Logarithmic Multiplier
No ratings yet
MICPRO2011-An Iterative Logarithmic Multiplier
11 pages
Writing A Good Summary
No ratings yet
Writing A Good Summary
28 pages
List of Autorised Recovery Agencies
No ratings yet
List of Autorised Recovery Agencies
74 pages
1 s2.0 S0141933119305976 Main
No ratings yet
1 s2.0 S0141933119305976 Main
8 pages
A Low Error Energy-Efficient Fixed-Width Booth Multiplier With Sign-Digit-Based Conditional Probability Estimation
No ratings yet
A Low Error Energy-Efficient Fixed-Width Booth Multiplier With Sign-Digit-Based Conditional Probability Estimation
5 pages
VLSIEE007
No ratings yet
VLSIEE007
6 pages
EC3021 Computer Organisation and Architecture: Latest Technologies in Multiplier Design
No ratings yet
EC3021 Computer Organisation and Architecture: Latest Technologies in Multiplier Design
6 pages
A Review of Different Methods For Booth Multiplier: Jyoti Kalia, Vikas Mittal
No ratings yet
A Review of Different Methods For Booth Multiplier: Jyoti Kalia, Vikas Mittal
4 pages
Floating Point Ieee
No ratings yet
Floating Point Ieee
4 pages
Design of Roba Multiplier Using Mac Unit
No ratings yet
Design of Roba Multiplier Using Mac Unit
15 pages
Multiplier 6.10 CameraReady
No ratings yet
Multiplier 6.10 CameraReady
6 pages
Design and Simulation of Radix-8 Booth Multiplier For Signed and Unsigned Numbers Using VHDL
No ratings yet
Design and Simulation of Radix-8 Booth Multiplier For Signed and Unsigned Numbers Using VHDL
51 pages
LATS 2022 v7
No ratings yet
LATS 2022 v7
4 pages
1.1 Motivation: Energy Efficient Concept Approximate Multiplier For Error-Resilent Applications
No ratings yet
1.1 Motivation: Energy Efficient Concept Approximate Multiplier For Error-Resilent Applications
35 pages
DesignandimplementationofMultiplierunitMAC ROBA
No ratings yet
DesignandimplementationofMultiplierunitMAC ROBA
10 pages
Multiplexer-Based Array Multipliers: Kiamal Z. Pekmestzi
No ratings yet
Multiplexer-Based Array Multipliers: Kiamal Z. Pekmestzi
9 pages
FA35880883
No ratings yet
FA35880883
4 pages
Speed Enhanced Multiprecision Multiplier Using Compressing Techniques
No ratings yet
Speed Enhanced Multiprecision Multiplier Using Compressing Techniques
3 pages
International Journal of Computational Engineering Research (IJCER)
No ratings yet
International Journal of Computational Engineering Research (IJCER)
11 pages
Example of Multiplier
No ratings yet
Example of Multiplier
4 pages
The Case For HPM-Based Baugh-Wooley Multipliers: Technical Report No. 08-8
No ratings yet
The Case For HPM-Based Baugh-Wooley Multipliers: Technical Report No. 08-8
16 pages
Design and Implementation of High Speed Baugh Wooley and Modified Booth Multiplier Using Cadence RTL
No ratings yet
Design and Implementation of High Speed Baugh Wooley and Modified Booth Multiplier Using Cadence RTL
8 pages
Approximate Radix-8 Booth Multipliers For Low-Power and High-Performance Operation
No ratings yet
Approximate Radix-8 Booth Multipliers For Low-Power and High-Performance Operation
8 pages
Dmatm: Dual Modified Adaptive Technique Based Multiplier
No ratings yet
Dmatm: Dual Modified Adaptive Technique Based Multiplier
6 pages
Wallace BoothMultipliersFinal
No ratings yet
Wallace BoothMultipliersFinal
4 pages
High-Speed and Low-Power Multipliers Using The Baugh-Wooley Algorithm and HPM Reduction Tree
No ratings yet
High-Speed and Low-Power Multipliers Using The Baugh-Wooley Algorithm and HPM Reduction Tree
4 pages
Booth Multiplier
No ratings yet
Booth Multiplier
5 pages
Design of Approximate Radix-4 Booth Multipliers For Error-Tolerant Computing
No ratings yet
Design of Approximate Radix-4 Booth Multipliers For Error-Tolerant Computing
7 pages
Approximate Radix-8 Booth Multipliers For Low-Power and High-Performance Operation
No ratings yet
Approximate Radix-8 Booth Multipliers For Low-Power and High-Performance Operation
7 pages
S Ios: Flew B!F Ihe Moon
No ratings yet
S Ios: Flew B!F Ihe Moon
17 pages
Compose Clear: Sentences Using Appropriate Grammatical Structures
100% (1)
Compose Clear: Sentences Using Appropriate Grammatical Structures
16 pages
Synopsys Booth Multiplier
No ratings yet
Synopsys Booth Multiplier
6 pages
Ang Kiukok
No ratings yet
Ang Kiukok
14 pages
An Algorithm For Multiplication Modulo (2 N-1)
No ratings yet
An Algorithm For Multiplication Modulo (2 N-1)
5 pages
TeeaCH Program.
No ratings yet
TeeaCH Program.
9 pages
CS372 - AI For Reasoning, Planning, and Decision Making (Spring 2025)
No ratings yet
CS372 - AI For Reasoning, Planning, and Decision Making (Spring 2025)
6 pages
Grade 5 Compare Contrast A
100% (1)
Grade 5 Compare Contrast A
3 pages
School 6
No ratings yet
School 6
42 pages
Nelson Huerta-Leidenz - : For The U.S.A. and Other Countries Targeted by The U.S. Meat Export Federation
No ratings yet
Nelson Huerta-Leidenz - : For The U.S.A. and Other Countries Targeted by The U.S. Meat Export Federation
11 pages
Time RPH Year 4
No ratings yet
Time RPH Year 4
6 pages
ControlLogix Controller Portfolio Customer Presentation
No ratings yet
ControlLogix Controller Portfolio Customer Presentation
22 pages
Afternoon Swim Louis Vuitton Perfume - A Fragrance For Women and Men 2019
No ratings yet
Afternoon Swim Louis Vuitton Perfume - A Fragrance For Women and Men 2019
1 page
CalTPA C1 AssessmentGuide MS
0% (1)
CalTPA C1 AssessmentGuide MS
50 pages
CLASS X PT1 Maths
No ratings yet
CLASS X PT1 Maths
4 pages
Xri77cx CH GB
No ratings yet
Xri77cx CH GB
9 pages
God Is Pure Bliss
No ratings yet
God Is Pure Bliss
26 pages
CD - Love Changes - Kashif Audrey Wheeler Bashiri Johnson B
No ratings yet
CD - Love Changes - Kashif Audrey Wheeler Bashiri Johnson B
6 pages
Lop11 Unit3 Getting
No ratings yet
Lop11 Unit3 Getting
5 pages
13 Custom Auth Server
No ratings yet
13 Custom Auth Server
9 pages
When The Code Becomes A Crime Scene Towards Dark Web Threat Intelligence With Software Quality Metrics
No ratings yet
When The Code Becomes A Crime Scene Towards Dark Web Threat Intelligence With Software Quality Metrics
5 pages
w3w4 - Worksheet
No ratings yet
w3w4 - Worksheet
2 pages
Tips For Freshers /: Some of The Personality Traits The GD Is Trying To Gauge May Include
No ratings yet
Tips For Freshers /: Some of The Personality Traits The GD Is Trying To Gauge May Include
5 pages
B.E (2019 Pattern)
No ratings yet
B.E (2019 Pattern)
2 pages
Active Integration Compatibility Matrix v6.7 2020-04-11 tcm54-76356
No ratings yet
Active Integration Compatibility Matrix v6.7 2020-04-11 tcm54-76356
8 pages
Mohammad Alfar CV-Accounting - Supplychain Coordinator
No ratings yet
Mohammad Alfar CV-Accounting - Supplychain Coordinator
2 pages
Essay
No ratings yet
Essay
2 pages
MTH 252 Section 4.5 Exercise 61: Justin Drawbert June 30, 2010
No ratings yet
MTH 252 Section 4.5 Exercise 61: Justin Drawbert June 30, 2010
2 pages
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)

A Modular Multiplier Implemented With Truncated Multiplication

Uploaded by

A Modular Multiplier Implemented With Truncated Multiplication

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Modular Multiplier Implemented With Truncated Multiplication

The user has requested enhancement of the downloaded file.

A modular multiplier implemented with truncated

Index Terms—modular multiplier, truncated multiplication,

Wallace-tree reduction Wallace-tree reduction

k+4 bits addition to compute AB T0

Fig. 2: Structure of our truncated multiplier

The “mod2n1 +n2 ” operation is utilized to eliminate the sign

denote the errors of “n(n−2,k1 ) ” and “n(n+6,k2 ) ”, respectively. stage 2

[1] P. L. Montgomery, “Modular multiplication without trial division,”

View publication stats

You might also like