0% found this document useful (0 votes)
25 views6 pages

Design and Evaluation of Finite Field Multipliers Using Fast XNOR Cells

This document describes a study that designed and evaluated state-of-the-art finite field multipliers ranging from 93 to 409 bits using faster XNOR cells. Four multiplier approaches were implemented - conventional algorithm, Karatsuba algorithm, overlap-free Karatsuba algorithm, and overlap-free based multiplication strategy. The multipliers were synthesized using a 45nm process. The fast XNOR cells improved computation delay for the designs by 1-38% depending on the approach. Design files have been made publicly available for further research.

Uploaded by

mqyank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views6 pages

Design and Evaluation of Finite Field Multipliers Using Fast XNOR Cells

This document describes a study that designed and evaluated state-of-the-art finite field multipliers ranging from 93 to 409 bits using faster XNOR cells. Four multiplier approaches were implemented - conventional algorithm, Karatsuba algorithm, overlap-free Karatsuba algorithm, and overlap-free based multiplication strategy. The multipliers were synthesized using a 45nm process. The fast XNOR cells improved computation delay for the designs by 1-38% depending on the approach. Design files have been made publicly available for further research.

Uploaded by

mqyank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Design and Evaluation of Finite Field Multipliers using fast XNOR

cells
Nitin D. Patwari Anjul Srivastav Mayank Kabra
International Institute of Information International Institute of Information International Institute of Information
Technology Technology Technology
Bangalore, Karnataka, India Bangalore, Karnataka, India Bangalore, Karnataka, India
[email protected] [email protected] [email protected]

Prashant Jonna Madhav Rao


International Institute of Information International Institute of Information
Technology Technology
Bangalore, Karnataka, India Bangalore, Karnataka, India
[email protected] [email protected]

ABSTRACT between multi-parties, possibility of security lapses needs attention.


Polynomial multiplication is a fundamental operation used for cryp- Symmetric-key cryptography and public-key cryptography [11]
tography applications, and this module forms the major factor in are the two main categories of encryption techniques. In public-
determining the performance of the overall design. The current key cryptography, all communication parties safely communicate
polynomial multiplication is built on conventional CMOS cells, and with each other without sharing secret information. Key setup and
no major changes are explored in the standard cell library to im- digital signature for secure communications are necessary and the
prove the performance. Hence state-of-the-art (SOTA) finite field same is setup for any transaction of private information. Examples
multipliers of operand sizes ranging from 93 to 409 bits were de- of public key cryptography includes RSA encryption algorithm [12],
signed and evaluated by adopting faster XNOR cells. The hardware Diffie-Hellman key exchange protocol [13], Elliptic-curve cryptog-
metrics in the form of gates usage, and propagation delay were raphy (ECC) [14] and other variants of the three stated. Among
compared. The SOTA multipliers of different approaches including these, ECC has emerged as the most popular public-key cryptosys-
Conventional Algorithm (CA), Karatsuba Algorithm (KA), Overlap tem among these algorithms, primarily due to its relatively small
free Karatsuba Algorithm (OKA), and Overlap-free based multipli- key size in terms of the effectiveness of its implementation and
cation strategy (OBS) were designed and synthesized through ASIC security robustness [15, 16].
flow using 45 nm GPDK library files. The fast XNOR cell adopted Hardware cryptography is performance efficient, reliable, and
SOTA multipliers improved the compute delay in the range of 8.24% less costly when compared to its counter-part software implemen-
to 33.45%, 8% to 37.05%, 4.63% to 18.36%, and 1.01% to 38.73% for tations [17]. In the recent past, the hardware cryptographic accel-
OKA, OBS, CA, and KA respectively. All the design files are made erators are realized through a traditional VLSI ASIC flow [18–20]
freely available for further usage to research and designers’ com- and in the programmable FPGA fabric [21, 22]. In the hardware
munity. implementation, the speed of the finite-field multiplier forms the
major bottleneck for computing elliptic curve points, thereby de-
KEYWORDS laying the encryption mechanism on critical data. There have been
several focused attempts to evolve fast and efficient finite field
Finite Field Multiplier, Karatsuba Algorithm, Overlap free Karat-
multiplier design in the past [23, 24]. These approaches include
suba, OBS, Polynomial Multiplication
configurations such as systolic array and other pipe-lining tech-
1 INTRODUCTION niques derived from regular compute intensive designs. However,
none of the literature study focuses on the primitive cell design to
Cybercrimes in the form of intrusion and leakage of personal identi- improve the performance of the finite-field multipliers. Karatsuba
fiable information, data theft, digital system trust deficit is growing multiplier (KA) [25, 26] is one of the popular ones used for fast finite-
daily [1, 2]. Hence to counter these attacks, researchers across the field operations. The method replaces the multiplier operations with
globe are continuously attempting to provide safer and stronger se- additions, thereby reducing the number of partial products from 𝑛 2
curity solutions. These set of security solutions are deployed on var- to 𝑛 1.58 . On the other side, KA is an iterative algorithm that inherits
ious platforms covering autonomous vehicles [3], healthcare [4, 5], time complexity and hence there is always a trade-off between area
IoT-and-Edge compute devices [6, 7], secure communication proto- and delay metrics while choosing KA over the conventional algo-
cols [8], digital banking [9, 10], and many others. The Cryptography rithm (CA). Several hardware implementation techniques [27] were
solutions is aimed to provide privacy of data, authentication, and explored to enhance the operational efficiency [27]. Overlap-free
data security. However, owing to the large scale of data exchange Karatsuba algorithm (OKA) is one such method that is proposed
recently to reduce the propagation delay, by removing one XOR
,, gate from the critical path of the Karatsuba algorithm [28–30], but
© 2023 Association for Computing Machinery.
it does not outperform CA. Another efficient method referred to as
,, Nitin D. Patwari, Anjul Srivastav, Mayank Kabra, Prashant Jonna, and Madhav Rao

overlap-free-based multiplication strategy (OBS) proposed in [31] CA multiplier of 𝑛-bit input generates 2𝑛 − 1 output bits, and
is a hybrid implementation of OKA, and CA, where the higher the critical path as depicted in the Figure 1 encompasses from the
operand sized multiplier is recursively staged to lower sized multi- input to the middle output bit; 𝐶 1 output in the case of 2-bit multi-
pliers. The lower operand size multiplication till 15-bit is enabled plier, 𝐶 3 output for the 4-bit multiplier, and similarly 𝐶 (𝑛−1) output
by CA, whereas the subsequent multiplication is driven by OKA. bit for a general 𝑛-bit multiplier. The number of XOR and AND
The hybrid method leverages the best of both methods to achieve gates required to realize 𝑛-bit conventional polynomial multiplier
benefits in terms of space and time complexity with respect to OKA, is expressed in the Equation 1, along with the critical path latency,
and KA approaches. where 𝑇𝑋𝑂𝑅 and 𝑇𝐴𝑁 𝐷 represents the individual XOR and AND
This work focuses on designing finite field multipliers of varying propagation gate delays respectively.
operand size ranging from 93 bits to 409 bits using fast XNOR cells
and evaluating the same for performance and cell usage which 𝐶𝐴 (𝑛) = (𝑛 − 1) 2
 𝑋𝑂𝑅

quantifies the incurred design space. The primitive fast XNOR gate 

was designed and characterized for incorporating to the standard 𝐶𝐴𝐴𝑁 𝐷 (𝑛) = (𝑛) 2 (1)

cell library which was further employed to synthesize the four 𝑇𝐶𝐴 (𝑛) = 𝑇𝐴𝑁 𝐷 + 𝑙𝑜𝑔2 (𝑛)𝑇𝑋𝑂𝑅


SOTA finite field multipliers through the ASIC flow. As per the Karatsuba Algorithm (KA) was devised in the past to improve the
authors knowledge, this is the first time, a primitive XNOR cell space complexity [25], however it ignores the time complexity space,
was evaluated for the state-of-the-art (SOTA) finite field multipliers. leading to a delayed output. A brief explanation on the KA approach
All the results and designs are made freely available for further with two operands is presented. Consider 𝐴, and 𝐵 as input operands
adoption to the research and designers’ community. The acceler- Í Í𝑛−1
that are expressed as 𝐴 = 𝑛−1 𝑖 𝑖
𝑖=0 𝑎𝑖 𝑥 , and 𝐵 = 𝑖=0 𝑏𝑖 𝑥 , and 𝑛 is
ated finite-field multiplier is way forward for achieving a secure 𝑡
a power of 2, and is expressed as 𝑛 = 2𝑚 = 2 (𝑡 > 1). On splitting
computing on the chip.
the operands 𝐴, and 𝐵 to most-significant-half (𝐴𝐻 , 𝐵𝐻 ) and least-
significant-half (𝐴𝐿 , 𝐵𝐿 ), the operands are re-formulated to
2 FINITE FIELD MULTIPLIERS 𝑛−1
∑︁ 𝑚−1
∑︁ 𝑚−1
∑︁
Finite field multipliers are typically employed for Galois field GF(2𝑛 ) 𝐴= 𝑎𝑖 𝑥 𝑖 = 𝑥 𝑚 𝑎𝑚+𝑖 𝑥 𝑖 + 𝑎𝑖 𝑥 𝑖 = 𝑥 𝑚 𝐴𝐻 + 𝐴𝐿
functions which are profoundly used in cryptographic applica- 𝑖=0 𝑖=0 𝑖=0
tions [29]. Faster and better finite field multiplier designs are ex- 𝑛−1 𝑚−1 𝑚−1
pected to improve and accelerate the encryption process. Consider
∑︁ ∑︁ ∑︁
𝐵= 𝑏𝑖 𝑥 𝑖 = 𝑥 𝑚 𝑏𝑚+𝑖 𝑥 𝑖 + 𝑏𝑖 𝑥 𝑖 = 𝑥 𝑚 𝐵 𝐻 + 𝐵 𝐿
𝐴(𝑥) = 𝑥 3 + 𝑥 1 + 1 and 𝐵(𝑥) = 𝑥 3 + 𝑥 2 + 𝑥 1 + 1 are two polynomials 𝑖=0 𝑖=0 𝑖=0
of degree three, and these polynomials are represented by their Í Í𝑚−1
coefficients in binary notation, either 0 or 1. A(x) in binary form where 𝐴𝐻 = 𝑚−1 𝑖 𝑖
𝑖=0 𝑎𝑚+𝑖 𝑥 , and 𝐴𝐿 = 𝑖=0 𝑎𝑖 𝑥 . Similarly, 𝐵𝐻 and
is denoted as 1011, and B(x) as 1111 so, 𝐴(𝑥)𝐵(𝑥) = 1101001 i.e. 𝐵𝐿 are expressed as most-significant and least-significant compo-
𝑥 6 + 𝑥 5 + 𝑥 3 + 1. The conventional algorithm (CA) based multipli- nents of operand 𝐵. The KA approach based multiplication product
cation of 4-bit numbers costs (4 − 1) 2 = 9 additions and 42 = 16 𝐴 ×𝐵 is computed recursively as expressed in the Equation 2, where
multiplications. In general, a total of (𝑛 − 1) 2 additions, and 𝑛 2 𝑃2 = 𝐴𝐻 𝐵𝐻 , 𝑃1 = (𝐴𝐻 + 𝐴𝐿 )(𝐵𝐻 + 𝐵𝐿 ), and 𝑃0 = 𝐴𝐿 𝐵𝐿 .
dot product for 𝑛-bit polynomial multiplication is demanded in CA (
based approach. The logical addition without carry-out is employed 𝐴 × 𝐵 = (𝑥 𝑚 𝐴𝐻 + 𝐴𝐿 )(𝑥 𝑚 𝐵𝐻 + 𝐵𝐿 )
for generating polynomial multiplication results. The dot-product (2)
= 𝑃 2𝑥 2𝑚 + {𝑃1 − 𝑃2 − 𝑃 0 }𝑥 𝑚 + 𝑃0
occupies the partial product stage. The gate-level design for 2-bit
polynomial multiplier in GF(2𝑛 ) is presented in the Figure 1, which This clearly shows that for KA multiplication, three sub-multipliers
includes one XOR, and four AND gates to represent the logical 𝑃0 , 𝑃1 , and 𝑃2 are required. In general, the complexity study shows
computation. Similarly, the gate-level design for 4-bit multiplier that for an 𝑛-bit multiplier, a finite number of XOR and AND gates
requires 9 XOR and 16 AND logical gates to extract the output are employed to design and the same is expressed as 𝐾𝐴𝑋𝑂𝑅 (𝑛),
product bits. and 𝐾𝐴𝐴𝑁 𝐷 (𝑛), along with the compute delay as a function of XOR
and AND gate delays in the Equation 3.

 𝐾𝐴 (𝑛) = 6𝑛𝑙𝑜𝑔2 (3) − 8𝑛 + 2


 𝑋𝑂𝑅



𝐾𝐴𝐴𝑁 𝐷 (𝑛) = 𝑛𝑙𝑜𝑔2 (3) (3)

𝑇𝐾𝐴 (𝑛)

= 𝑇𝐴𝑁 𝐷 + 𝑇𝑋𝑂𝑅 × (3𝑙𝑜𝑔2 (𝑛)˘1)

The gate level schematic of the 4-bit KA multiplier in GF(2𝑛 ) is
shown in the Figure 2. Comparing CA with KA circuit topology as
referred from Equations 1 and 3, indicates a reduction of quadratic
space complexity of (𝑛 2 ) in CA to sub-quadratic (𝑛𝑙𝑜𝑔2 (3) = 1.58)
complexity in KA, but the time complexity suffers. In conclusion,
𝐾𝐴 exhibits smaller footprint over 𝐶𝐴 but pays for the performance
cost.
Figure 1: Gate level schematic of 2-bit CA multiplier design.
Design and Evaluation of Finite Field Multipliers using fast XNOR cells ,,

to (2𝑙𝑜𝑔2 (𝑛) − 1)𝑇𝑋𝑂𝑅 , as stated in the Equation 5.




𝑂𝐾𝐴𝑋𝑂𝑅 (𝑛) = 6𝑛𝑙𝑜𝑔2 (3) − 8𝑛 + 2


𝑂𝐾𝐴𝐴𝑁 𝐷 (𝑛) = 𝑛𝑙𝑜𝑔2 (3) (5)

𝑇𝑂𝐾𝐴 (𝑛) = 𝑇𝐴𝑁 𝐷 + (2𝑙𝑜𝑔2 (𝑛) − 1)𝑇𝑋𝑂𝑅

Figure 2: Hierarchical schematic of 4-bit KA implementa-


tions.

The overlap-free Karatsuba algorithm is a variant, derived from


modifying the Karatsuba multiplier design, to enhance the opera-
tional speed. This approach divides inputs into odd and even orders
rather than higher and lower half-of-significant bits, to reduce the
critical path delay. Considering 𝑛 = 2𝑚, and 𝐴(𝑥) and 𝐵(𝑥) as two Figure 3: Hierarchical schematic of 4-bit Overlap-free Karat-
polynomials in GF(2𝑛 ), that are expressed as follows: suba multiplier.
Overlap free based multiplication strategy (OBS) is derived by
𝑚−1
∑︁ 𝑚−1
∑︁ examining the limitations conceded by the time and space complex-
𝐴= 𝑎 2𝑖 𝑥 2𝑖 + 𝑥 𝑎 2𝑖+1𝑥 2𝑖 ity CA, KA, and OKA [31]. The compute latency for KA increases
𝑖=0 𝑖=0 rapidly with the operand size when compared with that of the other
two algorithms. The number of recursive multipliers designed in
𝑚−1
∑︁ 𝑚−1
∑︁ KA and OKA is of the logarithmic order with respect to the operand
𝐵= 𝑏 2𝑖 𝑥 2𝑖 + 𝑥 𝑏 2𝑖+1𝑥 2𝑖 size. In the case of 193-bit multiplication, the first four stages are
𝑖=0 𝑖=0 designed to recursively conduct multiplication down to 13-bit mul-
Considering 𝑦 = 𝑥 2 , and 𝐴𝑒 (𝑦) = 𝑚−1
Í 𝑖 Í𝑚−1 𝑖 tipliers. Each of the 13-bit multipliers demand additional four steps,
𝑖=0 𝑎 2𝑖 𝑦 , 𝐴𝑜 (𝑦) = 𝑖=0 𝑎 2𝑖+1𝑦 ,
and 𝐵𝑒 , and 𝐵𝑜 are corresponding even and odd components of 𝐵 and the associated stage count signifies the overall delay. CA out-
operand, the operands 𝐴, and 𝐵 are simplified as stated in the Equa- performs other approaches for lower operand sizes; Hence a hybrid
tion 4. The product 𝐴 × 𝐵 is computed recursively like KA method. strategy consisting of both CA and OKA was conceived for finite
Note the three partial products are clearly seen in the product- field multiplier. Figure 4 (a) shows the multiplier modules used
generated expression, where 𝐺 0 = 𝐴𝑒 𝐵𝑒 , 𝐺 1 = (𝐴𝑒 + 𝐴𝑜 )(𝐵𝑒 + 𝐵𝑜 ), in different levels of OBS. The Overlap-free-based multiplication
𝐺 2 = 𝐴𝑜 𝐵𝑜 strategy is primarily based on the OKA method, but the initial
conventional polynomial strategy is employed.





𝐴 = 𝐴𝑒 (𝑦) + 𝑥𝐴𝑜 (𝑦) 3 PROPOSED FAST XNOR CELLS
𝐵 = 𝐵 (𝑦) + 𝑥𝐵 (𝑦)

 𝑒 𝑜 Most of the finite field multiplier performance parameter is a func-
(4)

 𝐴𝐵 = (𝐴𝑒 (𝑦) + 𝑥𝐴𝑜 (𝑦)) × (𝐵𝑒 (𝑦) + 𝑥𝐵𝑜 (𝑦)) tion of XOR gate count as discussed in the previous section. The


 = 𝐺 + 𝑦𝐺 + 𝑥 (𝐺 − 𝐺 − 𝐺 ) CMOS based XOR and XNOR cells picked from the standard cell li-
 0 2 1 0 2
brary when synthesized, is not efficient enough for the modern-day
In terms of VLSI implementation, multiplying a polynomial by finite field multiplier which is primarily devised for cryptography
𝑥 2 is equivalent to moving its coefficients to the left, hence no applications. Hence to benefit the performance of all the four mul-
computational gate-level operation is needed. It is clear that the ex- tipliers discussed so far, pass-transistor based XOR and XNOR cells
pression: 𝐴𝑒 (𝑦)𝐵𝑒 (𝑦) + 𝑦𝐴𝑜 (𝑦)𝐵𝑜 (𝑦) comprises of terms with only are proposed. A variety of XOR and XNOR gates are examined in
even components of 𝑥. Similarly (𝐴𝑒 (𝑦) + 𝐴𝑜 (𝑦))(𝐵𝑒 (𝑦) + 𝐵𝑜 (𝑦) + the past [32]. It was learnt that utilizing NOT gates on the circuits
𝐴𝑒 (𝑦)𝐵𝑒 (𝑦) + 𝐴𝑜 (𝑦)𝐵𝑜 (𝑦) consists of terms with only odd compo- critical path deters the performance. Positive feedback on XOR-
nents of 𝑥. The odd components and even components suggest that XNOR gate outputs instils stability but at a cost of energy drop
there is no overlap while computing sum, and hence the set of three due to contention, and further extends the delay metric which is
operations are performed concurrently. The addition operations attributed to additionally loaded parasitic capacitance. The circuit
incur a single XOR gate delay of a 𝑇𝑋𝑂𝑅 , while the subtraction op- shown in Figure 5 provides full output swing for all possible in-
eration concede a delay of a 𝑇𝑋𝑂𝑅 . The gate level schematic of the put combinations, besides not inheriting any inverter gate in the
4-bit OKA multiplier is shown in the Figure 3. In summary, a total critical path, leading to a faster output. The XOR and XNOR cell
of 2 × 𝑇𝑋𝑂𝑅 in addition to the cost of the recursive computation depicted in Figure 5 is asymmetrical considering one of the inputs,
of the three partial products is involved in OKA multiplier. OKA 𝐴 is fed as an input to the pass transistors, apart from driving a NOT
saves a 𝑇𝑋𝑂𝑅 over KA operation. The same is also depicted in the gate, hence the inputs 𝐴, and 𝐵 sees dissimilar capacitance. Table 1
Figure 3. The space-complexity of OKA is comparable to that of KA, shows the transistor level working for the proposed XNOR and
however, the time-complexity is improved from (3𝑙𝑜𝑔2 (𝑛) −1)𝑇𝑋𝑂𝑅 XOR cell. Considering XNOR operations, when either of the inputs
,, Nitin D. Patwari, Anjul Srivastav, Mayank Kabra, Prashant Jonna, and Madhav Rao

(a)

Figure 4: (a) Structure of the OBS multiplier where overlap-free transits to CA multiplier at different levels
Spice netlist was utilized to characterize delay and power for the
cells defined. Power in the form of switching, and leakage were
extracted and added to the cell properties. The pass-transistor based
XNOR cell, and its related characteristics were added to the stan-
dard cell library, and the same was referred to as custom library.
The custom library also included other cells such as AND, OR, NOT,
and Multiplexer units of different drive strengths.

Table 1: Transistor level working of XOR and XNOR cells as


referred from Figure 5, to arrive at the desired logical output.
Transistors
Figure 5: Schematic of Pass transistor based XOR and XNOR Operations Inputs Output
P2 P3 N2 N3 N4/P4
cell. A=0, B=0 ON ON OFF OFF ON 0
is 1, and other is 0, N3 or N2 will be ON, and passes the complete A=0, B=1 ON OFF OFF ON ON 1
0. When 𝐴=𝐵=0, both P2 and P3 will be ON and pushes the XNOR XOR
A=1, B=0 OFF ON ON OFF OFF 1
output to 1 through 𝑉𝑑𝑑 rail. Conversely when both inputs are 1, A=1, B=1 OFF OFF ON ON OFF 0
N3 and N2 will be ON, and passes 𝑉𝑑𝑑 − 𝑉𝑡 to the XNOR output. A=0, B=0 ON ON OFF OFF OFF 1
The P4 transistor is optimally positioned to push the XNOR output A=0, B=1 OFF ON ON OFF OFF 0
XNOR
to 𝑉𝑑𝑑 . Similar logic levels are passed through transistors in circuit A=1, B=0 ON OFF OFF ON ON 0
configured for XOR cell. A=1, B=1 OFF OFF ON ON ON 1
A quick synthesis of all finite field multipliers through 45 nm
Generic PDK (GPDK) library in Cadence Genus tool showcased 4 EXPERIMENTAL RESULTS AND
the preference of XNOR cells over XOR cells. Hence further design DISCUSSIONS
and synthesis of finite field multiplier with varying operand size The finite field multipliers were designed in Verilog individually
was performed using standard cell library incorporating new pass and were synthesized through ASIC flow using 45 nm technology
transistor based XNOR cell. Additionally, among all the cells in the through the Cadence based Genus tool. All multiplier designs of
library, XNOR was predominantly picked for realizing all four finite varying operand sizes ranging from 93 to 409 bits were synthesized
field multiplier designs. The PSO run was setup for establishing using standard cell library and customized library individually and
optimized widths for the XNOR cell that are targeted for minimum was further compared for any performance improvement. The num-
delay. The aim of this work was to establish faster performance ber of fast XNOR cells picked from the ASIC flow was also reported
of the finite-field multipliers; hence the optimization runs were to understand the impact of the fast cells created. Critical path delay
setup to reach minimum delay. A constant output load capacitance was also characterized for all the finite field multiplier designs.
of Fanout-of-4 (FO4) was applied to deduce the optimal standard
cell design. A theoretical approach in estimating critical path delay 4.1 Synthesized Results
for each of the cell design was suggested in [32], however, the ap- The ASIC flow synthesized results for all four finite field multiplier
proach may not be repeatable for different technology node based designs of varying operand sizes were reported as shown in the
PDKs and libraries. The pass-transistor based XNOR cell were de- Figure 6 (a, b). It was observed that as the operand size continues
signed on Cadence Virtuoso and particle-swarm-optimization (PSO) to increase, the number of gate instances and the number of cells
algorithm scheme was applied to extract optimized width of the picked also increases in the order of 1.5X to 2.5X. The compute
transistors for realizing minimum delay. Librecell, an open-source delay does not show many variations with respect to the designs
experimental tool was incorporated to characterize layout whereas for lower operand size. This was attributed towards parallel and
Design and Evaluation of Finite Field Multipliers using fast XNOR cells ,,

multiple usage of lower operand size multipliers for realizing wider based finite-field multiplier exhibits the lowest number of gates
operand size multiplier designs. Just to further reiterate with an followed by second best compute delay. With Fast XNOR cell, the
example, 163-bit multiplier is realized using set of 82, 41, 21, 11, difference in compute latency between CA, OKA, and OBS tend to
6, 3, and 2-bit multipliers. Similarly, 97, 49, 25, 13, 7, 4, and 2-bit reduce when compared with the synthesized results derived from
multipliers are employed to realize 193-bit design. Many of smaller standard cell library. The fast XNOR cells added library, however,
units of multipliers will be accommodated in parallel, thereby not tends to relax the design of the multipliers with an increase of 2X
much difference in delay is noticed. The operand of wider bit sized to 3X cell usage. However, the cell count for OKA, KA, and OBS
design showcases prominent surge in delay metric. Four finite multipliers show hardly any difference between the three with
field multipliers of different operand sizes including 2, 4, 8, 16, 32, updated fast XNOR cells. Although high in cell count, the custom
64, 93, 131, 163, 193, 233, 283, and 409-bits are implemented and library induced finite field multipliers are performance-efficient
characterized for hardware parameters. A python script to automate designs. Table 2 shows the XNOR cells picked for synthesizing
the generation of finite field multipliers for varying operand sizes three finite field multiplier designs using standard cell library and
in Verilog was setup. Structural symmetry and the fixed pattern in custom cell library individually. The fast XNOR cells added in the
each of the finite field multipliers were maintained and the python custom cell library was picked for at least 300 times for all the three
source code is made freely available for further usage to research designs which validates the use case of adopting fast XNOR cells
and development community in [33]. in realizing finite field multipliers. With the original 45 nm GPDK
standard cell library, although the XNOR cells utilized were 10X
more than in the fast XNOR adopted custom library besides the
total cell count for the new library was high, the fast XNOR cells
showed significant compute latency improvement. Additionally,
the characterized power for finite field multipliers was 2X times
more for fast XNOR cell included library than when compared with
standard gates realized designs. Hence further improvements in the
gate designs to not only benefit performance, but also improve other
hardware metrics such as power and footprint will be valuable.

Table 2: XNOR Cell count for the finite field multipliers.


(a) Operand KA OKA OBS
Size
Std New Std New Std New
93-bit 5384 574 3118 576 3554 364
131-bit 8311 836 10671 906 7505 514
163-bit 14281 1108 15232 1136 10780 644
193-bit 17869 1370 19564 1410 14156 768
233-bit 22222 1770 24334 1776 20354 916
283-bit 35987 2147 10696 2220 29469 1119
409-bit 59247 3420 21581 3408 60925 1051

5 CONCLUSION
(b)
A fast XNOR cell based finite field multipliers were designed and
Figure 6: (a) Area of finite field multipliers. (b) Delay of finite evaluated for different operand sizes ranging from 93 to 409 bits.
field multipliers. These designs were synthesized through ASIC flow using 45 nm
technology node by adopting standard cell library and fast XNOR
4.2 Synthesized results using customized library cell added library independently. The fast XNOR derived finite field
The customized library with the fast XNOR cell was adopted to multipliers generated faster output, however at a cost of more cell
synthesize four finite field multipliers of varying operand sizes. usage leading to higher silicon space requirement. The fast XNOR
Number of cells picked and instances of each cells along with the cell realized finite field multiplier designs exhibited compute de-
delay was compared for each of the operand sized multiplier design lay benefits in the range of 8.24% to 33.45%, 8% to 37.05%, 4.63%
with that of the standard cell library synthesized results as shown to 18.36%, and 1.01% to 38.73% for OKA, OBS, CA, and KA respec-
in the Figure 7 (a, b). As targeted, the compute latency of all the tively. Among the finite field multipliers, OBS crafted multiplier
finite field multipliers was improved. Figure 7 (b). The compute design exhibited second best performance characteristics and low
delay was improved for OKA in the range of 8.24% to 33.45% for cell usage across all the operand sizes studied in the XNOR adopted
varying operand size. Similarly, OBS with fast XNOR cells offered library. The performance efficient finite field multipliers is a step
compute delay benefits ranging from 8% to 37.05%. The CA, and towards realizing cryptographic accelerators for security applica-
KA exhibits a delay improvement in the range of 4.63% to 18.36% tions. All the design files are made freely available for further usage
and 1.01% to 38.73% respectively. For higher operand sizes, the OBS to research and designers’ community.
,, Nitin D. Patwari, Anjul Srivastav, Mayank Kabra, Prashant Jonna, and Madhav Rao

[12] Xin Zhou and Xiaofei Tang. Research and implementation of rsa algorithm for
encryption and decryption. In Proceedings of 2011 6th International Forum on
Strategic Technology, volume 2, pages 1118–1121, 2011.
[13] Nils Mäurer, Thomas Gräupl, Christoph Gentsch, and Corinna Schmitt. Compar-
ing different diffie-hellman key exchange flavors for ldacs. In 2020 AIAA/IEEE
39th Digital Avionics Systems Conference (DASC), pages 1–10, 2020.
[14] Qizhi Qiu and Qianxing Xiong. Research on elliptic curve cryptography. In
8th International Conference on Computer Supported Cooperative Work in Design,
volume 2, pages 698–701 Vol.2, 2004.
[15] Bappaditya Jana and Jayanta Poray. A performance analysis on elliptic curve
cryptography in network security. In 2016 International Conference on Computer,
Electrical Communication Engineering (ICCECE), pages 1–7, 2016.
[16] Ali Raya and K. Mariyappn. Security and performance of elliptic curve cryp-
tography in resource-limited environments: A comparative study. In 2020 15th
(a) International Conference for Internet Technology and Secured Transactions (ICITST),
pages 1–8, 2020.
[17] MD. Mainul Islam, MD. Selim Hossain, MD. Shahjalal, MOH. Khalid Hasan,
and Yeong Min Jang. Area-time efficient hardware implementation of modular
multiplication for elliptic curve cryptography. IEEE Access, 8:73898–73906, 2020.
[18] Mohita Jaiswal and Kusum Lata. Hardware implementation of text encryption
using elliptic curve cryptography over 192 bit prime field. In 2018 International
Conference on Advances in Computing, Communications and Informatics (ICACCI),
pages 343–349, 2018.
[19] Zia U. A. Khan and Mohammed Benaissa. High-speed and low-latency ecc
processor implementation over gf( 2𝑚 ) on fpga. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 25(1):165–176, 2017.
[20] Gang Chen, Guoqiang Bai, and Hongyi Chen. A high-performance elliptic
curve cryptographic processor for general curves over gf (𝑝 ) based on a systolic
arithmetic unit. IEEE Transactions on Circuits and Systems II: Express Briefs,
54(5):412–416, 2007.
(b) [21] Leelavathi G, Shaila K, and Venugopal K R. Elliptic curve cryptography imple-
mentation on fpga using montgomery multiplication for equal key and data size
Figure 7: (a) Number of cells instantiated by the finite field over gf(2m) for wireless sensor networks. In 2016 IEEE Region 10 Conference
multipliers design when realized with custom cell library. (b) (TENCON), pages 468–471, 2016.
[22] Hamad Marzouqi, Mahmoud Al-Qutayri, Khaled Salah, Dimitrios Schinianakis,
Delay of finite field multipliers implemented with custom and Thanos Stouraitis. A high-speed fpga implementation of an rsd-based ecc
cell library. processor. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
24(1):151–164, 2016.
REFERENCES [23] Parham Hosseinzadeh Namin, Crystal Roma, Roberto Muscedere, and Majid
Ahmadi. Efficient vlsi implementation of a sequential finite field multiplier using
[1] Hang Zhang, Bo Liu, and Hongyu Wu. Smart grid cyber-physical attack and reordered normal basis in domino logic. IEEE Transactions on Very Large Scale
defense: A review. IEEE Access, 9:29641–29659, 2021. Integration (VLSI) Systems, 26(11):2542–2552, 2018.
[2] James P. Farwell and Rafal Rohozinski. Stuxnet and the future of cyber war. [24] Chiou-Yng Lee, Chun-Sheng Yang, Bimal Kumar Meher, Pramod Kumar Meher,
Survival, 53(1):23–40, 2011. and Jeng-Shyang Pan. Low-complexity digit-serial and scalable spb/gpb multipli-
[3] Joonsang Yoo and Jeong Hyun Yi. Code-based authentication scheme for light- ers over large binary extension fields using (b,2)-way karatsuba decomposition.
weight integrity checking of smart vehicles. IEEE Access, 6:46731–46741, 2018. IEEE Transactions on Circuits and Systems I: Regular Papers, 61(11):3115–3124,
[4] Puvvadi Aparna and Polurie Venkata Vijay Kishore. Biometric-based efficient 2014.
medical image watermarking in e-healthcare application. IET Image Processing, [25] A Karatsuba and Yu Ofman. Multiplication of many-digital numbers by automatic
13(3):421–428, 2019. computers. Dokl. Akad. Nauk SSSR, 145(2):293–294, 1962.
[5] Zuowen Tan. Secure delegation-based authentication for telecare medicine [26] Christina Thomas and K. Gnana Sheela. Analysis of elliptic curve scalar multipli-
information systems. IEEE Access, 6:26091–26110, 2018. cation in secure communications. In 2015 Global Conference on Communication
[6] Karim Shahbazi and Seok-Bum Ko. Area-efficient nano-aes implementation for Technologies (GCCT), pages 623–627, 2015.
internet-of-things devices. IEEE Transactions on Very Large Scale Integration [27] A.A.-A. Gutub, M.K. Ibrahim, and A. Kayali. Pipelining gf(p) elliptic curve
(VLSI) Systems, 29(1):136–148, 2021. cryptography computation. In IEEE International Conference on Computer Systems
[7] Aristidis G. Anagnostakis, Charilaos Naxakis, Nikolaos Giannakeas, Markos G. and Applications, 2006., pages 93–99, 2006.
Tsipouras, Alexandros T. Tzallas, and Euripidis Glavas. Scalable consensus over [28] H. Fan. Overlap-free karatsuba–ofman polynomial multiplication algorithms.
finite capacities in multiagent iot ecosystems. IEEE Internet of Things Journal, IET Information Security, 4:8–14(6), March 2010.
pages 1–1, 2022. [29] A. Reyhani-Masoleh and M.A. Hasan. Low complexity bit parallel architectures
[8] Ruba Abu-Salma, M. Angela Sasse, Joseph Bonneau, Anastasia Danilova, Alena for polynomial basis multiplication over gf(2m). IEEE Transactions on Computers,
Naiakshina, and Matthew Smith. Obstacles to the adoption of secure communi- 53(8):945–959, 2004.
cation tools. In 2017 IEEE Symposium on Security and Privacy (SP), pages 137–153, [30] Jiafeng Xie, Pramod Kumar Meher, Mingui Sun, Yuecheng Li, Bo Zeng, and Zhi-
2017. Hong Mao. Efficient fpga implementation of low-complexity systolic karatsuba
[9] A. Hiltgen, T. Kramp, and T. Weigold. Secure internet banking authentication. multiplier over 𝑔𝑓 (2𝑚 ) based on nist polynomials. IEEE Transactions on Circuits
IEEE Security Privacy, 4(2):21–29, 2006. and Systems I: Regular Papers, 64(7):1815–1825, 2017.
[10] Hal Berghel. The future of digital money laundering. Computer, 47(8):70–75, [31] Moslem Heidarpur and Mitra Mirhassani. An efficient and high-speed overlap-
2014. free karatsuba-based finite-field multiplier for fgpa implementation. IEEE Trans-
[11] Muneer Bani Yassein, Shadi Aljawarneh, Ethar Qawasmeh, Wail Mardini, and actions on Very Large Scale Integration (VLSI) Systems, 29(4):667–676, 2021.
Yaser Khamayseh. Comprehensive study of symmetric key and asymmetric [32] Jyh-Ming Wang, Sung-Chuan Fang, and Wu-Shiung Feng. New efficient designs
key encryption algorithms. In 2017 International Conference on Engineering and for xor and xnor functions on the transistor level. IEEE Journal of Solid-State
Technology (ICET), pages 1–7, 2017. Circuits, 29(7):780–786, 1994.
[33] https://github.com/patwarind.

You might also like