Design and Evaluation of Finite Field Multipliers Using Fast XNOR Cells
Design and Evaluation of Finite Field Multipliers Using Fast XNOR Cells
cells
Nitin D. Patwari Anjul Srivastav Mayank Kabra
International Institute of Information International Institute of Information International Institute of Information
Technology Technology Technology
Bangalore, Karnataka, India Bangalore, Karnataka, India Bangalore, Karnataka, India
[email protected] [email protected] [email protected]
overlap-free-based multiplication strategy (OBS) proposed in [31] CA multiplier of 𝑛-bit input generates 2𝑛 − 1 output bits, and
is a hybrid implementation of OKA, and CA, where the higher the critical path as depicted in the Figure 1 encompasses from the
operand sized multiplier is recursively staged to lower sized multi- input to the middle output bit; 𝐶 1 output in the case of 2-bit multi-
pliers. The lower operand size multiplication till 15-bit is enabled plier, 𝐶 3 output for the 4-bit multiplier, and similarly 𝐶 (𝑛−1) output
by CA, whereas the subsequent multiplication is driven by OKA. bit for a general 𝑛-bit multiplier. The number of XOR and AND
The hybrid method leverages the best of both methods to achieve gates required to realize 𝑛-bit conventional polynomial multiplier
benefits in terms of space and time complexity with respect to OKA, is expressed in the Equation 1, along with the critical path latency,
and KA approaches. where 𝑇𝑋𝑂𝑅 and 𝑇𝐴𝑁 𝐷 represents the individual XOR and AND
This work focuses on designing finite field multipliers of varying propagation gate delays respectively.
operand size ranging from 93 bits to 409 bits using fast XNOR cells
and evaluating the same for performance and cell usage which 𝐶𝐴 (𝑛) = (𝑛 − 1) 2
𝑋𝑂𝑅
quantifies the incurred design space. The primitive fast XNOR gate
was designed and characterized for incorporating to the standard 𝐶𝐴𝐴𝑁 𝐷 (𝑛) = (𝑛) 2 (1)
cell library which was further employed to synthesize the four 𝑇𝐶𝐴 (𝑛) = 𝑇𝐴𝑁 𝐷 + 𝑙𝑜𝑔2 (𝑛)𝑇𝑋𝑂𝑅
SOTA finite field multipliers through the ASIC flow. As per the Karatsuba Algorithm (KA) was devised in the past to improve the
authors knowledge, this is the first time, a primitive XNOR cell space complexity [25], however it ignores the time complexity space,
was evaluated for the state-of-the-art (SOTA) finite field multipliers. leading to a delayed output. A brief explanation on the KA approach
All the results and designs are made freely available for further with two operands is presented. Consider 𝐴, and 𝐵 as input operands
adoption to the research and designers’ community. The acceler- Í Í𝑛−1
that are expressed as 𝐴 = 𝑛−1 𝑖 𝑖
𝑖=0 𝑎𝑖 𝑥 , and 𝐵 = 𝑖=0 𝑏𝑖 𝑥 , and 𝑛 is
ated finite-field multiplier is way forward for achieving a secure 𝑡
a power of 2, and is expressed as 𝑛 = 2𝑚 = 2 (𝑡 > 1). On splitting
computing on the chip.
the operands 𝐴, and 𝐵 to most-significant-half (𝐴𝐻 , 𝐵𝐻 ) and least-
significant-half (𝐴𝐿 , 𝐵𝐿 ), the operands are re-formulated to
2 FINITE FIELD MULTIPLIERS 𝑛−1
∑︁ 𝑚−1
∑︁ 𝑚−1
∑︁
Finite field multipliers are typically employed for Galois field GF(2𝑛 ) 𝐴= 𝑎𝑖 𝑥 𝑖 = 𝑥 𝑚 𝑎𝑚+𝑖 𝑥 𝑖 + 𝑎𝑖 𝑥 𝑖 = 𝑥 𝑚 𝐴𝐻 + 𝐴𝐿
functions which are profoundly used in cryptographic applica- 𝑖=0 𝑖=0 𝑖=0
tions [29]. Faster and better finite field multiplier designs are ex- 𝑛−1 𝑚−1 𝑚−1
pected to improve and accelerate the encryption process. Consider
∑︁ ∑︁ ∑︁
𝐵= 𝑏𝑖 𝑥 𝑖 = 𝑥 𝑚 𝑏𝑚+𝑖 𝑥 𝑖 + 𝑏𝑖 𝑥 𝑖 = 𝑥 𝑚 𝐵 𝐻 + 𝐵 𝐿
𝐴(𝑥) = 𝑥 3 + 𝑥 1 + 1 and 𝐵(𝑥) = 𝑥 3 + 𝑥 2 + 𝑥 1 + 1 are two polynomials 𝑖=0 𝑖=0 𝑖=0
of degree three, and these polynomials are represented by their Í Í𝑚−1
coefficients in binary notation, either 0 or 1. A(x) in binary form where 𝐴𝐻 = 𝑚−1 𝑖 𝑖
𝑖=0 𝑎𝑚+𝑖 𝑥 , and 𝐴𝐿 = 𝑖=0 𝑎𝑖 𝑥 . Similarly, 𝐵𝐻 and
is denoted as 1011, and B(x) as 1111 so, 𝐴(𝑥)𝐵(𝑥) = 1101001 i.e. 𝐵𝐿 are expressed as most-significant and least-significant compo-
𝑥 6 + 𝑥 5 + 𝑥 3 + 1. The conventional algorithm (CA) based multipli- nents of operand 𝐵. The KA approach based multiplication product
cation of 4-bit numbers costs (4 − 1) 2 = 9 additions and 42 = 16 𝐴 ×𝐵 is computed recursively as expressed in the Equation 2, where
multiplications. In general, a total of (𝑛 − 1) 2 additions, and 𝑛 2 𝑃2 = 𝐴𝐻 𝐵𝐻 , 𝑃1 = (𝐴𝐻 + 𝐴𝐿 )(𝐵𝐻 + 𝐵𝐿 ), and 𝑃0 = 𝐴𝐿 𝐵𝐿 .
dot product for 𝑛-bit polynomial multiplication is demanded in CA (
based approach. The logical addition without carry-out is employed 𝐴 × 𝐵 = (𝑥 𝑚 𝐴𝐻 + 𝐴𝐿 )(𝑥 𝑚 𝐵𝐻 + 𝐵𝐿 )
for generating polynomial multiplication results. The dot-product (2)
= 𝑃 2𝑥 2𝑚 + {𝑃1 − 𝑃2 − 𝑃 0 }𝑥 𝑚 + 𝑃0
occupies the partial product stage. The gate-level design for 2-bit
polynomial multiplier in GF(2𝑛 ) is presented in the Figure 1, which This clearly shows that for KA multiplication, three sub-multipliers
includes one XOR, and four AND gates to represent the logical 𝑃0 , 𝑃1 , and 𝑃2 are required. In general, the complexity study shows
computation. Similarly, the gate-level design for 4-bit multiplier that for an 𝑛-bit multiplier, a finite number of XOR and AND gates
requires 9 XOR and 16 AND logical gates to extract the output are employed to design and the same is expressed as 𝐾𝐴𝑋𝑂𝑅 (𝑛),
product bits. and 𝐾𝐴𝐴𝑁 𝐷 (𝑛), along with the compute delay as a function of XOR
and AND gate delays in the Equation 3.
𝐴 = 𝐴𝑒 (𝑦) + 𝑥𝐴𝑜 (𝑦) 3 PROPOSED FAST XNOR CELLS
𝐵 = 𝐵 (𝑦) + 𝑥𝐵 (𝑦)
𝑒 𝑜 Most of the finite field multiplier performance parameter is a func-
(4)
𝐴𝐵 = (𝐴𝑒 (𝑦) + 𝑥𝐴𝑜 (𝑦)) × (𝐵𝑒 (𝑦) + 𝑥𝐵𝑜 (𝑦)) tion of XOR gate count as discussed in the previous section. The
= 𝐺 + 𝑦𝐺 + 𝑥 (𝐺 − 𝐺 − 𝐺 ) CMOS based XOR and XNOR cells picked from the standard cell li-
0 2 1 0 2
brary when synthesized, is not efficient enough for the modern-day
In terms of VLSI implementation, multiplying a polynomial by finite field multiplier which is primarily devised for cryptography
𝑥 2 is equivalent to moving its coefficients to the left, hence no applications. Hence to benefit the performance of all the four mul-
computational gate-level operation is needed. It is clear that the ex- tipliers discussed so far, pass-transistor based XOR and XNOR cells
pression: 𝐴𝑒 (𝑦)𝐵𝑒 (𝑦) + 𝑦𝐴𝑜 (𝑦)𝐵𝑜 (𝑦) comprises of terms with only are proposed. A variety of XOR and XNOR gates are examined in
even components of 𝑥. Similarly (𝐴𝑒 (𝑦) + 𝐴𝑜 (𝑦))(𝐵𝑒 (𝑦) + 𝐵𝑜 (𝑦) + the past [32]. It was learnt that utilizing NOT gates on the circuits
𝐴𝑒 (𝑦)𝐵𝑒 (𝑦) + 𝐴𝑜 (𝑦)𝐵𝑜 (𝑦) consists of terms with only odd compo- critical path deters the performance. Positive feedback on XOR-
nents of 𝑥. The odd components and even components suggest that XNOR gate outputs instils stability but at a cost of energy drop
there is no overlap while computing sum, and hence the set of three due to contention, and further extends the delay metric which is
operations are performed concurrently. The addition operations attributed to additionally loaded parasitic capacitance. The circuit
incur a single XOR gate delay of a 𝑇𝑋𝑂𝑅 , while the subtraction op- shown in Figure 5 provides full output swing for all possible in-
eration concede a delay of a 𝑇𝑋𝑂𝑅 . The gate level schematic of the put combinations, besides not inheriting any inverter gate in the
4-bit OKA multiplier is shown in the Figure 3. In summary, a total critical path, leading to a faster output. The XOR and XNOR cell
of 2 × 𝑇𝑋𝑂𝑅 in addition to the cost of the recursive computation depicted in Figure 5 is asymmetrical considering one of the inputs,
of the three partial products is involved in OKA multiplier. OKA 𝐴 is fed as an input to the pass transistors, apart from driving a NOT
saves a 𝑇𝑋𝑂𝑅 over KA operation. The same is also depicted in the gate, hence the inputs 𝐴, and 𝐵 sees dissimilar capacitance. Table 1
Figure 3. The space-complexity of OKA is comparable to that of KA, shows the transistor level working for the proposed XNOR and
however, the time-complexity is improved from (3𝑙𝑜𝑔2 (𝑛) −1)𝑇𝑋𝑂𝑅 XOR cell. Considering XNOR operations, when either of the inputs
,, Nitin D. Patwari, Anjul Srivastav, Mayank Kabra, Prashant Jonna, and Madhav Rao
(a)
Figure 4: (a) Structure of the OBS multiplier where overlap-free transits to CA multiplier at different levels
Spice netlist was utilized to characterize delay and power for the
cells defined. Power in the form of switching, and leakage were
extracted and added to the cell properties. The pass-transistor based
XNOR cell, and its related characteristics were added to the stan-
dard cell library, and the same was referred to as custom library.
The custom library also included other cells such as AND, OR, NOT,
and Multiplexer units of different drive strengths.
multiple usage of lower operand size multipliers for realizing wider based finite-field multiplier exhibits the lowest number of gates
operand size multiplier designs. Just to further reiterate with an followed by second best compute delay. With Fast XNOR cell, the
example, 163-bit multiplier is realized using set of 82, 41, 21, 11, difference in compute latency between CA, OKA, and OBS tend to
6, 3, and 2-bit multipliers. Similarly, 97, 49, 25, 13, 7, 4, and 2-bit reduce when compared with the synthesized results derived from
multipliers are employed to realize 193-bit design. Many of smaller standard cell library. The fast XNOR cells added library, however,
units of multipliers will be accommodated in parallel, thereby not tends to relax the design of the multipliers with an increase of 2X
much difference in delay is noticed. The operand of wider bit sized to 3X cell usage. However, the cell count for OKA, KA, and OBS
design showcases prominent surge in delay metric. Four finite multipliers show hardly any difference between the three with
field multipliers of different operand sizes including 2, 4, 8, 16, 32, updated fast XNOR cells. Although high in cell count, the custom
64, 93, 131, 163, 193, 233, 283, and 409-bits are implemented and library induced finite field multipliers are performance-efficient
characterized for hardware parameters. A python script to automate designs. Table 2 shows the XNOR cells picked for synthesizing
the generation of finite field multipliers for varying operand sizes three finite field multiplier designs using standard cell library and
in Verilog was setup. Structural symmetry and the fixed pattern in custom cell library individually. The fast XNOR cells added in the
each of the finite field multipliers were maintained and the python custom cell library was picked for at least 300 times for all the three
source code is made freely available for further usage to research designs which validates the use case of adopting fast XNOR cells
and development community in [33]. in realizing finite field multipliers. With the original 45 nm GPDK
standard cell library, although the XNOR cells utilized were 10X
more than in the fast XNOR adopted custom library besides the
total cell count for the new library was high, the fast XNOR cells
showed significant compute latency improvement. Additionally,
the characterized power for finite field multipliers was 2X times
more for fast XNOR cell included library than when compared with
standard gates realized designs. Hence further improvements in the
gate designs to not only benefit performance, but also improve other
hardware metrics such as power and footprint will be valuable.
5 CONCLUSION
(b)
A fast XNOR cell based finite field multipliers were designed and
Figure 6: (a) Area of finite field multipliers. (b) Delay of finite evaluated for different operand sizes ranging from 93 to 409 bits.
field multipliers. These designs were synthesized through ASIC flow using 45 nm
technology node by adopting standard cell library and fast XNOR
4.2 Synthesized results using customized library cell added library independently. The fast XNOR derived finite field
The customized library with the fast XNOR cell was adopted to multipliers generated faster output, however at a cost of more cell
synthesize four finite field multipliers of varying operand sizes. usage leading to higher silicon space requirement. The fast XNOR
Number of cells picked and instances of each cells along with the cell realized finite field multiplier designs exhibited compute de-
delay was compared for each of the operand sized multiplier design lay benefits in the range of 8.24% to 33.45%, 8% to 37.05%, 4.63%
with that of the standard cell library synthesized results as shown to 18.36%, and 1.01% to 38.73% for OKA, OBS, CA, and KA respec-
in the Figure 7 (a, b). As targeted, the compute latency of all the tively. Among the finite field multipliers, OBS crafted multiplier
finite field multipliers was improved. Figure 7 (b). The compute design exhibited second best performance characteristics and low
delay was improved for OKA in the range of 8.24% to 33.45% for cell usage across all the operand sizes studied in the XNOR adopted
varying operand size. Similarly, OBS with fast XNOR cells offered library. The performance efficient finite field multipliers is a step
compute delay benefits ranging from 8% to 37.05%. The CA, and towards realizing cryptographic accelerators for security applica-
KA exhibits a delay improvement in the range of 4.63% to 18.36% tions. All the design files are made freely available for further usage
and 1.01% to 38.73% respectively. For higher operand sizes, the OBS to research and designers’ community.
,, Nitin D. Patwari, Anjul Srivastav, Mayank Kabra, Prashant Jonna, and Madhav Rao
[12] Xin Zhou and Xiaofei Tang. Research and implementation of rsa algorithm for
encryption and decryption. In Proceedings of 2011 6th International Forum on
Strategic Technology, volume 2, pages 1118–1121, 2011.
[13] Nils Mäurer, Thomas Gräupl, Christoph Gentsch, and Corinna Schmitt. Compar-
ing different diffie-hellman key exchange flavors for ldacs. In 2020 AIAA/IEEE
39th Digital Avionics Systems Conference (DASC), pages 1–10, 2020.
[14] Qizhi Qiu and Qianxing Xiong. Research on elliptic curve cryptography. In
8th International Conference on Computer Supported Cooperative Work in Design,
volume 2, pages 698–701 Vol.2, 2004.
[15] Bappaditya Jana and Jayanta Poray. A performance analysis on elliptic curve
cryptography in network security. In 2016 International Conference on Computer,
Electrical Communication Engineering (ICCECE), pages 1–7, 2016.
[16] Ali Raya and K. Mariyappn. Security and performance of elliptic curve cryp-
tography in resource-limited environments: A comparative study. In 2020 15th
(a) International Conference for Internet Technology and Secured Transactions (ICITST),
pages 1–8, 2020.
[17] MD. Mainul Islam, MD. Selim Hossain, MD. Shahjalal, MOH. Khalid Hasan,
and Yeong Min Jang. Area-time efficient hardware implementation of modular
multiplication for elliptic curve cryptography. IEEE Access, 8:73898–73906, 2020.
[18] Mohita Jaiswal and Kusum Lata. Hardware implementation of text encryption
using elliptic curve cryptography over 192 bit prime field. In 2018 International
Conference on Advances in Computing, Communications and Informatics (ICACCI),
pages 343–349, 2018.
[19] Zia U. A. Khan and Mohammed Benaissa. High-speed and low-latency ecc
processor implementation over gf( 2𝑚 ) on fpga. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 25(1):165–176, 2017.
[20] Gang Chen, Guoqiang Bai, and Hongyi Chen. A high-performance elliptic
curve cryptographic processor for general curves over gf (𝑝 ) based on a systolic
arithmetic unit. IEEE Transactions on Circuits and Systems II: Express Briefs,
54(5):412–416, 2007.
(b) [21] Leelavathi G, Shaila K, and Venugopal K R. Elliptic curve cryptography imple-
mentation on fpga using montgomery multiplication for equal key and data size
Figure 7: (a) Number of cells instantiated by the finite field over gf(2m) for wireless sensor networks. In 2016 IEEE Region 10 Conference
multipliers design when realized with custom cell library. (b) (TENCON), pages 468–471, 2016.
[22] Hamad Marzouqi, Mahmoud Al-Qutayri, Khaled Salah, Dimitrios Schinianakis,
Delay of finite field multipliers implemented with custom and Thanos Stouraitis. A high-speed fpga implementation of an rsd-based ecc
cell library. processor. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
24(1):151–164, 2016.
REFERENCES [23] Parham Hosseinzadeh Namin, Crystal Roma, Roberto Muscedere, and Majid
Ahmadi. Efficient vlsi implementation of a sequential finite field multiplier using
[1] Hang Zhang, Bo Liu, and Hongyu Wu. Smart grid cyber-physical attack and reordered normal basis in domino logic. IEEE Transactions on Very Large Scale
defense: A review. IEEE Access, 9:29641–29659, 2021. Integration (VLSI) Systems, 26(11):2542–2552, 2018.
[2] James P. Farwell and Rafal Rohozinski. Stuxnet and the future of cyber war. [24] Chiou-Yng Lee, Chun-Sheng Yang, Bimal Kumar Meher, Pramod Kumar Meher,
Survival, 53(1):23–40, 2011. and Jeng-Shyang Pan. Low-complexity digit-serial and scalable spb/gpb multipli-
[3] Joonsang Yoo and Jeong Hyun Yi. Code-based authentication scheme for light- ers over large binary extension fields using (b,2)-way karatsuba decomposition.
weight integrity checking of smart vehicles. IEEE Access, 6:46731–46741, 2018. IEEE Transactions on Circuits and Systems I: Regular Papers, 61(11):3115–3124,
[4] Puvvadi Aparna and Polurie Venkata Vijay Kishore. Biometric-based efficient 2014.
medical image watermarking in e-healthcare application. IET Image Processing, [25] A Karatsuba and Yu Ofman. Multiplication of many-digital numbers by automatic
13(3):421–428, 2019. computers. Dokl. Akad. Nauk SSSR, 145(2):293–294, 1962.
[5] Zuowen Tan. Secure delegation-based authentication for telecare medicine [26] Christina Thomas and K. Gnana Sheela. Analysis of elliptic curve scalar multipli-
information systems. IEEE Access, 6:26091–26110, 2018. cation in secure communications. In 2015 Global Conference on Communication
[6] Karim Shahbazi and Seok-Bum Ko. Area-efficient nano-aes implementation for Technologies (GCCT), pages 623–627, 2015.
internet-of-things devices. IEEE Transactions on Very Large Scale Integration [27] A.A.-A. Gutub, M.K. Ibrahim, and A. Kayali. Pipelining gf(p) elliptic curve
(VLSI) Systems, 29(1):136–148, 2021. cryptography computation. In IEEE International Conference on Computer Systems
[7] Aristidis G. Anagnostakis, Charilaos Naxakis, Nikolaos Giannakeas, Markos G. and Applications, 2006., pages 93–99, 2006.
Tsipouras, Alexandros T. Tzallas, and Euripidis Glavas. Scalable consensus over [28] H. Fan. Overlap-free karatsuba–ofman polynomial multiplication algorithms.
finite capacities in multiagent iot ecosystems. IEEE Internet of Things Journal, IET Information Security, 4:8–14(6), March 2010.
pages 1–1, 2022. [29] A. Reyhani-Masoleh and M.A. Hasan. Low complexity bit parallel architectures
[8] Ruba Abu-Salma, M. Angela Sasse, Joseph Bonneau, Anastasia Danilova, Alena for polynomial basis multiplication over gf(2m). IEEE Transactions on Computers,
Naiakshina, and Matthew Smith. Obstacles to the adoption of secure communi- 53(8):945–959, 2004.
cation tools. In 2017 IEEE Symposium on Security and Privacy (SP), pages 137–153, [30] Jiafeng Xie, Pramod Kumar Meher, Mingui Sun, Yuecheng Li, Bo Zeng, and Zhi-
2017. Hong Mao. Efficient fpga implementation of low-complexity systolic karatsuba
[9] A. Hiltgen, T. Kramp, and T. Weigold. Secure internet banking authentication. multiplier over 𝑔𝑓 (2𝑚 ) based on nist polynomials. IEEE Transactions on Circuits
IEEE Security Privacy, 4(2):21–29, 2006. and Systems I: Regular Papers, 64(7):1815–1825, 2017.
[10] Hal Berghel. The future of digital money laundering. Computer, 47(8):70–75, [31] Moslem Heidarpur and Mitra Mirhassani. An efficient and high-speed overlap-
2014. free karatsuba-based finite-field multiplier for fgpa implementation. IEEE Trans-
[11] Muneer Bani Yassein, Shadi Aljawarneh, Ethar Qawasmeh, Wail Mardini, and actions on Very Large Scale Integration (VLSI) Systems, 29(4):667–676, 2021.
Yaser Khamayseh. Comprehensive study of symmetric key and asymmetric [32] Jyh-Ming Wang, Sung-Chuan Fang, and Wu-Shiung Feng. New efficient designs
key encryption algorithms. In 2017 International Conference on Engineering and for xor and xnor functions on the transistor level. IEEE Journal of Solid-State
Technology (ICET), pages 1–7, 2017. Circuits, 29(7):780–786, 1994.
[33] https://github.com/patwarind.