Design A Scalable RSA and ECC Crypto-Processor
Design A Scalable RSA and ECC Crypto-Processor
Ming-Cheng Sun, Chih-Pin Su, Chih-Tsun Huang and Cheng-Wen Wu Laboratory for Reliable Computing (LaRC) Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan 30013, ROC
Abstract-In this paper, we propose a scalable word-based crypto-processor that performs modular multiplication based on modified Montgomery algorithm for finite fields G F ( P ) and GF(2"). The unified crypto-processor supports scalable keys of length up to 2048 bits for RSA and 512 bits for elliptic curve cryptography (ECC). Further extension of the key length can be done easily by enlarging the memory module or using the external memory resource. With the proposed parity prediction technique, our pipelined crypto-processor achieves a 512-bit RSA encryption rate of 276 Kbps and a 160-bit ECC encryption rate of 73.3 Kbps for a 220MHz clock rate.
both RSA and ECC. Our pipelined dual-field crypto-module supports finite field multiplication over both G F ( P ) and G F ( 2 m ) . A parity prediction technique is presented to compensate pipeline stalls by the data dependency from the original algorithm, which simplifies the controller's design and further speeds up the computation of the finite field and modular multiplication. In addition, efficient dual-field adder and subtractor are used to accomplish modified Montgomery multiplication for the support of both RSA and ECC operations. 11. WORD-BASED MULTIPLICATION ARCHITECTURE WITH PARITY PREDICTION Let P = ( P W - ' ,Pw-2,. . . ,Po), be a large prime number with w digit in radix r , where r = 2k, and k is the word width. Let B = (BW-1,BW-2 , . . . ,Bo), and A = (A,,-I ,Am-2,. . . ,Ao)2 are two large integers, which satisfy m = wk and 0 5 A , B < P. Let Bj be the irh bit in the j r h word of B. For simplicity, denotes a series of bits from i'" to K"in the jrh word of B. The word-based radix-2 Montgomery algorithm over G F ( P ) is shown in Fig. l(a), using the carry-save form to represent the intermediate results. Similarly the word-based Montgomery algorithm over GF(2") is shown in Fig. l(b), in which A(x),B(x) E G F ( 2 m )and P(n) is the irreducible polynomial. In Fig. l(a). {CO,,,@} = {@,O} represents a bitwise leftshift of @, and C O , ,is the one-bit carry digit. The algorithm performs shifting instead of conventional division, resulting in a fast implementation for modular multiplication. However, due to the data dependency of the algorithm shown in Fig. 2(a), the processing unit (PU) which computes CJ Si a; . Bj has to wait until the result of C"' Si+' ai-l . Bj+l parity. Pj+l is generated. As a result, an extra pipeline stall exists between each column of PUS, which degrades the performance [ 101. An extra constant 2-"' will be introduced after the Montgomery multiplication stage, i.e., M = A . B . 2-" mod P. In addition, M is in the range of [0,2P).A final reduction is thus required to ensure that 0 5 M 5 P. Therefore, improved Montgomery algorithm with two's complement numbers [7]is used in our design to prevent pipeline stalls by the sign digits, which is discussed in the following section. A parity prediction module (PPM) is implemented to predict the parity, (i.e., Sf!o in Fig. 1) and to compensate the pipeline stall, where the time instance is denoted as the superscript with
I. INTRODUCTION With the rapid advance in communication technology, more and more applications such as e-commerce and wireless networking are appearing. Protecting the sensitive information when transmitted along the insecure communication channel has become essential. Various cryptography systems have been investigated to prevent the information from snooped. Publickey cryptography, such as RSA algorithm [ 11, Elliptic Curve Cryptography (ECC) [2,3], DSA and Diffie-Hellman (DH) key exchange algorithm [4], plays a vital role in modern security system. Most of the public key cryptography relies heavily on the finite field or modular multiplication which is the crucial part for high performance hardware for system applications, such as VPN (Virtual Private network), SSL (Secure Socket Layer), etc. In 1985, Montgomery proposed a modular multiplication algorithm to avoid iterative divisions, which is suitable for VLSI implementation [5].Further improvement and modification of Montgomery algorithm can be found in [6,7]. Most of the conventional works mainly focused on the ASIC design. The efficient systolic array architectures for specific operand size have been investigated. As the key size of the cryptography growing with the demand of security system, the ASIC implementations suffer from the hardware complexity. Recently, scalable architectures for modular multiplication have been considered [8,9] to trade off between performance and area overhead. A scalable multiplier architecture for finite field G F ( P ) and GF(2') was proposed in [lo], which supports the basic multiplication for both ECC and RSA within the same hardware module. In this paper, we propose a scalable crypto-processor for
+ +
495
0-7803-7659-5/03/$17.000 2 0 0 3 IEEE.
i"
id)
Fig. 2. Data flow graph of word-based modular multiplication (a) without parity prediction; (b) with parity prediction. (c) The signal description of each PU and (d) the block diagram of the projected architecture.
1-2
1 -
Fig. 1. The word-based radix-2 Montgomery multiplication algorithms over (a) G F ( P ) and (b) GF(2m).
parentheses and the bit position is denoted as the subscript. For example, SgL is the least significant bit (LSB) of ST generated at time instance t . From the data dependency shown in Fig. 3, Ski = S!-')@Ct-') @ (boai). Since$-') = Po.S&l) and PO = 1 (because P is a prime), Ct-'] = S F ; ' ) . In addition, sI'-Ll = &-I) @ P I . Therefore, Sg,\ = $-I) c 3
cg,),'
@ boai, where
w-l
Input
w-l
L 1 . i ~ Sour
c L l ; i t !
output
w-
~ - 1 J u t
w- 1 SiL],o!r
instant ( t - 2) and S f , ; ' ) . Combining the parity Sk,ii)at time at time ( t - I), which can then be applied immediately at time ( t ) . The resultant data flow graph is then shown in Fig. 2(b), where the add-on functional block 2 represents the parity prediction module. Thus the pipeline stall can be eliminated. To realize the function of Y in Fig 2(c), our PU consists of a dual-field adder (DFA) array for both G F ( P ) and GF(2m),a sign-bit generator (SG) and a PPM. Figure 4 shows the circuit
( t - 2), we can generate the parity S g , ;
0
1
sin
0
0 1 0
0
1
0
1 1
si*
sin
0
1
0 0
496
R={
for addition over G F ( P ) . Similarly for subtraction over G F ( P ) ,the result R will be
R=
A-B, A - E - P, A - B+ P, A-E,
+M
Fig. 6 shows the block diagram of the dual-field adderhbtractor array. To support different arithmetic operations, the multiplexers are implemented to select proper operands for the DFA array. Note that since P is a prime, PO = 1 and the LSB of its two's complement is also 1. Similarly cOUf = 1 when subtracting by B. The addition and subtraction over GF(2m)is simply the bitwise exclusive-OR operation. In addition, a pipelined carry lookahead adder (CLA) is used to convert the result of modular addition, subtraction or multiplication from redundant carry-save form to the irredundant form.
w ,the output of PU, can be fed back immediately to PUI , otherwise an extra buffer is required to store the temporary data until PU1 is available. The total computation time, CC (clock cycles), is summarized as follows,
CLA
cc={
+ +
+ ([XI
Using the proposed parity prediction, the total computation time can be reduced by - 1) . n + ( n - 1) clocks when w 5 n and by ( a - 1) clocks otherwise, as compared with that in [ 10,111. The area overhead of the PPM and SG is approximately 4% as compared with DFA array. In addition to the PUS, an extra stage is needed to ensure that the result is within the range of (-P, P ) . The relation between the output of PU,, 0, and the final result, R, is given as
( [ ; I
i
L
Memory M o d u l e
(16Kb)
I
f
Control
Response Input Data
Oulp"1 oam
I /
Scalable Dual Field Crypto-Module
Dula-Field Controller
0-P,
R={
0-E, 0, O + B - P,
111. THE CRYPTO-PROCESSOR ARCHITECTURE The overall architecture of the crypto-processor core is shown in Fig. 7. There is an U 0 interface for transferring the data from and to the on-chip bus with standard protocol. Therefore, the crypto-processor can be easily plugged into a system chip. The crypto-controller manages the information exchange from the U 0 interface and the different cryptographic
Such part of function can be easily implemented by a dualfield adderhbtractor. The function of final adjustment can be extended to support the finite field addition and subtraction both for G F ( P ) and GF(2m), which are also the basic operations to compute ECC. To ensure that the result of two's complement addition and subtraction is within the range of (-P, P ) , the result R with respect
497
Year Yang [61 Su 171 Hong[12] Hsieh @-bit) [131 Lin(16-bit) [8] Ours(32-bit) 1998 1999 2002 1999 2001 2002
Gate #Clock Count Cycles 74K 390K 76K 510K 77K . 530K 4.5K 6.5M 13.1K 810K 40K 405K
operations of both RSA and ECC. When a cryptographic process begins, the controller will access the necessary multiplicand, multiplier and the prime or irreducible polynomial into the memory module. Proper microinstructions are generated from the crypto-controller and fed into the RSA/ECC controller. Then the RSA/ECC controller will access the dual-field crypto-module for proper data flow. The RSA/ECC controller also assigns each memory block as a read or write buffer during the encryption and decryption. The dual-field controller selects either GF(P) or GF(2m) arithmetic operations in the crypto-module. There are 16 PUS with 32-bit word in the crypto-module. The memory module consists of 2048Kb x 6 two-port memory blocks as the register files to store the intermediate codeword. Additional 2048Kb x 2 FIFOs are used as the buffer to store the temporary data when the key length is greater than 512 bits. The key length is scalable by 32-bit words. As a result, the overall crypto-processor is capable of processing 2048-bit RSA and 512-bit ECC cryptography. However, it is extensible simply with a larger memory module, or using external memory resource as the buffer. IV. COMPARISONS Table I compares different designs of the 5 12-bit RSA cryptography with normalized clock rate and baud rate, where the NB represents the normalized baud rate with respect to 0.35pm technology. The first three ASIC designs are systolic array design, while the design in [ 121 requires no broadcasting signal and achieves the best NB per gate for RSA. The last three designs are processor-based implementations. Our cryptoprocessor achieves the highest performance with the measurement of the NB per gate, regardless the scalability and the maximum key length that are the outperformance of the proposed design. The NB is 276 Kbps by a standard cell-library design flow using 0.35pm technology with 40K gates and a 220MHz clock rate from synthesis result. In addition, the pipelined crypto-processor is unified for both RSA and ECC with scalable key length. For ECC computation, projective coordinates are used to reduce the requirement of the modular inversion in affine coordinates. The resultant baud rate of 160-bit ECC is 73.3 Kbps for GF(P) and 65.9 Kbps for GF(2). V. CONCLUSIONS We have presented a new scalable and unified cryptoprocessor based on a modified Montgomery algorithm for both RSA and ECC. Effective pipeline architecture is implemented
to perform the modular multiplication, addition and subtraction over G F ( P ) and GF(2), which are the basic operations in RSA and ECC. The word-based crypto-processor supports scalable keys of the length up to 2048 bits for RSA and 5 12 bits for ECC. The key length can be increased easily with a larger memory module or using external memory resource, without affecting the overall architecture. Using a 0.35pm CMOS technology, our crypto-processor achieves a 5 12-bit RSA encryption rate of 276 Kbps and a 160-bit ECC encryption rate of 73.3Kbps for a 220MHz clock rate.
REFERENCES
[I] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtaining digital signatures and public-key cryptosystems, Communications ojthe ACM, vol. 21, no. 2, pp. 120-126, Feb. 1978. [2] N. Koblitz, Elliptic curve cryptosystems, in Mathmatics tion, 1987, pp. 203-209.
ef Compura-
[3] V. S. Miller, Use of elliptic curve in cryptography, in Advances in Cryptology-Crypto85 Proceedings, 1986, pp. 417-426. [4] W. Diffie and M. E. Hellman, New directions in cryptography, IEEE Trans. Information Theory, vol. 22, no. 6, pp. 644-654, Nov. 1976. [5] P. L. Montgomery, Modular multiplication without trial division, Math. Computation, vol. 44, no. 7, pp. 519-521, 1985. [6] C.-C. Yang, T.-S. Chang, and C.-W. Jen, A new RSA cryptosystem hardware design based on Montgomerys algorithm, IEEE Trans. Circuits and Systems 11: Analog and Digital Signal Processing, vol. 45, no. 7, pp. 908-913, July 1998. [7] C.-Y. Su, S.-A. Hwang, P.-S. Chen, and C.-W. Wu, An improved Montgomery algorithm for high-speed RSA public-key cryptosystem, IEEE Trans. VLSI Systems, vol. 7, no. 2, pp. 280-284, June 1999.
[8] Y.-C. Lin, A word-based RSA public-key crypto-processor core for IC smart card, Master thesis, Dept. Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, June 2001.
[9] A. E Tenca, G. Todorov, and C. K. Koc, High-radix design of a scalable modular multiplier, in Cryptographic Hardware and Embedded Systems (CHES) 2001, C. K. Koc, D. Naccache, and C. Pax, Eds. 2001, number 2162 in LNCS, pp. 189-205, Springer-Verlag. IO] E. Savag, A. E Tenca, and C. K. KO$, A scalable and unified multiplier architecture for finite fields G F ( p )and GF(2), in Cryptographic Hardware and Embedded Systems (CHES) 2000.2000, LNCS, pp. 281296, Springer-Verlag. I11 A. E Tenca and C. K. KO$, A scalable architecture for Montgomery multiplication, in Cryptographic Hardware and Embedded Systems (CHES) 1999. 1999, LNCS, pp. 94-108, Springer-Verlag. [12] J.-H. Hong and C.-W. Wu, Cellular array modular multiplier for the RSA public-key cryptosystem based on modified Booths algorithm, IEEE Trans. VLSI Systems, 2002 (accepted). [I31 Y.-H. Hsieb, Design and implementation of an RSA encryptioddecryption processor on IC smart card, Master Thesis, Dept. Electrical Engineering, National Taiwan University, Taipei, Taiwan, June . 1999.
498