FPGA Implementations of The Hummingbird Cryptographic Algorithm
FPGA Implementations of The Hummingbird Cryptographic Algorithm
Ken Lauffenburger
Aava Technology LLC 1206 Donegal Ln Garland, TX 75044, USA Email: [email protected]
Troy Hicks
Revere Security Corporation 4500 Westgrove Drive, Suite 335 Addison, TX 75001, USA Email: [email protected]
AbstractHummingbird is a new ultra-lightweight cryptographic algorithm targeted for resource-constrained devices like RFID tags, smart cards, and wireless sensor nodes. In this paper, we describe efcient hardware implementations of a stand-alone Hummingbird component in eld-programmable gate array (FPGA) devices. We implement an encryption only core and an encryption/decryption core on the low-cost Xilinx FPGA series Spartan-3 and compare our results with other reported lightweight block cipher implementations on the same series. Our experimental results highlight that in the context of low-cost FPGA implementation Hummingbird has favorable efciency and low area requirements. Index TermsLightweight cryptographic primitive, resourceconstrained devices, FPGA implementations.
I. I NTRODUCTION Hummingbird is a recently proposed ultra-lightweight cryptographic algorithm targeted for low-cost smart devices like RFID tags, smart cards, and wireless sensor nodes [3]. It has a hybrid structure of block cipher and stream cipher and was developed with both lightweight software and lightweight hardware implementations for constrained devices in mind. Moreover, Hummingbird has been shown to be resistant to the most common attacks to block ciphers and stream ciphers including birthday attack, differential and linear cryptanalysis, structure attacks, algebraic attacks, cube attacks, etc. [3]. In practice, Hummingbird has been implemented across a wide range of different target platforms [3], [5]. Those implementations demonstrate that Hummingbird provides efcient and exible software solutions for various embedded applications. However, the hardware performance of Hummingbird has not yet been investigated in detail. As a result, our main contribution in this paper is to close this gap and provide the rst efcient hardware implementations of Hummingbird encryption/decryption cores on low-cost FPGAs. Our implementation results show that on the Spartan-3 XC3S200 FPGA device the speed optimized Hummingbird encryption core can achieve a throughput of 160.4 Mbps at the cost of 273 slices, whereas the encryption/decryption core can be implemented in 558 slices and operate at 128.8 Mbps. II. T HE H UMMINGBIRD C RYPTOGRAPHIC A LGORITHM Hummingbird is neither a block cipher nor a stream cipher, but a rotor machine equipped with novel rotor-stepping rules.
The design of Hummingbird is based on an elegant combination of a block cipher and stream cipher with 16-bit block size, 256-bit key size, and 80-bit internal state. Figure 1(a) and Figure 1(b) illustrate the initialization and encryption processes of the Hummingbird cryptographic algorithm, respectively. Both initialization and encryption consist of four 16-bit block ciphers Eki (i = 1, 2, 3, 4), four 16-bit internal state registers RSi (i = 1, 2, 3, 4), and a 16-stage Linear Shift Feedback Register (LFSR). Moreover, the 256-bit secret key K is divided into four 64-bit subkeys k1 , k2 , k3 and k4 which are used in the four block ciphers, respectively. After a system initialization process as shown in Figure 1(a), a 16-bit plaintext block P Ti is encrypted by passing four identical block ciphers Eki () (i = 1, 2, 3, 4) in a consecutive manner, each of which is a typical substitution-permutation (SP) network with 16-bit block size and 64-bit key as shown in Figure 1(c). The block cipher consists of four regular rounds and a nal round. The substitution layer is composed of four S-boxes with 4-bit inputs and 4-bit outputs as shown in Table I.
TABLE I F OUR S-B OXES IN H EXADECIMAL N OTATION
x S1 (x) S2 (x) S3 (x) S4 (x) 0 8 0 2 0 1 6 7 E 7 2 5 E F 3 3 F 1 5 4 4 1 5 C C 5 C B 1 1 6 A 8 9 A 7 9 2 A F 8 E 3 B D 9 B A 4 E A 2 D 6 6 B 4 6 8 B C 7 F 0 2 D 0 C 7 8 E D 4 3 9 F 3 9 D 5
The permutation layer in the 16-bit block cipher is given by the linear transform L : {0, 1}16 {0, 1}16 dened as follows: L(m) = m (m 6) (m 10),
where m = (m0 , m1 , , m15 ) is a 16-bit data block. To further reduce the consumption of the area and power of Hummingbird in hardware implementations, four S-boxes used in Hummingbird can be replaced by a single S-box, which is repeated four times in the 16-bit block cipher. The compact version of Hummingbird can achieve the same security level as the original Hummingbird and will be implemented on FPGAs in this paper. For more details about Hummingbird, the interested reader is referred to [3].
48
m = (m0 , m1 , , m15 )
NONCE1
RS1 +
RS3
RS1 +
16
+
16 4 4
16
K1 , K 2 , K 3 , K 4
4
(i)
(i)
(i)
(i)
RS1 +
NONCE2
Ek 1 + Ek 2
S1 S2 S3 S 4
4 4 16 4 4
RS2
NONCE3
LFSR
RS3
+ Ek 3 + Ek 4 TV
RS3
+
16 4 4
16
K5 = K 1 K3
4
(i)
(i)
(i)
NONCE4
RS4
S1 S2 S3 S4
4 4 16 4 4
16
K6 = K2 K4
(i)
(i)
(i)
m = (m0 , m1 , , m15 )
III. FPGA I MPLEMENTATIONS OF H UMMINGBIRD In this section efcient FPGA implementations of a standalone Hummingbird component are described. Note that the choice of different kinds of I/O interfaces has a signicant inuence on the performance of hardware implementation and is highly application specic. Therefore, we do not implement any specic I/O logic in order to obtain the accurate performance prole of a plain Hummingbird encryption/decryption core and to provide enough exibility for various applications. A. Selection of a Hardware-Friendly S-Box A hardware-friendly S-box is the S-box that can be efciently implemented in the target hardware platform with a small area requirement. Four 4 4 S-boxes Si (x) : F4 F4 2 2 (i = 1, 2, 3, 4) have been carefully selected in Hummingbird according to certain security criteria (see Section II). To implement the compact version of Hummingbird, we need to choose a hardware-friendly S-box from four S-boxes listed in Table I. By using the Boolean minimization tool Espresso [4] we can obtain the minimal Boolean function representations (BFR) for the four S-boxes in Hummingbird. Note that each S-box can be implemented in hardware by using either a look-up table (LUT) or the Boolean function representations (i.e., combinatorial logic). The exact efciency of the above two approaches signicantly depends on specic hardware platforms and synthesis tools. Therefore, for the proposed architecture of the 16-bit block cipher in Section III-B we investigate two implementation strategies (i.e., LUT and BFR) for the four S-boxes and select one that results in the most area-efcient implementation of the 16-bit block cipher. B. Loop-Unrolled Architecture of 16-bit Block Cipher The loop-unrolled architecture for the 16-bit block cipher is illustrated in Figure 2. In this architecture, only one 16-bit block of data is processed at a time. However, ve rounds are cascaded and the whole encryption can be performed in a single clock cycle. The loop-unrolled architecture consists of 8
XORs, 20 S-boxes, and 4 permutation layers for the datapath. To select a hardware-friendly S-box for the compact version of Hummingbird, we implement the loop-unrolled architecture of the 16-bit block cipher on the target FPGA platform and test one S-box candidate from Table I each time. Table II summarizes the area requirement when using different S-boxes and implementation strategies. All experimental results are from post-place and route analysis.
TABLE II A REA R EQUIREMENT C OMPARISON FOR THE L OOP -U NROLLED A RCHITECTURE OF 16- BIT B LOCK C IPHER ON THE S PARTAN -3 XC3S200 FPGA
S-box S1 (x) S2 (x) S3 (x) S4 (x) Implementation Strategy LUT BFR LUT BFR LUT BFR LUT BFR # LUTs 186 186 193 186 186 186 190 187 # FFs 16 16 16 16 16 16 16 16 Total Occupied Slices 107 109 112 107 101 106 104 109
When comparing different S-boxes and implementation strategies, Table II shows that the loop-unrolled architecture occupies the minimal number of slices provided that the S-box S3 (x) is employed and implemented by a LUT. Therefore, the S-box S3 (x) is chosen for efcient implementation of speed optimized Hummingbird encryption/decryption cores that are described in detail in the following subsections. C. Speed Optimized Hummingbird Encryption Core The top-level description of a speed optimized Hummingbird encryption core is illustrated in Figure 3. After the chip enable signal changes from 0 to 1, the initialization process (see Figure 1(a)) begins and four rotors RSi (i = 1, 2, 3, 4) are rst initialized by four 16-bit random nonce through the interface RSi within four clock cycles. From the fth clock cycle, the core starts encrypting RS1 RS3 for four times and
49
K1
16
(i)
K2
16
(i)
Linear Transform L
16 DIN 16
S S S S
4 4 4
4 4 4 16
16
S S S S
Linear Transform L
K1
(i) 4 4
K4
(i) 4 4 4 4 4 4 4 4 16 16
(i) K3 16 4
+
4 4 4 4 16 16
16
K4
(i)
4 16
+ 16
16
16
+ 16
+ 16
S S S
4 4
D
16
DOUT 16
Fig. 2.
1
16
16
16
MODE CNT
16 CNT
RS1
16
RS2
16
RS3
16
RS4
16
M5
RSi
M11
16
M1
16 16 16 CNT
16
16 16
16 16
16
ki
16 64
+
16 16
16
M7
16 CNT 16
16
LoopUnrolled Encryption
16
CNT 16
M4
16
CET
1
MODE CNT
16
RS1
16
RS2
RS3
16
RS4
16
M5
RSi
16
M1
16 16 16
16
T1
16
1
16 16
16 16
SUBSEL MODE CNT 16 16 64
16
ki
CNT
M4
16
M6
16
+
16
PTO
M7
16
LoopUnrolled Encryption
MODE CNT
+
16
16
T1
CNT
M2
16
CNT
M3
16
PT/ SSID
16
CT
16
+
16
M9
16
16 CNT 16
16
CELF SR
16
16
LF SR
Time Base
CT I
CNT
16
M8 M6
MODE CNT CNT 16 16
16
16
LoopUnrolled Decryption
16 16
16
+
16 16 16
MODE 16 16 16 16 16
M10 CT O
MODE CNT
M2
M3
Fig. 3.
P T I/ SSID
16
CELF SR
16
LF SR
Time Base
CNT
each iteration requires four clock cycles to nish encryptions by four 16-bit block ciphers as well as the internal state updating. During the above procedure, the 64-bit subkeys ki (i = 1, 2, 3, 4) are read from an external register under the control of a key selection signal. Moreover, depending on the value of a round counter, the multiplexer M5 chooses the correct computation results to update four rotors and other multiplexers select appropriate inputs to feed the 16-bit block cipher. Once the initialization process is done after 20 clock cycles, the rst 16-bit plaintext block is read from an external register for encryption. With another four clock cycles, the corresponding ciphertext is output from the encryption core. Therefore, the proposed speed optimized Hummingbird encryption core can encrypt one 16-bit plaintext block per 4 clock cycles, after an initialization process of 20 clock cycles. D. Speed Optimized Hummingbird Encryption/Decryption Core We depict the top-level architecture of a speed optimized Hummingbird encryption/decryption core in the following Figure 4. The Hummingbird encryption/decryption core supports the following four operation modes: i) encryption only; ii) decryption only; iii) encryption followed by decryption; and iv) decryption followed by encryption. Both encryption and decryption routines share the same initialization procedure that rst takes 4 clock cycles to load four random nonce into rotors through multiplexers M5 and M11 , followed by 16 clock cycles for four iterations. The architecture of the encryption/decryption core is quite similar to that of the encryptiononly core except the following several aspects. Firstly, the rotor
RS2 completes the update when encrypting two successive plaintext blocks in the encryption-only core, whereas all rotors are fully updated each time a plaintext block is encrypted or decrypted in order to support the four operation modes in the encryption/decryption core. For this purpose, two multiplexers M10 and M11 are introduced to fully update the rotor RS2 after each encryption/decryption. Secondly, an adder that can perform both modulo 216 addition and subtraction is included, which executes the corresponding arithmetic according to the operation modes of the core. Thirdly, two multiplexers M7 and M8 are used to feed correct values to the encryption and decryption routines of the 16-bit block cipher, respectively. Finally, all the other multiplexers select appropriate inputs based on the value of a round counter as well as the operation modes. The workow of the encryption/decryption core is also similar to that of encryption-only core. Hence, the speed optimized Hummingbird encryption/decryption core can encrypt or decrypt one 16-bit plaintext or ciphertext block per 4 clock cycles, after an initialization process of 20 clock cycles. E. Implementation Results and Comparisons A summary of our implementation results is presented in Table III, where the area requirements (in slices), the maximum work frequency, and the throughput are provided. All experimental results were extracted after place and route with the
50
ISE Design Suite v11.1 from Xilinx on a xc3s200-5ft256 Spartan-3 platform with speed grade 5. From Table III, we note that the speed optimized Hummingbird encryption core can achieve a throughput of 160.4 Mbps at the cost of 273 slices, whereas the Hummingbird encryption/decryption core occupies 558 slices and operates at 128.8 Mbps on the target FPGA platform.
TABLE III I MPLEMENTATION R ESULTS FOR C OMPACT V ERSION OF H UMMINGBIRD ON THE S PARTAN -3 XC3S200 FPGA
Mode (Enc/Dec) Enc Enc/Dec # LUTs 473 1, 024 # FFs 120 145 Total Occupied Slices 273 558 Max. Freq. (MHz) 40.1 32.2 # CLK Cycles Init. Enc/Dec 20 4 Throughput (Mbps) 160.4 128.8 Efciency (Mbps/# Slices) 0.59 0.23
IV. C ONCLUSION This paper presented the rst efcient FPGA implementations of the ultra-lightweight cryptographic algorithm Hummingbird. The proposed speed optimized Hummingbird encryption/decryption cores can encrypt or decrypt a 16-bit message block with 4 clock cycles, after an initialization process of 20 clock cycles. Compared to other lightweight FPGA implementations of block ciphers XTEA, ICEBERG, SEA and AES, Hummingbird can achieve larger throughput with smaller area requirement. Consequently, Hummingbird can be considered as an ideal cryptographic primitive for resource-constrained environments. ACKNOWLEDGMENT This work is supported by an NSERC Discovery Grant and an NSERC Strategic Project Grant. R EFERENCES
[1] P. Bulens, F.-X. Standaert, J.-J. Quisquater, and P. Pellegrin, Implementation of the AES-128 on Virtex-5 FPGAs, Progress in Cryptology AFRICACRYPT 2008, LNCS 5023, pp. 16-26, 2008. [2] P. Chodowiec and K. Gaj, Very Compact FPGA Implementation of the AES Algorithm, The 5th International Workshop on Cryptographic Hardware and Embedded Systems - CHES 2003, LNCS 2779, pp. 319333, 2003. [3] D. Engels, X. Fan, G. Gong, H. Hu, and E. M. Smith, Hummingbird: Ultra-Lightweight Cryptography for Resource- Constrained Devices, to appear in the proceedings of The 14th International Conference on Financial Cryptography and Data Security - FC 2010, 2010. [4] N. N. Espresso. Available at http://embedded.eecs.berkeley.edu/pubs/ downloads/espresso/index.htm, November 1994. [5] X. Fan, H. Hu, G. Gong, E. M. Smith and D. Engels, Lightweight Implementation of Hummingbird Cryptographic Algorithm on 4-Bit Microcontrollers, The 1st International Workshop on RFID Security and Cryptography 2009 (RISC09), pp. 838-844, 2009. [6] T. Good and M. Benaissa, AES on FPGA from the Fastest to the Smallest, The 7th International Workshop on Cryptographic Hardware and Embedded Systems - CHES 2005, LNCS 3659, pp. 427-440, 2005. [7] X. Guo, Z. Chen, and P. Schaumont, Energy and Performance Evaluation of an FPGA-Based SoC Platform with AES and PRESENT Coprocessors, Embedded Computer Systems: Architectures, Modeling, and Simulation - SAMOS2008, LNCS 5114, pp. 106-115, 2008. [8] J.-P. Kaps, Chai-Tea, Cryptographic Hardware Implemenations of xTEA, The 9th International Conference on Cryptology in India INDOCRYPT 2008, LNCS 5356, pp. 363-375, 2008. [9] F. Mace, F.-X. Standaert, and J.-J. Quisquater, FPGA Implemenation(s) of a Scalable Encryption Algorithm, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 2, pp. 212-216, 2008. [10] A. Poschmann, Lightweight Cryptography - Cryptographic Engineering for a Pervasive World, Ph.D. Thesis, Department of Electrical Engineering and Information Sciences, Ruhr-Universitet Bochum, Bochum, a Germany, 2009. [11] G. Rouvroy, F.-X. Standaert, J.-J. Quisquater, and J.-D. Legat, Compact and Efcient Encryption/Decryption Module for FPGA Implementation of the AES Rijndael Very Well Suited for Small Embedded Applications, International Conference on Information Technology: Coding and Computing - ITCC 2004, pp. 583-587, 2004. [12] F.-X. Standaert, G. Piret, G. Rouvroy, and J.-J. Quisquater, FPGA Implementations of the ICEBERG Block Cipher, Integration, the VLSI Journal, vol. 40, iss. 1, pp. 20-27, 2007. [13] Xilinx Inc., Spartan-3 FPGA Family Data Sheet, DS099, December 4, 2009, available at http://www.xilinx.com/support/documentation/data sheets/ds099.pdf.
Table IV describes the performance comparison of our Hummingbird implementation with existing FPGA implementations of block ciphers PRESENT [7], [10], XTEA [8], ICEBERG [12], SEA [9] as well as AES [1], [2], [6], [11]. Note that numerous AES hardware architectures have been proposed in literature and we only focus on those implementations using low-cost Spartan series FPGA devices with speed grade -5 and above for the purpose of comparison. Moreover, the implementation gures of ICEBERG and SEA are only available on Virtex-2 series FPGAs. We also would like to point out that it is quite difcult to provide a fair comparison among different implementations on FPGAs, taking into account the diversity of FPGA devices and packages, speed grade level, and synthesis and implementation tools. Therefore, we also include additional information such as implementation platform and speed grade level in Table IV. Our experimental results show that in the context of lowcost FPGA implementation Hummingbird can achieve larger throughput with smaller area requirement, when compared to block ciphers XTEA, ICEBERG, SEA and AES. However, the implementation of the ultra-lightweight block cipher PRESENT is more efcient than that of Hummingbird, although a slightly larger (and hence more expensive) FPGA device Spartan-3 XC3S400 is required. The main reason is due to the complex internal state updating procedure in Hummingbird cipher (see Figure 1(a) and Figure 1(b)). As a result, the control unit is more complicated and the delay of the critical path is much longer in the Hummingbird hardware architecture than those in the PRESENT core.
51