0% found this document useful (0 votes)
81 views20 pages

A Impl Fpga Rabbit Flux

The document describes three hardware architectures - direct, interleaved, and generalized folded structure (GFS) - for implementing the Rabbit stream cipher on FPGAs. It reports the performance of these architectures on a Xilinx Virtex-5 FPGA. A direct implementation achieves throughputs up to 9.16 Gbits/s using 568 slices. A 4-slow interleaved design reaches 25.62 Gbits/s using 1163 slices. A 3-slow 8-GFS implementation delivers up to 3.46 Gbits/s using only 233 slices.

Uploaded by

ghionoiuc
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views20 pages

A Impl Fpga Rabbit Flux

The document describes three hardware architectures - direct, interleaved, and generalized folded structure (GFS) - for implementing the Rabbit stream cipher on FPGAs. It reports the performance of these architectures on a Xilinx Virtex-5 FPGA. A direct implementation achieves throughputs up to 9.16 Gbits/s using 568 slices. A 4-slow interleaved design reaches 25.62 Gbits/s using 1163 slices. A 3-slow 8-GFS implementation delivers up to 3.46 Gbits/s using only 233 slices.

Uploaded by

ghionoiuc
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Hardware Framework for the Rabbit Stream Cipher

Deian Stefan
S ProCom2 // Dept. of Electrical Engineering, The Cooper Union, New York NY 10003, USA

Abstract. Rabbit is a software-oriented synchronous stream cipher with very strong security properties and support for 128-bit keys. Rabbit is part of the European Unions eSTREAM portfolio of stream ciphers addressing the need for strong and computationally ecient (i.e., fast) ciphers. Extensive cryptanalysis conrms Rabbits strength against modern attacks; attacks with complexity lower than an exhaustive key search have not been found. Previous software implementations have demonstrated Rabbits high throughput, however, the performance in hardware has only been estimated. Three recongurable hardware designs of the Rabbit stream cipher direct, interleaved and generalized folded structure (GFS) are presented. On the Xilinx Virtex-5 LXT FPGA, a direct, resource-ecient (568 slices) implementation delivers throughputs of up to 9.16 Gbits/s, a 4-slow interleaved design reaches 25.62 Gbits/s using 1163 slices, and a 3-slow 8-GFS implementations delivers throughputs of up to 3.46 Gbits/s using only 233 slices. Key words: FPGA, Rabbit, eSTREAM, DSP, Stream Cipher

Introduction

The widespread use of embedded mobile devices poses the need for fast, hardware-oriented encryption capabilities to provide higher security and protection of private data for end users. Stream ciphers are cryptographic algorithms that transform a stream of plaintext messages of varying bitlength into ciphertext of the same length, usually by generating a keystream that is then XORed with the plaintext. In general, stream ciphers have very high throughput, strong security properties, and use few resources, thus making them ideal for mobile applications; well-known examples of stream ciphers include the RC4 cipher used in 802.11 Wireless Encryption Protocol [13], E0 cipher used in Bluetooth protocol [13], and the SNOW 3G cipher used by the 3GPP group in the new mobile cellular standard [26].
Part of this work was done while the author was visiting EPFL, Switzerland.

D. Stefan

The European Union sponsored the four-year eSTREAM project to identify new stream ciphers which address not only strong security properties, but also the need for 1) high-performance software-oriented ciphers and, 2) low-power and low-resource hardware-oriented ciphers. The Rabbit stream cipher is among four software-oriented stream ciphers which were selected for the eSTREAM software portfolio in 2008 [3]. Rabbit performs very well in software (e.g., 5.1 cycles/byte on a 1.7 GHz Pentium 4 and 3.8 cycles/byte on a 533 MHz PowerPC 440GX [6]) and detailed cryptanalysis by the designers and recent studies [2, 20] found no serious weaknesses or attacks more feasible than an exhaustive key search. In [20], Lu et al. estimate the complexity of a time-memory-data-tradeo (TMDT) key-recovery attack to be 297.5 with 232 memory usage, 232 precomputations in addition to an exceptionally strong adversary assumption. Moreover, they also present the best distinguishing attack with complexity 2158 , which is considerably higher than the exhaustive key search of 2128 . The strong security properties of Rabbit makes the cipher a desirable candidate for both software and hardware applications. Until now there were no hardware implementations of Rabbit to evaluate its performance, only estimates of application-specic integrated circuit (ASIC) and eld-programmable gate array (FPGA) designs; as part of our framework, we present three dierent architectures suitable for recongurable hardware implementations that can be used as standalone hardware or hardware/software co-designs for both cryptographic and cryptanalytic applications. First we introduce the structure of the Rabbit stream cipher and the mathematical foundations. We then discuss the three hardware architectures of the algorithm: direct, interleaved, and generalized folded structure. The tradeos of each are considered along with hardware- and software-based initialization designs. Finally, FPGA implementations and performance benchmarks are presented.

Structure of Rabbit

Rabbit is a symmetric synchronous stream cipher with a 513-bit internal state derived from the 128-bit key and an optional 64-bit initial vector (IV). From the classical denition of a synchronous stream cipher [22], the internal state during each system iteration is updated according to a nextstate function dependent on the previous (internal) state, and similarly, the keystream is produced as a function of the internal states, independent of the plaintext or ciphertext. An output function, XOR in this case, is

Hardware Framework for the Rabbit Stream Cipher

then used to combine the plaintext (ciphertext) message and keystream to produce the output ciphertext (plaintext). The 128-bit key allows for the safe encryption of 264 plaintext messages [21, 6], while the optional (public) 64-bit IV provides for the safe encryption of up to 264 plaintexts using the same key [8]. Many stream cipher keystream generators are based on the irregular clocking, non-linear combination, or non-linear ltering of the output(s) of linear feedback shift registers (LFSRs) and pseudo-random number generators (PRNGs) [24, 22]. The Rabbit design, although counter-assisted and dependent on the highly non-linear mixing of the internal state, is a novel approach to stream cipher design, adopting random-like properties from chaos theory [7]. The Rabbit 513-bit internal state (at iteration i) is divided into eight 32-bit state variables xj,i , 0 j 7, eight 32-bit counters cj,i , 0 j 7 and a carry bit 7,i . The design choice of a very large internal state makes TMDT attacks (e.g., key recovery), which rely on o-line precomputations to minimize on-line computing time, infeasible [5, 6]. 2.1 Internal State Update

The internal state update, i.e., a system iteration, is divided into the nonlinear next-state update of the state variables xj,i s, and the linear update of the counter variables cj,i s. Next-state update: At the core of the Rabbit algorithm is the iteration of the state variables, from which the keystream is generated. After the initialization of the internal state (explained in Section 2.2) the nextstate function, depending only on the previous state, is used to iterate the system; so, the internal state at iteration i + 1 depends solely on the non-linear mixing of the internal state at i. Formally, following the notation of [6], the eight 32-bit state variables are updated as follows: xj,i+1 = gj,i + gj1,i gj,i + gj1,i 16 + gj2,i 8 + gj2,i 16 for j even for j odd, (1)

where is a bitwise-rotation by bits, the additions are mod 232 and all the indices j k, 0 k 2 are mod 8 (the number of state and counter variables). The chaos-inspired function g is dened as: gj,i = ((xj,i + cj,i+1 )2 ((xj,i + cj,i+1 )2 32)) mod 232 , (2)

D. Stefan

where is a bitwise right-shift by and the inner additions, (xj,i + cj,i+1 ) are mod 232 . The g function is the source of the high non-linearity in the state updates 256 bits (all the bits of the xj,i s) of the 513-bit internal state are non-linearly transformed; as (1) shows, each state variable is a combination of three outputs from the g function. The g function is the source of the ciphers resistance to algebraic, dierential, and linear correlation attacks, which commonly take advantage of ciphers with few non-linear state updates, or the correlation between the dierence of inputs and outputs. These attacks seek to determine an outputs dependence on the input, nd a correlation between the output and internal state or distinguish the keystream from a truly random sequence [12, 1, 7, 6]. Counter update: Similar to the state variable updates, during each iteration the eight 32-bit counter variables are also updated, although linearly, according to: cj,i+1 = where, 0x4d34d34d for j = 0, 3, 6 aj = 0xd34d34d3 for j = 1, 4, 7 0x34d34d34 otherwise and the carry j,i+1 is: 1 if j = 0 and c0,i + a0 + 7,i 232 = 1 if j = 0 and cj,i + aj + j1,i+1 232 0 otherwise. (4) c0,i + a0 + 7,i for j = 0 cj,i + aj + j1,i+1 otherwise (3)

j,i+1

(5)

It can be shown that the 256-bit counter state (eight 32-bit counters) has a maximal period length of 2256 1 [7], and since the counter variables are used in (2), and thus in the next-state function (1), a lower-bound on the period length of the state variables can also be guaranteed [7, 6]. 2.2 Initialization

Key setup: The 128-bit key K is divided into eight 16-bit sub-keys K = k7 || ||k0 , where || is the concatenation operation, with the least

Hardware Framework for the Rabbit Stream Cipher

signicant bit (LSB) bit of k0 and most signicant bit (MSB) of k7 corresponding to the LSB and MSB of K, respectively. The key is expanded to initialize the counter and state variables according to: xj,0 = and: kj+4 ||kj+5 for j even (7) kj ||kj+1 for j odd, where the indices j + k are modulo 8. Additionally, the carry 7,0 is initialized to zero. Following the key expansion, the system is iterated four times according to the next-state and counter-update functions described in Section 2.1, and nally the counter variables are modied according to: cj,0 = cj,4 = cj,4 xj+4,4 , (8) kj+1 ||kj for j even kj+5 ||kj+4 for j odd, (6)

where the indices are again mod 8. The expansion of the key is such that there is a one-to-one correspondence between the key and the 512-bit internal state, while the four system iterations and counter modications assert both 1) the mixing of all the key bits with every state variable and 2) the combination of the counter with the non-linear state variables [6]. It is important to avoid a many-to-one mapping between the key and internal state as this drastically degrades the strength of the algorithm, for if two keys lead to the same internal state an adversary could potentially generate the same keystream with a dierent key. Equally essential are the counter modications, as they prevent key-recovery attacks in which an adversary, with knowledge of the counters state, can clock the system in reverse and deduce the key. Since the next-state function is resistant to guess-and-verify and correlation attacks [6], and thus resistant to the reverse clocking of the state variables, the modication of the counter variables as in (8) secures against key-recovery attacks. IV setup: If a 64-bit IV is provided, it is divided into four 16-bit sub-IVs IV = iv3 || ||iv0 where the LSB of iv0 and MSB of iv3 correspond to the LSB and MSB of IV , respectively. Using the sub-IVs the counters are modied to: cj,4 iv1 ||iv0 for j = 0, 4 cj,4 iv3 ||iv1 for j = 1, 5 cj,4 = (9) cj,4 iv3 ||iv2 for j = 2, 6 cj,4 iv2 ||iv0 for j = 3, 7,

D. Stefan

after which the system is again iterated four times, guaranteeing the nonlinear combination of all the IV bits into the state variables [6]. 2.3 Keystream Generation

During each iteration i, the state variables xj,i are split into low (L) and high (H) 16-bit sub-states xj,i = xj,i,H ||xj,i,L , from which the 128-bit keystream output, a concatenation of eight 16-bit blocks si = si,7 || ||si,0 , is extracted according to: si,0 = x0,i,L x5,i,H si,1 = x0,i,H x3,i,L si,2 = x2,i,L x7,i,H si,3 = x2,i,H x5,i,L si,4 = x4,i,L x1,i,H si,5 = x4,i,H x7,i,L si,6 = x6,i,L x3,i,H si,7 = x6,i,H x1,i,L . (10)

It is important that adversaries gain no information from the output, that is, they should not be able to distinguish the output of the keystream generator from a truly random sequence [15]. The combination of the outputs of the non-linear g function in the keystream extraction highlights the strength of Rabbit in passing various statistical tests [6], including the NIST Test Suite that seeks to nd non-randomness in a sequence [4].

Rabbit in Hardware

As previously mentioned, Rabbit is a software-oriented stream cipher and thus was designed to perform well on general purpose architectures, varying from 32-bit Intel processors to 8-bit microcontrollers. Estimates of ASIC and FPGA throughput and area performance are presented in [6], however the implementation details are limited. In the following sections, we consider three architecture designs of the Rabbit algorithm optimized for recongurable devices. 3.1 Direct Architecture and General Optimizations

The rst architecture we consider is a direct implementation of the algorithm. Observing the relationship between (3) and (5), we note that the counter variables can be updated using a series of chained adders. Each adder takes inputs cj,i , aj and carry-in1 j1,i+1 , j > 0, producing output cj,i+1 and carry-out j,i+1 each cycle. Figure 1 illustrates the chaining
1

Note that the carry-in for the rst adder is 7,i .

Hardware Framework for the Rabbit Stream Cipher

c0,i+1 c1,i+1 c2,i+1 c3,i+1 c4,i+1 c5,i+1 c6,i+1 c7,i+1 7,i+1

x0,i+1 x1,i+1 x2,i+1 x3,i+1 x4,i+1 x5,i+1 x6,i+1 x7,i+1

direct

Fig. 1. Direct architecture of the Rabbit algorithm, highlighting the critical path. The is a 32-bit adder with carries, while the dotted and dashed lines indicate a variable rotate dependent on whether j is even or odd, see (1). Control logic, ai inputs, initialization blocks and the keystream extractor are eliminated for clarity.

method within the full architecture design. The updated counters cj,i+1 and state variables xj,i are then used as inputs to the g function blocks, the outputs of which, gj,i , are combined according to (1) to produce the next state variables xj,i+1 . Moreover, the next state variables are concurrently combined according to (10) to produce the 128-bit keystream output. Below, we consider generic hardware optimizations, which are applied to all the designs in the framework, including the direct implementation. Ecient squaring: In implementing the next-state function, eight parallel realizations of the g function are required. Accordingly, the implementation of g can greatly aect the overall speed performance and area usage. As Boesgaard et al. note [6], the most costly part of the g function, the squaring, can be eciently implemented using three 16-bit multiplies

c0,i+1 c1,i+1 c2,i+1 c3,i+1 c4,i+1 c5,i+1 c6,i+1 c7,i+1

x0,i+1
8 D. Stefan

4-slow

x1,i+1

followed by a 32-bit addition. If we let u = xj,i + cj,i+1 and split u into two 16-bit values u = uH ||uL , then the optimization follows directly from x2,i+1 the fact that u2 = u2 + 232 u2 + 217 uL uH mod 232 . Thus the full g funcL H tion, as in (2), can be eciently implemented using four (2-input) 32-bit x3,i+1 adders, three 16-bit multipliers, 3 shifts (which have no cost in hardware, other than routing), and a 32-bit XOR.
x4,i+1 x5,i+1 x6,i+1 x7,i+1

Stage 1 cj,i+1

Stage 2

Stage 3 xj,i+1

+1

Fig. 2. Three-stage pipeline for the direct architecture of the Rabbit algorithm.

Pipelining: In addition to optimizing g, the speed of the direct design can be further increased by splitting the design into three pipeline stages. Without pipeline registers, the critical path the path with the highest computational cost between two delay elements consists of the eight counter adders, a g function (computing g7,i ), two 32-bit adders (computing x7,i+1 ) and a 16-bit XOR (extracting keystream output); excluding the nal XOR, the critical path is highlighted in Figure 1. The critical path can, however, be reduced to either eight 32-bit adders or g and two 32-bit adders2 by introducing pipeline registers following the counter adders and preceding the keystream output XORs, see Figure 2. To retain correctness, keeping the inputs cj,i+1 and xj,i to the g functions synchronized is required and can be accomplished by introducing a latency of one cycle (using clock-enables) for the xj,i s to match the latency introduced by the pipeline register for cj,i+1 . C-slow retiming: To further optimize the pipelined design, the critical path, which we experimentally determined to be in the second pipeline stage (the calculation of the the next state variables: g+two 32-bit adders), must be reduced with ne-grained pipelining of the g block, the costliest
2

Specically, the critical path is max(eight 32-bit adders, g+two 32-bit adders).

Hardware Framework for the Rabbit Stream Cipher

element in the path. We note that since gj,i+1 depends on xj,i+1 , which is a function of the output of gj,i , the direct design cannot take advantage of multiplier pipelining. Instead, we optimize the design with C-slow retiming, a DSP system-design technique that allows for the pipelining of structures with feedback loops [23, 27]. C-slow retiming is a modication of a system design in which each register is replaced with C registers (C-slowed) after which the full structure is retimed, whilst retaining algorithmic correctness; we refer the reader to [23] for further details. For C = 4, Figure 3(a) illustrates the partial C-slow design before retiming, and Figure 3(b) shows it after retiming, where 3 of the 4 registers were moved into the g block. Retiming stage 2 can thus be seen as ne-grained pipelining of the g function into 3 simpler stages (addition, multiplication, and addition + XOR). Moreover, by pipelining g, the critical path is reduced to the eight 32-bit chained counter adders. We note that although C-slow retiming can acutely increase the clock rate, the area usage will, in general, increase, as will the number of cycles it takes to complete a single iteration; specically the number of cycles per iteration will increase to C. Thus to avoid zero-lling the C 1 pipeline registers, it is essential that multiple streams be interleaved, running in parallel, so that during the C cycle system iteration, C independent streams are updated and C dierent keystream outputs are generated. Multi-stream cipher applications have been studied before (see e.g., [28, 9]), and nd use in many applications, including le system encryption, securing virtual private networks, and cryptanalysis. Initialization: Initialization of the direct architecture requires a key expansion block for (6) and (7), which consist of simple combinations of bit slices used to initialize the state and counter variables; additional control logic (multiplexers) and XORs are needed for the IV setup and modication of the counter as in (8). For multi-stream (C-slow retimed) designs, control logic is necessary to correctly initialize the independent streams. Alternatively, for hardware/software co-designs, the initialization can be performed in software from which the Rabbit hardware counter, state and carry registers can be loaded; the Rabbit crypto-co-processor and main CPU (e.g., MicroBlaze or PowerPC) can be interfaced using numerous bus protocols that can directly access hardware registers, including the Xilinx Fast Simplex Link (FSL), On-Chip Peripheral Bus (OPB) and the IBM-based Processor Local Bus (PLB). For many security systemand network-on-chip applications, which commonly consist of a CPU and

10

D. Stefan

cj,i+1 cj,i+1

g g

xj,i+1 xj,i+1

(a) 4-slow before retiming

cj,i+1 cj,i+1

g 3D g 3D

xj,i+1 xj,i+1

(b) 4-slow after retiming Fig. 3. C-slow retiming for C = 4 is accomplished by rst replacing each register with C of them, as shown in (a), followed by the retiming, which relocates registers to optimize the design, as shown in (b).

peripherals in addition to the FPGA, initialization in software eliminates the need for additional hardware resources and further simplies the overall design. Moreover, the saved resources can be dedicated to additional cryptographic cores in multi-stream applications, or to other hardwareassisted applications running concurrently, e.g., MPEG-4 encoder. 3.2 Interleaved Architecture

Although a C-slow retimed implementation is suitable for hardware, the high data-dependency between the counters (due to the percolating carries j,i+1 ) still poses a limitation on the clock rate. This is because a 256bit addition3 must be completed in a single cycle. For a 3-stage pipeline and C-slow retimed design (assuming C 2), the cost can be reduced to that of a 128-bit addition using cut-set retiming; in this section we, however, focus on interleaved architecture (IA) design, which is a considerably more balanced structure. See Appendix A for further details on the cut-set retiming approach. The interleaved design is a generalization of the C-slow retiming approach to ne-grained pipelining of, not only the state variable updates
3

The eight 32-bit additions with carries is equivalent to a 256-bit addition of c7,j || ||c0,j and a7 || ||a0 with 7,i as a carry-in.

Hardware Framework for the Rabbit Stream Cipher

11

(stage 2 of the pipelined design in Figure 2), but the counter updates as well (stage 1). Given a C-slow design (C = 2l, l 1), a C/k-interleaved architecture (in short C/k-IA) interleaves k independent streams in a single clock cycle for k cycles (ignoring the initial rst cycle used to ll the pipeline), where k C and k|8. For example, a 2/2-IA consists of 2 streams which are interleaved such that during the rst cycle half of the state variables of each stream are updated and during the second cycle the second half of the variables are updated. As another example, consider the 4/2-IA case; this design is equivalent to interleaving two 2/2-IA streams. We further note that the C-slow retimed design discussed in Section 3.1 is a special case for k = 1, i.e., C/1-IA.

7,i
1

c0,i+1
1 1 c1,i+1

g 2D

x0,i+1
1 1 x1,i+1

c0,i+1
1 1 c1,i+1

g 2D

x0,i+1
1 1 x1,i+1

slow+interleave

slow+interleave
g 2D h

g 2D

c2,i+1
1

1 x2,i+1

1 c2,i+1

1 x2,i+1

c3,i+1
1

c3,i+1
1

3,i+1 2 c4,i+1
2

x3,i+1
1

3,i+1 2 c4,i+1
2

x3,i+1
1

g 2D

x4,i+1
2

g 2D

x4,i+1
2

c5,i+1
2 2 c6,i+1

g 2D

x5,i+1
2 2 x6,i+1

c5,i+1
2 2 c6,i+1

g 2D

x5,i+1
2 2 x6,i+1

c7,i+1
2

x7,i+1
2

Mux i h:
g 2D

c7,i+1
2

x7,i+1
2

7,i+1
2

Mux is odd/even
gi-1 gi

7,i+1
2

h: 4/2-IA design. (a)


g 2D

gi-1 gi

(b) h function Fig. 4. 4/2-Interleaved Architecture design and corresponding h block.

12

D. Stefan

We denote variables of dierent stream with a superscript, e.g., cm is j,i the j-th counter variable at iteration i of stream m. For clarity we limit our discussion to the 4/2-IA design shown in Figure 4. From Figure 4 we observe that during the rst cycle, half of stream 1s counters c1 , 0 j,i+1 j 3 and nal half of stream 4s counters c4 , 4 j 7 are updated j,i+1 in the top and bottom of the structure, respectively. Because 3,i+1 is buered, in the following cycle we can update c1 , 4 j 7 in the j,i+1 bottom half, and c2 , 0 j 3 in the top. Table 1 illustrates the j,i+2 update of the counter variables over time corresponding to Figure 4. With the exception of the rst cycle, during every cycle a full-state update is completed.

0 c1 0,i c1 1,i c1 2,i c1 3,i

1 c2 0,i c2 1,i c2 1,i c2 3,i c1 4,i c1 5,i c1 6,i c1 7,i

2 c3 0,i c3 1,i c3 2,i c3 3,i c2 4,i c2 5,i c2 6,i c2 7,i

3 c4 0,i c4 1,i c4 2,i c4 3,i c3 4,i c3 5,i c3 6,i c3 7,i

4 c1 0,i+1 c1 1,i+1 c1 2,i+1 c1 3,i+1 c4 4,i c4 5,i c4 6,i c4 7,i

5 c2 0,i+1 c2 1,i+1 c2 2,i+1 c2 3,i+1 c1 4,i+1 c1 5,i+1 c1 6,i+1 c1 7,i+1

6 c3 0,i+1 c3 1,i+1 c3 2,i+1 c3 3,i+1 c2 4,i+1 c2 5,i+1 c2 6,i+1 c2 7,i+1

Top:

Bottom:

Table 1. Example counter update of 4/2-IA over increasing time t.

Due to the interleaving and need to retain correctness of the algorithm, the retiming of g is slighlty more complex than that of a C-slow design. First, because we start from a 4-slow design, 2 registers can be dedicated to the ne-grained pipelining of g, while the others are used to buer either 1) the output of g so that the next state variables can be computed according to (1) or 2) the next state variable. As the update is completed over 2 cycles, half of the g blocks need an additional register and a multiplexer (see Figure 4(b)) to select the correct g output; we denote this function by h. For example, in computing x4,j+1 , the outputs of the rst two h blocks (h2 and h3 ) are the previously buered x2,j+1 and x3,j+1 (and not the output of the g function). We further note that for the 4/2-IA, in addition to registers which buer the next state variables, two keystream extractors are needed in order to produce four 128-bit outputs in four cycles.

Hardware Framework for the Rabbit Stream Cipher

13

3.3

Generalized Folded Structure

Although FPGAs contain digital signal processing (DSP) slices4 that can be used in implementing an optimized direct or IA design, with the exception of the DSP-enhanced FPGAs (such as the Xilinx Spartan-3A, Virtex-5 SXT and Virtex-4 SX [29]), most FPGAs have a small number of DSP slices which may be necessary for applications other than the encryption module (e.g. Fast Fourier Transform block used for image processing). As such, we seek a more compact implementation of the Rabbit stream cipher. From (1), (2), (3) and Figure 1, we observe the repeated use of identical circuit blocks in the design (e.g. block g followed by addition), which can be reduced to fewer shared copies at the cost of additional control logic and intermediate state registers. Specically, the g block, adders and rotation blocks used to update a state variable can be shared to compute all the eight state variables at the cost of 1/8-th the time each computing block is used to update a state variable. Similar to the sharing of resources to update the state variables, the calculation of the eight counter variables at 1/8-th the time per resource can be accomplished by sharing a single adder and carry register. In DSP terminology, the general design optimization is referred to as a n-folded or n-rolled design [23], reducing the number of used computational resources (e.g., g blocks) to 1/n at the cost of taking n cycles to complete a full iteration. It is constructive to think of folded designs as n threads running on a pipelined system sharing the same computational units, and during every cycle a dierent thread, cycled in a roundrobin fashion, gets a chance to use the computational units (and advance in the pipeline) [16], such that after n cycles all the threads have nished their necessary computations and the iteration is complete. Although a directly folded design of Rabbit is realizable, it is inecient because each iteration requires g6,i and g7,i to compute the rst two next-state variables, x0,i+1 and x1,i+1 , and as such, an elegant solution buering only the last two g values is not feasible without the use of an additional g block. Instead, we propose a generalized lter structure that allows access to intermediate valuesfollowing the threading analogy: the threads are no longer independent and can share data. Moreover, an n-GFS implementation only requires 1/n of the number of computational elements (e.g., adders and g functions) used by a direct implementation.
4

The design of a DSP slice is FPGA-family-specic, however the most common design is a 18 18 multiplier followed by an adder/accumulator and a small number of registers and multiplexers.

14

D. Stefan

As the counter implementation in an n-GFS architecture is the same as that of a folded design (i.e., in an n-GFS design, the counter system is simply the chaining of 8/n adders whose (partial) inputs are n delayed counter variables that need to be updated sequentially), we limit our discussion to the more interesting case of the state updates.

8D 8D

xj+1,i cj+1,i+1
g

gj,i gj-1,i gj-2,i

xj,i+1

j+1,i+1

Fig. 5. 8-GFS design. Every 8-th cycle, the multiplexers select the g7,i and g6,i results for the gj1,i and gj2,i inputs of the 32-bit adder. The dashed and dotted lines highlight rotations dependent on j.

As shown in Figures 5 the 8-GFS design uses a minimal number of resources, both in terms of the register usage and computational elements (g function and adders). Only two additional registers, which buer gj1,i and gj2,i , are needed when computing xj,i+1 according to (1). We note that every 8 cycles all intermediate terms, g0,i through g7,i , are available and thus any of the next-state variables can be updated, including x0,i+1 . Similarly, Figure 6 shows the compact 4-GFS design split into a top and bottom pipeline, each computing even and odd next-state variables, respectively. As with the 8-GFS, every n = 4 cycles, all the intermediate terms are available and thus x0,i+1 and x2,i+1 can be computed. A 2-GFS design follows directly from these. From the gures, we observe that a straight-forward GFS implementation will be limited by the rate at which it can be clocked (due to the fact that the critical path consists of a g block and two 32-bit adders). However, the pipelining and C-slow retiming techniques presented in Section 3.1 are adopted to further speed up the compact n-GFS designs. Keystream extraction: To extract the keystream output according to (10), a time division demultiplexer (TDD) is needed so that xj,i , 0 j 7 are simultaneously available for the calculation of the si s. Since a TDD uses a considerable number of registers, applications of 8-GFS where

Hardware Framework for the Rabbit Stream Cipher

15

xj+2,i cj+2,i+1
g

gj,i gj-2,i

xj,i+1

xj+3,i cj+3,i+1 j+3,i+1


g

gj+1,i gj-1,i

xj+1,i+1

Fig. 6. 4-GFS design with the top pipeline computing every even state variable, and the bottom every odd. Every 4-th cycle, the top and bottom multiplexers select the g6,i and g7,i results, respectively. The dashed and dotted lines highlight rotations dependent on j.

variable output lengths and out-of-order keystreams are acceptable (such as random number generators), the TDD (and following XORs) can be replaced by two 16-bit XORs producing the following output sequence: si,0 ||si,1 , si,7 ||si,4 , si,2 ||si,3 , si,6 , si,5 . As the 4-GFS does not directly benet from this optimization, the keystream extractor of 4-GFS consists of a 2to-8 TDD followed by a series of XORs to generate the output. Initialization: The generalized lter structure has a very exible initialization process. For an 8-GFS, the hardware initialization requires additional 1) four registers so that x0,4 is available for the modication of c4,4 according to (8), 2) two XORs for the mixing of the counters with the state variables and IV, 3) set of control logic. Similar requirements follow for the 4-GFS. We note that although minimal additions are needed for the hardware initialization, software initialization (as discussed in Section 3.1) can be used without the need for any additional resources.

Implementation and Discussion

Three direct designs, a 4/2-IA design, and various 4- and 8-GFS designs of the Rabbit cipher were implemented using System Generator and synthesized using Xilinx XST (ISE 11.1). We targeted the Xilinx Virtex-5

16

D. Stefan

LXT (XC5VLX50TFFG1136) FPGA hosted on the Xilinx ML 501 development board, consisting of 7,200 slices, 60 Block RAMs and 48 DSP48 slices. Table 2 summarizes the post-place and route results, where the sux V is used to identify the implementations with variable output rate (see Section 3.3). We stress the advantage of using C-slow retiming by observing that a direct design can be maximally clocked at 71.58 MHz, while the ne-grained pipelining of the g function increases the clock rate to 141.38 MHz. This nearly doubles the throughput from 9.16 Gbps to 18.10 Gbps, in addition to increasing throughput/area ratio. Although using SLICEM and SLICEL slices (memory- and logic-enhanced slices) for more ecient carry propagation endures a clock rate of 71.58 MHz, we notice the advantage of pipelining the adders in the very high throughput (25.62 Gbps) of the 4/2-IA design; we expect that using C/k-IA designs with k > 2 will further allow for an increase in the clock rate, and thus throughput. Furthermore, our results conrm that the estimates made in [6] are reasonably accurate. Table 2 also shows the performances of the more compact n-GFS designs. The ascent from an 8- to 4-GFS shows a linear increase in the throughput, with only a slight increase in slice count. The single stream 4-GFS and 3-slow 8-GFS are ideal for resource-constrained environments, while delivering reasonably high throughputs (2.74 and 3.43 Gbps, respectively). For cases where variable rate and out-of-order keystream output is acceptable, we recommend the use of the 3-slow 8-GFS, as it outperforms the 4-GFS by more than 26% while using approximately 35% fewer slices, and half the number of DSP slices. We measured the performance penalty and additional resource of using hardware-initialized designs as compared to hardware/software codesigns to be less than 5% and 10%, respectively. Moreover, since the initialization circuit will not be needed after initialization, we recommend the hardware/software co-design as a very resource ecient design approach. For completeness, we also compare our results to other stream cipher implementations in Table 2. The table shows previous results of the three eSTREAM hardware-oriented ciphers; a direct comparison is dicult, since [14, 10, 25] are based on the Spartan-3, Virtex-II, and VirtexII Pro FPGAs and we present results on the Virtex-5 (which is based on the new-generation 6-input LUT architecture). However, we observe that, in general, the throughput/slice ratio of our results is greater than that of Mickey 128 2.0 and comparable with that of Grain. Triviums throughput/slice is higher than the compared stream ciphers, including

Hardware Framework for the Rabbit Stream Cipher

17

Design (Rabbit) Direct Direct, 3-slow Direct, 4-slow 4/2-IA 8-GFS 8-GFS, 2-slow 8-GFS, 2-slow, V 8-GFS, 3-slow 8-GFS, 3-slow, V 4-GFS 4-GFS 2-slow 4-GFS 3-slow Estimate [6] (eSTREAM) Mickey128 [25] Grain [14] Grain-128 [10] Trivium [14] (other) AES [11] AES [17] RC4 [18] LILI-II [19] SNOW 2.0 [19]

Freq DSP Block Thruput Mbps/ Slices (%) (MHz) Slices(%) RAMs(%) (Gbps) Slice 71.582 568 (7.88%) 137.155 884 (12.28%) 141.383 961 (13.35%) 200.120 1163 (16.15%) 83.724 260 (3.61%) 138.198 368 (5.11%) 142.227 239 (3.32%) 214.638 351 (4.88%) 216.450 233 (3.24%) 85.697 360 (5.00%) 155.982 602 (8.36%) 195.198 588 (8.17%) 280.5 155 181 190 350 168.3 64 24 (50%) 24 (50%) 24 (50%) 24 (50%) 3 ( 6%) 3 ( 6%) 3 ( 6%) 3 ( 6%) 3 ( 6%) 6 (12%) 6 (12%) 6 (12%) 24 0 0 0 0 0 0 0 0 0 0 0 0 (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) (0.00%) 9.16 17.56 18.10 25.62 1.34 2.21 2.28 3.43 3.46 2.74 4.99 6.25 17.8 0.56 2.48 0.18 12.16 4.1 21.5 0.22 0.24 5.659 16.10 19.86 18.83 22.03 5.15 6.01 9.52 9.78 14.86 7.62 8.29 10.62 1.43 6.97 3.77 31.34 10.2 4.2 0.16 0.28 5.57

392 (2.86%) 0 (0.00%) 0 (0.00%) 356 (46.35%) 48 (0.14%) 388 (10.83%) 400 (%) 0 (0.00%) 5177 (37.8%) 138 (8.98%) 866 (2.56%) 1015 (3.00%) 0 (0.00%) 84 (61.7%) 3 (12.5%) 1 (0.69%) 3 (2.08%)

Table 2. Rabbit Resource Usage and Performance Evaluation.

our 4/2-IA, whose throughput is much higher than all three eSTREAM candidates. We stress that although Rabbit is a software-oriented stream cipher, its performance in hardware is commendable in terms of both throughput and area-usage. Finally, we compare our results to the Advanced Encryption Standard (AES, Rijndael) and various well-known stream ciphers. In terms of speed, the compact 4-GFS 3-slow Rabbit outperforms all these ciphers, including the Virtex-5 implementation of AES [11], in addition to maintaining the highest throughput/area ratio of 10.62. Similarly, the 4/2-IA outperforms one of the fastest AES implementations [17]; again, a direct comparison is dicult since the AES block cipher of [17] was implemented on older generation Virtex-II Pro FPGAs. In addition to the very high speed performance of Rabbit in hardware, with the exception of RC4,

18

D. Stefan

the compact n-GFS implementations outperform the compared stream ciphers in terms of slices used as well; however we also expect the slice count of the compared ciphers to be lower on a Virtex 5.

Conclusion

The rst hardware standalone and hardware/software co-designs of the Rabbit stream cipher were presented and optimized using DSP system design techniques. As part of the generalized hardware framework, three dierent architectures were presented: a direct, interleaved and generalized folded structure. These implementations on the Virtex-5 LXT FPGA outperform previous FPGA implementations of stream ciphers such as MICKEY-128, RC4 and LILI-II, while also maintaining area-eciencies above 5 Mbps/slice. Future work includes further optimization of Rabbit for ASICs, low-power Spartan-6 FPGAs, and implementation of additional IA and GFS variants. Acknowledgment The author would like to thank Om Agrawal, David Nummey, and anonymous reviewers for their insightful comments and suggestions. The support of Fred L. Fontaine and S ProCom2 , and Arjen K. Lenstra and LACAL is also appreciated.

References
1. Cryptico A/S. Dierential properties of the g-function. White paper, http:// www.cryptico.com/Files/filer/wp_differential_properties_gfunction.pdf, 2003. 2. J.P. Aumasson. On a bias of Rabbit. In State of the Art of Stream Ciphers Workshop (SASC 2007), eSTREAM, ECRYPT Stream Cipher Project, Report, 2007. 3. S. Babbage, C. Canniere, A. Canteaut, C. Cid, H. Gilbert, T. Johansson, M. Parker, B. Preneel, V. Rijmen, and M. Robshaw. The eSTREAM Portfolio. eSTREAM, ECRYPT Stream Cipher Project, 2008. 4. E.B. Barker, M.S. Nechvatal, E. Barker, S. Leigh, M. Levenson, M. Vangel, G. Discussion, and E. Studies. A Statistical Test Suite For Random And Pseudorandom Number Generators For Cryptographic Applications. 5. A. Biryukov and A. Shamir. Cryptanalytic Time/Memory/Data Tradeos for Stream Ciphers. Lecture Notes in Computer Science, pages 113, 2000. 6. M. Boesgaard, M. Vesterager, T. Christensen, and E. Zenner. The Stream Cipher Rabbit. ECRYPT Stream Cipher Project Report, 6, 2005. 7. M. Boesgaard, M. Vesterager, T. Pedersen, J. Christiansen, and O. Scavenius. Rabbit: A new high-performance stream cipher. Proc. Fast Software Encryption 2003. Lecture Notes in Computer Science, pages 307329, 2003.

Hardware Framework for the Rabbit Stream Cipher

19

8. M. Boesgaard, M. Vesterager, and E. Zenner. The Stream Cipher Rabbit. New Stream Cipher Designs. Lecture Notes in Computer Science, 4986:6983, 2008. 9. J. W. Bos, N. Casati, and D. A. Osvik. Multi-stream hashing on the playstation 3. In International Workshop on State-of-the-Art in Scientic and Parallel Computing 2008, Minisymposium on Cell/B.E. Technologies, 2008. 10. P. Bulens, K. Kalach, F.X. Standaert, and J.J. Quisquater. FPGA implementations of eSTREAM phase-2 focus candidates with hardware prole. In State of the Art of Stream Ciphers Workshop (SASC 2007), eSTREAM, ECRYPT Stream Cipher Project, Report, 2007. 11. P. Bulens, F.X. Standaert, J.J. Quisquater, P. Pellegrin, and G. Rouvroy. Implementation of the AES-128 on Virtex-5 FPGAs. AFRICACRYPT 2008. Lecture Notes in Computer Science, 5023:1626, 2008. 12. N. Courtois. Fast algebraic attacks on stream ciphers with linear feedback. In Advances in Cryptology-CRYPTO, volume 2729, pages 176194. Springer, 2003. 13. E. Ferro and F. Potorti. Bluetooth and Wi-Fi wireless protocols: a survey and a comparison. IEEE Wireless Communications, 12(1):1226, 2005. 14. K. Gaj, G. Southern, and R. Bachimanchi. Comparison of hardware performance of selected Phase II eSTREAM candidates. State of the Art of Stream Ciphers Workshop (SASC 2007), eSTREAM, ECRYPT Stream Cipher Project, Report, 2007. 15. O. Goldreich. Foundations of Cryptography: Basic Tools. Cambridge University Press New York, NY, USA, 2000. 16. S. Hauck and A. DeHon. Recongurable Computing: The Theory and Practice of FPGA-Based Computation. Morgan Kaufmann, 2007. 17. A. Hodjat and I. Verbauwhede. A 21.54 Gbits/s fully pipelined AES processor on FPGA. In Field-Programmable Custom Computing Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, pages 308309, 2004. 18. P. Kitsos, G. Kostopoulos, N. Sklavos, and O. Koufopavlou. Hardware implementation of the RC4 stream cipher. In Circuits and Systems, 2003. MWSCAS03. Proceedings of the 46th IEEE International Midwest Symposium on, volume 3, 2003. 19. P. Leglise, F.X. Standaert, G. Rouvroy, and J.J. Quisquater. Ecient implementation of recent stream ciphers on recongurable hardware devices. In 26th Symposium on Information Theory in the Benelux, pages 261268, 2005. 20. Y. Lu, H. Wang, and S. Ling. Cryptanalysis of Rabbit. In Proceedings of the 11th international conference on Information Security, pages 204214. Springer, 2008. 21. W. Mao. Modern Cryptography: Theory and Practice. Prentice Hall Professional Technical Reference, 2003. 22. A.J. Menezes, P.C. Van Oorschot, and S.A. Vanstone. Handbook of applied cryptography. CRC press, 1997. 23. K.K. Parhi. VLSI Digital Signal Processing Systems: Design and Implementation. Wiley, 1999. 24. B. Schneier. Applied Cryptography Second Edition: protocols, algorithms, and source code in C. John Wiley and Sons, 1996. 25. D. Stefan and C. Mitchell. Parallelized Hardware Implementation of the MICKEY128 2.0 Stream Cipher. State of the Art of Stream Ciphers Workshop (SASC 2007), eSTREAM, ECRYPT Stream Cipher Project, Report, 2007. 26. I.A. UEA2&UIA. Specication of the 3GPP Condentiality and Integrity Algorithms UEA2& UIA2. Document 2: SNOW 3G Specications. Version: 1.1. ETSI/SAGE Specication, 2006.

20

D. Stefan

27. N. Weaver, Y. Markovskiy, Y. Patel, and J. Wawrzynek. Post-placement C-slow retiming for the Xilinx Virtex FPGA. In Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays, pages 185 194. ACM New York, NY, USA, 2003. 28. C.M. Wee, P.R. Sutton, N.W. Bergmann, and J.A. Williams. Multi stream cipher architecture for recongurable system-on-chip. In Field Programmable Logic and Applications, 2006. FPL 06. International Conference on, pages 14, Aug. 2006. 29. Xilinx. DSP Solutions Using FPGAs. http://www.xilinx.com/products/design_ resources/dsp_central/grouping/fpgas4dsp.htm, 2009.

Cut-set Retiming

Given a data ow graph G, cut-set retiming is a technique in which the graph is split into two disconnected subgraphs G0 and G1 . Further, for every edge from G0 to G1 , k delays are added and, similarly, for every edge from G1 to G0 k delays are removed (note that this assumes the existence of the k delays). We refer to [23] for additional details. Figure A shows part of the chained counter adders of a 4-slow Rabbit cipher with an example of a cut-set (Figure 7(a)) and the respective retiming (Figure 7(b)). This

c0,i+1 c1,i+1 c2,i+1 c3,i+1 c4,i+1

x0,i+1 c x1,i+1 c x2,i+1 c x3,i+1 c x4,i+1 c

cutset

x0,i+1 x1,i+1 x2,i+1 x3,i+1 x4,i+1 x5,i+1 x6,i+1 x7,i+1

cutset

(a) Determining the cut-set c


5,i+1
g

(b) Cut-se x retiming c5,i+1


x6,i+1 c

Fig. 7. Example of cut-set retiming to pipeline the chained counter adders.


c6,i+1
g g

particular example showsc7,i+1 a reduction from a 256-bit addition to two 128x7,i+1 c g g bit additions. Similarly, for C = 4 and C = 8-slow designs, the 256-bit 7,i+1 7,i+1 addition can be further reduced to four 64-bit or eight 32-bit additions, respectively. We note that the IA design of Section 3.2 can be similarly pipelined, however unlike the latter, the cut-set retimed design leads to an unbalanced design with a buildup of many registers between c0,j+1 and the g function. As such, we prefer the IA design approach.

You might also like