0% found this document useful (0 votes)
65 views8 pages

A High-Speed Unified Hardware Architecture For AES and The SHA-3 Candidate Grøstl

1) The document describes a new hardware architecture for performing AES encryption and Grøstl hashing in an interleaved-pipelined fashion to provide high-speed computations of both authentication and encryption with relatively small overhead. 2) Previous work on implementing Grøstl and AES individually in hardware is discussed. The document then analyzes the structures of Grøstl and AES to enable their integration for authenticated encryption. 3) The proposed coprocessor architecture is intended to accelerate IPSec by efficiently computing Grøstl-based HMAC and AES counter mode in parallel. Performance and resource usage of this combined architecture is evaluated.

Uploaded by

Minh Hiếu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views8 pages

A High-Speed Unified Hardware Architecture For AES and The SHA-3 Candidate Grøstl

1) The document describes a new hardware architecture for performing AES encryption and Grøstl hashing in an interleaved-pipelined fashion to provide high-speed computations of both authentication and encryption with relatively small overhead. 2) Previous work on implementing Grøstl and AES individually in hardware is discussed. The document then analyzes the structures of Grøstl and AES to enable their integration for authenticated encryption. 3) The proposed coprocessor architecture is intended to accelerate IPSec by efficiently computing Grøstl-based HMAC and AES counter mode in parallel. Performance and resource usage of this combined architecture is evaluated.

Uploaded by

Minh Hiếu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A High-Speed Unified Hardware Architecture

for AES and the SHA-3 Candidate Grøstl


Marcin Rogawski and Kris Gaj
ECE Department
George Mason University
Fairfax, Virginia 22030
email: {mrogawsk, kgaj}@gmu.edu

Abstract—The NIST competition for developing the new cryp- cipher Threefish and the Round 3 hash algorithm Skein was
tographic hash standard SHA-3 is currently in the third round. demonstrated in [6].
One of the five remaining candidates, Grøstl, is inspired by In this effort we are going to present a new hardware
the Advanced Encryption Standard. This unique feature can be
exploited in a large variety of practical applications. In order to architecture for Grøstl and AES working in an interleaved-
have a better picture of the Grøstl-AES computational efficiency pipelined fashion. A practical application to IPSec hardware
(high-level scheduling, internal pipelining, resource sharing, etc.), acceleration will be discussed.
we designed a high-speed coprocessor for Grøstl-based HMAC The rest of this paper is organized as follows: In Section
and AES in the counter mode. This coprocessor offers high-speed II we discuss previous work. Section III is devoted to the
computations of both authentication and encryption with rela-
tively small penalty in terms of area and speed when compared to analysis of the Grøstl-AES structure for the authenticated
the authentication (original Grøstl circuitry) functionality only. encryption based on HMAC and counter mode, respectively.
From our perspective, the main advantage of Grøstl over other Section IV describes the proposed coprocessor. Finally, Sec-
finalists is the fact that its hardware hardware architecture tion V discusses and analyzes the results and we draw
naturally accommodates AES at the cost of a small area overhead. conclusions in Section VI.

Keywords-SHA-3 competition; hardware architectures; Grøstl; II. P REVIOUS WORK


AES; resource sharing; pipelining; scheduling; IPSec. A. Grøstl hardware implementation
In January 2011, Grøstl team published tweaks to their
I. I NTRODUCTION
specification of Grøstl [7], [8]. An algorithm described by
The National Institute of Standards and Technology (NIST) the original Grøstl specification [9] has been renamed to
is currently holding a hash competition [1] to select a new Grøstl-0, and the tweaked version of Grøstl, described by the
cryptographic hash function standard, called SHA-3, for the revised specification [8], is from this point-on called Grøstl.
purpose of superseding the functions in the SHA-2 family. The proposed tweaks are aimed primarily at the increase in
Performance in hardware has been one of the major factors the algorithm resistance to cryptanalysis [7]. This increased
taken into account by NIST in the evaluation of Round 2 and resistance in security, typically comes together with some
Round 3 candidates during the SHA-3 competition [1], [2], limited penalty in terms of performance in hardware [10].
[3]. This factor is particularly important in the final round of Grøstl-0 has been implemented by several groups in FPGAs
the contest, because the algorithms qualified to this round are and ASICs [2]. In this paper, we focus on implementations
not very likely to have any significant security weaknesses. targeting FPGAs and optimized for high speed rather than
Several studies regarding stand-alone implementations of low area. High-speed implementations of Grøstl-0 typically
Round 2 and Round 3 SHA-3 candidates in FPGAs have use two major architectures. In the first architecture, reported
been already reported in the literature [2]. The main objective first in [9], permutations P and Q are implemented using two
of these studies was to evaluate all candidates in a unified independent units, working in parallel. We call this architecture
approach, and therefore the unique features of each and every parallel architecture. In the second architecture, introduced in
function were not deeply investigated. [11], the same unit is used to implement both P and Q. This
There are relatively few works which discuss distinctive unit is composed of two pipeline stages that allow interleaving
hardware architectures for the SHA-3 candidates. A copro- computations belonging to permutations P and Q. We call
cessor supporting Skein in tree hashing mode was presented this architecture quasi-pipeline architecture, as it is based on
in [4]. Common architectures of the block cipher AES and the similar principles as the quasi-pipelined architectures of
the Round 2 versions of Grøstl-0 and Fugue algorithms SHA-1 and SHA-2 reported in [12], [13]. The details of the
were reported in [5]. A compact implementation of block quasi-pipelined architecture of Grøstl-0 are described in [11]
1 This work has been supported in part by NIST through the Recovery
(Section 9), [14] (Section 3.8) and [15] (Section V).
Act Measurement Science and Engineering Research Grant Program, under An analysis of the influence of the Round 3 tweaks in Grøstl
contract no. 60NANB10D004 on the performance of this algorithm in FPGAs was conducted
TABLE I
R ESULTS OF I MPLEMENTATIONS FOR H IGH -S PEED A RCHITECTURES OF G RØSTL -256, USING X ILINX V IRTEX 5 FPGA S .

Source Architecture Implementation details Memory Frequency Throughput Area Throughput/Area


[BRAM] [MHz] [Mbps] [Slice] [Mbps/Slice]
Grøstl-0 - Round 2
Gauravaram et al. [9] parallel N/A* N/A* 200.7 10276 1722 5.97
Jungk et al. [15] quasi-pipelined S-boxes in BRAM 17 295.0 7552 1381 5.46
Shahid et al. [16] quasi-pipelined T-boxes in BRAM 48 250.0 6098 1188 5.13
Homsirikamol et al. [14] quasi-pipelined 64-bit interface 0 323.4 7885 1597 4.94
Gaj et al. [17] quasi-pipelined 64-bit interface 0 355.9 8676 1884 4.61
Matsuo et al. [18] parallel S-boxes in distributed memory 0 154.0 7885 2616 3.01
Baldwin et al. [19] parallel ideal interface, no padding unit 0 101.3 5187 2391 2.17
Kobayashi et al. [20] parallel S-boxes decomposed into logic 0 101.0 5171 4057 1.27
Guo et al. [21] parallel S-box decomposed into logic 0 80.2 4106 3308 1.24
Baldwin et al. [19] parallel 32-bit interface, no padding unit 0 101.3 3242 2391 1.36
Baldwin et al. [19] parallel 32-bit interface, padding unit 0 78.1 2498 2579 0.97
Grøstl - Round 3
Sharif et al. [22] quasi-pipelined S-box in BRAM 18 226 5524 1141 4.84
Homsirikamol et al. [23] quasi-pipelined 64-bit interface 0 249 6072 1912 3.18
Homsirikamol et al. [23] parallel 64-bit interface 0 158 8081 2591 3.12
* not reported

TABLE II
in [10]. Comprehensive hardware evaluation across multiple N UMBER OF ROUNDS AND THE SECURITY LEVEL RELATIONS FOR G RØSTL
architectures for all SHA-3 finalists, including Grøstl, was AND AES
investigated in [23]. The implementation results of hardware
Security Grøstl AES
architectures, for a single stream of data, in both variants of
128-bit (Grøstl-256) 10 (AES-128) 10
Grøstl are summarized in Table I. 192-bit (Grøstl-384) 12 (AES-192) 12
256-bit (Grøstl-512) 14 (AES-256) 14
B. Sharing resources
The idea of hardware resources sharing is very practical
and especially attractive in industrial applications. Several The design described in [23] and the corresponding source
companies offer so called all-in-one cryptographic solutions. codes from [33] will serve in this work as a starting point for
For example [24] and [25] offer customized cores including our investigations.
sophisticated AES core, which supports 128, 192 and 256- AES round Groestl P/Q transformation
bit main key and several different operational modes in a
single chip. The resource sharing concept was also investi- input input
512
gated by academia: shared MD5 and SHA-1 implementation 128

was described in [26]–[28], MD5 implemented together with AddRoundKey AddRoundConstant


RIPEMD-160 was reported in [29], and finally, SHA-1, MD-
5 and RIPEMD-160, implemented together, were discussed SubBytes SubBytes
in [30]. It seems that even a more practical is the idea to
build a coprocessor, which could share resources and support
ShiftRows ShiftBytes
different cryptographic services: confidentiality and authen-
tication. Cryptographic accelerators, in which the datapaths
are combined: Fugue with AES core, and Grøstl-0 with AES MixColumns MixBytes
128
core, were reported in [5]. A typical application for such 128 512

coprocessor will be the IPSec protocol suite [31] for securing output last output output
the Internet Protocol, the basis of Internet. This suite consists
of the Authentication Header protocol (providing authenti- Fig. 1. Block diagram of Grøsl and AES round
cation only) and Encapsulating Security Payload (providing
authentication and confidentiality at the same time).
A. Grøstl and the AES comparison
III. AUTHENTICATED ENCRYPTION BASED ON G RØSTL In order to extend the original Grøstl hardware architecture
AND AES IN A SINGLE COPROCESSOR
several facts have to be taken into consideration:
The specifications of the block cipher AES and the hash • The basic round structures of both algorithms are
function Grøstl were described in [32] and [8], respectively. demonstrated in Fig. 1. All four corresponding transfor-
The round functions for both algorithms are summarized in mations have the same order in both AES and Grøstl. Due
Fig. 1. to this fact a resource sharing between both algorithms is
especially attractive. It is expected that the delay in the • Second hashing in the HMAC requires message padding
critical path in both cases should be very similar. (in Fig. 5, pt. 11).
• The SubBytes layers in both cases are build upon the Motivated by the above observations, we will show how
same substitution box (S-box), therefore it can be fully to efficiently share the resources of Grøstl and AES in our
shared (in Fig. 5, pt. 1). In terms of a circuitry area this coprocessor for an authenticated encryption.
transformation is the most costly out of all round-building
operations.
B. HMAC/Grøstl
• The ShiftRow and ShiftBytes transformations in AES
and Grøstl, respectively, can be implemented as a per- A mechanism for message authentication using crypto-
mutation of bytes (simple rewiring). However they are graphic hash functions, the HMAC (Hash-based Message
not similar, both operations have to be implemented Authentication Code) was originally defined in [34] and
separately and properly multiplexed (in Fig. 5, pt. 2). adapted for the IPSec in [35]. Recently this last document
• The AddRoundKey and the AddRoundConstant was updated by [36]. HMAC has a generic form and it can
transformations in AES and Grøstl, respectively, can be used with any iterative cryptographic hash function, e.g.
be implemented as a simple network of XOR gates. Grøstl, in combination with a secret shared key. The HMAC
However they are not similar, both operations have to cryptographic strength rely on the properties of the underlying
be implemented separately and properly multiplexed (in hash algorithm. Fig. 2 demonstrates the HMAC generation
Fig. 5, pt. 3). process. Since the combination of HMAC with a current
• The MixColumn and the MixBytes in AES and Grøstl, standard SHA-2 is denoted as HMAC/SHA-2, we are using
respectively, share the GF(28 ) multiplication by con- corresponding notation for Grøstl algorithm (HMAC/Grøstl).
stants: 0x02 and 0x03. Therefore they can be completely
merged together. The networks of output XORs require select hkey
two separate paths for both algorithms (in Fig. 5, pt. 4).
• The last round of the AES block cipher is different than
the regular round. It is required to build a bypass bus and hkey ipad
multiplex it together with round’s regular output (in Fig.
5, pt. 5). hkey ipad data
• For a given security level both Grøstl and AES re-
quire the same number of rounds. This dependency
is summarized in Table II. This fact helps to achieve a H( hkey ipad data )
full synchronization of input data for both HMAC and
Encryption module.
hkey opad
• The Grøstl double data flow pipe (P and Q transfor-
mations) vs. the AES one data flow pipe determines the
optimal number of pipeline stages. The high-speed single hkey opad H( hkey ipad data )
stream of data quasi-pipelined hardware architecture of
Grøstl, demonstrated in [14], [15], [11], requires two
H( hkey opad H( hkey ipad data ))
pipeline stages for the P and Q permutations intermediate
values. The third pipeline stage is required for the AES
intermediate data (in Fig. 5, pt. 6). MAC(data) t = leftmost ’t’ bytes of
• Both algorithms input block sizes differ. They are H( hkey opad H( hkey ipad data ))
128-bit and 512-bit for AES and Grøstl, respectively.
The encryption of 512-bit single stream of data, by four Fig. 2. HMAC generation
instances of algorithm which can accommodate 128-bit
input only, prohibits the feedback mode utilization. In
order to increase the security level of non-feedback mode In order to compute the HMAC value for a given message
based encryption, the counter mode (in Fig. 5, pt. 7) was (data) and a key (hkey) the selected hash function has to
applied (in Fig. 3). be used twice. The output from the first computations is a
• The encryption process requires an extra storage space function of the ipad constant, padded key, and a given message.
for the plain/ciphertext (in Fig. 5, pt. 8). The output from the second computations (the hmac-value) is
• For a given security level the output block of both algo- a function of the opad constant, padded key, and the result
rithms is different. This fact implies the size extension of the first computation. For the sake of simplification of our
(doubling) of the Parallel Input Serial Output (PISO) circuit (padding of the second hash computation) we restricted
module for Grøstl-512 vs. Grøstl-256 (in Fig. 5, pt. 9). the range of key size up to the Grøstl block size.
• The Key scheduling algorithm for the AES algorithm This assumption leads us to the relation between the
requires an additional circuitry (in Fig. 5, pt. 10). throughput of HMAC/Grøstl and the throughput of Grøstl:
extra block M #n and the results of encryption of Ekey (ctr),
throughputHM AC/Grøstl #blocks Ekey (ctr + 1) ..., Ekey (ctr + n − 1). The ciphertext is a pair
= (1) (IV, C), where IV is the starting value for the counter. The
throughputGrøstl 5 + #blocks
decryption process is the same as encryption with M and C
where: interchanged.
#blocks is the number of data blocks for a given message The biggest advantage of the CTR-mode for any block
and throughputGrøstl is the maximum Grøstl hardware ar- cipher, including AES, is the possibility of a full parallelization
chitecture throughput calculated for long messages. of the computations. In order to compute all data chunks:
The constant in the denominator is an overhead from C#(i), C#(i + 1), ..., C#(i + m − 1) we can instantiate
HMAC/Grøstl and it is a sum of m AES coprocessors working simultaneously.
• two HMAC key injections, Since Grøstl specifies 512-bit (128-bit security level) and
• two Grøstl message finalizations, 1024-bit (256-bit security level) input block sizes then the
• and an injection of a message digest from the first to the number of corresponding CTR/AES cores is four and eight,
second hash computation. respectively. The maximum throughput in such configuration
In case of long messages the effect of HMAC/Grøstl over- is four (eight in case of Grøstl with 1024-bit input block) times
head is marginal, and it can be omitted in the throughput higher than the throughput of the single AES core.
calculations.
IV. C OPROCESSOR DESCRIPTION
C. AES in Counter mode A. Block diagram description
NIST has defined five confidentiality modes of operation for A block diagram presented in Fig. 5 shows the datapath
use with an underlying symmetric key block cipher algorithm: used in the proposed Grøstl/AES coprocessor. The non-shaded
Electronic Codebook (ECB), Cipher Block Chaining (CBC), components represent the original Grøstl design, available in
Cipher Feedback (CFB), Output Feedback (OFB), and Counter [33]. The original Grøstl quasi-pipelined structure has one
(CTR) in [37]. Two of aforementioned modes of operation, pipeline register inserted between SubBytes and ShiftBytes
namely ECB and CTR, allow parallel computations. In ECB operations.
mode, for a given key any given plaintext block encryption In order to perform in parallel encryption and message di-
process always leads to the same ciphertext block. This gest computation the quasi-pipeline architecture was enriched
property is undesirable in predominant number of applications, by several extra elements. The shaded components show which
and due to this fact the ECB mode should not be used. elements have to be added in order to accommodate the
The CTR mode for a block cipher is presented in Fig. 3. HMAC/Grøstl and the AES-CTR functionality.
First of all, we have added additional pipeline register after
Initialization
vector the Shared MixColumn/MixBytes operation. Two of pipeline
stages contain intermediate values for the P and Q functions
Counter M#(i)
from Grøstl, one extra stage is responsible for the encryption
of intermediate values of the same block of data.

Key Block B. Grøstl and AES pipelining


cipher
In the very first clock cycle, an input message is loaded
C#(i) directly to the state register as an input to the operation Q.
A message block is xored with an initialized chain register
+(m−1) to create an input for the operation P in the second cycle of
M#(i+m−1)
processing.

Key Block block M #(i)


cipher

cycle: 0 1 2 3 4 5 29 30 0 1
C#(i+m−1)
R0: Q(0) P(0) A(0) Q(1) P(1) A(1) ..... A(9) I Q(0) P(0)
Fig. 3. Block diagram of counter mode in block ciphers
R1: I Q(0) P(0) A(0) Q(1) P(1) ..... P(9) A(9) I Q(0)

To encrypt using AES/CTR-mode encryption, one starts R2: A(9) I Q(0) P(0) A(0) Q(1) ..... Q(9) P(9) A(9) I
with an arbitrary bit string(a n-block plaintext), a session
key, and an init value for a 128-bit (block size) counter. The
block M #(i)
output ciphertext C = {C#1, C#2, ...C#(n−1)} is the XOR
of corresponding plaintext chunks (in Fig. 3 the data blocks Fig. 4. Pipelining in the Computational Unit of the Grøstl/AES core
are represented as M = {M #1, M #2 ...M #(n − 1)}, the
SIPO
b+ks ks
din
64
ctr
Groestl−256 : b=512 7
’0’
Groestl−512 : b=1024 b
b
b
AES: ks=b/4 ks
0 1 2
b

0
0x80 ... 02
b/2

1
b
11 b

b b 0 1
b b

1 0
ks
last key b
b

b 1 0
IV b b KeyExpansion
8
b b b b b
R0 10
b
1 0 1 0 b b
b b
b b b

AddRoundConstant AddRoundKey
b h b
b b 3
b b 1 0
b
R2 6
1 0 b
1 SubBytes
b
b
1 0 5
PISO 9 R1 y
64 SharedMixBytes
x
b b
dout 4
ShiftBytes ShiftRow b

b b 2
1 0

Fig. 5. Block diagram of Grøstl/AES core

Finally, in the third clock cycle the counter values are loaded clock cycles. First, the chaining value, h, is xored with the final
to the state register, R0. At the same time when the first stage value of Q, while P is still being processed. In the subsequent
of the pipeline starts executing the first phase of the AES cycle the final result of P is mixed with the chaining value as
round, the second stage of the pipeline continues the execution well (in Fig. 4).
of the P operation and the third stage is in the last phase of In the following clock cycle, the tenth round of the AES
the Q operation. transformation is completed. Finally, the last AES key is xored
The first stage of the pipeline consists of the Grøstl’s P/Q with the output from stage register and with the plaintext.
AddRoundConstant, the AES AddRoundKey units and the Every time when the encryption process is finished the cipher-
fully shared SubBytes layer (in Fig. 6). text is ready to be stored in the Parallel Input Serial Output
The second stage of the pipeline consists of the Shift- (PISO) unit. The entire process is repeated until all blocks of
Bytes/ShiftRows and modified MixBytes units. a message are thoroughly hashed and encrypted.
The third stage of the pipeline consist of just two multi- The HMAC process requires also additional data in front
plexers. (key xored with the constant ipad value) and at the end (key
A part of the function Q is always performed one cycle xored with the constant opad value) of the message. During
ahead of the corresponding part of the function P and two the time when these pre- (M #0) and post- (M #n) data
clock cycles before CTR-mode AES related data. is processed, the AES module is not producing valid data
Finalization of the hash process in this design takes two (AES(idle) in Fig. 6).
key
Input interface M #0 idle M #1 idle M #2 idle M #3 ...... idle M #n idle

Groestl(M #0) Groestl(M #1) Groestl(M #2) ......


Groestl(M #(n−1) Groestl(M #n) Groestl(M#(n+1))
Computational unit
AES(idle) AES(M #1) AES(M #2) AES(M #(n−1)) AES(idle) AES(idle)

HMAC(M)
idle C #1 idle ...... C #(n−2) idle C #(n−1) idle
Output interface

M #0 = ipad hkey plaintext M = {M#1, M#2, ... M#(n−1)} HMAC(M) − HMAC value for
M #n = opad hkey ciphertext C = {C#1, C#2, ... C#(n−1)} a given plaintext M

Fig. 6. High level scheduling in the Grøstl/AES core

Finally, a HMAC value is calculated and it is taken from In case of the Grøstl/AES-based hardware accelerator, de-
the bottom half of the chaining value. scribed in this paper the throughput formula for long messages
For a given chunk of a 512-bit data both Grøstl and is:
AES cores need 31 clock cycles to complete their operations
(3 pipeline stages per 10 rounds + 1 clock for the Grøstl 512
finalization and 1 clock cycle for the final xor in the counter throughput = (3)
31 ∗ T
mode).
The typical application for an authenticated encryption-
C. High-level scheduling oriented, high-speed hardware coprocessor is the Encapsulat-
In order to make our implementations as practical as ing Security Payload (ESP) from the IPSec protocol suite. In
possible, we have followed a 64-bit interface and a simple this scenario the throughput has to be calculated for relatively
handshaking protocol specification from [33]. Thanks to the short messages (40-1536 bytes).
assumptions taken from the aforementioned paper, it is pos- Due to the fact that the HMAC/Grøstl computations take
sible to keep proposed coprocessor’s all three pipeline stages more time than the CTR/AES encryption, this HMAC/Grøstl
busy almost all the time. throughput is considered as an effective throughput for a given
The input-output operations overlap in many cases therefore message in our coprocessor. The final throughput formula is
the separation of input/output bus and control signals is a result of both formulas: (1) and (3).
necessary. 512 ∗ #blocks
throughput = (4)
A higher level scheduling is summarized in Fig. 6. The path (5 + #blocks) ∗ (31 ∗ T )
of the very first chunk of message M #1 for the authenticated
For long messages the formula (4) converges to the formula
encryption is denoted by the shaded blocks.
(3).
During the computations of longer messages (more than
three blocks), the coprocessor will be storing result of the V. R ESULTS
C#(i − 2) block, conducting HMAC/Grøstl and CTR/AES
operations for the block C#(i − 1) and fetching i − th block The HMAC/Grøstl and CTR/AES based hardware copro-
of data (M #(i)) at the same time. cessor was implemented on four high speed FPGA devices:
65nm Altera Stratix III and Xilinx Virtex 5, and 40nm Altera
D. Throughput discussion Stratix IV and Xilinx Virtex 6. As our tools, we have used
In the most typical scenario the speed of the hardware im- Xilinx ISE 13.1 and Altera Quartus II 11.1. All architectures
plementation of cryptographic transformations is understood have been first modeled in VHDL-93, then synthesized, placed
as a throughput for long messages. The exact throughput and routed using tools of the respective vendor. Maximum
formula is defined as follows [17]: clock frequencies have been determined using static timing
analysis tools provided as part of the respective software
packages (quartus sta for Altera and trace for Xilinx). The
blocksize
throughput = (2) tool options were selected in such a way, that no embedded
T ∗ (T imeHE (N + 1) − T imeHE (N )) resources, such as block memories or DSP units, were used
where blocksize is a input block size, characteristic for each during implementation. This choice was made in order to
cryptographic transformation, T imeHE (N ) is a total number enable the comparison of all implementations in terms of area
of clock cycles necessary to hash/encrypt an N-block input and throughput to area ratio. Table III summarizes the results
data and T is the clock period, characteristic for each hardware collected after the P lace-and-Route and F itter in Xilinx and
coprocessor. Altera, respectively.
TABLE III
R ESULTS OF SHARED - RESOURCES IMPLEMENTATION FOR HMAC-G RØSTL AND AES IN C OUNTER M ODE ON MODERN FPGA

Family Frequency Area Throughput @40Bytes Throughput @1536Bytes Throughput @infinity


Altera
[MHz] [ALUTs, Memory bits] [Mbps] [Mbps] [Mbps]
Stratix III 271 (9337, 0) 466 3704 4476
Stratix IV 264 (9322, 0) 454 3608 4360
Xilinx
[MHz] [CLB Slices, BRAMs] [Mbps] [Mbps] [Mbps]
Virtex 5 261 (+4.8%) (2505, 0) (+31%) * 449 3567 4310 (-29%) *
Virtex 6 276 (2221, 0) 474 3773 4558
* The relative difference between the reference Grøstl design from [23] and this work

TABLE IV
R ESULTS OF SHARED - RESOURCES IMPLEMENTATION FOR G RØSTL -0 (G RØSTL ) AND AES IN A LTERA C YCLONE III

Design Functionality Frequency Area Latency Throughput Throughput/Area


[MHz] [Logic Elements] [Cycles] [Mbps] [Mbps/Slice]
Järvinen [5]
reference Grøstl-0 Grøstl-0 57.2 12086 20 1473 0.122
Grøstl-0 56.0 (-2.6%) 13723 (+13.5%) 20 1434 (-2.6%) 0.104
Grøstl-0 and 4*AES AES 56.0 13723 10 2868 0.209
Grøstl-0 and AES 56.0 13723 30 956* 0.070
Grøstl-0 53.4 (-7.2%) 13453 (+11.3%) 20 1366 (-2.6%) 0.102
Grøstl-0, 3*AES and Key Expansion AES 53.4 13453 10 2049 0.152
Grøstl-0 and AES 53.4 13453 30 911* 0.068
* Throughput calculated for the authenticated encryption based on HMAC-Grøstl and AES-CTR
This work
reference Grøstl-0 Grøstl-0 141.1 19005 21 3440 0.181
Grøstl-0, 4*AES and Key Expansion Grøstl-0 and AES 159.9 (+13.3%) 23039 (+23.4%) 31 2640 (-23.3%) 0.115
reference Grøstl Grøstl 130.1 19260 21 3171 0.165
reference AES and Key Expansion AES 129.4 4901 11 1505 0.307
Grøstl, 4*AES and Key Expansion Grøstl and AES 144.0 (+10.7%) 23758 (+23.4%) 31 2378 (-25.0%) 0.100

Generally in terms of area, the coprocessor proposed in this In case of selected FPGA devices it varies between 450-
effort can be implemented on the smallest device from every 3700Mb/s. The final throughput result depends on the traffic
selected family. In case of small messages, the throughput statistics in a given network.
is a function of the message size. For the smallest 40-byte The coprocessor proposed in this work can be easily im-
packages, it is just 11% of the long messages throughput, but plemented on the smallest devices available in every selected
in case of 1536-byte messages it reaches almost 83% of long high-speed family.
messages throughput.
B. Comparison to the Järvinen design [5]
A. Comparison to the stand-alone Grøstl implementation In order to fairly compare our hardware accelerator with
In Table III we have summarized the implementation results the circuit described in [5], an additional implementation
of the proposed Grøstl/AES hardware accelerator. in Altera low-cost Cyclone III is provided (In Table IV).
First of all, in the case of Xilinx Virtex 5 implementation, In both our work and [5], one can observe the penalty in
the coprocessor investigated in this effort requires 31% more area for introducing extra AES functionality. In case of [5],
area than the basic version of quasi-pipelined architecture negligible frequency penalty was also introduced. This penalty
presented in [23]. Since this extra pipeline stage refinement is due to the fact that basic iterative task (P and Q Grøstl-
breaks the critical path from the aforementioned design, the 0 functions and AES round) of the coprocessor proposed in
maximum frequency increases by 4.8%. The 3rd stage pipeline [5] is fully combinational and extra multiplexers were added
register location was investigated by moving it before the to the original Grøstl-0 design. In case of our architecture,
multiplexer (in Fig. 5, pt. 5). This change helps to improve the an additional pipeline stage enables frequency improvement.
maximum frequency, but at the same time the throughput/area In case of scenario when both encryption and hashing for a
ratio decreases. However due to the fact that the quasi- given block of data have to be computed, the design from [5]
pipelined hardware architecture of Grøstl from [14] and triple- and our core will produce output in 30 and 31 clock cycles
staged Grøstl/AES in this work require 21 and 31 clock respectively. Due to the fact that our core has three pipeline
cycles, respectively, the overall throughput for long messages stages, ideally our circuit should have 3 times higher frequency
decreases by 29%. than [5]. The obtained result, 2.85x frequency improvement,
In Table III we have presented the impact of IPSec minimum proves the validity of this concept. A typical application for
and maximum size messages on the effective throughput. high-speed implementation of the combined confidentiality
and authentication services is the coprocessor from [31]. This [11] S. Tillich, M. Feldhofer, M. Kirschbaum, T. Plos, J.-M. Schmidt,
protocol works in two different modes: Encapsulating Security and A. Szekely, “High-speed hardware implementations of BLAKE,
Blue Midnight Wish, CubeHash, ECHO, Fugue, Grøstl, Hamsi, JH,
Payload (ESP) and Authentication Headers (AH). The first Keccak, Luffa, Shabal, SHAvite-3, SIMD, and Skein,” Cryptology ePrint
requires the usage of both block cipher and hash function at Archive, Report 2009/510, 2009, http://eprint.iacr.org/2009/510.pdf.
the same time for a given chunk of data, second requires a [12] L. Dadda, M. Macchetti, and J. Owen, “The design of a high speed ASIC
unit for the hash function SHA-256 (384, 512),” in Proc. DATE’04,
hash function usage only. Table IV summarizes results for both vol. 3, (2004).
modes for our and [5] coprocessors. In case of ESP request [13] M. Macchetti and L. Dadda, “Quasi-pipelined hash circuits,” in Proc.
we can observe 57% and in case of AH 10% improvement in ARITH’17, (2005), pp. 222–229.
terms of efficiency (throughput/area). [14] E. Homsirikamol, M. Rogawski, and K. Gaj, “Comparing hardware
performance of fourteen round two SHA-3 candidates using FPGAs,”
VI. C ONCLUSIONS Cryptology ePrint Archive, Report 2010/445, (2010).
[15] B. Jungk and S. Reith, “On FPGA-based implementations of the SHA-3
The hash function Grøstl is one of the five finalists of the candidate Grøstl,” in ReConFig’10, (2010), pp. 316–321.
SHA-3 competition. Hardware performance of this function [16] R. Shahid, M. U. Sharif, M. Rogawski, and K. Gaj, “Use of embedded
FPGA resources in implementations of 14 round 2 SHA-3 candidates,”
was investigated thoroughly over the last few years. in Proc. FPT’11, (2011), pp. 1–9.
In this paper we have investigated very unique feature [17] K. Gaj, E. Homsirikamol, and M. Rogawski, “Fair and comprehensive
among all SHA-3 candidates - Grøstl and the current Ad- methodology for comparing hardware performance of fourteen round
two SHA-3 candidates using FPGA,” in Proc. CHES’10, (2010), pp.
vanced Encryption Standard have similarities and these sim- 491–506.
ilarities can be exploited very efficiently in hardware. Their [18] S. Matsuo, M. Knežević, P. Schaumont, I. Verbauwhede, A. Satoh,
common structure can be utilized in the combined data-path K. Sakiyama, and K. Ota, “How can we conduct “fair and consistent”
implementation. The coprocessor was optimized for high- hardware evaluation for SHA-3 candidate?” 2nd SHA-3 Candidate
Conference, Tech. Rep., (2010).
speed implementation of both functions and can find practical [19] B. Baldwin, N. Hanley, M. Hamilton, L. Lu, A. Byrne, M. O’Neill,
application to the IPSec-based secure networks. It outperforms and W. P. Marnane, “FPGA implementations of the round two SHA-3
the hardware accelerator proposed in [5] for both IPSec modes: candidates,” in 2nd SHA-3 Candidate Conference, (2010).
[20] K. Kobayashi, J. Ikegami, S. Matsuo, K. Sakiyama, and K. Ohta,
IP Encapsulating Security Payload (ESP) and Authentication “Evaluation of hardware performance for the SHA-3 candidates using
Headers (AH) by 57% and 10%, respectively. SASEBO-GII,” http://eprint.iacr.org/2010/010, (2010).
The fully functional HMAC/Grøstl with CTR/AES hard- [21] X. Guo, S. Huang, L. Nazhandali, and P. Schaumont, “On the impact
ware accelerator, compared to the stand-alone quasi-pipelined of target technology in SHA-3 hardware benchmark rankings,” (2010),
http://eprint.iacr.org/2010/536.pdf.
architecture of Grøstl, described in [17] and later on improved [22] M. U. Sharif, R. Shahid, M. Rogawski, and K. Gaj, “Use of embedded
in [23], pays the price in terms of throughput and the area on FPGA resources in implementations of five round three SHA-3 candi-
all reported devices and in particular on Virtex 5: 29% in case dates,” ECRYPT II Hash Workshop, (2011).
[23] E. Homsirikamol, M. Rogawski, and K. Gaj, “Throughput vs. area trade-
of throughput and 31% in terms of area. Not surprisingly, the offs architectures of five round 3 SHA-3 candidates implemented using
maximum frequency of the proposed design increases (+4.8% Xilinx and Altera FPGAs,” in Proc. CHES’11, (2011), pp. 491–506.
for Virtex 5) as the number of pipeline stages was increased [24] Algotronix, “http://www.algotronix-store.com/.”
by one stage. [25] Helion, “http://www.heliontech.com/.”
[26] M.-Y. Wang, H. C.-T. Su, Chih-Pin, and C.-W. Wu, “An HMAC
From our point of view, the main advantage of Grøstl processor with integrated SHA-1 and MD5 algorithms,” in Proc. ASP-
over other SHA-3 finalists is the fact that the relatively DAC’04, (2004), pp. 456–458.
small overhead in its hardware architecture enables a natural [27] K. Järvinen, M. Tommiska, and J. Skytta, “A compact MD5 and SHA-1
co-implementation utilizing algorithms similarities,” in Proc. ERSA’05,
adoption of the most important to date block cipher - the (2005), pp. 48–54.
Advanced Encryption Standard. [28] D. Cao, J. Han, and X.-Y. Zeng, “A reconfigurable and ultra low-
cost VLSI implementation of SHA-1 and MD5 functions,” in Proc.
R EFERENCES ASICON’07, (2007), pp. 862–865.
[1] “Cryptographic Hash Algorithm Competition,” http://csrc.nist.gov/ [29] T.-S. N. Chiu-Wah Ng and K.-W. Yip, “A unified architecture of MD5
groups/ST/hash/sha3/index.html. and RIPEMD-160 hash algorithms,” in Proc. ISCAS’04, vol. 2, (2004).
[2] “SHA-3 Hardware Implementations,” http://ehash.iaik.tugraz.at/wiki/ [30] T. Ganesh, M. Frederick, T. Sudarshan, and A. Somani, “Hashchip: A
SHA-3\ Hardware\ Implementations. shared-resource multi-hash function processor architecture on FPGA,”
[3] “ATHENa Project website,” http://cryptography.gmu.edu/athenadb/. Integration, the VLSI journal, vol. 40, pp. 11–19, (2007).
[4] A. Schorr and M. Lukowiak, “Skein Tree Hashing on FPGA,” in Proc. [31] RFC-4301, “http://www.ietf.org/rfc/rfc4301.txt,” (2005).
ReConFig’10, (2010), pp. 292–297. [32] Advanced Encryption Standard (AES), National Institute of Standards
[5] K. Järvinen, “Sharing resources between AES and the SHA-3 second and Technology (NIST), FIPS Publication 197, (2001), http://csrc.nist.
round candidates Fugue and Grøstl,” (2010), 2nd SHA-3 Candidate gov/publications/fips/fips197/fips-197.pdf.
Conf. [33] “GMU SHA-3 source codes,” (2011), http://cryptography.gmu.edu/
[6] N. At, J.-L. Beuchat, and I. San, “Compact Implementation of Threefish athena/index.php?id=source codes.
and Skein on FPGA,” in Proc. NTMS, (2012). [34] The Keyed-Hash Message Authentication Code HMAC, National Insti-
[7] P. Gauravaram, L. Knudsen, K. Matusiewicz, F. Mendel, C. Rechberger, tute of Standards and Technology (NIST), FIPS Publication 198–1, Jul.
M. Schäffer, and T. Søren, “Tweaks on Grøstl,” (2011). 2008.
[8] “Grøstl - a SHA-3 candidate,” Submission to NIST (Round 3), (2011). [35] RFC-2104, “http://www.ietf.org/rfc/rfc2104.txt,” (1997).
[9] P. Gauravaram, L. Knudsen, K. Matusiewicz, F. Mendel, C. Rechberger, [36] RFC-6151, “http://www.ietf.org/rfc/rfc6151.txt,” (2011).
M. Schäffer, and T. Søren, “Grøstl - a SHA-3 candidate,” Submission [37] M. Dworkin, NIST Special Publication 800-38A: Recommendation
to NIST, (2008). for Block Cipher Modes of Operation, (2001), http://csrc.nist.gov/
[10] M. Rogawski and K. Gaj, “Grøstl Tweaks and their Effect on FPGA publications/nistpubs/800-38a/sp800-38a.pdf.
Results,” 2011, http://eprint.iacr.org/2011/635.pdf.

You might also like