PHD Gueneysu
PHD Gueneysu
ON R ECONFIGURABLE D EVICES
Security Implementations for Hardware and
Reprogrammable Devices
D ISSERTATION
vii
Abstract
With the rise of the Internet, the number of information processing systems has significantly
increased in many fields of daily life. To enable commodity products to communicate, so-
called embedded computing systems are integrated into these products. However, many of these
small systems need to satisfy strict application requirements with respect to cost-efficiency
and performance. In some cases, such a system also needs to drive cryptographic algorithms for
maintaining data security – but without significantly impacting the overall system performance.
With these constraints, most small microprocessors, which are typically employed in embedded
systems, cannot provide the necessary number of cryptographic computations. Dedicated hard-
ware is required to handle such computationally challenging cryptography. This thesis presents
novel hardware implementations for use in cryptography and cryptanalysis.
The first contribution of this work is the development of novel high-performance implementa-
tions for symmetric and asymmetric cryptosystems on reconfigurable hardware. More precisely,
most presented architectures target hardware devices known as Field Programmable Gate Ar-
rays (FPGAs) which consist of a large number of generic logic elements that can be dynamically
configured and interconnected to build arbitrary circuits. The novelty of this work is the us-
age of dedicated arithmetic function cores – available in some modern FPGA devices – for
cryptographic hardware implementations. These arithmetic functions cores (also denoted as
DSP blocks) were originally designed to improve filtering functions in Digital Signal Processing
(DSP) applications. The thesis at hand investigates how these embedded function cores can be
used to significantly accelerate the operation of symmetric block ciphers such as AES (FIPS 197
standard) as well as asymmetric cryptography, e.g., Elliptic Curve Cryptography (ECC) over
NIST primes (FIPS 186-2/3 standard).
Graphics Processing Units (GPU) on modern graphics cards provide computational power ex-
ceeding that of most recent CPU generations. In addition to FPGAs, this work also demon-
strates how graphics cards can be used for high performance asymmetric cryptography. For
the first time in open literature, the standardized asymmetric cryptosystem RSA (PKCS #1)
and ECC over the NIST prime P-224 are implemented on an NVIDIA 8800 GTS graphics card,
making use of the Compute Uniform Device Architecture (CUDA) programming model.
A third and last contribution of this thesis considers the protection of reconfigurable systems
themselves and contained security-related components. Typically, logical functions in FPGAs
are dynamically configured from SRAM cells and lookup tables used as function generators.
Since the configuration is loaded at startup and also can be modified during runtime, an attacker
can easily compromise the functionality of the hardware circuit. This is particularly critical
for security related functions in the logical elements of an FPGA, e.g., the attacker could be
able to extract secret information stored in the FPGA just by manipulating its configuration.
As a countermeasure, FPGA vendors already allow the use of encrypted configuration files
with some devices to prevent unauthorized tampering of circuit components. However, in
practical scenarios the secure installation of secret keys required for configuration decryption
by the FPGA is an issue left to the user to solve. This work presents an efficient solution for
this problem which hardly requires any changes to the architecture of recent FPGA devices.
Finally, this thesis presents a solution on how to install a trustworthy security kernel – also
known as Trusted Platform Module (TPM) – within the dynamic configuration of an FPGA.
A major advantage of this approach with respect to the PC domain is the prevention of bus
eavesdropping between TPM and application since all functionality is encapsulated in a System-
on-a-Chip (SoC) architecture. Additionally, the functionality of the TPM can easily be extended
or updated in case a security component has been compromised without need to replace the
entire chip or product.
Keywords
x
Kurzfassung
Seit Durchbruch des Internets ist die Zahl an informationsverarbeitenden Systemen in vielen
Bereichen des täglichen Lebens stark gewachsen. Dabei kommen bei der Kommunikation und
Verarbeitung von Daten in den verschiedensten Gegenständen des Alltags eingebettete Syste-
me zum Einsatz, die oft harten Anforderungen, wie beispielsweise hohe Leistung bei optimaler
Kosteneffizienz, gerecht werden müssen. Zusätzlich müssen diese – je nach Anwendungsfall – wei-
tere Kriterien, wie z.B. Sicherheitsaspekte durch kryptografische Verfahren, ohne nennenswerte
Einbußen bezüglich der Datenverarbeitungsgeschwindigkeit erfüllen. In diesem Zusammenhang
sind kleine Mikrocontroller, wie sie typischerweise in diesen Systemen verwendet werden, schnell
überfordert, so dass für kryptografische Funktionen in eingebetteten Hochleistungssystemen fast
immer dedizierte Hardwarechips zum Einsatz kommen.
Ein zweiter Aspekt dieser Arbeit ist die Kryptanalyse mit Hilfe von FPGA-basierten Spezi-
alhardwarearchitekturen. Alle praktikablen, kryptografischen Verfahren sind grundsätzlich der
Abwägung zwischen Effizienz und dem gewünschten Maß an Sicherheit unterworfen; desto höher
die Sicherheitsanforderungen sind, desto langsamer ist im Allgemeinen das Kryptosystem. Die
Sicherheitsparameter eines Kryptosystems werden daher aus Effizienzgründen an die besten
zu Verfügung stehenden Angriffsmöglichkeiten angepasst, wobei einem Angreifer ein hohes,
aber beschränktes Maß an Rechenleistung zugesprochen wird, das dem gewünschten Sicher-
heitsniveau entsprechen soll. Aus diesem Grund muss die Komplexität eines Angriffs genau
untersucht werden, damit eine präzise Angabe der durch das Kryptosystem tatsächlich erreich-
ten Sicherheit in praktikabler Weise gemacht werden kann. Im Rahmen dieser Arbeit wurde
maßgeblich der FPGA-basierte Parallelcluster COPACOBANA mit- und weiterentwickelt. Die-
ser speziell auf eine optimale Kosten-Leistungseffizienz ausgelegte Cluster ermöglicht genaue
Aufwandsabschätzungen von Angriffen auf verschiedenen Kryptosystemen, u.a. auf Basis ei-
ner finanziellen Metrik. Mit Hilfe dieser Clusterplattform können sowohl schwache oder ältere
Kryptosysteme gebrochen, wie auch Angriffe auf aktuell als sicher geltende kryptografische
Verfahren abgeschätzt werden. Neben der erfolgreichen Kryptanalyse der symmetrischen DES-
Blockchiffre, sind ein weiterer Teil dieser Arbeit neuartige Hardwareimplementierungen von
(unterstützenden) Angriffen auf asymmetrische Kryptosysteme, die auf dem Elliptischen Kur-
ven Diskreten Logarithmus Problem (ECDLP) oder dem Faktorisierungsproblem (FP) basieren.
Ein dritter und letzter Bereich dieser Dissertation betrifft den Schutz der rekonfigurierba-
ren Hardware und seinen logischen Komponenten selbst. Es handelt sich bei typischen FPGAs
zumeist um dynamische SRAM-basierte Logikschaltungen, die zur Laufzeit (um-)konfiguriert
werden können. Deshalb muss insbesondere bei sicherheitskritischen Funktionen darauf geach-
tet werden, dass die Konfiguration des FPGA durch einen Angreifer nicht manipuliert werden
kann, um beispielsweise ein Auslesen des geheimen Schlüssels oder die Kompromittierung eines
eingesetzten Sicherheitsprotokolls zu verhindern. Manchen FPGA hat der Hersteller bereits mit
der Funktion ausgestattet, symmetrisch verschlüsselte Konfigurationsdateien zu verwenden. Je-
doch besteht gerade bei komplizierteren Geschäftsmodellen in der Praxis das klassische Problem
der Schlüsselverteilung, d.h. wie kann der Hersteller von FPGA-Konfigurationsdateien den vom
FPGA zur Entschlüsselung der Konfiguration benötigten Schlüssel im Chip installieren, ohne
dabei physischen Zugriff auf den FPGA zu haben? In dieser Dissertation wird hierfür ein siche-
res Protokoll vorgestellt, welches auf dem Diffie-Hellman Schlüsselaustauschverfahren basiert
und dieses Schlüsselverteilungsproblem löst.
Weiterhin werden FPGAs auf ihre Fähigkeit untersucht, einen dynamisch konfigurierbaren Si-
cherheitskern, ein so genanntes Trusted Platform Module (TPM), in einem dedizierten, dyna-
mischen Bereich einzurichten, der einer Applikation vertrauenswürdige Sicherheitsfunktionen
zu Verfügung stellen kann. Der große Vorteil dieses Systems in Bezug auf klassischen TPM-
Architekturen im PC-Umfeld ist dabei die erschwerte Abhörbarkeit sicherheitsrelevanter Bus-
leitungen, da hier ein vollständiger System-on-a-Chip (SoC)-Architektur zum Einsatz kommt.
Weiterhin können durch die dynamische Erweiter- und Aktualisierbarkeit der Sicherheitsfunktio-
nen im rekonfigurierbaren System schwache oder gebrochene Sicherheitskomponenten jederzeit
ausgetauscht werden, ohne dafür das gesamte System ersetzen zu müssen.
Schlagworte
xii
Acknowledgements
This thesis is the result of nearly three years of cryptographic research in which I have been
accompanied and supported by many people during this time. Now I’d like to say thank you.
First, I would like to express my deep and sincere gratitude to my supervisor Prof. Christof Paar
for his continuous inspiration. I am grateful and glad that he gave me advice in professional
and personal matters and also shared many of his of his research experiences with me. And,
without doubt, he has the most outstanding talent to motivate people!
Furthermore, I like to thank my thesis committee, especially Prof. Daniel J. Bernstein for his
very valuable council as external referee.
Next, I want to thank my wife Sindy and my family, in particular Ludgera, Suzan, Maria
and Denis, for all their great support and encouragement during the course of preparing for my
PhD. Thank you!
Very important for my research career at the university was the joint work accomplished with
Jan Pelzl. It was him who introduced me to the scientific community and also showed me how
to efficiently write research contributions. I also want to thank Saar Drimer for all the research
projects on which we collaborated. I thoroughly enjoyed the time we shared during his stay in
Bochum. Many thanks go to my colleagues and friends Thomas Eisenbarth, Markus Kasper,
Timo Kasper, Kerstin Lemke-Rust, Martin Novotný, Axel Poschmann, Andy Rupp and Marko
Wolf for discussions, publications and projects in all aspects of cryptography and, of course, also
the great time and activities beyond work! Moreover, I should not forget the COPACOBANA
team led by Gerd Pfeiffer and Stefan Baumgart who always did outstanding work to support
me in all low-level hardware questions with respect to our joint work on FPGA-based cluster
architectures. Also, I like to thank Christa Holden for all her efforts on final corrections in
my thesis. Last but not least, a special ”thank you” is due to our team assistant Irmgard
Kühn for contributing to the outstanding atmosphere in our group and all her support with any
administrative task. I like to thank all the hard-working students I supervised, in particular
Hans-Christian Röpke, Sven Schäge, Christian Schleiffer, Stefan Spitz and Robert Szerwinski.
And if you’re now done reading these lines with any, yet unsatisfied expectations, I’d like to
let you know that I certainly also intend to thank you. Thanks a lot!
Table of Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 High-Performance Cryptography on Programmable Devices . . . . . . . . 4
1.2.2 Cryptanalysis with Reconfigurable Hardware Clusters . . . . . . . . . . . 6
1.2.3 Trust and Protection Models for Reconfigurable Devices . . . . . . . . . . 7
xvi
Table of Contents
xvii
Table of Contents
xviii
Table of Contents
IV Appendix 161
Bibliography 163
Publications 191
xix
Chapter 1
Introduction
This chapter introduces the aspects of cryptography and cryptanalysis for repro-
grammable devices and summarizes the research contributions of this thesis.
1.1 Motivation
Since many recent commodity products integrate electronic components to provide more func-
tionality, the market for embedded systems has grown expansively. Likewise, the availability of
new communication channels and data sources, like mobile telephony, wireless networking and
global navigation systems, has created a demand for a various mobile devices and handheld
computers. Along with the new features for data processing and communication, the need for
various security features on all of these devices has arisen. Examples for such security require-
ments are the installation and protection of vendor secrets inside a device to enable gradual
feature activation, secure firmware updates, and also aspects of user privacy. Some applications
even demand a complex set of interlaced security functions involving all fields of cryptogra-
phy. Additionally, these applications often put a demand on the necessary data throughput or
define a minimum number of operations per second. Since most embedded systems are based
on small microprocessors with limited computing power, execution of computationally costly
cryptographic operations on these platforms are extremely difficult without severely impacting
performance. This is where special-purpose hardware implementations for the cryptographic
components come into play.
and the Field Programmable Gate Array (FPGA) which allows mapping the application circuitry
dynamically into a two-dimensional array of generic and reconfigurable logic elements.
Though an ASIC provides best possible performance and lowest cost per unit, its development
process is expensive due to the required setup of complex production steps and the manpower
involved. Furthermore, the circuit of an ASIC is inherently static and cannot be modified after-
wards so that design changes require complete redevelopment. This does not only affect system
prototypes during development: it is especially crucial for later upgrades of cryptosystems which
have been reported compromised or insecure, but have already been delivered to the customer.
With classic ASIC technology, such a modification requires an expensive rollback and in most
cases the exchange of the entire device.
Since the mid eighties, the FPGA technology has provided reconfigurable logic on a chip [Xil08a].
Instead of using fixed combinatorial paths and fine-grain logic made up from standard-cell li-
braries as with ASICs, these reconfigurable devices provide Configurable Logic Blocks (CLB)
capable of providing logical functions that can be reconfigured during runtime. As a result of
their dynamic configuration feature, FPGA allow for rapid prototyping of systems with mini-
mal development time and costs. However, FPGAs come as a complete package with a specific
amount of reconfigurable logic making the use of FPGAs for a specific hardware application
more coarse-grain and thus more costly than ASICs (post development). Besides FPGAs, so
called Complex Programmable Logic Devices (CPLD) are an alternative and cheaper variant of
reconfigurable devices. Note that CPLDs consist of large configurable macro cells with fixed and
static interconnects and are thus used for simple hardware applications like bus arbitration or
low-latency signal processing. On the contrary, FPGAs have a finer grain architecture and freely
allow the connection of a large number of logic elements via a programmable switch matrix.
This makes FPGAs the best choice for complex systems such as cryptographic and cryptanalytic
algorithms. In this context such algorithms can be integrated in FPGAs either as a holistic
approach together with the main application and deployed as a System-on-a-Chip (SoC) or as
coprocessor unit extending the feature set of a separate microprocessor. In this thesis, we focus
mainly on crypto implementations for FPGAs, since they provide sufficient logic resources for
complex implementations and the feature of reconfigurability to update implemented security
functions when necessary.
This thesis focuses on hardware implementations both in the fields of cryptography and crypt-
analysis. In general, cryptography is considered the constructive science of securing information,
by means of mathematical techniques and known hard problems. Cryptanalysis on the other
hand denotes the destructive art of revealing the secured information from an attacker’s per-
spective without the knowledge of any secret. Cryptanalysis is an essential concept maintaining
the effectiveness of cryptography – cryptographers should carefully review their cryptosystems
with the (known) tools given by cryptanalysis to assess the threat and possibilities of potential
attackers.
2
1.2. Summary of Research Contributions
The field of cryptography is divided into public-key (asymmetric) and private-key (symmetric)
cryptography. In symmetric cryptography, all trusted parties share a common secret key, e.g.,
to establish confidential communication. This symmetric approach to secure communication
channels has been used throughout history. As an example, first monoalphabetic shift ciphers
were already employed by Julius Caesar around 70 BC [TraAD]. In contrast, asymmetric cryp-
tography is rather new and was first introduced in open literature by Diffie and Hellman [DH76]
in the mid 1970s. In this approach, each party is provided with a key pair consisting of a secret
and public key. Encryption of data can be performed by everyone who has knowledge of the
public key, but only the owner of the secret key can decrypt information. Besides encryption,
public-key cryptography can also be used to efficiently achieve other security goals, such as
mutual key agreement and digital signatures.
In the past, symmetric and asymmetric cryptosystems are both essential in practical systems.
By nature, the computational complexity of asymmetric cryptography is much higher than with
symmetric cryptography. This is due to the necessity of hard mathematical problems which are
converted to one-way functions with trapdoors to support the complex principle of a secret
with a public and private component. Common choices of hard problems for these one-way
functions are the Factorization Problem (FP), which is the foundation of the security of the
popular RSA [RSA78] system, and the Discrete Logarithm Problem for finite fields (DLP) or
elliptic curve groups (ECDLP). Public-key cryptography is thus only employed for applications
with demand for the advanced security properties of the asymmetric key approach. For all
other needs, like bulk data encryption, the symmetric cryptography is the more efficient choice,
e.g., using the legacy Data Encryption Standard (DES) or the Advanced Encryption Standard
(AES) block ciphers. In many cases, hybrid cryptography comprising symmetric and asymmet-
ric cryptography is required (e.g., to provide symmetric data encryption with fresh keys which
are obtained from an asymmetric key agreement scheme).
This thesis provides new insights into the field of asymmetric and symmetric cryptography
as well as the cryptanalysis of established cryptosystems (and related problems) by use of
reconfigurable devices. In addition to that, this work also presents novel measures and protocols
to protect reconfigurable devices against manipulation, theft of Intellectual Property (IP) and
secret extraction.
Most of the presented design strategies and implementations of cryptographic and cryptanalytic
applications in this thesis target Xilinx FPGAs. Xilinx Inc is the current market leader in FPGA
technology, hence, the presented results can be widely applied where FPGA technology comes
into play. All presented cryptographic architectures in this contribution aim at applications with
3
Chapter 1. Introduction
high demands for data throughput and performance. For these designs, we1 primarily employ
powerful Xilinx Virtex-4 and Virtex-5 FPGAs which include embedded functional elements that
can accelerate the arithmetic operations of many cryptosystems.
Implementations for cryptanalytic applications are usually designed to achieve an optimal
cost-performance ratio. More precisely, the challenge is to select an (FPGA) device which is
available at minimal cost but can provide a maximum number of cryptanalytic operations.
Hence, we mainly tailor our architectures for cryptanalytic applications specifically for clusters
consisting of cost-efficient Xilinx Spartan-3 FPGAs.
Finally, we present strategies to protect the configuration and security-related components on
FPGAs. Our protection and trust models are designed for use with arbitrary FPGAs satisfying
a specific set of minimum requirements (e.g., on-chip configuration decryption).
FPGA architectures for advanced cryptanalysis of the DES block cipher and DES-related
systems
4
1.2. Summary of Research Contributions
The Advanced Encryption Standard is the most popular block cipher due to its standardization
by NIST in 2002. We developed an AES cipher implementation that is almost exclusively based
on embedded memory and arithmetic units embedded of Xilinx Virtex-5 FPGAs. It is designed
to match specifically the features of this modern FPGA class – yielding one of the smallest and
fastest FPGA-based AES implementation reported up to now – with minimal requirements on
the (generic) configurable logic of the device. A small AES module based on this approach
returns a 32 bit column of an AES round each clock cycle, with a throughput of 1.76 Gbit/s
when processing two 128 bit input streams in parallel or using a counter mode of operation.
Moreover, this basic module can be replicated to provide a 128 bit data path for an AES round
and a fully unrolled design yielding throughputs of over 6 and 55 Gbit/s, respectively.
Elliptic curve cryptosystems provide lower computational complexity compared to other tradi-
tional cryptosystems like RSA [RSA78]. Therefore, ECCs are preferable when high performance
is required. Despite a wealth of research regarding high-speed implementation of ECC since
the mid 1990s [AMV93, WBV+ 96], providing truly high-performance ECC on reconfigurable
hardware platforms is still an open challenge. This applies especially to ECCs over prime fields,
which are often selected instead of binary fields due to standards in Europe and the US. In this
thesis, we present a new design strategy for an FPGA-based, high performance ECC implemen-
tation over prime fields. Our architecture makes intensive use of embedded arithmetic units
in FPGAs originally designed to accelerate digital signal processing algorithms. Based on this
technique, we propose a novel architecture to create ECC arithmetic and describe the actual
implementation of standard compliant ECC based on the NIST primes.
Modern Graphics Processing Units (GPU) have reached a dimension that far exceeds conven-
tional CPUs with respect to performance and gate count. Since many computers already include
such powerful GPUs as stand-alone graphics card or chipset extension, it seems reasonable to
employ these devices as coprocessing units for general purpose applications and computations
to reduce the computational burden of the main CPU. This contribution presents novel im-
plementations using GPUs as accelerators for asymmetric cryptosystems like RSA and ECC.
With our design, an NVIDIA Geforce 8800 GTS can compute 813 modular exponentiations per
second for RSA with 1024 bit parameters (or, alternatively, for the Digital Signature Standard
(DSA)). In addition to that, we describe an ECC implementation on the same platform which
is capable to compute 1412 point multiplications per second over the prime field P − 224.
Extracts of the contributions presented in this part were also published in [DGP08, GP08, SG08].
5
Chapter 1. Introduction
Cryptanalysis of symmetric (and asymmetric) ciphers is a challenging task due to the enormous
amount of computations involved. The security parameters of cryptographic algorithms are
commonly chosen so that attacks are infeasible with available computing resources. Thus, in
the absence of mathematical breakthroughs to a cryptanalytical problem, a promising way
for tackling the computations involved is to build special-purpose hardware which provide a
better performance-cost ratio than off-the-shelf computers in many cases. We have developed a
massively parallel cluster system (COPACOBANA) based on low-cost FPGAs as a cost-efficient
platform primarily targeting cryptanalytical operations with these high computational but low
communication and memory requirements [KPP+ 06b]. Based on this machine, we investigate
here various attacks on the weak DES cryptosystem which was the long-lasting standard block
cipher according to FIPS 46-3 since 1977 – and is still used in many legacy (and even recent)
systems. Besides simple brute-force attack on DES, we also evaluate time-memory trade-off
attacks for DES keys on COPACOBANA as well as the breaking of more advanced modes of
operations of the DES block cipher, e.g., some one-time password generators.
As already mentioned, the utilization of Elliptic Curves (EC) in cryptography is very promising
for embedded systems due to small parameter sizes. This directly results from their resistance
against powerful index-calculus attacks meaning only generic, exponential-time attacks like the
Pollard-Rho method are available. We present here a first concrete hardware implementation
of this attack against ECC over prime fields and describe an FPGA-based multi-processing
hardware architecture for the Pollard-Rho method. With the implementation at hand and
given a machine like COPACOBANA, a fairly accurate estimate about the cost of an FPGA-
based attack can be generated. We will extrapolate the results on actual ECC key lengths
(128 bits and above) and estimate the expected runtimes for a successful attack. Since FPGA-
based attacks are out of reach for key lengths exceeding 128 bits, we also provide additional
estimates based on ASICs.
The factorization problem is a well-known mathematical issue that mathematicians have al-
ready attempted to tackle since the beginning. Due to the lack of factorization algorithms
6
1.2. Summary of Research Contributions
with better than subexponential complexity, cryptosystems like the well-established asymmet-
ric RSA system remain state-of-the-art. Since the best known attacks like the Number Field
Sieve (NFS) are too complex to be (efficiently) handled solely by (simple) FPGA systems, we
focus on improvements of hardware architectures of the Elliptic Curve Method (ECM) which
is preferably also used in substeps of the NFS. Previous implementations of ECM on FPGAs
were reported by Pelzl et al. [ŠPK+ 05] and Gaj et al. [GKB+ 06a]. In this work we will optimize
the low-level arithmetic of their proposals by employing the DSP blocks of modern FPGAs and
also discuss also high-level decisions as the choice of alternative elliptic curve representation like
Edwards curves.
Parts of the presented research contributions were also published by the author in [GKN+ 08,
GRS07, GPP+ 07b, GPP07a, GPP08, GPPS08].
The distinct advantage of SRAM-based FPGAs is their flexibility for configuration changes.
However, this opens up the threat of IP theft since the system configuration is usually stored
in easy-to-access external Flash memory. To prevent this, high-end FPGAs have already been
fitted with symmetric-key decryption engines used to load an encrypted version of the configu-
ration that cannot easily be copied and used without knowledge of the secret key. However, such
protection systems based on straightforward use of symmetric cryptography are not well-suited
with respect to business and licensing processes, since they are lacking a convenient scheme for
key transport and installation. We propose a new protection scheme for the IP of circuits in
configuration files that provides a significant improvement to the current unsatisfactory situa-
tion. It uses both public-key and symmetric cryptography, but does not burden FPGAs with
the usual overhead of public-key cryptography: While it needs hardwired symmetric cryptog-
raphy, the public-key functionality is moved into a temporary configuration file for a one-time
setup procedure. Therefore, our proposal requires only very few modifications to current FPGA
technology.
Trusted Computing (TC) is an emerging technology used to build trustworthy computing plat-
forms which can provide reliable and untampered security functions to upper layers of an ap-
plication. The Trusted Computing Group (TCG) has proposed several specifications to imple-
ment TC functionalities by a hardware extension available for common computing platforms,
7
Chapter 1. Introduction
Extracts of the research contributions in this part are published in [GMP07a, GMP07b, EGP+ 07a,
EGP+ 07b]
8
Part I
High-Performance Cryptosystems on
Reprogrammable Devices
Chapter 2
Optimal AES Architectures for
High-Performance FPGAs
This chapter presents an AES cipher implementation that is based on memory blocks
and DSP units embedded within Xilinx Virtex-5 FPGAs. It is designed to match
specifically the features of these modern FPGA devices – yielding the fastest FPGA-
based AES implementation reported in open literature with minimal requirements on
the configurable logic of the device.
2.1 Motivation
Since its standardization in 2001 the Advanced Encryption Standard (AES) [Nat01] has become
the most popular block cipher for many applications with requirements for symmetric security.
Therefore, by now there exist a multitude of implementations and literature discussing how to
optimize AES in software and hardware. In this chapter we will focus on AES implementations
in reconfigurable hardware, in particular on Xilinx Virtex-5 FPGAs.
Analyzing existing solutions, these AES implementations are mostly based on traditional
configurable logic to maintain platform independence and thus do not exploit the full potential
of modern FPGA devices. Thus, we present a novel way to implement AES based on the 32-bit
T-Table method [DR02, Section 4.2] by taking advantage of new embedded functions located
inside of the Xilinx Virtex-5 FPGA [Xil06], such as large dual-ported RAMs and Digital Signal
Processing (DSP) blocks [Xil07] with the goal of minimizing the use of registers and look-up
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
tables that could otherwise be used for other functions. Unlike conventional AES design ap-
proaches for these FPGAs [BSQ+ 08], our design is especially suitable for applications where
user logic is the limiting resource1 , yet not all embedded memory and DSP blocks are used.
Several authors already proposed to employ embedded memory (Block RAM or BRAM) for
AES [CG03, MM03] and there already exists work using the T-Table construction for FP-
GAs [FD01, CKVS06]. In contrast to these designs, our approach maps the complete AES
data path onto embedded elements contained in Virtex-5 FPGAs. This strategy provides most
savings in logic and routing resources and results in the highest data throughput on FPGAs
reported in open literature.
More precisely, we demonstrate that an optimal AES module can be created from a combi-
nation of two 36 Kbit BlockRAM (BRAM) and four DSP slices in Virtex-5 FPGAs. This basic
module comprises of eight pipeline stages and returns a single 32 bit column of an AES round
each cycle. Since the output can be combined with the input in a feedback loop, this module is
sufficient to compute the full AES output in iterative operation. Alternatively, the basic module
can be replicated four times extending the data path to 128 bit to compute a full AES round
resulting in a reduced number of iterations. This 128-bit design can be unrolled ten times for
a fully pipelined operation of the AES block cipher. For reasons of comparability with other
designs we do not directly include the key expansion function in these designs but instead, we
provide a separate circuit for precomputing the required subkeys which can be combined with
all three implementations. This project was done as joint work with Saar Drimer [DGP08] who
did most of the implementations (except for the key schedule) as well as simulation of the entire
design. Moreover, Saar also elaborated on suitable modes of operations and authentication
methods (e.g., CMAC) for our design. See [Dri09] for further details.
12
2.2. Previous Work
Designs targeting High Performance: Architecture with the goal to achieve maximum
performance usually make thorough use of pipelining techniques, i.e., all AES rounds are
unrolled in hardware and can be processed in parallel. McLoone et al. [MM03] discuss an
AES-128 implementation based on the Xilinx Virtex-E 812(-8) device using 2,457 CLBs
and 226 block memories providing an overall encryption rate of 12 Gbit/s. Hodjat and Ver-
bauwhede [HV04] report an AES-128 implementation with 21.54 Gbit/s throughput using
5,177 slices and 84 BRAMs on a Xilinx Virtex-II Pro 20(-7) FPGA. Järvinen et al. [JTS03]
13
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
shows how to achieve a high throughput even without use of any BRAMs on a Xilinx
Virtex-II 2000(-5) at the cost of additional CLBs: their design takes 10750 slices and
provides an encryption rate of 17.8 GBit/s. Finally, Chaves et al. [CKVS06] also use the
memory-based T-Table implementation on a Virtex-II Pro 20(-7) and provide a design of
a single iteration and a loop unrolled AES based on a similar strategy as ours.
To our knowledge, only few implementations [FD01, RSQL04, CKVS06] have transferred the
software architecture based on the T-table to FPGAs. Due to the large tables and the restricted
memory capacities on those devices, certain functionality must be still encoded in user logic up
to now (e.g., the multiplication elimination required in the last AES round, see 2.3). The new
features of Virtex-5 devices provide wider memories and more advanced logic resources. Our
contribution is the first T-table-based AES-implementation that efficiently uses mostly device-
specific features minimizing the need for generic logic elements. We will provide three individual
solutions that address each of the design categories mentioned above – minimal resource usage,
area-time efficiency and high-throughput.
(1) SubBytes: all input bytes of A are substituted with values from a non-linear 8 × 8 bit
S-Box.
(2) ShiftRows: the bytes of rows Ri are cyclically shifted to the left by 0, 1, 2 or 3 positions.
(3) MixColumns: columns Cj = (a0,j , a1,j , a2,j , a3,j ) are matrix-vector-multiplied by a matrix
of constants in GF (28 ).
14
2.3. Mathematical Background
(4) AddRoundKey: a round key Ki is added to the input using GF (28 ) arithmetic.
The sequence of these four operations defines an AES round, and they are iteratively applied
for a full encryption or decryption of a single 128 bit input block. Since some of the operations
above rely on GF (28 ) arithmetic we are able to combine them into a single complex operation.
In addition to the Advanced Encryption Standard, an alternative representation of the AES
operation for software implementations on 32 bit processors was proposed in [DR02, Section 4.2]
based on the use of large lookup tables. This approach requires four lookup tables with 8 bit
input and 32 bit output for the four round transformations, each the size of 8 Kbit. According
to [DR02], these transformation tables Ti with i = 0..3 can be computed as follows:
S[x] × 02 S[x] × 03
S[x] S[x] × 02
T0 [x] = T1 [x] =
S[x] S[x]
S[x] × 03 S[x]
S[x] S[x]
S[x] × 03 S[x]
T2 [x] = T3 [x] =
S[x] × 02 S[x] × 03
S[x] S[x] × 02
In this notation, S[x] denotes a table lookup in the original 8 × 8 bit AES S-Box (for a
more detailed description of this AES optimization see NIST’s FIPS-197 [Nat01]). The last
round, however, is unique since it omits the MixColumns operation, so we need to give it
special consideration. There are two ways for computing the last round, either by “reversing”
the MixColumns operation from the output of a regular round by another multiplication in
GF (28 ), or creating dedicated T-tables for the last round. The latter approach will allow us to
maintain the same data path for all rounds, so – since Virtex-5 devices provide larger memory
blocks than former devices – we chose this method and denote these T-tables as T[j]′ . With all
T-tables at hand, we can redefine all transformation steps of a single AES round as
where Kr[j] is a corresponding 32 bit subkey and Ej denotes one of four encrypted output
columns of a full round. We now see that based on only four T-table lookups and four XOR
operations, a 32 bit column Ej can be computed. To obtain the result of a full round, Equa-
tion (2.1) must be performed four times with all 16 bytes.
Input data to an AES encryption can be defined as four 32 bit column vectors Cj =
(a0,j , a1,j , a2,j , a3,j ) with the output similarly formatted in column vectors. According to
Equation (2.1), these input column vectors need to be split into individual bytes since all
bytes are required for the computation steps for different Ej . For example, for column
C0 = (a0,0 , a1,0 , a2,0 , a3,0 ) the first byte a0,0 is part of the computation of E0 , the second byte
a1,0 is used in E3 , etc. Since fixed (and thus simple) data paths are preferable in hardware
15
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
implementations, we have rearranged the operands of the equation to align the bytes according
to the input columns Cj when feeding them to the T-table lookup. In this way, we can imple-
ment a unified data path for computing all four Ej for a full AES round. Thus, Equation (2.1)
transforms into
′ ′ ′ ′
E0 = Kr[0] ⊕ T0 (a0,0 ) ⊕ T1 (a1,1 ) ⊕ T2 (a2,2 ) ⊕ T3 (a3,3 ) = (a0,0 , a1,0 , a2,0 , a3,0 )
′ ′ ′ ′
E1 = Kr[1] ⊕ T3 (a3,0 ) ⊕ T0 (a0,1 ) ⊕ T1 (a1,2 ) ⊕ T2 (a2,3 ) = (a0,1 , a1,1 , a2,1 , a3,1 )
′ ′ ′ ′
E2 = Kr[2] ⊕ T2 (a2,0 ) ⊕ T3 (a3,1 ) ⊕ T0 (a0,2 ) ⊕ T1 (a1,3 ) = (a0,2 , a1,2 , a2,2 , a3,2 )
′ ′ ′ ′
E3 = Kr[3] ⊕ T1 (a1,0 ) ⊕ T2 (a2,1 ) ⊕ T3 (a3,2 ) ⊕ T0 (a0,3 ) = (a0,3 , a1,3 , a2,3 , a3,3 )
′
where ai,j denotes an input byte, and ai,j the corresponding output byte after the round
transformation. However, the unified input data path still requires a look-up to all of the
four T-tables for the second operand of each XOR operation. For example, the XOR compo-
nent at the first position of the sequential operations E0 to E3 and thus requires the lookups
T0 (a0,0 ), T3 (a3,0 ), T2 (a2,0 ) and T1 (a1,0 ) (in this order) and the corresponding round key Kr[j] .
Though operations are aligned for the same input column now, it becomes apparent that the
bytes of the input column are not processed in canonical order, i.e., bytes need to be swapped
for each column Cj = (a0,j , a1,j , a2,j , a3,j ) first before being fed as input to the next AES round.
The required byte transposition is reflected in the following equations:
′ ′ ′ ′
C0 = (a0,0 , a3,0 , a2,0 , a1,0 )
′ ′ ′ ′
C1 = (a1,1 , a0,1 , a3,1 , a2,1 )
′ ′ ′ ′ (2.2)
C2 = (a2,2 , a1,2 , a0,2 , a3,2 )
′ ′ ′ ′
C3 = (a3,3 , a2,3 , a1,3 , a0,3 )
Note that the given transpositions are static so that they can be efficiently hardwired in our
implementation.
Finally, we need to consider the XOR operation of the input key and the input 128 bit block
which is done prior to the round processing. Initially, we will omit this operation when reporting
our results for the round function. However, adding the XOR to the data path is simple, either
by modifying the AES module to perform a sole XOR operation in a preceding cycle, or – more
efficiently – by just adding an appropriate 32-bit XOR which processes the input columns prior
being fed to the round function.
2.3.1 Decryption
Although data encryption and decryption semantically only reverses the basic AES operations,
the basic operations itself require different treatment so typically separate hardware components
and significant logic overhead is necessary to support both. With our approach, all primitive
operations are encoded into T-tables for encryption, so that we can apply a similar strategy
for decryption by creating tables representing the inverse cipher transformation. Hence, we can
basically support an encryptor and decryptor engine with the same circuit by only swapping the
16
2.3. Mathematical Background
values of the transformation tables and slightly modifying the input. As with Equation (2.1),
decryption of columns Dj can be expressed by the following set of equations:
′ ′ ′ ′
D0 = Kr[0] ⊕ I0 (a0,0 ) ⊕ I1 (a1,3 ) ⊕ I2 (a2,2 ) ⊕ I3 (a3,1 ) = (a0,0 , a1,0 , a2,0 , a3,0 )
′ ′ ′ ′
D3 = Kr[3] ⊕ I3 (a3,0 ) ⊕ I0 (a0,3 ) ⊕ I1 (a1,2 ) ⊕ I2 (a2,1 ) = (a0,3 , a1,3 , a2,3 , a3,3 )
′ ′ ′ ′
D2 = Kr[2] ⊕ I2 (a2,0 ) ⊕ I3 (a3,3 ) ⊕ I0 (a0,2 ) ⊕ I1 (a1,1 ) = (a0,2 , a1,2 , a2,2 , a3,2 )
′ ′ ′ ′
D1 = Kr[1] ⊕ I1 (a1,0 ) ⊕ I2 (a2,3 ) ⊕ I3 (a3,2 ) ⊕ I0 (a0,1 ) = (a0,1 , a1,1 , a2,1 , a3,1 )
This requires the following inversion tables (I-Tables), where S −1 denotes the inverse 8 × 8
S-Box for the AES decryption:
S −1 [x] × 0E S −1 [x] × 0B
S −1 [x] × 09 S −1 [x] × 0E
I0 [x] = I1 [x] =
S −1 [x] × 0D S −1 [x] × 09
S −1 [x] × 0B S −1 [x] × 0D
S −1 [x] × 0D S −1 [x] × 09
S −1 [x] × 0B S −1 [x] × 0D
I2 [x] = I3 [x] =
S −1 [x] × 0E S −1 [x] × 0B
S −1 [x] × 09 S −1 [x] × 0E
17
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
w4n-1
w4n+3
32 32
w3
32
w7
w4n+2
w4n--2
8
Round key 1
Round key n
32 32
w2
w6
Initial key
f ... f
S-Box
S-Box
S-Box
S-Box
w4n-3
w4n+1
32 32
w1
w5
32 32
w4n-4
RC[r]
w4n
f
w0
w4
Figure 2.1: The key schedule derives subkeys for the round computations from a main key.
The first operation of AES is a 128 bit XOR of the main key K0 with the 128 bit initial
plaintext block. During expansion, each subkey is split into four individual 32 bit words Kr [j]
for j = 0 . . . 3. The first word Kr [0] of each round subkey is extensively transformed using byte-
wise rotations and mappings along the same non-linear AES S-Box already used for encryption.
All subsequent words for j = 1 . . . 3 are determined by an exclusive-or operation with the
previous subkey words Kr [j − 1] ⊕ K(r−1) [j]. Figure 2.1 depicts the full key schedule.
18
2.4. Embedded Elements of Modern FPGAs
CLK I/O
integrated DSP blocks – as indicated by their name – were designed to accelerate Digital Signal
Processing (DSP) applications, e.g., Finite Impulse Response (FIR) filters, etc. However, these
arithmetic units can be programmed to perform universal arithmetic functions not limited
to the scope of DSP filters; they support generic multiplication, addition and subtraction of
(un)signed integers. Dependant on the FPGA class common DSP component comprises an
lM -bit signed integer multiplier coupled with an lA -bit signed adder where the adder supports a
larger data path to allow accumulation of multiple subsequent products. Exactly, Xilinx Virtex-
4 FPGAs support 18 bit unsigned integer multiplication (yielding 36 bit products) and three-
input addition, subtraction or accumulation of unsigned 48 bit integers. Virtex-5 devices offer
support for even wider 25×18 bit multiplications. Since DSP blocks are designed as an embedded
element in FPGAs, there are several design constraints which need to be obeyed for maximum
performance with the remaining logic, e.g., the multiplier and adder block should be surrounded
by pipeline registers to reduce signal propagation delays between components. Furthermore,
since they support different input paths, DSP blocks can operate either on external inputs
A, B, C or on internal feedback values from accumulation or result Pj−1 from a neighboring
DSP block. Figure 2.3 shows the generic DSP-block and a small selection of possible modes of
operations available in recent Xilinx Virtex-4/5 FPGA devices [Xil08b] and used in this thesis.
19
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
C A B
lA lM1 lM2
X
Exclusive OR (XOR)
x
lA Multiply & Accumulate
+
Pi
Figure 2.3: Generic and simplified structure of DSP-blocks of advanced FPGA devices.
2.5 Implementation
In Section 2.3, we have introduced the T-table method for implementing the AES round most
suitable for 32 bit microprocessors. Now, we will demonstrate how to adapt this technique
into modern reconfigurable hardware devices in order to achieve high throughput for modest
amounts of resources. For our implementations, we use Xilinx Virtex-5 FPGAs and make inten-
sive use of the embedded elements to achieve a design beyond traditional LUTs and registers.
Our architecture relies on dual ported 36 Kbit BlockRAMs (BRAM) (with independent address
and data buses for the same stored content) and DSP blocks. The fundamental idea of this work
is that the 8 to 32 bit lookup followed by a 32 bit XOR AES operation perfectly matched this
architectural alignment of Virtex-5 FPGAs. Based on these primitives, we developed a basic
AES module that performs a quarter (one column) of an AES round transformation given by
Equation (2.1). Figure 2.4 depicts such a mapping of Equation (2.1) into embedded functions
blocks of a Virtex-5 FPGA. The chosen design is optimal for Virtex-5 so that it allows efficient
placing and routing of components such that it can operate at the maximum device frequency
of 550 MHz. Furthermore, our basic module is designed such that it can be replicated for higher
throughput.
20
2.5. Implementation
column input
32
Input Alignment
8 8 8 8
32 32 32 32
subkey 32
DSP
column
output
32
Figure 2.4: The mapping of AES column operations onto functional components of modern
Virtex-5 devices. Each dual ported BRAM contains four T-tables, including separate
tables for the last round. Each DSP block performs a 32 bit bit-wise XOR operation.
the inputs: here, four bytes ai,j are selected from the current state A at a time and passed to the
BRAMs for the T-table lookup. Since the order of bytes ai,j vary for each column computation
Ej , this requires a careful design of the input logic since it need to support selection from
all four possible byte positions of each 32-bit column input. Hence, instead of implementing a
complex input logic, we modified the order of operations according to Equations (2.2) exploiting
that addition in GF (2m )(i.e., XOR) is a commutative operation. When changing the order of
operations dynamically for each computation of Ej , this requires that all four T-table lookups
with their last-round T-table counterparts are stored in each BRAM. However, that would
require to fit a total of eight 8 Kbit T-tables in a single 36 Kbit dual-port RAM. As discussed
in Section 2.3, for performance and resource efficiency reasons we opted against adding out the
MixColumn operations from the stored T-tables and preferred a solution so that all BRAM can
provide all eight required tables. Utilizing the fact that all T-tables are byte-wise transpositions
of each other, we can produce the output of T1 , T2 and T3 by cyclically byte-shifting of the
BRAM’s output for T-table T0 . Using this observation, we only store T0 and T2 and their last-
round counterparts T0′ and T2′ in a single BRAM. Using a single byte circular right rotation
(a, b, c, d) → (d, a, b, c), T0 becomes T1 , and T2 becomes T3 and the same for the last round’s
T-tables. In hardware, this only requires a 32 bit 2:1 multiplexer at the output of each BRAM
with a select signal from the control logic. For the last round, a control bit is connected to
a high order address bit of the BRAM to switch from the regular T-table to the last round’s
T-table. The adapted design can be seen in Figure 2.5. A dual-port 32 Kbit BRAM with three
control bits, and a 2:1 32 bit mux allows us to output all T-table combinations. Using two such
BRAMs with identical content, we get the necessary lookups for four columns, each capable of
performing all four T-table lookups in parallel.
21
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
plaintext
32
8 8 8 8
port A (addr) port B (addr) port A (addr) port B (addr)
BRAM
ctrl
subkey 32 32 32 32
32
DSP
32
Figure 2.5: The complete basic AES module consisting of 4 DSP slices and 2 dual-ported Block
Memories. Tables T1 and T3 are constructed on-the-fly using byte shifting from
tables T0 and T2 in the block memory, respectively.
Note that both the BRAMs and DSP blocks provide internal input and output registers for
pipelining along the data path so that we include these registers without occupation of any
flip-flops in the fabric. At this point, we already had six pipeline stages that could not have
been easily removed if our goal was high throughput. Instead of trying to reduce pipeline stages
for lower latency, we opted to add two more so that we are able to process two input blocks
at the same time, doubling the throughput for separate input streams. One of these added
stages is the 32 bit register after the 2:1 multiplexer that shifts the T-tables at the output of
the BRAM.
A full AES operation is implemented by operating the basic construct with an added feedback
scheduling in the data path.
Figure 2.6 shows the eight pipeline stages where Kr[i] denotes the ith subkey of round r and
Dj the 32 bit table output produced by the four BRAM ports. The first column output E0
becomes available after the eighth clock cycle and is fed back as input for the second round.
For the second round, the control logic switches the 2:1 input multiplexer for the feedback path
rather than the external input. The exact data flow is given in detail in Table A.1 which can
be found in the appendix. In the eight pipeline stages we can process two separate AES blocks,
since we only need 4 stages to process the 128 bit of one block. This allows us to feed two
consecutive 128 bit blocks one after another, in effect doubling our throughout without any
additional complexity.
22
2.5. Implementation
ai,j
1 Table Lookup
BRAM
2 Table Output Register
3 Byte Permutation
D0
5 DSP Output Register #1
D1
6 DSP Output Register #2
D2
7 DSP Output Register #3
D3
8 DSP Output Register #4
Figure 2.6: Pipeline stages to compute the column output of an AES round.
We also investigated on an alternative design approach for the basic AES module. Instead of
cascading several DSP units to use and create a data path with eight pipeline stages, we chose
to process each column Ej with the j-th DSP slice only by selecting an operation mode for the
DSP slice which accumulates all input values using an internal feedback path (i.e., accumulation
in GF (2m )). We found, however, that this requires the input of a key to each DSP block, extra
control logic, different operating modes for the DSP (e.g., for restarting accumulation), and a
32 bit 4:1 mux to choose between the output of each DSP for feeding the input to the next
round. Due to higher resource cost and worse routing results, we prefer to stick to the original
design.
Up to now we focused on the encryption process, though decryption is quite simply achieved
with minor modifications to the circuit. As the T-tables are different for encryption and de-
cryption, storing them all would require double the amount of storage what is not desirable.
Recall, however, that any Ti can be converted into Tj simply by shifting the appropriate amount
of bytes. The most straightforward modification to the design is to replace the 32 bit 2:1 mux
at the output of the BRAM with a 4:1 mux such that all byte transpositions can be created.
Then, we load the BRAMs with TiE , TiE′ , TiD and TiD ′ , where T
E and T D denote encryption
and decryption T-tables, respectively, with their corresponding last round counterparts. Note,
that this does not necessarily increase the data path due to the 6-input LUTs in the CLBs of a
Virtex-5 device. Based on 6-input LUTs, a 4:1 multiplexer can be as efficiently implemented as a
2:1 multiplexer with only a single stage of logic. An alternative is to dynamically reconfigure the
content of the BRAMs with the decryption T-tables; this can be done from an external source,
23
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
128
... 128
8 8 8 8 32 8 8 8 8 32
plaintext
plaintext
BRAM
ctrl
subkey 32 32 32 32
... subkey 32 32 32 32
32 32
DSP
32 32
Instance #1 Instance #4
Figure 2.7: Four instances of the basic structure in hardware allow all AES columns being
processed in parallel (128 bit data path).
or even from within the FPGA using the internal configuration access port (ICAP) [Xil06] with
a storage BRAM for reloading content through the T-table BRAMs’ data input port.
Finally, the AES specification requires an initial key addition of the input with the main key
which has not covered by the AES module so far. Most straightforward, this can be done by
adding one to four DSP blocks (alternatively, the XOR elements can be implemented in CLB
logic) as a prestage to the round operation.
Since the single AES round requires the computation of four 32 bit columns, we can replicate
the basic construct four times and add 8, 16, and 24 bit registers at the inputs of the columns.
This is shown in Figure 2.7 where all instances are connected to a 128 bit bus (32 bits per
instance) of which selected bytes are routed to corresponding instances by fixed wires. Note
that only one byte per 32 bit column output remains within the same instance, the other three
bytes will be processed by the other instances in the next round. The latency of this construct
is still 80 clock cycles as before, but allows us to interleave eight 128 bit inputs instead of two.
In contrast to the basic module, however, the input byte arrangements allow that the T-tables
be static so the 32 bit 2:1 multiplexers are no longer required. This simplifies the data paths
between the BRAMs and DSP blocks since the shifting can be fixed in routing. The control
logic is simple as well, comprising of a 3 bit counter and a 1 bit control signal for choosing the
last round’s T-tables.
Finally, we implemented a fully unrolled AES design for achieving maximum throughput by
connecting ten instances of the round design presented above. We yield an architecture with an
80-stage pipeline, producing a 128 bit output every clock cycle at a resource consumption of 80
BRAMs and 160 DSP blocks. One advantage of this approach is the savings for control signals
since the full process is unrolled and thus completely hardwired in logic.
24
2.5. Implementation
Considering the key schedule, many designers (e.g., [BSQ+ 08]) prefer a shared S-Box and/or
datapath for deriving subkeys and the AES round function. This approach needs additional
multiplexing and control signals to switch the central data path between subkey computations
and data encryption which may lead to decreased performance in practice. Furthermore, key
precomputation is mostly preferred over on-the-fly key expansion because the first relaxes the
constraints on data dependencies, i.e., the computation is only dependent on the availability of
the previous state (plaintext) and not additionally on completion of key computations.
In case that high throughput is not required but the key schedule needs to be precomputed
on chip without adversely increasing logic resource utilization, our basic AES module can be
modified to support the key generation. Remember that we already store T-tables T[0..3]′ for the
last round in the BRAMs without the MixColumns operation so that the values of these tables
are basically a byte-rotated 8 bit S-Box value. These values are perfectly suited for generating a
32 bit round key from S-Box lookups and our data path has been specifically designed for 32 bit
XOR operations based on the DSP unit. Hence, with additional input multiplexers, control logic
and a separate BRAM as key-store, we can integrate a key scheduler in our existing design.
However, although this is possible, the additional overhead (i.e., additional multiplexers) will
potentially degrade the performance of the AES rounds.
The second approach for the key schedule is a dedicated circuit to preserve the regularity of
the basic module and the option to operate the design at maximum device frequency. For a
minimal footprint, we propose to add another dual-ported BRAM to the design used for storing
the expanded 32 bit subkeys (44 words for AES-128), the round constants (10 32 bit values)
and S-Box entries with 8 bit each. The design of our key schedule implementation is shown in
Figure 2.8: port A of the BRAM is 32 bit wide which feeds the subkeys to the AES module,
while port B is configured for 8 bit I/O enabling a minimal data path for the key expansion
function. With an 8 bit multiplexer, register and XOR connected to port B data output, we can
construct a minimal and byte-oriented key schedule that can compute the full key expansion.
The sequential and byte-wise nature of this approach for loading and storing the appropriate
bytes from and to the BRAM requires a complex state machine. Recall that the BRAM provides
36 Kbits of memory of which 1408 to 1920 bits are required for subkeys (for AES-128 and AES-
256, respectively), 2048 bits for S-Box entries and 80 bits for round constants, so the BRAM can
still be used to store further data. Thus, we have decided that the most area economic approach
is to encode all the required memory addresses as well as control signals for multiplexers and
registers as 32 bit instructions, and store these instruction words in the yet unused portion
of the BRAM. This method also ensures a constant and uniform signal propagation in all
control signals since they do not need to be generated by combinatorial logic but loaded (and
hardwired) from the BRAM. In particular, complex state machines and the latency within their
combinatorial circuitry are usually the bottleneck of high-performance implementations since
nested levels of logic to generate dozens of control signals are likely to emerge as the critical
path. By encoding this complexity into the BRAM, we could avoid this performance degrade.
25
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
FSM
ADDR IN WE ADDR IN WE
BRAM Keys/
CMD
SBox
OUT OUT
8
8 32
Subkeys Ki
Figure 2.8: Block diagram of the key schedule implementation. Complex instructions of the
finite state maschine, S-boxes, round constants and 32-bit subkeys are stored in the
dual-port BRAM.
Like the basic AES module it can be operated at full device frequency of 550 MHz and with the
complete key expansion function requiring 524 clock cycles for AES-128.
2.6 Results
Our designs target a Virtex-5 LX30 and SX95T devices at their fastest speed grade (-3) using
Xilinx Synthesis Technology (XST) and the ISE 9.2 implementation flow. For simulation we used
Mentor’s ModelSim 6.2g for both behavioral and post place-and-route stages. In addition, the
routes to input and output ports were ignored for timing (“TIG” constraint) during synthesis,
as we consider the cores as a stand-alone function.
The basic AES module as shown in Figure 2.5 passed timing (post place-and-route) for a
frequency just over 550 MHz, the maximum frequency rating of the device. The design requires
the following resources: 247 flip-flops, 96 (8 · 3 · 4) for the input shift registers plus 128 (4 · 32) for
the pipeline stages in between the BRAMs and DSPs, with the rest used for control logic; 275
look-up tables, mostly functioning as multiplexers; and finally, two 36 Kbit dual-port BRAM
(32 Kbit used in each) and four DSP blocks. We calculate throughput as follows: given that
there are 80 processing cycles operating at 550 MHz and we maintain state of 256 bits in the
pipeline stages, we achieve 550 · 106 · 256/80 = 1.76 Gbit/s of throughput. This assumes that
the pipeline stages are always full, meaning that the module is processing two 128 bit inputs at
any given time; if only one input is processed, the throughput is halved. As we have mentioned,
the eight pipeline stages were implemented for the purpose of interleaving two inputs or using a
parallel mode of operation like Counter (CTR) mode, though the designer can remove pipeline
26
2.6. Results
stages to reduce resources. Removing pipeline stages reduces latency, though it may also reduce
the maximum frequency, so there is a trade-off that needs to be assessed according to the
application.
In the round module the basic construct is used four times for a 128 bit-width interface. The
maximum frequency reported by the tools post place-and-route was over 485 MHz, and it uses
621 flip-flops, 204 look-up tables, 8 36 Kbit BRAMs (32 Kbit used in each), and 16 DSP blocks.
Notice that we expect at least 4·48+4·128 = 704 registers but the tools report only 621. This is
because the synthesizer tries to achieve a balanced FF-LUT ratio for better packing into slices so
the 2- and 3-stage input shift registers for each basic cells are implemented in eight LUTs each.
The latency of 80 clock cycles is the same as the previous design, though now we can maintain
state of 128 · 8 = 1024 bits, thus giving us a throughput of 485 · 106 · 8 · 128/80 = 6.21 Gbit/s
when processing eight input blocks. We can see that the complexity of this design reduces
the maximum frequency and throughput, though hand placement of DSPs and BRAMs, along
with matching the bit ordering to the routing can improve on this performance. As with the
basic module, pipeline stages can be removed to minimize the use of logic resources if they are
required for other functions and the highest throughput is not required.
Finally, the unrolled implementation produces 128 bits of output every clock cycle once the
initial latency is complete. We have experimented with eliminating the pipeline stage between
the BRAM and DSP to see if it adversely affects performance; this will save us 5,120 registers.
We found that the performance degradation is minimal, with the added benefit of having an
initial latency of only 70 clock cycles instead of 80. The resulting throughput is 430 · 106 · 128 =
55 Gbit/s. This design operates at a maximum frequency of over 430 MHz and uses 992 flip-
flops, 672 look-up tables, 80 36 Kbit BRAMs (only 16 Kbit in each for dec/enc or 32 Kbit for
both), and 160 DSP blocks; the same balancing act of FF-LUT ratio by the synthesizer occurs
here as well. There are very few flip-flops and LUTs compared to what is available in the large
SX95T device: 1.68% and 1.14%, respectively, though we use 32% of BRAMs and 25% of DSP
blocks.
Our results are summarized in Table 2.1. We extended the list of our findings with previous re-
sult available in the literature. However, please be aware of the unfairness of direct comparison.
Due to the different architectures of Spartan-2/3 (S2/S3) and Virtex-2/E Pro (V2/V2P/VE)
and Virtex-5 (V5) FPGAs we cannot directly compare soft metrics like ”‘slices”’. This is due
to the different concepts of contained LUTs (6-input LUTs in Virtex-5 and 4-input LUTs in all
others) as well as the different number of LUTs and flip-flops per slice (e.g., a slice in Spartan
and Virtex-2 FPGAs consists a combination of 2 LUTs/FF but 4 LUTs/FF in Virtex-5 devices).
Even the amount of memory contained in BRAMs is different: Virtex-5 FPGAs provide block
memories capable to store 36 KBits of data, twice as much as with Virtex-2 devices. Beside
device-specific differences, the implementations also target different applications and require-
ments: some can operate in more complex modes of operations, others include a key schedule
in the data path or support natively encryption and decryption with the same circuit. This all
leads to the conclusion that comparisons with other publications based on different FPGAs and
application goals are mostly misleading, e.g., meaningful comparisons are only possible when
27
Chapter 2. Optimal AES Architectures for High-Performance FPGAs
Table 2.1: Our results along with recent academic and commercial implementations. Decryption
(Dec.) and Key expansion (Key) are included when denoted by •, by ◦ otherwise.
Note the structural differences between the FPGA types: Virtex-5 (V5) has 4 FF
and 4 6-LUT per slice and a 36 Kbit BRAM, while Spartan-3 (S3), Virtex-E (VE),
Virtex-II (PRO) (V2/V2P) has 2 FF and 2 4-LUT per slice and an 18 Kbit BRAM.
Spartan-II (S2) devices only provide 4 KBit BRAMs.
the same device/technology is used and the compared cryptographic implementations comply
to a predefined application setup or framework. Note that all these constraints on comparisons
between FPGA implementations also apply to other results reported in the remainder of this
thesis.
Note that our results for the AES modules are all based on the assumption that the set of
subkeys are externally provided. In case that all subkeys should be generated on the same
device, these modules can be augmented with the key schedule precomputing all subkeys and
storing them in a dedicated BRAM. As shown in Section 2.5.3, our key schedule is optimized
for a minimal footprint and allows operation at maximum device frequency of 550 MHz. The
complexity of the state machine, which is the most expensive part in terms of logic, is mostly
hidden within the encoded 32 bit instructions stored in the BRAM. Hence, since only a small
stub of the state machine in the user logic is required to address the individual instructions
words, the overall resources consumption of the full key schedule is only 1 BRAM, 55 LUTs
28
2.7. Conclusions and Future Work
Table 2.2: Implementation results for the AES key schedule. Most state machine encoding and
control logic has been incorporated into the BRAM to save on logic resources.
and 41 flip-flops. All key schedule related data is presented in Table 2.2 supporting key sizes of
128, 192 and 256 bits.
2
A wide 128 bit integer increment counter can introduce long signal propagation paths due to carry propagation
degrading the overall system performance. Here, the use of three cascaded DSP slices for multi-precision
addition or a Linear Feedback Shift Registers (LFSR) can be used to avoid any performance penalty.
29
Chapter 3
3.1 Motivation
Asymmetric cryptographic algorithms are known to be extremely arithmetic intensive since their
security assumptions rely on computational problems which are considered hard in combination
with parameters of significant bit sizes.
Neal Koblitz and Victor Miller proposed independently in 1985 [Mil86, Kob87] the use of
Elliptic Curve Cryptography providing similar security compared to classical cryptosystems
but using smaller keys. This benefit allows for greater efficiency when using ECC (160–256 bit)
compared to RSA or discrete logarithm schemes over finite fields (1024–4096 bit) while providing
an equivalent level of security [LV01]. Due to this, ECC has become the most promising
Chapter 3. Optimal ECC Architectures for High-Performance FPGAs
candidate for many new applications, especially in the embedded domain, which is also reflected
by several standards by IEEE, ANSI and SECG [P1300, ANS05, Cer00a, Cer00b].
In addition to many new “lightweight” applications (e.g., digital signatures on RFID-like
devices), there are also many new applications which call for high-performance asymmetric
primitives. Even though very fast public-key algorithms can be provided for PC and server
applications by accelerator cards equipped with ASICs, providing very high speed solutions in
embedded devices is still a major challenge. Somewhat surprisingly, there appears to be ex-
tremely few, if any, commercially available ASICs or chip sets that provide high speed ECC and
which are readily available for integration in general embedded systems. A potential alterna-
tive is provided by FPGAs. However, despite a wealth of research regarding high-speed FPGA
(and high-speed software) implementation of ECC since the mid 1990s, providing truly high-
performance ECC (i.e., to reach less than 100µs per point multiplication) on readily available
platforms remains an open challenge. This holds especially for ECC over prime fields, which
are often preferred over binary fields due to standards in Europe and the US, and a somewhat
clearer patent situation.
In this work, we propose a novel hardware architecture based on reconfigurable FPGAs sup-
porting ECC cryptography over prime fields Fp offering the highest single-chip performance
reported in literature up to now. Usually, known ECC implementations for reconfigurable logic
implement the computationally expensive low-level arithmetic in configurable logic elements,
allowing for greatest flexibility but offering only moderate performance. Some implementations
have attempted to address this problem by using dedicated arithmetic hardware in the reconfig-
urable device for specific parts of the computations, like built-in 18 × 18 multipliers [MMM04].
But other components of the circuitry for field addition, subtraction and inversion have been
still implemented in the FPGA’s fabric which usually leads to a decrease in performance.
The central idea of this contribution is to relocate the arithmetic intensive operations of ECC
over prime fields entirely in dedicated arithmetic function blocks on the FPGA actually tar-
geting DSP filter applications. As introduced in Section 2.4, these DSP accelerating functions
are built-in components in the static logic of modern FPGA devices capable to perform integer
multiplication, addition and subtraction as well as a multiply-accumulate operation.
32
3.3. Mathematical Background
viding a significant speedup for modular multiplications. However, other field operations have
been implemented in the FPGA fabric, resulting in a very large design (15,755 slices and 256
multiplier blocks) on a large Xilinx XC2VP125 device. The architecture presented in [DMKP04]
was designed to achieve a better trade-off between performance and resource consumption. Ac-
cording to the contribution, an area consumption of only 1,854 slices and a maximum clock
speed of 40 MHz can be achieved on a Xilinx Virtex-2 XC2V2000 FPGA for a parameter bit
length of 160 bit.
Our approach to implementing an FPGA-based ECC engines was to shift all field operations
into the integrated DSP building blocks available on modern FPGAs. This strategy frees most
configurable logic elements on the FPGA for other applications and requires less power compared
to a conventional design. In addition to that, this architecture offers the fastest performance
for ECC computations over prime fields with up to 256 bit security in reconfigurable logic.
In this section, we will briefly introduce to the mathematical background relevant for this work.
We will start with a short review of the Elliptic Curve Cryptosystems (ECC). Please note that
only ECC over prime fields Fp will be subject of this work since binary extensions fields GF (2m )
require binary arithmetic which is not (yet) natively supported by DSP blocks.
Let p be a prime with p > 3 and Fp = Fp the Galois Field over p. Given the Weierstrass
equation of an elliptic curve
E : y 2 = x3 + ax + b,
with a, b ∈ Fp and 4a3 + 27b2 6= 0, points Pi ∈ E, we can compute tuples (x, y) also considered
as points on this elliptic curve E. Based on a group of points defined over this curve, ECC
arithmetic defines the addition R = P + Q of two points P, Q using the tangent-and-chord
rule as the primary group operation. This group operation distinguishes the case for P = Q
(point doubling) and P = 6 Q (point addition). Furthermore, formulas for these operations
vary for affine and projective coordinate representations. Since affine coordinates require the
availability of fast modular inversion, we will focus on projective point representation to avoid
the implementation of a costly inversion circuit. Given two points P1 , P2 with Pi = (Xi , Yi , Zi )
and P1 6= P2 , the sum P3 = P1 + P2 is defined by
A = Y2 Z1 − Y1 Z2 B = A2 Z1 Z2 − C 3 − 2C 2 X1 Z2 C = X2 Z1 − X1 Z2
2 3
X3 = BC Y3 = A(C X1 Z2 − B) − C Y1 Z2 Z3 = C 3 Z1 Z2 , (3.1)
33
Chapter 3. Optimal ECC Architectures for High-Performance FPGAs
where A, B, C are auxiliary variables and P3 = (X3 , Y3 , Z3 ) is the resulting point in projective
coordinates. Similarly, for P1 = P2 the point doubling P3 = 2P1 is defined by
A = aZ 2 + 3X 2 B =YZ C = XY B D = A2 − 8C
X3 = 2BD Y3 = A(4C − D) − 8B 2 Y 2 Z3 = 8B 3 . (3.2)
Most ECC-based cryptosystems rely on the Elliptic Curve Discrete Logarithm Problem
(ECDLP) and thus employ the technique of point multiplication k · P as cryptographic primi-
tive, i.e., a k times repeated point addition of a base point P. More precisely, the ECDLP (cf.
Chapter 6) is the fundamental cryptographic problem used in protocols and crypto schemes like
the Elliptic Curve Diffie-Hellman key exchange [DH76], the ElGamal encryption scheme [Elg85]
and the Elliptic Curve Digital Signature Algorithm (ECDSA) [ANS05].
According to Algorithm 3.1 the modular reduction for P-224 can be performed with two
224-bit subtractions and additions. However, these four consecutive operations can lead to a
potential over- and underflow in step 2. With Z = z1 + z2 + z3 − z4 − z5 , we can determine
the bounds −2p < Z < 3p reducing the number of final correction steps to two additions or
subtractions to compute the correctly bounded c mod P-224.
34
3.4. An Efficient ECC Architecture Using DSP Cores
Algorithm 3.2 NIST Reduction with P-256 = 2256 − 2224 + 2192 + 296 − 1
Input: Double-sized integer c = (c15 , . . . , c2 , c1 , c0 ) in base 232 and 0 ≤ c < P-2562
Output: Single-sized integer c mod P-256.
1: Concatenate ci to following 256-bit integers zj :
Algorithm 3.2 presents the modular reduction for P-256 requiring two doublings, four 256-bit
subtractions and four 256-bit additions. Based on the computation Z = z1 + 2z2 + 2z3 + z4 +
z5 − z6 − z7 − z8 − z9 , the range of the result to be corrected is −4p < Z < 5p.
(1) Build DSP cascades: Neighboring DSP blocks can be cascaded to widen or extend their
atomic operand width (e.g., from 18 bit to 256 bit).
(2) Use DSP routing paths: DSPs have been provided with inner routing paths connecting
two adjacent blocks. It is advantageous in terms of performance to use these paths as
frequently as possible instead of using FPGA’s general switching matrix for connecting
logic blocks.
(3) Consider DSP columns: Within a Xilinx FPGA, DSPs are aligned in columns, i.e., routing
paths between DSPs within the same column are efficient while a switch in columns can
lead to degraded performance. Hence, DSP cascades should not exceed the column width
(typically 32/48/64 DSPs per column).
35
Chapter 3. Optimal ECC Architectures for High-Performance FPGAs
(4) Use DSP pipeline registers: DSP blocks feature pipeline stages which should be used to
achieve the maximum clock frequency supported by the device (up to 500 MHz).
(5) Use different clock domains: Optimally, DSP blocks can be operated at maximum device
frequency. This is not necessarily true for the remainder of the design so that separate
clock domains should be introduced (e.g. by halving the clock frequency for control
signals) to address the critical paths in each domain individually.
Modular Addition/Subtraction
Let A, B ∈ GF (P ) be two multi-precision operands with lengths |A|, |B| ≤ l and l = ⌊log2 P ⌋+1.
Modular addition C = A + B mod P and subtraction C = A − B mod P can be efficiently
computed according to Algorithm 3.3:
For using DSP blocks, we need to divide the l-bits operands into multiple words each having
a maximum size of lA bits due to the limited width of the DSP input port. Thus, all inputs
PnA −1 i·lA , where
A, B and P to the DSP blocks can be represented in the form X = i=0 xi · 2
nA = ⌈l/lA ⌉ denotes the number of words of an operand. According to Algorithm 3.3, we
employ two cascaded DSP blocks, one for computing s(0,i) = ai ± (bi + CIN0 ) and a second for
s(1,i) = s(0,i) ∓ (pi + CIN1 ). The resulting values s(0,i) and s(1,i) each of size |s(j,i) | ≤ lA + 1
are temporarily stored and recombined to S0 and S1 using shift registers (SR). Finally, a 2-to-1
l-bit output multiplexer selects the appropriate value C = Si . Figure 3.1 presents a schematic
overview of a combined modular addition and subtraction based on two DSP blocks. Note that
DSP blocks on Virtex-4 FPGAs provide a dedicated carry input cIN but no carry output cOUT .
Particularly, this fact requires extra logic to compensate for duplicate carry propagation to the
36
3.4. An Efficient ECC Architecture Using DSP Cores
ai bi pi
lA lA f lA
DSP
CIN1
+/- CIN2
+/-
CARRY
1 l A+1 l A+1 COUT2
CARRY
lA C IN1 CIN2
lA
SR SR
nAl A nAl A
1
MUX
c
Figure 3.1: Modular addition/subtraction based on DSP-blocks.
second DSP which is due to the fixed cascaded routing path between the DSP blocks. In this
architecture, each carry is considered twice, namely in s0,i+1 and s1,i what needs to be corrected.
This special carry treatment requires a wait cycle to be introduced so that one lA -bit word can
be processed each two clock cycles. However, this is no restriction for our architecture since we
design for parallel addition and multiplication so that the (shorter) runtime of an addition is
completely hidden in the duration of a concurrent multiplication operation.
Modular Multiplication
The most straightforward multiplication algorithm to implement the multiplication with subse-
quent NIST prime reduction (cf. Section 3.3.2) is the schoolbook multiplication method with a
time complexity of O(n2 ) for n-bit inputs. Other methods, like the Karatsuba algorithm [KO63],
trade multiplications for additions using a divide-and-conquer approach. However Karatsuba
computing the product C = A × B for A = a1 2n + a0 and B = b1 2n + b0 requires to store the
intermediate results for a1 a0 and b1 b2 for later reuse in the algorithm. Although this is certainly
possible, this requires a much more complex data and memory handling and cannot be solely
done within DSP blocks. Since many parts of the Karatsuba multiplier would require generic
logic of CLBs, we are likely to lose the gain of performance of the fast arithmetic in DSP blocks.
We thus use a variant of the schoolbook multiplication, known as Comba multiplication [Com90]
which combines carry handling and reduces write accesses to the memory. These optimiza-
tions result in improved performance with respect to the original pen-and-paper method. Let
A, B ∈ GF (P ) be two multi-precision integers with bit length l ≤ ⌊log2 P ⌋ + 1. According to
the limited input size lM of DSP blocks, we split now the values A, B in nM = ⌈l/lM ⌉ words
37
Chapter 3. Optimal ECC Architectures for High-Performance FPGAs
#6 #5 #4 #3 #2 #1 #0 a0 b0
DSP #1 DSP #2 DSP #3 DSP #4
a1 b0
a 1 b3 a2b3 a3 b3 a0 b3
a0 b1
a 2 b2 a3b2 a0b2 a1 b2
a2b0
a 3 b1 a0b1 a1b1 a2 b1
a1b1
a 0 b0 a1b0 a2b0 a3 b0
a0b2
ACCUMULATOR
a3 b0
s4 s5 s6
TIME
a2 b1
s3
a1 b2
s2
a0 b3
s1
a3 b 1
s0
a2 b 2
a1 b 3
a 3 b2
a 2 b3
a3 b3
Figure 3.2: Parallelizing Comba’s multiplication method for efficient DSP-based computation.
PnM −1
represented as X = i=0 xi · 2ilM . Straightforward multiplication computes C = A · B based
P M i·n Pi
on accumulation of (nM ) products C = 2n
2
i=0 2
M
j=0 aj bi−j providing a result C of size
|C| ≤ 2nM . For parallel execution on nM DSP units, we compacted the order of inner product
computations used for Comba’s algorithm as shown in Figure 3.2. All nM DSP blocks oper-
ate in a loadable Multiply-and-Accumulate mode (MACC) so that intermediate results remain
P
in the corresponding DSP block until an inner product si = ij=0 aj bi−j is fully computed.
Note that si returned from the nM DSP blocks are not aligned and can vary in size up to
|si | ≤ 2lM + log2 (nM ) = lACC = 36 bits. Thus, all si need to be converted to non-redundant
representation to finally form the final product of words ci with maximum size 2lM each. Hence,
we feed all values into a subsequent accumulator to combine each si with the corresponding
bits of si−1 and si+1 . Considering the special input constraints, timing conventions and carry
transitions of DSP blocks, we developed Algorithm 3.4 to address the accumulation of inner
products based on two DSP blocks performing lACC -bit additions.
Figure 3.3 gives a schematic overview of the multiplication circuit returning the full-size
product C. This result has to be reduced using the fast NIST prime reduction scheme discussed
in the next section.
Modular Reduction
At this point we will discuss the subsequent modular reduction of the 2nM -bit multiplication
result C using the NIST reduction scheme. All fast NIST reduction algorithms rely on a
reduction step (1) defined as a series multi-precision additions and subtractions followed by a
38
3.4. An Efficient ECC Architecture Using DSP Cores
correction step (2) to achieve a final value in the interval [0, . . . , P − 1] (cf. Algorithms 3.1
and 3.2). To implement (1), we decided to use one DSP-block for each individual addition or
subtraction, e.g., for the P-256 reduction we reserved a cascade of 8 DSP blocks. Each DSP
performs one addition or subtraction and stores the result in a register whose output is taken
as input to the neighboring block (data pipeline).
For the correction step (2), we need to determine in advance the possible overflow or underflow
of the result returned by (1) to avoid wait or idle cycles in the pipeline. Hence, we introduced
a Look-Ahead Logic (LAL) consisting of a separate DSP block which exclusively computes the
expected overflow or underflow. Then, the output of the LAL is used to select a corresponding
reduction value which are stored as multiple {0, . . . , 5P } in a ROM table. The ROM values
are added or subtracted to the result of (1) by a sequence of two DSP blocks ensuring that the
final result is always in {0, . . . , P − 1}. Figure 3.4 depicts the general structure of the reduction
circuit which is applicable for both primes P-224 and P-256.
39
Chapter 3. Optimal ECC Architectures for High-Performance FPGAs
lM
x x x x x
DSP
...
+ + + + +
lACC
cDELAY CARRY
Accumulator Unit
lACC
lACC-2lM lM
DSP
+ + lm
2lM
ci
Figure 3.3: An l-bit multiplication circuit employing a cascade of parallelly operating DSP
blocks.
40
3.5. Implementation
2lM 2lM
Reduction Chain
ci ci+1 ci+2 ... ci+k-1 ci+k
c
0 0 0 0 0 0
Look
DSP
+ + + ... - - Ahead
Logic
CTL
p
Correction
ROM
DSP
+/- +/- ...
2P
1P
2lM
2lM
rj
Figure 3.4: Modular reduction for NIST-P-224 and P-256 using DSP blocks.
by a linear factor n dependent on the number of cores. Note that most other high-performance
implementations occupy the full FPGA due to their immense resource consumption so that
these cannot easily be instantiated several times.
Based on our synthesis results, the limiting factor of our architecture is the number of available
DSP blocks of a specific FPGA device (cf. Section 3.5).
3.5 Implementation
The proposed architecture has been synthesized and implemented for the smallest available
Xilinx Virtex-4 device (XC4VFX12-12SF363) and the corresponding results are presented in
Subsection 3.5.1. This FPGA offers 5,472 slices (12,288 4-input LUTs and flip-flops) of recon-
figurable logic, 32 DSP blocks and can be operated at a maximum clock frequency of 500 MHz.
Furthermore, to demonstrate how many ECC computations can be performed using ECC core
parallelism, we take a second device, the large Xilinx Virtex-4 XC4VSX55-12FF1148 providing
the maximum number of 512 DSP blocks and 24,576 slices (49,152 4-input LUTs and flip-flops)
as a reference for a multi-core architecture.
41
Chapter 3. Optimal ECC Architectures for High-Performance FPGAs
IN IN2
32
Dual Port RAM
IN1 OUT
OUT1 OUT2
l 32
A B
32 a0 ... an-1
A B
SUB
Modular Modular
Multiplier CTL
FSM CTL
Addition/
Subtraction
32 OUT OUT 32
MUX
a0 ... an-1
l ECC Core
Given an ECC core with a separate adder/subtracter and multiplier unit, we can perform a
field multiplication and field addition simultaneously. By optimizing the execution order of
the basic field operations, it is possible to perform all additions/subtraction required for the
ECC group operation in parallel to a multiplication. Based on the runtimes of a single field
multiplication, we can determine the number of required clock cycles for the operations k · P
and k · P + r · Q using the implemented Double-and-Add algorithm. Moreover, we also give
Table 3.1: Resource requirements of a single ECC core on a Virtex-4 FX 12 after PAR. Note
the different clock domains for arithmetic (DSP) and control logic.
42
3.5. Implementation
Table 3.2: Performance of ECC operations based on a single ECC core using projective Chud-
nowsky coordinates on a Virtex-4 XC4VFX12 (Figures with asterisk are estimates).
estimates concerning their performance when using a window-based method [ACD+ 05] based
on a window size w = 4.
Note that the specified timing considers signal propagation after complete PAR excluding the
timing constraints from I/O pins (“TIG” constraint) since no underlying data communication
layer was implemented. Hence, when being combined with an I/O protocol of a real-world
application, the clock frequency can be slightly lower than specified in Table 3.1 and Table 3.3.
3.5.4 Comparison
Based on our architecture, we can estimate a throughput of more than 37,000 point multiplica-
tions on the standardized elliptic curve P-224 per second which exceeds the throughput of all
single-chip hardware implementation known to the authors by far. A detailed comparison with
other implementations is presented in Table 3.5.4.
At this point we like to point out that the field of highly efficient prime field arithmetic is be-
lieved to be predominated by implementations on general purpose microprocessors rather than
43
Chapter 3. Optimal ECC Architectures for High-Performance FPGAs
Table 3.3: Results of a multi-core architecture on a Virtex-4 XC4VSX55 device for ECC over
prime fields P-224 and P-256 (Figures with an asterisk are estimates).
on FPGAs. Hence, we will also compare our hardware implementation against the performance
of software solutions on recent microprocessors. Since most performance figures for software
implementations are given in cycles rather than absolute times, we assumed for comparing
throughputs that uninterrupted, repeated computations can be performed simultaneously on
all available cores of a modern microprocessor with no further cycles spent, e.g., on scheduling
or other administrative tasks. Note that this is indeed a very optimistic assumption possibly
overrating the performance of software implementations with respect to actual applications.
For example, a point multiplication using the highly efficient software implementation by Bern-
stein based on floating point arithmetic for ECC over P-224 requires 839.000 cycles on an
(outdated) Intel Pentium 4 [Ber01] at 1.4GHz. According to our assumption for cycle count
interpretation, this correlates to 1670 point multiplication per second.
We also compare our design to more recent results, e.g., obtained from ECRYPT’s eBATS
project. According to the report from March 2007 [ECR07], an Intel Core2 Duo running at
2.13 GHz is able to generate 1868 and 1494 ECDSA signatures based on the OpenSSL imple-
mentation for P-224 and P-256, respectively. Taking latest Intel Core2 Quad microprocessors
into account, these performance figures might even double. We also compare our work to the
very fast software implementation by [GT07] using an Intel Core2 Duo system at 2.66 GHz.
However, in this contribution the special Montgomery and non-standard curve over F2255 −19 is
used instead of a standardized NIST prime. Despite of that, for the design based on this curve
the authors report the impressive throughput of 6700 point multiplications per second.
For a fair comparison with software solutions it should be considered that a single Virtex-4
SX 55 costs about US$ 1,1702 . Recent microprocessors like the Intel Core2 Duo, however, are
2
Market price for a single device in Jan 2009.
44
3.6. Conclusions
available at only about a quarter of that price. With this in mind, we might not be able to beat
all software implementation in terms of the cost-performance ratio, but we still like to point
out that our FPGA-based design - as the fastest reported hardware implementation so far -
definitely closes the performance gap between software and hardware implementations for ECC
over prime fields. Furthermore, we like to emphasize again that all software related performance
figures are based on very optimistic assumptions.
3.6 Conclusions
In this chapter, we presented novel ECC implementations in reconfigurable hardware for fields
over the NIST primes P-224 and P-256. Due to the exhaustive utilization of DSP blocks, which
are contained as hardcores in modern FPGA devices, we are able to run the critical components
computing low-level integer arithmetic operations nearly at maximum device frequency. Fur-
thermore, considering a multi-core architecture on a Virtex-4 XC4VSX55 FPGA, we can achieve
a throughput of more than 24,000 and 37,000 point multiplications per second for P-256 and
P-224, respectively, what significantly exceeds the performance of all other hardware implemen-
tation known to the authors and comes close to the cost-performance ratio provided by the
fastest available software implementations in the open literature.
45
Chapter 4
High-Performance Asymmetric Cryptography
with Graphics Cards
Modern Graphics Processing Units (GPU) have reached a dimension with respect to
performance and gate count exceeding conventional Central Processing Units (CPU)
by far. Besides the CPU, many today’s computer systems include already such a
powerful GPU which runs idle most of the time. Thus, it might be used as cheap
and instantly available co-processor for general-purpose applications.
In this chapter, we focus on the efficient processing of computationally expensive op-
erations in asymmetric cryptosystems on the off-the-shelf NVIDIA 8800 GTS graph-
ics card by use of NVIDIA’s CUDA programming model. We present improved and
novel implementations employing GPUs as accelerator for RSA and DSA cryptosys-
tems as well as for ECC.
4.1 Motivation
For the last twenty years graphics hardware manufacturers have focused on producing fast
Graphics Processing Units (GPUs), specifically for the gaming community. This has more
recently led to devices which outperform general purpose Central Processing Units (CPUs) for
specific applications, particularly when comparing the MIPS (million instructions per second)
benchmarks. Hence, a research community has been established to use the immense power of
GPUs for general purpose computations (GPGPU). In the last two years, prior limitations of the
graphics application programming interfaces (API) have been removed by GPU manufacturers
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
by introducing unified processing units in graphics cards. They support a general purpose
instruction set by a native driver interface and framework.
In the field of asymmetric cryptography, the security of all practical cryptosystems rely on
hard computational problems strongly dependent on the choice of parameters. But with rising
parameter sizes (often in the range of 1024–4096 bits), however, computations become more
and more challenging for the underlying processor. For modern hardware, the computation
of a single cryptographic operation is not critical, however in a many-to-one communication
scenario, like a central server in a company’s data processing center, it may be confronted with
hundreds or thousands of simultaneous connections and corresponding cryptographic operations.
As a result, the most common current solution are cryptographic accelerator cards. Due to the
limited market, their price tags are often in the range of several thousands euros or US dollars.
The question at hand is whether commodity GPUs can be used as high-performance public-key
accelerators.
In this chapter, we will present novel implementations of cryptosystems based on modular
exponentiations and elliptic curve operations on recent graphics hardware. To the best of
our knowledge, this is the first publication making use of the CUDA framework for GPGPU
processing of asymmetric cryptosystems. We will start with implementing the extremely wide-
spread RSA cryptosystem [RSA78]. The same implementation based on modular exponentiation
for large integers can be used to implement the Digital Signature Algorithm (DSA) [Nat00]
which also has been adopted to elliptic curve groups in the ANSI X9.62 standard [ANS05]. The
implementation of this DSA variant for elliptic curves (ECDSA) is our second goal.
All presented results in this chapter originated from joint work with Robert Szerwinski [SG08].
48
4.3. General-Purpose Applications on GPUs
49
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
Grid
Block 0 Block 1
Shared Memory Shared Memory
Texture Memory
Cache
Constant Memory
Global Memory
Figure 4.1: The memory and programming model for CUDA based applications.
a group of blocks that have same dimensionality and execute the same CUDA program logically
in parallel is called grid.
To allow optimal performance for different access patterns, CUDA implements a hierarchical
memory model, contrasting to the flat model normally found in PCs. Host (PC) and device
(GPU) have their own memory areas, called host memory and device memory, respectively.
CUDA supplies optimized functions to transfer data between these separate spaces.
Each thread possesses its own register file, which can be read and written. Additionally, it can
access its own copy of so-called local memory. All threads in the same grid can access the same
on-chip read- and writable shared memory region. To prevent hazards resulting from concurrent
execution of threads synchronization mechanisms must be used. Shared memory is organized in
groups called banks that can be accessed in parallel. All threads can access a read- and writable
memory space called global memory and read-only regions called constant memory and texture
memory. The second last is optimized for one-dimensional locality of accesses, while the last is
most effective when being used with two-dimensional arrays (matrices). Note that the texture
and constant memories are the only regions that are cached. Thus, all accesses to the off-chip
regions global and local memory have a high access latency, resulting in penalties when being
used too frequently. The programming and memory model of CUDA is depicted in Figure 4.1.
The hardware consists of a number of so-called multiprocessors that are build from SIMD
processors, on-chip memory and caches. Clearly, one processor executes a particular thread,
the same warp being run on the multiprocessor at the same time. One or more blocks are
50
4.3. General-Purpose Applications on GPUs
mapped to each multiprocessor, sharing its resources (registers and shared memory) and get
executed on a time-sliced basis. When a particular block has finished its execution, the scheduler
starts the next block of the grid until all blocks have been run.
To achieve optimal performance using CUDA, algorithms must be designed to run in a multi-
tude of parallel threads and take advantage of the presented hierarchical memory model. Now,
we enumerate the key criteria necessary for gaining the most out of the GPU by following the
CUDA programming guide [CUD07] and a talk given by Mark Harris of NVIDIA [Har07].
CUDA Limitations
Although CUDA programs are written in the C language together with extensions to support the
memory model, allow synchronization and special intrinsics to access faster assembler instruc-
tions, it also contains a number of limitations that negatively affect efficient implementation
51
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
of public key cryptography primitives. Examples are the lack for additions/subtractions with
carry as well as the missing support for inline assembler instructions1 .
Modular Multiplication
52
4.4. Modular Arithmetic on GPUs
corresponding power of the radix used by the machine. A possible configuration could be
R = 2wn for w bit registers and n words per multi-precision integer.
To use the reduction algorithm, we first need to convert the parameters (say a ∈ ZM ) to
Montgomery form by computing A = aR mod M . Furthermore, we need to determine an
auxiliary value M ′ = −M −1 mod R, 0 ≤ M ′ < R which is a precomputed constant for a given
tuple (M, R). Based on these preliminaries, given an X < RM in Montgomery form, we can use
Algorithm 4.1 to compute XR−1 mod M very efficiently by constructing an integer v = X +f M
that is a multiple of R. This, in turn, allows to simply divide by R to compute the desired
result x = XR−1 mod M .
More precisely, Algorithm 4.1 first computes f which is a multiple of M that needs to be
added to the input value X in order to determine a value v that is also a multiple of R, i.e.,
X + f M ≡ 0 mod R, while not changing the congruency modulo M (X + f M ≡ X mod M ).
Such a value can be found by computing f = XM ′ mod R and leads to
X + fM RM + RM
w= < = 2M, (4.2)
R R
so that we yield x = XR−1 mod M with 0 ≤ x < M as desired after final reduction. Algo-
rithm 4.1 can also be used to transform parameters between standard and Montgomery repre-
sentation. For a given integer x in standard representation, X = M onRed(xR2 ) converts it to
Montgomery domain and x = M onRed(X) back to standard domain.
53
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
Koç et al. [cKKAK96] provide a survey of different implementation options of this algorithm
combined with multi-precision multiplication X = A × B. As all these multi-precision algo-
rithms feature no inherent parallelism except the possibility to pipeline, we do not consider
them optimal for our platform and implement the method with the lowest temporary space re-
quirement of n+2 words, Coarsely Integrated Operand Scanning (CIOS), as a reference solution
only (cf. Algorithm 4.2).
Here, the ordered set of relatively prime moduli (m0 , m1 , . . . , mn−1 ), gcd(mi , mj ) = 1 for all
Qn−1
i 6= j, is called base and denoted by A. The product of all moduli, A = i=0 mi is called
dynamic range of A, i.e., the number of values that can be uniquely represented in A. In other
54
4.4. Modular Arithmetic on GPUs
words, all numbers in A get implicitly reduced modulo A. Such a representation in RNS has
the advantage that addition, subtraction and multiplication can be computed independently for
all residues:
hxiA ◦ hyiA = h|x0 ◦ y0 |m0 , |x1 ◦ y1 |m1 , . . . , |xn−1 ◦ yn−1 |mn−1 iA , ◦ ∈ {+, −, ·} (4.4)
which allows carry-free computations2 and multiplication without partial products. However,
some information involving the whole number x cannot be easily computed. For instance, sign
and overflow detection and comparison of magnitude are hard, resulting from the fact that
residue number systems are no weighted representation. Furthermore, division and as a result
reduction modulo an arbitrary modulus M 6= A is not as easy as in other representations.
But similar to the basic idea of Montgomery multiplication, one can create a modular mul-
tiplication method for input values in RNS representation as shown in Algorithm 4.3, which
involves a second base B = (m e 0, m
e 1, . . . , m
e n−1 ) with corresponding dynamic range B. It com-
putes a value v = XY +f M that is equivalent to 0 mod A and XY mod M . Thus, we can safely
divide by A, i.e., multiply by its inverse modulo B, to compute the output XY A−1 mod M .
Note that the needed reduction modulo A to compute f is free in A.
Algorithm 4.3 Modular Multiplication Algorithm for Residue Number Systems [KKSS00]
Input: Modulus M , two RNS bases A and B composed of n distinct moduli mi each,
gcd(A, B) = gcd(A, M ) = 1 and B > A > 4M .
Two factors X and Y , 0 ≤ X, Y < 2M , encoded in both bases and in Montgomery form,
i.e. hXiA , hXiB and hY iA , hY iB , X = xA mod M and Y = yA mod M .
Output: The product C = XY A−1 mod M , 0 ≤ C < 2M , in both bases and Montgomery
form.
1: huiA ← hXiA · hY iA and huiB ← hXiB · hY iB
2: hf iA ← huiA · h−M −1 iA
3: hf iB ← BaseExtend(hf iA )
4: hviB ← huiB + hf iB · hM iB {hviA = 0 by construction}
5: hwiB ← hviB · hA−1 iB
6: hwiA ← BaseExtend(hwiB )
7: return hwiA and hwiB
All steps of the algorithm can be efficiently computed in parallel. However, a method to
convert between both bases, a base extension mechanism, is required. We take four different
options into account: the method based on a Mixed Radix System (MRS) according to Szabó
and Tanaka [ST67], as well as CRT-based methods due to Shenoy and Kumaresan [SK89],
Kawamura et al. [KKSS00] and Bajard et al. [BDK01]. We present a brief introduction of
these methods, but for more detailed information about base extensions, please see the recent
survey in [BP04].
2
Note that inner-RNS operations still contain carries.
55
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
The MRS digits x′i can be derived from the residues xi by a recursive strategy:
where m−1(i,j) are the precomputed inverses of mj modulo mi . To convert x from this repre-
sentation to a target RNS base, we could reduce Equation (4.5) by each target modulus m e k,
Qi−1
involving precomputed constants e c(k,i) = l=0 ml . But instead of creating a table for all
m
ek
ck , a recursive approach is more efficient in our situation, eliminating the need for table-lookups
e
[BMP05], and allowing to compute all residues in the target base in parallel:
|x|m
ek = (. . . ((x′n−1 mn−2 + x′n−2 )mn−3 + x′n−3 )mn−4 + · · · + x′1 )m0 + x0 m
ek
(4.7)
x ≡ x0 mod m0
x ≡ x1 mod m1
..
.
x ≡ xn−1 mod mn−1
M
with M̂i = mi . An equivalent version not including the reduction modulo M is
n−1
X xi
x= M̂i − αM (4.9)
i=0 M̂i mi
56
4.4. Modular Arithmetic on GPUs
where Aˆk = A/mk and α is an integer so that 0 ≤ x < A. Note that α is strictly upper-bounded
by n. When reducing this equation with an arbitrary target modulus, say m e i , we obtain
n−1
X −1
|x|m
ei = Aˆk δk − |αA|m
ei , δk = xk · Aˆk (4.11)
m
ei mk
k=0 m
ei
−1
where Aˆk , Aˆk and |A|m
e i are precomputed constants. Note that the δk do not depend
m
ei mk
on the target modulus and can thus be reused in the computation of a different target residue.
This is an efficient way to compute all residues modulo the target base, provided we know
the value of α. While involving a couple of look-ups for the constants as well, the instruction
flow is highly uniform (cf. Criterion A4) and fits to our SIMD architecture, i.e., we can use n
threads to compute the n residues of x in the target base in parallel (cf. to Criterion A1).
The first technique to compute such an α is due to Shenoy and Kumaresan [SK89] and
requires a redundant modulus mr ≥ n that is relatively prime to all other moduli mj and m e i,
i.e., gcd(A, mr ) = gcd(B, mr ) = 1. Consider Equation 4.11, set me i = mr and rearrange it to
the following:
n−1
!
X
−1
|α|mr = |A |mr · Aˆk δk − |x|mr . (4.12)
mr
k=0 mr
Since α < n ≤ mr it holds that α = |α|mr and thus Equation 4.12 computes the exact value of
α, involving the additional constant |A−1 |mr .
Kawamura et al. propose a different technique that approximates α using fixed-point com-
putations [KKSS00]. Consider Equation 4.11, rearrange it and divide by A:
n−1
$n−1 %
X δk |x|m
ei
X δk
α = − = . (4.13)
mk A mk
k=0 k=0
Next, they approximate α by using truncr (δk ) as numerator and 2w as denominator and adding
a properly chosen offset σ, where truncr (δk ) sets the last w − r bits of δk to zero:
$n−1 % $ n−1 %
X truncr (δk ) 1 X
′ w−r
α = + σ = δk /2 + σ , (4.14)
2w 2r
k=0 k=0
Thus, the approximate value α′ can be computed in fixed-point arithmetic as integer part of
the sum of the r most-significant bits of all δk . Provided σ is chosen correctly, Equation 4.14
will compute α′ = α, and the resulting base extension will be exact.
57
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
Finally, Bajard et al. follow the most radical approach possible [BDK01]: they allow an offset
of αA ≤ (n − 1)A to occur in Equation 4.11 and thus do not need to compute α at all. After the
first base extension we have f ′ = f + αA and thus w′ = w + αM , i.e., the result w′ will contain
a maximum offset of (n − 1)M , and thus be equivalent to w mod M . However, this technique
needs additional measures of precaution in the multiplication algorithm, which predominantly
condense in the higher dynamic ranges needed.
4.5 Implementation
In this section we will describe the implementation of two primitive operations for a variety
of cryptosystems: first, we realize modular exponentiation on the GPU for use with RSA,
DSA and similar systems. Second, for ECC-based cryptosystems we present an efficient point
multiplication method which is the fundamental operation, e.g., for ECDSA or ECDH [HMV04].
58
4.5. Implementation
one global memory access. Hence, for a given set of values (A, B, C, . . .) consisting each of n + 2
words X = (x0 , x1 , . . . , xn+1 ), we store all first words (a0 , b0 , c0 , . . .) for all threads in the same
block, then all second words (a1 , b1 , c1 , . . .), and so on (cf. Criterion B4).
Moreover, we have to use nailing techniques, as CUDA does not yet include add-with-carry
instructions. Roughly speaking, nailing reserves one or more of the high-order bits of each
word for the carry that can occur when adding two numbers. To save register and memory
space, however, we store the full word of w bits per register and use bit shifts and and-masking
to extract two nibbles, each providing sufficient bits for the carry (cf. Criterion A3). We here
decompose a 32 bit addition in two 16 bit additions, including some overhead for carry handling.
Computations in residue number systems yield the advantage of being inherently parallel. Ac-
cording to Algorithm 4.3 all steps are computed in one base only, except for the first multi-
plication. Thus, the optimal mapping of computations to threads is as follows: each thread
determines values for one modulus in the two bases. As a result, we have coarse-grained (differ-
ent exponentiations) and fine-grained parallelism (base size), satisfying Criterion A1. We call
n′ the number of residues that can be computed in parallel, i.e., the number of threads per
encryption. The base extension by Shenoy et al. needs a redundant residue starting from the
first base extension to be able to compute the second base extension. To reflect this fact, we
use two RNS bases A and B, having n moduli each, and an additional residue mr resulting in
n′ = n + 1. For all other cases, it holds that n′ = n.
Considering the optimal number of bits per modulus, we are faced with w = 32 bit integer
registers on the target hardware. Thus, to avoid multi-precision techniques, we can use moduli
that are smaller than 2w . The hardware can compute 24 bit multiplications faster than full 32 bit
multiplications. However, CUDA does not expose an intrinsic to compute the most-significant
16 bits of the result. Using 16 bit moduli would waste registers and memory and increase the
number of memory accesses as well. Thus, we prefer full 32 bit moduli to save storage resources
at the expense of higher computational cost (cf. Criteria A2 and A3).
For Algorithm 4.2 to work, the dynamic ranges A and B and the modulus M have to be
related according to B > A > 22 M , or B > A > (2 + n)2 M when using Bajard’s method. For
performance reasons, we consider full warps of 32 threads only, resulting in a slightly reduced
size of M . The figures for all possible combinations can be found in Table A.2 in the appendix.
For input and output values, we assume that all initial values will have been already converted to
both bases (and possibly the redundant modulus mr ) and that output values will be returned in
the same encoding. Note that it would be sufficient to transfer values in one base only and do a
base extension for all input values (cf. Criterion B1, transferring values in both bases results in a
more compact kernel together with a slightly higher latency). Different from the CIOS method,
temporary values can be kept local for each thread, i.e., every thread stores its assigned residues
in registers. Principally all operations can be performed in parallel on different residues and –
59
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
as a result – the plain multiplication algorithm does not need any synchronizations. However,
both properties do not hold for the base extension algorithms.
Recall that the mixed radix conversion computes the mixed radix representation from all residues
in the source base first and uses this value to compute the target residues. The second step in-
volves the computation of n′ residues and can be executed in parallel, i.e., each thread computes
the residue for its corresponding modulus. As a result, we have to store the n MRS digits in
shared memory to make them accessible to all threads (cf. Criteria A1 and B2). The first step,
however, is the main caveat of this algorithm due to its highly divergent nature as each MRS
digit is derived from the residue of a temporary variable in a different modulus (and thus thread)
and depends on all previously computed digits. This clearly contradicts to Criterion A4 and
results in serialization of executions. Additionally, note that threads having already computed
an MRS digit do not generate any useful output anymore.
CRT-based Conversion
The first step for all CRT-based techniques is to compute δk for each source modulus what can
be carried out by one thread for each value. Second, all n′ threads compute a weighted sum
involving δk and a modulus-dependent constant. Note that all threads need to access all δk and
thus δk have to be stored in shared memory (cf. Criterion B2). Third, α has to be derived,
whose computation is the main difference in the distinguished techniques. α is needed by all
threads later and thus needs to be stored in shared memory as well. After computing α all
threads can proceed with their independent computations.
Bajard’s method does not compute α and consequently needs no further operations. For
Shenoy’s method, the second step above is needed for the redundant modulus mr as well, which
can be done in parallel with all other moduli. Then, a single thread computes α and writes it to
shared memory. The redundant residue mr comes at the price of an additional thread, however,
the divergent part required to compute α does only contain one addition and one multiplication
modulo mr . Kawamura’s method needs to compute the sum of the r most significant bits of
all δk . While the right-shift of each δk can be done using all threads, the sum over all shifted
values and the offset has to be computed using a single thread. A final right-shift results in the
integer part of the sum, namely α.
Clearly, Bajard’s method is the fastest since it involves no computation of α. Shenoy’s method
only involves a small divergent part. However, we pay the price of an additional thread for the
redundant modulus, or equivalently decrease the size of M . Kawamura’s technique consists of a
slightly larger divergent part, however it does neither include look-ups nor further reduces the
size of M .
60
4.5. Implementation
Method A→B
MRC (M) Shenoy (S) Kawamura (K) Bajard (B)
MRC (M) • ◦ ◦ •
B→A
Shenoy (S) • ◦ ◦ •
Kawamura (K) • ◦ ◦ •
Bajard (B) ◦ ◦ ◦ ◦
Not all base extension mechanisms can be used for both directions required for Algorithm 4.3.
For Bajard’s method, consider the consequence of an offset in the second base extension: we
would compute some w′′ in base A that is not equal to the w′ in B. As a result, neither
hw′ iA nor hw′′ iB could be computed leading to an invalid input for a subsequent execution of
Algorithm 4.3. Thus, their method is only available for A → B conversions. Shenoy’s method
can only be used for the second base extension as there is no efficient way to carry the redundant
residue through the computation of f modulo A. The technique by Kawamura et al. would in
principle be available for both conversions. However, the sizes of both bases would be different
to allow proper reduction in the A → B case, thus we exclude this option from our consideration.
Table 4.1 shows the available and efficient combinations.
61
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
Table 4.2: Results for different Base Extension Techniques (RNS Method).
a solution that uses global memory for temporary values instead. Note that the left-to-right
binary method for point multiplication demands only one temporary point. However, for the
sake of a homogeneous flow of instructions we compute both possible solutions per scalar bit and
use a small divergent section to decide which of them is the desired result (cf. Criterion A4).
4.6 Conclusions
With the previously discussed implementations on GPUs at hand, we finally need to identify
the candidate providing the best performance for modular exponentiation.
Before presenting the benchmark results of the best algorithm combinations we show our re-
sults regarding the different base extension options for the RNS method. The benchmarking
scheme was the following: first, we did an exhaustive search for the number of registers per
thread that can principally be generated by the tool chain. Then, we benchmarked all available
execution configurations for these numbers of registers. To make the base extension algorithms
comparable, we would have to repeat this for all possible combinations, as shown in Table 4.1.
The results for the particular best configuration can be found in Table 4.2.
Clearly, the mixed radix based approach also used in [MPS07] cannot compete with CRT-
based solutions. Kawamura et al. is slower than the method of Shenoy et al., but performs only
slightly worse for the 2048 bit range. Figure 4.2 shows the time over the number of encryptions
for the four cases and the 1024 bit and 2048 bit ranges, respectively.
Both graphs show the characteristic behavior: Depending on the number of blocks that are
started on the GPU and the respective execution configuration we get stair-like graphs. Only
multiples of the number of warps per multiprocessor and the number of multiprocessors result
in optimal configurations that fully utilize the GPU. However, depending on the number of
registers per thread and the amount of shared memory used other configurations are possible
and lead to smaller steps in between.
62
4.6. Conclusions
3000 10000
MRC, MRC MRC, MRC
Bajard et al, MRC Bajard et al, MRC
Bajard et al, Shenoy et al 9000 Bajard et al, Shenoy et al
Bajard et al, Kawamura et al Bajard et al, Kawamura et al
2500
8000
7000
2000
6000
Time [ms]
Time [ms]
1500 5000
4000
1000
3000
2000
500
1000
0 0
0 100 200 300 400 500 0 50 100 150 200 250
Number of Encryptions Number of Encryptions
Figure 4.2: Results for modular exponentiation with about 1024 (left) and 2048 bit (right) mod-
uli for different base extension methods, based on a NVIDIA 8800 GTS graphics
card.
Optimized Implementations
Beside the reference implementation based on the CIOS algorithm, we selected as best choice
the CRT-RNS method based on a combination of Bajard’s and Shenoy’s methods to compute
the first and second base extension of Algorithm 4.3, respectively.
The selection of the implementation was primarily motivated to achieve high throughput
rather than a small latency. Hence, due to the latency, not all implementations might be
suitable for all practical applications. To reflect this, we present figures for data throughput
as well as the initial latency tmin required at the beginning of a computation. Note that our
results consider optimal configurations of warps per block and blocks per grid only. Table 4.3
shows the figures for modular exponentiation with 1024 and 2048 bit moduli and elliptic curve
point multiplication using NIST’s P-224 curve.
The throughput is determined from the number of encryptions divided by the elapsed time.
Note that this includes the initial latency tmin at the beginning of the computations. The
Table 4.3: Results for throughput and minimum latency tmin on a NVIDIA 8800 GTS graphics
card.
63
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
35000 300000
Modular Exponentiation, CIOS Method Modular Exponentiation, CIOS Method
Modular Exponentiation, RNS Method Modular Exponentiation, RNS Method
Point Multiplication, NIST P-224 Point Multiplication, NIST P-224
30000
250000
25000
200000
20000
Time [ms]
Time [ms]
150000
15000
100000
10000
50000
5000
0 0
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
Number of Operations Number of Operations
Figure 4.3: Results for modular exponentiation with about 1024 (left) and 2048 bit (right) mod-
uli and elliptic curve point multiplication on NIST’s P-224 curve, based on a NVIDIA
8800 GTS graphics card.
corresponding graphs are depicted in Figure 4.3. Note the relatively long plateau when using the
CIOS technique. It is a direct result from having coarse-grained parallelism only: the smallest
number of encryptions that can be processed is 128 times higher than for the RNS method.
Its high offset is due to storing temporary values in global memory: memory access latency is
hidden by scheduling independent computations, however the time needed to fetch/store the
first value in each group cannot be hidden.
Clearly, the CIOS method delivers the highest throughput at the price of a high initial
latency. For interactive applications such as online banking using TLS this will be a major
obstacle. However, non-interactive applications like a certificate authority (CA) might benefit
from the raw throughput. Note that both applications will share the same secret key for all
digital signatures when using RSA. In case of ECC (ECDSA) however, different exponents were
taken into account.
The residue number system based approach does only feature roughly half of the throughput
but provides a more immediate data response. Thus, this method seems to be suitable even
in interactive applications. Last but not least elliptic curve cryptography clearly outperforms
modular exponentiation based techniques not only due to the much smaller parameters. With
respect to other hardware and software implementations compared against our results in the
next section, we present an ECC solution which outperforms most hardware devices and comes
close the the performance of recent dual-core microprocessors.
64
4.6. Conclusions
The authors present the maximum throughput only that has been achieved at the cost of an
unspecified but high latency. Fleissner’s recent analysis on modular exponentiation for GPUs is
based on 192 bit moduli but relates the GPU performance solely to the CPU of his host system.
Costigan and Scott implemented modular exponentiation on IBM’s Cell platform, i.e., a Sony
Playstation 3 and an IBM MPM blade server, both running at 3.2 GHz [CS07]. We only quote
the best figures for the Playstation 3 as they call the results for the MPM blade preliminary. The
Playstation features one PowerPC core (PPU) and 6 Synergistic Processing Elements (SPUs).
Software results have been attained from ECRYPT’s eBATS project [ECR07]. Here, we picked
a recent Intel Core2 Duo with 2.13 GHz clock frequency. Since mostly all figures for software
relate to cycles, we assumed that repeated computations can be performed without interruption
on all available cores so that no further cycles are spent, e.g., on scheduling or other administra-
tive tasks of the operating system. Note that this might be an optimistic assumption possibly
overrating the performance of microprocessors with respect to actual applications due to over-
head for scheduling and interrupt handling of the operation system (which is not required on
an FPGA). We also compare our work to the very fast software implementation by [GT07] on
an Intel Core2 system at 2.66 GHz but which uses the special Montgomery and non-standard
curve over F2255 −19 . The comparison of our results with figures from the literature is shown in
Table 4.5.
65
Chapter 4. High-Performance Asymmetric Cryptography with Graphics Cards
ModExp-1024, CRT
ModExp-2048, CRT
ECC PointMul-224
ECC PointMul-256
ModExp-1024
ModExp-2048
Our Design NVIDIA 8800GTS, CIOS 813.0 104.3
NVIDIA 8800GTS, RNS 439.8 57.9
NVIDIA 8800GTS, NIST-224 1412.6
[MPS07] NVIDIA 7800GTX, RNS 175.4
[CS07] Sony Playstation 3 909.2 401.4
[Men07] Xilinx XC2VP30 FPGA 471.7 1724.1 235.8 440.5
[Suz07] Xilinx XC4FX12 FPGA 584.8 79.4
[NMSK01] 0.25µm CMOS, 221k GE 238.1 34.2
[ECR07] Intel Core2 2.13 GHz 1447.5 300.4 1868.5a 1494.8a
[GT07] Intel Core2 2.66 GHz 6900b
a
Performance for ECDSA including additional modular inversion and multiplication operation.
b
Special elliptic curve in Montgomery form, non-compliant to ECC standardized by NIST.
cations at a time. Note however, that the curves standardized by ANSI and NIST cannot be
transformed to Edwards or Hessian form, hence this technique can unfortunately not exploited
when targeting industrial or governmental applications with requirements for standardized cryp-
tography.
A last aspect considers the maturity of tools: the CUDA tool chain was an early version
that does not necessarily create optimal compilation results. Hence, the results could probably
improved with a later versions of the tool chain.
66
Part II
5.1 Motivation
The security of symmetric and asymmetric ciphers is usually determined by the size of their
security parameters, in particular the key-length. Hence, when designing a cryptosystem, these
parameters need to be chosen according to the assumed computational capabilities of an at-
tacker. Depending on the chosen security margin, many cryptosystems are potentially vulner-
able to attacks when the attacker’s computational power increases unexpectedly. In real life,
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
the limiting factor of an attacker is often financial resources. Thus, it is quite crucial from
a cryptographic point of view to not only investigate the complexity of an attack, but also
to study possibilities to lower the cost-performance ratio of attack hardware. For instance, a
cost-performance improvement of an attack machine by a factor of 1,000 effectively reduces the
key length of a symmetric cipher by roughly 10 bit (since 1000 ≈ 210 ).
Cryptanalysis of modern cryptographic algorithms involves massive and parallel computa-
tions, usually requiring more than 240 operations. Many cryptanalytical schemes spend their
computations in independent operations, which allows for a high degree of parallelism. Such
parallel functionality can be realized by individual hardware blocks that can be operated simul-
taneously, improving the time complexity of the overall computation by a perfect linear factor.
At this point, it should be remarked that the high non-recurring engineering costs for ASICs
– with can often consume more than US$ 100,000 for large projects – have put most projects
for building special-purpose hardware for cryptanalysis out of reach for commercial or research
institutions. However, with the recent advent of low-cost FPGAs which host vast amounts of
logic resources, special-purpose cryptanalytical machines have now become a possibility outside
government agencies.
There are several approaches to building powerful computing clusters for cryptanalysis. For
instance, distributed computing with loosely coupled processors connected via the Internet is
a popular approach, e.g., demonstrated by the SETI@home project [Uni05]. However, this has
the disadvantage that the success strongly depends on the number of participating users. Hence,
distributed computing usually results in an unpredictable runtime for an attack since the avail-
able computational power is variable due the dynamically changing number of contributors. A
second natural approach could rely on utilizing supercomputers like IBM’s BlueGene [Int08]
or other commercial machines, e.g., from Cray or SGI. Unfortunately, supercomputers tend to
provide sophisticated options for high-speed communication and large portions of distributed
memory which are mostly not required for simple cryptanalytical number crunching. Unfor-
tunately, the availability of these features increases the costs of these systems significantly,
resulting in a non-optimal cost-performance ratio for cryptanalytical applications. With the
improvements in FPGA technology, reconfigurable computing has emerged as a cost effective
alternative for certain supercomputer applications.
In this chapter, we will employ a hardware architecture called Cost-Optimized Parallel Code
Breaker (COPACOBANA) for advanced cryptanalytic applications which was originally intro-
duced in [KPP+ 06b]. The platform is optimal for computational problems which can be split
among multiple, independent nodes with low communication and memory requirements. CO-
PACOBANA consists of up to 120 FPGA nodes which are connected by a shared bus providing
an aggregate bandwidth of 1.6 Gbps on the backplane of the machine. COPACOBANA is not
equipped with additional memory modules, but offers a limited number of RAM blocks inside
each FPGA. Even though breaking modern ciphers like AES (with keys of 128/192/256 bits),
full-size RSA (1024 bit or more) or elliptic curves (ECC with 160 bit or more) is out of reach
with COPACOBANA, we can use the machine to gather data for extrapolating attacks with
realistic security parameters in terms of financial costs and attack time. Equally importantly,
70
5.2. Previous Work
there are numerous legacy systems (and not-so-legacy systems such as the electronic passport)
which are still operating with key lengths that can be tackled with COPACOBANA.
In this chapter, we will show how COPACOBANA can be used to break the Data Encryp-
tion Standard (DES) block cipher [Nat77] and other DES-related cryptosystems with (slightly)
more advanced methods, like Time-Memory Tradeoffs (TMTO). Though DES was revoked as
standard in 2004, it is still a popular choice for low-end security system as well as available in
many legacy systems.
Besides DES breaking, cryptanalysis on asymmetric ciphers can also be supported by CO-
PACOBANA, e.g., for solving the Elliptic Curve Discrete Logarithm Problem [GPP07a] which
is known as the fundamental primitive for cryptosystems based on elliptic curves. The cor-
responding implementation is discussed in Chapter 6. Moreover, we also adapted the Elliptic
Curve Method for integer factorization for use with Xilinx Virtex-4 FPGAs on a variant of the
original COPACOBANA cluster. This work is presented in Chapter 7. There is also further
work with COPACOBANA which are not in the scope of this thesis. For example, an attack on
a legacy hard disk encryption (Norton Diskreet) [KPP+ 06a], attacks on the GSM A5/1 stream
cipher [GNR08] and the recent Machine Readable Travel Documents (ePassport) [LKLRP07]
are further examples of cryptanalytic applications which also make use of the COPACOBANA
cluster.
1
Since we do not want to support hacking of bank accounts, we will not give further details here.
71
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
Although the DES was reaffirmed for use in (US government) security systems several times
until 1999, the worries about the inherent threat of its short key space was already raised in
1977 when it was first proposed. The first estimates were proposed by Diffie and Hellman
[DH77] for a brute force machine that could find the key within a day at a cost of US$ 20 mil-
lion. Some year after that, a first detailed hardware design description for a brute force attack
was presented by Michael Wiener at the rump session of CRYPTO’93, a printed version is
available in [Wie96]. It was estimated that the machine could be built for less than a mil-
lion US dollars. The proposed machine consists of 57, 000 DES chips that could recover a key
every three and half hours. In 1997, a detailed cost estimate for three different approaches for
DES key search, distributed computing, FPGAs and custom ASIC designs, was presented by
Blaze et al. [BDR+ 96]. In 1998, the Electronic Frontier Foundation (EFF) finally built a DES
hardware cracker called Deep Crack which could perform an exhaustive key search within 56
hours [Ele98]. Their DES cracker consisted of 1, 536 custom designed ASIC chips at a cost of
material of around US$ 250,000 and could search 88 billion keys per second. To our knowledge,
the latest step in the history of DES brute-force attacks took place in 2006, when the Cost Op-
timal Parallel Code Breaker (COPACOBANA) was built for less than US$ 10,000 [KPP+ 06b].
COPACOBANA is capable of breaking DES in less than one week on average. We would like
to note that software-only attacks against DES still take more than 1,000 PC-years (based on
Intel Pentium-4@3GHz) in worst case.
Most of these attacks assume that at least one complete plaintext-ciphertext pair is given.
We will see that crypto tokens for bank applications (compliant to ANSI X9.9) typically do not
provide such inputs, so that a smarter attack must be chosen to tackle this kind of systems.
There are some theoretical contributions by Coppersmith et al. [CKM00] as well as by Pre-
neel and van Oorschot [PO96] considering the theoretical security of DES-based authentication
methods (DES-MAC). But to our best knowledge an attack on an ANSI X9.9-based crypto
system has not been proposed (or demonstrated) yet. Minor parts of the work presented in
this chapter were done in collaboration with Andy Rupp and Martin Novotny and were also
published in [GKN+ 08].
72
5.3. Mathematical Background
ciphertext domain C of same size |P | = |C|. The set K denotes the key space of size |K| = n.
In case of a block cipher, we define e to be a bijective encryption function which is used to
encrypt a plaintext p ∈ P under a key k ∈ K into a ciphertext c ∈ C:
e : (P × K) → C
ek (p) 7→ c
Given an image c ∈ C, the challenge for an attacker is to find a preimage of c, i.e., some
plaintext p ∈ P and key k ∈ K so that ek (p) = c. An attacker often knows at least one
corresponding combination of plaintext p and ciphertext c (known-plaintext attack ) and can use
this to uniquely identify key k – however usually at a large computational cost.
By using a cryptanalytic Time-Memory Tradeoff method, one tries to find a compromise
between the two well-known extreme approaches to recover the key k, i.e., either performing live,
exhaustive key searches or precomputing exhaustive tables with all possible key combinations
for a predefined tuple (p, c). A TMTO offers a way to reasonably reduce the actual search
complexity (by doing some kind of precomputation) while keeping the amount of precomputed
data reasonably low, whereas reasonably has to be defined more precisely. It depends on the
actual attack scenario (e.g., real-time attack), the function e and the available resources for the
precomputation and online (search) phase.
Existing TMTO methods [Hel80, Den82, Oec03] share the natural property that in order
to achieve a significant success rate much precomputation effort is required. Since performing
this task on PCs is usually too costly or time-consuming, cheap special-purpose hardware with
massive computational power – like COPACOBANA – is required. In [SRQL03a] an FPGA
design for an attack on a 40-bit DES variant using Rivest’s TMTO method was proposed.
In [MBPV06] a hardware architecture for UNIX password cracking based on Oechslin’s method
was presented. However, to the best of our knowledge, nobody has done a complete TMTO
precomputation for full 56-bit DES so far.
The next sections give a brief overview of cryptanalytic time-memory tradeoff methods.
The original Time-Memory Tradeoff Method, published in 1980 by Hellman [Hel80], tries to
precompute all possible key-ciphertext pairs in advance by encrypting p with all n possible
keys. However, to reduce memory requirements, these pairs are organized in several chains of
a fixed length. The chains are generated deterministically and are uniquely identified by their
respective start and end points. In this way, it suffices to save its start and end point to restore
a chain later on. In the online phase of the attack, one then simply needs to identify and
reconstruct the right chain containing the given ciphertext to get the desired key. The details
of the two phases are described in the following.
73
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
p p p
f f f
SP x2 x3 xt EP
e r e r ... e r
Precomputation phase.
Assume m as the number of concurrent computations used for precomputing. The first m
different keys are chosen serving as start points SP of the chains. To generate a chain, one first
computes eSP (p) resulting in some ciphertext c (see Figure 5.1). In order to continue the chain,
c is used to generate a new key. To do so, a so-called reduction and re-randomization function r
is applied reducing the bit length of c to the bit length of a key for the encryption function e (if
necessary) and performing a re-randomization of the output. By means of r, we can continue
the chain by computing r(eSP (p)) = x2 , using the resulting key x2 to compute r(ex2 (p)) = x3
and so on. The composition of e and r is called step-function f . After t applications of f the
chain computation stops and we take the last output as the end point EP of the chain. The
pair (SP, EP) is stored in a table sorted by end points. The number of distinct keys contained
in a table divided by n is called the coverage of a table. Unfortunately, the occurrence of a key
in a table is not necessarily unique because there is a chance that two chains collide and merge
or that a chain runs into a loop. This is due to the non-injective function r mapping the space
of ciphertexts to the key space. (which is often smaller, e.g., in the case of DES). Each merge
or loop reduces the fraction of distinct keys contained in a table and thus the coverage (if m
is fixed). Since the probability of merges increases with the size of a table, at a certain point
we cannot significantly improve the coverage by simply adding more and more chains. Hellman
2
calculated that this point is somewhere near n 3 for a single table. To cope with this problem,
he suggested to generate multiple tables each associated with a different reduction function. In
this way even if two chains from different tables collide, they will not merge because different
functions are applied to the shared value in the next step.
Online phase.
In the online phase a ciphertext c′ is given that is assumed to be the result of the encryption of
p using some key k. We try to retrieve k from the precomputed tables in the following way: to
find out if k is covered by a specific table, we compute a chain up to a length of t starting with
r(c′ ) and compare the intermediate points with the end points in the table. More precisely, we
first check if r(c′ ) is contained in the table. If not, we compute f (r(c′ )) and look for a match,
then we repeat this for f (f (r(c′ ))) and so on. If a match occurs after the i-th application of
f for a pair (SP, EP), then f t−i−1 (SP) = xt−i is a key candidate. This candidate needs to be
74
5.3. Mathematical Background
checked, by verifying ext−i (p) = c′ , and, if it is valid, the online phase ends. If it is not valid, a
false alarm has occurred and the procedure continues while the chain has a length smaller than
t + 1. If no valid key is found in this table we repeat the same procedure for another table (and
thus another r and f ).
Distinguished Points.
In most attack setups for Hellman’s TMTO, the time required to complete the online phase is
dominated by the high number of table accesses. Random accesses to disk can be many orders of
magnitude slower than the evaluation of f . The Distinguished Point (DP) method, introduced
by Rivest [Den82] in 1982, addresses this problem. A DP is a key that fulfills a certain simple
criterion (e.g., the first 20 bits are zero) which is usually given as a mask of length d. Rivest’s
idea was to admit only DPs as end points of a chain. For the precomputation phase this means
that a chain is computed until a DP or a maximal chain length tmax + 1 is reached. Only chains
of length at most tmax + 1 ending in a DP are stored. Using DPs, merging and looping chains
can also be detected and are then discarded. In the online phase, the table does not need to be
accessed after every application of f but only for the first occurring DP. If we have no match
for this DP we can proceed with the next table.
Rainbow Tables.
Rainbow tables were introduced by Oechslin [Oec03] in 2003. He suggested not to use the
same r when generating a chain for a single table but a (fixed) sequence r1 , . . . , rt of different
reduction functions. More precisely, due to the different reduction functions we get t different
step-functions f1 , . . . , ft that are applied one after another in order to create a chain of length
t + 1. The advantage of this approach is that the effect of chain collisions is reduced: while the
collision of two chains inevitably leads to a merge of these chains in a Hellman table, a merge
only happens if the shared value appears at the same position in both chains for the rainbow
method . Otherwise they share only this single value. Thus, a merge of two chains in a rainbow
table is not likely to occur. Furthermore, loops are completely prevented. Hence, regarding
a space efficient coverage, these characteristics allow to put many more chains into a rainbow
table than into a Hellman table. This in turn significantly reduces the total number of tables
needed in order to achieve a certain coverage. Since fewer rainbow tables must be searched in
the online phase (what is, however, slightly more complex) a lower number of calculations and
table accesses is required compared to Hellman’s method. To check for a key in a rainbow table,
we first compute rt (c′ ) and compare it to the end points, then we repeat this for ft (rt−1 (c′ )),
ft (ft−1 (rt−2 (c′ ))), etc. Moreover, compared to the distinguished point method the number of
false alarms and the corresponding extra work is reduced.
75
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
76
5.5. Exhaustive Key Search on DES
1 le
2 20
du
le du d ule
Mo Mo Mo
MM MM MM
DI 0D00I 00 DI 00
1 10 10
C 3S C 3S C 3S
backplane X 0X 0 X 0
er 1 00 00 00
tr ol l
3 S 3 S1 3 S1
n C XC XC
co card X 00 0 0
10 00 00
Gi erne
host S1 S1
Eth
3 S 3 3
ga
XC 0X
C XC
bit t
0 0
1 00 00 00
3 S 3S1 3S1
er X C 0X
C XC
t r oll 1 00 10
00 00
0
n A S S S1
Co PG 12) XC 3
0XC
3
0 XC
3
0
F FX 100 00 00
V 4 3 S 3 S1 3 S1
(XC XC XC XC
64 data
bus 16 address
The top level entity of COPACOBANA is a host-PC which is used to initialize and control the
FPGAs, as well as for the accumulation of results. Programming can be done simultaneously for
all or a specific subset of FPGAs. Data transfer between FPGAs and a host-PC is accomplished
by a dedicated control interface. The controller has also been designed as a slot-in module so
that COPACOBANA can be connected to a computer either via a USB or Ethernet controller
card. A software library on the host-PC provides low-level functions that allow for addressing
individual FPGAs, storing and reading FPGA-specific application data. With this approach,
we can easily attach more than one COPACOBANA device to a single host-PC.
The Data Encryption Standard (DES) with a 56-bit key size was chosen as the first commercial
cryptographic standard by NIST in 1977 [Nat77]. A key size of 56-bits was considered to be a
good choice considering the huge development costs for computing power in the late 70’s, that
made a search over all the possible 256 keys appear impractical.
Since DES was designed to be extremely efficient in terms of area and speed for hardware, an
FPGA implementation of DES can be orders of magnitude faster than an implementation on
a conventional PC at much lower costs [KPP+ 06b]. This allows a hardware based engine for a
DES key search to be much faster and efficient compared to a software based approach.
Our core component is an improved version of the DES engine of the Université Catholique
de Louvain’s Crypto Group [RSQL03] based on 21 pipeline steps. Our design can test one key
per clock cycle and engine. On the COPACOBANA, we can fit four such DES engines inside
77
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
rst din
FPGA
key CNT plaintext ciphertext
56 64 56 64 56 64 56 64 64
success
Figure 5.3: Architecture for exhaustive key search with four DES key search units.
a single FPGA, which allows for sharing plaintext-ciphertext input pairs and the key space as
shown in Figure 5.3.
Since our first implementation, as presented in [KPP+ 06b], we were able to tweak our design
for increased performance by the use of additional pipelined comparators and improved control
logic. We now can operate each of the FPGAs at an increased clock rate of 136 MHz so that
a gain in performance by 36% is achieved, compared to [KPP+ 06b]. Consequently, a partial
key space of 242 keys can completely be checked in 240 × 7.35 ns by a single FPGA, which is
approximately 135 minutes. Since COPACOBANA hosts 120 of these low-cost FPGAs, the key
search machine can check 4 × 120 = 480 keys every 7.35 ns, i.e., 65.28 billion keys per second.
To find the correct key, COPACOBANA has to search through an average of 255 different keys.
Thus, COPACOBANA can find the right key in approximately T = 6.4 days on average. Since
more than one COPACOBANA can be attached to a single host and the key space can be
shared among all machines, the search time then reduces to Tℓ , where ℓ denotes the number of
machines.
78
5.6. Time-Memory Tradeoff Attacks on DES
Table 5.1: Empirical TMTO parameters for optimal performance with COPACOBANA.
79
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
Method SR DU PT (COPA) OT
Hellman 0.80 1897 GB 24 days 240.2 TA + 240.2 C
DP 0.80 1690 GB 95 days 221 TA + 239.7 C
Rainbow 0.80 1820 GB 23 days 221.8 TA + 240.3 C
table accesses (TA) and calculations (C) during the online phase (OT). Note that these figures
for use with COPACOBANA are based on estimations given in [Hel80, Oec03, SRQL03a] (false
alarms are neglected) and the given constraints mentioned above. Note further that for this
initial extrapolation we have used the implementation of our exhaustive key search unit pre-
sented in Section 5.5. According to our findings, precomputations for the DP method on a single
COPACOBANA take roughly four times longer compared to Hellman’s and Oechslin’s method
based on the given constraints. Contrary, the subsequent online attack has the lowest complexity
for the distinguished point method. Considering a TMTO scenario to use the COPACOBANA
for precomputation only (implying that the online attack is performed by a PC), the rainbow
table method can be assumed to provide best performance. When using COPACOBANA as
well for precomputation and online phase, there is a strong indicator to select distinguished
points as the method of choice: for the DP method, we can assume the frequency of table ac-
cesses to follow a uniform distribution, hence, we expect balanced bandwidth requirements over
time. With respect to the online phase using rainbow tables, the computation trails are short
in the beginning but increment in length over time. This results in significant congestion on
COPACOBANA’s communication interface since a large number of table lookups are required
in the beginning of the online phase. Therefore, a scenario running both the precomputation
and the online phase on COPACOBANA should be based on the DP method since this method
is most promising with respect to the restrictions of the machine.
We have implemented the precomputation phase for generating DES rainbow tables on CO-
PACOBANA. For this implementation, we have developed another DES core which operates
with 16 pipeline stages only5 . With 4 parallel DES units and 16 pipeline stages each we can run
64 chain computations in parallel per FPGA. Figure 5.4 graphically presents our architectures
for generating rainbow tables in further detail. On the given Spartan-3 devices, our entire im-
plementation including I/O and control logic consumes 7571 out of 7680 (98%) available slices
of each FPGA and runs at a maximum clock frequency of 96MHz. Based on this figures, a
single COPACOBANA is able to compute more than 46 billion iterations of the step function f
per second. We are currently optimizing the I/O logic to support concurrent trail computations
and data transfers to eliminate idle times of the DES cores during data transmission. With this
improvement of our design, we can estimate the actual duration of the precomputation phase
5
Recall that the DES implementation from Section 5.5 uses 21 instead of 16 pipeline stages. A 16-staged
implementation allows for simpler addressing schemes when selecting a result from a specific pipeline position.
80
5.7. Extracting Secrets from DES-based Crypto Tokens
rand din
FPGA
plaintext
Chain TMTO
Buffer Core
rand table id start point
56
TMTO
TMTO Core
FSM Core 56
64
56 56
TMTO KEY IN
Core
DES
64 16 pipeline stages
Re-rand.
I/O
Function TMTO
Core OUT
for generating the rainbow tables to last slightly less than 32 days. With a high-performance
controller for COPACOBANA becoming available, we expect to verify our results practically
and complete the generation of full precomputation tables for the DES.
81
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
Input Output
ASCII
12345678 0x3132333435363738 04731312
secret 1 2 3
1 2 3 key
DES µ
4 5 6 4 5 6
0xA473BD1CF974226A 7 8 9
7 8 9
Figure 5.5: Principle of response generation with ANSI X9.9-based crypto tokens.
type of tokens require additional assumptions concerning the unknown plaintext and thus are
harder to attack. More precisely, our contribution assumes fixed-key OTP tokens which can be
used in combination with a challenge-response protocol. In such protocols, a decimal-digit chal-
lenge is manually entered into the token via an integrated keypad. The token in turn computes
the corresponding response according to the ANSI X9.9 standard. Tokens implementing this
standardized authentication scheme (incorporating ANSI 3.92 DES encryption) have a often a
fixed size LCD allowing for displaying 8 decimal digits for input and output.
After the user has typed in eight decimal digits as input (challenge), the value is converted to
binary representation using standard ASCII code notation according to the ANSI X9.9 standard.
For instance, the typed number “12345678” is converted into the 64-bit challenge value in
hexadecimal representation
After recoding, c is used as plaintext to the DES encryption function r = ek (c) with the static
key k stored securely in the token. The output of the encryption function is the 64-bit ciphertext
r = (r1 , r0 ) where each ri denotes a 32 bit word to be transformed using a mapping µ to fit the
8-digit display of the token. The mapping µ takes the 8 hexadecimal digits of r1 (32 bits) of
the DES encryption as input, and converts each digit individually from hexadecimal (binary)
notation to decimal representation. Let H = {0, . . . , 9, A, . . . , F} and D = {0, . . . , 9} be the
alphabets of hexadecimal and decimal digits, respectively. Then µ is defined as:
µ : H → D : {0H 7→ 0D ; . . . ; 9H 7→ 9D ; AH 7→ 0D ; . . . ; FH 7→ 5D }
Hence, the output after the mapping µ is an 8 decimal digit value which is displayed on the
LCD of the token. Figure 5.5 shows how the response is generated on the token according to a
given challenge. In several countries, this authentication method is used in banking applications
whenever a customer needs to authenticate financial transactions. For this, each user of such
82
5.7. Extracting Secrets from DES-based Crypto Tokens
OK 7 8 9 OK
an online-banking system owns a personal token used to respond to challenges presented by the
banking system to authorize every security-critical operation. In this context for example, a
security-critical operation can be the login to the banking system as well as the authorization
of a money transfer. Figure 5.6 depicts a token-based challenge-response protocol interaction
with an online-banking system from a user’s perspective.
The central role in such a security-related application makes the secret token an interesting
target for an attack.
Observation 5.1 Let D = {0, . . . , 9} be the alphabet of decimal digits. With a single challenge-
response pair (c, r) of an ANSI X9.9-based authentication scheme where c, r ∈ D8 , on average
26 bits of a DES key can be determined (24 bits in the worst case, 32 bits in the best case).
Since only 32 bits of the output c for a given challenge c are exposed, this is a trivial upper bound
for the information leakage for a single pair. Assuming the DES encryption function to be a
pseudo-random function with appropriate statistical properties, the 32 most significant bits of c
83
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
form 8 hexadecimal digits uniformly distributed over H 8 = {0, . . . , 9, A, . . . , 8}8 . The surjective
mapping µ has the domain F = {0, . . . , 9} of which T = {0, . . . , 5} are doubly assigned. Hence,
we know that ∆ = F \T = {6, . . . , 9} are four fixed points which directly correspond to output
digits of c yielding four bit of key information (I). The six remaining decimal digits Ω = F ∩ T
can have two potential origins allowing for a potential deviation of one bit (II). According to
a uniform distribution of the 8 hexadecimal output digits, the probability that (I) is given for
an arbitrary digit i of c is P r(i ∈ ∆) = 1/4. Thus, on average we can expect 2 out of 8
hexadecimal digits of c to be in ∆ revealing four bits of the key whereas the remaining 6 digits
introduce a possible variance of one unknown bit per digit. Averaged, this leads to knowledge
of R = 2 · 4 + 6 · 3 = 26 bits of DES key material. Obviously, the best case with all 8 digits in
∆ and worst case with no digits out of the set ∆ provide 32 and 24 key bits, respectively.
According to Observation 5.1, we can develop two distinguished attacks based on the knowl-
edge of two and three known challenge-response pairs:
Observation 5.2 Given two known challenge-response pairs (ci , ri ) for i = {0, 1} of the ANSI
X9.9 authentication scheme, an exhaustive key search using both pairs will reveal 24 = 16
potential key candidates on average (256 candidates in the worst case, in the best case the
actual key is returned).
Assuming independence of two different encrypted blocks related to the same key in block
ciphers, we can use accumulated results from Observation 5.2 for key determination using
multiple pairs (pi , ci ). Hence, on average we can expect to determine 52 bits of the key where
each ci has 2 digits from the set ∆. Given a full DES key of 56 bit size results in 24 possible
variations for key candidates. Having at least 4 digits from ∆ for each ci , this will lead to the
best case resulting in a single key candidate. In the worst case and with no ∆ digits in any ci ,
we will end up with 48 bits of determined key material and 28 = 256 possible remaining key
candidates. As a consequence, the number of potential key candidates is directly dependent on
how many digits of a ci are fixed points and out of the set ∆.
Observation 5.3 Given three known challenge-response pairs of the ANSI X9.9 authentication
scheme, an exhaustive key search based on this information will uniquely reveal the DES key.
This directly follows from Observation 5.2. For this attack, 3 · 24 = 72 > 56 bits of key material
can directly determined (even in the worst case) resulting in the correct key to be definitely
identified.
84
5.7. Extracting Secrets from DES-based Crypto Tokens
Online Banking
2) Attacker computes Phishing Webpage
1) Attacker phishes 047321
token key k using (c,r) pairs for pairs (c,r) 1 2 3
c= 1234
r= 4 5 6
7 8 9
Attacker
User with ANSI
Special-purpose X9.9 token
3) Attacker can generate new responses r
hardware
to challenges c from bank using key k
Bank
Figure 5.7: Attack scenario for token-based banking applications using phishing techniques.
spying techniques, for example based on malicious software like key-loggers or hidden cameras,
can be used to observe the user while responding to a challenge. Note that the freshness of these
values do not play a role since we use the information only for computing the secret key and not
for an unauthorized login attempt. Figure 5.7 shows a possible attack scenario on ANSI X9.9
tokens and associated banking applications based on phishing of challenge-response pairs (c, r).
With at least two pairs of challenge-response data, we can perform an exhaustive key search on
the DES key space implementing the specific features of ANSI X9.9 authentication. To cope
with the DES key space of 256 potential key candidates we will propose an implementation based
on dedicated special-purpose hardware. In case that three pairs of challenge-responses pairs are
given, we are definitely able to uniquely determine the key of the secret token using a single
exhaustive key search. When only two pairs (ci , ri ) are available to the attacker, then it is likely
that several potential key candidates are returned from the key search (cf. Observation 5.2).
With 16 potential solutions on average, the attacker can attempt to guess the right solution
by trial-and-error. Since most banking systems allow the user to enter up to three erroneous
responds to a challenge in a row, two key candidates can be tried by the attacker at a time.
Then, after a period of inactivity, the authorized user has probably logged into the banking
application that resets the error counter and allows the attacker to start another trial session
with further key candidates. On average, the attacker can expect to be successful after about
four trial and error sessions, testing 8 out of the 16 keys from the candidate list. Hence, an
attack on an ANSI X9.9-based token is very likely to be successful even with knowledge of only
two given challenge-response pairs.
As before, the main goal of our hardware design is a key search of the token to be done in a
highly parallelized fashion by partitioning the key space among the available FPGAs on the
COPACOBANA. This requires hardly any interprocess communication, as each of the DES
engines can search for the right key within its allocated key subspace.
85
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
din
64 64
FPGA plaintext 1
din 64 64
key CNT 1 plaintext 2
56
key CNT 2 64
56
64
KEY IN KEY IN KEY IN KEY IN
Figure 5.8: Four ANSI X9.9 key search units based on fully pipelined DES cores in a Xilinx
Spartan-3 FPGA.
Within the FPGAs, we use again a slightly modified version of the highly pipelined DES imple-
mentation of the Université Catholique de Louvain’s Crypto Group [RSQL03], which computes
one encryption per clock per engine. As with the brute-force attack, we can fit four such DES
engines inside a single FPGA, and therefore allow for sharing of control circuitry and the key
space as shown in Figure 5.8. The FPGA architecture comprises two 64-bit plaintext registers for
the challenges and two 32-bit ciphertext registers for storing the corresponding responses which
can be acquired from the OTP-token. The key space to be searched is allocated to each chip as
the most-significant 14-bits of the key which is stored in the Key register. The counter (CNT
1) is used to run through the least significant 40 bits of the key. The remaining two bits of the
56-bit key for each of the DES engines are hardwired and dedicated to each of them. Thus, for
every such FPGA, a task is assigned to search through all the keys with the 16 most-significant
bits fixed, in total 240 different keys. The key space is partitioned by a connected host-PC so
that each chip takes around 150 minutes (at 120 MHz) to test all ANSI X9.9 authenticators
in its allocated key subspace. During a single check of an authenticator, the DES engines use
the first challenge (plaintext 1) as a primary input to the encryption function. Then, the upper
32-bits of the generated ciphertext is mapped digit-per-digit by the function µ and compared to
the value of the response stored in the register ciphertext 1. If any of the DES engines provides
a positive match, the corresponding engine switches its input to the second challenge encrypting
it with the same key. To match the pipelined design of the DES engine, we are using a shadow
counter (CNT 2) tracking the key position at the beginning of the pipeline. In case that the
86
5.8. Conclusions
derived authenticator from the second encryption compares successfully to the second response,
the controller CTL reports the counter value to the host-PC as a potential key candidate. The
host-PC keeps track of the key range that is assigned to each of the FPGAs and, hence, can
match the right key from a given counter value. If no match is found until the counter overflows,
the FPGA reports completion of the task and remains idle until a new key space is assigned.
In case that a third challenge-response pair was specified, the host-PC performs a verification
operation of the reported key candidate in software. In case the verification was successful, the
search is aborted and the key returned as result of the search.
We have implemented the FPGA architecture shown in Figure 5.8 using the Xilinx ISE 9.1
development platform. After synthesis of the design incorporating four DES engines and the
additional logic for the derivation of the ANSI X9.9 authenticator, the usage of 8,729 Flip-
Flops (FF) and 12,813 Look-Up Tables (LUT) was reported by the tools (56% FF and 83%
LUT utilization of the Spartan-3 1000 device, respectively). As discussed in Section 5.5, we
included specific optimizations like pipelined comparators since n-bit comparators are likely to
introduce a long signal propagation path reducing the maximum clock frequency significantly.
By removing these potential bottlenecks, the design can be clocked at 120MHz after place-
and-route resulting in a throughput of 480 million keys per FPGA per second. In total, a fully
equipped COPACOBANA with 120 FPGAs can compute 57.6 billion ANSI X9.9 authenticators
per second. Based on this, we can present time estimates for an attack provided that two
challenge-response pairs are given. Recall that in this scenario we will be faced with several
potential key candidates per run so that we have to search the entire key space of 256 to build
a list with all of them. This ensures that we are able to identify the actual key in a separate
step.
Similarly, we can present figures for an attack scenario where three challenge-response pairs
are available. In this attack, we must test 255 ANSI X9.9 authenticators on average to find
the corresponding key what is half the time complexity of an attack having two known pairs of
data. Note that all presented figures of Table 5.3 include material costs only (not taking energy
and development costs into account).
For comparison with our hardware-based cluster, we have included estimations for a Intel
Pentium 4 processor operating at 3GHz. This microprocessor allows for a throughput of about
2 million DES computations a second what we also assume as appropriate throughput estimate
for generating ANSI X9.9 authenticators.
5.8 Conclusions
In this chapter we have shown advanced attack implementations to break the DES block cipher
and related products with COPACOBANA. On up to 120 low-cost FPGAs, COPACOBANA is
able to perform the required cryptographic operations simultaneously and in parallel for appli-
cations with high computational but low memory and communication requirements.
87
Chapter 5. Cryptanalysis of DES-based Systems with Special Purpose Hardware
Table 5.3: Cost-performance figures for attacking the ANSI X9.9 scheme with two and three
known challenge-response pairs (ci , ri ).
We presented an improved key search implementation that can break DES within less than a
week at an average computation rate of 65.3 billion tested keys per second. Beside the simple
brute-force scenario on DES, we have extended the attack scheme for tackling the complexity
of ANSI X9.9 OTP tokens whose security assumptions rely on the DES. Smarter brute-force
attacks, particularly when identical plaintexts are encrypted with different keys, can be achieved
by Time-Memory Tradeoffs. Due to immature communication facilities of COPACOBANA, we
could not fully verify our estimated results yet. However, as soon as higher bandwidth becomes
available on COPACOBANA, we will complete the generation of the precomputation tables for
the TMTO attack on DES in less than a month.
88
Chapter 6
Parallelized Pollard-Rho Hardware
Implementations for Solving the ECLDP
As discussed in Chapter 3, the use of Elliptic Curves (EC) in cryptography is very
promising for embedded systems since they are resistant against powerful index-
calculus attacks and thus allow small parameter sizes. In this chapter, we analyze the
actual security margin provided by ECC more precisely and present a first concrete
hardware implementation of an attack against ECC over prime fields. In detail, we
describe an FPGA-based hardware architecture of the Pollard-Rho method which is,
to our knowledge, currently the most efficient attack against ECC. With the imple-
mentation at hand, a fairly accurate estimate about the cost of such an attack based
on FPGAs as well as on ASICs can be given.
6.1 Motivation
Elliptic Curve Cryptosystems are based on the difficulty of the Diffie-Hellman Problem (DHP)
in the group of points on an Elliptic Curve (EC) over a finite field. The DHP is closely related
to the well studied Discrete Logarithm Problem (DLP). In contrast to the more efficient index-
calculus attacks on the DLP over finite fields, ECC only allows for generic attacks such as
Pollard’s Rho method [Pol78, vOW99]. This benefit yields much shorter underlying bit lengths
for ECC (160–256 bit) compared to RSA or DLP in finite fields (1024–4096 bit) at an equivalent
level of security [LV01].
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
For RSA and DLP in finite fields, many publications address the issue of hardware-based
attacks and provide security estimates based on proposals of architectures attacking such cryp-
tosystems. For ECC, however, no precise architecture for a hardware attack has been described
yet. Cryptanalysis of ECC requires the same algorithmic primitives as the cryptosystem itself,
namely point addition and point doubling which can be implemented very efficiently in hard-
ware. An algorithm designed for concurrency, the parallel Pollard’s Rho, is described in [vOW99]
which requires a unique point representation but achieves a perfect linear speedup dependent on
the number of available processors. To the best of our knowledge, except for estimations given
in this contribution, no actual results of a hardware implementation of Pollard’s Rho algorithm
for solving the ECDLP have been published.
In this chapter, we will present first results of a hardware implementation of an attack against
ECC over prime fields, allowing a security evaluation of ECC taking cryptanalytical hardware
into account. We propose an efficient architecture for the parallel variant of Pollard’s Rho
method and its implementation in hardware. Besides the hardware architecture which has been
completely programmed in VHDL and realized on a low-cost FPGA (Xilinx Spartan-3), this
project involves external software components required for managing the hardware components.
Based upon first results of the running hardware implementation, we estimate the expected
runtime to break ECC for actual security parameters. We also give estimates for an ASIC
design, solely dedicated to tackle actual ECC Challenges [Cer97] based on our architecture.
Extracts of this work were published as joint work with Jan Pelzl [GPP07a, GPP08].
Proposals of hardware-based attacks are very rare. The most important work in this field is
provided by [vOW99]: Besides an algorithmic improvement which allows for an efficient par-
allelization of Pollard’s Rho method, the authors estimate the cost of a dedicated hardware
solving the DLP over a curve over GF (2155 ) to 32 days for US$ 10 million. However, specific
details of the parametrization are omitted (i.e., it is unclear how the distinguished point prop-
erty can be used in the projective domain).
A recently published contribution discusses attacks on ECC over binary fields with reconfig-
urable hardware [dDBQ07]. The authors report their results also taking COPACOBANA as
reference platform. However, in contrast to curves over GF (2m ), curves over GF(p) have not
been examined yet.
90
6.3. Mathematical Background
Let p be a prime with p > 3 and Fp = GF (p) the Galois Field with p elements. The ECDLP
input is an elliptic curve
E : y 2 = x3 + ax + b,
ℓ · P = Q, (6.1)
where Q ∈ hP i holds. The parameter ℓ often is denoted as the elliptic curve discrete logarithm
ℓ = logP (Q).
Näive exhaustive search: this method sequentially adds the point P ∈ E to itself. The
addition chain P, 2P, 3P, 4P, . . . will eventually reach Q and reveal ℓ with ℓ · P = Q. In
the worst case, this computation can take up to n steps where n = ord(P ), making this
attack infeasible in practice when n is large.
Baby Step Giant Step (BSGS): the BSGS algorithm is an improvement to the näive ex-
√
haustive search [Sha71]. For n = ord(P ), memory for about n points and approximately
91
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
√
n computational steps are required. Due to its high memory complexity, BSGS is suit-
able only for very restricted sizes of n.
Pollard’s Rho and Lambda method: in 1978 J. Pollard proposed [Pol78] two collision
based algorithms. Although having a similar time complexity compared to BSGS, both
methods are superior due to their negligible memory requirements.
The ECDLP can be most efficiently computed using Pollard’s Rho method or in case of
an a priori specified search interval [a, b] by Pollard’s Lambda method (also known as
Kangaroo method). In combination with adequate parallelization, Pollard’s Rho method
is the fastest known attack against general ECC for solving the ECDLP in time of roughly
p
πn/2. Thus, it will be the algorithm of choice for the remainder of this chapter.
To explain the Pollard-Rho in more detail, we should first outline why point collisions can
help to reveal the ECDLP. Let R1 = c1 P + d1 Q and R2 = c2 P + d2 Q be two points with
R1 , R2 ∈ E(Fp ) and R1 = R2 but c1 6= c2 and d1 6= d2 . Then the following statements hold
[HMV04]:
c1 P + d 1 Q = c2 P + d 2 Q
(c1 − c2 )P = (d2 − d1 )Q
(c1 − c2 )P = (d2 − d1 )ℓP
ℓ = (c1 − c2 )(d2 − d1 )−1 mod n (6.2)
Hence, in case of a point collision in the subgroup of P the ECDLP can be solved efficiently
if gcd(d2 − d1 , n) = 1. Next is the issue of how to find such a collision.
A naı̈ve approach would be to take a starting point S = cs P + ds Q with cs , ds ∈R {2, . . . , n}
chosen randomly. A second point T1 = ct1 P + dt1 Q with other randomly chosen coefficients is
used to compute R1 = S + T1 . Then a third random point T2 determined the same way will
lead to a further point R2 = S + T2 which is compared against previous results. This procedure
can be continued until a correspondence of points is located with the drawback that all results
√
(about n due to the birthday paradox) need to be stored. The enormous space requirements
would make this attack similarly costly as the Baby Step Giant Step algorithm.
The better solution is to have a random walk [Pol78] within the group. This is a pseudo-
random function determining a collision candidate using an addition chain with a finite set of
preinitialized random points. In other words, we have a function f taking a current point Xj of
92
6.3. Mathematical Background
X4
X5
X9 = X3 X6
X8 X7
X2
X1
X0
the EC as input and computing its successor Xj+1 by simply adding another point. A repetition
of this procedure produces a pseudo-random trail of points in the domain of hP i. The other
point, which is added each time, is determined from a set R of previously randomly chosen
points and is selected in each iteration by a partitioning function. Let Ri ∈ R be the i-th out
of s total random points with i = {0, . . . , s − 1}. Then, we can define a partitioning function g
which determines the next random point Ri to add:
g : E(Fp ) → {0, . . . , s − 1} : X 7→ i.
When we integrate g into the function f , we obtain a next point Xj+1 by:
Due to the finiteness of R, the trail generated by a repetitive utilization of f will always run into
a cycle and therefore eventually collide in some point. The shape depicted by the random walk
is similar to the greek letter ρ and is hereby namesake for this algorithm. The collision itself can
easily detected using Floyd’s cycle finding algorithm which requires only a second computation
advancing twice as fast as the first one [MvOV96]. Hence, except for two computations, no
additional storage is required. Figure 6.1 depicts the construction of the trail respectively a
random walk in the Pollard-Rho algorithm. Assume X0 to be the starting point of the trail. A
repeated application of f with Xj+1 = f (Xj ) will lead to each next point in the walk. Finally,
we will encounter a collision caused by a duplicate visit at point X3 and X9 , respectively.
The complexity for this Single-Processor Pollard-Rho (SPPR) algorithm is derived directly
from the collision probability given by the birthday paradox [Pol78]. The birthday paradox
93
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
deals with the random and repetitive selection of elements from a distinct set until a duplicate is
chosen. Assuming Z to be a random variable for the number of chosen elements, the probability
that j out of n elements can be selected without duplication is
1 2 j−1 −j 2
P r(Z > j) = (1 − )(1 − ) . . . (1 − ) ≈ e 2n .
n n n
p
This finally leads to an expected number of distinct elements of roughly πn/2 before a
collision occurs and the algorithm terminates. A proof of this statement can be found in
[vOW99, FO90].
The Multi-Processor Pollard-Rho (MPPR), proposed by van Oorschot and Wiener [vOW99],
is basically a variation of the previously presented SPPR with some modification for better sup-
port of the parallel idea and providing a linear speedup with the number of available processors.
Each processor wi ∈ W where W is the set of processors starts an individual trail but does not
focus on terminating in a cycle like in the original method. The primary goal is to find collisions
of different computationally independent trails. For this purpose, a selection criteria for points
on the trail is defined, marking a small partition of all computable points as distinguished. In
our case we simply assign this property to points with the bits of an x-coordinate showing a
specific number of consecutive zeros. This is similar to the methodology as used in Section 5.3.1
for distinguished points in TMTOs.
When a trail arrives at a Distinguished Point (DP), the corresponding processor transmits
the DP to a central server which keeps track of all submitted DPs and checks for any duplicates.
In case of a duplicate resulting from a different origin, a collision is found and the algorithm
terminates successfully (when c1 6= c2 and d1 6= d2 ).
Figure 6.2 depicts how several point processors contribute to the collision search. Every
darker spot in the trail of a processor corresponds to a DP. An eventual collision of two trails in
a common distinguished point is highlighted in black for the processors w3 and w4 . In contrast
to the application of completely independent instances according to the original Pollard-Rho,
the centralized collection of DP in the MPPR achieves a linear speedup with multiple instances.
On average, a collision occurs after the computation of
p
T = πn/2 + c (6.3)
points, where n is the order of the actual subgroup. The central unit will detect the merging
of two trails when the next DP in the (joint) trail is found. The term c takes the additional
overhead of collecting points in a joint trail into account. If we assume that all available
processors directly contribute to the threshold T , the workload of a single processor wi ∈ W is
given by Equation (6.4) [vOW99]. Algorithm 6.1 depicts the MPPR method.
p
TW = πn/2/ |W| + c (6.4)
94
6.4. An Efficient Hardware Architecture for MPPR
W5 = U5
X5 V5
W4 = U4
V4
X4
W3 = U3
X3 V3
W2
V2
X2 U2
W1
V1 U1
X1
V0
W0
X0
Processor w1 Processor w2 Processor w3 Processor w4
6.4.1 Requirements
For the realization of MPPR the server requires the following features:
A communication controller for data exchange with the |W| point processors.
A database for storing the tuples (c, d, Y ) for a point Y = cP + dQ in a sorted table
according to Y for efficient point recovery.
A unit for validating distinguished points received from a processor wi . This step is
mandatory to protect against defective processors which might spoil the entire compu-
tational result by eventually corrupted results from a point processor wi . This happens
rarely when voltage drops and power fluctuations in a large hardware system spoil indi-
vidual arithmetic computations.
An arithmetic unit for computing the discrete logarithm from a detected collision.
95
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
In contrast to the connected point processors which are designed to run on special-purpose
hardware, the central server has only modest computational and throughput requirements.
Depending on the size of the subset of distinguished points and on the number of actual point
processors, a distinguished point has only to be processed once in a while. Thus, it is sufficient
to model our central station in software which simplifies the development process.
96
6.4. An Efficient Hardware Architecture for MPPR
Internal
RAM
Memory Triple
Database
(c,d,Y)
Communication Controller
HW HW HW ... HW
of MPPR in hardware. Note that a possible ASIC implementation would dramatically decrease
the monetary effort of MPPR if produced at high volumes. However, for solving ECDLP over
smaller groups (≤128 bit), an FPGA-based design can be implemented with COPACOBANA
(cf. Chapter 5).
For the implementation of a point processor, the computation of a trail of consecutive points
consists of four steps:
97
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
reduces the overall bandwidth requirement. Hence, we can adapt δ accordingly to meet
any bandwidth limitation between server and hardware processors.
Note that an efficient verification of the distinguished property does only work in affine
coordinates. Affine coordinates suffer from expensive field inversions which are manda-
tory to compute the group addition. Projective coordinate systems avoid inversions by
encoding the inversion in a separate variable at the cost of additional field multiplica-
tions. Unfortunately, the projective point representation is not unique and does not allow
for an efficient check of the distinguished property. The reason for this is that an affine
point P = (x, y) has p − 1 different projective representations, e.g., P = (X, Y, Z) satis-
fying x = X/Z α and y = Y /Z β which do not yield a unique property for DP detection.
Consequently, we will use affine coordinates in our implementation.
(2) Partitioning. In order to select the next random point Ri and its corresponding coeffi-
cients ai and bi , we need to identify a partition i = 0, . . . , s − 1 according to the current
point X. In hardware, it is straightforward to choose a power of 2 for s since we simply
can use log2 s least significant bits from x to map to the partitioning value i. We choose
a (heuristically determined) optimal value s = 16 which comes close to Teske’s proposal
in [Tes98].
(3) Point Operation. We need to update the current point by adding a random Ri which
requires a point addition or, in case of X = Ri , a point doubling. The latter case actually
means that a collision is not detected among two distinguished points rather than the
current point X and a random point Ri . Due to the fact that this case is very unlikely
with a probability of P r(X = Ri ) = s/n, we will omit the additionally required logic for
this situation to save hardware area.
(4) Coefficient Update. Finally, we need to update the corresponding coefficients c and d
for point X with X = cP + dQ by adding the random coefficients c = c + ai mod n and
d = d + bi mod n according to the selected partition i.
In the following, we introduce the design of a top layer in Subsection 6.4.2 which cares for
administrative tasks such as, e.g., providing a centralized distinguished point buffer (DPB) as
well as a processor controller (PC) which is responsible for the control of the several computa-
tional cores. In Subsection 6.4.2 we will describe a single computational core and its arithmetic
unit.
Top Layer
The top layer is composed of the previously mentioned DPB, PC, and a physical Communication
Controller (CC), as depicted by a schematic outline in Figure 6.4. The CC is required to
control the data input and output from and to the central server, respectively. Recall that no
high data communication between FPGA and central server is required since we can adjust the
distinguished property to reduce the load of the communication path.
98
6.4. An Efficient Hardware Architecture for MPPR
I/O
DP_OUT FPGA
I/O Controller
Global DP
Buffer (RAM) CTL Buffer
WE
FSM Arbiter
DP_IN FIN
DIN, RST, CTL DIN, RST, CTL DIN, RST, CTL
A gain in performance can be achieved by using more computational cores on a single FPGA,
computing many trails in parallel. Obviously, this procedure requires a careful management
regarding the data input and output. With many PR cores and the centralized buffer, the
access to the DPB demands multiplexing and buffer locking capabilities to prevent two or more
cores from writing to the buffer at the same time. On the other hand, the DPB is very useful
since it allows for an independent external access to the stored distinguished points without the
need to interrupt a core from computing. Figure 6.4 shows the computational cores as an entity
which will be discussed in detail in the following.
Core Layer
For an optimal design, all operations of the core should make use of extensive resource sharing
because we are not able to use embedded, arithmetic function blocks (e.g., DSP blocks as seen
in Chapter 3) which are not available on Spartan-3 FPGAs. Hence, due to the design flexibility
with reconfigurable logic, we can combine all costly field operations in a central Arithmetic Unit
(AU). The AU entity will be discussed in Subsection 6.4.2.
An important component of the core layer is the core controller. It is primarily responsible
for managing the operations of the AU and delegating its output to memory locations. Another
element of this layer is the main memory, providing the operands to the AU. A schematic of
the core layer architecture is given by Figure 6.5.
The AU demands for three concurrent inputs: Two inputs provide the operands IN1 and IN2,
one input provides the modulus MOD. The modulus can take two possible values: n = ord(P )
for updating the coefficients ci , di and p for all other computations. For the operands, we use a
dual-port block memory module which is available on the actual Spartan-3 FPGA. The dual-
port memory contains all k-bit variables for the actual point addition including the starting
99
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
CMP
(DP-Test)
DUAL
PORT RAM
k
[Coefficients, CTL
Points]
k IN1
Modular
IN2 Arithmetic R
Unit
SINGLE PORT ADD/SUB
k MOD
RAM MUL/INV
[n, p]
point X, associated coefficients c, d, and temporary values. Besides, it can hold the random
point data (Ri = (xi , yi ), ai , bi ) for i = 0, ..., s − 1 in a separate region. Hence, variables and
random data are available via a common access path requiring no further multiplexing. For the
modulus we use a separate single-port memory module. Compared to two individual registers
with multiplexers and separate control lines, this approach is more area efficient.
Arithmetic Unit
100
6.5. Results
Arithmetic Unit
MUX Layer
CE CE CE
AU
k+1
+/- SUB +/- SUB +/- SUB CTL
k+1
WE WE WE WE
REG #1 REG #2 REG #3 REG #4
MUX Layer
mented by right shifts by one) [MvOV96]. Thus, we use a bit-wise Montgomery multiplication
(cf. Section 4.4.1) and a modified Kaliski inversion algorithm [Kal95]. For more information
regarding the implementation see [Gün06]. All field operations require k-bit additions and sub-
tractions which we realize with Xilinx IP cores. The IP cores are highly optimized for a specific
FPGA and usually provide a better performance-area ratio than conventional implementations.
The final design of our AU is shown in Figure 6.6. For the sake of clarity, we combined the
logic for the shifting units and the multiplexing units in combined blocks.
6.5 Results
A Xilinx Spartan-3 device (XC3S1000) was used as target platform for the implementation of
the proposed architecture. Corresponding results obtained from Xilinx ISE 8.2 are presented
in Subsection 6.5.1. For comparison and debugging purposes, a software implementation of
the complete MPPR has been realized on a conventional processor (Pentium M). Based on the
running FPGA implementation, we estimate the cost and performance of an ASIC implemen-
tation of the MPPR in a further step in Subsection 6.5.3 and finally compare our findings in
Subsection 6.5.4.
101
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
6.5.1 Synthesis
We synthesized the design for various bit sizes k and the maximum number of cores fitting
on one FPGA. Table 6.1 gives the results for a Xilinx XC3S1000 which provides 17,280 logic
cells in 7,680 slices with a variable number of cores. Note that the number of occupied slices
includes all top-layer components for management and communication, as well. Additionally,
the required number of memory bits and BRAMs for the PRCore and buffer components for the
synthesized design are given. Please note that for the RAM-size in bit, k denotes 210 = 1024.
With the given time complexity of all core functions, we can provide the exact number of cycles
required for an iteration of MPPR. Along with the maximum clock frequency, detailed numbers
for the throughput of the whole design can be given.
An elliptic curve point addition in affine coordinates can be performed using 6 subtractions,
3 multiplications, and a single inversion. The update of coefficients demands another two
additions. A few administrative cycles are required by the core for resetting (RST) and post-
Table 6.2: Required cycles for one MPPR iteration with bit size k.
102
6.5. Results
processing (FIN). Finally, we can provide a list for the overall cycle count of a single MPPR
iteration as shown in Table 6.2, resulting in following complexity function:
With Equation (6.5), we can easily compute the performance for the selected architectures
including the number of required cycles per iteration and the number of points per second.
The overall system performance for our FPGA-based implementation of MPPR is shown in
Table 6.3.
Note that the numbers above exclude additional cycles for writing to the distinguished point
buffer. Only when computing the ECDLP in small groups (< 80 bit), the buffer might become a
bottleneck and the additional cycles for data output have to be taken into account. Additionally,
a larger number of MPPR cores per FPGA increases the demand for a large output buffer.
However, this can easily be avoided by reducing the size of the distinguished point set D. A
direct consequence are longer search trails for distinguished points and, thus, collisions of two
or more cores requesting a buffer lock at the same time will become extremely unlikely.
103
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
expect an improvement of area-time complexity and, thus, the numbers provided should be
seen as an upper bound on the hardware complexity of MPPR.
Along with the given throughput from Subsection 6.5.2, projections are performed for bit lengths
k = {64, 80, 96, 128, 160} using a fixed distinguished point proportion of Θ = 2−24 . We will
regard following architectures: First, projections are done for the software reference, running
on a conventional off-the-shelf processor (Pentium M [email protected] GHz). Second, we will estimate
the performance on the XC3S1000 FPGA. Thirdly, an estimate for dedicated ASIC will be given
for bit sizes of 128 bit and above1 . Clearly, the performance of the hardware architecture heavily
depends on the varying bit size, indicated by different runtimes and varying area consumption.
For small bit lengths, several cores can be realized on the FPGA and, thus, slightly decrease
the maximum clock speed of the entire system. However, the additional computational power
compensates the lower clock speed.
We compute the expected number of points for a specified bit size to determine the ECDLP
of an associated curve. At this time, we discard any negative effect of overwriting points on the
central station which in fact needs to be considered when the number of available distinguished
points becomes too huge to be stored without compression. If we assume a geometrical distri-
bution of distinguished points we can show that only negligible effects from overwriting points
are expected for bit sizes ≤ 96 bit. This statement is true for an available size of server storage
of 224 ≈ 1.68 · 107 points.
Table 6.4 presents the results for the expected computation time with a single chip (Pen-
tium M, XC3S1000 FPGA, and ASIC). We have determined the expected number of steps to
finish a successful attack based on an interval TH and TL denoting the upper and lower limit of
the point order n for a given ECC bit size k. Using the mean µ(TH , TL ) based on Equation 6.3,
we determined a step complexity of an average n for a bit size k. Relating the number of steps
and the performance of the analyzed platforms, a successful computation of the ECDLP for
k > 96 seems to be unrealistic with any of the presented architectures when considering only a
single chip.
1
For lower bit sizes, the relatively high NRE costs outweigh the production costs.
104
6.6. Security Evaluation of ECC
We will estimate the cost of such cryptanalytical systems for different scenarios: We consider
a successful attack against ECC in one year and compare the results to the previously proposed
hardware attacks against RSA in [ST03b, FKP+ 05]. Finally, we estimate the approximate cost
to solve the ECC challenges given by [Cer97]. Please remark that these estimates are based on
2007 numbers for hardware costs. For actual comparisons, Moore’s Law should be taken into
account.
We assume the cost of a Pentium M 735 processor and a Xilinx XC3S1000 FPGA to be
approximately US$ 220 and US$ 50 per chip, respectively2 . Additionally, we need housing
and peripheral interconnections for each chip. In the case of FPGAs, the realization of a cost-
efficient parallel architecture called COPACABANA as described in Chapter 5. Taking the
amount of US$ 10, 000 as a reference, we determine a corresponding number of workstations
2
These are market prices for low quantities as in February 2007; prices at high volumes might differ.
105
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
Table 6.5: MPPR performance estimates for an ASIC design (107 gates, 500 MHz).
with Pentium M processors. Including housing and additional peripherals, we estimate the cost
for a single workstation to be US$ 400 in total. Thus, a cluster of 25 Pentium M workstations
costs approximately the same as the COPACOBANA machine with 120 low-cost FPGAs. For
the ASIC design from Subsection 6.5.3, we assume a cost of US$ 50 per chip including overhead,
which seems to be a fair assumption for the production of large quantities. Hence, for US$ 10, 000
we can build an ASIC cluster consisting of 200 ASICs. Due to the relatively high NRE costs
of ASICs however, such a design will only be considered when targeting ECC with bit sizes
of 128 bit and above and many chips have to be produced. Table 6.5 provides a performance
estimate for an ASIC design. Table 6.6 compares the computational power for a US$ 10, 000
MPPR-attack in terms of software and hardware cluster implementations. Figure 6.7 provides
associated runtimes in days for these platform clusters.
Table 6.6: MPPR performance in pts/sec comparison for a US$ 10,000 investment and speed-up
factor compared to the software-based implementation (in parentheses).
106
6.6. Security Evaluation of ECC
1E+14
1E+12
1E+10
Required Time for Computation in Days
10000000
0
1000000
10000
100
1
60 70 80 90 100 110 120 130 140 150 160
0,01
0,0001
0,000001
ECC Bit Size k
107
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
Table 6.7: Expected runtime on different platforms for the Certicom ECC challenges.
of 7.6 billion COPACOBANA machines and amount to approximately US$ 7.6 · 1013 . Hence,
attacks of such dimension are far beyond the feasibility of today’s (monetary) capabilities.
Due to the vast amount of computations, ECC with bit size of 128 bit and above can only be
efficiently attacked with dedicated ASICs. With the estimate from Subsection 6.5.3, a successful
attack against ECC-163 within one year based on 1.16 · 1010 ASICs will cost approximately
US$ 5.8·1011 . This amount still is a factor of 58, 000 and 2, 900 higher compared to the estimates
for breaking RSA-1024 as described in [ST03b] and [FKP+ 05], respectively. Although all figures
are reasonable estimates for recent hardware architectures with no claim for representing a
definite lower bound, we can still assume that an attack on ECC-163 requires significantly more
efforts than an attack on RSA-1024.
For ECC over prime fields GF (p), challenges have been defined for various bit sizes with
relevance to cryptography. Table 6.9 shows the expected runtimes for solving the ECDLP for
different bit lengths and monetary efforts. Certicom provided estimates for the required number
of machine days for solving each challenge based on an outdated Intel Pentium 100 processor.
Consequently, we will additionally provide runtimes for our software reference for comparison
with the hardware implementations. Table 6.7 depicts how fast our architectures can solve these
challenges. A comparison to the estimates given by Certicom in [Cer97] is shown in Table 6.8.
Considering the latest unsolved Certicom challenge over
GF (p) (k = 131 bit), we estimate the required computational power to be at least 62, 000
COPACOBANA machines (7.40 · 106 FPGAs) for solving the ECDLP within a year. Unlike
k = 160, this is also an enormous but not an absolutely unrealistic amount of computational
power. With 93, 400 ASICs, this challenge would take one year to finish at a cost of approxi-
108
6.7. Conclusion
Table 6.8: Relative speed-up compared of different Certicom ECC challenges compared to the
Certicom Reference.
Table 6.9: Cost-performance consideration of MPPR attacks with ASICs (107 gates, 500 MHz,
NRE costs excluded).
mately US$ 5, 000, 000, excluding NRE costs. Analyzing the last solved challenge with k = 109
bits, we can state that about 300 COPACOBANA machines are already sufficient to solve this
ECDLP in only 30 days with an expense of about US$ 3 million. For bit lengths exceeding 131
bit however, the only feasible way to solve the ECDLP is the application of dedicated ASIC
devices. Assuming the high-performance ASICs from Subsection 6.5.3, we can estimate the
runtime for MPPR attacks with respect to different financial background (cf. Table 6.9).
6.7 Conclusion
This contribution fills the gap of the missing analysis in hardware-based cryptanalysis of elliptic
curve cryptosystems. Up to now, no actual hardware implementation of an algorithm solving the
elliptic curve discrete logarithm problem over prime fields has been realized. In this chapter,
we described an efficient architecture for the parallelized Pollard’s Rho method (MPPR) to
solve the elliptic curve discrete logarithm problem over prime fields. We provided results and
a detailed analysis of the first realization of such an architecture in hardware. Besides the
hardware implementation, a software implementation was realized on a conventional processor
for comparison and for debugging purposes.
109
Chapter 6. Parallelized Pollard-Rho Hardware Implementations for Solving the ECLDP
The excellent suitability of the MPPR algorithm for hardware was demonstrated by the use
of low-cost FPGAs (Xilinx Spartan-3) for a proof-of-concept implementation. Compared to
existing architectures for GF (p) arithmetic units, we can state that our AU implementation
requires 19.995µs per point addition at k = 160 bit occupying about 2, 660 slices of our FPGA.
Thus, our implementation provides a better performance at a smaller area cost with respect
to [OP01, OBPV03] and a competitive design concerning the architecture of [DMKP04] when
balancing the differences between the varying FPGA types.
For solving the ECDLP on curves of practical interest (160 bit and above) however, FPGA
implementations are still too costly. In order to derive an estimate of the security of actual
ECC in practice, we provide an estimate for an ASIC implementation based on the outcomes
of the FPGA implementation at hand. As a result, ECC turns out to be more secure than
commonly believed. The expenses of a successful attack against ECC-163 within one year based
on 1.16 · 1010 ASICs will cost approximately US$ 5.8 · 1011 . Compared to recent (conservative)
estimates for a special-purpose hardware attacking RSA-1024 in [FKP+ 05], this amount still is
a factor of 2, 900 larger. However, we could show that low-security standards such as the SECG
proposed in [Cer00a, Cer00b] for parameters of length of, e.g., 80 bit and 112 bit, are vulnerable
to such efficient FPGA-based attacks.
110
Chapter 7
7.1 Motivation
The Elliptic Curve Method (ECM) was introduced by H. W. Lenstra [Len87] for integer factor-
ization generalizing the concept of Pollard’s p − 1 and Williams’ p + 1 method [Pol74, Wil82].
Although the ECM is known not to be the fastest method for factorization with respect to
asymptotical time complexity, it is widely used to factor composite numbers up to 200 bits due
to its very limited requirements on memory. As of February 2009, the largest factor, which was
revealed by use of ECM, was a 222-bit factor of the special integer 10381 − 1 [Zim09].
Chapter 7. Improving the Elliptic Curve Method in Hardware
The most prominent application that relies on the hardness of the factorization problem is
the RSA cryptosystem. An attacker on RSA is faced to find the factorization of a composite
number n which consists of two large primes p, q. More precisely, the RSA parameter n used in
practice to match today’s security requirements are 1024 bits and up and hence out of reach for
ECM. Up to date, such large bit sizes are preferably attacked with the most powerful methods
known so far, such as the Number Field Sieve (NFS). However, the complex NFS1 involves the
search of relations in which many mid-sized numbers need to be tested if they are ”‘smooth”’,
i.e., composed only of small prime factors not larger than a fixed boundary B. In this context,
ECM is an important tool to determine the smoothness of such integers (i.e., if they can be
factored into small primes), in particular due to its moderate resource requirements.
The fastest ECM implementations for retrieving factors of composite integers are software-
based; a state-of-the-art system is the GMP-ECM software published by Zimmermann et al.
[FFK+ 09]. As a promising alternative, efficient hardware implementations of the ECM were
firstly proposed in 2006: Šimka et al. [ŠPK+ 05] demonstrated the feasibility to implement the
ECM in reconfigurable hardware by presenting a first proof-of-concept implementation. Their
results were improved by Gaj et al. [GKB+ 06a] in 2006 who also showed a complete hardware
implementation of ECM stage 2. However, the low-level arithmetic in both implementations
were only implemented using straightforward techniques within the configurable logic which
yet leaves room for further improvements. To fill this gap, de Meulenaer et al. [dMGdDQ07]
proposed an unrolled Montgomery multiplier based on a two-dimensional pipeline on Xilinx
Virtex-4 FPGAs to accelerate the field arithmetic. However, due to limitations in area and the
long pipeline design, their design only efficiently supports the first stage of ECM.
In this chapter we describe an alternative ECM architecture for reconfigurable devices. Our
focus is to accelerate the underlying field arithmetic of ECM on FPGAs without sacrificing the
option to combine both stage 1 and 2 in a single core. Thus, we adopt some high-level deci-
sions, like memory-management and the use of SIMD instructions, from [GKB+ 06a] which also
supports both stages on the same hardware. To improve the field arithmetic, we use a similar
technique introduced in Chapter 3 placing fundamental arithmetic functions like adders and
multipliers in embedded DSP blocks of modern FPGAs. As a second goal, we develop a cluster
architecture for high-performance Virtex-4 FPGAs capable to run our presented ECM imple-
mentation in massive parallelism. The presented cluster adopts similar design criteria as the
COPACOBANA architecture discussed in Chapter 5. This original version of COPACOBANA
provides 120 independent low-cost FPGAs (Xilinx Spartan-3 XC3S1000). The lack of additional
memory or high-speed communication facilities supported the simple design approach and pro-
vided bare computational resources at low costs. However, the usability of COPACOBANA was
limited to applications which do not have high demands to aspects such as availability of local
memory and high-speed communications. Moreover, although providing a high density of logic
resources, the low-cost Spartan-3 FPGA series only offers rather generic support arithmetic but
1
The NFS comprises of four steps, the polynomial selection, relation finding, a linear algebra step and finally
the square root step. The relation finding step is most time-consuming taking roughly 90% of the runtime.
For more information on the NFS refer to [LL93].
112
7.2. Mathematical Background
no dedicated DSP blocks. More precisely, wide multipliers with more than 160 bits as typically
required for computations for public-key crypto (and cryptanalysis) consume large portions of
logic when implemented with conventional structures2 (e.g., cf. the MPPR implementation in
generic logic presented in Chapter 6).
Our new COPACOBANA architectures is designed for Virtex-4 FPGA devices3 . We employ
FPGA Virtex-4 SX devices which provide a large number of DSP blocks. Unfortunately, the
cost per unit for the largest XC4VSX55 device of this family is disproportional to the gain in
logic; hence we opted for the second largest XC4SX35 device and a (slightly) better cost ratio.
However, the costs for any Virtex-4 device is still much higher compared to the previously used
Spartan-3 FPGAs.
Previous versions of COPACOBANA did not require a fast communication interface since ap-
plications performing an exhaustive key search rarely exchange data. However, parameters
and results for ECM operations need to be transferred between the cluster and a host-PC. In
particular, some operations, like gcd computations, are costly in hardware. More precisely,
the generation of elliptic curves and corresponding parameters are preferably precomputed in
software and transferred to the FPGAs. This demands for higher bandwidth so that further
customization of COPACOBANA should target the communication interface between host and
the FPGA cluster. The enhanced design of COPACOBANA based on Virtex-4 FPGAs was
developed in joint work with Gerd Pfeiffer [GPPS08].
ak(p−1) = ae ≡ 1 mod p
⇒ ae − 1 ≡ 0 mod p
⇒ p|(ae − 1)
2
In fact, there are a few 18 × 18 bit multiplier hardcores on XC3S1000 devices but not sufficiently many to
support complex cryptographic operations.
3
At time of development, newer Virtex-5 and low-cost Spartan-3A DSP devices – which also provide embedded
DSP blocks – were not yet readily available.
113
Chapter 7. Improving the Elliptic Curve Method in Hardware
In this case, we found a non-trivial divisor of n. However, we are not able to compute e = k(p−1)
without the knowledge of p. Thus, we assume that p − 1 decomposes solely into small prime
factors pi less than a defined bound B1 (in this case, p − 1 is called B1-smooth). Thus, we
choose e as product of all prime powers pri lower than B1 and hope that e is a multiple of p − 1:
Y ⌊logpi (B1)⌋
e= pi (7.1)
pi ∈P;pi <B1
operation in E(Zn ) can be defined by using the given addition formulas [Bos85] and replaces
the multiplication in Zn as used in the p − 1 method.
Since its proposal, many improvements have been made to ECM. The most fundamental
modification was an extension by Brent introducing a second stage to the ECM to improve the
success probability of the algorithm [Bre86]. In [Mon87], Montgomery presents further vari-
ants of this so-called stage two and introduces a homogeneous parameterization with projective
coordinates that avoid inversions modulo n. Furthermore, by using special elliptic curves in
so-called Montgomery form and by omitting computations for y-coordinates, group operations
of such elliptic curves can be performed with only 6 and 5 modular multiplications per point
addition and doubling on E, respectively. In addition to that, the construction of elliptic curves
with group order divisible by 12 or 16 [AM93, Mon87] increases the probability of obtaining a
group order that is B1-smooth.
Now, we briefly review the ECM using elliptic curves E that are defined by the (homogeneous)
Weierstrass Equation:
E : y 2 z = x3 + axz 2 + bz 3 (7.2)
All computations based on this Equation (7.2) are performed using projective coordinates for
points X = (x, y, z). Let us assume q to be a factor of n and |E(Zq )| is B1-smooth so that
e – according to the construction in Equation (7.1) – is a multiple of q − 1. Note that point
multiplication by |E(Zq )| (or multiples of |E(Zq )|) returns the point at infinity, e.g., Q = eP = O
for an arbitrary point P and resulting point Q. Recall that the resulting point Q = O implies a
prior impossible division by zQ so that zQ ≡ 0 mod q. Note that we actually perform all point
operations in E(Zn ) since we do not know q. Hence, we compute Q = eP in E(Zn ) and hope
to yield a point Q with coordinate zQ 6= 0 mod n but zQ ≡ 0 mod q. Then, the factor q of n is
114
7.2. Mathematical Background
From an algorithmic point of view, we can discover a factor q of n as follows: in the first stage
of ECM, we compute Q = eP where e is a product of prime powers pi ≤ B1 with appropriately
chosen smoothness bounds. The second phase of ECM checks for each prime B1 < p ≤ B2
whether pQ reduces to the neutral element in E(Zq ). Algorithm 7.1 summarizes all necessary
steps for both stages of ECM. Stage 2 can be done efficiently, e.g., using the Weierstrass form
and projective coordinates pQ = (xpQ : ypQ : zpQ ) again by testing whether gcd(zpQ , n) > 1.
Note that we can avoid all gcd computations except for a final one at the expense of one modular
multiplication per gcd and accumulating all intermediate results in a product modulo n. On
this product, we perform a single gcd at the end.
If we regard a single curve only, the properties of ECM are closely related to those of Pollard’s
(p − 1)-method that can fail by returning a trivial divisor. The advantage of ECM is apparent
with the possibility of choosing another curve (and thus group order) after each unsuccessful
trial, increasing the probability of retrieving factors of n. If the final gcd of the product s and
n satisfies 1 < gcd(s, n) < n, a factor is found. Note that the parameters B1 and B2 control
the probability of finding a divisor q. More precisely, if the order of P factors into a product
115
Chapter 7. Improving the Elliptic Curve Method in Hardware
of coprime prime powers (each ≤ B1) and at most one additional prime between B1 and B2,
the prime factor q is discovered. It is possible that more than one or even all prime divisors of
n are discovered simultaneously. This happens rarely for reasonable parameter choices and can
be ignored by proceeding to the next elliptic curve. The runtime of ECM is given by
√ √
T (q) = e( 2+o(1)) log q log log q
operations, thus, it mainly depends on the size of the factors to be found and not on the size of
n [Bre86]. However, remark that the underlying operations are computed modulo n so that the
runtime of each single operations still depends on the size of n. To generate elliptic curves, we
start with an initial point P and construct an elliptic curve such that P lies on it. As already
mentioned, curves with particular properties (e.g., a group order divisible by 12 or 16) can
increase the probability of success. The construction of such curves and corresponding ECM
parameters, for example the parametrization4 , is more thoroughly discussed in [ZD06, BBLP08].
Note that Montgomery curves always have an order divisible by 4 what leads to the observation
that not every curve in Weierstrass form can be transformed to Montgomery form.
Formulas for point addition on Montgomery curves can be given that do not involve any com-
putations of y; hence it is not possible to distinguish between (x, y) and (x, −y). Consequently,
we need additional information when constructing an addition chain for point multiplications
which is often denoted as ”differential sum” P − Q of the two points P, Q. The addition of
P + Q can be determined from P, Q and P − Q using 6 multiplications5 and 5 multiplications
for point doubling 2P assuming that the fixed quantity (a + 2)/4 is precomputed. Note that it
is also possible to reduce combined addition formulas for both operations to 10 multiplications
by fixing zP = 1 and even 9 multiplications when additionally choosing a small xP , e.g., xP = 2.
4
A popular parametrization method for Montgomery curves with a group order divisible by 12 is due to
Suyama [ZD06].
5
Note that we do not distinguish between multiplications and squarings since we will not implement a dedicated
hardware squaring unit. This limits a priori our requirements on hardware resources. Besides, we do not regard
modular additions in any runtime consideration since they can be done in parallel to modular multiplications.
116
7.2. Mathematical Background
Addition:
Doubling:
Since the formulas require availability of P ,Q and P −Q, Montgomery proposed a suitable point
multiplication method denoted as Montgomery ladder [Mon87] which was later also extended
to other abelian groups by Joye and Yen [JY03]. Let (mP, (m + 1)P ) be an initial state of the
point multiplication, where m is a scalar and P an arbitrary point on EM , then the Montgomery
ladder determines in sequence either (2mP, (2m + 1)P ) or ((2m + 1)P, (2m + 2)P ) dependent if
m is to be raised by a bit 0 and 1, respectively. For more information regarding the Montgomery
ladder, see [Mon87, JY03].
Recently, an alternative form of elliptic curves suggested by Edwards [Edw07] was proposed
by Bernstein et al. for use with ECM [BBLP08]. Twisted Edwards curves with homogeneous
inverted coordinates can be obtained by the following formula:
EE : (x2 + ay 2 )z 2 = x2 y 2 + dz 4 (7.6)
Compared to ECM with Montgomery curves, a variant of such Edwards curves with twisted
inverted coordinates [BBJ+ 08] are reported to lead to even shorter computation times in soft-
ware [BBLP08]. Although more operations are required in total, a gain in performance is
achieved by choosing input and curve twist parameters with small heights and a more efficient
method for point multiplication compared to the Montgomery ladder. The point addition on
(specifically constructed) Edwards curves takes only 6 full-size multiplications and 6 small-
height multiplications. A point doubling on EE can be done with 7 full-size and 2 small-height
multiplications. Besides an adaption of the original GMP-ECM software for such Edwards
curves [BBLP08], the authors also implemented ECM very efficiently on modern graphics cards
using the CUDA framework [BCC+ 09]. The formulas for group operations P + Q and 2P on
Edwards curves (in inverted twisted coordinates) are given as follows [BBJ+ 08]:
Addition:
xP +Q ≡ (xP xQ yP yQ + dzP2 zQ
2
)(xP xQ − ayP yQ ) mod n (7.7)
yP +Q ≡ (xP xQ yP yQ − dzP2 zQ
2
)[(xP + yP )(xQ + yQ ) − xP xQ − yP yQ ] mod n
zP +Q ≡ zP zQ (xP xQ − ayP yQ )[(xP + yP )(xQ + yQ ) − xP xQ − yP yQ ] mod n
117
Chapter 7. Improving the Elliptic Curve Method in Hardware
Doubling:
Next, we will elaborate on an optimal instruction schedule for point addition and point
doubling on Montgomery and Edwards curves. Let e be a scalar used in the computation
P
Q = e·P of ECM stage 1 with n-bits and binary representation e = 0n−1 ei 2i where ei ∈ {0, 1}.
With Montgomery curves, we need to compute 9n full-width multiplications for each bit ei
assuming that zP = 1 and xP is small. Stage 2 does not allow any preliminary simplifications
so that we need compute 11m multiplications where m denotes the number of accumulated
point operations depending of the continuation method for ECM stage 2 (cf. Algorithm 7.1).
With Edwards curves, it is possible to use a state-of-the-art point multiplication algorithm
with addition chains instead of the Montgomery ladder6 . Such advanced methods for point
multiplication do not require a point addition for each bit ei of the scalar what reduces the
overall operation count compared to the Montgomery ladder. The point multiplication selected
in [BBLP08] is the signed sliding window method. Note that – depending on the algorithm –
the determination of an optimal addition chain is not always straightforward so that the im-
plementation of a corresponding hardware circuit can be costly. Thus, many concepts for point
multiplication with optimal addition chains (which can significantly accelerate computations
in software) cannot easily be applied in hardware7 . For Edwards curves we will thus assume
the (straightforward) sliding window technique for which efficient hardware implementations
are known [SBG+ 03]. Summarizing, computations on Edwards curves take a constant number
of (7 + 2r) modular multiplications per point doubling and (6 + 6r) modular multiplications
for a point addition where r is a factor representing the clock cycle ratio of a small height
multiplication with respect to a full multiplication8 .
At this point, we need to know the complexity functions of the individual operations in order
to determine the ratio r for small-height multiplications with respect to a standard multiplica-
tions. Thus, we elaborate on this further in Section 7.3.3.
118
7.3. Implementing an ECM System for Xilinx Virtex-4 FPGAs
We cannot reuse directly the presented ECDSA core in Chapter 3, since the modular arith-
metic supports a special modulus only. Hence, to support arbitrary moduli, we decided to
implement a high-radix Montgomery multiplication algorithm [Mon85] and took efforts that all
basic arithmetic functions are handled by DSP blocks.
Montgomery’s original multiplication method (cf. Section 4.4.1) has been improved by many
publications available in open literature. Orup proposed rewriting Montgomery’s algorithms for
multiplication of which one simplifies the quotient handling [Oru95]. Straightforward quotient
handling is particularly important since it allows consecutive arithmetic operations in the DSP
blocks with realignment of operands. Orup’s improvements only demand for word-wise (i.e.,
k-bit) multiplications, additions and shifting. All operations are natively supported by all
DSP blocks of Virtex-4 FPGAs. Hence, using Orup’s variant all arithmetic operations can be
performed by sequential instructions in DSP blocks (by issuing a different operation mode or
opmode to the DSP block) without need for additional resources in the configurable logic.
For our hardware design of the ECM, we will thus use Orup’s Modular Multiplication with
Quotient Pipelining as given in Algorithm 7.2. Note that Suzuki [Suz07] already promoted the
use of Orup’s algorithms with DSP blocks. However, Suzuki used static opmodes and thus
performed all additions in the configurable logic instead of using the embedded adder of the
DSP block. Hence, Suzuki’s approach turns out to prove more costly in terms of resources and
clock cycles.
We will now shortly revise Orup’s modification to the Montgomery multiplication as intro-
duced in Section 4.4.1. Let M > 3 be a modulus such that M and 2 are coprime. We fix the
delay parameter d = 0 in Orup’s original notation simplifying our implementation. Since the
multiplier in DSP blocks has a static width, we choose the word width k = 17. Let β be the
maximum number of words required to represent all operands, determined by M̃ = M ′ · M
and M ′ ≡ −M −1 mod 2k with the constraint that 4M̃ < 2k·β . We define R to be an integer
R ≡ 2kβ mod M . Let A and B be factors in radix 2k representation such that 0 ≤ A, B ≤ 2M̃ .
The result of Algorithm 7.2 is Sβ+1 ≡ A · B · R−1 mod M where 0 ≤ Sβ+1 < 2M̃ .
119
Chapter 7. Improving the Elliptic Curve Method in Hardware
We now present our modified high radix multiplication algorithm that can be implemented
almost solely with DSP blocks. In addition to Orup’s description, we define the number of
required DSP block δ as additional parameter. For small multiplier configurations, δ can be
chosen equal to the number of blocks β. Note that δ should consider the maximum number
of DSP blocks available in a single column9 . In such case, each 17 × 17-bit multiplication is
performed by a dedicated DSP block. In order to realize larger multipliers, a cascade of DSP
blocks can be used to process individual limbs of a multi-precision multiplication in several
iterations. In such a case, let n = δr with r ∈ N and r denotes the number of iterations for each
DSP block. This iterative approach is also followed in [Suz07]. Suzuki’s multiplier comprises of
δ = 17 DSP blocks that can handle 512-bit up to 2048-bit moduli dependent on the number of
iterations r ∈ {2, 4, 6, 8}. Since we assume the ECM to handle relatively small moduli (e.g., we
target composite integers n that are smaller than 200-bit), we focus solely on the special case
with δ = β.
To use DSP blocks, all operands and intermediate results need to be handled in radix 2k
representation with data path width of k = 17. The input parameters A and B, and the
resulting parameter Sn+1 are of the same order of magnitude (i.e., Sβ+1 < 2k·β ) such that the
result can be used as input parameter again. In addition to that, all intermediate and final
results Si and Sβ+1 are less than 2M̃ so that all parameters can be expressed with a maximum
number of blocks β.
Lemma 7.1 Given an operand with h = ⌊log2 (M )⌋ + 1 bits, the number of blocks β for the
Montgomery multiplication method given by Algorithm 7.2 with word size k is determined by
h+2
β = 1+ .
k
Proof: Note that M̃ = (M ′ mod 2k )M and 4M̃ < 2kβ . Thus, the maximum possible value
of M̃ is M̃max = (2k − 1)(2h − 1). Hence,
and thus
k+h+2
⇒k+h+2=k·β ⇔β =
k
Lemma 7.2 relates the number of DSP blocks to a configuration supporting integer multipli-
cations for operands up to h bits. Note that we allocate a DSP block for each block β of the
input.
9
Dependent on the device size, a single column of a Virtex-4 FPGA comprises of 32 or 48 DSP blocks, respec-
tively.
120
7.3. Implementing an ECM System for Xilinx Virtex-4 FPGAs
Algorithm 7.3 Modular multiplication with quotient pipelining optimized for DSP blocks
Pβ−1 Pβ
Input: Multiplicand A = i=0 ai (2k )i ; multiplier B = k i
i=0 bi (2 ) ; 0 ≤ A, B ≤ 2M̃ ;
P β−1
M ′′ = i=0 mi 2k i; 0 < M ′′ < 2h ; word width k = 17; number of blocks β.
P
Output: Sβ+1 ≡ A · B · R−1 (mod M ); Si = β−1 k j
j=0 Si,j (2 ) ; 0 ≤ Sβ+1 < 2M̃ .
S0,j ← 0
q0 ← 0
for i = 0 to β do
qi ← Si,0 (mod 2k )
Si+1,0 ← Si,1 (mod 2k ) + qi · m0 + bi · a0 (mod 2k )
for j = 1 toh β − 1 do j ki
S
Si+1,j ← Si,j+1 (mod 2k ) + qi · mj + bi · aj + i+1,j−12k
(mod 2k )
end for h j ki
S
Si+1,β ← qi · mβ + bi · aβ + i+1,β−1
2k (mod 2k )
end for
return Sβ+1
Lemma 7.2 Given a number of δ > 2 DSP blocks for a single-pass Montgomery multiplication
method as in Algorithm 7.2 with word size k, the maximum width h of the operands is given by
h = k(δ − 1) − 2.
Proof: In a single-pass computation (δ = β), the trivial upper bound for δ > 2 DSP blocks
is kδ bits. However, from the assumptions of Algorithm 7.2 we need to enforce that we can also
represent 4M̃ with β blocks. In other words,
Recalling that M̃ = (M ′ mod 2k )M < 2k+h holds, we can simply extract the exponents yielding
k + h = kδ − 2 ⇒ h = k(δ − 1) − 2.
Consequently, an architecture based on 10 DSP blocks can support moduli up to 151 bits and
one with 16 DSP units up to 253 bits, respectively. A version of the Modular Multiplication with
Quotient Pipelining algorithm rewritten for use with DSP blocks is shown by Algorithm 7.3.
The implementation of algorithm 7.3 is depicted in Figure 7.1. Each DSP block successively
performs and accumulates ai · bj , mi · qj and Si,j+1 with a total latency of 5 clock cycles (one per
accumulation operation and two for additional input/output registers). This has to be repeated
β times for all blocks ai for i = 0, . . . , β − 1 and an additional final round. Hence, the total
runtime of an full h-bit multiplication with a pipeline consisting of δ DSP blocks is
h+2 h+2 6(h + 2)
CMUL (h) = 5(β + 1) + δ + 1 = 5(2 + )+2+ = 12 +
k k k
As an example, a full 151-bit multiplications using δ = 10 DSP blocks takes T (151) = 66 clock
cycles. Note that this algorithm can also be reused repeatedly by using results as inputs and
121
Chapter 7. Improving the Elliptic Curve Method in Hardware
bi ai,aj,... mi,mj,...
k·lM
k·lM lM
lM lM
lM
1
lM lM lM lM lM lM
x x x x x
DSP
...
+ + + + +
opmode
lM lM lM
Figure 7.1: Generic Montgomery multiplier designed for use with Virtex-4 DSP blocks.
thus is not limited to the use with ECM. As already mentioned, our modular multiplier is
capable to perform accumulation within the DSP block by dynamic opmode manipulation and
thus computes the final product with a reduced number of cycles when comparing our design
to [Suz07].
h+2
CSDMUL (h) = 10 + δ + 1 = 12 +
k
for a single-digit multiplication. The ratio between full and single-digit multiplications can be
thus obtained by ⌊CMUL /CSDMUL ⌋. Assuming the most practical bit range 49 < h < 304 for
122
7.3. Implementing an ECM System for Xilinx Virtex-4 FPGAs
hardware-based ECM (cf. parameter sizes in previous work), three single-digit multiplications
can be performed in the same time as one multiplication with two full-width operand.
The modular adder and subtracter is based on the same unit proposed in Section 3.4.2. The
time complexity for this component is even less than that of a single-digit multiplication since it
processes words at double-precision (e.g., 2k = 34). More precisely, the number of clock cycles
for modular addition or subtraction based on two DSP units with a pipeline length of 4 cycles
is given by
h
CADD/SUB (h) = 4 + .
2k
For simplicity, we will assume CADD/SUB < CSDMUL since three modular additions or subtrac-
tions can be executed in parallel to one full-width multiplication.
We now identify the optimal number of arithmetic units for fast instruction scheduling for
both point doubling and addition. Based on our experiments, we found the use of two paral-
lel multiplier and one combined adder/subtracter optimal for both Edwards and Montgomery
curves. More parallelism in terms of arithmetic units turned out to be not beneficial due to
data dependencies. In other words, adding further arithmetic units to the system resulted in
computational idle times since the components need to wait until previous computations has
finished. An example of such a dependency is given by the computation of y2P (Edwards coor-
dinates) where two multiplications (T1 = zP2 , T2 = 2dT1 ) followed by another two subtractions
(T3 = P1 − P2 − T2 ) and a final multiplication (T4 = P3 T3 ) need to performed in sequence since
all inputs Ti are dependent on previous results Ti−1 (note that we do not care for Pj values
since we assume they are computed concurrently to Ti operations).
Next, we focus on the instruction scheduling of the explicit formulas for point addition and
point doubling. For the given setup with two multipliers and one adder/subtracter, we present
a combined sequence of operations for an interleaved point doubling and point addition based
on the formulas in Montgomery and Edwards form. The combination of both point operations
into a single sequence of instructions allows further reduction of computational idle times. This
is particularly beneficial when using the Montgomery ladder where point addition and point
doubling follow in sequence in each step. On the contrary, the more advanced point multiplica-
tion techniques as for use with Edwards curves compute a more unpredictable sequence of point
doublings and point additions. However, this is not an issue: to perform a sole point doubling
independent of the point addition, the instruction sequence can be terminated right after the
last arithmetic operation of the point doubling has finished (additional computations for point
addition performed up to this point can operate on dummy inputs in such a case). We opti-
mized the instruction flow to avoid a performance penalty by the interleaved processing of both
operations. Table 7.1 shows the operation schedule for a single step on the Montgomery ladder
(i.e., combined point addition and point doubling) in the case zP −Q = 1. For the instruction
schedule, we assumed that at least two addition/subtractions can be issued in parallel to one
multiplication. Note that optimal memory and register allocation is not included in the listing
123
Chapter 7. Improving the Elliptic Curve Method in Hardware
Table 7.1: Combined point addition and doubling (2P and P + Q) on Montgomery curves for
the case zP −Q = 1.
for better readability. Instead we denote Ai as a register used for point addition, Di one for
point doubling and ADi is used for both operations.
Similarly, Table 7.2 presents combined formulas for point doubling and subsequent point
addition for Edwards curves with inverted, twisted coordinates. Assuming practical bit
lengths h with 49 < h < 304, we allocated in our model three time slots for modular ad-
ditions/subtractions or single-digit multiplications that can be individually issued during the
runtime of one full multiplication.
Based on our model, the combined point addition and doubling based on two multipliers
and one adder unit takes 9CFULL cycles. The sequence for Montgomery coordinates requires
5CFULL + 2CADD/SUB cycles. Although more expensive in terms of the total number of opera-
tions, the point multiplication methods applicable with Edwards curves perform mainly point
doublings that take 5CFULL cycles. Evaluating minimum requirements so that a point multipli-
cation method is more efficient than the Montgomery ladder, we obtain a ratio between point
doublings (DBL) and point addition (ADD) of #DBL/#ADD > 13.7 for practical operand
lengths with h > 100 bits. Thus, a beneficial ratio in favor for Edwards curves is only likely for
very long scalars e of several thousands bits with large window sizes. Since ECM parameters
for hardware-based implementations need to be smaller due to the limited amount of available
on-chip memory10 , it is unlikely that we can reach this threshold with FPGAs.
Hence, although Edwards curve were reported to be more efficient in software, we finally select
Montgomery curves for our hardware implementation, mostly due to memory restrictions.
10
Recall that only strictly limited bound B1 = 960 and B2 = 57000 were used in previous hardware implemen-
tations [ŠPK+ 05, GKB+ 06a]. Larger bounds will exceed the available memory for storing prime tables on
the FPGA required for the second stage of ECM.
124
7.3. Implementing an ECM System for Xilinx Virtex-4 FPGAs
6.1 A7 = xQ + yQ A4 = yP · yQ A2 = A21
6.2 A8 = xP + yP A6 = a · A4
6.3 A8 = A7 · A8
7.1 A6 = A3 − A6 A2 = d · A2 A5 = A3 · A4
7.2 x2P +Q = A5 + A2
8.1 A8 = A8 − A3 x2P+Q = xP+Q · A6 z2P+Q = A1 · A6
8.2 A8 = A8 − A4
8.3 y2P +Q = A5 − A2
9 y2P+Q = yP+Q · A8 z2P+Q = z2P+Q · A8
Table 7.2: Combined point addition and doubling (2P and P + Q) in inverted, twisted Edwards
coordinates. Bold-faced operations denote full-size modular multiplications with h×h
bits, all other operations take at most a third of the runtime of the full multiplication.
The double line marks a possible early termination point of the instruction sequence
to perform a sole point doubling.
In this section, we develop an architecture based on the modular multiplier proposed in Sec-
tion 7.3.1. In our implementation, we adopt the high-level design including the global instruction
set and memory management presented in [GKB+ 06a]. Hence, we developed a similar multi-
core design which can support both stage 1 and 2 of ECM. Each core is designed to compute the
group operations on the elliptic curves, i.e., supplementary operations like gcd computations and
parameter generation for the elliptic curves are performed off-chip. Thus, an ECM core consists
of an arithmetic unit for modular multiplication and addition and control logic for computing
a point multiplication step. For both stages, all instructions are kept in a global instruction
125
Chapter 7. Improving the Elliptic Curve Method in Hardware
FPGA 1
253 bit
INSTRUCTION CTL
FPGA 2
ECM ECM
... ECM ECM
MOD
MUL
Core Core Core Core
151 bit
FPGA 3
INSTRUCTION CTL
ECM ECM
... ECM ECM
RAM Bank
Core Core Core Core
Access switch
66 bit
INSTRUCTION CTL
ECM ECM
... ECM ECM
MOD
MUL
ECM ECM Core ... ECM ECM Core
I/O
Core Core
Core Core Core Core
ECM ECM ... ECM ECM
I/O
ADD/
SUB
ECM ECM ... ECM ECM
I/O
Figure 7.2: Each FPGA contains an individual ECM system with multiple ECM cores operating
in SIMD fashion. Factorization of different integer bit lengths can be supported by
different FPGAs.
memory and issued to all cores in parallel, i.e., the individual cores operate in Single Instruction
Multiple Data (SIMD) fashion. The global control also includes a few additional ROM tables
for stage 2 to drive the standard continuation method proposed by Montgomery [Mon87]. We
considered the improvements for the standard continuation method in hardware (e.g., bit tables
for prime representation) proposed in [GKB+ 06a] as well. Figure 7.2 depicts the architecture
of the ECM system implemented in this work.
The hardware platform for the presented ECM system above requires the availability of FPGA
devices with embedded DSP blocks and a communication interface providing sufficient perfor-
mance to exchange required parameters. Note that arithmetic computations prevail so that the
hardware platform still does not need to support very high performance data transfers. Only
the elliptic curve parameter a, the base point P = (xP , zP ) and the result zQ need to be ex-
changed. Though, significantly more bandwidth is necessary than in the brute-force scenarios
as described in Chapter 5.
The original COPACOBANA cluster combined 120 Spartan-3 XC3S1000 FPGAs distributed
along 20 plug-in modules in a single backplane with 6 FPGAs per module. Note that this
approach is not optimal with a binary address encoding since the required address space of 5
bit for module selection is not fully utilized. As a remedy, the new design carries 16 plug-in
modules each hosting 8 FPGAs. Instead of Spartan-3 FPGA, we selected Virtex-4 devices which
126
7.4. A Reconfigurable Hardware Cluster for ECM
2
le
du
l e1 Mo 16
du 5 M le 5
X3 IM 35 du X3
Mo 4S D 4 SX Mo 4S
V 5 V V 5
XC X3 XC X3
5 XC X3
Backplane r V4S V 4S V4S
e 5 5
o ll X C X3 XC X3
5 XC X3
ntr 4S 4S 4S
Co card V 5 V V
35
XC 3 C X3
5 XC
SX X SX
Gi erne
Eth
PC V4 4S 4
ga
C 5 V V 5
3 C 5 XC X3
b it t
X X3
SX X 4S l V4S
C V4 5 V 5
X X3 XC 35 XC X3
l ler V 4S 4 SX V 4S
o XC 5 V 35 XC 5
o ntr A 4S
X3 XC
4 SX 4S
X3
C P G ) V 5 V V 5
F FX12 XC X3 XC X3
5 XC X3
V 4 V 4S V 4S V4S
(X
C XC XC XC
CP CP CP
LD LD LD
bus 32 data
16 address
2
system
Figure 7.3: Architecture of modified COPACOBANA cluster based on Virtex-4 SX35 FPGAs.
provide embedded DSP blocks11 . We stick to the shared single master bus (32 bit) and provided
an additional 16 bit address bus on a backplane. As a design option, the bus architecture can
be switched to a 32 serial point-to-point connections so that four FPGAs share a serial line
using differential I/O. In this setup, each serial link can support throughputs up to 2.5 GBit
per second (RocketIO transceivers) so that each FPGA of the system can transfer data with up
to 625 MBit/s.
We selected a Xilinx Virtex-4 SX35-10 in FF668 package (27 × 27 mm) that provided the
best tradeoff with respect to the number of embedded DSP blocks and financial costs. The
SX class of Virtex-4 FPGAs are optimized for arithmetic operations and thus provide a large
number of DSP blocks. Note that we can alternatively deploy other Virtex-4 devices with the
same 27 × 27 mm footprint like the Virtex-4 LX12 or Virtex-4 FX13 . The architecture of the
new FPGA-cluster (COPACOBANA with Virtex-4 FPGAs) is depicted in Figure 7.3.
On the same plug-in module, all eight FPGAs are connected to a two CPLDs (CoolRunner-II)
which act as bus driver and communication bridge between the shared bus on the backplane and
the local bus on the module. The shared 32 bit bus on the backplane is driven and controlled by
a further Virtex-4 FX FPGA at 20MHz placed on a separate controller module. This FPGA also
integrates a full-blown TCP/IP stack running on the integrated PowerPC hardcore so that the
FPGA cluster can establish a connection to a host computer via Gigabit Ethernet. We designed
the whole communication system as a three-tier architecture allowing a target application to
11
The design decision in favor for Virtex-4 FPGAs was taken in March 2007. At that time, the more cost-efficient
Spartan-3A DSP devices were not available and Virtex-4 devices were the only option when DSP blocks are
required.
12
LX devices are designed to provide large amounts of generic logic.
13
Virtex-4 FX FPGAs come with an integrated PowerPC core and support a larger variety of I/O modes
127
Chapter 7. Improving the Elliptic Curve Method in Hardware
The improved performance of the Virtex-4 FPGAs comes in line with an increased energy
consumption per chip. The required power per chip was estimated based on assumption that
cryptanalytical applications are likely to utilize all available hardware resources. According
to these requirements, the power distribution system can supply a maximum of 10W to each
FPGA. Consequently, we chose a global DC power supply unit providing 125A output at 12V.
The corresponding 1500W of output power are distributed by the backplane to all plug-in cards
and are locally transformed into the 1.2V core and 2.5V I/O voltage by individual DC/DC con-
verters. The dissipation of 1500W electrical power requires a sophisticated thermal management
in terms of the selection of fans, routing of air flow, and choice of effective heat sinks.
For monitoring purposes, we have also added a bit-serial bus throughout the system which
complies to the SMBus specification. On each plug-in card the CoolRunner-II CPLD operates
as monitoring client and runs all system management operations exclusively. The temperature
measurement diode of all Virtex-4 devices is used to initiate an automatic power down when the
core temperature is about to exceed the maximum value of 85◦ C. Hence, the CPLD is connected
to all monitoring diodes of each Virtex-4 FPGA and also to the power enabler of the DC/DC
converters to control the shut-down of the plug-in module in case of overheating.
7.5 Results
We implemented an ECM architecture in VHDL for Xilinx Virtex-4 SX35 and Spartan-3A DSP
3400 FPGAs and synthesized the design using Xilinx ISE 9.2. Although our design is prepared
to be used both for stage 1 and 2, we only implemented the instructions for stage 1 up to now
due to time constraints. As an example implementation, the resource consumption of a single
ECM core for parameters with h = 151 bits is shown in Table 7.3. Note that the requirements on
flip-flops are lower for Spartan-3A DSP devices since they come with integrated register stages
at the outputs of the BRAMs. Virtex-4 devices, however, need to implement these memory
output registers in generic logic and thus consume more flip-flops. For comparison with our
implementation, we refer to Gaj et al. who reported the use of 3224 slices (5047 LUTs and 3077
flip-flops) on a larger and faster Virtex-4 XC4VLX200-11, however targeting 198 bit parameters.
The enhanced COPACOBANA machine was built and is already fully functional. Unfortu-
nately, we cannot do (reasonable) practical measurements due to a bottleneck in the communi-
cation interface14 . Therefore, Table 7.4 presents runtime estimates obtained by simulation and
without limitation on the communication interface. We compared our results to the implemen-
tation reported by Gaj et al. [GKB+ 06b, GKB+ 06a] on the same FPGA platform assuming
14
More precisely, the bottleneck is inside the FPGA-based Gigabit Ethernet controller which does not support
Direct Memory Access (DMA) to support the Gigabit data link.
128
7.5. Results
Table 7.4: Clock cycles and frequency for a single ECM core (stage 1) to factor a 151 bit integer
with bound B1 = 960 and B2 = 57000. For comparison, values with asterisk were
scaled down from figures for 198 bit integers.
same bounds B1 = 960 and B2 = 57000. Due to our optimizations with respect to low-level
arithmetic, we obtained an architectures which more time and area-efficient (however, we re-
quire the availability of DSP blocks). Note further that we are not able to achieve similar high
clock frequencies and throughputs as with our ECC design presented in Chapter 3. This is
mainly due to the second multiplier and the associated, more complex routing to the memory
banks15 . In addition to that, the multiplier also requires access to an additional memory for
storing the modulus n (including different representations of n); a simple ROM holding the
special primes P-224 or P-256 is not sufficient anymore. This is another factor contributing to
a more complex (and thus less optimal) core arithmetic leading to degraded performance.
Based on the ECM core for 151-bit integers, we can place 5 and 8 such cores on the Spartan-
3A DSP 3400 and Virtex-4 SX35, respectively. On both devices, the number of DSP blocks
limit the possible number of cores. However, the less efficient routing of multiple parallel cores
on the same chip results again in lower clock frequency. First experiments with the Virtex-4
SX35 showed that a clock frequency of 100 MHz can be achieved. Based on this figure, we can
15
Recall that our design from Chapter 3 only integrates two arithmetic units which can be efficiently operated
in parallel with a dual-ported memory.
129
Chapter 7. Improving the Elliptic Curve Method in Hardware
estimate that 2295 ECM stage 1 operations per second can be performed on a single Virtex-4
SX35 to factorize 151 bit integers (with B1 = 960,B2 = 57000). At this point, we like to stress
that the costs for a Virtex-4 SX35 FPGA device are significant, currently about US$ 500 per
device16 . A more cost-efficient device is the Spartan-3A DSP 3400 FPGA which was recently
released. Suche a devices is only about US$ 75 per piece and can support 5 such ECM cores for
151 bit integers at (estimated) 60 MHz clock frequency. This results in around 860 operations
per second for stage 1 (assuming same bounds as above).
16
Price in December 2008, not including discount for purchase of larger quantities.
130
Part III
8.1 Motivation
When Field Programmable Gate Arrays (FPGA) were first introduced in the 1980s, this was a
revolutionary step from static ASIC and VLSI solutions to flexible and maintainable hardware
applications. It has become possible to avoid the static designs of standard VLSI technology, and
instead to compile electrical circuits for arbitrary hardware functions into configuration files1
1
Note that we also use the terminology configuration bitstream or configuration bit file as synonyms for the
configuration file of FPGAs.
Chapter 8. Intellectual Property Protection for FPGA Bitstreams
used to program a fabric of reconfigurable logic. A new market has evolved where companies
have specialized on the development of abstract hardware functions that can be distributed and
licensed to system integrators by using only a logical description file. However, the flexibility
of SRAM-based FPGAs also brings up the issue of protecting the Intellectual Property (IP)
of such circuit layouts from unauthorized duplication or reverse engineering. Unfortunately, a
configuration file of an FPGA can easily be retrieved from a product and used to clone a device
with only little effort. Furthermore, IP vendors that deliver configuration files to licensees
do not have any control over how many times the IP is actually used. To cope with these
problems, several different approaches have been proposed. As an example, a simple “security
by obscurity” approach is to split the IP among multiple FPGAs and create a unique timing
relationship between the components [Bar05]. This type of mechanism, however, will not protect
proprietary IP from more intensive attacks. Moreover, such techniques force IP vendors (who
only intend to sell configuration files) to deal with the customer’s board layout as well.
In a smarter approach, IP vendors can insist on installing their configuration file only on
encryption-enabled FPGA devices using a previously inserted secret key. Common high-end
FPGA types like Virtex-II, Virtex-4 and Virtex-5 from Xilinx [Xil08b] as well as Altera’s Stratix
II GX and Stratix III [Alt06] devices provide decryption cores based on symmetric 3DES and
AES block ciphers. With an encrypted configuration file, the IP can only be used on a device
that has knowledge of the appropriate secret key. But here the issue of key transfer arises. One
approach is to ship FPGAs to the IP owner for key installation: the IP owner installs secret
keys in the devices such that these keys are available to decrypt configuration files but remain
otherwise unaccessible. After key installation, the devices are returned to the customer. The
high logistical effort makes this a very unsatisfying solution to the problem.
Further solutions are based on separate security chips that dongle the IP to a specific FPGA
by exchanging cryptographic handshaking tokens between the components [Alt]. Similarly, this
approach requires modification to the customer’s board layout, additional hardware, and a
secure domain for installing the secret parameters. This way, it provides only a partial solution
at a high cost.
In the open literature, there are only very few suggestions to enhance this situation. In
[Kea01a, Kea01b], a strategy was proposed based on a static secret key already inserted during
the manufacturing process. The issue of key transfer is solved by including cores both for
encryption and for decryption in the FPGA. Each FPGA specimen containing contains a private
or group key which is used to encrypt a bitstream during installation of the configuration. Thus,
after encryption, the configuration will work only for itself and for other FPGAs sharing the
same fixed key. In [Kea02, SS06], more complex protocols have been proposed for a more
complete solution. Both approaches require the implementation of additional security features
in the FPGA. Furthermore, they also require the participation of the FPGA manufacturer (as a
trusted party) whenever a bitstream is to be encrypted for a particular FPGA. In other words,
such transactions cannot be kept just between the IP vendor and the customer.
In this chapter, we propose a new protection scheme for configuration files that provides
IP vendors with means for exact tracking and control of their licensed designs. Our solution
134
8.2. Protection Scheme
does not impose a need for additional hardware components or major modifications of recent
FPGA technology. Instead of demanding a crypto component for key establishment in the static
logic as needed by [Kea02, SS06], we use the reconfigurable logic for a setup process based on
public key cryptography. Besides exact tracking of the number of licensed IP, our approach
provides the invaluable advantage of off-site IP installation: the installation of the designs can
be performed by the licensees without any shipping of hardware. Our approach does not require
the continuing participation of a third party (such as the FPGA manufacturer) either. To enable
FPGAs for these new features, only marginal modification are required on recently available
FPGA models. The protection scheme presented in this chapter was developed in joint work
with Bodo Möller and published in [GMP07a, GMP07b].
135
Chapter 8. Intellectual Property Protection for FPGA Bitstreams
of such keys a single (longer) key k. In this chapter, we will write AuthEnck (x) for authenticated
encryption of a plaintext x yielding a ciphertext including a MAC value, and DecVerk (y) for
the reverse step of decryption of a ciphertext y while also checking for an authentication error
based on the MAC value provided as part of the ciphertext.
The use of CBC for confidentiality with CMAC for authentication is just an example of a
convenient scheme. Alternatively, we could use any suitable symmetric cryptographic scheme
that provides authenticated encryption in an appropriate sense. Regarding the combined scheme
using CBC with CMAC, note that a single implementation of either block cipher encryption or
block cipher decryption is sufficient in the FPGA. We can use the block cipher “in reverse” for
one of the two cryptographic steps, e.g., use a CMAC based on block cipher decryption rather
than on block cipher encryption.
Asymmetric Cryptography: If we want to use symmetric data encryption, this means we have
the issue of establishing a shared key k between the parties (usually over an untrusted com-
munication channel). As mentioned in the beginning of this thesis, asymmetric cryptography
provides a pair of keys consisting of a public (PK ) and a private (SK ) component that can
be used to overcome the key transport deficiencies of symmetric methods. The first publicly
known example of public-key cryptography was the Diffie-Hellman (DH) scheme [DH76], which
can be used to establish keys for symmetric cryptography. Appropriately used in conjunction
with a key derivation function (KDF) based on a cryptographic hash function, the DH scheme
remains state-of-the-art to derive a symmetric key. In this context, an important variant of this
is the DH scheme using elliptic curve cryptography [ACD+ 05], namely ECDH. See [Nat06] for
elaborate recommendations on the proper use of DH and ECDH. Public-key cryptography can
also be used for authentication through digital signatures.
136
8.2. Protection Scheme
the given inputs. By varying the OtherInfo value (which directly becomes part of the KDF
input), the static-key Diffie-Hellman scheme can be used to generate many seemingly indepen-
dent symmetric keys. Note that the recommendations in [Nat06] for static-key Diffie-Hellman
settings additionally require the use of a nonce in the KDF input for each invocation of the key
establishment scheme. This use of non-repeated random values ensures that different results
can be obtained based on otherwise identical inputs. However, we do not need this nonce here:
for our application, the reproducibility of key establishment results is a feature, not a deficiency.
P1. Trusted Party. We assume the FPGA hardware manufacturer (HM) as a commonly
trusted party. All other participating and non-involved parties in the protocol can assume
the HM to be trustworthy and unbiased, i.e., the HM will neither share any secrets with
other parties nor unfairly act in favor of someone else. All other parties, however, are
regarded per se as untrusted, and may try to cheat.
P3. FPGA Security Environment. For the scheme, we assume an FPGA with an inte-
grated symmetric decryption engine that is able to handle encrypted configuration files.
This reference device is extended with the following features:
a) A unique device identifier ID (an l-bit value) is assigned by the hardware manufac-
turer (HM), accessible from the FPGA’s fabric.
b) A symmetric key KHM (an m-bit value) that is statically integrated by the HM during
the manufacturing process, and which can only be read by the internal decryption
engine but not from the FPGA fabric.
c) A symmetric key store KFPGA (also an m-bit value) that is implemented as non-
volatile memory and allows for storing a variable key. The key stored in KFPGA
can either be updated using an external interface (e.g., JTAG, SelectMAP) or via
an internal port from the reconfigurable logic of the FPGA. However, it can only be
read from the internal decryption engine (not from the fabric).
137
Chapter 8. Intellectual Property Protection for FPGA Bitstreams
d) A data exchange register that can be accessed via a standardized configuration in-
terface like JTAG as well as from the reconfigurable fabric using a dual-ported logic.
This feature is already available on many common FPGAs based on user-defined
instructions in the JTAG protocol.
e) Tamper-resistant control and security components that can withstand invasive and
non-invasive attacks on the device. Particular protection mechanisms should cover
the integrated keys KHM and KFPGA , the decryption engine and the FPGA controller
distributing the unencrypted configuration bits in the fabric of the FPGA after de-
cryption. Hence, a readback of the plain configuration bits or partial reconfiguration
must not be possible on a device with an encrypted configuration loaded. Possible
physical threats to security-enabled devices as well as side-channel attacks have been
discussed more thoroughly in [KK99, BGB06, WP03].
We assume that the decryption engine of the FPGA can be invoked such that either KHM
or KFPGA is used to decrypt a configuration file.
A. SETUP. On the launch of a new class of FPGA devices, the HM creates a specific
bitstream for them, a Personalization Module (PM) that later will be used in the per-
sonalization stage. An encrypted version of this PM is made available to all participants,
together with a public key associated to it.
B. LICENSING. When an IPO offers licenses to an SI, it provides a public key of its own.
The device identifier ID of each FPGA on which the SI intends to use the licensed IP is
transferred to the IPO.
D. CONFIGURATION. Using the device information, the IPO sends copies of the con-
figuration file containing its IP to the SI, specifically encrypted for each FPGA.
E. INSTALLATION. The SI installs the IP in each FPGA (using the appropriate en-
crypted copy).
The information exchange between the parties is simple. Figure 8.1 shows the data flow
between the participants. Steps 1 through 3 can be considered one-time operations (these are
part of the setup and licensing stage). On the contrary, steps 4 and 5, which are part of the
138
8.2. Protection Scheme
SI
licensing and configuration stage, are required to be performed repetitively for each FPGA that
should make use of the protected IP.
The setup stage, described in the following, is performed once by the HM for each FPGA class
to be enabled for the proposed security features. Note that it is reasonable to keep each FPGA
class relatively small by splitting a large series of otherwise identical FPGAs into multiple
separate classes. This limits the potential damage done if an FPGA’s permanent secret KHM
or the secret SK HM is compromised (cf. Section 8.2.4). Any such class should be treated as
independent from any others, i.e., the results of the setup stage as described below cannot be
shared between classes. For practical reasons, a specific part of the device ID should be used to
denote the class that a given FPGA belongs to.
A1. The HM generates a symmetric key KHM and an asymmetric key pair (SK HM , PK HM )
for key establishment.
A2. The HM creates a specific bitstream P for the FPGA class such that P implements
a key establishment scheme, as described in Section 8.2.3. P includes the private key
SK HM . All components employed should be fault-tolerant and include countermeasures
against external tampering [KK99]. The bitstream P acts as a personalization module
and implements the FPGA behavior that we present in Subsection 8.2.5.
139
Chapter 8. Intellectual Property Protection for FPGA Bitstreams
A3. After the configuration bit file P has been built, it is encrypted using the secret key KHM ,
yielding an encrypted configuration file Penc = AuthEncKHM (P ).
A4. The secret key KHM is statically integrated in each FPGA (cf. Subsection 8.2.4) during
the manufacturing process.
After these setup steps have been completed, the HM distributes the encrypted personalization
bitstream Penc and the public key component PK HM to all participating parties.
The other parties (notably IPO) must convince themselves that PK HM actually originates
from HM. How this authentication is performed in detail is outside the scope of this proto-
col. However, we remark that if HM supports many different FPGA classes, a Public-Key
Infrastructure (PKI) can be very useful.
The licensing stage can be regarded as a first interaction between IPO and SI. To use external
IP, the SI signs a contract with the IPO (usually for a volume license). Then, the following
steps are required.
B1. The IPO creates a key pair (SK IPO , PK IPO ) and sends the public component PK IPO to
the SI.
B2. SI provides IPO with a list of the ID values of those FPGAs for which the SI intends to
obtain a license.
Again, authentication of the communication between IPO and SI is not explicitly covered in
this work since well-known solutions do exist (e.g., digital signatures). It should be remarked
that authentication is required to avoid any abuse of the license model (again, a PKI can be
useful).
In contrast to current solutions, our scheme allows for a key installation that is done in the
untrusted domain of the SI. This can take place due to the secret key KHM available in the
specific FPGAs from the setup stage. The following steps need to be performed for every
FPGA intended to be operated with the IP from the IPO. Note that they can be automated
very efficiently. The personalization and the key establishment within an FPGA makes use of
the encrypted configuration Penc performing the (one-time) key setup in logic residing in the
fabric. Compared with the option of demanding static security components for this step in the
static part of an FPGA, this provides huge efficiency advantages since it limits the resources
that have to be permanently allocated in the FPGA device.
C1. Using a programming interface, the FPGA is configured with the encrypted personal-
ization module Penc made available by the HM, which is decrypted using the statically
integrated key KHM .
140
8.2. Protection Scheme
C2. Then, the data exchange register of the FPGA is loaded with the public key PK IPO via
a common interface (e.g., JTAG). After PK IPO is loaded, the computation process is
initiated.
C3. The personalization module determines a symmetric key key(PK IPO , SK HM , ID) using
the integrated key agreement scheme, and stores the resulting symmetric key in KFPGA .
From now on, the FPGA can decrypt designs that are encrypted using this key.
The security properties of the key establishment scheme imply that knowledge of either SK HM
or SK IPO is required to compute the key KFPGA . Thus, SI cannot determine KFPGA . Including
ID in the KDF input ensures that KFPGA will differ for different FPGA instances. For further
implementation aspects, see Section 8.4.
For each FPGA ID for which SI has bought a license, the IPO returns a corresponding con-
figuration file to the SI usable only on the specific FPGA. This mechanism allows the IPO to
easily track the number of FPGAs that have been configured to use the licensed IP. In detail,
the IPO performs the following steps to generate the FPGA-specific configuration file.
D1. The IPO recovers the specific key KFPGA using its own secret key and the HM’s public
key:
KFPGA = key(PK HM , SK IPO , ID)
D2. The IPO encrypts the plain IP configuration bit file using the secret key, thus binding the
IP to a specific FPGA device:
IPenc,ID = AuthEncKFPGA (IP).
so the key KFPGA as determined by the IPO is identical to the key KFPGA as installed into the
FPGA during the personalization stage.
After having received IPenc,ID , the SI configures the FPGA with the personalized IP.
E1. SI configures the flash memory of the specific FPGA denoted by identifier ID with IPenc,ID
to operate the device.
Since KFPGA is available in the FPGA, this step enables the FPGA to use the IPO’s configu-
ration bit file IP by computing DecVerKFPGA (IPenc,ID ).
141
Chapter 8. Intellectual Property Protection for FPGA Bitstreams
142
8.4. Implementation Aspects
Table 8.1: Data throughput and logic requirement of personalization components on a Xilinx
Virtex-4 FPGA.
where S(x) is a selection function choosing the first 256 out of 320 input bits and where || denotes
concatenation. The data exchange between the personalization module and an external party
(SI) was realized using a shift register which is writable from the JTAG interface. Beside a ROM
for storing the secret key of the HM and constants c0 , c1 , a SelectMAP controller is required to
program the key storage of the FPGA, which was provided by Berkeley’s Bee2 project [Ber06].
The schematic of the implemented personalization module is sketched in Figure 8.2.
For this proof-of-concept work, the SelectMAP core needs to be externally connected with the
FPGA’s SelectMAP interface since a direct configuration from the fabric is not yet available.
It should be remarked that all implementations have been developed for an optimal area-time
product so that reductions in logical elements can still be achieved if data throughput is not
a primary issue. Concluding, the implementations at hand are small enough to fit even the
smallest Virtex-4 XC4VFX12 device (providing a total of 5472 slices of which less than 4000 are
required) with some remaining logical resources to add functionality providing tamper resistance
and fault tolerance.
143
Chapter 8. Intellectual Property Protection for FPGA Bitstreams
256
JTAG ROM Select
Shift Register SKHM MAP
JTAG
static
ID Keystore
PKIPO
Figure 8.2: Simplified schematic of a personalization module for Xilinx Virtex-4 FPGAs with a
256-bit decryption key.
To use our proposed key scheme on common FPGA devices, an additional key storage with a
fixed key needs to be added to the static design. This is no technical and big financial issue
since it can be integrated by the HM either directly in the circuit layout, or using antifuses, or
by using similar techniques in a post-production step.
Such a strategy based on laser inscriptions or antifuses can also be used to provide each FPGA
with a unique identification number. Alternatively, so called Physically Unclonable Functions
(PUF) implemented using a coating or delay technique might be an option to create a unique
identification of each chip [Lim04].
A further additional requirement for our scheme is access from the fabric to the key store via
an internal (write-only) interface. Since it only requires some internal component repositioning
and few dedicated internal I/O pins, this can be cost-efficiently implemented.
Moreover, for advanced bitstream protection in the FPGA, we require the availability of
authenticated encryption within the FPGA as explained in Section 8.2.2. A single block cipher
engine can be used both for authentication and for decryption, and such reuse means that only
little modification are needed for current FPGA architectures already containing a symmetric
block cipher engine.
144
8.5. Conclusions and Outlook
To achieve tamper resistance against invasive attackers, most HMs have already taken efforts
to hide away critical parts from observation and manipulation, i.e., by distracting those compo-
nents over multiple layers and locations. Moreover, strategies known from smart-cards [KK99]
could be applied to strengthen FPGAs against physical and invasive attacks.
145
Chapter 9
Trusted Computing in Reconfigurable Hardware
Trusted Computing (TC) is an emerging technology used to build trustworthy com-
puting platforms. The Trusted Computing Group (TCG) proposed several specifica-
tions to implement TC functionalities and security function. Their proposal included
a hardware extension to common computing platforms which is known as Trusted
Platform Module (TPM). In this chapter we propose a reconfigurable (hardware)
architecture which is capable to support TC functionalities as well as trustworthy
security components. With respect to the implemented set of security functions, we
focus on TPM-similar functionality as proposed by the TCG, but specifically designed
for embedded platforms.
9.1 Motivation
Trusted Computing (TC) is a promising technology used to build trustworthy computing plat-
forms. A recent initiative to implement TC by extending common computing platforms with
hardware and software components is due to the Trusted Computing Group (TCG), a consor-
tium of IT enterprises [Tru08]. The TCG specified the Trusted Platform Module (TPM) which
provides a small set of cryptographic and security functions, and is assumed to be the trust
anchor in a computing platform. Currently, TPMs are implemented as dedicated crypto chip
mounted on the main board of computing devices, and many vendors already ship their plat-
forms equipped with TPM chips. The functionalities provided by the TPM allow to securely
bind (sensitive) data to a specific platform meaning that the data is only accessible when the
underlying platform has the valid and desired configuration.
Chapter 9. Trusted Computing in Reconfigurable Hardware
However, there are several issues to deal with: first, existing TPM chips are currently mainly
available for workstations and servers and rather for specific domain applications, in particular
not for embedded systems1 . Second, TPM specifications are continuously growing in size and
complexity, and there is still no published analysis on the minimal security functionalities that
are practically needed. In addition to this, TPM users have to completely trust implementations
of TPM manufacturers, e.g., regarding the compliance to the TCG specification. This also
demands the user to trust the TPM implementation that no malicious functionalities have been
integrated (like trapdoors or Trojans). Finally, the TCG adversary model considers software
attacks only, but manipulations on the underlying hardware can circumvent any sophisticated
software security measures. Currently, TPM chips are connected to the I/O system with an
unprotected interface that can be eavesdropped and manipulated easily [KSP05].
We address most of these issues by proposing a reconfigurable architecture for FPGAs that al-
lows a scalable and flexible usage of trusted computing functionalities. To our knowledge, there
has been no proposal for building TC capabilities (e.g., TPM functionalities and corresponding
trustworthy security components) in reconfigurable hardware architectures. Our approach al-
lows to bind a reconfigurable application to the underlying TPM and even to bind any higher
software layer to the whole reconfigurable architecture. Based on the asymmetric means of an
TCG-conform TPM, this can be used as an effective and flexible protection of IP to provide
device-specific application software.
We believe that FPGA devices can provide a promising basis for a variety of TC applications
in embedded system environments. On the contrary, for enabling TC functionality on these
devices, today’s FPGA architectures must be extended with additional features but the tech-
nologies for the required modifications are already available. Note that we do not primarily
focus on the integration of large microprocessors (µP ) like commercial Intel Core 2 Duo or
AMD Opteron into an FPGA. In fact, our approach assumes embedded applications running
on small µP s like ARM known from mobile phones and PDAs.
In this chapter we propose solutions to extend reconfigurable hardware architectures with
Trusted Computing functionalities, e.g., for use in embedded systems. In particular, our ar-
chitecture allows to include the TPM implementation itself into the so-called chain of trust.
Although we aim at solutions compliant to the TCG proposed TPM specification, our archi-
tecture can be deployed independently from TCG approach for future developments of TC
technology. Besides a vendor-independent and flexible integration of a TPM in embedded
systems, our approach provides the advantage to reduce the trusted computing base to the
minimum according to the application’s needs. This includes specific functionalities to allow for
effective protection scenarios with hardware and software IP (on FPGAs). Parts of this chap-
ter are based on collaborations with Thomas Eisenbarth and Marko Wolf and were published
in [EGP+ 07a, EGP+ 07b].
1
At least there exist proposals from Brizek et al. [BKSW05] and the TCG [Tru08] for a specific TPM to also
support mobile devices.
148
9.2. Previous Work
149
Chapter 9. Trusted Computing in Reconfigurable Hardware
RAM
(33 MHz
TPM version 1.2 provides the following features: a hardware-based random number generator
(RNG), a cryptographic engine for encryption and signing (RSA) as well as a cryptographic
hash function (SHA-1, HMAC), read-only memory (ROM) for firmware and certificates, volatile
memory (RAM), non-volatile memory (EEPROM) for internal keys, monotonic counter values
and authorization secrets, and optionally, sensors for tampering detection. Common TPM
chips use a synchronous Low Pin Count-I/O-Interface (LPC-I/O) to communicate with their
host system. Based on protected information and hardware functionality the security critical
operations like key generation and decryption are performed on-chip. Note that secret keys
never leave the TPM device unencrypted. The TPM provides a minimum of 24 registers called
Platform Configuration Registers (PCR) that can be used to store hash values. A PCR normally
can be extended only, i.e., the value of a PCR can only be modified as follows: PCR i+1 ←
Hash(PCR i |x), with the old register value PCR i , the new register value PCR i+1 , and the input
x (e.g. a SHA-1 hash value over the instructions of an operating system level or an application).
PCR 0 to PCR 15 are reset only at system startup and thus can only be extended afterwards.
The PCR values are used to establish the chain of trust. Roughly speaking, at power-on of
the platform, the CRTM3 computes a hash value of the code and parameters of the boot loader.
Computing the hash value is called measurement in the TCG terminology. Then the CRTM
extends the corresponding PCR and hands over the control to the boot loader that measures
the next component, e.g., operating system kernel, and so forth. The security of the chain relies
strongly on explicit security assumptions about the CRTM.
150
9.4. Trusted Reconfigurable Hardware Architecture
resistance (as TPM technology stems from the smart-card technology). However, recent TPM
still show the following security weaknesses:
151
Chapter 9. Trusted Computing in Reconfigurable Hardware
employs the device. All parties need to trust the FPGA hardware manufacturer. However, IP
developers have only limited trust in systems developers, and users have only limited trust in
IP and system developers. It is obvious that the entity issuing the update (usually the TPM
designer) needs to be trustworthy, or the TPM implementation is subject to certification by
some trusted organization.
We assume an adversary who can eavesdrop and modify all FPGA-external communication
lines, eavesdrop and modify all FPGA-external memories, arbitrarily reconfigure the FPGA,
but cannot eavesdrop or modify FPGA-internal states. Particularly, we exclude invasive attacks
such as glitch attacks, microprobing attacks or attacks using laser or Focused Ion Beam (FIB) to
gain or modify FPGA internals. Precautions against other physical attacks such as side channel
attack or non-invasive tampering must be taken when implementing the TPM. Furthermore,
we do not consider any destructive adversaries which are focusing on denial-of-service attacks,
destroying components or the entire system.
The basic idea is to include the hardware configuration bitstream(s) of the FPGA in the chain
of trust. The main security issue, besides protection of the application logic, is to protect the
TPM against manipulations, replays and cloning. Hence, appropriate measures are required to
securely store and access the sensitive (TPM) state T .
In the following we denote a hardware configuration bitstream as BX with X ∈ {TPM , App}
such that BTPM denotes a TPM bitstream and BApp an application bitstream. We further define
EX as the encryption of BX using a symmetric encryption algorithm and a symmetric encryp-
tion key kEnc,X . We define AX as an authenticator of a bitstream BX with AX ← AuthkX (BX )
where AuthkX could be for instance a Message Authentication Code (MAC) based on the au-
thentication key kX . We denote the corresponding verification algorithm of an authenticator
AX with VerifykX (BX , AX ). In case that a bitstream has been encrypted to preserve design
confidentiality, BX is replaced by EX . Thus, the corresponding authenticator AX becomes
AX ← AuthkX (EX ). Unlike Chapter 8, we use here a separate operator for encryption and
authentication since this provides direct access to their individual properties (e.g., to the au-
thenticator AX ). According to [BN00], an Encrypt-then-MAC authenticated encryption scheme
provides the strongest security (with respect to the two other possible schemes MAC-then-
Encrypt and Encrypt-and-MAC ). We finally define CX as an unique representative of BX ’s
configuration. This value CX can be, for example, directly obtained from computed authen-
ticator A′X when the FPGA verifies the authenticity of the configuration BX (i.e., the FPGA
checks that A′X = AX for BX ).
Figure 9.2 shows our high-level reconfigurable architecture. The bitstreams BApp , BTPM of
the application and the TPM core without any state T are stored authenticated (and encrypted)
in the external (untrusted) memory EM .
152
9.4. Trusted Reconfigurable Hardware Architecture
Communication Interface
Reconfigurable Logic
TCG I/O Non-volatile Storage
µP TPM Static Logic
HW Configuration
Register (HCR)
Key Storage
Application
Bitstream Trust
KS
Engine (BTE)
DECRYPT AUTH
BTPM ATPM
CONFIGURE FPGA LOAD TPM bitstream TPM Authenticator
EM
FPGA
The FPGA Control logic allows partial hardware configuration4 of the FPGA fabric to load
the TPM and the application independently using the LOAD and CONFIGURE interfaces.
The Bitstream Trust Engine (BTE) provides means to decrypt and verify the authenticity
and integrity of bitstreams using the DECRYPT and AUTH interfaces5 . Furthermore, the BTE
includes a protected and non-volatile key storage (KS) to store the keys for bitstream decryption
and authentication. Finally, the BTE provides a volatile memory location called Hardware
Configuration Registers (HCR) to store the configuration information of loaded bitstreams.
These registers are used later on by the TPM to set up its internal Platform Configuration
Registers (PCR).
In the following we define two stages in our protocol, the setup and the operational phase.
153
Chapter 9. Trusted Computing in Reconfigurable Hardware
TPM implementation, it is possible that both components are developed by a single party, e.g.,
by the system developer itself.
S1. The system developer verifies the authenticity of BTPM and BApp , encrypts BApp to EApp
and then creates bitstream authenticators ATPM and AApp using the authentication keys
kTPM and kApp , respectively.6
S2. The TPM bitstream BTPM , its authenticator ATPM , the encrypted application bitstream
EApp , and its authenticator AApp are stored in the external memory EM .
S3. The system developer writes the appropriate authentication keys kTPM and kApp (and the
encryption key kApp ) to the key store KS of the BTE.
O1. On device startup, the FPGA controller reads the TPM bitstream BTPM and the corre-
sponding authentication information ATPM from the external memory EM . BTE veri-
fies the authenticity and integrity of BTPM based on the authenticator ATPM by using
VerifykTPM (BTPM , ATPM ).
With successful verification of the bitstream, BTE computes the configuration value CTPM
of the TPM bitstream and writes CTPM into the first Hardware Configuration Register
(HCR) before the FPGA’s fabric is finally configured with BTPM .
O2. The TPM requires exclusive access to a non-volatile memory location to store its sensitive
state T = (EK , SRK , TD) where EK denotes an asymmetric key that uniquely identifies
each TPM (Endorsement Key), SRK an asymmetric key used to encrypt all other keys
created by the TPM (Storage Root Key) and TPM data TD includes further security-
critical non-volatile data of the TPM. This requires an extension of recent SRAM-FPGA
devices with on-chip non-volatile storage which is discussed in more detail in Section 9.5.
Furthermore, the access to this storage location is protected by an Access Control Func-
tion (ACF) integrated in the FPGA’s static logic which provides access to sensitive data
only when a specific bitstream (i.e., the TPM) is loaded. For full flexibility, the ACF
implements an interface with which a currently configured bitstream can request a reset
(and implicitly, a clear) of the non-volatile memory to reassign the access to the storage
for its own exclusive use. The access authorization to the memory for a loaded bitstream
X can easily be performed by BTE by checking its CX stored in the first HCR.
O3. After the TPM has been loaded into the fabric, the application bitstream EApp and its
authenticator AApp are read from EM , verified and decrypted in the same way. The
6
If TPM bit stream BTPM is also provided by the system integrator itself, he can choose kTPM = kApp .
154
9.4. Trusted Reconfigurable Hardware Architecture
BTE stores the configuration value CApp of the verified application in the second HCR
register. After the application bitstream has been configured in the FPGA, the first
call of the application initializes the TPM as follows: Based on the content of the HCR
(CTPM , CApp ), the TPM initializes its own PCRs: PCR 1 ← Hash(PCR 0 |CTPM ) and
PCR 2 ← Hash(PCR 1 |CApp ) where Hash(x) denotes the internal hash function of the
TPM and PCR 0 is some constant (root) value. In this way the (unique) configurations of
all bit streams can be included in the chain of trust7 .
After loading the hardware configuration of TPM and application into the FPGA, the chain
of trust can be extended by the measurements of other specific system components like the
operating system and high-level application software. This allows to bind any higher level
application (of the IP provider) to the underlying FPGA by binding the application (or its
data) using the subset of the PCR registers that contain the corresponding measurements of
the underlying FPGA.
155
Chapter 9. Trusted Computing in Reconfigurable Hardware
corresponding system developer (over an untrusted channel). Then, the migration function
Migrate(UA, CTPM2 ) performs the following steps:
U1. The migration function of TPM1 verifies UA using the update verification key P KUPD
and checks whether PTPM2 provides the same (minimum) set of security properties as
PTPM1 .
U2. After successful verification, the migration function of TPM1 reassigns the ACF (contain-
ing T ) for use with TPM2 . The ACF needs to grant access to TPM2 without erasing the
non-volatile memory. More precisely, the BTE provides a further interface so that only
TPM1 with access to the ACF memory can associate the memory with CTPM2 . After
reassignment of the ACF memory, only the new TPM2 is able to access T .
After the migration function has terminated, the application (or manually, the user) overwrites
TPM1 stored in the external memory EM with BTPM2 and the corresponding authenticator
ATPM2 . Now, the user restarts the FPGA to reload the updated TPM and application (cf.
Section 9.4.4).
Enhancing Chain of Trust. As mentioned in Section 9.3.1, recent TPM enabled sys-
tems establish the chain of trust by starting from the CRTM, which is currently part of the
BIOS. For TPMs hosted in FPGAs, the BTE can begin with the hardware configuration
of the application and even with the TPM itself. Therefore, the chain of trust can include
the underlying hardware as well as the TPM hardware configuration, i.e., the chain of
trust paradigm can be moved to the hardware level.
Flexible Usage of TPM Functionality. The developer may also utilize the basic
functionality of the TPM in his application which can make the development of additional
cryptographic units obsolete. This includes the generation of true random numbers, the
asymmetric cryptographic engine as well as protected non-volatile memory. Furthermore,
a flexible FPGA design allows to use only that TPM functionality which is required for
the application.
156
9.5. Implementation Aspects
and hence any changes to the chain of trust can render sealed data inaccessible, even when
keeping the same level of security. This is a general limitation of the TCG solution and
holds for our chain of trust model as well. However, in [KKL+ 05] the authors propose
the concept of property-based sealing that provide a mapping between security properties
provided by a platform configuration and its binary measurements making updates very
efficient, since as long as properties are preserved, changes during update to binary mea-
surements have no impact on sealed data. In this context, the authors also propose to use
a new TPM command called TPM UpdateSeal that allows a TPM to verify a certificate
issued (by a trusted third party) on a new configuration, and hence reseal the data under
the new configuration.
Improved Communication Security. The integration of CPU, ROM, RAM and TPM
into a single chip enhances protection of communication links between these security-
critical components from being intercepted or manipulated. With the boot ROM and
RAM integrated in the FPGA, the injection of malicious boot code or RAM manipulations
becomes much more difficult.
157
Chapter 9. Trusted Computing in Reconfigurable Hardware
preserving the TPM state T , it is possible to integrate a non-volatile memory directly on the
FPGA chip. Newer FPGA devices like the Xilinx Spartan 3AN [Xil08b] already offer SRAM-
based circuits combined with a non-volatile Flash memory layer in the same package. The Flash
memory in those devices allows for storing up to 11 MBit of user-defined data, perfectly suited
for storing T (with an additional ACF implementation).
To realize the BTE and bitstream authentication of BApp and BTPM we require minor mod-
ifications to the existing (AES) decryption cores. For integrity verification and authentication
of bit streams in the BTE, one option is to use a Message Authentication Code (MAC) which
ideally uses the same cryptographic engine used for bitstream decryption (cf. the authenticated
encryption feature discussed in Chapter 8).
Table 9.1: Estimated number of Logical Elements (LE) and RAM bits for the TPM functionality.
For all other (reconfigurable) cryptographic components of the TPM itself, a multitude of
proposals for efficient implementation are available in the open literature. Table 9.1 shows
estimates for the required resources of a TPM which is realized in an FPGA. Please note that
the implementations have been selected to be area-optimal. We have converted the resource
requirements to a universal metric based on Logical Elements (LE) which was chosen to maintain
platform independence among different classes of FPGAs11 . Translated to a low-cost Xilinx
Spartan-3AN XC3S1400AN with a total system complexity of 25,344 LEs, the TC enhancements
will take about 3,000 LEs and require about 12% of the device capacity. Hence, we can conclude
that a TPM implementation will obviously be efficient with recent devices.
9.6 Conclusions
In this chapter we proposed solutions for reconfigurable hardware architectures based on FPGAs
providing Trusted Computing (TC) mechanisms as proposed by the TCG specifications.
Integrating TC mechanisms into reconfigurable hardware allows to integrate application hard-
ware and TC functionalities into the chain of trust, which helps building trustworthy embedded
platforms for securing various security-critical embedded applications.
11
In our notion, an LE consists of a single 4-input LUT connected to an optional, single-bit flip-flop
158
9.6. Conclusions
The flexible, efficient and scalable realization of TC mechanisms, allows manufacturer inde-
pendent TC designs and flexible updates of TC functionalities. Due to complete integration,
the given solutions also improve the protection against certain hardware attacks.
Future work can include a realization of our architecture as a proof-of-concept implementation
and detailed analyses on the tamper resistance of FPGA chip packages.
159
Part IV
Appendix
Additional Tables
We here provide supplementary tables and information for the work presented in this thesis.
Table A.1: Supplementary information to Chapter 2: initial 13 clock cycles of the eight pipeline
stages computing a plaintext input. Steps are RAM lookup L, RAM output register
R; transform T, DSP input register D and DSP XOR ⊕. After eight cycles the
′
output column E0 is used as input to the next round, etc.
1st Base Ext. 2nd Base Ext. 1024 bit range 2048 bit range
Bajard et al. Shenoy et al. 981 2003
Others 1013 2035
Others Shenoy et al. 990 2014
Others 1022 2046
Table A.2: Supplementary information to Chapter 4: modulus sizes for modular multiplication
using RNS.
Appendix
Multiplier Adder/Subtracter
Step Target Operation Description Target Operation Description
1 T0 Y1 · Z23 Y1 · Z32 = S1
2 T1 Y2 · Z13 Y2 · Z31 = S2
3 T2 X1 · Z22 X1 · Z22 = U1 T1 T1 − T0 S2 − S1 = R
4 T3 X2 · Z12 X2 · Z21 = U2
5 T4 T 12 R2 T3 T3 − T2 U2 − U1 = H
6 T5 T 32 H2
7 T6 T5 · T2 U1 H2
8 T2 T5 · T3 H3 T5 T6 + T6 2 · U1 H2
9 T7 T2 · T0 S1 · H3 T4 T4 − T5 R − 2 · U1 H2
2
10 T3 T 3 · Z1 Z1 · H X3 T4 − T2 R − 2 · U1 H2 − H3
2
11 Z3 T 3 · Z2 Z1 · Z2 · H = Z3 T4 T 6 − X3 U1 H2 − X3
12 T6 T1 · T4 R(U1 H2 − X3 )
13 Z32 Z3 · Z3 Z23 Y3 T6 − T7 R(U1 H2 − X3 ) − S1 · H3
14 Z33 Z32 · Z3 Z33
Table A.3: Supplementary information to Chapter 3: instruction sequence for point addition
using projective Chudnovsky coordinates based on a parallel adder and multiplier.
Multiplier Adder/Subtracter
Step Target Operation Description Target Operation Description
1 T0 Z12 · Z12 Z41 T2 Y1 + Y1 2 · Y1 = B
2 T1 X1 · X1 X21
3 T3 T2 · T2 B·B=C T0 T0 − T1 (X21 − Z41 )
4 Z3 T 2 · Z1 B · Z1 = Z3 T1 T0 + T0 2 · (X21 − Z14 )
5 T4 T 3 · X1 C · X1 = D T1 T1 + T0 3 · (X21 − Z41 ) = A
6 T5 T1 · T1 A2 T0 T4 + T4 2D
7 Z32 Z3 · Z3 Z23 X3 T5 − T0 2
A − 2D = X3
8 T T3 · T3 C2 T5 T − X3 D − X3
9 T2 T1 · T5 A · (D − X3 ) 6 T 0 DIV 2 C2 /2
10 Z33 Z32 · Z3 Z33 Y3 T2 − T6 A · (D − X3 ) − C2 /2
Table A.4: Supplementary information to Chapter 3: instruction sequence for point doubling
using projective Chudnovsky coordinates based on a parallel adder and multiplier.
164
Bibliography
[ABCS06] R. Anderson, M. Bond, J. Clulow, and S. Skorobogatov. Cryptographic Proces-
sors – A Survey. Proceedings of the IEEE, 94(2):357–369, Feb 2006.
[ACD+ 05] R. M. Avanzi, H. Cohen, C. Doche, G. Frey, T. Lange, K. Nguyen, and F. Ver-
cauteren. Handbook of Elliptic and Hyperelliptic Curve Cryptography. Chapman
& Hall/CRC, 2005.
[Act08] Actel Corporation. IGLOO and ProASIC Flash-based FPGAs, 2008. Available
at http://www.actel.com/products/.
[Alg07] Algotronix Ltd. AES G3 data sheet: Xilinx edition, October 2007. Available at
http://www.algotronix-store.com/kb_results.asp?ID=7.
[Alt] Altera Corporation. FPGA design security using MAX II reference design. Avail-
able at http://www.altera.com/support/refdesigns/sys-sol/indust_mil/
ref-des-secur.html.
[Alt06] Altera Corporation. Stratix II GX and Stratix III FPGAs, 2006. Available
at http://www.altera.com/products/devices/.
[AM93] A. O. L. Atkin and F. Morain. Finding suitable curves for the elliptic curve
method of factorization. Mathematics of Computation, 60:399–405, 1993.
[ANS05] ANSI X9.62-2005. American National Standard X9.62: The Elliptic Curve Digi-
tal Signature Algorithm (ECDSA). Technical report, Accredited Standards Com-
mittee X9, http://www.x9.org, 2005.
[ATI06] Advanced Micro Devices, Inc. (AMD), Sunnyvale, CA, USA. ATI CTM
Guide, Release 1.01, 2006. Available at http://ati.amd.com/companyinfo/
researcher/documents/ATI_CTM_Guide.pdf.
[Bar05] T. Barraza. How to Protect Intellectual Property in FPGA Devices Part II.
Design and Reuse Online: Industry Articles, 2005. Available at http://www.
us.design-reuse.com/articles/article11240.html.
Bibliography
[BBJ+ 08] D. J. Bernstein, P. Birkner, M. Joye, T. Lange, and C. Peters. Twisted Edwards
Curves. In S. Vaudenay, editor, Proceedings of First International Conference on
Cryptology in Africa – AFRICACRYPT 2008, volume 5023 of LNCS, pages 389–
405. Springer-Verlag, 2008. Document ID: c798703ae3ecfdc375112f19dd0787e4.
[BCC+ 09] D. J. Bernstein, T.-R. Chen, C.-M. Cheng, T. Lange, and B.-Y. Yang. ECM
on Graphics Cards. Cryptology ePrint Archive, Report 2008/480, January 2009.
Document ID: 6904068c52463d70486c9c68ba045839.
[BDK01] J.-C. Bajard, L.-S. Didier, and P. Kornerup. Modular Multiplication and Base
Extension in Residue Number Systems. In N. Burgess, editor, Proceedings of the
15th IEEE Symposium on Computer Arithmetic ARITH15, pages 59–65, Vail,
Colorado, USA, June 2001.
[BMP05] J.-C. Bajard, N. Meloni, and T. Plantard. Efficient RNS Bases for Cryptography.
In Proceedings of IMACS 2005 World Congress, Paris, France, July 2005.
166
Bibliography
[Bos85] W. Bosma. Primality testing using elliptic curves. Technical Report 85-12,
Universiteit van Amsterdam, 1985.
[BP01] T. Blum and C. Paar. High Radix Montgomery Modular Exponentiation on Re-
configurable Hardware. IEEE Transactions on Computers, 50(7):759–764, 2001.
[BP04] J.-C. Bajard and T. Plantard. RNS bases and conversions. Advanced Signal
Processing Algorithms, Architectures, and Implementations XIV, 5559:1:60–69,
2004.
[Bre86] R. P. Brent. Some Integer Factorization Algorithms Using Elliptic Curves. Aus-
tralian Computer Science Communications, 8:149–163, 1986.
[BSQ+ 08] P. Bulens, F.X. Standaert, J.-J. Quisquater, P. Pellegrin, and G. Rouvroy.
Implementation of the AES-128 on Virtex-5 FPGAs. In S. Vaudenay, edi-
tor, Proceedings of First International Conference on Cryptology in Africa –
AFRICACRYPT 2008, volume 5023 of LNCS Series, pages 16–26. Springer-
Verlag, 2008.
[Cer00a] Certicom Research. Standards for Efficient Cryptography – SEC 1: Elliptic Curve
Cryptography. Available at http://www.secg.org/secg_docs.htm, September
2000. Version 1.0.
[CG03] P. Chodowiec and K. Gaj. Very compact FPGA implementation of the AES
algorithm. In C. D. Walter, Ç. K. Koç, and C. Paar, editors, Proceedings of
the Workshop on Cryptographic Hardware and Embedded Systems (CHES 2003),
volume 2779 of LNCS, pages 319–333. Springer-Verlag, 2003.
[cKKAK96] Ç. K. Koç, T. Acar, and B. S. Kaliski, Jr. Analyzing and comparing Montgomery
multiplication algorithms. IEEE Micro, 16(3):26–33, 1996.
167
Bibliography
[CS07] N. Costigan and M. Scott. Accelerating SSL using the vector processors in IBM’s
Cell broadband engine for Sony’s Playstation 3. Workshop on Software Perfor-
mance Enhancement for Encryption and Decryption (SPEED 2007), 2007.
[CUD07] NVIDIA Corporation, Santa Clara, CA, USA. Compute Unified Device Archi-
tecture (CUDA) Programming Guide, Version 1.0, 2007.
[dDBQ07] G.M. de Dormale, P. Bulens, and J.J. Quisquater. Collision Search for Elliptic
Curve Discrete Logarithm over GF (2m ) with FPGA. In P. Paillier and I. Ver-
bauwhede, editors, Proceedings of the Workshop on Cryptograpic Hardware and
Embedded Systems (CHES 2007), volume 4727 of LNCS, page 378. Springer-
Verlag, 2007.
[DGP08] S. Drimer, T. Güneysu, and C. Paar. DSPs, BRAMs and a pinch of logic: new
recipes for AES on FPGAs. In Proceedings of the IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM 2008), pages 99–108. IEEE
Computer Society, April 2008.
168
Bibliography
[DH77] W. Diffie and M. E. Hellman. Exhaustive cryptanalysis of the NBS Data En-
cryption Standard. COMPUTER, 10(6):74–84, June 1977.
[DR02] J. Daemen and V. Rijmen. The Design of Rijndael: AES - The Advanced En-
cryption Standard. Springer, 2002.
[Dri09] S. Drimer. Security for volatile FPGAs. PhD thesis, Cambridge University, 2009.
to appear.
[EGP+ 07a] T. Eisenbarth, T. Güneysu, C. Paar, A.-R. Sadeghi, D. Schellekens, and M. Wolf.
Reconfigurable trusted computing in hardware. In Proceedings of the ACM Work-
shop on Scalable Trusted Computing (STC ’07), pages 15–20, New York, NY,
USA, 2007. ACM Press.
[EGP+ 07b] T. Eisenbarth, T. Güneysu, C. Paar, A.-R. Sadeghi, M. Wolf, and R. Tessier.
Establishing Chain of Trust in Reconfigurable Hardware. In Proceedings of the
IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM
2007), pages 289–290, Washington, DC, USA, 2007. IEEE Computer Society.
169
Bibliography
[Elg85] T. Elgamal. A public key cryptosystem and a signature scheme based on discrete
logarithms. IEEE Transactions on Information Theory, 31(4):469–472, 1985.
[FKP+ 05] J. Franke, T. Kleinjung, C. Paar, J. Pelzl, and C. Priplataand C. Stahlke. SHARK
– A Realizable Special Hardware Sieving Device for Factoring 1024-bit Integers.
In J. R. Rao and B. Sunar, editors, Proceedings of the Workshop on Cryptograpic
Hardware and Embedded Systems (CHES 2005), volume 3659 of LNCS, pages
119–130. Springer-Verlag, August 2005.
[FL05] J. Fry and M. Langhammer. RSA & Public Key Cryptography in FPGAs. Tech-
nical report, Altera Corporation, 2005.
[GB05] T. Good and M. Benaissa. AES on FPGA from the fastest to the smallest. In
J. R. Rao and B. Sunar, editors, Proceedings of the Workshop on Cryptographic
Hardware and Embedded Systems (CHES 2005), volume 3659 of LNCS, pages
427–440. Springer-Verlag, 2005.
[GKB+ 06a] K. Gaj, S. Kwon, P Baier, P. Kohlbrenner, H. Le, M. Khaleeluddin, and R. Bachi-
manchi. Implementing the Elliptic Curve Method of Factoring in Reconfigurable
Hardware. In L. Goubin and M. Matsui, editors, Proceedings of the Workshop
170
Bibliography
[GKB+ 06b] K. Gaj, S. Kwon, P. Baier, P. Kohlbrenner H. Le, M. Khaleeluddin, and R. Bachi-
manchi. Implementing the Elliptic Curve Method of Factoring in Reconfigurable
Hardware. In Presented at the Workshop on Special Purpose Hardware for At-
tacking Cryptographic Systems (SHARCS’06), 2006.
[GKN+ 08] T. Güneysu, T. Kasper, M. Novotný, C. Paar, and A. Rupp. Cryptanalysis with
COPACOBANA. IEEE Transactions on Computers, 57(11):1498–1513, Novem-
ber 2008.
[GKST07] J. Guajardo, S. Kumar, G.-J. Schrijen, and P. Tuyls. FPGA intrinsic PUFs and
their use for IP protection. In P. Paillier and I. Verbauwhede, editors, Proceedings
of the Workshop on Cryptograpic Hardware and Embedded Systems (CHES 2007),
volume 4727 of LNCS, pages 63–80. Springer-Verlag, 2007.
[GMP07b] T. Güneysu, B. Möller, and C. Paar. New Protection Mechanisms for Intellectual
Property in Reconfigurable Logic. In Proceedings of the IEEE Symposium on
Field-Programmable Custom Computing Machines (FCCM 2007), pages 287–
288, Washington, DC, USA, 2007. IEEE Computer Society.
[GP08] T. Güneysu and C. Paar. Ultra High Performance ECC over NIST Primes on
Commercial FPGAs. In E. Oswald and P. Rohatgi, editors, Proceedings of the
Workshop on Cryptograpic Hardware and Embedded Systems (CHES 2008), vol-
ume 5154 of LNCS, pages 62–78. Springer-Verlag, 2008.
[GPP07a] T. Güneysu, C. Paar, and J. Pelzl. Attacking elliptic curve cryptosystems with
special-purpose hardware. In Proceedings of the International Symposium on
Field Programmable Gate Arrays (FPGA 2007), pages 207–215. ACM Press,
2007.
[GPP+ 07b] T. Güneysu, C. Paar, J. Pelzl, G. Pfeiffer, M. Schimmler, and C. Schleiffer. Par-
allel Computing with Low-Cost FPGAs: A Framework for COPACOBANA. In
Proceedings of the Symposioum on Parallel Computing with FPGAs (ParaFPGA
2007), LNI, Jülich, Germany, September 2007. Springer-Verlag.
171
Bibliography
[GPP08] T. Güneysu, C. Paar, and J. Pelzl. Special-Purpose Hardware for Solving the El-
liptic Curve Discrete Logarithm Problem. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 1(2):1–21, 2008.
[GT07] P. Gaudry and E. Thomé. The mpFq library and implementing curve-based key
exchanges. Workshop on Software Performance Enhancement for Encryption
and Decryption (SPEED 2007), 2007.
[Gün06] T. Güneysu. Efficient Hardware Architectures for Solving the Discrete Loga-
rithm Problem on Elliptic Curves. Master’s thesis, Horst Görtz Institute, Ruhr
University of Bochum, February 2006.
[HCD07] H. Hisil, G. Carter, and E. Dawson. Faster Group Operations on Special Elliptic
Curves. Cryptology ePrint Archive, Report 2007/441, 2007. http://eprint.
iacr.org/.
[Hel07] Helion Technology. High performance AES (Rijndael) cores for Xilinx FPGAs,
2007. http://www.heliontech.com/downloads/aes_xilinx_helioncore.pdf.
[HV04] A. Hodjat and I. Verbauwhede. A 21.54 Gbits/s fully pipelined AES processor on
FPGA. In Proceedings of the IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM 2004), pages 308–309. IEEE Computer Society,
2004.
172
Bibliography
[IKM00] T. Ichikawa, T. Kasuya, and M. Matsui. Hardware evaluation of the AES finalists.
AES Candidate Conference, pages 13–14, 2000.
[Int08] International Business Machines Inc. IBM Research: BlueGene, 2008. Available
at http://www.research.ibm.com/bluegene/.
[JY03] M. Joye and S.M. Yen. The Montgomery Powering Ladder. In B.S. Kaliski,
Ç. K. Koç, and C. Paar, editors, Proceedings of the Workshop on Cryptograpic
Hardware and Embedded Systems (CHES 2002), volume 2523 of LNCS, pages
291–302. Springer-Verlag, 2003.
[Kal95] B. S. (Jr.) Kaliski. The Montgomery Inverse and its Applications. IEEE Trans-
actions on Computers, 44:1064–1065, 1995.
[KG04] P. Kohlbrenner and K. Gaj. An embedded true random number generator for
FPGAs. In R. Tessier and H. Schmit, editors, Proceedings of the International
173
Bibliography
[KKL+ 05] U. Kühn, K. Kursawe, S. Lucks, A.-R. Sadeghi, and C. Stüble. Secure Data Man-
agement in Trusted Computing. In J. R. Rao and B. Sunar, editors, Proceedings
of the Workshop on Cryptograpic Hardware and Embedded Systems (CHES 2005),
volume 3659 of LNCS, pages 324–338. Springer-Verlag, 2005.
[KKSS00] S. Kawamura, M. Koike, F. Sano, and A. Shimbo. Cox-rower architecture for fast
parallel Montgomery multiplication. In B. Preneel, editor, Advances in Cryptol-
ogy – Proceedings of EUROCRYPT 2000, volume 1807 of LNCS, pages 523–538.
Springer-Verlag, 2000.
[Koç95] Ç. K. Koç. RSA Hardware Implementation. Technical report TR801, RSA Data
Security, Inc., 1995. Available at http://islab.oregonstate.edu/koc/docs/
r02.pdf.
[KPP+ 06a] S. Kumar, C. Paar, J. Pelzl, G. Pfeiffer, A. Rupp, and M. Schimmler. How
to Break DES for BC 8,980. In Presented at the Workshop on Special Purpose
Hardware for Attacking Cryptographic Systems (SHARCS’06), 2006.
[KPP+ 06b] S. Kumar, C. Paar, J. Pelzl, G. Pfeiffer, and M. Schimmler. Breaking Ciphers
with COPACOBANA - A Cost-Optimized Parallel Code Breaker. In L. Goubin
and M. Matsui, editors, Proceedings of the Workshop on Cryptograpic Hardware
and Embedded Systems (CHES 2006), volume 4249 of LNCS, pages 101–118.
Springer-Verlag, 2006.
[Lim04] D. Lim. Extracting secret keys from integrated circuits. Master’s thesis, Mas-
sachusetts Institute of Technology, Dept. of Electrical Engineering and Computer
Science, 2004.
174
Bibliography
[LL93] A. K. Lenstra and H. W. Jr. Lenstra, editors. The Development of the Number
Field Sieve, volume 1554 of LNM. Springer-Verlag, 1993.
[Men07] N. Mentens. Secure and Efficient Coprocessor Design for Cryptographic Applica-
tions on FPGAs. PhD thesis, Katholieke Universiteit Leuven, Leuven-Heverlee,
Belgium, June 2007.
[MM01] M. McLoone and J.V. McCanny. High performance single-chip FPGA Rijndael
algorithm implementations. In Ç. K. Koç, D. Naccache, and C. Paar, editors,
Proceedings of the Workshop on Cryptograpic Hardware and Embedded Systems
(CHES 2001), volume 2162 of LNCS, pages 65–76. Springer-Verlag, 2001.
[MM03] M. McLoone and J.V. McCanny. Rijndael FPGA implementations utilising look-
up tables. The Journal of VLSI Signal Processing, 34(3):261–275, 2003.
175
Bibliography
[Mon87] P. L. Montgomery. Speeding the Pollard and elliptic curve methods of factoriza-
tion. Mathematics of Computation, 48(177):243–264, 1987.
[MPS07] A. Moss, D. Page, and N. Smart. Toward acceleration of RSA using 3d graphics
hardware. In Cryptography and Coding, volume 4887 of LNCS, pages 369–388.
Springer, 2007.
[Nat77] National Institute for Standards and Technology (NIST). FIPS PUB 46-3: Data
Encryption Standard, January 1977.
[Nat00] National Institute of Standards and Technology (NIST). Digital signature stan-
dard (DSS) (FIPS 186-2), January 2000.
[Nat01] National Institute of Standards and Technology (NIST). FIPS PUB 197: Ad-
vanced Encryption Standard, 2001.
176
Bibliography
[OP01] G. Orlando and C. Paar. A Scalable GF (p) Elliptic Curve Processor Architecture
for Programmable Hardware. In Ç. K. Koç, D. Naccache, and C. Paar, editors,
Proceedings of the Workshop on Cryptograpic Hardware and Embedded Systems
(CHES 2001), volume 2162 of LNCS, pages 356–371. Springer-Verlag, 2001.
[P1300] Institute of Electrical and Electronics Engineers. IEEE P1363 Standard Specifi-
cations for Public Key Cryptography, 2000.
[PO96] B. Preneel and P.C. Van Oorschot. Key recovery attack on ANSI X9.19 retail
MAC. In Electronics Letters, volume 32 of 17, pages 1568–1569. IEEE, Dept. of
Electr. Eng., Katholieke Univ., Leuven, 1996.
[Poe06] B. Poettering. SECCURE Elliptic Curve Crypto Utility for Reliable Encryp-
tion, version 0.3, August 2006. Available at http://point-at-infinity.org/
seccure/.
[Pol75] J. M. Pollard. Monte Carlo method for factorization. Nordisk Tidskrift for
Informationsbehandlung (BIT), 15:331–334, 1975.
[Pol78] J. M. Pollard. Monte Carlo methods for index computation mod p. Mathematics
of Computation, 32(143):918–924, July 1978.
[PTX07] NVIDIA Corporation, Santa Clara, CA, USA. Parallel Thread Execution (PTX)
ISA Release 1.0, 2007.
177
Bibliography
[RSA78] R. L. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining Digital Signa-
tures and Public-Key Cryptosystems. Communications of the ACM, 21(2):120–
126, February 1978.
[RSA07] RSA - The Security Division of EMC2 . RSA SecurID, 2007. http://www.rsa.
com/.
[RSQL03] G. Rouvroy, F.-X. Standaert, J.-J. Quisquater, and J.-D. Legat. Design Strategies
and Modified Descriptions to Optimize Cipher FPGA Implementations: Fast and
Compact Results for DES and Triple-DES. In Field-Programmable Logic and
Applications - FPL, pages 181–193, 2003.
[RSQL04] G. Rouvroy, F.-X. Standaert, J.-J. Quisquater, and J.-D. Legat. Compact and
Efficient Encryption/Decryption Module for FPGA Implementation of the AES
Rijndael Very Well Suited for Small Embedded Applications. International Con-
ference on Information Technology: Coding and Computing, 2:583, 2004.
[SG08] R. Szerwinski and T. Güneysu. Exploiting the Power of GPUs for Asymmetric
Cryptography. In E. Oswald and P. Rohatgi, editors, Proceedings of the Work-
shop on Cryptograpic Hardware and Embedded Systems (CHES 2008), volume
5154 of LNCS, pages 79–99. Springer-Verlag, 2008.
[SK89] A. P. Shenoy and R. Kumaresan. Fast Base Extension Using a redundant Mod-
ulus in RNS. IEEE Transactions on Computers, 38(2):292–297, February 1989.
[Sma01] N. P. Smart. The Hessian form of an elliptic curve. In Ç. K. Koç, D. Naccache,
and C. Paar, editors, Proceedings of the Workshop on Cryptograpic Hardware
and Embedded Systems (CHES 2001), volume 2162 of LNCS, pages 118–125.
Springer-Verlag, 2001.
178
Bibliography
[SÖP04] F.-X. Standaert, S. B. Örs, and B. Preneel. Power analysis of an FPGA imple-
mentation of Rijndael: Is pipelining a DPA countermeasure? In M. Joye and
J.-J. Quisquater, editors, Proceedings of the Workshop on Cryptograpic Hard-
ware and Embedded Systems (CHES 2004), volume 3156 of LNCS, pages 30–44.
Springer-Verlag, 2004.
[SRQL03b] F.-X. Standaert, G. Rouvroy, J.-J. Quisquater, and J.-D. Legat. Efficient imple-
mentation of Rijndael encryption in reconfigurable hardware: improvements and
design tradeoffs. In C. D. Walter, Ç. K. Koç, and C. Paar, editors, Proceedings of
the Workshop on Cryptograpic Hardware and Embedded Systems (CHES 2003),
volume 2779 of LNCS, pages 334–350. Springer-Verlag, 2003.
[ST67] N. S. Szabó and R. I. Tanaka. Residue Arithmetic and its Applications to Com-
puter Technology. McGraw - Hill Inc., USA, 1967.
[ST03a] A. Satoh and K. Takano. A scalable dual-field elliptic curve cryptographic pro-
cessor. IEEE Transactions on Computers, 52(4):449–460, 2003.
[ST03b] A. Shamir and E. Tromer. Factoring Large Numbers with the TWIRL Device.
In D. Boneh, editor, Advances in Cryptology – Proceedings of CRYPTO 2003,
volume 2729 of LNCS, pages 1–26. Springer-Verlag, 2003.
[Sti05] D. R. Stinson. Cryptography. Theory and Practice. Taylor & Francis, 3rd edition,
2005.
179
Bibliography
[Suz07] D. Suzuki. How to maximize the potential of FPGA Resources for Modular
Exponentiation. In P. Paillier and I. Verbauwhede, editors, Proceedings of the
Workshop on Cryptograpic Hardware and Embedded Systems (CHES 2007), vol-
ume 4727 of LNCS, pages 272–288. Springer-Verlag, 2007.
[SZJvD04] R. Sailer, X. Zhang, T. Jaeger, and L. van Doorn. Design and Implementation of
a TCG-based Integrity Measurement Architecture. In Proceedings of the USENIX
Workshop on Smartcard Technology, Boston, USA, pages 223–238, 2004.
[Tes98] E. Teske. Speeding Up Pollard’s Rho Method for Computing Discrete Loga-
rithms. Algorithmic Number Theory Seminar ANTS-III, 1423:541–554, 1998.
[TraAD] G. Suetonius Tranquillus. De vita Caesarum (transl.: On the Life of the Caesars).
Volume 1: Life of Julius Caesar, 121 AD.
[Tru06] Trusted Computing Group (TCG). TPM specification, version 1.2 revision 94,
March 2006. Available at http://www.trustedcomputinggroup.org/specs/
TPM/.
[Tru08] Trusted Computing Group (TCG). About the TCG, 2008. Available at http:
//www.trustedcomputinggroup.org/about/.
[vOW99] P. C. van Oorschot and M. J. Wiener. Parallel Collision Search with Cryptana-
lytic Applications. Journal of Cryptology, 12(1):1–28, 1999.
[Wie96] M. J. Wiener. Efficient DES Key Search. In William R. Stallings, editor, Practi-
cal Cryptography for Data Internetworks, pages 31–79. IEEE Computer Society
Press, 1996.
[WP03] T. Wollinger and C. Paar. How secure are FPGAs in cryptographic applications,
2003.
[Xil06] Xilinx Inc. UG190: Virtex-5 user guide, 2006. Available at http://www.xilinx.
com/support/documentation/user_guides/ug190.pdf.
180
Bibliography
[Xil07] Xilinx Inc. UG193: Virtex-5 XtremeDSP design considerations user guide, 2007.
Available at http://www.xilinx.com/support/documentation/user_guides/
ug193.pdf.
[Xil08a] Xilinx Inc. Xilinx’ History of FPGA Development, 2008. Available at http:
//www.xilinx.com/company/history.htm.
[Xil08b] Xilinx Inc. Xilinx Spartan-3 and Virtex FPGA devices, 2008. Available at www.
xilinx.com/products/silicon_solutions/.
181
List of Figures
2.1 The key schedule derives subkeys for the round computations from a main key. . 18
2.2 Simplified structure of Xilinx Virtex-5 FPGAs. . . . . . . . . . . . . . . . . . . . 19
2.3 Generic and simplified structure of DSP-blocks of advanced FPGA devices. . . . 20
2.4 The mapping of AES column operations onto functional components of modern
Virtex-5 devices. Each dual ported BRAM contains four T-tables, including
separate tables for the last round. Each DSP block performs a 32 bit bit-wise
XOR operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 The complete basic AES module consisting of 4 DSP slices and 2 dual-ported
Block Memories. Tables T1 and T3 are constructed on-the-fly using byte shifting
from tables T0 and T2 in the block memory, respectively. . . . . . . . . . . . . . . 22
2.6 Pipeline stages to compute the column output of an AES round. . . . . . . . . . 23
2.7 Four instances of the basic structure in hardware allow all AES columns being
processed in parallel (128 bit data path). . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Block diagram of the key schedule implementation. Complex instructions of the
finite state maschine, S-boxes, round constants and 32-bit subkeys are stored in
the dual-port BRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 The memory and programming model for CUDA based applications. . . . . . . . 50
4.2 Results for modular exponentiation with about 1024 (left) and 2048 bit (right)
moduli for different base extension methods, based on a NVIDIA 8800 GTS
graphics card. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Results for modular exponentiation with about 1024 (left) and 2048 bit (right)
moduli and elliptic curve point multiplication on NIST’s P-224 curve, based on
a NVIDIA 8800 GTS graphics card. . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1 Generic Montgomery multiplier designed for use with Virtex-4 DSP blocks. . . . 122
7.2 Each FPGA contains an individual ECM system with multiple ECM cores op-
erating in SIMD fashion. Factorization of different integer bit lengths can be
supported by different FPGAs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3 Architecture of modified COPACOBANA cluster based on Virtex-4 SX35 FPGAs.127
184
List of Tables
2.1 Our results along with recent academic and commercial implementations. De-
cryption (Dec.) and Key expansion (Key) are included when denoted by •, by
◦ otherwise. Note the structural differences between the FPGA types: Virtex-5
(V5) has 4 FF and 4 6-LUT per slice and a 36 Kbit BRAM, while Spartan-3 (S3),
Virtex-E (VE), Virtex-II (PRO) (V2/V2P) has 2 FF and 2 4-LUT per slice and
an 18 Kbit BRAM. Spartan-II (S2) devices only provide 4 KBit BRAMs. . . . . 28
2.2 Implementation results for the AES key schedule. Most state machine encoding
and control logic has been incorporated into the BRAM to save on logic resources. 29
3.1 Resource requirements of a single ECC core on a Virtex-4 FX 12 after PAR. Note
the different clock domains for arithmetic (DSP) and control logic. . . . . . . . . 42
3.2 Performance of ECC operations based on a single ECC core using projective
Chudnowsky coordinates on a Virtex-4 XC4VFX12 (Figures with asterisk are
estimates). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Results of a multi-core architecture on a Virtex-4 XC4VSX55 device for ECC
over prime fields P-224 and P-256 (Figures with an asterisk are estimates). . . . 44
3.4 Selected high-performance implementations of public-key cryptosystems. . . . . . 45
6.6 MPPR performance in pts/sec comparison for a US$ 10,000 investment and
speed-up factor compared to the software-based implementation (in parenthe-
ses). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.7 Expected runtime on different platforms for the Certicom ECC challenges. . . . . 108
6.8 Relative speed-up compared of different Certicom ECC challenges compared to
the Certicom Reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.9 Cost-performance consideration of MPPR attacks with ASICs (107 gates, 500 MHz,
NRE costs excluded). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.1 Combined point addition and doubling (2P and P + Q) on Montgomery curves
for the case zP −Q = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2 Combined point addition and doubling (2P and P + Q) in inverted, twisted
Edwards coordinates. Bold-faced operations denote full-size modular multiplica-
tions with h × h bits, all other operations take at most a third of the runtime of
the full multiplication. The double line marks a possible early termination point
of the instruction sequence to perform a sole point doubling. . . . . . . . . . . . . 125
7.3 Resource consumption of a single ECM core after place-and-route. . . . . . . . . 129
7.4 Clock cycles and frequency for a single ECM core (stage 1) to factor a 151 bit
integer with bound B1 = 960 and B2 = 57000. For comparison, values with
asterisk were scaled down from figures for 198 bit integers. . . . . . . . . . . . . . 129
9.1 Estimated number of Logical Elements (LE) and RAM bits for the TPM func-
tionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.1 Supplementary information to Chapter 2: initial 13 clock cycles of the eight
pipeline stages computing a plaintext input. Steps are RAM lookup L, RAM
output register R; transform T, DSP input register D and DSP XOR ⊕. After
′
eight cycles the output column E0 is used as input to the next round, etc. . . . . 163
A.2 Supplementary information to Chapter 4: modulus sizes for modular multiplica-
tion using RNS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.3 Supplementary information to Chapter 3: instruction sequence for point addition
using projective Chudnovsky coordinates based on a parallel adder and multiplier.164
A.4 Supplementary information to Chapter 3: instruction sequence for point doubling
using projective Chudnovsky coordinates based on a parallel adder and multiplier.164
186
List of Abbreviations
188
About the Author
Personal Data
Short Resume⋆
Feb 2007 – present Fellow of Research School at Ruhr-University Bochum
Mar 2006 – present Research associate of Embedded Security Group (EM-
SEC) at Ruhr-University Bochum
Oct 2003 – Jan 2006 Study of ”IT Security” at Ruhr University Bochum
Sep 2002 – Jan 2003 Semester abroad at Staffordshire University, UK
Jun 2002 – Sep 2002 Scholarship by Carl-Duisberg Association for internship
at IBM Almaden Research Lab, San José , USA
Oct 2000 – Sep 2003 Study of ”Information Technology International” at
University of Coorporative Education Mannheim
⋆ As of February 2009.
Publications
The author of this thesis has worked in several research areas. The following contributions to
the cryptographic and reconfigurable computing community were published (January 2009 ).
Book Chapters
Tim Güneysu, Christof Paar. Modular Integer Arithmetic for Public-key Cryptography.
Chapter in Secure Integrated Circuits and Systems, Editor I. Verbauwhede, to appear
in Kluwer-Verlag, 2009.
Journals
Tim Güneysu, Timo Kasper, Martin Novotny, Christof Paar, Andy Rupp. Cryptanalysis
with COPACOBANA. In IEEE Transactions on Computers, IEEE Computer Society,
volume 57, number 11, pp. 1498–1513 November 2008.
Tim Güneysu, Christof Paar, Jan Pelzl. Special-Purpose Hardware for Solving the
Elliptic Curve Discrete Logarithm Problem. In ACM Transactions on Reconfigurable
Technology and Systems (TRETS), volume 1, number 2, pp. 1–21, June 2008.
Jorge Guajardo, Tim Güneysu, Sandeep Kumar, Christof Paar, Jan Pelzl. Efficient
Hardware Implementation of Finite Fields with Applications to Cryptography. In Acta
Applicandae Mathematicae: An International Survey Journal on Applying Mathematics
and Mathematical Applications, volume 93, numbers 1-3, pp. 75–118, September 2006.
Tim Güneysu, Christof Paar. Ultra High Performance ECC over NIST Primes on
Commercial FPGAs. In Proceedings of the Cryptographic Hardware and Embedded
Systems, LNCS Series, Washington D.C., USA, Springer-Verlag, August 2008.
Publications
Robert Szerwinski, Tim Güneysu. Exploiting the Power of GPUs for Asymmetric
Cryptography. In Proceedings of the Cryptographic Hardware and Embedded Systems,
LNCS Series, Washington D.C., USA, Springer-Verlag, August 2008.
Saar Drimer, Tim Güneysu and Christof Paar. DSPs, BRAMs and a Pinch of Logic:
New recipes for AES on FPGAs. In Proceedings of the IEEE Symposium on Field-
Programmable Custom Computing Machines, Stanford University, CA, USA, IEEE
Computer Society, April 2008.
Tim Güneysu, Christof Paar. Breaking Legacy Banking Standards with Special-Purpose
Hardware. In Proceedings of the Conference on Financial Cryptography and Data Se-
curity (FC08), Cozumel, Mexico, LNCS Series, Springer-Verlag, January 2008.
Tim Güneysu, Bodo Möller, Christof Paar. Dynamic Intellectual Property Protec-
tion for Reconfigurable Devices. In Proceedings of the IEEE Conference on Field-
Programmable Technology, (ICFPT), Kitakyushu, Japan, IEEE Computer Society, pp.
169-176, December 2007.
Tim Güneysu, Christof Paar, Sven Schäge. Efficient Hash Collision Search Strategies
on Special-Purpose Hardware. In Proceedings of the Western European Workshop on
Research in Cryptology (WeWORC), Bochum, Germany, LNCS Series, Springer-Verlag,
July 2007.
Tim Güneysu, Bodo Möller, Christof Paar. New Protection Mechanisms for Intellectual
Property in Reconfigurable Logic. In Proceedings of the IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM), IEEE Computer Society, Napa,
CA, USA, April 2007.
Thomas Eisenbarth, Tim Güneysu, Christof Paar, Ahmad-Reza Sadeghi, Russell Tessier,
Marko Wolf. Establishing Chain of Trust in Reconfigurable Hardware. In Proceedings of
the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM),
IEEE Computer Society, Napa, CA, USA, April 2007.
Tim Güneysu, Christof Paar, Jan Pelzl. Attacking Elliptic Curve Cryptosystems with
Special-Purpose Hardware. In Proceedings of the International Symposium on Field
Programmable Gate Arrays, ACM Press, Monterey, CA, USA, Februar 2007.
192
Publications
Tim Güneysu, Christof Paar, Jan Pelzl, Gerd Pfeiffer, Manfred Schimmler and Chris-
tian Schleiffer. Parallel Computing with Low-Cost FPGAs: A Framework for COPA-
COBANA. In Proceedings of the Symposium on Parallel FPGA Architecture (Para-
FPGA), LNI, Springer-Verlag, Jülich, Germany, September 2007.
Invited Talks
Tim Güneysu. High Performance ECC over NIST Primes on Commercial FPGAs.
12. Workshop on Elliptic Curve Cryptography (ECC 2008), Trianon Zalen, Utrecht,
Niederlande, 22.-24 September 2008.
Tim Güneysu. Efficient Hardware Architectures for Solving the Discrete Logarithm
Problem on Elliptic Curves. ECC-Brainpool, NXP Semiconductors, Hamburg, Ger-
many, 07. Juni 2006.
Technical Reports
Saar Drimer, Tim Güneysu, Markus Kuhn, Christof Paar. Protecting multiple cores
in a single FPGA design. Research Report, Horst Görtz Institute for IT security in
cooperation with University of Cambridge, September 2008.
193