Parallel Cryptanalysis
Parallel Cryptanalysis
Ruben Niederhagen
Parallel Cryptanalysis
PROEFSCHRIFT
door
prof.dr. T. Lange
en
prof.dr. D.J. Bernstein
Copromotor:
dr. C.-M. Cheng
Public domain
Commissie:
1 Introduction 1
Bibliography 101
Introduction
1
Most of today’s cryptographic primitives are based on computations that are hard
to perform for a potential attacker but easy to perform for somebody who is in
possession of some secret information, the key, that opens a back door in these
hard computations and allows them to be solved in a small amount of time. Each
cryptographic primitive should be designed such that the cost of an attack grows
exponentially with the problem size, while the computations using the secret key
only grow polynomially. To estimate the strength of a cryptographic primitive it is
important to know how hard it is to perform the computation without knowledge
of the secret back door and to get an understanding of how much money or
time the attacker has to spend. Usually a cryptographic primitive allows the
cryptographer to choose parameters that make an attack harder at the cost of
making the computations using the secret key harder as well. Therefore designing
a cryptographic primitive imposes the dilemma of choosing the parameters strong
enough to resist an attack up to a certain cost while choosing them small enough
to allow usage of the primitive in the real world, e.g. on small computing devices
like smart phones.
Typically a cryptographic attack requires a tremendous amount of compu-
tation—otherwise the cryptographic primitive under attack can be considered
broken. Given this tremendous amount of computation, it is likely that there are
computations that can be performed in parallel. Therefore, parallel computing
systems are a powerful tool for the attacker of a cryptographic system. In contrast
to a legitimate user who typically exploits only a small or moderate amount of
parallelism, an attacker is often able to launch an attack on a massively parallel
system. In practice the amount of parallel computation power available to an
attacker is limited only by the amount of money he is able to spend; however the
1
2 CHAPTER 1. INTRODUCTION
Overview
Chapter 2 gives an introduction to parallel computing. The work on Pollard’s
rho algorithm described in Chapter 3 is part of a large research collaboration
with several research groups; the computations are embarrassingly parallel and
are executed in a distributed fashion in several facilities on a variety of parallel
system architectures with almost negligible communication cost. This disserta-
tion presents implementations of the iteration function of Pollard’s rho algorithm
on Graphics Processing Units and on the Cell Broadband Engine. Chapter 4 de-
scribes how XL has been parallelized using the block Wiedemann algorithm on a
NUMA system using OpenMP and on an InfiniBand cluster using MPI. Wagner’s
generalized birthday attack is described in Chapter 5; the attack has been per-
formed on a distributed system of 8 multi-core nodes connected by an Ethernet
network.
Overview of parallel computing
2
Parallelism is an essential feature of the universe itself; as soon as there exists
more than one entity and there is interaction between these entities, some kind
of parallel process is taking place. This is the case for quantum mechanics as
well as astrophysics, in chemistry and biology, in botanics and zoology, and hu-
man interaction and society. Therefore, exploiting parallelism also in information
technology is natural and has started already at the beginning of modern com-
puting.
The main condition for parallel computing is that there are parts of the com-
putation which can be executed in parallel. This is the case if there are at least
two parts of the computation which have no data dependencies, i.e., the input
data of each part is not depending on the result of the computation of the other
part. In [LS08], Lin and Snyder classify parallelism into two types as follows:
• In case the same sequence of instructions (e.g., the same mathematical func-
tion) is applied to a set of independent data items, the computation on these
data items can be performed in any order and therefore, in particular, in
parallel. This kind of parallelism is called data parallelism. The amount of
parallelism scales with the number of independent data items.
• Some computations can be split in several tasks, i.e., independent instruc-
tion sequences which compute on independent data items. Since the tasks
are independent from each other, they can be computed in parallel. This
is called task parallelism. The number of independent tasks for a given
computation is fixed and does not scale for a larger input.
Given that a program exhibits some parallelism, this parallelism can be ex-
ploited on a parallel computer architecture. In [Fly66] Flynn classifies computer
3
4 CHAPTER 2. OVERVIEW OF PARALLEL COMPUTING
• Single Instruction Stream – Single Data Stream (SISD): The computer fol-
lows only one instruction stream that operates on data from one single data
stream.
• Single Instruction Stream – Multiple Data Streams (SIMD): Each instruc-
tion of the instruction stream is executed on several data streams.
• Multiple Instruction Streams – Single Data Stream (MISD): Several instruc-
tions from different instruction streams are executed on the same single data
stream.
• Multiple Instruction Streams – Multiple Data Streams (MIMD): Instructions
from different instruction streams are applied to independent data streams.
be executed in parallel and s the part that must be executed sequentially, i.e.,
p + s = 1; the best speedup possible when using n execution units is
1
S= p .
s+ n
2.1.1 Microarchitecture
The microarchitecture of a processor is the lowest level on which computations
can be executed in parallel. The microarchitecture focuses on the computation
of a single instruction stream. If a sequential instruction stream contains one or
more instructions that operate on different data items, i.e., if each input does not
depend on the output of another one, these instructions can be executed in any
order and therefore in parallel. This kind of parallelism is called instruction level
parallelism.
6 CHAPTER 2. OVERVIEW OF PARALLEL COMPUTING
Instruction pipelining
The execution of any instruction can be split into several stages [HP07, Ap-
pendix A]: for example it is split into instruction fetch, instruction decode, ex-
ecution, memory access, and write back of results to the output register. For
different instructions the execution stage might take a different amount of time
depending on the complexity of the instruction: for example an XOR-instruction
typically takes less time than a double-precision floating-point multiplication.
If all these stages were executed in one single clock cycle of a fixed length,
much time would be wasted on the faster instructions since the processor would
need to wait as long as the slowest instruction needs to finish. Furthermore, while
an instruction is processed, e.g., by the execution unit all other stages would be
idle.
Instead, the processor frequency and instruction throughput can be increased
by pipelining the different stages of the instruction execution and by splitting
the execution stage in several clock cycles depending on instruction complexity.
After the first instruction has been loaded, it is forwarded to the decoder which can
decode the instruction in the next cycle. At the same time, the second instruction
is loaded. In the third cycle the first instruction starts to be executed while the
second one is decoded and a third one is loaded in parallel.
This requires all instructions which are in the pipeline at the same time to
be independent from each other; in case the instructions lack instruction level
parallelism, parts of the pipeline have to stall until all dependencies have been
resolved.
There are two ways to achieve a high throughput of instructions: Either the
instructions are scheduled by a compiler or by the programmer in a way that min-
imizes instruction dependencies on consecutive instructions in a radius depending
on the pipeline length. Or the fetch unit looks ahead in the instruction stream and
chooses an instruction for execution which is independent from the instructions
that are currently in the pipeline; this is called out-of-order execution. Details on
out-of-order execution can be found in [HP07, Section 2.4].
The physical distance between the pipeline stages is very small and the com-
munication time between the stages can almost be neglected. From the outside
a pipelined processor appears to be a SISD architecture; internally the pipeline
stages actually resemble a MIMD architecture.
Superscalar processors
Another way to exploit instruction level parallelism is to use several arithmetic
logic units (ALUs) to compute an instruction stream. An instruction scheduler
forwards several independent instructions to several ALUs in the same clock cy-
cle. The ALUs may share one common instruction set or they may have different
instruction sets, e.g., a processor may have two integer units and one floating
point unit. As in the case of pipelining, the instruction scheduler of a super-
scalar processor may examine several upcoming instructions out of order to find
independent instructions.
2.1. PARALLEL ARCHITECTURES 7
Vector processors
Vector processors implement the SIMD architecture: the same instruction is exe-
cuted on independent data elements (vector elements) from several data streams
in parallel. The elements are loaded into vector registers which consist of several
slots, one for each data element. Vector instructions, often called SIMD instruc-
tions, operate on the vector registers.
The number of execution units does not necessarily need to be equal to the
number of vector elements. The processor may operate only on parts of the
vector registers in each clock cycle. Usually vector processors only have 4 (e.g.,
NEC SX/8) to 8 (e.g., Cray SV1) ALUs (see [HP07, Appendix F]). Nevertheless,
operating on a large vector in several steps on groups of the vector register slots
8 CHAPTER 2. OVERVIEW OF PARALLEL COMPUTING
allows the processor to hide instruction and memory latencies: since it takes
several cycles before the execution on the last group of register slots has started,
the operation on the first group is likely to be finished so that all dependencies
that might exist for the next instruction are resolved.
Initially, a very high number of vector elements was envisioned, for example as
much as 256 elements for the early prototype architecture Illiac IV (which eventu-
ally was built with only 64 vector elements) [BDM+ 72]. The Cray-1 had 8 vector
registers of 64 vector elements each as well. Today, many specialized scientific
computers also offer wide vector registers of 64 to 256 vector elements [HP07,
Figure F.2].
Apart from these cases also short-vector architectures are very common. Most
of today’s x86-processors have 128-bit vector registers, which can be used for 8-,
16-, 32-, or 64-bit vector operations on sixteen, eight, four, or two vector elements.
The Synergistic Processing Units of the Cell processor have a similar design and
the PowerPC processors also provide 128-bit vector registers. Note that 128-bit
vector registers can also be used for 1-bit operations, i.e. logical operations, on 128
vector elements. This is exploited by a technique called bitslicing; see Chapter 3
for an application of this technique.
Symmetric Multiprocessing
Several execution units can be connected by a system bus. If all execution units
have the same architecture, this system architecture is called Symmetric Multipro-
cessing (SMP). All execution units of an SMP architecture share a joint memory
space.
Multi-core processors: A multi-core processor consists of several execution
units called cores which compute in an MIMD fashion. Several resources might
be shared, such as (parts of) the cache hierarchy as well as IO and memory ports.
The operating system has access to all cores and schedules processes and threads
onto the cores. Communication between the cores can be accomplished via shared
cache or via main memory.
Almost all of today’s high-end CPUs are multi-core CPUs. Multi-core proces-
sors are used in a large range of devices: servers, desktop PCs, even embedded
systems and mobile phones are powered by multi-core processors. Common con-
figurations range from dual-core CPUs to ten- and even twelve-core CPUs (in May
2011). Core count of mainstream processors is envisioned to raise even higher,
even though up to now only a small fraction of desktop applications benefit from
multi-core architectures.
Non-Uniform Memory Access: If several conventional processors are placed
on one mainboard and are connected by a system bus, each processor has a mem-
ory controller of its own even though all processors have full access to all memory.
Since latency as well as throughput may vary depending on which physical mem-
ory is accessed, this architecture is called Non-Uniform Memory Access (NUMA)
architecture as opposed to Uniform Memory Access (UMA) architectures (for ex-
ample multi-core processors) where accessing any physical memory address has
the same performance. Several multi-core processors may be used to set up a
NUMA system.
Since the mainboards for NUMA systems are more expensive than off-the-shelf
boards, NUMA systems are usually used for commercial or scientific workloads.
The communication distance between the execution units is higher than in the
case of multi-core processors since caches are not shared and all communication
must be accomplished via main memory.
Simultaneous multithreading: In general a lack of instruction level paral-
lelism on pipelined or superscalar processors as well as high-latency instructions
(like memory access or IO) may lead to stalls of one or several ALUs. This can
be compensated for by a technique called simultaneous multithreading (SMT): A
processor with an SMT architecture concurrently computes two or more instruc-
tion streams, in this case called processes or threads. Therefore the instruction
scheduler has more options to choose an instruction from one of the independent
instruction streams. For concurrent execution, each instruction stream requires
his own instance of resources like the register file. In contrast to multi-core proces-
sors, simultaneous multithreading is not really a parallel architecture even though
it appears to be a MIMD architecture from the outside. Actually, it just allows a
higher exploitation of the computation resources of a single processor.
10 CHAPTER 2. OVERVIEW OF PARALLEL COMPUTING
Since each thread has its own register file, communication between the instruc-
tion streams is accomplished via memory access. In the best case, shared data
can be found in on-chip caches but communication might need to be channeled
via main memory.
SMT is state of the art in many of today’s processors. Usually, two concurrent
threads are offered. Intel calls its SMT implementation Hyper-Threading Tech-
nology (HTT) and uses it in most of its products. AMD, IBM and others use
SMT as well.
Accelerators
Using accelerators in contrast to SMP architectures leads to a heterogeneous
system design. Accelerators can be any kind of computing device in addition to a
classical processor. Either the accelerators are connected to the processor via the
system bus as for example in the case of Field-Programmable Gate Array (FPGA)
cards and graphics cards or they can be integrated onto the die of the processor
like the Synergistic Processing Elements of the Cell processor, Intel’s E600C Atom
chips with integrated FPGAs, or the Fusion processors of AMD, which include
Graphics Processing Units (GPUs); there is a strong trend for moving accelerators
from the system bus into the processor chip.
Parallel computing on FPGAs is not part of this thesis and therefore will
not be further examined; GPUs will be described in more detail in Sections 2.3
and 3.4.1, the Cell processor in Section 3.3.1.
Supercomputers
The term supercomputer traditionally describes a parallel architecture of tightly
coupled processors where the term tight depends on the current state of the art.
Many features of past supercomputers appear in present desktop computers, for
example multi-core and vector processors. Today’s state of the art supercomput-
ers contain a large number of proprietary technologies that have been developed
exclusively for this application. Even though supercomputers are seen as a single
system, they fill several cabinets or even floors due to their tremendous number
of execution units.
Since supercomputers are highly specialized and optimized, they have a higher
performance compared to mainstream systems. On the TOP500 list of the most
powerful computer systems from November 2011 only 17.8% of the machines are
supercomputers but they contribute 32.16% of the computing power. The ma-
chines of rank one and two are supercomputers. Even though the number of
supercomputers in the TOP500 list has been declining of the past decade, they
still are the most powerful architectures for suitable applications. The TOP500
list is published at [MDS+ 11].
Supercomputers are MIMD architectures even though they commonly are
treated as Single Program Multiple Data (SPMD) machines (imitating Flynn’s
taxonomy): the same program is executed on all processors, each processor handles
2.1. PARALLEL ARCHITECTURES 11
Cluster computing
Distributed computing
If the communication distance is very high, usually the term distributed com-
puting is used. This term applies if no guarantee for communication latency or
throughput is given. This architecture consists of a number of nodes which are
connected by a network. The nodes can be desktop computers or conventional
server machines. The Internet or classical Ethernet networks may be used as
interconnecting network.
In general, the physical location of the nodes is entirely arbitrary; the nodes
may be distributed all over the world. SETI@home [KWA+ 01] was one of the
first scientific projects for distributed computing. Its grid computing middleware
“BOINC” [And04] is used for many different scientific workloads today.
If the location of the nodes is more restricted, e.g., if the desktop computers of
a company are used for parallel computing in the background of daily workload
or exclusively at night, the term office grid is used. Some companies like Amazon
offer distributed computing as cloud services.
12 CHAPTER 2. OVERVIEW OF PARALLEL COMPUTING
2.2.3 Summary
Neither of the two approaches is more expressive than the other: An implemen-
tation of the MPI standard might use shared-memory programming for message
delivery on an SMP architecture; a distributed shared-memory programming lan-
guage like Unified Parallel C (UPC, see [LBL]) might use message-passing primi-
tives to keep shared-memory segments coherent on cluster architectures.
2.3. GENERAL-PURPOSE GPU PROGRAMMING 15
The choice between these paradigms depends on the application and its tar-
geted system architecture. The paradigms are not mutually exclusive; hybrid
solutions are possible and common on, for example, clusters of SMP architec-
tures. In this case one process (or several processes on NUMA architectures) with
several threads is executed on each cluster node. Local communication between
the threads is accomplished via shared memory, remote communication between
the cluster nodes (and NUMA nodes) via message passing.
Even though GPUs have been used successfully to accelerate scientific algo-
rithms and workloads (e.g. [BGB10; RRB+ 08]), their performance can only be
fully exploited if the program is carefully optimized for the target hardware. There
are also reports of cases where GPUs do not deliver the expected performance (e.g.
[BBR10]). Apart from the fact that not all applications are suitable for an efficient
implementation on GPUs, there may be several reasons for an underutilization
of these computing devices, e.g. poor exploitation of data locality, bad register
allocation, or insufficient hit rates in the instruction cache. In particular the latter
two cases cannot easily, maybe not at all, be prevented when using a high-level
programming language: If the compiler does not deliver a suitable solution there
is no chance to improve the performance without circumventing the compiler.
To solve this issue, Bernstein, Schwabe, and I constructed a toolbox for im-
plementing GPU programs in a directly translated assembly language (see Sec-
tion 2.3.2). The current solution works only with NVIDIA’s first generation
CUDA hardware. Due to NVIDIA’s and AMD’s lack of public documentation
the support for state-of-the-art GPUs still is work in progress.
Since 2006 NVIDIA released several versions of CUDA. Furthermore, the
processor architecture has been modified several times. The architectures are
classified by their respective compute capability. The first generation of CUDA
GPUs has the compute capabilities 1.0 to 1.3 and was distributed until 2010.
In 2010, NVIDIA introduced the second generation of CUDA called Fermi. Up
to now Fermi cards have the compute capabilities 2.0 and 2.1. Release steps
within each generation introduce minor additions to the instruction set and minor
extensions to the hardware architecture. The step from 1.3 to 2.0 introduced a
completely redesigned hardware architecture and instruction set.
Because CUDA allows major changes of the hardware implementation, CUDA
software is usually not distributed in binary form. Instead, the device driver
receives the CUDA-C or OpenCL source code or an instruction-set-independent
intermediate assembly code called Parallel Thread Execution (PTX). In either
case the source code is compiled for the actual hardware and then transferred to
the GPU for execution.
NVIDIA does not offer an assembler for CUDA; PTX-code is compiled like a
high level language and therefore does not give direct access to instruction schedul-
ing, register allocation, register spills, or even the instruction set. Nevertheless,
Wladimir J. van der Laan reverse-engineered the byte code of the first genera-
tion CUDA 1.x instruction set and implemented an assembler and disassembler in
Python. These tools are available online at [Laa07] as part of the Cubin Utilities,
also known as decuda.
In April 2011, Yun-Qing Hou started a similar project called asfermi in order
to provide an assembler and disassembler for the instruction set of CUDA 2.x
Fermi. Hou’s code is available at [Hou11]. Since this tool set does not yet fully
support all opcodes of the Fermi architecture, the remainder of this section will
focus on decuda and therefore compute capability 1.x and CUDA SDK version
2.3. Support for OpenCL has been introduced with CUDA SDK version 4.0 and
will not be described in this section.
2.3. GENERAL-PURPOSE GPU PROGRAMMING 17
The output file of cudasm can be put into the code repository of the CUDA
application. The CUDA runtime transfers it to the graphics card for execution
the same way as a kernel that was compiled from CUDA-C or PTX. The following
tweaks and compiler flags are necessary to make the CUDA runtime digest the
cudasm-kernel:
First a host application is written in CUDA-C. This application allocates de-
vice memory, prepares and transfers data to the graphics card and finally invokes
the kernel as defined by the CUDA API. The device kernel can be implemented
in CUDA-C for testing purposes; if this is omitted a function stub for the ker-
nel must be provided. Eventually, the program is compiled using the following
command:
nvcc -arch=architecture -ext=all -dir=repository sourcefiles -o outfile
The desired target architecture can be chosen by the flag arch; e.g. compute
capability 1.0 is requested by arch=sm_10, compute capability 1.3 by arch=sm_13.
The flag ext=all instructs the compiler to create a code repository for both PTX
and assembly code in the directory repository. For each run nvcc will create
a separate subdirectory in the directory of the code repository; the name of the
subdirectory seems to be a random string or a time based hash value. The runtime
will choose the most recent kernel that fits to the actual hardware. Deleting all
previous subdirectories when recompiling the host code facilitates to keep track
of the most recent kernel version.
The PTX kernel will be placed inside the newly created subdirectory in the
file compute_architecture. The binary code resides in the file sm_architecture. The
latter one can simply be replaced by the binary that was created by cudasm. The
runtime library will load this binary at the next kernel launch.
Assembly programming
Commercial software development requires a fast, programmer friendly, efficient,
and cheap work flow. Source code is expected to be modular, reusable, and
platform independent. These demands are usually achieved by using high-level
programming languages like C, C++, Java, C#, or Objective-C. There are only a
few reasons for programming in assembly language:
• high demands on performance: In many cases hand-optimizing critical code
sections gives better performance than compiler-optimized code. This has
several reasons; details can be found in [Sch11a, Section 2.7].
• the need for direct access to the instruction set: Some instructions like
the AES instruction set [Gue10] might not be accessible from high level
programming languages. Operating systems need access to special control
instructions.
• the requirement for full control over instruction order: For example the
program flow of cryptographic applications can leak secret information to
an attacker [Koc96]. This can be avoided by carefully scheduling instructions
by hand.
20 CHAPTER 2. OVERVIEW OF PARALLEL COMPUTING
Only the latter one might allow the programmer to actually enjoy programming
in assembly language. Assembly programming is a complex and error prone task.
To reduce the complexity of manually writing GPU assembly code, this section
introduces several tools that tackle key issues of assembly code generation.
The major pitfall of writing assembly code is to keep track of what regis-
ters contain which data items during program execution. On the x86 architecture
which provides only 8 general purpose registers, this is already troublesome. Keep-
ing track of up to 128 registers on NVIDIA GPUs pushes this effort to another
level. However, the advantage to choose what register contains which data at a
certain time when running the program is not necessary to write highly efficient
assembly code and therefore is usually not the reason for writing assembly code
in the first place. As long as the programmer can be sure that a register allocator
finds a perfect register allocation provided that it exists and otherwise reports
to the programmer, register allocation can be delegated to a programming tool.
However, this tool should not be NVIDIA’s compiler nvcc—since it neither finds
a perfect register allocation nor reports on its failure.
Furthermore, there is no common syntax for different assembly languages; even
for the same instruction set the syntax may be different for assemblers of differ-
ent vendors: e.g. “addl %eax, -4(%rbp)” and “add DWORD PTR [rbp-4], eax”
are the same x86 instruction in gcc assembler and Intel assembler, respectively.
The instructions vary in mnemonic, register and addressing syntax, and even in
operand order. Therefore switching between architectures or compilers is error
prone and complicated. Furthermore, most assembly languages have not been
designed for human interaction and are not easily readable. This again is not an
inherent problem of assembly programming itself but a shortcoming of commonly
used assembly programming tools.
Both of these pitfalls, register allocation and syntax issues, are addressed by
Daniel J. Bernstein’s tool qhasm.
qhasm
The tool qhasm is available at [Ber07b]. It is neither an assembler itself nor has
it been designed for writing a complete program; qhasm is intended to be used
for replacing critical functions of a program with handwritten assembly language.
The job of qhasm is to translate an assembly-like source code, so called qhasm
code, into assembly code. Each line in qhasm code is translated to exactly one
assembly instruction. The syntax of the qhasm code is user-defined. Furthermore,
qhasm allows using an arbitrary number of named register variables instead of the
fixed number of architecture registers in the qhasm code. These register variables
are mapped to register names by the register allocator of qhasm. If qhasm does not
find a mapping, it returns with an error message; the programmer is in charge of
spilling registers to memory. The order of instructions is not modified by qhasm.
2.3. GENERAL-PURPOSE GPU PROGRAMMING 21
qhasm-cudasm
Van der Laan’s assembler cudasm combined with Bernstein’s register allocator
qhasm (together with a machine description file that was written for cudasm)
gives a powerful tool for programming NVIDIA graphics cards on assembly level
and can be easily used for small kernels. Nevertheless, for large kernels of several
hundred or even thousand lines of code additional tools are necessary to gain a
reasonable level of usability for the following reasons:
NVIDIAs compiler nvcc does not support linking of binary code; usually all
functions that are qualified with the keyword __device__ in CUDA-C are in-
lined before compiling to binary code. Therefore it is not possible to replace only
parts of a CUDA-C kernel with qhasm code—the whole GPU kernel must be
implemented in qhasm code. But qhasm was designed to replace small computa-
tionally intensive functions. There is no scope for register variables: the names
of register variables are available in the whole qhasm code. This makes it com-
plicated to keep track of the data flow. Furthermore, qhasm does not support
the programmer in maintaining a large code base e.g. by splitting the code into
several files.
Therefore, a modified version of the m5 macro processor is used on top of
qhasm to simplify the handling of large kernels. The original version of m5 by
William A. Ward, Jr. can be found at [War01]. The following list gives an overview
of some native m5 features:
• includes: An m5 source file can include other m5 files; the content of the
files is inlined in the line where the include occurs.
• functions: Functions can be defined and called in m5; “call” in this case
means that the content of the called function is inlined at the position
where the call occurs.
22 CHAPTER 2. OVERVIEW OF PARALLEL COMPUTING
23
24 CHAPTER 3. POLLARD’S RHO METHOD
processor is a highly parallel architecture, it is well suited for the parallel work-
load of Pollard’s rho method. The Cell processor powers the PlayStation 3 gaming
console and therefore can be obtained at relatively low cost.
This section is structured as follows: First the Cell processor is introduced.
Then general design questions for the implementation of the iteration function are
stated. This is followed by a detailed description of the implementation of the it-
eration function and a close investigation of the memory management. Eventually
the results of the implementation for the Cell processor are discussed.
main memory to local storage and vice versa explicitly by instructing the DMA
controller of the MFC. Due to the relatively small size of the local storage and the
lack of transparent access to main memory, the programmer has to ensure that
instructions and the active data set fit into the local storage and are transferred
between main memory and local storage accordingly.
The execution unit has a pure RISC-like SIMD instruction set encoded into
32-bit instruction words; instructions are issued strictly in order to two pipelines
called odd and even pipeline, which execute disjoint subsets of the instruction
set. The even pipeline handles floating-point operations, integer arithmetic, log-
ical instructions, and word shifts and rotates. The odd pipeline executes byte-
granularity shift, rotate-mask, and shuffle operations on quadwords, and branches
as well as loads and stores.
Up to two instructions can be issued each cycle, one in each pipeline, given that
alignment rules are respected (i.e., the instruction for the even pipeline is aligned
to a multiple of 8 bytes and the instruction for the odd pipeline is aligned to a
multiple of 8 bytes plus an offset of 4 bytes), that there are no interdependencies
to pending previous instructions for either of the two instructions, and that there
are in fact at least two instructions available for execution. Therefore, a careful
scheduling and alignment of instructions is necessary to achieve best performance.
Accessing main memory. As mentioned before, the MFC is the gate for the
SPU to reach main memory as well as other processor elements. Memory transfer
is initiated by the SPU and afterwards executed by the DMA controller of the
MFC in parallel to ongoing instruction execution by the SPU.
Since data transfers are executed in the background by the DMA controller,
the SPU needs feedback about when a previously initiated transfer has finished.
Therefore, each transfer is tagged with one of 32 tags. Later on, the SPU can
probe either in a blocking or non-blocking way if a subset of tags has any out-
standing transactions. The programmer should avoid reading data buffers for
incoming data or writing to buffers for outgoing data before checking the state of
the corresponding tag to ensure deterministic program behaviour.
Accessing local storage. The local storage is single ported and has a line in-
terface of 128 bytes width for DMA transfers and instruction fetch as well as a
quadword interface of 16 bytes width for SPU load and store. Since there is only
one port, the access to the local storage is arbitrated using the highest priority
for DMA transfers (at most every 8 cycles), followed by SPU load/store, and the
lowest priority for instruction fetch.
Instructions are fetched in lines of 128 bytes, i.e., 32 instructions. In the case
that all instructions can be dual issued, new instructions need to be fetched every
16 cycles. Since SPU loads/stores have precedence over instruction fetch, in case
of high memory access there should be a NOP instruction for the odd pipeline every
16 cycles to avoid instruction starvation. If there are ongoing DMA transfers an
HBRP instruction should be used giving instruction fetch explicit precedence over
DMA transfers.
3.3. IMPLEMENTING ECC2K-130 ON THE CELL PROCESSOR 29
Determining performance. The instruction set of the Cell processor gives ac-
cess to a decrementer (see [IBM08, Sec. 13.3.3]) for timing measurements. The
disadvantage of this decrementer is that it is updated with the frequency of the
so-called timebase of the processor. The timebase is usually much smaller than
the processor frequency. The Cell processor (rev. 5.1) in the PlayStation 3 for
example changes the decrementer only every 40 cycles, the Cell processor in the
QS21 blades even only every 120 cycles. Small sections of code can thus only
be measured on average by running the code several times repeatedly. All cycle
counts reported in this chapter have been measured by running the code and
reading the decrementer.
Polynomial or normal basis? Another choice to make for both bitsliced and
non-bitsliced implementations is the representation of elements of F2131 : Poly-
nomial bases are of the form (1, z, z 2 , z 3 , . . . , z 130 ), so the basis elements are
increasing powers of some element z ∈ F2131 . Normal bases are of the form
130
(α, α2 , α4 , . . . , α2 ), so each basis element is the square of the previous one.
Performing arithmetic in normal-basis representation has the advantage that
squaring elements is just a rotation of coefficients. Furthermore there is no need
for basis transformation before computing the Hamming weight in normal basis.
On the other hand, implementations of multiplications in normal basis are widely
believed to be less efficient than those of multiplications in polynomial basis.
In [GSS07], von zur Gathen, Shokrollahi and Shokrollahi proposed an effi-
cient method to multiply elements in type-2 optimal-normal-basis representation
(see also [Sho07]). The following gives a review of this multiplier as shown in
[BBB+ 09]:
An element of F2131 in type-2 optimal-normal-basis representation is of the
form
130 130
f0 (ζ + ζ −1 ) + f1 (ζ 2 + ζ −2 ) + f2 (ζ 4 + ζ −4 ) + · · · + f130 (ζ 2 + ζ −2 ),
where ζ is a 263rd root of unity in F2131 . This representation is first permuted to
obtain coefficients of
ζ + ζ −1 , ζ 2 + ζ −2 , ζ 3 + ζ −3 , . . . , ζ 131 + ζ −131 ,
and then transformed to coefficients in polynomial basis
ζ + ζ −1 , (ζ + ζ −1 )2 , (ζ + ζ −1 )3 , . . . , (ζ + ζ −1 )131 .
Applying this transformation to both inputs makes it possible to use a fast
polynomial-basis multiplier to retrieve coefficients of
(ζ + ζ −1 )2 , (ζ + ζ −1 )3 , . . . , (ζ + ζ −1 )262 .
Applying the inverse of the input transformation yields coefficients of
ζ 2 + ζ −2 , ζ 3 + ζ −3 , . . . , ζ 262 + ζ −262 .
Conversion to permuted normal basis just requires adding appropriate coefficients,
for example ζ 200 is the same as ζ −63 and thus ζ 200 +ζ −200 is the same as ζ 63 +ζ −63 .
The normal-basis representation can be computed by applying the inverse of the
input permutation.
This multiplication still incurs overhead compared to modular multiplication
in polynomial basis, but it needs careful analysis to understand whether this
overhead is compensated for by the above-described benefits of normal-basis rep-
resentation. Observe that all permutations involved in this method are free for
hardware and bitsliced implementations while they are quite expensive in non-
bitsliced software implementations. Nevertheless, it is not obvious which basis
representation has better performance for a bitsliced implementation. Therefore
all finite-field operations are implemented in both polynomial- and normal-basis
representation. By this the overall runtime can be compared to determine which
approach gives the better performance.
3.3. IMPLEMENTING ECC2K-130 ON THE CELL PROCESSOR 31
The additional three bit operations per output bit can be interleaved with the
loads and stores which are needed for squaring. In particular when using normal-
basis representation (which does not involve any bit operations for squarings), this
speeds up the computation: In case of normal-basis representation, computation
of σ j takes 1380 cycles.
For polynomial-basis representation a conditional m-squaring consists of an m-
squaring followed by a conditional move. The conditional move function requires
262 loads, 131 stores and 393 bit operations and thus balances instructions on
the two pipelines. One call to the conditional move function takes 518 cycles. In
total, computing σ j takes 10 squarings and 3 conditional moves summing up to
5554 cycles.
Addition. Addition is the same for normal-basis and polynomial-basis represen-
tation. It requires loading 262 inputs, 131 XOR instructions and storing of 131
3.3. IMPLEMENTING ECC2K-130 ON THE CELL PROCESSOR 33
outputs. Just as squaring, the function is bottlenecked by loads and stores rather
than bit operations. One call to the addition function takes 492 cycles.
Inversion. For both polynomial-basis and normal-basis representation the inver-
sion is implemented using Fermat’s little theorem. It involves 8 multiplications,
3 squarings and 6 m-squarings (with m = 2, 4, 8, 16, 32, 65). It takes 173325 cy-
cles using polynomial-basis representation and 135460 cycles using normal-basis
representation. Observe that with a sufficiently large batch size for Montgomery
inversion this does not have big impact on the cycle count of one iteration.
Conversion to normal basis. For polynomial-basis representation the x-coor-
dinate must be converted to normal basis before it can be detected whether it
belongs to a distinguished point. This basis conversion is generated using the
techniques described in [Ber09c] and uses 3380 bit operations. The carefully
scheduled code takes 3748 cycles.
Hamming-weight computation. The bitsliced Hamming-weight computation
of a 131-bit number represented in normal-basis representation can be done in
a divide-and-conquer approach (producing bitsliced results) using 625 bit op-
erations. This algorithm was unrolled to obtain a function that computes the
Hamming weight using 844 cycles.
Overhead. For both polynomial-basis representation and normal-basis represen-
tation there is additional overhead from loop control and reading new input points
after a distinguished point has been found. This overhead accounts only for about
8 percent of the total computation time. After a distinguished point has been
found, reading a new input point takes about 2 009 000 cycles. As an input point
takes on average 225.7 ≈ 40 460 197 iterations to reach a distinguished point, these
costs are negligible and are ignored in the overall cycle counts for the iteration
function.
the benefits in faster m-squarings, conditional m-squarings, and the saved ba-
sis conversion before Hamming-weight computation. Therefore the normal-basis
implementation was chosen to be further improved by storing data in main mem-
ory as well: A larger number for the batch size of Montgomery inversion can be
achieved by taking advantage of DMA transfers from and to main memory. The
batches are stored in main memory and are fetched into local storage temporarily
for computation.
Since the access pattern to the batches is totally deterministic, it is possible
to use multi-buffering to prefetch data while processing previously loaded data
and to write back data to main memory during ongoing computations. At least
three slots—one for outgoing data, one for computation, and one for incoming
data—are required in local storage for the buffering logic. The slots are organized
as a ring buffer. One DMA tag is assigned to each of the slots to monitor ongoing
transactions.
Before the computation starts, the first batch is loaded into the first slot in
local storage. During one step of the iteration function, the SPU iterates multiple
times over the batches. Each time, first the SPU checks whether the last write
back from the next slot in the ring-buffer has finished. This is done using a
blocking call to the MFC on the tag assigned to the slot. When the slot is free for
use, the SPU initiates a prefetch for the next required batch into the next slot.
Now—again in a blocking manner—it is checked whether the data for the current
batch already has arrived. If so, data is processed and finally the SPU initiates a
DMA transfer to write changed data back to main memory.
Due to this access pattern, all data transfers can be performed with mini-
mal overhead and delay. Therefore it is possible to increase the batch size to
512 improving the runtime per iteration for the normal basis implementation by
about 5% to 94949 cycles. Measurements on IBM blade servers QS21 and QS22
showed that neither processor bus nor main memory is a bottleneck even if 8
SPEs are doing independent computations and DMA transfers in parallel.
Table 3.1: Cycle counts per input for all building blocks on one SPE of a
3192 MHz Cell Broadband Engine (rev. 5.1). Cycle counts for 128 bitsliced
inputs are divided by 128. The value B in the last row denotes the batch size
for Montgomery inversions.
At the time of this writing, the computations have not been finished so far. The
Cell implementation has been running on the MariCel cluster at the Barcelona
Supercomputing Center (BSC), on the JUICE cluster at the Jülich Supercomput-
ing Centre, and on the PlayStation 3 cluster of the École Polytechnique Fédérale
de Lausanne (EPFL).
The dispatcher. The 8 ALUs in a GPU core are fed by a single dispatcher.
The dispatcher cannot issue more than one new instruction to the ALUs every
4 cycles. The dispatcher can send this one instruction to a warp containing 32
separate threads of computation, applying the instruction to 32 pieces of data in
parallel and keeping all 8 ALUs busy for all 4 cycles; but the dispatcher cannot
direct some of the 32 threads to follow one instruction while the remaining threads
follow another.
Branching is allowed, but if threads within one warp take different branches,
the threads taking one branch will no longer operate in parallel with the threads
in the other branch; execution of the two branches is serialized and the time it
takes to execute diverging branches is the sum of the time taken in all branches.
SRAM: registers and shared memory. Each core has 16384 32-bit registers;
these registers are divided among the threads. For example, if the core is running
256 threads, then each thread is assigned 64 registers. If the core is running 128
threads, then each thread is assigned 128 registers, although access to the high
64 registers is somewhat limited: the architecture does not allow a high register
as the second operand of an instruction. Even with fewer than 128 threads, only
128 registers are available per thread.
The core also has 16384 bytes of shared memory. This memory enables com-
munication between threads. It is split into 16 banks, each of which can dispatch
one 32-bit read or write operation every two cycles. To avoid bank conflicts, ei-
ther each of the 16 threads of a half-warp must access different memory banks or
the same memory address must be touched by all 16 threads. Otherwise accesses
to the same memory bank are serialized; in the worst case, when all threads are
3.4. IMPLEMENTING ECC2K-130 ON A GPU 37
requesting data from different addresses on the same bank, memory access takes
16 times longer than memory access without bank conflicts.
Threads also have fast access to an 8192-byte constant cache. This cache can
broadcast a 32-bit value from one location to every thread simultaneously, but it
cannot read more than one location per cycle.
DRAM: global memory and local memory. The CPU makes data available
to the GPU by copying it into DRAM on the graphics card outside the GPU. The
cores on the GPU can then load data from this global memory and store results in
global memory to be retrieved by the CPU. Global memory is also a convenient
temporary storage area for data that does not fit into shared memory. However,
global memory is limited to a throughput of just one 32-bit load from each GPU
core per cycle, with a latency of 400–600 cycles.
Each thread also has access to local memory. The name “local memory” might
suggest that this storage is fast, but in fact it is another area of DRAM, as slow
as global memory. Instructions accessing local memory automatically incorporate
the thread ID into the address being accessed, effectively partitioning the local
memory among threads without any extra address-calculation instructions.
There are no hardware caches for global memory and local memory. Program-
mers can, and must, set up their own schedules for moving data between shared
memory and global memory.
Instruction latency. The ALUs execute the instruction stream strictly in order.
NVIDIA does not document the exact pipeline structure but recommends running
at least 192 threads (6 warps) on each core to hide arithmetic latency. If all 8
ALUs of a core are fully occupied with 192 threads then each thread runs every
24 cycles; evidently the latency of an arithmetic instruction is at most 24 cycles.
One might think that a single warp of 32 threads can keep the 8 ALUs fully
occupied, if the instructions in each thread are scheduled for 24-cycle arithmetic
latency (i.e., if an arithmetic result is not used until 6 instructions later). However,
if only one warp is executed on one core, the dispatcher will issue instructions
only every second dispatching cycle. Therefore at least 2 warps (64 threads)
are necessary to exploit all ALU cycles. Furthermore, experiments showed that
additional penalty is encountered when shared memory is accessed. This penalty
can be hidden if enough warps are executed concurrently or if the density of
memory accesses is sufficiently low. For instance the ALU can be kept busy in
all cycles with 128 threads as long as fewer than 25% of the instructions include
shared-memory access and as long as these instructions are not adjacent.
NVIDIA also recommends running many more than 192 threads to hide DRAM
latency. This does not mean that one can achieve the best performance by simply
running the maximum number of threads that fit into the core. Threads share
the register bank and shared memory, so increasing the number of threads means
reducing the amount of these resources available to each thread. The ECC2K-
130 computation puts extreme pressure on shared memory, as discussed later in
this section; to minimize this pressure, this implementation is using 128 threads,
skirting the edge of severe latency problems.
38 CHAPTER 3. POLLARD’S RHO METHOD
The code from [Ber09b] for a 16-bit polynomial multiplication can be scheduled
to fit into 67 registers. It is applied to the 27 multiplications in parallel, leaving 5
threads idle out of 32. In total 27 · 4 = 108 16-bit polynomial multiplications on
32-bit words are carried out by 108 threads in this subroutine leaving 20 threads
idle. Each thread executes 413 instructions (350 bit operations and 63 load/store
instructions).
The initial expansion can be parallelized trivially. Operations on all three
levels can be joined and performed together on blocks of 16 bits per operand
using 8 loads, 19 XOR instructions, and 27 stores per thread.
Karatsuba collection is more work: On the highest level (level 3), each block
of 3 times 32-bit results (with leading coefficient zero) is combined into a 64-bit
intermediate result for level 2. This takes 5 loads (2 of these conditional), 3 XOR
operations and 3 stores per thread on each of the 9 blocks. Level 2 operates on
blocks of 3 64-bit intermediate results leading to 3 128-bit blocks of intermediate
results for level 1. This needs 6 loads and 5 XOR operations for each of the 3
blocks. The 3 blocks of intermediate results of this step do not need to be written
to shared memory and remain in registers for the following final step on level 1.
Level 1 combines the remaining three blocks of 128 bits to the final 256-bit result
by 12 XOR operations per thread.
On the first two levels of the basis conversion algorithm the following sequence of
operations is executed on bits a0 , a62 , a64 , a126 :
Meanwhile the same operations are performed on bits a1 , a61 , a65 , a125 ; on bits
a2 , a60 , a66 , a124 ; and so on through a30 , a32 , a94 , a96 . These 31 groups of bits
are assigned to 32 threads, keeping almost all of the threads busy.
Merging levels 2 and 3 and levels 4 and 5 works in the same way. This
assignment keeps 24 out of 32 threads busy on levels 2 and 3, and 16 out of
32 threads busy on levels 4 and 5. This assignment of operations to threads also
avoids almost all memory-bank conflicts.
Multiplication with reduction. Recall that the PPP multiplication produces
a product in polynomial basis, suitable for input to a subsequent multiplication.
The PPN multiplication produces a product in normal basis, suitable for input
to a squaring.
The main work in PPN, beyond polynomial multiplication as described in
Section 3.4.3, is a conversion of the product from polynomial basis to normal
basis. This conversion is almost identical to basis conversion described above,
except that it is double-size and in reverse order. The reduction PPP is a more
complicated double-size conversion, with similar parallelization.
Squaring and m-squaring. Squaring and m-squaring are simply permutations
in normal basis, costing 0 bit operations, but this does not mean that they cost 0
cycles.
The obvious method for 32 threads to permute 131 bits is for them to pick up
the first 32 bits, store them in the correct locations, pick up the next 32 bits, store
3.4. IMPLEMENTING ECC2K-130 ON A GPU 43
them in the correct locations, etc.; each thread performs 5 loads and 5 stores, with
most of the threads idle for the final load and store. The addresses determined
by the permutation for different m-squarings can be kept in constant memory.
However, this approach triggers two GPU bottlenecks.
The first bottleneck is shared-memory bank throughput. Recall from Sec-
tion 3.4 that threads in the same half-warp cannot simultaneously store values to
the same memory bank. To almost completely eliminate this bottleneck a greedy
search is performed that finds a suitable order to pick up 131 bits, trying to avoid
all memory bank conflicts for both the loads and the stores. For almost all values
of m, including the most frequently used ones, this approach finds a conflict-free
assignment. For two values of m the assignment involves a few bank conflicts, but
these values are used only in inversion, not in the main loop.
The second bottleneck is the throughput of constant cache. Constant cache
delivers only one 32-bit word per cycle; this value can be broadcasted to all threads
in a warp. If the threads load from different positions in constant memory, then
these accesses are serialized. To eliminate this bottleneck, the loads are moved
out of the main loop. Each thread reserves 10 registers to hold 20 load and 20
store positions for the 4 most-often used m-squarings, packing 4 1-byte positions
in one 32-bit register. Unpacking the positions costs just one shift and one mask
instruction for the two middle bytes, a mask instruction for the low byte, and a
shift instruction for the high byte.
Addition. The addition of 128 bitsliced elements of F2131 is decomposed into
computing the XOR of two sets of 4 · 131 = 524 of 32-bit words. This can be
accomplished by 128 threads using 2 · 5 loads, 5 XOR operations and 5 stores per
thread where 2 loads, 1 XOR and 1 store are conditional and carried out by only
12 threads.
Hamming-weight computation. The subroutine for Hamming-weight compu-
tation receives elements of F2131 in bitsliced normal-basis representation as input
and returns the Hamming weight, i.e. the sum of all bits of the input value,
as bitsliced output. More specifically, P it returns 8 bits h0 , . . . , h7 such that the
7
Hamming weight of the input value is i=0 hi 2i .
The basic building block for the parallel computation is a full adder, which
has three input bits b1 , b2 , b3 and uses 5 bit operations to compute 2 output bits
c0 , c1 such that b1 + b2 + b3 = c1 2 + c0 . At the beginning of the computation all
131 bits have a weight of 20 . When the full adder operates on bits of weight 20 ,
output bit c1 gets a weight of 21 and the bit c0 a weight of 20 . If three bits with
a weight of 21 are input to a full adder, output bit c1 will have the weight 22 , bit
c0 weight 21 . More generally: If three bits with a weight of 2i enter a full adder,
output bit c1 will have a weight of 2i+1 , output bit c0 a weight of 2i . The full
adder sums up bits of each weight until only one bit of each weight is left giving
the final result of the computation.
Because there are many input bits, it is easy to keep many threads active in
parallel. In the first addition round 32 threads perform 32 independent full-adder
operations, 96 bits with weight 20 are transformed into 32 bits with weight 20 and
44 CHAPTER 3. POLLARD’S RHO METHOD
normal basis
building block cycles
multiplication
PPN 159.54
PPP 158.08
squaring / m-squaring 9.60
computation of σ j 44.99
Hamming-weight computation 41.60
addition 4.01
inversion 1758.72
conversion to polynomial basis 12.63
full iteration:
B = 128 1164.43
Table 3.2: Cycle counts per input for all building blocks on one core of a
GTX 295. Cycle counts for 128 bitsliced inputs are divided by 128. The value
B in the last row denotes the batch size for Montgomery inversions.
instruction throughput, the graphics card should perform much better than that:
The 60 cores of a GTX 295 deliver up to 596.16 billion logical operations on 32-bit
words per second. This is more than 7.5 times the performance of the 6 SPEs
of a Cell processor in the PlayStation 3 delivering 76.8 billion logical operations
on 32-bit words per second. There are several reasons why the Cell processor
performs more efficiently than the GPU:
• The SPEs of the Cell processor can execute one memory operation in parallel
to each logical instruction while the GPU has to spend extra cycles on
loading and storing data.
• The memory controller of the SPE can transfer data between local storage
and main memory per DMA in parallel to the computation of the ALU; set-
ting up data transfers costs only a fraction of an instruction per transferred
byte. On the GPU, memory transfers between device memory and shared
memory take a path through the ALU, consuming additional cycles.
• The local storage of the SPEs is large enough to keep the full working set
and all code during computation while the instruction cache of the GPU is
too small for the code size.
• For the GPU implementation, several threads are cooperating on the bit op-
erations of one iteration. This requires synchronization and communication
which adds more cycles to the overall cycle count.
• Due to the cooperation of threads during the computation, not all threads
of the GPU can be kept busy all the time.
• Sequential computations like address calculations are more frequent on the
GPU due to the thread-based programming model.
• More than 128 threads would need to run concurrently to hide all latency
effects which arise on the GPU.
Some of these issues could have been avoided by increasing the number of
concurrent threads on each GPU core. However, the multiplication which is the
most expensive instruction requires an amount of resources that is only available
when using at most 128 threads. Any reduction of the resources would result in
less efficient multiplication algorithms and thus eat up all improvements achieved
by a higher ALU occupancy.
To put the results into perspective, similar implementations of the iteration
function are presented in [BBB+ 09]: A Core 2 Extreme Q6850 CPU with 4 cores
running at 3 GHz clock frequency achieves 22.45 million iterations/second. A
previous, more straightforward GPU implementation that was not sharing work
between the threads, carries out 25.12 million iterations/second on the same
GTX 295 graphics cards. A Spartan-3 XC3S5000-4FG676 FPGA delivers 33.67
million iterations/second. A 16mm2 ASIC is estimated to achieve 800 million
iterations/second.
Parallel implementation of
4
the XL algorithm
47
48 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
This string has n+(D −2) words, n times “xi ” and D −2 times “foo”. The number
of all such strings is n+(D−2) = n+(D−2)
n D−2 . Thus the number |T (D−2) | of all
monomials with total degree less than or equal to D − 2 is n+(D−2)
n . Therefore
the size of R(D) grows exponentially with the operational degree D. Consequently,
the choice of D should not be larger than the minimum degree that is necessary to
find a solution. On the other hand, starting with a small operational degree may
result in several repetitions of the XL algorithm and therefore would take more
computation than necessary. The solution for this dilemma is given by Yang and
Chen in [YC05] (see also Moh in [Moh01] and Diem in [Die04]): they show that
for random systems the minimum degree D0 required for the reliable termination
of XL is given by D0 := min{D : ((1 − λ)m−n−1 (1 + λ)m )[D] ≤ 0}.
are suitable for other applications since they are independent of the shape or data
structure of the input matrix.
The block Wiedemann algorithm is a probabilistic algorithm. It solves a linear
system M by computing kernel vectors of a corresponding matrix B in three steps
which are called BW1, BW2, and BW3 for the remainder of this chapter. The
following paragraphs give a review of these three steps on an operational level;
for more details please see [Cop94].
The parameters m and n are chosen such that operations on vectors K m and K n
can be computed efficiently on the target computing architecture. In this chapter
the quotient dm/ne is treated as a constant for convenience. In practice each a(i)
can be efficiently computed using two matrix multiplications with the help of a
sequence {t(i) }∞
i=0 of matrices t
(i)
∈ K N ×n defined as
(
(i) y = Bz for i = 0
t =
Bt(i−1) for i > 0.
a(i) = (xt(i) )T .
BW2: Coppersmith uses an algorithm for this step that is a generalization of the
Berlekamp–Massey algorithm given in [Ber66; Mas69]. Literature calls Copper-
smith’s modified version of the Berlekamp–Massey algorithm “matrix Berlekamp–
Massey” algorithm or “block Berlekamp–Massey” algorithm in analogy to the
name “block Wiedemann”.
The block Berlekamp–Massey algorithm is an iterative algorithm. It takes the
sequence {a(i) }∞
i=0 from BW1 as input and defines the polynomial a(λ) of degree
N/m + N/n + O(1) with coefficients in K n×m as
κ
X
a(λ) = a(i) λi .
i=0
4.2. THE BLOCK WIEDEMANN ALGORITHM 51
15 P (j) ← P ;
16 E (j) ← E;
The j-th iteration step receives two inputs from the previous iteration: One in-
(j) (j)
put is an (m + n)-tuple of polynomials (f1 (λ), . . . , fm+n (λ)) with coefficients
in K 1×n ; these polynomials are jointly written as f (j) (λ) with coefficients in
(j)
K (m+n)×n such that (f (j) [k])i = fi [k]. The other input is an (m + n)-tuple d(j)
(j) (j) (j)
of nominal degrees (d1 , . . . , dm+n ); each nominal degree dk is an upper bound
(j)
of deg(fk ).
An initialization step generates f (j0 ) for j0 = dm/ne as follows: Set the polyno-
(j0 ) (j0 )
mials fm+i , 1 ≤ i ≤ n, to the polynomial of degree j0 where coefficient fm+i [j0 ] =
0 (j )
ei is the i-th unit vector and with all other coefficients fm+i [k] = 0, k 6= j0 . Re-
(j ) (j )
peat choosing the polynomials f1 0 , . . . , fm 0 randomly with degree j0 − 1 until
(j )
H (j0 ) = (f (j0 ) a)[j0 ] has rank m. Finally set di 0 = j0 , for 0 ≤ i ≤ (m + n).
After f (j0 ) and d(j0 ) have been initialized, iterations are carried out until
(deg(a))
f is computed as follows: In the j-th iteration, a Gaussian elimination
according to Algorithm 1 is performed on the matrix
H (j) = (f (j) a)[j] ∈ K (m+n)×m .
52 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
Note that the algorithm first sorts the rows of the input matrix by their cor-
responding nominal degree in decreasing order. This ensures that during the
Gaussian elimination no rows of higher nominal degree are subtracted from a row
with lower nominal degree. The Gaussian elimination finds a nonsingular matrix
P (j) ∈ K (m+n)×(m+n) such that the first n rows of P (j) H (j) are all zeros and a
permutation matrix E (j) ∈ K (m+n)×(m+n) corresponding to a permutation φ(j) .
Using P (j) , the polynomial f (j+1) of the next iteration step is computed as
I 0
f (j+1) = QP (j) f (j) , for Q = n .
0 λ · Im
(j+1)
The nominal degrees di are computed corresponding to the multiplication by
Q and the permutation φ(j) as
(j)
d (j) for 1 ≤ i ≤ n,
(j+1) φi
di =
d(j)(j) + 1 for n < i ≤ (n + m).
φi
For the output of BW2, the last m rows of f (deg(a)) are discarded; the output
is an n-tuple of polynomials (f1 (λ), . . . fn (λ)) with coefficients in K 1×n and an
n-tuple d = (d1 , . . . , dn ) of nominal degrees such that
(deg(a))
fk = fk
and
(deg(a))
dk = dk ,
for 1 ≤ k ≤ n, where max(d) ≈ N/n.
4.3. THE BLOCK BERLEKAMP–MASSEY ALGORITHM 53
BW3: This step receives an n-tuple of polynomials (f1 (λ), . . . fn (λ)) with coef-
ficients in K 1×n and an n-tuple d = (d1 , . . . , dn ) as input from BW2. For each
fi (λ), 1 ≤ i ≤ n, compute wi ∈ K N as
wi = z(fi [deg(fi )])T + B 1 z(fi [deg(fi ) − 1])T + . . . + B deg(fi ) z(fi [0])T
deg(fi )
X
= B j z(fi [deg(fi ) − j])T .
j=0
The kernel vectors of B are found during the iterative computation of W (max(d))
by checking whether an individual column i ∈ {1, . . . , n} is nonzero in iteration k
but becomes zero in iteration k + 1. Therefore, column i of matrix W (k) is a
kernel vector of B.
Each iteration step has asymptotically time complexity
O N n2 + N wB n = O (N · (n + wB ) · n) .
Therefore, W (max(d)) for max(d) ≈ N/n can be computed with the asymptotic
time complexity
O N 2 · (wB + n) .
The output of BW3 and of the whole block Wiedemann algorithm consist of up
to n kernel vectors of B.
for n < i ≤ m + n, 0 < k ≤ deg(f (j) ). Since the bottom right corner of P (j) is
the identity matrix of size m, this also holds for
((f (j+1) a)[j + 1])i = ((QP (j) f (j) a)[j + 1])i = ((f (j) a)[j])φ(i) .
(j+1)
Thus, Hi for n < i ≤ m + n can be computed as
(j+1) (j)
Hi = ((f (j+1) a)[j + 1])i = ((QP (j) f (j) a)[j + 1])i = ((f (j) a)[j])φ(i) = Hφ(i) .
This means the last m rows of H (j+1) can actually be copied from H (j) ; only the
first n rows of H (j+1) need to be computed. Therefore the cost of computing any
H (j>j0 ) is reduced to deg(f (j) ) · Mul(n, n, m).
The matrix P (j) can be assembled as follows: The matrix P (j) is computed
using Algorithm 1. In this algorithm a sequence of row operations is applied to
M := H (j) . The matrix H (j) has rank m for all j ≥ j0 . Therefore in the end
the first n rows of M are all zeros. The composition of all the operations is P (j) ;
some of these operations are permutations of rows. The composition of these
permutations is E (j) :
I ∗ In ∗
P (j) (E (j) )−1 = n ⇐⇒ P (j)
= E (j) .
0 F (j) 0 F (j)
4.3. THE BLOCK BERLEKAMP–MASSEY ALGORITHM 55
The algorithm by Coppersmith requires that the first n rows of P (j) H (j) are all
zero (see [Cop94, p. 7]); there is no condition for the bottom m rows. However,
the first n rows of P (j) H (j) are all zero independently of the value of F (j) . Thus,
F (j) can be replaced by Im without harming this requirement.
requires communication: Each node i first locally computes a part of the sum
(j)
using only its own coefficients Si of f (j) . The matrix H (j) is the sum of all these
intermediate results. Therefore, all nodes broadcast their intermediate results
to the other nodes. Each node computes H (j) locally; Gaussian elimination is
performed on every node locally and is not parallelized over the nodes. Since only
small matrices are handled, this sequential overhead is negligibly small.
Also the computation of f (j+1) requires communication. Recall that
I 0
f (j+1) = QP (j) f (j) , for Q = n .
0 λ · Im
Table 4.1: Example for the workload distribution over 4 nodes. Iteration
0 receives the distribution in the first line as input and computes the new
distribution in line two as input for iteration 1.
Observe, that only node (c − 1) can check whether the degree has increased,
i.e. whether deg(f (j+1) ) = deg(f (j) ) + 1, and whether coefficients need to be
redistributed; this information needs to be communicated to the other nodes.
To avoid this communication, the maximum nominal degree max(d(j) ) is used to
approximate deg(f (j) ). Note that in each iteration all nodes can update a local list
of the nominal degrees. Therefore, all nodes decide locally without communication
whether coefficients need to be reassigned: If max(d(j+1) ) = max(d(j) ) + 1, the
number i(j) is computed as
Node i(j) is chosen to store one additional coefficient, the coefficients of nodes i,
for i ≥ i(j) , are redistributed accordingly.
Table 4.1 illustrates the distribution strategy for 4 nodes. For example in
iteration 3, node 1 has been chosen to store one more coefficient. Therefore it
receives one coefficient from node 2. Another coefficient is moved from node 3 to
node 2. The new coefficient is assigned to node 3.
This distribution scheme does not avoid all communication for the computa-
tion of f (j+1) : First all nodes compute P (j) f (j) locally. After that, the coefficients
are multiplied by Q. For almost all coefficients of f (j) , both coefficients k and
(j) (j)
(k − 1) of P (j) f (j) are stored on the same node, i.e. k ∈ S(i) and (k − 1) ∈ S(i) .
Thus, f (j+1) [k] can be computed locally without communication. In the example
in Figure 4.1, this is the case for k ∈ {0, 1, 2, 4, 5, 7, 9, 10}. Note that the bottom
m rows of f (j+1) [0] and the top n rows of f (j+1) [max(d(j+1) )] are 0.
Communication is necessary if coefficients k and (k − 1) of P (j) f (j) are not on
the same node. There are two cases:
(j+1) (j)
• In case k − 1 = max(Si−1 ) = max(Si−1 ), i 6= 1, the bottom m rows of
(P (j) f (j) )[k − 1] are sent from node i − 1 to node i. This is the case for
k ∈ {6, 3} in Figure 4.1. This case occurs if in iteration j + 1 no coefficient
is reassigned to node i − 1 due to load balancing.
4.3. THOMÉ’S BLOCK BERLEKAMP–MASSEY ALGORITHM 57
(j)
Si P (j) f (j)
(j+1)
0
Si 0 f (j+1)
10 9 8 7 6 5 4 3 2 1 0 k
Figure 4.1: Example for the communication between 4 nodes. The top n
rows of the coefficients are colored in blue, the bottom m rows are colored in
red.
(j) (j+1)
• In case k = min(Si ) = max(Si−1 ), i 6= 1, the top n rows of (P (j) f (j) )[k]
are sent from node i to node i − 1. The example in Figure 4.1 has only one
such case, namely for coefficient k = 8. This happens, if coefficient k got
reassigned from node i to node i − 1 in iteration j + 1.
If max(d(j+1) ) = max(d(j) ), i.e. the maximum nominal degree is not increased
during iteration step j, only the first case occurs since no coefficient is added and
therefore reassignment of coefficients is not necessary.
The implementation of this parallelization scheme uses the Message Passing
Interface (MPI) for computer clusters and OpenMP for multi-core architectures.
For OpenMP, each core is treated as one node in the parallelization scheme. Note
that the communication for the parallelization with OpenMP is not programmed
explicitly since all cores have access to all coefficients; however, the workload
distribution is performed as described above. For the cluster implementation,
each cluster node is used as one node in the parallelization scheme. Broadcast
communication for the computation of H (j) is implemented using a call to the
MPI_Allreduce function. One-to-one communication during the multiplication
by Q is performed with the non-blocking primitives MPI_Isend and MPI_Irecv
to avoid deadlocks during communication. Both OpenMP and MPI can be used
together for clusters of multi-core architectures. For NUMA systems the best
performance is achieved when one MPI process is used for each NUMA node
since this prevents expensive remote-memory accesses during computation.
The communication overhead of this parallelization scheme is very small. In
each iteration, each node only needs to receive and/or send data of total size
O(n2 ). Expensive broadcast communication is only required rarely compared to
the time spent for computation. Therefore this parallelization of Coppersmith’s
Berlekamp–Massey algorithm scales well on a large number of nodes. Further-
more, since f (j) is distributed over the nodes, the memory requirement is dis-
tributed over the nodes as well.
58 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
parallelization of Coppersmith’s version has already been explained. Here the par-
allelization of the matrix polynomial multiplications is described on the example
of the FFT-based multiplication.
The FFT-based multiplication is mainly composed of 3 stages: forward FFTs,
point-wise multiplications, and the reverse FFT. Let f, g be the inputs of forward
FFTs and f 0 , g 0 be the corresponding outputs; the point-wise multiplications take
f 0 , g 0 as operands and give h0 as output; finally, the reverse FFT takes h0 as input
and generates h.
For this implementation, the parallelization scheme for Thomé’s Berlekamp–
Massey algorithm is quite different from that for Coppersmith’s: Each node deals
with a certain range of rows. In the forward and reverse FFTs the rows of f , g, and
h0 are independent. Therefore, each FFT can be carried out in a distributed man-
ner without communication. The problem is that the point-wise multiplications
require partial f 0 but full g 0 . To solve this each node collects the missing rows
of g 0 from the other nodes. This is done by using the function MPI_Allgather.
Karatsuba and Toom-Cook multiplication are parallelized in a similar way.
One drawback of this scheme is that the number of nodes is limited by the
number of rows of the operands. However, when the Macaulay matrix B is very
large, the runtime of BW2 is very small compared to BW1 and BW3 since it is
subquadratic in N . In this case using a different, smaller cluster or a powerful
multi-core machine for BW2 might give a sufficient performance as suggested in
[KAF+ 10]. Another drawback is, that the divide-and-conquer approach and the
recursive algorithms for polynomial multiplication require much more memory
than Coppersmith’s version of the Berlekamp–Massey algorithm. Thus Copper-
smith’s version might be a better choice on memory-restricted architectures or for
very large systems.
4.5 Implementation of XL
This section gives an overview of the implementation of XL. Section 4.5.1 de-
scribes some tweaks that are used to reduce the computational cost of the steps
BW1 and BW2. This is followed by a description of the building block for these
two steps. The building blocks are explained bottom up: Section 4.5.2 describes
the field arithmetic on vectors of Fq ; although the implementation offers several
fields (F2 , F16 , and F31 ), F16 is chosen as a representative for the discussion in this
section. The modularity of the source code makes it possible to easily extend the
implementation to arbitrary fields. Section 4.5.3 describes an efficient approach
for storing the Macaulay matrix that takes its special structure into account.
This approach reduces the memory demand significantly compared to standard
data formats for sparse matrices. Section 4.5.4 details how the Macaulay matrix
multiplication in the stages BW1 and BW3 is performed efficiently, Section 4.5.5
explains how the multiplication is performed in parallel on a cluster using MPI
and on a multi-core system using OpenMP. Both techniques for parallelization
can be combined on clusters of multi-core systems.
60 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
Notes. The tweaks for BW1 and BW3, though useful in practice, actually reduce
the entropy of x and z. Therefore, theoretical analyses of [Kal95; Vil97b; Vil97a]
do no longer apply.
(a3 x3 + a2 x2 + a1 x + a0 ) · x = a3 (x + 1) + a2 x3 + a1 x2 + a0 x
= a2 x3 + a1 x2 + (a0 + a3 )x + a3 .
a3 = A AND mask_a3
tmp = A AND mask_a2a1a0
tmp = tmp << 1
new_a0 = a3 >> 3
tmp = tmp XOR new_a0
add_a1 = a3 >> 2
ret = tmp XOR add_a1
RETURN ret
The number of bit-operations varies with the actual value of b since it is not
necessary to explicitly compute a · xi in case bit i of b is 0.
Scalar multiplication using PSHUFB: The PSHUFB (Packed Shuffle Bytes) in-
struction was introduced by Intel with the SSSE3 instruction set extension in
2006. The instruction takes two byte vectors A = (a0 , a1 , . . . , a15 ) and B =
(b0 , b1 , . . . , b15 ) as input and returns C = (ab0 , ab1 , . . . , ab15 ). In case the top bit
of bi is set, ci is set to zero. Using this instruction, scalar multiplication is im-
plemented using a lookup table as follows: For F16 the lookup table L contains
16 entries of 128-bit vectors Li = (0 · i, 1 · i, x · i, (x + 1) · i, . . . ), i ∈ F16 . Given
a vector register A that contains 16 elements of F16 , one in each byte slot, the
scalar multiplication A · b, b ∈ F16 is computed as A · b = PSHUFB(Lb , A).
Since in the implementation each vector register of the input contains 32
packed elements, two PSHUFB instructions are used. The input is unpacked using
shift and mask operations accordingly as shown in Listing 4.2. Using the PSHUFB
instruction, the scalar multiplication needs 7 operations for any input value b with
the extra cost of accessing the lookup table. The lookup table consumes 256 bytes
of memory.
62 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
mask_low = 00001111|00001111|00001111|...
mask_high = 11110000|11110000|11110000|...
RETURN ret
Standard
Compact
Memory Demand (GB)
50
2
16 GB
240
230
220
210
Figure 4.2: Memory demand of XL for several system sizes using F16 in
standard and compact representation.
x.rand();
t_old = z;
for (unsigned i = 0; i <= N/m + N/n + O(1); i++)
{
t_new = B * t_old;
a[i] = x * t_new;
swap(t_old, t_new);
}
RETURN a
N N
a(i) = (xT · (B · B i z))T , 0≤i≤ + + O(1),
m n
and stage BW3 iteratively computes
where B is a Macaulay matrix, x and z are sparse matrices, and a(i) , f [k] are
dense matrices (see Section 4.2).
Listings 4.3 and 4.4 show pseudo-code for the iteration loops. The most ex-
pensive part in the computation of stages BW1 and BW3 of XL is a repetitive
multiplication of the shape
tnew = B · told ,
where tnew , told ∈ K N ×n are dense matrices and B ∈ K N ×N is a sparse Macaulay
matrix of row weight w.
Due to the row-block structure of the Macaulay matrix, there is a guaranteed
number of entries per row (i.e. the row weight w) but a varying number of entires
per column, ranging from just a few to more than 2w. Therefore the multiplication
is computed in row order in a big loop over all row indices.
For F16 the field size is significantly smaller than the row weight. Therefore,
the number of actual multiplications for a row r can be reduced by summing
4.5. IMPLEMENTATION OF XL 65
t_old = z * f[0].transpose();
for (unsigned k = 1; k <= f.deg; k++)
{
t_new = B * t_old;
t_new += z * f[k].transpose();
[...] // check columns of t_new for solution
// and cpoy found solutions to sol
swap(t_new, t_old);
}
RETURN sol
up all row-vectors of told which are to be multiplied by the same field element
and performing the multiplication on all of them together. A temporary buffer
bi , i ∈ F16 of vectors of length n is used to collect the sum of row vectors that
ought to be multiplied by i. For allPentries Br,c row c of told is added to bBr,c .
Finally b is reduced by computing i · bi , i 6= 0, i ∈ F16 , which gives the result
for row r of the matrix tnew .
With the strategy explained so far, computing the result for one row of B takes
w + 14 additions and 14 scalar multiplications (there is no need for the multiplica-
tion of 0 and 1, see [Sch11b, Statement 8], and for the addition of 0, see [Pet11b,
Statement 10]). This can be further reduced by decomposing each scalar factor
into the components of the polynomial that represents the field element. Summing
up the entries in bi according to the non-zero coefficients of i’s polynomial results
in 4 buckets which need to be multiplied by 1, x, x2 , and x3 (multiplying by 1
can be omitted once more). This reduces 14 scalar multiplications from before to
only 3 multiplications at the cost of 22 more additions. All in all the computation
on one row of B (row weight w) on Fpn costs w + 2(pn − n − 1) + (n − 1) additions
and n − 1 scalar multiplications (by x, x2 , . . . , xn−1 ). For F16 this results in w + 25
additions and 3 multiplications per row.
In general multiplications are more expensive on architectures which do not
support the PSHUFB-instruction than on those which do. Observe that in this case
the non-PSHUFB multiplications are about as cheap (rather slightly cheaper) since
the coefficients already have been decomposed into the polynomial components
which gives low-cost SIMD multiplication code in either case.
66 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
750
Rows
1500
2250
3000
0 750 1500 2250 3000
Columns
If the matrix is split into blocks of equal size, every unit has to compute
the same number of multiplications. Nevertheless, due to the structure of the
Macaulay matrix the runtime of the computing units may vary slightly: in the
bottom of the Macaulay matrix it is more likely that neighbouring row blocks
have non-zero entries in the same column. Therefore it is more likely to find
the corresponding row of told in the caches and the computations can be finished
faster than in the top of the Macaulay matrix. This imbalance may be addressed
by dynamically assigning row ranges depending on the actual computing time of
each block.
column stripe of told and tnew . All computation can be accomplished locally;
the results are collected at the end of the computation of these stages.
Although this is the most efficient parallelization approach when looking at
communication cost, the per-node efficiency drops drastically with higher
node count: For a high node count, the impact of the width of the column
stripes of told and tnew becomes even stronger than for the previous ap-
proach. Therefore this approach only scales well for small clusters. For a
large number of nodes, the efficiency of the parallelization declines signifi-
cantly.
Another disadvantage of this approach is that all nodes must store the
whole Macaulay matrix in their memory. For large systems this is may not
be feasible.
node 0 since the Macaulay matrix does not have entries in the first quarter of the
columns for these rows. The obvious solution is that a node i sends only these
rows to a node j that are actually required by node j in the next iteration step.
Depending on the system size and the cluster size, this may require to send many
separate data blocks to some nodes. This increases the communication overhead
and requires to call MPI_Test even more often.
Therefore, to reduce the communication overhead and communication time,
the second implementation of approach b) circumvents the MPI API and pro-
grams the network hardware directly. This implementation uses an InfiniBand
network; the same approach can be used for other high-performance networks.
The InfiniBand hardware is accessed using the InfiniBand verbs API. Program-
ming the InfiniBand cards directly has several benefits: All data structures that
are required for communication can be prepared offline; initiating communication
requires only one call to the InfiniBand API. The hardware is able to perform
all operations for sending and receiving data autonomously after this API call;
there is no need for calling further functions to ensure communication progress as
it is necessary when using MPI. Finally, complex communication patterns using
scatter-gather lists for incoming and outgoing data do not have a large overhead.
This implementation allows to send only such rows to the other nodes that are
actually required for computation with a small communication overhead. This
reduces communication to the smallest amount possible for the cost of only a
negligibly small initialization overhead. One disadvantage of this approach is an
unbalanced communication demand of the nodes. Another disadvantage is that
the InfiniBand verbs API is much more difficult to handle than MPI.
Figure 4.4 shows the communication demand of each node for both implemen-
tations. The figure shows the values for a quadratic system of 18 equations and
16 variables; however, the values are qualitatively similar for different parameter
choices. While for the MPI implementation the number of rows that are sent is the
same for all nodes, it varies heavily for the InfiniBand verbs API implementation.
The demand on communication is increased for large clusters for the MPI case; for
the InfiniBand case, communication demand has a peak for 4 nodes and declines
afterwards. However, the scalability of the InfiniBand approach depends on the
ratio between computation time and communication time. Perfect scalability is
only achieved as long as computation time is longer than communication time.
While computation time is roughly halved when doubling the number of nodes,
communication time decreases in a smaller slope. Therefore, at some point for a
certain number of nodes, computation time is catching up with communication
time. For moderate system sizes on a cluster with a fast InfiniBand network, the
MPI implementation scales almost perfectly for up to four nodes while the In-
finiBand verbs API implementation scales almost perfectly for up to eight nodes
(details are discussed in the following Section 4.6). A better scalability for large
problem size is not likely: On the one hand, larger systems have a higher row
weight and therefore require more computation time per row. But on the other
hand, a higher row weight also increases the amount of rows that need to be
communicated.
72 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
1
Verbs API
MPI
0.75
Ratio
0.5
0.25
0
2 4 8 16 32 64 128 256
Number of Nodes
Figure 4.4: Ratio between the number of rows that each node sends in one
iteration step and the total number of rows N . For MPI, the number of rows
that are sent is equal for all cluster nodes. For InfiniBand, the number of rows
varies; the maximum, minimum and average is shown.
Approach c) obviously does not profit from programming the InfiniBand cards
directly—since it does not require any communication. Currently, there is only
an MPI version of approach a). This approach would profit from a more efficient
communication strategy; providing such an implementation is yet an open task.
All parallelization approaches stated above are based on the memory-efficient
Macaulay matrix representation described in Section 4.5.3. Alternatively the com-
pact data format can be dropped in favor of a standard sparse-matrix data format.
This gives the opportunity to optimize the structure of the Macaulay matrix for
cluster computation. For example, the Macaulay matrix could be partitioned
using the Mondriaan partitioner of Bisseling, Fagginger Auer, Yzelman, and Vas-
tenhouw available at [BFAY+ 10]. Due to the low row-density, a repartitioning of
the Macaulay matrix may reduce the communication demand and provide a better
communication scheme. On the other hand, the Mondriaan partitioner performs
well for random sparse matrices, but the Macaulay matrix is highly structured.
Therefore, it is not obvious if the communication demand can be reduced by such
an approach. First experiments with the Mondriaan partitioner did not yet yield
lower communication cost. Another approach for reducing communication cost
is to modify the Macaulay matrix systematically. Currently the terms of the
linearized system are listed in graded reverse lexicographical (grevlex) monomial
order. For XL, the choice of the monomial order is totally free. Other monomial
orders may give a lower communication cost and improve caching effects. Analyz-
ing the impact of the Mondriaan partitioner, choosing a different monomial order,
and evaluating the trade-off between communication cost, memory demand, and
computational efficiency is a major topic of future research.
4.6. EXPERIMENTAL RESULTS 73
NUMA Cluster
CPU
Name AMD Opteron 6172 Intel Xeon E5620
Microarchitecture Magny–Cours Nehalem
Support for PSHUFB 7 3
Frequency 2100 MHz 2400 MHz
Memory Bandwidth per socket 2 × 25.6 GB/s 25.6 GB/s
Number of CPUs per socket 2 1
Number of cores per socket 12 (2 x 6) 4
Level 1 data cache size 12 × 64 KB 4 × 32 KB
Level 2 data cache size 12 × 512 KB 4 × 256 KB
Level 3 data cache size 2 × 6 MB 8 MB
Cache-line size 64 byte 64 byte
System Architecture
Number of NUMA nodes 4 sockets × 2 CPUs 2 sockets × 1 CPU
Number of cluster nodes — 8
Total number of cores 48 64
Network interconnect — InfiniBand 40 GB/s
Memory
Memory per CPU 32 GB 18 GB
Memory per cluster node — 36 GB
Total memory 256 GB 288 GB
10000 BW1
BW2 Thomé
5000 BW2 Coppersmith
BW3
Runtime [s]
1500
1000
500
Figure 4.5: Runtime of XL 16-14 on one cluster node with two CPUs (8
cores in total) with different block sizes.
the large matrix multiplication in the steps BW1 and BW3. The third experiment
used a second quadratic system with 18 equations and 16 variables to measure the
performance of the parallelization on the cluster system with a varying number of
cluster nodes and on the NUMA system with a varying number of NUMA nodes.
The following paragraphs give the details of these experiments.
90
BW2 Thomé
80 BW2 Coppersmith
36 GB
70
60
Memory [GB]
50
40
30
20
10
0
32 64 128 256 512 1024
Block Width: m = n
less than one cache line of 64 bytes. This explains why the best performance in
BW1 and BW3 is achieved for larger values of n. The runtime of BW1 and BW3
is minimal for block sizes m = n = 256. In this case one row of t(j−1) occu-
pies two cache lines. The reason why this case gives a better performance than
m = n = 128 might be that the memory controller is able to prefetch the second
cache line. For larger values of m and n the performance declines probably due
to cache saturation.
According to the asymptotic time complexity of Coppersmith’s and Thomé’s
versions of the Berlekamp–Massey algorithm, the runtime of BW2 should be pro-
portional to m and n. However, this turns out to be the case only for moderate
sizes of m and n; note the different scale of the graph in Figure 4.5 for a runtime
of more than 2000 seconds. For m = n = 256 the runtime of Coppersmith’s
version of BW2 is already larger than that of BW1 and BW3, for m = n = 512
and m = m = 1024 both versions of BW2 dominate the total runtime of the
computation. Thomé’s version is faster than Coppersmith’s version for small and
moderate block sizes. However, by doubling the block size, the memory demand
of BW2 roughly doubles as well; Figure 4.6 shows the memory demand of both
variants for this experiment. Due to the memory–time trade-off of Thomé’s BW2,
the memory demand exceeds the available RAM for a block size of m = n = 512
and more. Therefore memory pages are swapped out of RAM onto hard disk
which makes the runtime of Thomé’s BW2 longer than that of Coppersmith’s
version of BW2.
76 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
1400
1200 BW1
1000
Runtime [s]
800
600
400
200
0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
1 Block 2 Blocks MPI 2 Blocks IB Ind. Blocks
Number of Cluster Nodes
8 100%
6 75%
Efficiency
Speedup
4 50%
2 Seepdup 25%
Efficiency
0 0%
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
1 Block 2 Blocks MPI 2 Blocks IB Ind. Blocks
Number of Cluster Nodes
Since the approaches “2 Blocks MPI” and “2 Blocks IB” split the data into two
column blocks independently of the number of computing nodes, they have some
overhead when computing on a single cluster node. Therefore, for 1 cluster node
the runtime is longer than for the other two approaches.
The approach “1 Block” does not scale very well for more than two nodes
and thus is not appropriate for this cluster. The approach “2 Blocks MPI” scales
almost perfectly for up to 4 nodes but takes about as long on 8 nodes as on 4
nodes. This is due to the fact that for 8 nodes communication takes twice as
much time as computation. The approach “2 Blocks IB” scales almost perfectly
for up to 8 nodes. For 8 nodes, communication time is about two thirds of the
computation time and thus can entirely be hidden. Doubling the number of nodes
to 16 decreases communication time by a factor of about 0.7 (see Figure 4.4) while
computation time is halved. Therefore, this approach may scale very well for up to
16 nodes or even further. However, a larger cluster with a high-speed InfiniBand
network is required to proof this claim. The approach “Ind. Blocks” scales well
for up to 8 nodes. It looses performance due to the reduction of the block size
per node. In case the per-node block size is kept constant by increasing the total
block size n, this approach scales perfectly as well even for larger clusters—since
no communication is required during the computation. However, increasing the
total block size also increases runtime and memory demand of BW2 as described
earlier.
The approach “2 blocks IB” was used for the scalability tests that are described
in the next paragraphs since it has a good performance for up to 8 cluster nodes
and uses a fixed number of column blocks independent of the cluster size. Fur-
thermore it uses a larger column-block size for a fixed total block size n than the
third approach with several independent column blocks.
78 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
2500
BW1
BW2 Tho.
2000 BW2 Cop.
BW3
Runtime [min]
1500
1000
500
0
8 16 32 64 6 12 24 48
Cluster NUMA
Number of Cores
8 100%
6 75%
Efficiency
Speedup
4 50%
2 25%
Speedup
Efficiency
0 0%
2 4 6 8 2 4 6 8 2 4 6 8
Stage BW1 Stage BW2 Stage BW2
Thomé Coppersmith
Number of Cluster Nodes
Figure 4.10: Speedup and efficiency of BW1 and BW2 on the cluster system.
8 100%
6 75%
Efficiency
Speedup
4 50%
2 25%
Speedup
Efficiency
0 0%
2 4 6 8 2 4 6 8 2 4 6 8
Stage BW1 Stage BW2 Stage BW2
Thomé Coppersmith
Number of NUMA Nodes
Figure 4.11: Speedup and efficiency of BW1 and BW2 on the NUMA system.
80 CHAPTER 4. PARALLEL IMPLEMENTATION OF XL
systems it is even smaller due its smaller asymptotic time complexity compared
to steps BW1 and BW3. Thus, a lower scalability than BW1 and BW3 can be
tolerated.
On the NUMA system, BW1 achieves an efficiency of over 75% on up to 8
NUMA nodes. The workload was distributed such that each CPU socket was
filled up with OpenMP threads as much as possible. Therefore in the case of
two NUMA nodes (12 threads) the implementation achieves a high efficiency of
over 90% since a memory controller on the same socket is used for remote memory
access and the remote memory access has only moderate cost. For three and more
NUMA nodes, the efficiency declines to around 80% due to the higher cost of
remote memory access between different sockets. Also on the NUMA system the
parallelization of Thomé’s BW2 achieves only a moderate efficiency of around 55%
for 8 NUMA nodes. The parallelization scheme used for OpenMP does not scale
well for a large number of threads. The parallelization of Coppersmith’s version
of BW2 scales almost perfectly on the NUMA system. The experiment with this
version of BW2 is performed using hybrid parallelization by running one MPI
process per NUMA node and one OpenMP thread per core. The blocking MPI
communication happens that rarely that it does not have much impact on the
efficiency of up to 8 NUMA nodes.
Parallel implementation of
5
Wagner’s generalized birthday attack
81
82 CHAPTER 5. WAGNER’S GENERALIZED BIRTHDAY ATTACK
the attack methodology. Section 5.3 explains what strategy is used for the attack
to make it fit on the restricted hard-disk space of the available computer cluster.
Section 5.4 details the measures that have been applied to make the attack run as
efficiently as possible. The overall cost of the attack is evaluated in Section 5.5 and
cost estimates for a similar attack against full-size FSB are given in Section 5.6.
algorithm. Note that any solution to the generalized birthday problem can be
found by some choice of clamping values.
Expected number of runs. Wagner’s generalized birthday attack is a prob-
abilistic algorithm. Without clamping through precomputation it produces an
expected number of exactly one collision. However this does not mean that run-
ning the algorithm always gives a collision.
In general, the expected number of runs of Wagner’s attack is a function of
the number of remaining bits in the entries of the two input lists of the last merge
step and the number of elements in these lists.
Assume that b bits are clamped on each level and that lists have length 2b .
Then the probability to have at least one collision after running the attack once is
22b
2B−(i−2)b − 1
Psuccess = 1 − ,
2B−(i−2)b
and the expected number of runs E(R) is
1
E(R) = . (5.1)
Psuccess
For larger values of B − ib the expected number of runs is about 2B−ib . The
total runtime tW of the attack can be modeled as being linear in the amount of
data on level 0, i.e.,
tW ∈ Θ 2i−1 2B−ib 2b .
(5.2)
Here 2i−1 is the number of lists, 2B−ib is approximately the number of runs,
and 2b is the number of entries per list. Observe that this formula will usually
underestimate the actual runtime of the attack by assuming that all computations
on subsequent levels are together still linear in the time required for computations
on level 0.
Using Pollard iteration. If the number of uncontrolled bits is high because of
memory restrictions, it may be more efficient to use a variant of Wagner’s attack
that uses Pollard iteration [Knu97, Chapter 3, exercises 6 and 7].
Assume that L0 = L1 , L2 = L3 , etc., and that combinations x0 + x1 with
x0 = x1 are excluded. The output of the generalized birthday attack will then be
a collision between two distinct elements of L0 + L2 + · · · .
Alternatively the usual Wagner tree algorithm can be performed by starting
with only 2i−2 lists L0 , L2 , . . . and using a nonzero clamping constant to enforce
the condition that x0 6= x1 . The number of clamped bits before the last merge step
is now (i − 3)b. The last merge step produces 22b possible values, the smallest of
which has an expected number of 2b leading zeros, leaving B−(i−1)b uncontrolled.
Think of this computation as a function mapping clamping constants to the
final B − (i − 1)b uncontrolled bits and apply Pollard iteration to find a collision
between the output of two such computations; combination then yields a collision
of 2i−1 vectors.
5.2. THE FSB HASH FUNCTION 85
The Pollard variant of the attack becomes more efficient than plain Wagner
with repeated runs if B > (i + 2)b.
The compression function works as follows. The matrix H is split into w blocks
of n/w columns. Each non-zero entry of the input bit string indicates exactly one
column in each block. The output of the compression function is an r-bit string
which is produced by computing the xor of all the w columns of the matrix H
indicated by the input string.
Preimages and collisions. A preimage of an output of length r of one round of
the compression function is a regular n-bit string of weight w. A collision occurs
if there are 2w columns of H—exactly two in each block—which add up to zero.
Finding preimages or collisions means solving two problems coming from cod-
ing theory: finding a preimage means solving the Regular Syndrome Decod-
ing problem and finding collisions means solving the so-called 2-regular Null-
Syndrome Decoding problem. Both problems were defined and proven to be
NP-complete in [AFS05].
Parameters. Following the notation in [AFG+ 08b] the term FSBlength denotes
the version of FSB which produces a hash value of length length. Note that the
output of the compression function has r bits where r is considerably larger than
length.
For the SHA-3 competition NIST demanded hash lengths of 160, 224, 256,
384, and 512 bits, respectively. Therefore the SHA-3 proposal contains five ver-
sions of FSB: FSB160 , FSB224 , FSB256 , FSB384 , and FSB512 . Table 5.1 gives the
parameters for these versions.
The proposal also contains FSB48 , which is a reduced-size version of FSB called
“toy” version. FSB48 is the main attack target in this chapter. The binary matrix
H for FSB48 has dimension 192 × 3 · 217 ; i.e., r equals 192 and n is 3 · 217 . In each
round a message chunk is converted into a regular 3 · 217 -bit string of Hamming
weight w = 24. The matrix H contains 24 blocks of length 214 . Each 1 in the
regular bit string indicates exactly one column in a block of the matrix H. The
output of the compression function is the xor of those 24 columns.
A pseudo-random matrix. The attack against FSB48 uses a pseudo-random
matrix H which is constructed as described in [AFG+ 08b, Section 1.2.2]: H con-
sists of 2048 submatrices, each of dimension 192 × 192. For the first submatrix
consider a slightly larger matrix of dimension 197 × 192. Its first column consists
of the first 197 digits of π where each digit is taken modulo 2. The remaining 191
columns of this submatrix are cyclic shifts of the first column. The matrix is then
truncated to its first 192 rows which form the first submatrix of H. For the second
submatrix consider digits 198 up to 394 of π. Again build a 197 × 192 bit matrix
where the first column corresponds to the selected digits (each taken modulo 2)
and the remaining columns are cyclic shifts of the first column. Truncating to the
first 192 rows yields the second block matrix of H. The remaining submatrices
are constructed in the same way.
This is one possible choice for the matrix H; the matrix may be defined dif-
ferently. The attack described in this chapter does not make use of the structure
of this particular matrix. This construction is used in the implementation since
5.2. THE FSB HASH FUNCTION 87
Finding appropriate clamping constants. This task does not require storing
the positions, since it only determines whether a certain set of clamping constants
leads to a collision; it does not tell which matrix positions give this collision.
Whenever storing the value needs less space than storing positions, the entries
can be compressed by switching representation from positions to values. As a
side effect this speeds up the computations because less data has to be loaded
and stored.
Starting from lists L0,0 , . . . , L0,7 , each containing 237 entries first list L3,0 (see
Figure 5.1) is computed on 8 nodes. This list has entries with 78 remaining bits
each. Section 5.4 will describe how these entries are presorted into 512 buck-
ets according to 9 bits that therefore do not need to be stored. Another 3 bits
are determined by the node holding the data (also see Section 5.4) so only 66
bits or 9 bytes of each entry have to be stored, yielding a total storage require-
ment of 1152 GB versus 5120 GB necessary for storing entries in positions-only
representation.
Then the attack continues with the computation of list L2,2 , which has entries
of 115 remaining bits. Again 9 of these bits do not have to be stored due to
presorting, 3 are determined by the node, so only 103 bits or 13 bytes have to
be stored, yielding a storage requirement of 1664 GB instead of 2560 GB for
uncompressed entries.
After these lists have been stored persistently on disk, proceed with the com-
putation list L2,3 , then L3,1 and finally check whether L4,0 contains at least one
element. These computations require another 2560 GB.
Therefore total amount of storage sums up to 1152 GB + 1664 GB + 2560 GB
= 5376 GB; obviously all data fits onto the hard disk of the 8 nodes.
If a computation with given clamping constants is not successful, clamping
constants are changed only for the computation of L2,3 . The lists L3,0 and L2,2
do not have to be computed again. All combinations of clamping values for lists
L0,12 to L0,15 summing up to 0 are allowed. Therefore there are a large number
of valid clamp-bit combinations.
With 37 bits clamped on every level and 3 clamped through precomputation
there are only 4 uncontrolled bits left and therefore, according to (5.1), the algo-
rithm is expected to yield a collision after 16.5 repetitions.
Computing the matrix positions of the collision. After the clamping con-
stants which lead to a collision have been found, it is also known which value
in the lists L3,0 and L3,1 yields a final collision. Now recompute lists L3,0 and
L3,1 without compression to obtain the positions. For this task only positions
are stored; values are obtained by dynamic recomputation. In total one half-tree
computation requires 5120 GB of storage, hence, they can be performed one after
the other on 8 nodes.
The (re-)computation of lists L3,0 and L3,2 is an additional time overhead over
doing all computation on list positions in the first place. However, this cost is
incurred only once, and is amply compensated for by the reduction of the number
of repetitions compared to the straight forward attack.
L0,0 L0,1 L0,2 L0,3 L0,4 L0,5 L0,6 L0,7 L0,8 L0,9 L0,10 L0,11 L0,12 L0,13 L0,14 L0,15
0,1 0,1 2,3 2,3 4,5 4,5 6,7 6,7 0,1,2,3 0,1,2,3 4,5,6,7 4,5,6,7 0,1,2,3 0,1,2,3 4,5,6,7 4,5,6,7
L2,2
L2,0 L2,1 L2,3
0,1,2,3,4,5,6,7
0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7
store 1664 GB
positions only value only
L3,0
L3,1
0,1,2,3,4,5,6,7
0,1,2,3,4,5,6,7
store 1152 GB
L4,0
final merge
Figure 5.1: Structure of the attack: in each box the upper line denotes the list, the lower line gives the nodes holding
fractions of this list
91
92 CHAPTER 5. WAGNER’S GENERALIZED BIRTHDAY ATTACK
5.4.1 Parallelization
Most of the time in the attack is spent on determining the right clamping con-
stants. As described in Section 5.3 this involves computations of several partial
trees, e.g., the computation of L3,0 from lists L0,0 , . . . , L0,7 (half tree) or the
computation of L2,2 from lists L0,8 , . . . , L0,11 (quarter tree). There are also com-
putations which do not start with lists of level 0; the computation of list L3,1 for
example is computed from the (previously computed and stored) lists L2,2 and
L2,3 .
Lists of level 0 are generated with the current clamping constants. On every
level, each list is sorted and afterwards merged with its neighboring list giving the
entries for the next level. The sorting and merging is repeated until the final list
of the partial tree is computed.
Distributing data over nodes. This algorithm is parallelized by distributing
fractions of lists over the nodes in a way that each node can perform sort and
merge locally on two lists. On each level of the computation, each node contains
fractions of two lists. The lists on level j are split between n nodes according to
lg(n) bits of each value. For example when computing the left half-tree, on level
0, node 0 contains all entries of lists L0,0 and L0,1 ending with a zero bit (in the
bits not controlled by initial clamping), and node 1 contains all entries of lists
L0,0 and L0,1 ending with a one bit.
Therefore, from the view of one node, on each level the fractions of both lists
are loaded from hard disk, the entries are sorted and the two lists are merged.
The newly generated list is split into its fractions and these fractions are sent over
the network to their associated nodes. There the data is received and stored onto
the hard disk.
The continuous dataflow of this implementation is depicted in Figure 5.2.
Presorting into buckets. To be able to perform the sort in memory, incoming
data is presorted into one of 512 buckets according to the 9 least significant bits
of the current sort range. This leads to an expected bucket size for uncompressed
entries of 640 MB (0.625 GB) which can be loaded into main memory at once to
be sorted further. The benefit of presorting the entries before storing them is:
1. A whole fraction that exceeds the size of the memory can be sorted by
sorting its presorted buckets independently.
recv presort store load merge
sort send
(gather) (scatter) (gather) (scatter)
Ale
Network-Packet Network-Packet
(hdd-Packet) hdd
5MB 5MB
1497600 Bytes
List 0 List 0 List 0
. . load-buffer sort-buffer merge-buffer .
. .
List 0 .
. . .
0.714 GByte 0.714 GByte 0.714 GByte
5.4. IMPLEMENTING THE ATTACK
Figure 5.2: Data flow and buffer sizes during the computation.
93
94 CHAPTER 5. WAGNER’S GENERALIZED BIRTHDAY ATTACK
2. Two adjacent buckets of the two lists on one node (with the same presort-
bits) can be merged directly after they are sorted.
3. The 9 bits that determine the bucket for presorting do not need to be stored
when entries are compressed to value-only representation.
• the computational power and memory latency of the CPUs for computation-
intensive applications
140
HDD (seq)
120 HDD (rnd)
MPI
Bandwidth in MByte/s
100
80
60
40
20
0
210 215 220 225 230
Packet Size in Bytes
data in portions of Ales. Each cluster node has one large unformatted data parti-
tion, which is directly opened by the AleSystem using native Linux file I/O. After
data has been written, it is not read for a long time and does not benefit from
caching. Therefore caching is deactivated by using the open flag O_DIRECT.
All administrative information is persistently stored as a file in the native Linux
filesystem and mapped into the virtual address space of the process. On sequen-
tial access, the throughput of the AleSystem reaches about 90 MB/s which is
roughly the maximum that the hard disk permits.
Tasks and threads. Since the cluster nodes are driven by quad-core CPUs, the
speed of the computation is primarily based on multi-threaded parallelization. On
the one side the tasks for receiving, presorting, and storing, on the other side the
tasks for loading, sorting, merging, and sending are pipelined. Several threads are
used for sending and receiving data and for running the AleSystem. The core of
the implementation is given by five threads which process the main computation.
There are two threads which have the task to presort incoming data (one thread
for each list). Furthermore, sorting is parallelized with two threads (one thread
for each list) and the merge task is assigned to another thread.
Memory layout. The benchmarks show that bigger buffers generally lead to
higher throughput. However, the sum of all buffer sizes is limited by the size of
the available RAM. Six buffers are needed for loading, sorting, and merging of
two list buckets. Furthermore two times 2 · 8 network buffers are required for
double-buffered send and receive, which results in 32 network buffers. Presorting
entries of the two lists double-buffered into 512 buckets requires 2048 ales.
96 CHAPTER 5. WAGNER’S GENERALIZED BIRTHDAY ATTACK
When a bucket is loaded from disk, its ales are treated as a continuous field of
entries to avoid conditions and branches. Therefore, each ale must be completely
filled with entries; no data padding at the end of each ale is allowed. Thus, the
ales must have a size which allows them to be completely filled independent of the
varying size of entries over the whole run of the program. Possible sizes of entries
are 5, 10, 20, and 40 bytes when storing positions and 5, 10, 13, and 9 bytes when
storing compressed entries. Furthermore, since the hard disk is accessed using
DMA, the size of each ale must be a multiple of 512 bytes. Therefore the size of
one ale must be a multiple of 5 · 9 · 13 · 512 bytes.
The size of network packets does not necessarily need to be a multiple of all
possible entry sizes; if network packets happen not to be completely filled is is
merely a small waste of network bandwidth.
In the worst case, on level 0 one list containing 237 entries is distributed over
2 nodes and presorted into 512 buckets; thus the size of each bucket should be
larger than 237 /2/512 · 5 bytes = 640 MB. The actual size of each bucket depends
on the size of the ales since it must be an integer multiple of the ale size.
Following these conditions the network packets have a size of 220 · 5 bytes
= 5 MB summing up to 160 MB for 32 buffers. The size of the ales is 5 · 9 ·
13 · 512 · 5 = 1 497 600 bytes (about 1.4 MB). Therefore 2.9 GB are necessary
to store 2048 buffers for ales in memory. The buffers for the list buckets require
5 · 9 · 13 · 512 · 5 · 512 = 766 771 200 bytes (731.25 MB) each summing up to 4.3 GB
for 6 buckets. Overall the implementation requires about 7.4 GB of RAM leaving
enough space for the operating system and additional data, e.g., stack or data for
the AleSystem.
Efficiency and further optimizations. The assignment of tasks to threads as
described above results in an average CPU usage of about 60% and reaches a peak
of up to 80%. The average hard-disk throughput is about 40 MB/s. The hard-disk
benchmark (see Figure 5.3) shows that an average throughput between 45 MB/s
and 50 MB/s should be feasible for packet sizes of 1.4 MB. Therefore further
optimization of the sort task may make it possible to get closer to maximum
hard-disk throughput.
5.5 Results
The implementation described in Sections 5.3 and 5.4 successfully computed a
collision for the compression function of FSB48 . This section presents (1) the
estimates, before starting the attack, of the amount of time that the attack would
need; (2) measurements of the amount of time actually consumed by the attack;
and (3) comments on how different amounts of storage would have changed the
runtime of the attack.
step entries are stored by positions on level 0 and 1 and from level 2 on list entries
consist of values. Computation of list L3,0 takes about 32 hours and list L2,2
about 14 hours, summing up to 46 hours. These computations need to be done
only once.
The time needed to compute list L2,3 is about the same as for L2,2 (14 hours),
list L3,1 takes about 4 hours and checking for a collision in lists L3,0 and L3,1 on
level 4 about another 3.5 hours, summing up to about 21.5 hours. The expected
number of repetitions of these steps is 16.5 and thus the expected runtime is about
16.5 · 21.5 = 355 hours.
Computing the matrix positions of the collision. Finally, computing the
matrix positions after finding a collision requires recomputation with uncom-
pressed lists. Entries of lists L3,0 and L3,1 need to be computed only until the
entry is found that yields the collision. In the worst case this computation with
uncompressed (positions-only) entries takes 33 hours for each half-tree, summing
up to 66 hours.
Total expected runtime. Overall a collision for the FSB48 compression function
is expected to be found in 46 + 355 + 66 = 467 hours or about 19.5 days.
Table 5.1: Parameters of the FSB variants and estimates for the cost of
generalized birthday attacks against the compression function. For Pollard’s
variant the number of lists is marked with a ∗ . Storage is measured in bytes.
The time required by this attack is approximately 2224 (see (5.3)). This is sub-
stantially faster than a brute-force collision attack on the compression function,
but is clearly much slower than a brute-force collision attack on the hash function,
and even slower than a brute-force preimage attack on the hash function.
Similar statements hold for the other full-size versions of FSB. Table 5.1
gives rough estimates for the time complexity of Wagner’s attack without storage
restriction and with storage restricted to a few hundred exabytes (260 entries per
list). These estimates only consider the number and size of lists being a power of
2 and the number of bits clamped in each level being the same. The estimates
ignore the time complexity of precomputation. Time is computed according to
(5.2) and (5.3) with the size of level-0 entries (in bytes) as a constant factor.
Although fine-tuning the attacks might give small speedups compared to the
estimates, it is clear that the compression function of FSB is oversized, assuming
that Wagner’s algorithm in a somewhat memory-restricted environment is the
most efficient attack strategy.
Bibliography
101
102 BIBLIOGRAPHY
[Ber09c] Daniel J. Bernstein. “Optimizing linear maps modulo 2”. In: Work-
shop Record of SPEED-CC: Software Performance Enhancement
for Encryption and Decryption and Cryptographic Compilers. Doc-
ument ID: e5c3095f5c423e2fe19fa072e23bd5d7. 2009, pp. 3–18.
url: http://binary.cr.yp.to/linearmod2-20091005.pdf (cit.
on pp. 29, 33).
[Ber66] Elwyn R. Berlekamp. “Nonbinary BCH decoding”. Institute of
Statistics Mimeo Series No. 502. University of North Carolina, Dec.
1966. url: http://www.stat.ncsu.edu/information/library/m
imeo.archive/ISMS_1966_502.pdf (cit. on p. 50).
[BFAY+ 10] Rob H. Bisseling, Bas Fagginger Auer, Albert-Jan Yzelman, and
Brendan Vastenhouw. “Mondriaan for sparse matrix partitioning”.
Feb. 2010. url: http://www.staff.science.uu.nl/~bisse101/M
ondriaan (cit. on p. 72).
[BFH+ 04] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Pat Hanra-
han, Mike Houston, and Kayvon Fatahalian. “BrookGPU”. 2004.
url: http://graphics.stanford.edu/projects/brookgpu/inde
x.html (cit. on p. 15).
[BGB10] Aydın Buluç, John R. Gilbert, and Ceren Budak. “Solving path
problems on the GPU”. In: Parallel Computing 36.5–6 (June 2010),
pp. 241–253. doi: 10.1016/j.parco.2009.12.002 (cit. on p. 16).
[BKN+ 10] Joppe Bos, Thorsten Kleinjung, Ruben Niederhagen, and Peter
Schwabe. “ECC2K-130 on Cell CPUs”. In: Progress in Cryptology –
AFRICACRYPT 2010. Ed. by Daniel J. Bernstein and Tanja Lange.
Vol. 6055. Lecture Notes in Computer Science. Document ID: bad
46a78a56fdc3a44fcf725175fd253. Springer-Verlag Berlin Heidel-
berg, 2010, pp. 225–242. doi: 10.1007/978-3-642-12678-9_14.
IACR Cryptology ePrint archive: 2010/077 (cit. on pp. 24, 29).
[BL07] Daniel J. Bernstein and Tanja Lange. “Explicit-Formulas Database”.
2007. url: http://www.hyperelliptic.org/EFD/ (cit. on p. 26).
[BL10] Daniel J. Bernstein and Tanja Lange. “Type-II optimal polynomial
bases”. In: Arithmetic of Finite Fields. Ed. by M. Anwar Hasan and
Tor Helleseth. Vol. 6087. Lecture Notes in Computer Science. Doc-
ument ID: 90995f3542ee40458366015df5f2b9de. Springer-Verlag
Berlin Heidelberg, 2010, pp. 41–61. doi: 10.1007/978-3-642-1379
7-6_4. IACR Cryptology ePrint archive: 2010/069 (cit. on pp. 38,
39).
[BLN+ 09] Daniel J. Bernstein, Tanja Lange, Ruben Niederhagen, Christiane
Peters, and Peter Schwabe. “FSBday: implementing Wagner’s gen-
eralized birthday attack against the SHA-3 round-1 candidate FSB”.
In: Progress in Cryptology – INDOCRYPT 2009. Ed. by Bimal Roy
and Nicolas Sendrier. Vol. 5922. Lecture Notes in Computer Science.
104 BIBLIOGRAPHY
[Hof05] H. Peter Hofstee. “Power efficient processor architecture and the Cell
processor”. In: 11th International Symposium on High-Performance
Computer Architecture — HPCA-11. IEEE Computer Society, Feb.
2005, pp. 258–262. doi: 10.1109/HPCA.2005.26 (cit. on p. 27).
[Hou11] Yunqing Hou. “asfermi: Assembler for the NVIDIA Fermi Instruc-
tion Set”. 2011. url: http://code.google.com/p/asfermi/ (cit.
on p. 16).
[HP07] John L. Hennessy and David A. Patterson. “Computer Architec-
ture. A Quantitative Approach”. 4th ed. Elsevier/Morgan Kauf-
mann Publishers, 2007 (cit. on pp. 6–8).
[IBM08] “Cell Broadband Engine Programming Handbook”. Version 1.11.
IBM DeveloperWorks, May 2008. url: https://www-01.ibm.com/
chips/techlib/techlib.nsf/techdocs/1741C509C5F64B330025
7460006FD68D (cit. on pp. 27, 29).
[KAF+ 10] Thorsten Kleinjung, Kazumaro Aoki, Jens Franke, Arjen Lenstra,
Emmanuel Thomé, Joppe Bos, Pierrick Gaudry, Alexander Kruppa,
Peter Montgomery, Dag Arne Osvik, Herman te Riele, Andrey Tim-
ofeev, and Paul Zimmermann. “Factorization of a 768-bit RSA mod-
ulus”. In: Advances in Cryptology – CRYPTO 2010. Ed. by Tal Ra-
bin. Vol. 6223. Lecture Notes in Computer Science. Springer-Verlag
Berlin Heidelberg, 2010, pp. 333–350. doi: 10.1007/978-3-642-146
23-7_18. IACR Cryptology ePrint archive: 2010/006 (cit. on p. 59).
[Kal95] Erich Kaltofen. “Analysis of Coppersmith’s block Wiedemann algo-
rithm for the parallel solution of sparse linear systems”. In: Mathe-
matics of Computation 64.210 (1995), pp. 777–806. doi: 10.1090/
S0025-5718-1995-1270621-1 (cit. on pp. 53, 60).
[Knu97] Donald E. Knuth. “The Art of Computer Programming. Vol. 2,
Seminumerical Algorithms”. 3rd ed. Addison–Wesley, Nov. 1997
(cit. on p. 84).
[KO63] Anatolii Karatsuba and Yuri Ofman. “Multiplication of multidigit
numbers on automata”. In: Soviet Physics Doklady 7 (1963). Trans-
lated from Doklady Akademii Nauk SSSR, Vol. 145, No. 2, pp. 293–
294, July 1962., pp. 595–596 (cit. on p. 31).
[Koc96] Paul C. Kocher. “Timing attacks on implementations of Diffie-
Hellman, RSA, DSS, and other systems”. In: Advances in Cryp-
tology – CRYPTO ’96. Ed. by Neal Koblitz. Vol. 1109. Lecture
Notes in Computer Science. Springer-Verlag Berlin Heidelberg,
1996, pp. 104–113. doi: 10.1007/3-540-68697-5_9 (cit. on p. 19).
[KWA+ 01] Eric Korpela, Dan Werthimer, David Anderson, Jegg Cobb, and
Matt Lebofsky. “SETI@home—massively distributed computing for
SETI”. In: Computing in Science Engineering 3.1 (Jan. 2001),
pp. 78–83. doi: 10.1109/5992.895191 (cit. on p. 11).
BIBLIOGRAPHY 107
[Laa07] Wladimir J. van der Laan. “Cubin Utilities”. 2007. url: http://wi
ki.github.com/laanwj/decuda/ (cit. on pp. 16, 18, 44).
[Laz83] Daniel Lazard. “Gröbner-bases, Gaussian elimination and resolu-
tion of systems of algebraic equations”. In: Proceedings of the Euro-
pean Computer Algebra Conference on Computer Algebra – EURO-
CAL ’83. Ed. by J. A. van Hulzen. Vol. 162. Lecture Notes in Com-
puter Science. Springer-Verlag Berlin Heidelberg, 1983, pp. 146–156.
doi: 10.1007/3-540-12868-9_99 (cit. on p. 47).
[LBL] “Berkeley UPC—Unified Parallel C”. Lawrence Berkeley National
Laboratory and University of California, Berkeley. url: http://up
c.lbl.gov/ (cit. on p. 14).
[LRD+ 90] Jed Lengyel, Mark Reichert, Bruce R. Donald, and Donald P. Green-
berg. “Real-time robot motion planning using rasterizing computer
graphics hardware”. In: Proceedings of the 17th annual conference
on Computer graphics and interactive techniques – SIGGRAPH ’90.
ACM, 1990, pp. 327–335. doi: 10.1145/97879.97915 (cit. on p. 15).
[LS08] Calvin Lin and Larry Snyder. “Principles of Parallel Programming”.
Addison-Wesley, Mar. 2008 (cit. on p. 3).
[Mas69] James L. Massey. “Shift-register synthesis and BCH decoding”.
In: IEEE Transactions on Information Theory 15.1 (Jan. 1969),
pp. 122–127. doi: 10.1109/TIT.1969.1054260 (cit. on p. 50).
[MDS+ 11] Hans Meuer, Jack Dongarra, Erich Strohmaier, and Horst Simon,
eds. “TOP500 Supercomputing Sites”. 2011. url: http://www.top
500.org/ (cit. on p. 10).
[Moh01] Tzuong-Tsieng Moh. “On the method of XL and its inefficiency to
TTM”. Jan. 2001. IACR Cryptology ePrint archive: 2001/047 (cit.
on p. 49).
[Mon87] Peter L. Montgomery. “Speeding the Pollard and elliptic curve
methods of factorization”. In: Mathematics of Computation 48.177
(1987), pp. 243–264. doi: 10.1090/S0025-5718-1987-0866113-7
(cit. on p. 26).
[Mon95] Peter L. Montgomery. “A block Lanczos algorithm for finding de-
pendencies over GF(2)”. In: Advances in Cryptology — EURO-
CRYPT ’95. Ed. by Louis Guillou and Jean-Jacques Quisquater.
Vol. 921. Lecture Notes in Computer Science. Springer-Verlag Berlin
Heidelberg, 1995, pp. 106–120. doi: 10.1007/3-540-49264-X_9
(cit. on p. 49).
[MR02] Sean Murphy and Matthew J. B. Robshaw. “Essential algebraic
structure within the AES”. In: Advances in Cryptology – CRYPTO
2002. Ed. by Moti Yung. Vol. 2442. Lecture Notes in Computer Sci-
ence. Springer-Verlag Berlin Heidelberg, 2002, pp. 1–16. doi: 10.1
007/3-540-45708-9_1 (cit. on p. 47).
108 BIBLIOGRAPHY
Parallel Cryptanalysis
Most of today’s cryptographic primitives are based on computations that are hard
to perform for a potential attacker but easy to perform for somebody who is in
possession of some secret information, the key, that opens a back door in these
hard computations and allows them to be solved in a small amount of time. To
estimate the strength of a cryptographic primitive it is important to know how
hard it is to perform the computation without knowledge of the secret back door
and to get an understanding of how much money or time the attacker has to spend.
Usually a cryptographic primitive allows the cryptographer to choose parameters
that make an attack harder at the cost of making the computations using the
secret key harder as well. Therefore designing a cryptographic primitive imposes
the dilemma of choosing the parameters strong enough to resist an attack up to
a certain cost while choosing them small enough to allow usage of the primitive
in the real world, e.g. on small computing devices like smart phones.
This thesis investigates three different attacks on particular cryptographic sys-
tems: Wagner’s generalized birthday attack is applied to the compression function
of the hash function FSB. Pollard’s rho algorithm is used for attacking Certicom’s
ECC Challenge ECC2K-130. The implementation of the XL algorithm has not
been specialized for an attack on a specific cryptographic primitive but can be
used for attacking some cryptographic primitives by solving multivariate quadratic
systems. All three attacks are general attacks, i.e. they apply to various cryp-
tographic systems; the implementations of Wagner’s generalized birthday attack
and Pollard’s rho algorithm can be adapted for attacking other primitives than
those given in this thesis.
The three attacks have been implemented on different parallel architectures.
XL has been parallelized using the Block Wiedemann algorithm on a NUMA
system using OpenMP and on an InfiniBand cluster using MPI. Wagner’s attack
was performed on a distributed system of 8 multi-core nodes connected by an
Ethernet network. The work on Pollard’s Rho algorithm is part of a large research
collaboration with several research groups; the computations are embarrassingly
parallel and are executed in a distributed fashion in several facilities with almost
negligible communication cost. This dissertation presents implementations of the
iteration function of Pollard’s Rho algorithm on Graphics Processing Units and
on the Cell Broadband Engine.
Curriculum Vitae
Ruben Niederhagen was born on August 10, 1980 in Aachen, Germany. After
obtaining the German university-entrance qualification (Abitur) at the Goldberg
Gymnasium Sindelfingen in 2000, he studied computer science at the RWTH
Aachen University in Germany. In 2007 he graduated with the German degree
“Diplom Informatiker” on the topic “Design and Implementation of a Secure Group
Communication Layer for Peer-To-Peer Systems” and started his PhD studies
at the Lehrstuhl für Betriebssysteme (research group for operating systems) of
the Faculty of Electrical Engineering and Information Technology at the RWTH
Aachen University. In 2009 he started a cooperation with the Coding and Cryp-
tology group at Eindhoven University of Technology in the Netherlands and with
the Fast Crypto Lab at the National Taiwan University in Taipei. He joined
the Coding and Cryptology group in 2010 to continue his PhD studies under the
supervision of Prof. Dr. Daniel J. Bernstein and Prof. Dr. Tanja Lange. During
his PhD program, he commuted between the Netherlands and Taiwan on a reg-
ular basis to work as research assistant at the National Taiwan University and
the Institute of Information Science at the Academia Sinica in Taipei under the
supervision of Prof. Dr. Chen-Mou Cheng and Prof. Dr. Bo -YinYang. His dis-
sertation contains the results of his work in the Netherlands and in Taiwan from
2009 to 2012.
His research interests are parallel computing and its application to cryptanalysis.