0% found this document useful (0 votes)

88 views21 pages

Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A

This document discusses exploiting the power of GPUs for asymmetric cryptography. It introduces Robert Szerwinski and Tim Güneysu's work on implementing RSA, DSA, and ECC on GPUs. Using an Nvidia 8800GTS graphics card, they were able to compute 813 modular exponentiations per second for 1024-bit RSA/DSA and 1412 point multiplications per second for ECC over P-224. The document provides background on GPU architecture and programming and discusses how previous work has explored accelerating cryptography on GPUs but were limited by legacy GPU interfaces.

Uploaded by

Laine Venâncio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views21 pages

Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A

Uploaded by

Laine Venâncio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Exploiting the Power of GPUs for Asymmetric

Cryptography

Robert Szerwinski and Tim Güneysu

Horst Görtz Institute for IT Security, Ruhr University Bochum, Germany

{szerwinski,gueneysu}@crypto.rub.de

Abstract. Modern Graphics Processing Units (GPU) have reached a

dimension with respect to performance and gate count exceeding conven-
tional Central Processing Units (CPU) by far. Many modern computer
systems include – beside a CPU – such a powerful GPU which runs idle
most of the time and might be used as cheap and instantly available
co-processor for general purpose applications.
In this contribution, we focus on the efficient realisation of the com-
putationally expensive operations in asymmetric cryptosystems on such
off-the-shelf GPUs. More precisely, we present improved and novel imple-
mentations employing GPUs as accelerator for RSA and DSA cryptosys-
tems as well as for Elliptic Curve Cryptography (ECC). Using a recent
Nvidia 8800GTS graphics card, we are able to compute 813 modular ex-
ponentiations per second for RSA or DSA-based systems with 1024 bit
integers. Moreover, our design for ECC over the prime field P-224 even
achieves the throughput of 1412 point multiplications per second.

Keywords: Asymmetric Cryptosystems, Graphics Processing Unit,

RSA, DSA, ECC.

1 Introduction
For the last twenty years graphics hardware manufacturers have focused on pro-
ducing fast Graphics Processing Units (GPUs), specifically for the gaming com-
munity. This has more recently led to devices which outperform general purpose
Central Processing Units (CPUs) for specific applications, particularly when
comparing the MIPS (million instructions per second) benchmarks. Hence, a re-
search community has been established to use the immense power of GPUs for
general purpose computations (GPGPU). In the last two years, prior limitations
of the graphics application programming interfaces (API) have been removed by
GPU manufacturers by introducing unified processing units in graphics cards.
They support a general purpose instruction set by a native driver interface and
framework.
In the field of asymmetric cryptography, the security of all practical cryptosys-
tems rely on hard computational problems strongly dependant on the choice of
parameters. But with rising parameter sizes (often in the range of 1024–4096

E. Oswald and P. Rohatgi (Eds.): CHES 2008, LNCS 5154, pp. 79–99, 2008.

c International Association for Cryptologic Research 2008
80 R. Szerwinski and T. Güneysu

bits), however, computations become more and more challenging for the under-
lying processor. For modern hardware, the computation of a single cryptographic
operation is not critical, however in a many-to-one communication scenario, like
a central server in a company’s data processing centre, it may be confronted with
hundreds or thousands of simultaneous connections and corresponding crypto-
graphic operations. As a result, the most common current solution are cryp-
tographic accelerator cards. Due to the limited market, their price tags are
often in the range of several thousands euros or US dollars. The question at
hand is whether commodity GPUs can be used as high-performance public-key
accelerators.
In this work, we will present novel implementations of cryptosystems based on
modular exponentiations and elliptic curve operations on recent graphics hard-
ware. To the best of our knowledge, this is the ﬁrst publication making use of
the CUDA framework for GPGPU processing of asymmetric cryptosystems. We
will start with implementing the extremely wide-spread Rivest Shamir Adleman
(RSA) cryptosystem [30]. The same implementation based on modular expo-
nentiation for large integers can be used to implement the Digital Signature
Algorithm (DSA), which has been published by the US National Institute of
Standards and Technology (NIST) [25]. Recently, DSA has been adopted to el-
liptic curve groups in the ANSI X9.62 standard [2]. The implementation of this
variant, called ECDSA, is the second major goal of this work.

2 Previous Work
Lately, the research community has started to explore techniques to accelerate
cryptographic algorithms using the GPU. For example, various authors looked
at the feasibility of the current industry standard for symmetric cryptography,
the Advanced Encryption Standard (AES) [21,31,18,9]. Only two groups, namely
Moss et al. and Fleissner, have aimed for the efficient implementation of mod-
ular exponentiation on the GPU [24,14]. Their results were not promising, as
they were limited by the legacy GPU architecture and interface (cf. the next
section). To the best of our knowledge there are neither publications about the
implementation of these systems on modern, GPGPU-capable hardware nor on
the implementation of elliptic curve based systems.
We aim to fill this gap by implementing the core operations for both systems
efficiently on modern graphics hardware, creating the foundation for the use of
GPUs as accelerators for public key cryptography. We will use Nvidia’s current
flagship GPU series, the G80 generation, together with its new GPGPU interface
CUDA.

3 Using GPUs for General-Purpose Applications

The following section will give an overview over traditional GPU computing,
followed by a more in-depth introduction to Nvidia’s general purpose interface
CUDA.
Exploiting the Power of GPUs for Asymmetric Cryptography 81

3.1 Traditional GPU Computing

Roughly, the graphics pipeline consist of the stages transform & light, assemble
primitives, rasterise and shade. First GPUs had all functions needed to im-
plement the graphics pipeline hardwired, but over time more and more stages
became programmable by introducing specialised processors, e.g. vertex and frag-
ment processors that made the transform & light and shading stages, respec-
tively, more flexible.
When processing power increased massively while prices kept falling, the re-
search community thought of ways to use these resources for computationally
intense tasks. However, as the processors’ capabilities were very limited and
the API of the graphics driver was specifically built to implement the graphics
pipeline, a lot of overhead needed to be taken into account. For example, all
data had to be encoded in textures which are two dimensional arrays of pixels
storing colour values for red, green, blue and an additional alpha channel used
for transparency. Additionally, textures are read-only objects, which forced the
programmers to compute one step of an algorithm, store the result in the frame
buffer, and start the next step using a texture reference to the newly produced
pixels. This technique is known as ping-ponging. Most GPUs did only provide in-
structions to manipulate floating point numbers, forcing GPGPU programmers
to map integers onto the available mantissa and find ways to emulate bit-logical
functions, e.g., by using look-up tables.
These limitations have been the main motivation for the key GPU manufac-
turers ATI/AMD and Nvidia to create APIs specifically for the GPGPU com-
munity and modify their hardware for better support: ATI’s solution is called
Close To the Metal (CTM) [1], while Nvidia presented the Compute Unified
Device Architecture (CUDA), a radically new design that makes GPU program-
ming and GPGPU switch places: The underlying hardware of the G80 series is
an accumulation of scalar common purpose processing units (“unified” design)
and quite a bit of “glue” hardware to efficiently map the graphics pipeline to
this new design. GPGPU applications however directly map to the target hard-
ware and thus graphics hardware can be programmed without any graphics API
whatsoever.

3.2 Programming GPUs Using Nvidia’s CUDA Framework

In general, the GPU’s immense computation power mainly relies on its inherent
parallel architecture. For this, the CUDA framework introduces the thread as
smallest unit of parallelism, i.e., a small piece of concurrent code with associated
state. However, when compared to threads on microprocessors, GPU threads
have much lower resource usage and lower creation and switching cost. Note
that GPUs are only eﬀective when running a high number of such threads. A
group of threads that is executed physically in parallel is called warp. All threads
in one warp are executed in a single instruction multiple data (SIMD) fashion.
If one or more thread(s) in the same warp need to execute diﬀerent instructions,
e.g., in case of a data-dependent jump, their execution will be serialised and the
82 R. Szerwinski and T. Güneysu

threads are called divergent. As the next level of parallelism, a (thread) block
is a group of threads that can communicate with each other and synchronise
their execution. The maximum number of threads per block is limited by the
hardware. Finally, a group of blocks that have same dimensionality and execute
the same CUDA program logically in parallel is called grid.
To allow optimal performance for different access patterns, CUDA implements
a hierarchical memory model, contrasting the flat model normally assumed on
computers. Host (PC) and device (GPU) have their own memory areas, called
host memory and device memory, respectively. CUDA supplies optimised func-
tions to transfer data between these separate spaces.
Each thread possesses its own register file, which can be read and written.
Additionally, it can access its own copy of so-called local memory. All threads
in the same grid can access the same on-chip read- and writable shared mem-
ory region. To prevent hazards resulting from concurrent execution of threads
synchronisation mechanisms must be used. Shared memory is organised in groups
called banks that can be accessed in parallel. All threads can access a read- and
writable memory space called global memory and read-only regions called
constant memory and texture memory. The second last is optimised for
one-dimensional locality of accesses, while the last is most effective when being
used with two-dimensional arrays (matrices). Note that the texture and constant
memories are the only regions that are cached. Thus, all accesses to the off-chip
regions global and local memory have a high access latency, resulting in penalties
when being used too frequently.
The hardware consists of a number of so-called multiprocessors that are build
from SIMD processors, on-chip memory and caches. Clearly, one processor ex-
ecutes a particular thread, the same warp being run on the multiprocessor at
the same time. One or more blocks are mapped to each multiprocessor, sharing
its resources (registers and shared memory) and get executed on a time-sliced
basis. When a particular block has finished its execution, the scheduler starts
the next block of the grid until all blocks have been run.
Design Criteria for GPU Implementations. To achieve optimal perfor-
mance using CUDA, algorithms must be designed to run in a multitude of par-
allel threads and take advantage of the presented hierarchical memory model. In
the following, we enumerate the key criteria necessary for gaining the most out
of the GPU by loosely following the CUDA programming guide [27] and a talk
given by Mark Harris of Nvidia [17].

A. Maximise use of available processing power

A1. Maximise independent parallelism in the algorithm to enable easy
partitioning in threads and blocks.
A2. Keep resource usage low to allow concurrent execution of as many
threads as possible, i.e., use only a small number of registers per thread
and shared memory per block.
A3. Maximise arithmetic intensity, i.e., match the arithmetic to band-
width ratio to the GPU design philosophy: GPUs spend their transistors
Exploiting the Power of GPUs for Asymmetric Cryptography 83

on ALUs, not caches. Bearing this in mind allows to hide memory ac-
cess latency by the use of independent computations (latency hiding).
Examples include using arithmetic instructions with high throughput as
well as re-computing values instead of saving them for later use.
A4. Avoid divergent threads in the same warp.

B. Maximise use of available memory bandwidth

B1. Avoid memory transfers between host and device by shifting
more computations from the host to the GPU.
B2. Use shared memory instead of global memory for variables.
B3. Use constant or texture memory instead of global memory for
constants.
B4. Coalesce global memory accesses, i.e., choose access patterns that
allow to combine several accesses in the same warp to one, wider access.
B5. Avoid bank conﬂicts when utilising shared memory, i.e., choose pat-
terns that result in the access of diﬀerent banks per warp.
B6. Match access patterns for constant and texture memory to the cache
design.

CUDA Limitations. Although CUDA programs are written in the C language

together with extensions to support the memory model, allow synchronisation
and special intrinsics to access faster assembler instructions, it also contains a
number of limitations that negatively aﬀect eﬃcient implementation of public
key cryptography primitives. Examples are the lack for additions/subtractions
with carry as well as the missing support for inline assembler instructions1 .

4 Modular Arithmetic on GPUs

In the following section we will give different ways do realise modular arith-
metic on a GPU efficiently, keeping the aforementioned criteria in mind. For the
RSA cryptosystem we need to implement arithmetic modulo N , where N is the
product of two large primes p and q: N = p · q. The arithmetic of both DSA
systems, however, is based on the prime field GF (p) as the lowest-level building
block. Note that the DSA systems both use a fixed – in terms of sessions or
key generations – prime p, thus allowing to choose special primes at build time
that have advantageous properties when reducing modulo p. For example, the
US National Institute of Standards and Technology (NIST) proposes a set of
generalised Mersenne primes in the Digital Signature Standard (DSS) [25, Ap-
pendix 6]. As the RSA modulus N is the product of the two secret primes p and
q that will be chosen secretly for each new key pair, we cannot optimise for the
modulus in this case.
1
Nvidia published their own (abstract) assembler language PTX [28], however as
of CUDA version 1.0 one kernel cannot contain code both generated from the C
language and PTX.
84 R. Szerwinski and T. Güneysu

Modular Addition and Subtraction. In general, addition s ≡ a + b mod m

of two operands a and b, where 0 ≤ a, b < m, is straightforward, as the result of
the plain addition operation a + b always satisfies 0 ≤ a + b < 2m and therefore
needs at maximum one subtraction of m to fulfil 0 ≤ s < m. Due to the SIMD
design, we require algorithms that have a uniform control flow in all cases and
compute both a + b and a + b − m and decide afterwards which is the correctly
reduced result, cf. Criterion A4. Subtraction d ≡ a − b mod m can be treated
similarly: we compute both a − b and a − b + m and use a sign test at the end
to derive the correctly reduced result.

Modular Multiplication. Multi-precision modular multiplication r ≡ a·b mod

m is usually the most critical operation in common asymmetric cryptosystems. In
a straightforward approach to compute r, we derive a double-sized product r =
ab ﬁrst and reduce afterwards by multi-precision division. Besides the quadratic
complexity of standard multiplication, division is known to be very costly and
should be avoided whenever possible. Thus, we will discuss several multiplication
strategies to identify an optimal method for implementation on GPUs.

4.1 Modular Multiplication Using Montgomery’s Technique

In 1985 Peter L. Montgomery proposed an algorithm [23] to remove the costly

division operation from the modular reduction. Koç et al. [6] give a survey of
diﬀerent implementation options. As all multi-precision Montgomery multiplica-
tion algorithms feature no inherent parallelism except the possibility to pipeline,
we do not consider them optimal for our platform and implement the method
with the lowest temporary space requirement of n + 2 words, coarsely integrated
operand scanning (CIOS), as a reference solution only (cf. to Algorithm 1).

4.2 Modular Multiplication in Residue Number Systems (RNS)

As an alternative approach to conventional base-2w arithmetic, we can represent

integers based on the idea of the Chinese Remainder Theorem, by encoding an
integer x as a tuple formed from its residues xi modulo n relatively prime w-bit
moduli mi , where |x|mi denotes x mod mi :

xA = x0 , x1 , . . . , xn−1 A = |x|m0 , |x|m1 , . . . , |x|mn−1 A (1)

Here, the ordered set of relatively prime moduli (m0 , m1 , . . . , mn−1 ), gcd
(mi , mj ) = 1 for all i = j, is called base and denoted by A. The product of
n−1
all moduli, A = i=0 mi is called dynamic range of A, i.e., the number of val-
ues that can be uniquely represented in A. In other words, all numbers in A get
implicitly reduced modulo A. Such a representation in RNS has the advantage
that addition, subtraction and multiplication can be computed independently for
all residues:

xA ◦ yA = |x0 ◦ y0 |m0 , |x1 ◦ y1 |m1 , . . . , |xn−1 ◦ yn−1 |mn−1 A , ◦ ∈ {+, −, ·} (2)
Exploiting the Power of GPUs for Asymmetric Cryptography 85

Algorithm 1. Montgomery Multiplication for Multi-Precision Integers

(CIOS Method) [6]
Require: Modulus M and radix R = 2wn s.t. R > M and gcd(R, M ) = 1; M0 =
(−M −1 mod R) mod 2w , two unsigned integers 0 ≤ A, B < M in Montgomery
form, i.e. X = (Xn−1 Xn−2 . . . X0 )2w for X ∈ {A, B, M }.
Ensure: The product C = ABR−1 (mod M ), 0 ≤ C < M , in Montgomery form.
1: T ← 0
2: for i from 0 to n − 1 do
3: c←0
4: for j from 0 to n − 1 do {Multiplication}
5: (c, Tj ) ← Aj · Bi + Tj + c
6: end for
7: (Tn+1 , Tn ) ← Tn + c
8: m ← T0 · M0 mod 2w {Reduction}
9: (c, T0 ) ← m · M0 + T0
10: for j from 1 to n − 1 do
11: (c, Tj−1 ) ← m · Mj + Tj + c
12: end for
13: Tn−1 ← Tn + c
14: Tn ← Tn+1 + c
15: end for
16: return (Tn−1 Tn−2 . . . T0 )2w

which allows carry-free computations2 and multiplication without partial prod-

ucts. However, some information involving the whole number x cannot be easily
computed. For instance, sign and overflow detection and comparison of mag-
nitude are hard, resulting from the fact that residue number systems are no
weighted representation. Furthermore, division and as a result reduction mod-
ulo an arbitrary modulus M = A is not as easy as in other representations.
But similar to the basic idea of Montgomery multiplication, one can create a
modular multiplication method for input values in RNS representation as shown
in Algorithm 2, which involves a second base B = (m 1, . . . , m
0, m n−1 ) with
corresponding dynamic range B. It computes a value v = XY + f M that is
equivalent to 0 mod A and XY mod M . Thus, we can safely divide by A, i.e.,
multiply by its inverse modulo B, to compute the output XY A−1 (mod M ).
Note that the needed reduction modulo A to compute f is free in A.
All steps of the algorithm can be efficiently computed in parallel. However, a
method to convert between both bases, a base extension mechanism, is needed.
We take three different options into account: the method based on a Mixed
Radix System (MRS) according to Szabó and Tanaka [37], as well as CRT-based
methods due to Shenoy and Kumaresan [33], Kawamura et al. [20] and Bajard et
al. [3]. We present a brief introduction of these methods, but for more detailed
information about base extensions, please see the recent survey at [5].

2
Inner-RNS operations still contain carries.
86 R. Szerwinski and T. Güneysu

Algorithm 2. Modular Multiplication Algorithm for Residue Number Sys-

tems [20]
Require: Modulus M , two RNS bases A and B composed of n distinct moduli mi
each, gcd(A, B) = gcd(A, M ) = 1 and B > A > 4M .
Two factors X and Y , 0 ≤ X, Y < 2M , encoded in both bases and in Montgomery
form, i.e. XA∪B and Y A∪B , X = xA (mod M ) and Y = yA (mod M ).
Ensure: The product C = XY A−1 (mod M ), 0 ≤ C < 2M , in both bases and
Montgomery form.
1: uA∪B ← XA∪B · Y A∪B
2: f A ← uA · −M −1 A
3: f A∪B ← BaseExtend(f A )
4: vB ← uB + f B · M B {vA = 0 by construction}
5: wB ← vB · A−1 B
6: wA∪B ← BaseExtend(wB )
7: return wA∪B

4.3 Base Extension Using a Mixed Radix System (MRS)

The classical way to compute base extensions is due to Szabó and Tanaka [37].
Let (m0 , . . . , mn−1 ) be the MRS base associated to A. Then, each integer x can
be represented in a mixed radix system as
x = x0 + x1 m0 + x2 m0 m1 + · · · + xn−1 m0 . . . mn−2 . (3)
The MRS digits xi can be derived from the residues xi by a recursive strategy:
where m−1
(i,j) are the pre-computed inverses of mj modulo mi . To convert x from

x0 = x0 (mod m0 ) (4)

x1 = (x1 − x0 )m−1
(1,0) (mod m1 )
..
.
xn−1 = (· · · ((xn − x0 )m−1
(n−1,0)
− x1 )m−1
(n−1,1)
− · · · − xn−2 )m−1
(n−1,n−2)
(mod mn−1 )

this representation to a target RNS base, we could reduce Equation

(3) by each
i−1
target modulus m k , involving pre-computed constants
c(k,i) = l=0 ml . But
k
m
instead of creating a table for all
ck , a recursive approach is more eﬃcient in our
situation, eliminating the need for table-lookups [4], and allowing to compute all
residues in the target base in parallel:

|x|m (5)
k = (. . . ((xn−1 mn−2 +xn−2 )mn−3 +xn−3 )mn−4 + · · · + x1 )m0 + x0 m k

4.4 Base Extension Using the Chinese Remainder Theorem (CRT)

Recall the deﬁnition of the CRT and adopt it to the source base A with dynamic
range A:

n−1
xk
x= Ak
ˆ − αA, α<n (6)
Aˆ
k=0 k mk
Exploiting the Power of GPUs for Asymmetric Cryptography 87

where Aˆk = A/mk and α is an integer s.t. 0 ≤ x < A. Note that α is strictly
upper-bounded by n. When reducing this equation with an arbitrary target
modulus, say m i , we yield

n−1
ˆ −1
|x|m
i = Ak δk − |αA|m
i , δk = xk · Aˆk (7)
i
m mk
k=0 i
m
−1

where Aˆk , Aˆk and |A|m
i are pre-computed constants. Note that the δk
i
m mk
do not depend on the target modulus and can thus be reused in the computation
of a different target residue.
This is an efficient way to compute all residues modulo the target base, pro-
vided we know the value of α. While involving a couple of look-ups for the con-
stants as well, the instruction flow is highly uniform (cf. Criterion A4) and fits
to our SIMD architecture, i.e., we can use n threads to compute the n residues
of x in the target base in parallel (cf. Criterion A1).
The first technique to compute such an α is due to Shenoy and Kumaresan
[33] and requires a redundant modulus mr ≥ n that is relatively prime to all other
moduli mj and m i , i.e., gcd(A, mr ) = gcd(B, mr ) = 1. Consider Equation 7, set
m i = mr and rearrange it to the following:
n−1

−1
|α|mr = |A |mr · Aˆk δk − |x|mr . (8)
mr
k=0 mr

Since α < n ≤ mr it holds that α = |α|mr and thus Equation 8 computes the
exact value of α, involving the additional constant |A−1 |mr .
Kawamura et al. propose a diﬀerent technique that approximates α using
ﬁxed-point computations [20]. Consider Equation 7, rearrange it and divide
by A: n−1
δk
n−1
|x|m
i
δk
α = − = . (9)
mk A mk
k=0 k=0

Next, they approximate α by using truncr (δk ) as numerator and 2w as denomi-

nator and adding a properly chosen oﬀset σ, where truncr (δk ) sets the last w − r
bits of δk to zero:
n−1 n−1
truncr (δk ) 1

w−r
α = + σ = δk /2 + σ , (10)
2w 2r
k=0 k=0

Thus, the approximate value α can be computed in ﬁxed-point arithmetic as

integer part of the sum of the r most-significant bits of all δk . Provided σ is chosen
correctly, Equation 10 will compute α = α, and the resulting base extension will
be exact.
Finally, Bajard et al. follow the most radical approach possible [3]: they allow
an offset of αA ≤ (n − 1)A to occur in Equation 7 and thus do not need to
compute α at all. After the first base extension we have f = f + αA and thus
88 R. Szerwinski and T. Güneysu

w = w + αM , i.e., the result w will contain a maximum oﬀset of (n − 1)M ,

and thus be equivalent to w mod M . However, this technique needs additional
measures of precaution in the multiplication algorithm, which predominantly
condense in the higher dynamic ranges needed.

4.5 Multiplication Modulo Generalised Mersenne Primes

For some cryptosystems like DSA, arithmetic in an underlying prime ﬁeld is

required. Taking advantage of the special structure of Mersenne primes, the re-
duction modulo p after a multiplication can be carried out very efficiently. Using
such a method, we can compute r using a standard multi-precision multiplica-
tion method first, followed by a reduction algorithm that is specific for the given
prime. In this work, we will use an algorithm to efficiently compute multiplica-
tions modulo P-224, where P-224 is the 224 bit prime proposed by NIST [25].
Algorithm 3 performs the complete reduction for this prime with only two addi-
tions and two subtractions of 224 bit integers and a subsequent correction step to
determine the correct value of r ≡ r mod p, since −2p ≤ r < 3p must be consid-
ered. Note that this final correction step additionally needs the same amount of
computations, as we have to avoid data-dependant branches (cf. Criterion A4).

Algorithm 3. NIST Reduction for P-224 = 2224 − 296 + 1

Require: Double-sized integer r = (r13

, . . . , r2 , r1 , r0 ) in base 232 and 0 ≤ r < P-2242

Ensure: Single-sized integer r ≡ r mod P-224, 0 ≤ r < P-224.
1: Concatenate ri to following 224-bit integers tj :

t1 = (r6 , r5 , r4 , r3 , r2 , r1 , r0 ), t2 = (r10 , r9 , r8 , r7 , 0, 0, 0), t3 = (0, r13

, r12
, r11 , 0, 0, 0)

t4 = (0, 0, 0, 0, r13
, r12
, r11 ), t5 = (r13 , r12 , r11 , r10 , r9 , r8 , t7 )

2: Compute r = t1 + t2 + t3 − t4 − t5
3: return r = r mod P-224

5 Implementation

In this section we will describe the implementation of two primitive operations

for a variety of cryptosystems: ﬁrst, we realise modular exponentiation on the
GPU for use with RSA, DSA and similar systems. Second, for ECC-based cryp-
tosystems we present an eﬃcient point multiplication method which is the fun-
damental operation, e.g., for ECDSA or ECDH [16].

5.1 Modular Exponentiation Using the CIOS Method

We implemented the CIOS Method as introduced in Algorithm 1 for sequential

execution since it does not include any inherent parallelism. Fan et al. describe
eﬃcient ways to pipeline such an algorithm for the use on multi-core systems [13].
Exploiting the Power of GPUs for Asymmetric Cryptography 89

This would however need fairly complex coordination and memory techniques
and thus will not be considered further for our implementation, cf. Criteria A4
and B4-B6.
As all modular exponentiations are independent, we let each thread compute
exactly one modular exponentiation in parallel with all others. Resulting from
that, this solution only profits from coarse-grained parallelism. We assume the
computation of distinct exponentiations, each having the same exponent t –
for example RSA signatures using the same key – and thus need to transfer
only the messages Pi for each exponentiation to the device and the result Pit
(mod N ) back to the host. As a result, every thread executes the same control
flow, fulfilling Criterion A4. To accelerate memory transfers between host and
device, we use page-locked host memory and pad each message to a fixed length
that forces the starting address of each message to values that are eligible for
global memory coalescing (cf. Criteria B1 and B4).
For modular exponentiation based on Algorithm 1, we applied the straightfor-
ward binary right-to-left method [35]. During exponentiation, each thread needs
three temporary values of (n+2) words each that get used as input and output of
Algorithm 1 in a round-robin fashion by pointer arithmetic. Thus, 3(n+2) words
are required. This leads to 408 bytes and 792 bytes for 1024 bits and 2048 bit pa-
rameters, respectively. Each multiprocessor features 16384 bytes of shared mem-
ory, resulting in a maximum number of
16386/408 = 40 and
16386/792 = 20
threads per multiprocessor for 1024 and 2048 bits, respectively, if we use shared
memory for temporary values. Clearly, both solutions are inefficient when con-
sidering that each multiprocessor is able to execute 768 threads per block in
principle (i.e., we favour Criterion A2 over B2).
Thus, we chose to store the temporary values in global memory. We have to
store the values interleaved so that memory accesses of one word by all threads
in a warp can be combined to one global memory access. Hence, for a given set
of values (A, B, C, . . .) consisting each of n + 2 words X = (x0 , x1 , . . . , xn+1 ),
we store all first words (a0 , b0 , c0 , . . .) for all threads in the same block, then all
second words (a1 , b1 , c1 , . . .), and so on (cf. Criterion B4).
Moreover, we have to use nailing techniques, as CUDA does not yet include
add-with-carry instructions. Roughly speaking, nailing reserves one or more of
the high-order bits of each word for the carry that can occur when adding two
numbers. To save register and memory space, however, we store the full word
of w bits per register and use bit shifts and and-masking to extract two nibbles,
each providing sufficient bits for the carry (cf. Criterion A3). This can be thought
of decomposing a 32 bit addition in two 16 bit additions plus the overhead for
carry handling.

5.2 Modular Exponentiation Using Residue Number Systems

Computations in residue number systems yield the advantage of being inherently
parallel. According to Algorithm 2 all steps are computed in one base only, ex-
cept for the ﬁrst multiplication. Thus, the optimal mapping of computations to
threads is as follows: each thread determines values for one modulus in the two
90 R. Szerwinski and T. Güneysu

bases. As a result, we have coarse-grained (diﬀerent exponentiations) and ﬁne-

grained parallelism (base size), fulfilling Criterion A1. We call n the number of
residues that can be computed in parallel, i.e., the number of threads per en-
cryption. The base extension by Shenoy et al. needs a redundant residue starting
from the first base extension to be able to compute the second base extension.
To reflect this fact, we use two RNS bases A and B, having n moduli each, and
an additional residue mr resulting in n = n + 1. For all other cases, it holds
that n = n.
Considering the optimal number of bits per modulus, we are faced with w =
32 bit integer registers on the target hardware. Thus, to avoid multi-precision
techniques, we can use moduli that are smaller than 2w . The hardware can
compute 24 bit multiplications faster than full 32 bit multiplications. However,
CUDA does not expose an intrinsic to compute the most-significant 16 bits of the
result. Using 16 bit moduli would waste registers and memory and increase the
number of memory accesses as well. Thus, we prefer full 32 bit moduli to save
storage resources at the expense of higher computational cost (cf. Criteria A2
and A3).
For Algorithm 1 to work, the dynamic ranges A and B and the modulus M
have to be related according to B > A > 22 M , or B > A > (2+n)2 M when using
Bajard’s method. For performance reasons, we consider full warps of 32 threads
only, resulting in a slightly reduced size of M . The figures for all possible combi-
nations can be found in Table 6 in the Appendix. For input and output values,
we assume that all initial values will have been already converted to both bases
(and possibly the redundant modulus mr ) and that output values will be re-
turned in the same encoding. Note that it would be sufficient to transfer values
in one base only and do a base extension for all input values (cf. Criterion B1,
transferring values in both bases results in a more compact kernel together with
a slightly higher latency). Different from the CIOS method, temporary values
can be kept local for each thread, i.e., every thread stores its assigned residues
in registers. Principally all operations can be performed in parallel on different
residues and – as a result – the plain multiplication algorithm does not need any
synchronisations. However, both properties do not hold for the base extension
algorithms.

Mixed Radix Conversion. Recall that the mixed radix conversion computes
the mixed radix representation from all residues in the source base first and
uses this value to compute the target residues. The second step involves the
computation of n residues and can be executed in parallel, i.e., each thread
computes the residue for ’its’ modulus. As a result, we have to store the n MRS
digits in shared memory to make them accessible to all threads (cf. Criteria A1
and B2). The first step however is the main caveat of this algorithm due to
its highly divergent nature as each MRS digit is derived from the residue of
a temporary variable in a different modulus (and thus thread) and depends
on all previously computed digits, clearly breaking Criterion A4 and resulting
in serialisation of executions. Additionally, note that threads having already
computed an MRS digit do not generate any useful output anymore.
Exploiting the Power of GPUs for Asymmetric Cryptography 91

CRT-Based Conversion. The first step for all CRT-based techniques is to com-
pute the δk for each source modulus and can be carried out by one thread for each
value. Second, all n threads compute a weighted sum involving δk and a modulus-
dependent constant. Note that all threads need to access all δk and thus δk have to
be stored in shared memory (cf. Criterion B2). Third, α has to be derived, whose
computation is the main difference in the distinguished techniques. α is needed
by all threads later and thus needs to be stored in shared memory as well. After
computing α all threads can proceed with their independent computations.
Bajard’s method does not compute α and consequently needs no further oper-
ations. For Shenoy’s method, the second step above is needed for the redundant
modulus mr as well, which can be done in parallel with all other moduli. Then, a
single thread computes α and writes it to shared memory. The redundant residue
mr comes at the price of an additional thread, however the divergent part needed
to compute α does only contain one addition and one multiplication modulo mr .
Kawamura’s method needs to compute the sum of the r most significant bits of
all δk . While the right-shift of each δk can be done using all threads, the sum over
all shifted values and the offset has to be computed using a single thread. A final
right-shift results in the integer part of the sum, namely α.

Comparison and Selection. Clearly, Bajard’s method is the fastest since it

involves no computation of α. Shenoy’s method only involves a small divergent
part. However, we pay the price of an additional thread for the redundant mod-
ulus, or equivalently decrease the size of M . Kawamura’s technique consists of
a slightly larger divergent part, however it does neither include look-ups nor
further reduces the size of M .
Not all base extension mechanisms can be used for both directions required for
Algorithm 2. For Bajard’s method, consider the consequence of an offset in the sec-
ond base extension: we would compute some w in base A that is not equal to the
w in B. As a result, neither w A nor w B could be computed leading to an in-
valid input for a subsequent execution of Algorithm 2. Thus, their method is only
available for A → B conversions. Shenoy’s method can only be used for the second
base extension as there is no efficient way to carry the redundant residue through
the computation of f modulo A. The technique by Kawamura et al. would in prin-
ciple be available for both conversions. However, the sizes of both bases would be
different to allow proper reduction in the A → B case, thus we exclude this option
from our consideration. Table 1 shows the available and the practical combinations.

Table 1. Base Extension Algorithm Combinations

A→B
MRC (M) Shenoy (S) Kawamura (K) Bajard (B)
MRC (M) • ◦ ◦ •
B→A

Shenoy (S) • ◦ ◦ •
Kawamura (K) • ◦ ◦ •
Bajard (B) ◦ ◦ ◦ ◦
92 R. Szerwinski and T. Güneysu

5.3 Point Multiplication Using Generalised Mersenne Primes

For realising the elliptic curve group operation, we chose mixed affine-Jacobian
coordinates [8] to avoid costly inversions in the underlying field and thus con-
centrated on efficient implementation of modular multiplication, the remaining
time critical operation. For this, we used a straightforward schoolbook-type mul-
tiplication combined with the efficient reduction technique for the generalised
Mersenne prime presented in Algorithm 3.
As for the CIOS method, there is no intrinsic parallelism except pipelining in
this approach (cf. Criterion A1). Thus, we use one thread per point multiplica-
tion. We assume the use of the same base point P per point multiplication kP
and varying scalars k. Thus, the only input that has to be transferred are the
scalars. Secondly, we transfer the result in projective Jacobian coordinates back
to the host. For efficiency reasons, we encode all coordinates interleaved for each
threads in a block again.
We used shared memory to store all temporary values, nailed to 28 bits to
allow schoolbook multiplication without carry propagation. Thus, we need 8
words per coordinate. Point addition and doubling algorithms were inspired by
libseccure [29]. With this approach shared memory turns out to be the lim-
iting factor. Precisely, we require 111 words per point multiplication to store
7 temporary coordinates for point addition and modulo arithmetic, two points
and each scalar. This results in 444 bytes of shared memory and a maximum
of
16384/444 = 36 threads per multiprocessor. This leaves still room for im-
provements as Criterion A1 is not fulfilled. However, due to internal errors in
the toolchain, we were not (yet) able to compile a solution that uses global mem-
ory for temporary values instead. Note that the left-to-right binary method for
point multiplication demands only one temporary point. However, for the sake
of a homogeneous flow of instructions we compute both possible solutions per
scalar bit and use a small divergent section to decide which of them is the desired
result (cf. Criterion A4).

6 Conclusion
With the previously discussed implementations on GPUs at hand, we ﬁnally
need to identify the candidate providing the best performance for modular ex-
ponentiation.

6.1 Results and Applications

Before presenting the benchmarking results of the best algorithm combinations
we show our results regarding the different base extension options for the RNS
method. The benchmarking scheme was the following: first, we did an exhaustive
search for the number of registers per thread that can principally be generated
by the toolchain. Then, we benchmarked all available execution configurations
for these numbers of registers. To make the base extension algorithms compa-
rable, we would have to repeat this for all possible combinations, as shown in
Exploiting the Power of GPUs for Asymmetric Cryptography 93

Table 2. Results for diﬀerent Base Extension Techniques (RNS Method)

Base Ext. Throughput (1024 bits) Throughput (2048 bits)

A→B B→A [Enc/s] (rel.) [Enc/s] (rel.)
M M 194 (46%) 28 (50%)
B M 267 (63%) 38 (67%)
B K 408 (97%) 55 (98%)
B S 419 (100%) 56 (100%)

Table 1. However to reduce the complexity of benchmarking, it suﬃces to mea-

sure all possible combinations in the first row and all possible combinations in
the second column to gain figures for all available combinations. The results for
the particular best configuration can be found in Table 2.
Clearly, the mixed radix based approach also used in [24] cannot compete with
CRT-based solutions. Kawamura et al. is slower than the method of Shenoy et
al. , but performs only slightly worse for the 2048 bit range. Figure 1 shows the
time over the number of encryptions for the four cases and the 1024 bit and
2048 bit ranges, respectively.
Both graphs show the characteristic behaviour: Depending on the number of
blocks that are started on the GPU and the respective execution configuration we
get stair-like graphs. Only multiples of the number of warps per multiprocessor
and the number of multiprocessors result in optimal configurations that fully
utilise the GPU. However, depending on the number of registers per thread and
the amount of shared memory used other configurations are possible and lead
to smaller steps in between.

Optimised Implementations. Beside the reference implementation based on

the CIOS algorithm, we selected as best choice the CRT-RNS method based on a
combination of Bajard’s and Shenoy’s methods to compute the first and second
base extension of Algorithm 2, respectively.
The selection of the implementation was primarily motivated to achieve high
throughput rather than a small latency. Hence, due to the latency, not all im-
plementations might be suitable for all practical applications. To reflect this, we
present figures for data throughput as well as the initial latency tmin required at
the beginning of a computation. Note that our results consider optimal configu-
rations of warps per block and blocks per grid only. Table 3 shows the figures for
modular exponentiation with 1024 and 2048 bit moduli and elliptic curve point
multiplication using NIST’s P-224 curve.
The throughput is determined from the number of encryptions divided by the
elapsed time. Note that this includes the initial latency tmin at the beginning of
the computations. The corresponding graphs are depicted in Figure 2. Note the
relatively long plateau when using the CIOS technique. It is a direct result from
having coarse-grained parallelism only: the smallest number of encryptions that
can be processed is 128 times higher than for the RNS method. Its high offset
is due to storing temporary values in global memory: memory access latency is
94 R. Szerwinski and T. Güneysu

3000 10000
MRC, MRC MRC, MRC
Bajard et al, MRC Bajard et al, MRC
Bajard et al, Shenoy et al 9000 Bajard et al, Shenoy et al
Bajard et al, Kawamura et al Bajard et al, Kawamura et al
2500
8000

7000
2000

6000
Time [ms]

Time [ms]
1500 5000

4000

1000
3000

2000
500

1000

0 0
0 100 200 300 400 500 0 50 100 150 200 250
Number of Encryptions Number of Encryptions

Fig. 1. Results For Modular Exponentiation with about 1024 (left) and 2048 bit (right)
Moduli For Diﬀerent Base Extension Methods, based on a Nvidia 8800 GTS Graphics
Card

Table 3. Results for Throughput and Minimum Latency tmin on a Nvidia 8800 GTS
Graphics Card

Technique Throughput Latency tmin OPs at tmin

[OPs/s] [ms/OP] [ms]
ModExp-1024 CIOS 813.0 1.2 6930 1024
ModExp-1024 RNS 439.8 2.3 144 4
ModExp-2048 CIOS 104.3 9.6 55184 1536
ModExp-2048 RNS 57.9 17.3 849 4
ECC PointMul-224 1412.6 0.7 305 36

hidden by scheduling independent computations, however the time needed to

fetch/store the first value in each group cannot be hidden.
Clearly, the CIOS method delivers the highest throughput at the price of a
high initial latency. For interactive applications such as online banking using
TLS this will be a major obstacle. However, non-interactive applications like a
certificate authority (CA) might benefit from the raw throughput3 . Note that
both applications will share the same secret key for all digital signatures when
using RSA. In case of ECC (ECDSA) however, different exponents were taken
into account.
The residue number system based approach does only feature roughly half
of the throughput but provides a more immediate data response. Thus, this
method seems to be suitable even in interactive applications. Last but not least
elliptic curve cryptography clearly outperforms modular exponentiation based
techniques not only due to the much smaller parameters. With respect to other
hardware and software implementations compared against our results in the next
section, we present an ECC solution which outperforms most hardware devices
and comes close the the performance of recent dual-core microprocessors.
3
Also consider the top model of Nvidia’s next series of GPUs, the GeForce 9800GX2,
that can be used in a four-card setup.
Exploiting the Power of GPUs for Asymmetric Cryptography 95

35000 300000
Modular Exponentiation, CIOS Method Modular Exponentiation, CIOS Method
Modular Exponentiation, RNS Method Modular Exponentiation, RNS Method
Point Multiplication, NIST P-224 Point Multiplication, NIST P-224
30000
250000

25000
200000

20000
Time [ms]

Time [ms]
150000

15000

100000
10000

50000
5000

0 0
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
Number of Operations Number of Operations

Fig. 2. Results For Modular Exponentiation with about 1024 (left) and 2048 bit (right)
Moduli and Elliptic Curve Point Multiplication on NIST’s P-224 Curve, based on a
Nvidia 8800 GTS Graphics Card

6.2 Comparison with Previous Implementations

Due to the novelty of general purpose computations on GPUs and since directly
comparable results are rare, we will take reference to recent hardware and soft-
ware implementations in literature as well. To give a feeling for the diﬀerent
GPU generations we include Table 4.

Table 4. Comparison of Nvidia GPU platforms

GPU Shader clock Shaders Fill Rate Mem Bandwidth CUDA

[MHz] [GPixels/s] [GB/s]
7800GTX 13.2 54.4 no
8800GTS 1200 92 24.0 64.0 yes
8800GTX 1350 128 36.8 86.4 yes
9800GX2 1500 2 · 128 76.8 128.0 future

Moss et al. implemented modular exponentiation for 1024 bit moduli on

Nvidia’s 7800GTX GPU [24], using the same RNS approach but picking different
base extension mechanisms. The authors present the maximum throughput only
that has been achieved at the cost of an unspecified but high latency. Fleissner’s
recent analysis on modular exponentiation for GPUs is based on 192 bit moduli
but relates the GPU performance solely to the CPU of his host system.
Costigan and Scott implemented modular exponentiation on IBM’s Cell plat-
form, i.e., a Sony Playstation 3 and an IBM MPM blade server, both running
at 3.2 GHz [10]. We only quote the best figures for the Playstation 3 as they
call the results for the MPM blade preliminary. The Playstation features one
PowerPC core (PPU) and 6 Synergistic Processing Elements (SPUs). Software
results have been attained from ECRYPT’s eBATS project [11]. Here, we picked
a recent Intel Core2 Duo with 2.13 GHz clock frequency. Since mostly all figures
for software relate to cycles, we assumed that repeated computations can be
96 R. Szerwinski and T. Güneysu

Table 5. Comparison of our designs to results from literature. The higher throughput
values the better. ModExp-i denotes modular exponentiation using an i-bit modulus.
PointMul-i denotes point multiplication on elliptic curves over Fp , where p is a i-bit
prime. Results that used the Chinese remainder theorem are marked with “CRT”.
Reference Platform & Technique Throughput [ModExps/s] and [PointMuls/s]

ModExp-1024, CRT

ModExp-2048, CRT

ECC PointMul-160

ECC PointMul-224

ECC PointMul-256
ModExp-1024

ModExp-2048
Our Design Nvidia 8800GTS GPU, CIOS algorithm 813.0 104.3
Nvidia 8800GTS GPU, RNS arithmetic 439.8 57.9
Nvidia 8800GTS GPU, ECC NIST-224 1412.6
[24] Moss Nvidia 7800GTX GPU, RNS arithmetic 175.4
[10] Costigan Sony Playstation 3, 1 PPU, 6 SPUs 909.2 401.4
[22] Mentens Xilinx xc2vp30 FPGA 471.7 1724.1 235.8 1000.0 440.5
[32] Schinianakis Xilinx xc2vp125 FPGA, RNS arithmetic 413.9
[36] Suzuki Xilinx xc4fx12 FPGA, using DSPs 584.8 79.4
[26] Nozaki 0.25μm CMOS, 80 MHz, 221k GE 238.1 34.2
a
[11] eBATS Intel Core2 2.13 GHz 1447.5 300.4 2623.4 1868.5a 1494.8a
[15] Gaudry Intel Core2 2.66 GHz 6900b
a
Performance for ECDSA operation including additional modular inversion and multiplication operation.
b
Special elliptic curve in Montgomery form, non-compliant to ECC standardised by NIST.

performed without interruption on all available cores so that no further cycles

are spent, e.g., on scheduling or other administrative tasks. Note that this is a
very optimistic assumption possibly overrating the performance of microproces-
sors with respect to actual applications. We also compare our work to the very
fast software implementation by [15] on an Intel Core2 system at 2.66 GHz but
which uses the special Montgomery and non-standard curve over F2255 −19 .
To the best of our knowledge, Mentens published the best results for public key
cryptography on reconﬁgurable hardwareso far [22]. She used a Field Programmable
Gate Array (FPGA) of Xilinx’ Virtex-II Pro family, namely the xc2vp30-7FF1152.
Schinianakis et al. implemented elliptic curve cryptography on the same family of
FPGAs but using RNS arithmetic for the underlying ﬁeld [32]. Suzuki implemented
the modular exponentiation on FPGAs taking advantage of the included digital sig-
nal processors (DSPs) on a board from Xilinx’ Virtex 4 FX family [36].
Nozaki et al. designed an RSA circuit in 0.25 μm CMOS technology, that needs
221k gate equivalents (GE) [26] and uses RNS arithmetic with Kawamura’s base
extension mechanism.

6.3 Further Work

Elliptic curves in Hessian form feature highly homogeneous formulae to com-

pute all three projective coordinates in point additions [19,34]. However, the
curves standardised by ANSI and NIST cannot be transformed to Hessian form.
Furthermore, point doublings can be converted to point additions by simple co-
ordinate rotations. Thus, it is possible to compute point doublings and additions
Exploiting the Power of GPUs for Asymmetric Cryptography 97

for all three coordinates in parallel. A future study will show the applicability
to graphics hardware.

References
1. Advanced Micro Devices, Inc. (AMD), Sunnyvale, CA, USA. ATI CTM Guide,
Release 1.01 (2006)
2. American National Standards Institute (ANSI). Public key cryptography for the
ﬁnancial services industry: The elliptic curve digital signature algorithm (ECDSA)
(ANSI X9.62:2005) (2005)
3. Bajard, J.-C., Didier, L.-S., Kornerup, P.: Modular multiplication and base exten-
sion in residue number systems. In: Burgess, N. (ed.) Proceedings ARITH15, the
15th IEEE Symposium on Computer Arithmetic, Vail, Colorado, USA, pp. 59–65
(June 2001)
4. Bajard, J.-C., Meloni, N., Plantard, T.: Eﬃcient RNS bases for cryptography. In:
Proceedings of IMACS 2005 World Congress, Paris, France (July 2005)
5. Bajard, J.-C., Plantard, T.: RNS bases and conversions. Advanced Signal Process-
ing Algorithms, Architectures, and Implementations XIV 5559(1), 60–69 (2004)
6. Koç, Ç.K., Acar, T., Kaliski Jr., B.S.: Analyzing and comparing Montgomery mul-
tiplication algorithms. IEEE Micro 16(3), 26–33 (1996)
7. Koç, Ç.K., Naccache, D., Paar, C. (eds.): CHES 2001. LNCS, vol. 2162. Springer,
Heidelberg (2001)
8. Cohen, H., Frey, G. (eds.): Handbook of elliptic and hyperelliptic curve cryptogra-
phy. Chapman & Hall/CRC Press, Boca Raton (2005)
9. Cook, D.L., Ioannidis, J., Keromytis, A.D., Luck, J.: CryptoGraphics: Secret key
cryptography using graphics cards. In: Menezes, A. (ed.) CT-RSA 2005. LNCS,
vol. 3376. Springer, Heidelberg (2005)
10. Costigan, N., Scott, M.: Accelerating SSL using the vector processors in IBM’s Cell
broadband engine for Sony’s Playstation 3. In: SPEED 2007 Workshop Record [12]
(2007), http://www.hyperelliptic.org/SPEED/
11. ECRYPT. eBATS: ECRYPT benchmarking of asymmetric systems. Technical re-
port (2007), http://www.ecrypt.eu.org/ebats/
12. ECRYPT European Network of Excellence in Cryptography. Software Performance
Enhancement for Encryption and Decryption (SPEED), 2007 Workshop Record,
Amsterdam, The Netherlands (June 2007),
http://www.hyperelliptic.org/SPEED/
13. Fan, J., Skiyama, K., Verbauwhede, I.: Montgomery modular multiplication algo-
rithm for multi-core systems. In: SPEED 2007 Workshop Record [12] (2007),
http://www.hyperelliptic.org/SPEED/
14. Fleissner, S.: GPU-accelerated Montgomery exponentiation. In: Shi, Y., van Al-
bada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp.
213–220. Springer, Heidelberg (2007)
15. Gaudry, P., Thomé, E.: The mpFq library and implementing curve-based key ex-
changes. In: SPEED 2007 Workshop Record [12], pp. 49–64 (2007),
http://www.hyperelliptic.org/SPEED/
16. Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography.
Springer, New York (2003)
17. Harris, M.: Optimizing CUDA. In: Supercomputing 2007 Tutorial, Reno, NV, USA
(November 2007)
98 R. Szerwinski and T. Güneysu

18. Harrison, O., Waldron, J.: AES encryption implementation and analysis on com-
modity graphics processing unit. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007.
LNCS, vol. 4727, pp. 209–226. Springer, Heidelberg (2007)
19. Hisil, H., Carter, G., Dawson, E.: Faster group operations on special elliptic curves.
Cryptology ePrint Archive, Report 2007/441 (2007), http://eprint.iacr.org/
20. Kawamura, S., Koike, M., Sano, F., Shimbo, A.: Cox-rower architecture for fast par-
allel Montgomery multiplication. In: Preneel, B. (ed.) EUROCRYPT 2000. LNCS,
vol. 1807, pp. 523–538. Springer, Heidelberg (2000)
21. Manavski, S.A.: CUDA compatible GPU as an efficient hardware accelerator for
AES cryptography. In: Proceedings of IEEE’s International Conference on Signal
Processing and Communication ICSPC 2007, pp. 65–68 (November 2007)
22. Mentens, N.: Secure and Efficient Coprocessor Design for Cryptographic Appli-
cations on FPGAs. PhD thesis, Katholieke Universiteit Leuven, Leuven-Heverlee,
Belgium (June 2007)
23. Montgomery, P.L.: Modular multiplication without trial division. Mathematics of
Computation 44(170), 519–521 (1985)
24. Moss, A., Page, D., Smart, N.: Toward acceleration of RSA using 3d graphics hard-
ware. In: Galbraith, S.D. (ed.) Cryptography and Coding 2007. LNCS, vol. 4887,
pp. 369–388. Springer, Heidelberg (2007)
25. National Institute of Standards and Technology (NIST). Digital signature standard
(DSS) (FIPS 186-2) (January 2000)
26. Nozaki, H., Motoyama, M., Shimbo, A., Kawamura, S.: Implementation of RSA
algorithm based on RNS Montgomery multiplication. In: Koç, Ç.K., Naccache, D.,
Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 364–376. Springer, Heidelberg
(2001)
27. Nvidia Corporation, Santa Clara, CA, USA. Compute Unified Device Architecture
(CUDA) Programming Guide, Version 1.0 (June 2007)
28. Nvidia Corporation, Santa Clara, CA, USA. Parallel Thread Execution (PTX) ISA
Version 1.0, Release 1.0 (June 2007)
29. Poettering, B.: seccure – SECCURE elliptic curve crypto utility for reliable en-
cryption, version 0.3 (August 2006), http://point-at-infinity.org/seccure/
30. Rivest, R., Shamir, A., Adleman, L.: A method for obtaining digital signatures and
public key cryptosystems. In: Communications of the ACM, vol. 21, pp. 120–126
(February 1978)
31. Rosenberg, U.: Using graphic processing unit in block cipher calculations. Master’s
thesis, University of Tartu, Tartu, Estonia (2007)
32. Schinianakis, D.M., Kakarountas, A.P., Stouraitis, T.: A new approach to elliptic
curve cryptography: an RNS architecture. In: Proceedings of IEEE’s 14th Mediter-
ranian Electrotechnical Conference (MELECON 2006), pp. 1241–1245 (May 2006)
33. Shenoy, A.P., Kumaresan, R.: Fast base extension using a redundant modulus in
RNS. In: IEEE Transactions on Computers, vol. 38, pp. 292–297 (February 1989)
34. Smart, N.P.: The Hessian form of an elliptic curve. In: Koç, Ç.K., et al. (eds.) [7],
pp. 118–125
35. Stinson, D.R.: Cryptography. Theory and Practice, 3rd edn. Taylor & Francis,
Abington (2005)
36. Suzuki, D.: How to maximize the potential of FPGA resources for modular expo-
nentiation. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727,
pp. 272–288. Springer, Heidelberg (2007)
37. Szabó, N.S., Tanaka, R.I.: Residue Arithmetic and its Applications to Computer
Technology. McGraw-Hill Inc., USA (1967)
Exploiting the Power of GPUs for Asymmetric Cryptography 99

A Appendix

Table 6. Modulus Sizes for Modular Multiplication Using RNS

1st Base Ext. 2nd Base Ext. 1024 bit range 2048 bit range
Bajard et al. Shenoy et al. 981 2003
Others 1013 2035
Others Shenoy et al. 990 2014
Others 1022 2046

DAAD Epos Motivation Letter
100% (2)
DAAD Epos Motivation Letter
2 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
GPU Computing For Data Science - John Joo
No ratings yet
GPU Computing For Data Science - John Joo
34 pages
GPU Questions
No ratings yet
GPU Questions
7 pages
Lamination Suitability For Flexible Packaging Appl PDF
No ratings yet
Lamination Suitability For Flexible Packaging Appl PDF
3 pages
Graphics Processing Unit Thesis
100% (2)
Graphics Processing Unit Thesis
4 pages
Cecilia Laurente - Nursing Practice and Career
100% (1)
Cecilia Laurente - Nursing Practice and Career
11 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
UNIT 4 GPU Computing - HPC
No ratings yet
UNIT 4 GPU Computing - HPC
13 pages
Task Support Vehicle: Maintenance Repair Parts Manual
No ratings yet
Task Support Vehicle: Maintenance Repair Parts Manual
120 pages
CryptoGraphic-Secret Key Using Graphic Card
No ratings yet
CryptoGraphic-Secret Key Using Graphic Card
18 pages
Operating System Abstractions To Manage Gpus As Compute Devices
No ratings yet
Operating System Abstractions To Manage Gpus As Compute Devices
16 pages
Efficient Acceleration of Asymmetric Cryptography On Graphics Hardware
No ratings yet
Efficient Acceleration of Asymmetric Cryptography On Graphics Hardware
17 pages
Secret Key Cryptography Using Graphics Cards
No ratings yet
Secret Key Cryptography Using Graphics Cards
14 pages
Report On Gpu
No ratings yet
Report On Gpu
39 pages
Internship at D'Decor
No ratings yet
Internship at D'Decor
38 pages
Lecture 25
No ratings yet
Lecture 25
2 pages
Gpu IEEE Paper
No ratings yet
Gpu IEEE Paper
14 pages
Cryptography On Gpus: Erdem Sarılı
No ratings yet
Cryptography On Gpus: Erdem Sarılı
18 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
A Complete Gpu Guide - Cherry Servers
No ratings yet
A Complete Gpu Guide - Cherry Servers
29 pages
A Survey of Architectural Approaches For Improving GPGPU
No ratings yet
A Survey of Architectural Approaches For Improving GPGPU
24 pages
Questionnaire On The Role of Graphics Processing Units (Gpus) in Modern Computing
No ratings yet
Questionnaire On The Role of Graphics Processing Units (Gpus) in Modern Computing
3 pages
GPU (Graphics Processing Unit)
No ratings yet
GPU (Graphics Processing Unit)
11 pages
Analysis and Implementation of Parallel Aes Algorithm Based On T-Table Using Cuda On The Multicore Gpu
No ratings yet
Analysis and Implementation of Parallel Aes Algorithm Based On T-Table Using Cuda On The Multicore Gpu
8 pages
GPU-Co Processing
No ratings yet
GPU-Co Processing
8 pages
GPU Gems2 ch29
No ratings yet
GPU Gems2 ch29
21 pages
Sample 3
No ratings yet
Sample 3
2 pages
Abracon LTE GPS Antenna Data Sheet
No ratings yet
Abracon LTE GPS Antenna Data Sheet
8 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
CUDA
No ratings yet
CUDA
46 pages
Part1 22
No ratings yet
Part1 22
77 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
2 pages
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
No ratings yet
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
10 pages
Graphics Processing Unit GPU Programming Strategie
No ratings yet
Graphics Processing Unit GPU Programming Strategie
14 pages
Gpu Research Paper
No ratings yet
Gpu Research Paper
6 pages
D&I of GPU Based Image Processing On CASE Cluster
No ratings yet
D&I of GPU Based Image Processing On CASE Cluster
28 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
SPI's First Decade Mirrors Gaming's Progress: John Prados
100% (1)
SPI's First Decade Mirrors Gaming's Progress: John Prados
24 pages
A Seminar Report On GPU by M.Marshal Murmu (1801109169)
No ratings yet
A Seminar Report On GPU by M.Marshal Murmu (1801109169)
28 pages
CAO Report
No ratings yet
CAO Report
17 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Article VI Reviewer
No ratings yet
Article VI Reviewer
7 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
10 pages
GPGPUs CUDA
No ratings yet
GPGPUs CUDA
21 pages
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
No ratings yet
Side-Channel Power Analysis of A GPU AES Implementation: Chao Luo, Yunsi Fei, Pei Luo, Saoni Mukherjee, David Kaeli
8 pages
Job Order Costing
100% (3)
Job Order Costing
45 pages
Cybercrime Among University Undergraduates: Implications On Their Academic Achievement
No ratings yet
Cybercrime Among University Undergraduates: Implications On Their Academic Achievement
11 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
cs179 2024 Lec01
No ratings yet
cs179 2024 Lec01
26 pages
GPU Gpgpu Computing: Rajan Panigrahi
No ratings yet
GPU Gpgpu Computing: Rajan Panigrahi
24 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Uk
100% (1)
Uk
12 pages
Intro Computing BCSM-F18-071 - Assignment 1
No ratings yet
Intro Computing BCSM-F18-071 - Assignment 1
10 pages
Gpus
No ratings yet
Gpus
32 pages
CHAPTER 6 Valve Timing 13 Two Stroke Engine
No ratings yet
CHAPTER 6 Valve Timing 13 Two Stroke Engine
20 pages
789
No ratings yet
789
5 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Comparison of Symmetric and Asymmetric Cryptography With Existing Vulnerabilities and Countermeasures
No ratings yet
Comparison of Symmetric and Asymmetric Cryptography With Existing Vulnerabilities and Countermeasures
4 pages
Lab Manual JAVA
No ratings yet
Lab Manual JAVA
133 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Symmetric and Asymmetric Encryption: Sandm Laboratories, Albuquerque, New Mexico 87185
No ratings yet
Symmetric and Asymmetric Encryption: Sandm Laboratories, Albuquerque, New Mexico 87185
26 pages
HVDC
No ratings yet
HVDC
3 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
No ratings yet
An Introduction To Graphical Processing Unit: Jayshree Ghorpade, Jitendra Parande, Rohan Kasat, Amit Anand
6 pages
Brodtkorb Etal Meta10
No ratings yet
Brodtkorb Etal Meta10
15 pages
Image Processing With CUDA
No ratings yet
Image Processing With CUDA
66 pages
Ahu Fan-6000 CFM Twin Fan-1.5 Inch
No ratings yet
Ahu Fan-6000 CFM Twin Fan-1.5 Inch
1 page
Illilllhiulll Liillllllll: Ulllted States Patent (19) (11) Patent Number: 5,748,735
No ratings yet
Illilllhiulll Liillllllll: Ulllted States Patent (19) (11) Patent Number: 5,748,735
19 pages
Automotive Servicing NC Ii Jay Christian T Agsalon
No ratings yet
Automotive Servicing NC Ii Jay Christian T Agsalon
3 pages
Parallel Processing Using GPU's
No ratings yet
Parallel Processing Using GPU's
34 pages
VB UNIT 1 Notes
No ratings yet
VB UNIT 1 Notes
24 pages
Iiihiii Iiii: United States Patent 19
No ratings yet
Iiihiii Iiii: United States Patent 19
20 pages
Optimal Asymmetric Encryption
No ratings yet
Optimal Asymmetric Encryption
20 pages
CMA Inter - July 2023 Past Paper Questions Practice
No ratings yet
CMA Inter - July 2023 Past Paper Questions Practice
36 pages
Michael Wiener (Ed.) : CRYPTO'99, LNCS 1666, Pp. 537-554, 1999. Springer-Verlag Berlin Heidelberg 1999
No ratings yet
Michael Wiener (Ed.) : CRYPTO'99, LNCS 1666, Pp. 537-554, 1999. Springer-Verlag Berlin Heidelberg 1999
18 pages
Asymmetric Cryptography With: Hidden Monomial
No ratings yet
Asymmetric Cryptography With: Hidden Monomial
16 pages
Configuring Distributed Systems in A Java-Based Environment: Louis J Botha, Judith M Bishop
No ratings yet
Configuring Distributed Systems in A Java-Based Environment: Louis J Botha, Judith M Bishop
15 pages
Dynamic Key Cryptography and Applications
No ratings yet
Dynamic Key Cryptography and Applications
14 pages
Quantum Cryptography Based On Orthogonal States: Lior Goldenberg and Lev Vaidman
No ratings yet
Quantum Cryptography Based On Orthogonal States: Lior Goldenberg and Lev Vaidman
12 pages
Beating The PNS Attack in Practical Quantum Cryptography
No ratings yet
Beating The PNS Attack in Practical Quantum Cryptography
12 pages
A Comparative Survey of Symmetric and Asymmetric Key Cryptography
No ratings yet
A Comparative Survey of Symmetric and Asymmetric Key Cryptography
12 pages
Sportage 2017 Headlights Adjustement
No ratings yet
Sportage 2017 Headlights Adjustement
5 pages
Continuous Variable Quantum Cryptography
No ratings yet
Continuous Variable Quantum Cryptography
11 pages
Quantum Cryptography With 3-State Systems: Helle Bechmann-Pasquinucci and Asher Peres
No ratings yet
Quantum Cryptography With 3-State Systems: Helle Bechmann-Pasquinucci and Asher Peres
11 pages
United States Patent (10) Patent No.: US 8.477,940 B2
No ratings yet
United States Patent (10) Patent No.: US 8.477,940 B2
10 pages
LMHF Data Sheet Most Recent One
No ratings yet
LMHF Data Sheet Most Recent One
2 pages
United States Patent (19) : Watanabe (54) - Sensing Amplifier Including
No ratings yet
United States Patent (19) : Watanabe (54) - Sensing Amplifier Including
8 pages
7 PDF
No ratings yet
7 PDF
8 pages
ICI Overview
No ratings yet
ICI Overview
40 pages
Quantum Cryptography With Imperfect Apparatus
No ratings yet
Quantum Cryptography With Imperfect Apparatus
7 pages
Put The Verbs Into The Correct Tense (Simple Present or Present Progressive)
No ratings yet
Put The Verbs Into The Correct Tense (Simple Present or Present Progressive)
3 pages
"Plug and Play" Systems For Quantum Cryptography
No ratings yet
"Plug and Play" Systems For Quantum Cryptography
6 pages
Optimal Eavesdropping in Quantum Cryptography With Six States
No ratings yet
Optimal Eavesdropping in Quantum Cryptography With Six States
5 pages
Device-Independent Security of Quantum Cryptography Against Collective Attacks
No ratings yet
Device-Independent Security of Quantum Cryptography Against Collective Attacks
5 pages
LLM4RE Final Submitted1
No ratings yet
LLM4RE Final Submitted1
25 pages
Entangled State Quantum Cryptography: Eavesdropping On The Ekert Protocol
No ratings yet
Entangled State Quantum Cryptography: Eavesdropping On The Ekert Protocol
4 pages
Quantum Cryptography Using Entangled Photons in Energy-Time Bell States
No ratings yet
Quantum Cryptography Using Entangled Photons in Energy-Time Bell States
4 pages
Optimal Eavesdropping in Cryptography With Three-Dimensional Quantum States
No ratings yet
Optimal Eavesdropping in Cryptography With Three-Dimensional Quantum States
4 pages
Quantum Cryptography With Entangled Photons
No ratings yet
Quantum Cryptography With Entangled Photons
4 pages
Single Photon Quantum Cryptography: PACS Numbers: 03.67.Dd, 42.50.Dv
No ratings yet
Single Photon Quantum Cryptography: PACS Numbers: 03.67.Dd, 42.50.Dv
4 pages
Beton Dizayn Programi
No ratings yet
Beton Dizayn Programi
4 pages
UGC - NET December 2024 Admit Card: Ugcnet - Nta.ac - in
No ratings yet
UGC - NET December 2024 Admit Card: Ugcnet - Nta.ac - in
2 pages
Image Encryption Using Affine Transform and XOR Operation
No ratings yet
Image Encryption Using Affine Transform and XOR Operation
4 pages
VOCABULARY-sport Car Lis
No ratings yet
VOCABULARY-sport Car Lis
16 pages
Portugal v. Portugal-Beltran, G.R. No. 155555, 16 August 2005
No ratings yet
Portugal v. Portugal-Beltran, G.R. No. 155555, 16 August 2005
10 pages
Experiences With Ides and Java Teaching: What Works and What Doesn'T
No ratings yet
Experiences With Ides and Java Teaching: What Works and What Doesn'T
2 pages
What Is Prompt Engineering v2
No ratings yet
What Is Prompt Engineering v2
6 pages
01 23 ADCB Fire Pipes Egy Gulf Rev.01
No ratings yet
01 23 ADCB Fire Pipes Egy Gulf Rev.01
3 pages
VAHAN 4.0 (Citizen Services) Onlineapp02 150 8013
No ratings yet
VAHAN 4.0 (Citizen Services) Onlineapp02 150 8013
2 pages
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
From Everand
Mining for Knowledge: Exploring GPU Architectures In Cryptocurrency and AI: The Crypto Mining Mastery Series, #2
Ladd Baby
No ratings yet
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet

Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A

Uploaded by

Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A

Uploaded by

Exploiting the Power of GPUs for Asymmetric

Robert Szerwinski and Tim Güneysu

Horst Görtz Institute for IT Security, Ruhr University Bochum, Germany

Abstract. Modern Graphics Processing Units (GPU) have reached a

Keywords: Asymmetric Cryptosystems, Graphics Processing Unit,

3 Using GPUs for General-Purpose Applications

3.1 Traditional GPU Computing

3.2 Programming GPUs Using Nvidia’s CUDA Framework

A. Maximise use of available processing power

B. Maximise use of available memory bandwidth

CUDA Limitations. Although CUDA programs are written in the C language

4 Modular Arithmetic on GPUs

Modular Addition and Subtraction. In general, addition s ≡ a + b mod m

Modular Multiplication. Multi-precision modular multiplication r ≡ a·b mod

4.1 Modular Multiplication Using Montgomery’s Technique

In 1985 Peter L. Montgomery proposed an algorithm [23] to remove the costly

4.2 Modular Multiplication in Residue Number Systems (RNS)

As an alternative approach to conventional base-2w arithmetic, we can represent

xA = x0 , x1 , . . . , xn−1 A = |x|m0 , |x|m1 , . . . , |x|mn−1 A (1)

Algorithm 1. Montgomery Multiplication for Multi-Precision Integers

which allows carry-free computations2 and multiplication without partial prod-

Algorithm 2. Modular Multiplication Algorithm for Residue Number Sys-

4.3 Base Extension Using a Mixed Radix System (MRS)

x0 = x0 (mod m0 ) (4)

this representation to a target RNS base, we could reduce Equation

4.4 Base Extension Using the Chinese Remainder Theorem (CRT)

Next, they approximate α by using truncr (δk ) as numerator and 2w as denomi-

Thus, the approximate value α can be computed in ﬁxed-point arithmetic as

w = w + αM , i.e., the result w will contain a maximum oﬀset of (n − 1)M ,

4.5 Multiplication Modulo Generalised Mersenne Primes

For some cryptosystems like DSA, arithmetic in an underlying prime ﬁeld is

Algorithm 3. NIST Reduction for P-224 = 2224 − 296 + 1

In this section we will describe the implementation of two primitive operations

5.1 Modular Exponentiation Using the CIOS Method

We implemented the CIOS Method as introduced in Algorithm 1 for sequential

5.2 Modular Exponentiation Using Residue Number Systems

bases. As a result, we have coarse-grained (diﬀerent exponentiations) and ﬁne-

Comparison and Selection. Clearly, Bajard’s method is the fastest since it

Table 1. Base Extension Algorithm Combinations

5.3 Point Multiplication Using Generalised Mersenne Primes

6.1 Results and Applications

Table 2. Results for diﬀerent Base Extension Techniques (RNS Method)

Base Ext. Throughput (1024 bits) Throughput (2048 bits)

Table 1. However to reduce the complexity of benchmarking, it suﬃces to mea-

Optimised Implementations. Beside the reference implementation based on

Technique Throughput Latency tmin OPs at tmin

hidden by scheduling independent computations, however the time needed to

6.2 Comparison with Previous Implementations

Table 4. Comparison of Nvidia GPU platforms

GPU Shader clock Shaders Fill Rate Mem Bandwidth CUDA

Moss et al. implemented modular exponentiation for 1024 bit moduli on

performed without interruption on all available cores so that no further cycles

6.3 Further Work

Elliptic curves in Hessian form feature highly homogeneous formulae to com-

Table 6. Modulus Sizes for Modular Multiplication Using RNS

You might also like

xA = x0 , x1 , . . . , xn−1 A = |x|m0 , |x|m1 , . . . , |x|mn−1 A (1)