Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
Exploiting The Power of Gpus For Asymmetric Cryptography: Abstract. Modern Graphics Processing Units (Gpu) Have Reached A
Cryptography
1 Introduction
For the last twenty years graphics hardware manufacturers have focused on pro-
ducing fast Graphics Processing Units (GPUs), specifically for the gaming com-
munity. This has more recently led to devices which outperform general purpose
Central Processing Units (CPUs) for specific applications, particularly when
comparing the MIPS (million instructions per second) benchmarks. Hence, a re-
search community has been established to use the immense power of GPUs for
general purpose computations (GPGPU). In the last two years, prior limitations
of the graphics application programming interfaces (API) have been removed by
GPU manufacturers by introducing unified processing units in graphics cards.
They support a general purpose instruction set by a native driver interface and
framework.
In the field of asymmetric cryptography, the security of all practical cryptosys-
tems rely on hard computational problems strongly dependant on the choice of
parameters. But with rising parameter sizes (often in the range of 1024–4096
E. Oswald and P. Rohatgi (Eds.): CHES 2008, LNCS 5154, pp. 79–99, 2008.
c International Association for Cryptologic Research 2008
80 R. Szerwinski and T. Güneysu
bits), however, computations become more and more challenging for the under-
lying processor. For modern hardware, the computation of a single cryptographic
operation is not critical, however in a many-to-one communication scenario, like
a central server in a company’s data processing centre, it may be confronted with
hundreds or thousands of simultaneous connections and corresponding crypto-
graphic operations. As a result, the most common current solution are cryp-
tographic accelerator cards. Due to the limited market, their price tags are
often in the range of several thousands euros or US dollars. The question at
hand is whether commodity GPUs can be used as high-performance public-key
accelerators.
In this work, we will present novel implementations of cryptosystems based on
modular exponentiations and elliptic curve operations on recent graphics hard-
ware. To the best of our knowledge, this is the first publication making use of
the CUDA framework for GPGPU processing of asymmetric cryptosystems. We
will start with implementing the extremely wide-spread Rivest Shamir Adleman
(RSA) cryptosystem [30]. The same implementation based on modular expo-
nentiation for large integers can be used to implement the Digital Signature
Algorithm (DSA), which has been published by the US National Institute of
Standards and Technology (NIST) [25]. Recently, DSA has been adopted to el-
liptic curve groups in the ANSI X9.62 standard [2]. The implementation of this
variant, called ECDSA, is the second major goal of this work.
2 Previous Work
Lately, the research community has started to explore techniques to accelerate
cryptographic algorithms using the GPU. For example, various authors looked
at the feasibility of the current industry standard for symmetric cryptography,
the Advanced Encryption Standard (AES) [21,31,18,9]. Only two groups, namely
Moss et al. and Fleissner, have aimed for the efficient implementation of mod-
ular exponentiation on the GPU [24,14]. Their results were not promising, as
they were limited by the legacy GPU architecture and interface (cf. the next
section). To the best of our knowledge there are neither publications about the
implementation of these systems on modern, GPGPU-capable hardware nor on
the implementation of elliptic curve based systems.
We aim to fill this gap by implementing the core operations for both systems
efficiently on modern graphics hardware, creating the foundation for the use of
GPUs as accelerators for public key cryptography. We will use Nvidia’s current
flagship GPU series, the G80 generation, together with its new GPGPU interface
CUDA.
Roughly, the graphics pipeline consist of the stages transform & light, assemble
primitives, rasterise and shade. First GPUs had all functions needed to im-
plement the graphics pipeline hardwired, but over time more and more stages
became programmable by introducing specialised processors, e.g. vertex and frag-
ment processors that made the transform & light and shading stages, respec-
tively, more flexible.
When processing power increased massively while prices kept falling, the re-
search community thought of ways to use these resources for computationally
intense tasks. However, as the processors’ capabilities were very limited and
the API of the graphics driver was specifically built to implement the graphics
pipeline, a lot of overhead needed to be taken into account. For example, all
data had to be encoded in textures which are two dimensional arrays of pixels
storing colour values for red, green, blue and an additional alpha channel used
for transparency. Additionally, textures are read-only objects, which forced the
programmers to compute one step of an algorithm, store the result in the frame
buffer, and start the next step using a texture reference to the newly produced
pixels. This technique is known as ping-ponging. Most GPUs did only provide in-
structions to manipulate floating point numbers, forcing GPGPU programmers
to map integers onto the available mantissa and find ways to emulate bit-logical
functions, e.g., by using look-up tables.
These limitations have been the main motivation for the key GPU manufac-
turers ATI/AMD and Nvidia to create APIs specifically for the GPGPU com-
munity and modify their hardware for better support: ATI’s solution is called
Close To the Metal (CTM) [1], while Nvidia presented the Compute Unified
Device Architecture (CUDA), a radically new design that makes GPU program-
ming and GPGPU switch places: The underlying hardware of the G80 series is
an accumulation of scalar common purpose processing units (“unified” design)
and quite a bit of “glue” hardware to efficiently map the graphics pipeline to
this new design. GPGPU applications however directly map to the target hard-
ware and thus graphics hardware can be programmed without any graphics API
whatsoever.
In general, the GPU’s immense computation power mainly relies on its inherent
parallel architecture. For this, the CUDA framework introduces the thread as
smallest unit of parallelism, i.e., a small piece of concurrent code with associated
state. However, when compared to threads on microprocessors, GPU threads
have much lower resource usage and lower creation and switching cost. Note
that GPUs are only effective when running a high number of such threads. A
group of threads that is executed physically in parallel is called warp. All threads
in one warp are executed in a single instruction multiple data (SIMD) fashion.
If one or more thread(s) in the same warp need to execute different instructions,
e.g., in case of a data-dependent jump, their execution will be serialised and the
82 R. Szerwinski and T. Güneysu
threads are called divergent. As the next level of parallelism, a (thread) block
is a group of threads that can communicate with each other and synchronise
their execution. The maximum number of threads per block is limited by the
hardware. Finally, a group of blocks that have same dimensionality and execute
the same CUDA program logically in parallel is called grid.
To allow optimal performance for different access patterns, CUDA implements
a hierarchical memory model, contrasting the flat model normally assumed on
computers. Host (PC) and device (GPU) have their own memory areas, called
host memory and device memory, respectively. CUDA supplies optimised func-
tions to transfer data between these separate spaces.
Each thread possesses its own register file, which can be read and written.
Additionally, it can access its own copy of so-called local memory. All threads
in the same grid can access the same on-chip read- and writable shared mem-
ory region. To prevent hazards resulting from concurrent execution of threads
synchronisation mechanisms must be used. Shared memory is organised in groups
called banks that can be accessed in parallel. All threads can access a read- and
writable memory space called global memory and read-only regions called
constant memory and texture memory. The second last is optimised for
one-dimensional locality of accesses, while the last is most effective when being
used with two-dimensional arrays (matrices). Note that the texture and constant
memories are the only regions that are cached. Thus, all accesses to the off-chip
regions global and local memory have a high access latency, resulting in penalties
when being used too frequently.
The hardware consists of a number of so-called multiprocessors that are build
from SIMD processors, on-chip memory and caches. Clearly, one processor ex-
ecutes a particular thread, the same warp being run on the multiprocessor at
the same time. One or more blocks are mapped to each multiprocessor, sharing
its resources (registers and shared memory) and get executed on a time-sliced
basis. When a particular block has finished its execution, the scheduler starts
the next block of the grid until all blocks have been run.
Design Criteria for GPU Implementations. To achieve optimal perfor-
mance using CUDA, algorithms must be designed to run in a multitude of par-
allel threads and take advantage of the presented hierarchical memory model. In
the following, we enumerate the key criteria necessary for gaining the most out
of the GPU by loosely following the CUDA programming guide [27] and a talk
given by Mark Harris of Nvidia [17].
on ALUs, not caches. Bearing this in mind allows to hide memory ac-
cess latency by the use of independent computations (latency hiding).
Examples include using arithmetic instructions with high throughput as
well as re-computing values instead of saving them for later use.
A4. Avoid divergent threads in the same warp.
In the following section we will give different ways do realise modular arith-
metic on a GPU efficiently, keeping the aforementioned criteria in mind. For the
RSA cryptosystem we need to implement arithmetic modulo N , where N is the
product of two large primes p and q: N = p · q. The arithmetic of both DSA
systems, however, is based on the prime field GF (p) as the lowest-level building
block. Note that the DSA systems both use a fixed – in terms of sessions or
key generations – prime p, thus allowing to choose special primes at build time
that have advantageous properties when reducing modulo p. For example, the
US National Institute of Standards and Technology (NIST) proposes a set of
generalised Mersenne primes in the Digital Signature Standard (DSS) [25, Ap-
pendix 6]. As the RSA modulus N is the product of the two secret primes p and
q that will be chosen secretly for each new key pair, we cannot optimise for the
modulus in this case.
1
Nvidia published their own (abstract) assembler language PTX [28], however as
of CUDA version 1.0 one kernel cannot contain code both generated from the C
language and PTX.
84 R. Szerwinski and T. Güneysu
Here, the ordered set of relatively prime moduli (m0 , m1 , . . . , mn−1 ), gcd
(mi , mj ) = 1 for all i = j, is called base and denoted by A. The product of
n−1
all moduli, A = i=0 mi is called dynamic range of A, i.e., the number of val-
ues that can be uniquely represented in A. In other words, all numbers in A get
implicitly reduced modulo A. Such a representation in RNS has the advantage
that addition, subtraction and multiplication can be computed independently for
all residues:
xA ◦ yA = |x0 ◦ y0 |m0 , |x1 ◦ y1 |m1 , . . . , |xn−1 ◦ yn−1 |mn−1 A , ◦ ∈ {+, −, ·} (2)
Exploiting the Power of GPUs for Asymmetric Cryptography 85
2
Inner-RNS operations still contain carries.
86 R. Szerwinski and T. Güneysu
where Aˆk = A/mk and α is an integer s.t. 0 ≤ x < A. Note that α is strictly
upper-bounded by n. When reducing this equation with an arbitrary target
modulus, say m i , we yield
n−1
ˆ −1
|x|m
i = Ak δk − |αA|m
i , δk = xk · Aˆk (7)
i
m mk
k=0 i
m
−1
where Aˆk , Aˆk and |A|m
i are pre-computed constants. Note that the δk
i
m mk
do not depend on the target modulus and can thus be reused in the computation
of a different target residue.
This is an efficient way to compute all residues modulo the target base, pro-
vided we know the value of α. While involving a couple of look-ups for the con-
stants as well, the instruction flow is highly uniform (cf. Criterion A4) and fits
to our SIMD architecture, i.e., we can use n threads to compute the n residues
of x in the target base in parallel (cf. Criterion A1).
The first technique to compute such an α is due to Shenoy and Kumaresan
[33] and requires a redundant modulus mr ≥ n that is relatively prime to all other
moduli mj and m i , i.e., gcd(A, mr ) = gcd(B, mr ) = 1. Consider Equation 7, set
m i = mr and rearrange it to the following:
n−1
−1
|α|mr = |A |mr · Aˆk δk − |x|mr . (8)
mr
k=0 mr
Since α < n ≤ mr it holds that α = |α|mr and thus Equation 8 computes the
exact value of α, involving the additional constant |A−1 |mr .
Kawamura et al. propose a different technique that approximates α using
fixed-point computations [20]. Consider Equation 7, rearrange it and divide
by A: n−1
δk
n−1
|x|m
i
δk
α = − = . (9)
mk A mk
k=0 k=0
2: Compute r = t1 + t2 + t3 − t4 − t5
3: return r = r mod P-224
5 Implementation
This would however need fairly complex coordination and memory techniques
and thus will not be considered further for our implementation, cf. Criteria A4
and B4-B6.
As all modular exponentiations are independent, we let each thread compute
exactly one modular exponentiation in parallel with all others. Resulting from
that, this solution only profits from coarse-grained parallelism. We assume the
computation of distinct exponentiations, each having the same exponent t –
for example RSA signatures using the same key – and thus need to transfer
only the messages Pi for each exponentiation to the device and the result Pit
(mod N ) back to the host. As a result, every thread executes the same control
flow, fulfilling Criterion A4. To accelerate memory transfers between host and
device, we use page-locked host memory and pad each message to a fixed length
that forces the starting address of each message to values that are eligible for
global memory coalescing (cf. Criteria B1 and B4).
For modular exponentiation based on Algorithm 1, we applied the straightfor-
ward binary right-to-left method [35]. During exponentiation, each thread needs
three temporary values of (n+2) words each that get used as input and output of
Algorithm 1 in a round-robin fashion by pointer arithmetic. Thus, 3(n+2) words
are required. This leads to 408 bytes and 792 bytes for 1024 bits and 2048 bit pa-
rameters, respectively. Each multiprocessor features 16384 bytes of shared mem-
ory, resulting in a maximum number of
16386/408 = 40 and
16386/792 = 20
threads per multiprocessor for 1024 and 2048 bits, respectively, if we use shared
memory for temporary values. Clearly, both solutions are inefficient when con-
sidering that each multiprocessor is able to execute 768 threads per block in
principle (i.e., we favour Criterion A2 over B2).
Thus, we chose to store the temporary values in global memory. We have to
store the values interleaved so that memory accesses of one word by all threads
in a warp can be combined to one global memory access. Hence, for a given set
of values (A, B, C, . . .) consisting each of n + 2 words X = (x0 , x1 , . . . , xn+1 ),
we store all first words (a0 , b0 , c0 , . . .) for all threads in the same block, then all
second words (a1 , b1 , c1 , . . .), and so on (cf. Criterion B4).
Moreover, we have to use nailing techniques, as CUDA does not yet include
add-with-carry instructions. Roughly speaking, nailing reserves one or more of
the high-order bits of each word for the carry that can occur when adding two
numbers. To save register and memory space, however, we store the full word
of w bits per register and use bit shifts and and-masking to extract two nibbles,
each providing sufficient bits for the carry (cf. Criterion A3). This can be thought
of decomposing a 32 bit addition in two 16 bit additions plus the overhead for
carry handling.
Mixed Radix Conversion. Recall that the mixed radix conversion computes
the mixed radix representation from all residues in the source base first and
uses this value to compute the target residues. The second step involves the
computation of n residues and can be executed in parallel, i.e., each thread
computes the residue for ’its’ modulus. As a result, we have to store the n MRS
digits in shared memory to make them accessible to all threads (cf. Criteria A1
and B2). The first step however is the main caveat of this algorithm due to
its highly divergent nature as each MRS digit is derived from the residue of
a temporary variable in a different modulus (and thus thread) and depends
on all previously computed digits, clearly breaking Criterion A4 and resulting
in serialisation of executions. Additionally, note that threads having already
computed an MRS digit do not generate any useful output anymore.
Exploiting the Power of GPUs for Asymmetric Cryptography 91
CRT-Based Conversion. The first step for all CRT-based techniques is to com-
pute the δk for each source modulus and can be carried out by one thread for each
value. Second, all n threads compute a weighted sum involving δk and a modulus-
dependent constant. Note that all threads need to access all δk and thus δk have to
be stored in shared memory (cf. Criterion B2). Third, α has to be derived, whose
computation is the main difference in the distinguished techniques. α is needed
by all threads later and thus needs to be stored in shared memory as well. After
computing α all threads can proceed with their independent computations.
Bajard’s method does not compute α and consequently needs no further oper-
ations. For Shenoy’s method, the second step above is needed for the redundant
modulus mr as well, which can be done in parallel with all other moduli. Then, a
single thread computes α and writes it to shared memory. The redundant residue
mr comes at the price of an additional thread, however the divergent part needed
to compute α does only contain one addition and one multiplication modulo mr .
Kawamura’s method needs to compute the sum of the r most significant bits of
all δk . While the right-shift of each δk can be done using all threads, the sum over
all shifted values and the offset has to be computed using a single thread. A final
right-shift results in the integer part of the sum, namely α.
A→B
MRC (M) Shenoy (S) Kawamura (K) Bajard (B)
MRC (M) • ◦ ◦ •
B→A
Shenoy (S) • ◦ ◦ •
Kawamura (K) • ◦ ◦ •
Bajard (B) ◦ ◦ ◦ ◦
92 R. Szerwinski and T. Güneysu
6 Conclusion
With the previously discussed implementations on GPUs at hand, we finally
need to identify the candidate providing the best performance for modular ex-
ponentiation.
3000 10000
MRC, MRC MRC, MRC
Bajard et al, MRC Bajard et al, MRC
Bajard et al, Shenoy et al 9000 Bajard et al, Shenoy et al
Bajard et al, Kawamura et al Bajard et al, Kawamura et al
2500
8000
7000
2000
6000
Time [ms]
Time [ms]
1500 5000
4000
1000
3000
2000
500
1000
0 0
0 100 200 300 400 500 0 50 100 150 200 250
Number of Encryptions Number of Encryptions
Fig. 1. Results For Modular Exponentiation with about 1024 (left) and 2048 bit (right)
Moduli For Different Base Extension Methods, based on a Nvidia 8800 GTS Graphics
Card
Table 3. Results for Throughput and Minimum Latency tmin on a Nvidia 8800 GTS
Graphics Card
35000 300000
Modular Exponentiation, CIOS Method Modular Exponentiation, CIOS Method
Modular Exponentiation, RNS Method Modular Exponentiation, RNS Method
Point Multiplication, NIST P-224 Point Multiplication, NIST P-224
30000
250000
25000
200000
20000
Time [ms]
Time [ms]
150000
15000
100000
10000
50000
5000
0 0
0 2000 4000 6000 8000 10000 12000 14000 0 2000 4000 6000 8000 10000 12000 14000
Number of Operations Number of Operations
Fig. 2. Results For Modular Exponentiation with about 1024 (left) and 2048 bit (right)
Moduli and Elliptic Curve Point Multiplication on NIST’s P-224 Curve, based on a
Nvidia 8800 GTS Graphics Card
Due to the novelty of general purpose computations on GPUs and since directly
comparable results are rare, we will take reference to recent hardware and soft-
ware implementations in literature as well. To give a feeling for the different
GPU generations we include Table 4.
Table 5. Comparison of our designs to results from literature. The higher throughput
values the better. ModExp-i denotes modular exponentiation using an i-bit modulus.
PointMul-i denotes point multiplication on elliptic curves over Fp , where p is a i-bit
prime. Results that used the Chinese remainder theorem are marked with “CRT”.
Reference Platform & Technique Throughput [ModExps/s] and [PointMuls/s]
ModExp-1024, CRT
ModExp-2048, CRT
ECC PointMul-160
ECC PointMul-224
ECC PointMul-256
ModExp-1024
ModExp-2048
Our Design Nvidia 8800GTS GPU, CIOS algorithm 813.0 104.3
Nvidia 8800GTS GPU, RNS arithmetic 439.8 57.9
Nvidia 8800GTS GPU, ECC NIST-224 1412.6
[24] Moss Nvidia 7800GTX GPU, RNS arithmetic 175.4
[10] Costigan Sony Playstation 3, 1 PPU, 6 SPUs 909.2 401.4
[22] Mentens Xilinx xc2vp30 FPGA 471.7 1724.1 235.8 1000.0 440.5
[32] Schinianakis Xilinx xc2vp125 FPGA, RNS arithmetic 413.9
[36] Suzuki Xilinx xc4fx12 FPGA, using DSPs 584.8 79.4
[26] Nozaki 0.25μm CMOS, 80 MHz, 221k GE 238.1 34.2
a
[11] eBATS Intel Core2 2.13 GHz 1447.5 300.4 2623.4 1868.5a 1494.8a
[15] Gaudry Intel Core2 2.66 GHz 6900b
a
Performance for ECDSA operation including additional modular inversion and multiplication operation.
b
Special elliptic curve in Montgomery form, non-compliant to ECC standardised by NIST.
for all three coordinates in parallel. A future study will show the applicability
to graphics hardware.
References
1. Advanced Micro Devices, Inc. (AMD), Sunnyvale, CA, USA. ATI CTM Guide,
Release 1.01 (2006)
2. American National Standards Institute (ANSI). Public key cryptography for the
financial services industry: The elliptic curve digital signature algorithm (ECDSA)
(ANSI X9.62:2005) (2005)
3. Bajard, J.-C., Didier, L.-S., Kornerup, P.: Modular multiplication and base exten-
sion in residue number systems. In: Burgess, N. (ed.) Proceedings ARITH15, the
15th IEEE Symposium on Computer Arithmetic, Vail, Colorado, USA, pp. 59–65
(June 2001)
4. Bajard, J.-C., Meloni, N., Plantard, T.: Efficient RNS bases for cryptography. In:
Proceedings of IMACS 2005 World Congress, Paris, France (July 2005)
5. Bajard, J.-C., Plantard, T.: RNS bases and conversions. Advanced Signal Process-
ing Algorithms, Architectures, and Implementations XIV 5559(1), 60–69 (2004)
6. Koç, Ç.K., Acar, T., Kaliski Jr., B.S.: Analyzing and comparing Montgomery mul-
tiplication algorithms. IEEE Micro 16(3), 26–33 (1996)
7. Koç, Ç.K., Naccache, D., Paar, C. (eds.): CHES 2001. LNCS, vol. 2162. Springer,
Heidelberg (2001)
8. Cohen, H., Frey, G. (eds.): Handbook of elliptic and hyperelliptic curve cryptogra-
phy. Chapman & Hall/CRC Press, Boca Raton (2005)
9. Cook, D.L., Ioannidis, J., Keromytis, A.D., Luck, J.: CryptoGraphics: Secret key
cryptography using graphics cards. In: Menezes, A. (ed.) CT-RSA 2005. LNCS,
vol. 3376. Springer, Heidelberg (2005)
10. Costigan, N., Scott, M.: Accelerating SSL using the vector processors in IBM’s Cell
broadband engine for Sony’s Playstation 3. In: SPEED 2007 Workshop Record [12]
(2007), http://www.hyperelliptic.org/SPEED/
11. ECRYPT. eBATS: ECRYPT benchmarking of asymmetric systems. Technical re-
port (2007), http://www.ecrypt.eu.org/ebats/
12. ECRYPT European Network of Excellence in Cryptography. Software Performance
Enhancement for Encryption and Decryption (SPEED), 2007 Workshop Record,
Amsterdam, The Netherlands (June 2007),
http://www.hyperelliptic.org/SPEED/
13. Fan, J., Skiyama, K., Verbauwhede, I.: Montgomery modular multiplication algo-
rithm for multi-core systems. In: SPEED 2007 Workshop Record [12] (2007),
http://www.hyperelliptic.org/SPEED/
14. Fleissner, S.: GPU-accelerated Montgomery exponentiation. In: Shi, Y., van Al-
bada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp.
213–220. Springer, Heidelberg (2007)
15. Gaudry, P., Thomé, E.: The mpFq library and implementing curve-based key ex-
changes. In: SPEED 2007 Workshop Record [12], pp. 49–64 (2007),
http://www.hyperelliptic.org/SPEED/
16. Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography.
Springer, New York (2003)
17. Harris, M.: Optimizing CUDA. In: Supercomputing 2007 Tutorial, Reno, NV, USA
(November 2007)
98 R. Szerwinski and T. Güneysu
18. Harrison, O., Waldron, J.: AES encryption implementation and analysis on com-
modity graphics processing unit. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007.
LNCS, vol. 4727, pp. 209–226. Springer, Heidelberg (2007)
19. Hisil, H., Carter, G., Dawson, E.: Faster group operations on special elliptic curves.
Cryptology ePrint Archive, Report 2007/441 (2007), http://eprint.iacr.org/
20. Kawamura, S., Koike, M., Sano, F., Shimbo, A.: Cox-rower architecture for fast par-
allel Montgomery multiplication. In: Preneel, B. (ed.) EUROCRYPT 2000. LNCS,
vol. 1807, pp. 523–538. Springer, Heidelberg (2000)
21. Manavski, S.A.: CUDA compatible GPU as an efficient hardware accelerator for
AES cryptography. In: Proceedings of IEEE’s International Conference on Signal
Processing and Communication ICSPC 2007, pp. 65–68 (November 2007)
22. Mentens, N.: Secure and Efficient Coprocessor Design for Cryptographic Appli-
cations on FPGAs. PhD thesis, Katholieke Universiteit Leuven, Leuven-Heverlee,
Belgium (June 2007)
23. Montgomery, P.L.: Modular multiplication without trial division. Mathematics of
Computation 44(170), 519–521 (1985)
24. Moss, A., Page, D., Smart, N.: Toward acceleration of RSA using 3d graphics hard-
ware. In: Galbraith, S.D. (ed.) Cryptography and Coding 2007. LNCS, vol. 4887,
pp. 369–388. Springer, Heidelberg (2007)
25. National Institute of Standards and Technology (NIST). Digital signature standard
(DSS) (FIPS 186-2) (January 2000)
26. Nozaki, H., Motoyama, M., Shimbo, A., Kawamura, S.: Implementation of RSA
algorithm based on RNS Montgomery multiplication. In: Koç, Ç.K., Naccache, D.,
Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 364–376. Springer, Heidelberg
(2001)
27. Nvidia Corporation, Santa Clara, CA, USA. Compute Unified Device Architecture
(CUDA) Programming Guide, Version 1.0 (June 2007)
28. Nvidia Corporation, Santa Clara, CA, USA. Parallel Thread Execution (PTX) ISA
Version 1.0, Release 1.0 (June 2007)
29. Poettering, B.: seccure – SECCURE elliptic curve crypto utility for reliable en-
cryption, version 0.3 (August 2006), http://point-at-infinity.org/seccure/
30. Rivest, R., Shamir, A., Adleman, L.: A method for obtaining digital signatures and
public key cryptosystems. In: Communications of the ACM, vol. 21, pp. 120–126
(February 1978)
31. Rosenberg, U.: Using graphic processing unit in block cipher calculations. Master’s
thesis, University of Tartu, Tartu, Estonia (2007)
32. Schinianakis, D.M., Kakarountas, A.P., Stouraitis, T.: A new approach to elliptic
curve cryptography: an RNS architecture. In: Proceedings of IEEE’s 14th Mediter-
ranian Electrotechnical Conference (MELECON 2006), pp. 1241–1245 (May 2006)
33. Shenoy, A.P., Kumaresan, R.: Fast base extension using a redundant modulus in
RNS. In: IEEE Transactions on Computers, vol. 38, pp. 292–297 (February 1989)
34. Smart, N.P.: The Hessian form of an elliptic curve. In: Koç, Ç.K., et al. (eds.) [7],
pp. 118–125
35. Stinson, D.R.: Cryptography. Theory and Practice, 3rd edn. Taylor & Francis,
Abington (2005)
36. Suzuki, D.: How to maximize the potential of FPGA resources for modular expo-
nentiation. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727,
pp. 272–288. Springer, Heidelberg (2007)
37. Szabó, N.S., Tanaka, R.I.: Residue Arithmetic and its Applications to Computer
Technology. McGraw-Hill Inc., USA (1967)
Exploiting the Power of GPUs for Asymmetric Cryptography 99
A Appendix
1st Base Ext. 2nd Base Ext. 1024 bit range 2048 bit range
Bajard et al. Shenoy et al. 981 2003
Others 1013 2035
Others Shenoy et al. 990 2014
Others 1022 2046