0% found this document useful (0 votes)
31 views

25 Years of Cryptographic Hardware Design: City University of Istanbul & University of California Santa Barbara

The document summarizes 25 years of advances in cryptographic hardware design for implementing public-key algorithms like RSA and Diffie-Hellman. It describes the initial naive algorithms from 1978-1985, followed by Peter Montgomery's breakthrough Montgomery multiplication algorithm in 1985 which significantly improved efficiency by replacing costly divisions with additions. It then discusses subsequent optimizations like advanced Karatsuba algorithms, various Montgomery multiplication methods, and arithmetic techniques for finite fields that further improved hardware performance.

Uploaded by

ALEX SAGAR
Copyright
© © All Rights Reserved
Available Formats
Download as PS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

25 Years of Cryptographic Hardware Design: City University of Istanbul & University of California Santa Barbara

The document summarizes 25 years of advances in cryptographic hardware design for implementing public-key algorithms like RSA and Diffie-Hellman. It describes the initial naive algorithms from 1978-1985, followed by Peter Montgomery's breakthrough Montgomery multiplication algorithm in 1985 which significantly improved efficiency by replacing costly divisions with additions. It then discusses subsequent optimizations like advanced Karatsuba algorithms, various Montgomery multiplication methods, and arithmetic techniques for finite fields that further improved hardware performance.

Uploaded by

ALEX SAGAR
Copyright
© © All Rights Reserved
Available Formats
Download as PS, PDF, TXT or read online on Scribd
You are on page 1/ 44

25 Years of Cryptographic Hardware Design

C
etin Kaya Koc
City University of Istanbul &
University of California Santa Barbara
[email protected]
http://cryptocode.net
[email protected]

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

25 Years of Cryptographic Hardware Design


1975-1977: Invention of Public-Key Cryptography
Diffie-Hellman & RSA Algorithms
Publication Dates: Nov 1976 & Feb 1978
First hardware implementation:
R. L. Rivest. A Description of a Single-Chip Implementation of the RSA
Cipher. Lambda, vol. 1, pages 14-18, 1980.
In 1984, I was a graduate student at UCSBs ECE Department
My interest started with Rivests hardware paper

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

Essential Milestones
This talk gives a brief summary of advanced algorithms for creating better
hardware realizations of public-key cryptographic algorithms: DiffieHellman, RSA, elliptic curve cryptography
Essential milestones:

Naive algorithms, 1978-1985


Montgomery algorithm, 1985
Advanced Karatsuba algorithms, 1994
Advanced Montgomery algorithms, 1996
Montgomery algorithm in GF (2k ), 1998
Unified arithmetic, 2002
Spectral arithmetic, 2006

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

RSA Computation
The RSA algorithm uses modular exponentiation for encryption
C := M e

(mod n)

and decryption
M : Cd

(mod n)

The computation of M e mod n is performed using exponentiation


heuristics
Modular exponentiation requires implementation of three basic modular
arithmetic operations: addition, subtraction, and multiplication

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

Diffie-Hellman Computation
Similarly, the Diffie-Hellman key exchange algorithm executes the steps
RA := g a

(mod p)

RB

:= g b

(mod p)

RB

b
:= RA
= g ab

(mod p)

a
RA
:= RB
= g ba

(mod p)

between two parties, Alice & Bob


These computations are also modular exponentiations, requiring modular
addition, subtraction, and multiplication operations

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

NIST Digital Signature Algorithm


The signature computation on M and k is the pair (r, s)
r := (g k mod p) mod q
s := (M + xr)k 1 mod q
The signature verification
w := s1 mod q
u1 := M w mod q
u2 := rw mod q
v := (g u1 y u2 mod p) mod q
Check if r

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

v
5

Ellliptic Curve Cryptography


Elliptic curves defined over GF (p) or GF (2k ) are used in cryptography
The arithmetic of GF (p) is the usual mod p arithmetic
The arithmetic of GF (2k ) is similar to that of GF (p), however, there
are some differences
Elliptic curves over GF (2k ) are more popular due to the space and
time-efficient algorithms for doing arithmetic in GF (2k )
Elliptic curve cryptosystems based on discrete logarithms seem to provide
similar amount of security to that of RSA, but with relatively shorter key
sizes

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

Computations of Cryptographic Functions


It is interesting to note that all public-key cryptographic algorithms are
based on number-theoretic and algebraic finite structures, such as groups,
rings, and fields
In fact, most of them need modular arithmetic, i.e., the arithmetic of
integers in finite rings or fields
The challenge is however that the sizes of operands are large, starting
from about 160 bits up to 16,000 bits
Therefore, the algorithmic development of cryptographic hardware design
is essentially based on (exact) computer arithmetic with very large
integers
Since exponentiations & multiplications are most time/energy/space
consuming computations, we will only study those in our talk

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

Computing Exponentiations
Given the integer e, the computation of M e or eP is an exponentiation
operation
The objective is to use as few multiplications (or elliptic curve additions)
as possible for a given integer e
This problem is related to addition chains
An addition chain yields an algorithm for computing M e or eP given the
integer e
M 1 M 2 M 3 M 5 M 10 M 11 M 22 M 44 M 55
P 2P 3P 5P 10P 11P 22P 44P 55P

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

Computing Exponentiations
Finding the shortest addition chain is an NP-complete problem
Lower bound: log2 e + log2 H(e) 2.13 (Sch
onhage)
Upper bound: log2 e + H(e) 1, where H(e) is the Hamming weight
of e (the binary method, the SX method, Knuth)
It turns out the oldest known algorithm for computing exponentiation is
not too far in efficiency to the best algorithm
Heuristics, m-ary, adaptive m-ary, sliding windows, power tree methods
offer only slight improvements

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

Computing Modular Multiplication - Naive Algorithms


Given a, b < n, compute P = a b mod n
Multiply and reduce:
Multiply: P = a b (2k-bit number)
Reduce: P = P mod n (k-bit number)
Reductions are essentially integer divisions
However, multiply and reduce steps can be interleaved, but offering only
slight improvements

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

10

Interleaved Multiply & Reduce - Naive Algorithms

P = a b = a

k1
X

bi2i =

i=0

k1
X

(a bi)2i

i=0

= 2( 2(2(0 + a bk1) + a bk2) + ) + a b0


1.
2.
2a.
2b.
3.

P := 0
for i = k 1 downto 0
P := 2P + a bi
P := P mod n
return P

Unfortunately, Step 2b is highly time consuming (a full division for every


bit of the operands)

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

11

Montgomery Multiplication - 1985


Attempts to create good hardware to compute the RSA functions (sign,
verify, encrypt, decrypt) in acceptable time have essentially failed because
of the excessive requirements of the naive algorithms
This includes Rivests hardware proposal and all other implementations
until the Montgomery multiplication algorithm came about
Peter Montgomery discovered a method to replace Step 2b with a step
similar to Step 2a: an addition instead of a division
It is brilliant and efficient
Montgomerys algorithm changed cryptographic design in a way very
much like the FFT algorithm changed the digital signal processing

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

12

Montgomery Multiplication
Montgomerys method maps the integers {0, 1, 2, . . . , n 1} to the same
set with the map x
= x r (mod n) using the integer r = 2k
It then works in this set (numbers with the bar sign) and performs the
multiplication
MonPro(
a, b) = a
b 2k

(mod n)

The above operation turns out to be significantly simpler than the


standard modular multiplication a b (mod n) because the division by
n in Step 2b (reduction) is avoided
Transformation to and back from the bar domain is also quite easily
done, i.e., x
= MonPro(x, r 2) and x = MonPro(
x, 1)

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

13

Montgomery Multiplication
In order to compute u = MonPro(a, b) = a b 2k
the steps below

(mod n), we use

1. u := 0
2. for i = 0 to k 1
2a.
u := u + ai b
2b.
if u0 is 1 then u := u + n
3.
u := u/2
Now, Step 2b is only an addition!
And, it is is done about half of the time!
We remain in the Montgomery (bar) domain of integers until the final
step of the exponentiation, and then use the conversion routine to go
back to the no bar domain

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

14

Karatsuba-Ofman Multiplication
Algorithms Textbooks offer a few asymptotically faster multiplication
algorithms: Karatsuba-Ofman, Toom-Cook, Winograd, and DFT-based
algorithms
These algorithms are all good: they help you to multiply faster
But, they are no help in modular multiplication, i.e., they do not
multiply-and-reduce (Montgomerys method is special)
They also have large overhead, and start being faster only after a few
thousand bits
However, there has been significant algorithmic developments to bring
down their break-even point to a few hundred bits

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

15

Advanced Montgomery Multiplication


On the other hand, Montgomery algorithms also improved
They can be made fit into specific archiectures, by changing the way
they scan the bits of the multiplicand, the multiplier, and the product
Separated Operand Scanning (SOS): First computes t = a b and then
interleaves the computations of m = t n mod r and u = (t + m n)/r.
Squaring can be optimized.
SOS requires 2s + 2 words of space
Finely Integrated Product Scanning (FIPS): Interleaves computation of
a b and m n by scanning the words of m
It uses the same space to keep m and u, reducing the temporary space
to s + 3 words

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

16

Advanced Montgomery Multiplication


Finely Integrated Operand Scanning (FIOS): The computation of a b
and m n is performed in a single loop
FIOS also requires s + 3 words of space
Coarsely Integrated Hybrid Scanning (CIHS): The computation of a b is
split into 2 loops, and the second loop is interleaved with the computation
of m n
CIHS also requires s + 3 words of space
Coarsely Integrated Operand Scanning(CIOS): Improves the SOS method
by integrating the multiplication and reduction steps. It alternates
between iterations of the outer loops for multiplication and reduction
CIOS also requires s + 3 words of space

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

17

Advanced Montgomery Multiplication


All methods require 2s2 + s multiplications
Add, Read/Write and Space requirements are below
SOS
FIPS
FIOS
CIHS
CIOS

Add

Read/Write

Space

4s2 +4s+2

8s2 +13s+5

2s+2

6s2 +2s+2

14s2 +16s+3

s+3

5s2 +3s+2

10s2 +9s+3

s+3

4s2 +4s+2

9.5s2+11.5s+3

s+3

4s2 +4s+2

8s2 +12s+3

s+3

Depending on the availability of functional units (multipliers, adders,


registers), one method can outperform another and thus should be
selected

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

18

Montgomery Multiplication in GF (2k )


It turns out that the Montgomery multiplication can also be performed
in the finite field GF (2k ) if the polynomial basis representations of the
field elements are employed
It imitates the the Montgomery multiplication in GF (p) by taking
the modulus the irreducible polynomial p(x) generating the field of 2k
elements
It is not as fast as the normal basis, but it has some advantages

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

19

Montgomery Multiplication in GF (2k )


In order to compute
u(x) = MonPro(a(x), b(x)) = a(x) b(x) xk mod p(x) ,
we use the steps below
1. u(x) := 0
2. for i = 0 to k 1
2a.
u(x) := u(x) + ai b(x) mod 2
2b.
if u0 is 1 then u(x) := u(x) + p(x) mod 2
3.
u := u/2
Now Steps 2a and 2b use mod 2 additions (XOR gates)

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

20

Unified Arithmetic
One advantage of the Montgomery multiplication in GF (2k ) is that a
single arithmetic unit can be used to handle both kinds of fields: GF (p)
and GF (2k ): This is called unified arithmetic (or, dual-field arithmetic)
Advantages of the unified arithmetic are low manufacturing cost,
compatibility, parallelism, and scalability
Furthermore, unified arithmetic is impartial: it does not favor one prime
against another or one irreducible polynomial against another
The building block of the unified architecture is the unified full adder: a
1-bit adder that handles both GF (p) and GF (2k )

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

21

Unified Full Adder

a
b
c
FSEL

UnivAdder

S
Cout

a
b
FSEL

Cout

(a) Universal Adder

(b) Synthesized circuit by Mentor

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

22

Scalability
Scalability is an important concept: it allows to make small changes
in the hardware to handle larger operands without a complete redesign
(such as switching from 1024-bit RSA keys to 1536-bit RSA keys)

PE 1

PE 2

PE 3

PE k

Buffer

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

23

Dependency Graph of Montgomery Multiplication


a

(0)

(0)

(0)

, p

(0)

(1)

(1)

, p

(0)
c
b

(2)

(0)

(0)

(0)

, p

(2)

, p

(0)
(1)

b
b

(3)

(1)

(1)

, p

(3)

, p

a
(0)

(2)

c
b

(4)

(0)

(4)

, p

(2)

(0)

(0)

, p

(2)

, p

(0)

b
p

(e+1)

,
(e+1)

(e)

(e-1)

c
b
p

(e+1)

(e+1)

(e)

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

24

Pipelined Montgomery Multiplication

An example of pipeline computation for 7 bit operands


where w=1

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

25

Pipelined Architecture with Fewer Units

Pipeline stalls when fewer


processing
i units
i are available
il bl
m=7, w=1, k=3

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

26

General Pipelined Architecture


Reg-a
k

k
a

k
i+1

i+t-1

(j)

PU

PU

t
c

Reg-c

Reg-p

Reg-b

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

PUt

2w

27

Spectral Arithmetic
We use FFT-based arithmetic to implement modular multiplication
However, we are interested in performing the reduction inside the spectral
(frequency) domain
We utilize finite ring and field arithmetic (avoid real or complex arithmetic
because of the roundoff errors in using floating-point or fixed-point
arithmetic)
We also want to bring down the break-even point of efficiency for
FFT-based multiplication
Furthermore, we utilize the properties of the DFT and Montgomery
algorithm to perform modular multiplication

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

28

Spectral Arithmetic

Convolution

DFT

Modular
Multiplication

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

Modular
Reduction

DFT Inverse

29

DFT over a Finite Ring: Definition


Let be a primitive d-th root of unity in Zq and, let x(t) and X(t) be
polynomials of degree d 1 having entries in Zq . The DFT map over Zq is
an invertib le set map sending x(t) to X(t) given by
Xi = DF Td (x(t)) :=

d1
X

xj ij mod q,

j=0

with the inverse


xi = IDF Td (X(t)) := d1

d1
X

Xj ij mod q,

j=0

for i, j = 0, 1, . . . , d 1.

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

30

DFT over a Finite Ring: Existence


We write
x(t)

DFT

X(t)

and say x(t) and X(t) are transform pairs; x(t) is called a time polynomial
and sometimes X(t) is named as the spectrum of x(t).
(Convention) In the literature, DFT over a finite ring spectrum is also
called as Number Theoretical Transform (NTT)
(Existence) In order to have a DFT map over Zq :
The multiplicative inverse of DFT length d must exist in Zq which
requires that gcd(d, q) = 1.
d has to divide p 1 for every prime p divisor of q

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

31

DFT over a Finite Ring: Efficiency


In order to have simple arithmetic
q should be chosen as
a Mersenne number q = 2v 1, or
a Fermat number q = 2v + 1
The principal root of unity should be selected as a power of 2 to
simplify the multiplications with roots of unity

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

32

Properties of DFT
Under certain conditions, the Fourier transform preserves some properties
of the time sequences, e.g., linearity and convolution.
The existence conditions of these properties differ when working in finite
ring spectrums
Let and be operations on time and spectral domains respectively.
We write
DFT

and say and are transform pairs on x(t) and sometimes declare that
the map DF Td respects the operation on point x(t) if following
equation is satisfied
(x(t)) = IDF Td DF Td (x(t))

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

33

Time-Frequency Dictionary
Time and frequency shifts correspond to circular shifts Let
x(t) = x0 + x1t + . . . + xd1td1
and
X(t) = X0 + X1t + . . . + Xd1td1
be a transform pair.
The one-term right circular shift is defined as x(t) 1
x1 + x2t + . . . + xd2td1 + x0td1
l DFT
X(t) (t)
where stands for component-wise multiplication and
(t) = 1 + 1t + . . . + (d1) td1

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

34

Time-Frequency Dictionary
Sum of sequence and first value: The sum of the coefficients of a time
polynomial equals to the zeroth coefficient of its spectral polynomial.
Conversely the sum of the spectrum coefficients equals to d1 times the
zeroth coefficient of the time polynomial

x0 = d1

d1
X

Xi i

and X0 =

d1
X

xi i

i=0

i=0

sum equals to X0
(x0, x1, , xd-1)

DFT

(X0, X1, , Xd-1)

sum multiplied by d-1


equals to x0

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

35

Time-Frequency Dictionary
Left and right logical shifts: By using the previous properties, it is
possible to perform logical left and right digit shifts x(t) 1 as follows:
(x(t) x0)/t = x1 + . . . + xd1td2
l DFT
(X(t) x0 (t)) (t)
where
x0(t) = x0 + x0t + x0 t2 + . . . + x0td1
The right shifts are similar, where one then uses the
(t) = 1 + 1t + . . . + (d1)td1
polynomial instead of (t)

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

36

A Time Simulation for Spectral Modular Multiplication


We would like to compute 8592 49 (mod 1337).
Signal x(t) representing 859 = x(4) in base 4.
35
30
25
20
15
10
5

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

10

37

A Time Simulation for SMP


Convolving x(t) with itself, we find x2(t) = 8592 = 737881.
25

23

20
15
10

14

12
10

10

9
7

5
0
1

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

10

38

A Time Simulation for SMP


The modulus m = 1337 is represented as m = 1 + 2t + 3t2 + t4 + t5.We
add 3m to the sum to anhilate the least significant b bits of the least digit.
30
26
25
20
15

18

19
17

12
10

10

9
7

5
0
0
1

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

10

39

A Time Simulation for SMP


Carry goes to the next digit.
30
26
25
21
19

20

17

15
10

10

9
7

3
0
1

Carry added from the


eliminated coefficient

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

10

40

A Time Simulation for SMP


We then shift the digits.
30
26
25
21
19

20

17

15
10

10

9
7

5
0
1

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

10

41

A Time Simulation for SMP


After 9 iterations, we find the result: 914 8592 49

(mod 1337).

35
30
25
20
15
10

0
1

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

10

42

Unending Quest for Efficiency


Conclusions?
Challenges remain: Make faster but low-area and low-energy hardware
for cryptography
Platforms are diverse: Huge SSL and IPSec boxes versus tiny Bluetooth
earphones, cellphones and PDAs
New challenges: We need to build countermeasures in order to circumvent
attacks by adversaries to obtain hardware-hidden secrets
Questions?
Email: [email protected]

EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION

43

You might also like