25 Years of Cryptographic Hardware Design: City University of Istanbul & University of California Santa Barbara
25 Years of Cryptographic Hardware Design: City University of Istanbul & University of California Santa Barbara
C
etin Kaya Koc
City University of Istanbul &
University of California Santa Barbara
[email protected]
http://cryptocode.net
[email protected]
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
Essential Milestones
This talk gives a brief summary of advanced algorithms for creating better
hardware realizations of public-key cryptographic algorithms: DiffieHellman, RSA, elliptic curve cryptography
Essential milestones:
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
RSA Computation
The RSA algorithm uses modular exponentiation for encryption
C := M e
(mod n)
and decryption
M : Cd
(mod n)
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
Diffie-Hellman Computation
Similarly, the Diffie-Hellman key exchange algorithm executes the steps
RA := g a
(mod p)
RB
:= g b
(mod p)
RB
b
:= RA
= g ab
(mod p)
a
RA
:= RB
= g ba
(mod p)
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
v
5
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
Computing Exponentiations
Given the integer e, the computation of M e or eP is an exponentiation
operation
The objective is to use as few multiplications (or elliptic curve additions)
as possible for a given integer e
This problem is related to addition chains
An addition chain yields an algorithm for computing M e or eP given the
integer e
M 1 M 2 M 3 M 5 M 10 M 11 M 22 M 44 M 55
P 2P 3P 5P 10P 11P 22P 44P 55P
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
Computing Exponentiations
Finding the shortest addition chain is an NP-complete problem
Lower bound: log2 e + log2 H(e) 2.13 (Sch
onhage)
Upper bound: log2 e + H(e) 1, where H(e) is the Hamming weight
of e (the binary method, the SX method, Knuth)
It turns out the oldest known algorithm for computing exponentiation is
not too far in efficiency to the best algorithm
Heuristics, m-ary, adaptive m-ary, sliding windows, power tree methods
offer only slight improvements
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
10
P = a b = a
k1
X
bi2i =
i=0
k1
X
(a bi)2i
i=0
P := 0
for i = k 1 downto 0
P := 2P + a bi
P := P mod n
return P
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
11
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
12
Montgomery Multiplication
Montgomerys method maps the integers {0, 1, 2, . . . , n 1} to the same
set with the map x
= x r (mod n) using the integer r = 2k
It then works in this set (numbers with the bar sign) and performs the
multiplication
MonPro(
a, b) = a
b 2k
(mod n)
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
13
Montgomery Multiplication
In order to compute u = MonPro(a, b) = a b 2k
the steps below
1. u := 0
2. for i = 0 to k 1
2a.
u := u + ai b
2b.
if u0 is 1 then u := u + n
3.
u := u/2
Now, Step 2b is only an addition!
And, it is is done about half of the time!
We remain in the Montgomery (bar) domain of integers until the final
step of the exponentiation, and then use the conversion routine to go
back to the no bar domain
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
14
Karatsuba-Ofman Multiplication
Algorithms Textbooks offer a few asymptotically faster multiplication
algorithms: Karatsuba-Ofman, Toom-Cook, Winograd, and DFT-based
algorithms
These algorithms are all good: they help you to multiply faster
But, they are no help in modular multiplication, i.e., they do not
multiply-and-reduce (Montgomerys method is special)
They also have large overhead, and start being faster only after a few
thousand bits
However, there has been significant algorithmic developments to bring
down their break-even point to a few hundred bits
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
15
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
16
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
17
Add
Read/Write
Space
4s2 +4s+2
8s2 +13s+5
2s+2
6s2 +2s+2
14s2 +16s+3
s+3
5s2 +3s+2
10s2 +9s+3
s+3
4s2 +4s+2
9.5s2+11.5s+3
s+3
4s2 +4s+2
8s2 +12s+3
s+3
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
18
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
19
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
20
Unified Arithmetic
One advantage of the Montgomery multiplication in GF (2k ) is that a
single arithmetic unit can be used to handle both kinds of fields: GF (p)
and GF (2k ): This is called unified arithmetic (or, dual-field arithmetic)
Advantages of the unified arithmetic are low manufacturing cost,
compatibility, parallelism, and scalability
Furthermore, unified arithmetic is impartial: it does not favor one prime
against another or one irreducible polynomial against another
The building block of the unified architecture is the unified full adder: a
1-bit adder that handles both GF (p) and GF (2k )
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
21
a
b
c
FSEL
UnivAdder
S
Cout
a
b
FSEL
Cout
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
22
Scalability
Scalability is an important concept: it allows to make small changes
in the hardware to handle larger operands without a complete redesign
(such as switching from 1024-bit RSA keys to 1536-bit RSA keys)
PE 1
PE 2
PE 3
PE k
Buffer
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
23
(0)
(0)
(0)
, p
(0)
(1)
(1)
, p
(0)
c
b
(2)
(0)
(0)
(0)
, p
(2)
, p
(0)
(1)
b
b
(3)
(1)
(1)
, p
(3)
, p
a
(0)
(2)
c
b
(4)
(0)
(4)
, p
(2)
(0)
(0)
, p
(2)
, p
(0)
b
p
(e+1)
,
(e+1)
(e)
(e-1)
c
b
p
(e+1)
(e+1)
(e)
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
24
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
25
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
26
k
a
k
i+1
i+t-1
(j)
PU
PU
t
c
Reg-c
Reg-p
Reg-b
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
PUt
2w
27
Spectral Arithmetic
We use FFT-based arithmetic to implement modular multiplication
However, we are interested in performing the reduction inside the spectral
(frequency) domain
We utilize finite ring and field arithmetic (avoid real or complex arithmetic
because of the roundoff errors in using floating-point or fixed-point
arithmetic)
We also want to bring down the break-even point of efficiency for
FFT-based multiplication
Furthermore, we utilize the properties of the DFT and Montgomery
algorithm to perform modular multiplication
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
28
Spectral Arithmetic
Convolution
DFT
Modular
Multiplication
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
Modular
Reduction
DFT Inverse
29
d1
X
xj ij mod q,
j=0
d1
X
Xj ij mod q,
j=0
for i, j = 0, 1, . . . , d 1.
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
30
DFT
X(t)
and say x(t) and X(t) are transform pairs; x(t) is called a time polynomial
and sometimes X(t) is named as the spectrum of x(t).
(Convention) In the literature, DFT over a finite ring spectrum is also
called as Number Theoretical Transform (NTT)
(Existence) In order to have a DFT map over Zq :
The multiplicative inverse of DFT length d must exist in Zq which
requires that gcd(d, q) = 1.
d has to divide p 1 for every prime p divisor of q
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
31
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
32
Properties of DFT
Under certain conditions, the Fourier transform preserves some properties
of the time sequences, e.g., linearity and convolution.
The existence conditions of these properties differ when working in finite
ring spectrums
Let and be operations on time and spectral domains respectively.
We write
DFT
and say and are transform pairs on x(t) and sometimes declare that
the map DF Td respects the operation on point x(t) if following
equation is satisfied
(x(t)) = IDF Td DF Td (x(t))
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
33
Time-Frequency Dictionary
Time and frequency shifts correspond to circular shifts Let
x(t) = x0 + x1t + . . . + xd1td1
and
X(t) = X0 + X1t + . . . + Xd1td1
be a transform pair.
The one-term right circular shift is defined as x(t) 1
x1 + x2t + . . . + xd2td1 + x0td1
l DFT
X(t) (t)
where stands for component-wise multiplication and
(t) = 1 + 1t + . . . + (d1) td1
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
34
Time-Frequency Dictionary
Sum of sequence and first value: The sum of the coefficients of a time
polynomial equals to the zeroth coefficient of its spectral polynomial.
Conversely the sum of the spectrum coefficients equals to d1 times the
zeroth coefficient of the time polynomial
x0 = d1
d1
X
Xi i
and X0 =
d1
X
xi i
i=0
i=0
sum equals to X0
(x0, x1, , xd-1)
DFT
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
35
Time-Frequency Dictionary
Left and right logical shifts: By using the previous properties, it is
possible to perform logical left and right digit shifts x(t) 1 as follows:
(x(t) x0)/t = x1 + . . . + xd1td2
l DFT
(X(t) x0 (t)) (t)
where
x0(t) = x0 + x0t + x0 t2 + . . . + x0td1
The right shifts are similar, where one then uses the
(t) = 1 + 1t + . . . + (d1)td1
polynomial instead of (t)
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
36
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
10
37
23
20
15
10
14
12
10
10
9
7
5
0
1
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
10
38
18
19
17
12
10
10
9
7
5
0
0
1
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
10
39
20
17
15
10
10
9
7
3
0
1
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
10
40
20
17
15
10
10
9
7
5
0
1
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
10
41
(mod 1337).
35
30
25
20
15
10
0
1
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
10
42
EN EL CINVESTAV
25 ANOS
DE LA COMPUTACION
43