Hardware Cryptograhy
Hardware Cryptograhy
Francisco Rodrı́guez-Henrı́quez
CINVESTAV-IPN, México
[email protected]
5 BRW polynomials
Each Virtex 5 slice has 4 Look-Up Tables (LUTs), eight registers and
several multiplexers
Each Virtex 5 slice has 4 Look-Up Tables (LUTs), eight registers and
several multiplexers
A LUT can be configured to perform any Boolean operation of 6
inputs/1 output or 5 inputs/ 2 outputs or as a memory elements of
64 inputs of one-bit size
Virtex devices include built-in 32K-bit RAM memory, called BRAM, which
are intended for storing big amounts of data. Some of its features are,
Polymorphic [bus size programmable]
Dual port [they can perform two data reads and one write in the same
clock cycle]
can be configured for a size of up to 4K bytes
DSP slices are embedded devices equipped with the following components
25 × 18 two’s-complement multiplier:
48-bit accumulator
pre-adders
Single-instruction-multiple-data (SIMD) arithmetic unit
Can generate any one of ten different logic functions of the two
operands
execute all the operations at a extremely high frequency
Advantages
I They have been utilized for fast prototyping of hardware designs
I They are reconfigurable devices
I They allow for a shorter design cycle
I They permit hardware-software co-design
Advantages
I They have been utilized for fast prototyping of hardware designs
I They are reconfigurable devices
I They allow for a shorter design cycle
I They permit hardware-software co-design
Disadvantages
I They tend to consume much more power and energy than ASIC designs
I Their reconfigurability implies redundancy
I Their speed is minor than the one achievable with ASICs
Advantages
I More often than not, they are faster than software applications
I some operations are almost free of cost [such as shifts, rotations, etc.]
I They allow for a versatile data-path
I They inherently enjoy fine-grain parallelism
Advantages
I More often than not, they are faster than software applications
I some operations are almost free of cost [such as shifts, rotations, etc.]
I They allow for a versatile data-path
I They inherently enjoy fine-grain parallelism
Disadvantages
I It is a bit more difficult to code and test designs
I Their maximum clock frequency is ten times slower
I prime field arithmetic tends to be more difficult to handle
The smallest finite field is hF2 , ⊕, i, that contains only two elements
{0, 1} and its binary operations act as the Boolean operators XOR and AND,
respectively.
f = x m + fm−1 x m−1 + · · · + f1 x + f0
f = x m + fm−1 x m−1 + · · · + f1 x + f0
F2m ∼
= F2 [x]/(f )
a ∈ F2m :
a = am−1 x m−1 + · · · + a1 x + a0
Each element of F2 stored using one bit, ergo,a field element F2m can
be represented as a vector of m bits.
Usually the irreducible polynomial f is selected as a trinomial or a
pentanomial
~a → (~a)2
(am−1 , am−2 , ... , a1 , a0 ) → (am−1 , 0, ... , a2 , 0, a1 , 0, a0 )
Parallel-serial multiplication
I multiplicand loaded in a parallel register
I multiplier loaded in a shift register
Most significant coefficients first (Horner scheme)
lmm
D coefficients processed at each clock cycle: cycles per
D
multiplication
C = A·B
m−1 m−1
= (a0 + a1 x 2 )(b0 + b1 x 2 )
m−1
= a0 b0 + [(a0 + a1 )(b0 + b1 ) + a0 b0 + a1 b1 ] x 2
+a1 b1 x m−1
C = A·B
m−1 m−1
= (a0 + a1 x 2 )(b0 + b1 x 2 )
m−1
= a0 b0 + [(a0 + a1 )(b0 + b1 ) + a0 b0 + a1 b1 ] x 2
+a1 b1 x m−1
Let n be the block length then the block cipher can be seen as a
function
E : {0, 1}n × K → {0, 1}n
Denoted by E (K , M) = EK (M).
For each K , EK must be a permutation. So, each EK () has an inverse
such that
DK (EK (M)) = M
A secure block cipher is considered to be a Strong Pseudo Random
Permutation (SPRP).
Informally a hash function maps a big string into a small one. Among
those function, there exists a specific type of hash called the
polynomial hash
defined as
defined as
BRWh () = 0
BRWh (x1 ) = x1
BRWh (x1 , x2 ) = x1 + x2 h
BRWh (x1 , x2 , x3 ) = (h + x1 )(h2 + x2 ) + x3
BRWh (x1 , x2 , ... , xm ) = BRWh (x1 , x2 , ... , xt−1 )(ht + xt ) +
BRWh (xt+1 , ... , xm )
2 4 2 8
BRWh (x1 , ..., x16 ) = ((((h + x1 )(h + x2 ) + x3 )(h + x4 ) + (h + x5 )(h + x6 ) + x7 )(h + x8 )
2 4 2 16
+((h + x9 )(h + x10 ) + x11 )(h + x12 ) + (h + x13 )(h + x14 ) + x15 )(h + x16 )
2 4 2 8
BRWh (x1 , ..., x16 ) = ((((h + x1 )(h + x2 ) + x3 )(h + x4 ) + (h + x5 )(h + x6 ) + x7 )(h + x8 )
2 4 2 16
+((h + x9 )(h + x10 ) + x11 )(h + x12 ) + (h + x13 )(h + x14 ) + x15 )(h + x16 )
Number of nodes in Tm is p.
The number of connected components is given by
hamming weight of p.
If the bit i of p is 1, Tm contains a tree of size 2i .
If k ≡ 2mod 4, then k is an independent node.
If k ≡ 0mod 8, k has at least k − 2 and k − 4 as
its children.
If k ≡ 4mod 8, k − 2 is the only child of k.
Theorem
Let Hh (X1 , X2 , ... , Xm ) be a BRW polynomial and let p = bm/2c be the
number of nodes in the corresponding collapsed
tree. Let clks be the number of clock cycles taken by
Schedule to schedule all nodes, then,
1 If NS = 2, and p ≥ 3, clks = p + 1 if p ≡ 0 mod 4; and
clks = p otherwise.
2 If NS = 3 and p ≥ 7, then
p+2 if p ≡ 0 mod 4
p+1 if p ≡ 1 mod 4
clks =
p+1 if p ≡ 2 mod 4
p if p ≡ 3 mod 4
Addition a + b mod p
Multiplication a · b mod p
Multiplicative inversion a−1 mod p
We would like to compute the sum of two k-bit integers A and B. Let Ai
and Bi for i = 1, 2, ... , k − 1 represent the bits of the integers A and B,
respectively, then the sum bits Si for i = 1, 2, ... , k − 1 and the final
carry-out Ck are defined as,
Ak−1 Ak−2 ··· A1 A0
+ Bk−1 Bk−2 ··· B1 B0
Ck Sk−1 Sk−2 ··· S1 S0
Ci+1 = Ai Bi + Ai Ci + Bi Ci
Si = Ai ⊕ Bi ⊕ Ci
C0 + S = A + B + C
0
The ith bit of the sum Si and the (i + 1)st bit of the carry Ci+1 is
calculated using the equations
Si = Ai ⊕ Bi ⊕ Ci
0
Ci+1 = Ai Bi + Ai Ci + Bi Ci
C := C 0 mod p
r · r −1 − p · p 0 = 1,
Input: t = (t0 , t1 , ..., t2n−1 ), p = (p0 , p1 , ..., pn ) and p00 , where |p00 | = ω
Output: u ← (t+(t · p 0 mod r )·p)/r
1. for i = 0 → n − 1 do
2. C ←0
3. m ← ti · p00 mod 2ω
4. for j = 0 → n − 1 do
5. (C , S) ← ti+j + m · pj + C
6. ti+j = S
7. ADD(ti+n , C )
8. for i = 0 → n − 1 do
9. ui = ti+n
10. return u
Input: t = (t0 , t1 , ..., t2n−1 ), p = (p0 , p1 , ..., pn ) and p00 , where |p00 | = ω
Output: u ← (t+(t · p 0 mod r )·p)/r
1. for i = 0 → n − 1 do
2. C ←0
3. m ← ti · p00 mod 2ω
4. for j = 0 → n − 1 do
5. (C , S) ← ti+j + m · pj + C
6. ti+j = S
7. ADD(ti+n , C )
8. for i = 0 → n − 1 do
9. ui = ti+n
10. return u
The number of products of this method is 2n2 + n.
y 2 = x 3 + Ax + B
y 2 = x 3 + Ax + B
y 2 = x 3 + Ax + B
y 2 = x 3 + Ax + B
y 2 = x 3 + Ax + B
y 2 = x 3 + Ax + B
ê : G1 × G2 → Gτ
ê : G1 × G2 → Gτ
ê : G1 × G2 → Gτ
ê : G1 × G2 → Gτ
ê : G1 × G2 → Gτ
aopt : G2 × G1 −→ G3
More than 10, 000 and 5, 000 multiplications over Fp and Fp2 ,
respectively are required for computing a pairing defined over BN
curves
More than 10, 000 and 5, 000 multiplications over Fp and Fp2 ,
respectively are required for computing a pairing defined over BN
curves
BN curves enjoy several useful features for computing the
Montgomery reduction, namely,
I gcd(t, p) = 1
I p ≡ 1 mod t, which implies, p −1 mod t = 1
I the coefficients of the polynomial p(t) (36, 36, 24, 6, 1) are relatively
small.
= 4i=0 ai t i ,
P
Input: a(t) mod p = a(t)P
4
b(t) mod p = b(t) = i=0 bi t i ,
p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1, t = 2n + s
Output: c(t) = a(t)b(t) · t −1 mod p
1. c(t) = 5-term KaratsubaProduct(a(t), b(t)) (Polynomial Product)
= 4i=0 ai t i ,
P
Input: a(t) mod p = a(t)P
4
b(t) mod p = b(t) = i=0 bi t i ,
p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1, t = 2n + s
Output: c(t) = a(t)b(t) · t −1 mod p
1. c(t) = 5-term KaratsubaProduct(a(t), b(t)) (Polynomial Product)
2. for i = 0 to 4 do
3. µ ← c0 div 2n ; γ ← c0 mod 2n − µs
4. g (t) ← p(t)(−γ) (Montgomery Reduction Phase)
5. c(t) ← (c(t) + g (t))/t + µ
= 4i=0 ai t i ,
P
Input: a(t) mod p = a(t)P
4
b(t) mod p = b(t) = i=0 bi t i ,
p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1, t = 2n + s
Output: c(t) = a(t)b(t) · t −1 mod p
1. c(t) = 5-term KaratsubaProduct(a(t), b(t)) (Polynomial Product)
2. for i = 0 to 4 do
3. µ ← c0 div 2n ; γ ← c0 mod 2n − µs
4. g (t) ← p(t)(−γ) (Montgomery Reduction Phase)
5. c(t) ← (c(t) + g (t))/t + µ
6. for k = 0 to 1 do
7. for i = 0 to 3 do
8. µ ← ci div 2n ; γ ← ci mod 2n − µs (Coefficient Reduction Phase)
9. ci+1 ← ci+1 + µ; ci ← γ
10. return c(t)
Polynomial Multiplier
64 64
Input Input
64 64
a(t) 64 x 64 Multiplier b(t)
128
Second Phase of
Control
Additions
128 128
Final Reduction of
80 80
Coefficients
64
Output
c(t)
Ch Cl
+ Carries
AiBi
2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p9 = a4 b4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (98 / 120)
5-term Karatsuba-like multiplication algorithm
4 i 4
bi t i
P P
Input: a(t) = i=0 ai t , b(t) = i=0
Output: c(t) = a(t)b(t)
1. c(t)(= i=0 ci t i ) ← 0;
P8
2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p 9 = a4 b 4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (99 / 120)
Initial addition phase
MUX MUX
+ +
OPA OPB
SA2 SA3 SB3 SB2
64x64 Multiplier
+ +
2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p9 = a4 b4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (101 / 120)
5-term Karatsuba-like multiplication algorithm
4 i 4
bi t i
P P
Input: a(t) = i=0 ai t , b(t) = i=0
Output: c(t) = a(t)b(t)
1. c(t)(= i=0 ci t i ) ← 0;
P8
2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p 9 = a4 b 4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (102 / 120)
Final addition phase
Control
Unit
64x64 Multiplier Output
From the
From the Acc1 Rmem Control Unit
Control Unit Bank Component of Additions
Registers Accumulators Component
Address of read and Registers Mux Selection and
write Acc2 Input Data Load
Signals
Acc2 Acc2
Rmem
R1 R2 R3 R4
ACC1 ACC2 0's
0's 0's 0's
A B C D
+ +
Ri1 Ri2
Operator -
(A+B)-(C+D)
Polynomial Multiplier
64 64
Input Input
64 64
a(t) 64 x 64 Multiplier b(t)
128
Second Phase of
Control
Additions
128 128
Final Reduction of
80 80
Coefficients
64
Output
c(t)
C0 C1
C2 C3
γ
Cin div 2n C 0 R 2 6
C 2 C 3 6 R1
C + Cin Cin mod 2 s
n
C1 C 2 24 R1
C 3 36
R 2 C1 R1 12
R2
C0 C1 C2 C3
Input
Cin CH
CL
CL-μs
Reset
R0 R1 R2 R3 + µ
S0 S1 S2 S3 S4
X0 Y0 X1 Y1 X0 X1 Y0 Y1
Sub512
Sub512
D1 D2 D0-(D1+D2)
Reducción Reducción
de de
Montgomery Montgomery
W0 mod p W1 mod p
W0 W1
X Y Z
S : X Y Z
W1 W0
C : ( X Y ) ( X Z ) (Y Z )
W0129
130 W0'1280
128
c +
C
130
C0
s
||
C1
258
+ W1 + W0 W
S
S1
S0
256
41
E0
A5B0 120
A5B1
E1 A4B0 96
DSP48Slices Multipliers
48 Products
AiBj
13 Stages Ei
S C
256 Bits
Output
YH
MUX
YL
128
128
X X H || X L
128X128
Multiplier Y Y H || Y L
| X H || X L || Y H || Y L | 128 bits
MUX
R1
R2 R3
0's 0's 0's
R4 R5 R6
130
Partial Output
Output
P0 , P1 , P2 , P3
| Pi | 128 bits
128
128
128x128 128
ROM
Multiplier P’
128 128
MUX
FIFO
Memory 0
MUX
+
ROM
P×2250
MUX
Bank
Registers
MUX
W mod P
Bank
Registers
Questions?