0% found this document useful (0 votes)
63 views211 pages

Hardware Cryptograhy

This document provides an overview of hardware design issues for cryptographic algorithms. It discusses implementing crypto in hardware for performance reasons, and outlines FPGA architecture including logic cells, block RAMs, DSP slices and families from manufacturers like Xilinx. It compares FPGAs to ASICs and CPUs, and covers design methodology, critical path analysis, and latency considerations for FPGA designs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views211 pages

Hardware Cryptograhy

This document provides an overview of hardware design issues for cryptographic algorithms. It discusses implementing crypto in hardware for performance reasons, and outlines FPGA architecture including logic cells, block RAMs, DSP slices and families from manufacturers like Xilinx. It compares FPGAs to ASICs and CPUs, and covers design methodology, critical path analysis, and latency considerations for FPGA designs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 211

Hardware design of cryptographic algorithms

Francisco Rodrı́guez-Henrı́quez
CINVESTAV-IPN, México
[email protected]

Tutorial Talk at Indocrypt 2012 - Sunday December 9th, 2012

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (1 / 120)


Outline of the talk

1 Context and motivation

2 Hardware design issues

3 basic F2m field arithmetic building blocks

4 basic cryptographic building blocks

5 BRW polynomials

6 A 256-bit multiplier over Fp

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (2 / 120)


But.... why should one bother implementing
crypto-schemes in Hardware?

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (3 / 120)


But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)


But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor


There exist specific targets, one of the most prominent ones being
smart cards

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)


But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor


There exist specific targets, one of the most prominent ones being
smart cards
Hardware may be the fastest/most efficient way to implement
cryptographic algorithms.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)


But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor


There exist specific targets, one of the most prominent ones being
smart cards
Hardware may be the fastest/most efficient way to implement
cryptographic algorithms.
However if a cryptographic hardware accelerator is going to be
attractive at all, a significant performance improvement should be
observed with respect to software implementations.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)


FPGA General architecture

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (5 / 120)


Main Xilinx FPGA families

Kintex 7 economical models for high performance applications


Virtex 5 and 6 some of the most sophisticated models. the Look-up
Tables can be configurated with up to six inputs
Artix 7 some of the models include a dual-core ARM Cortex-A9
Spartan 6 economical and simple devices
Virtex 7 the newest family with many embedded components working
at high speed. It is still costly

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (6 / 120)


Main characteristics of Xilinx FPGA Families [as of 2006]

Virtex-5 Virtex-4 Virtex II Pro Spartan 3 & 3E


Logic Cells up to 330K 12K-200K 3K-99K 1.7K-74K
BRAM 576 36-512 12-444 4-104
(18Kbits each)
Multipliers 32 − 1921 32-512 12-444 4-104
DCM up to 18 4-20 4-12 2-18
IOBs up to 1200 240-960 204-1164 63-633
DSP Slices 32-192 32-192 — –
PowerPC Blocks N/A 0-2 0-2 –
Max. freq. 550MHz 500MHz 547 MHz up to 300MHz
Price ≈ $400USD From $300 From $139 From $2 up to $85
1 25 × 18 embedded multipliers

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (7 / 120)


FPGA Manufacturers and Their Devices

Manufacturer FPGA Family Feature


Xilinx Virtex-4-Virtex-7, FPGA market leader
VirtexII, Spartan III 28ηm technology
Altera Stratix, Stratix II, Cyclone 28ηm technology
Lattice LatticeXP ultra low power for mobile applications
Actel Fusion, M7Fusion first mixed-signal FPGA
Quick Logic Eclipse II programmable-only-once FPGA
Atmel AT40KAL fine-grain reconfigurable
Achronix Achronix-ULTRA 1.6GHz - 2.2GHz speed

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (8 / 120)


General Xilinx Virtex 5 Slice architecture

Each Virtex 5 slice has 4 Look-Up Tables (LUTs), eight registers and
several multiplexers

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (9 / 120)


General Xilinx Virtex 5 Slice architecture

Each Virtex 5 slice has 4 Look-Up Tables (LUTs), eight registers and
several multiplexers
A LUT can be configured to perform any Boolean operation of 6
inputs/1 output or 5 inputs/ 2 outputs or as a memory elements of
64 inputs of one-bit size

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (9 / 120)


General Xilinx Virtex 5 Slice architecture

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (10 / 120)


Block RAMs

Virtex devices include built-in 32K-bit RAM memory, called BRAM, which
are intended for storing big amounts of data. Some of its features are,
Polymorphic [bus size programmable]
Dual port [they can perform two data reads and one write in the same
clock cycle]
can be configured for a size of up to 4K bytes

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (11 / 120)


DSP Slices

DSP slices are embedded devices equipped with the following components
25 × 18 two’s-complement multiplier:
48-bit accumulator
pre-adders
Single-instruction-multiple-data (SIMD) arithmetic unit
Can generate any one of ten different logic functions of the two
operands
execute all the operations at a extremely high frequency

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (12 / 120)


DSP Slices

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (13 / 120)


FPGAs Vs. ASIC

Advantages
I They have been utilized for fast prototyping of hardware designs
I They are reconfigurable devices
I They allow for a shorter design cycle
I They permit hardware-software co-design

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (14 / 120)


FPGAs Vs. ASIC

Advantages
I They have been utilized for fast prototyping of hardware designs
I They are reconfigurable devices
I They allow for a shorter design cycle
I They permit hardware-software co-design
Disadvantages
I They tend to consume much more power and energy than ASIC designs
I Their reconfigurability implies redundancy
I Their speed is minor than the one achievable with ASICs

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (14 / 120)


FPGAs Vs. General purpose processors

Advantages
I More often than not, they are faster than software applications
I some operations are almost free of cost [such as shifts, rotations, etc.]
I They allow for a versatile data-path
I They inherently enjoy fine-grain parallelism

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (15 / 120)


FPGAs Vs. General purpose processors

Advantages
I More often than not, they are faster than software applications
I some operations are almost free of cost [such as shifts, rotations, etc.]
I They allow for a versatile data-path
I They inherently enjoy fine-grain parallelism
Disadvantages
I It is a bit more difficult to code and test designs
I Their maximum clock frequency is ten times slower
I prime field arithmetic tends to be more difficult to handle

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (15 / 120)


Design Methodology for FPGA designs

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (16 / 120)


Design Methodology for FPGA designs

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (16 / 120)


FPGA Design issues: circuit’s critical path

Circuit’s critical path: The maximum allowed clock frequency is


determined by the longest/slowest combinatorial path present in the circuit

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (17 / 120)


FPGA Design issues: circuit’s critical path

Circuit’s critical path: The maximum allowed clock frequency is


determined by the longest/slowest combinatorial path present in the circuit

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (17 / 120)


FPGA Design issues: Latency

Circuit’s latency: the amount of time [often given in number of clock


cycles] required for producing the first output

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (18 / 120)


Parallel techniques in hardware: pipelining

pipelining is a natural technique for producing more parallelism, however,


the designer must balance the pipe stages, carefully trying to avoid
dependencies and pipeline stalls

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (19 / 120)


Parallel techniques in hardware: pipelining

pipelining is a natural technique for producing more parallelism, however,


the designer must balance the pipe stages, carefully trying to avoid
dependencies and pipeline stalls

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (19 / 120)


Measures of performance in reconfigurable Hardware
devices

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (20 / 120)


Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)


Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency
Throughput defined as:
# of bits processed · clock cycle frequency
# of clock cycles

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)


Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency
Throughput defined as:
# of bits processed · clock cycle frequency
# of clock cycles
Latency:# of clock cycles required for producing the first computation

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)


Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency
Throughput defined as:
# of bits processed · clock cycle frequency
# of clock cycles
Latency:# of clock cycles required for producing the first computation
Amount of hardware resources utilized by the design. Including slices,
dedicated memories, DSP slices, etc.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)


Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency
Throughput defined as:
# of bits processed · clock cycle frequency
# of clock cycles
Latency:# of clock cycles required for producing the first computation
Amount of hardware resources utilized by the design. Including slices,
dedicated memories, DSP slices, etc.
Time-Area product

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)


Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency
Throughput defined as:
# of bits processed · clock cycle frequency
# of clock cycles
Latency:# of clock cycles required for producing the first computation
Amount of hardware resources utilized by the design. Including slices,
dedicated memories, DSP slices, etc.
Time-Area product
Power consumption, energy consumption, ...

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)


Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency
Throughput defined as:
# of bits processed · clock cycle frequency
# of clock cycles
Latency:# of clock cycles required for producing the first computation
Amount of hardware resources utilized by the design. Including slices,
dedicated memories, DSP slices, etc.
Time-Area product
Power consumption, energy consumption, ...
In the case of cryptographic designs: Side-channel resistance
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)
Design tools

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (22 / 120)


Design tools

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (22 / 120)


Design tools

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (22 / 120)


Finite fields

Every prime number p defines a finite field of order p, denoted as, Fp .

The smallest finite field is hF2 , ⊕, i, that contains only two elements
{0, 1} and its binary operations act as the Boolean operators XOR and AND,
respectively.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (23 / 120)


Field Extensions

Given a positive integer m > 1,the field Fpm is a field extension of Fp .

It can be shown that Fpm is isomorphic to Fp [x]/(f (x)), where f (x) is a


monic polynomial of degree m > 1, irreducible over Fp .

We denote by Fp [x]/(f (x)) the set of equivalence classes of the


polynomials Fp [x] (mod f (x)).

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (24 / 120)


Agenda
1 Context and motivation
2 Hardware design issues
3 basic F2m field arithmetic building blocks
F2m Field Arithmetic
4 basic cryptographic building blocks
block ciphers
polynomial hash
5 BRW polynomials
6 A 256-bit multiplier over Fp
Fp Field Arithmetic
pairings and elliptic curves
Karatsuba-like multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (25 / 120)


Arithmetic over F3m

f ∈ F2 [x]: degree-m irreducible polynomial over F2

f = x m + fm−1 x m−1 + · · · + f1 x + f0

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (26 / 120)


Arithmetic over F3m

f ∈ F2 [x]: degree-m irreducible polynomial over F2

f = x m + fm−1 x m−1 + · · · + f1 x + f0

F2m ∼
= F2 [x]/(f )
a ∈ F2m :
a = am−1 x m−1 + · · · + a1 x + a0
Each element of F2 stored using one bit, ergo,a field element F2m can
be represented as a vector of m bits.
Usually the irreducible polynomial f is selected as a trinomial or a
pentanomial

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (26 / 120)


Addition over F2m

r = a + b = (am−1 + bm−1 )x m−1 + · · · + (a1 + b1 )x + (a0 + b0 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (27 / 120)


Addition over F2m

r = a + b = (am−1 + bm−1 )x m−1 + · · · + (a1 + b1 )x + (a0 + b0 )


I coefficient-wise additions over F2 : ri = (ai + bi ) mod 2

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (27 / 120)


Addition over F2m

r = a + b = (am−1 + bm−1 )x m−1 + · · · + (a1 + b1 )x + (a0 + b0 )


I coefficient-wise additions over F2 : ri = (ai + bi ) mod 2
I addition over F2 : XOR gates
I This operation directly benefits from the parallel processing of the XOR
operation

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (27 / 120)


Field Squaring
Due to the action of the Frobenius map, polynomial squaring of an
element a ∈ F2m is a linear operation over binary fields,
"m−1 #2
X
2 i
a(x) = ai x
i=0
m−1
X
= ai x 2i
i=0

This can be implemented by interleaving zeroes among the polynomial


coefficients,

~a → (~a)2
(am−1 , am−2 , ... , a1 , a0 ) → (am−1 , 0, ... , a2 , 0, a1 , 0, a0 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (28 / 120)


Multiplication over F2m

Parallel-serial multiplication
I multiplicand loaded in a parallel register
I multiplier loaded in a shift register
Most significant coefficients first (Horner scheme)
lmm
D coefficients processed at each clock cycle: cycles per
D
multiplication

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (29 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Field multiplication: interleaving products and reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (30 / 120)


Bit parallel multiplication

For this case, field multiplication is usually performed in two steps:


polynomial multiplication followed by polynomial reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (31 / 120)


Bit parallel multiplication

For this case, field multiplication is usually performed in two steps:


polynomial multiplication followed by polynomial reduction
The first phase consists on multiplying two polynomials of degree
m − 1 to obtain a polynomial of degree 2m − 2, where the arithmetic
operations are performed over F2

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (31 / 120)


Bit parallel multiplication

For this case, field multiplication is usually performed in two steps:


polynomial multiplication followed by polynomial reduction
The first phase consists on multiplying two polynomials of degree
m − 1 to obtain a polynomial of degree 2m − 2, where the arithmetic
operations are performed over F2
The second phase performs modular reduction using f (x), the
irreducible polynomial that generated the field.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (31 / 120)


Polynomial multiplication

Given the polynomials A y B of degree m − 1, the product C = A · B of


degree 2m − 2 can be computed as,

C = A·B
m−1 m−1
= (a0 + a1 x 2 )(b0 + b1 x 2 )
m−1
= a0 b0 + [(a0 + a1 )(b0 + b1 ) + a0 b0 + a1 b1 ] x 2

+a1 b1 x m−1

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (32 / 120)


Polynomial multiplication
Given the polynomials A y B of degree m − 1, the product C = A · B of
degree 2m − 2 can be computed as,

C = A·B
m−1 m−1
= (a0 + a1 x 2 )(b0 + b1 x 2 )
m−1
= a0 b0 + [(a0 + a1 )(b0 + b1 ) + a0 b0 + a1 b1 ] x 2

+a1 b1 x m−1

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (32 / 120)


Polynomial multiplication

This operation can be recursively repeated until the bit level

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (33 / 120)


Fully parallel Karatusuba Multiplier

A Fully parallel Karatusuba Multiplier can compute one product every


clock cycle but at the price of a large critical path.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (34 / 120)


3 Stages Pipelined Karatusuba Multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (35 / 120)


Reduction

Usually the irreducible polynomial f is selected as a trinomial or a


pentanomial

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (36 / 120)


Reduction
Usually the irreducible polynomial f is selected as a trinomial or a
pentanomial

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (36 / 120)


Agenda
1 Context and motivation
2 Hardware design issues
3 basic F2m field arithmetic building blocks
F2m Field Arithmetic
4 basic cryptographic building blocks
block ciphers
polynomial hash
5 BRW polynomials
6 A 256-bit multiplier over Fp
Fp Field Arithmetic
pairings and elliptic curves
Karatsuba-like multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (37 / 120)


Block-Ciphers

Let n be the block length then the block cipher can be seen as a
function
E : {0, 1}n × K → {0, 1}n
Denoted by E (K , M) = EK (M).
For each K , EK must be a permutation. So, each EK () has an inverse
such that
DK (EK (M)) = M
A secure block cipher is considered to be a Strong Pseudo Random
Permutation (SPRP).

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (38 / 120)


Block-Ciphers

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (38 / 120)


Block cipher in counter mode

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (39 / 120)


A block-Cipher instantiation: AES

Rijndael, Designed by Daemen and Rijman, became the industry


standard in 2000 when NIST decided to choose it as the winner of the
AES contest.
AES is usually deployed to offer about 128 bits of security, which
means that the plaintext, ciphertext and key-length are all equal to
128 bits. In this mode AES performs 10 round transformation to
encrypt a single block of data

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (40 / 120)


A block-Cipher instantiation: AES

Rijndael, Designed by Daemen and Rijman, became the industry


standard in 2000 when NIST decided to choose it as the winner of the
AES contest.
AES is usually deployed to offer about 128 bits of security, which
means that the plaintext, ciphertext and key-length are all equal to
128 bits. In this mode AES performs 10 round transformation to
encrypt a single block of data

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (40 / 120)


Parallel techniques in hardware: AES example

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (41 / 120)


Agenda
1 Context and motivation
2 Hardware design issues
3 basic F2m field arithmetic building blocks
F2m Field Arithmetic
4 basic cryptographic building blocks
block ciphers
polynomial hash
5 BRW polynomials
6 A 256-bit multiplier over Fp
Fp Field Arithmetic
pairings and elliptic curves
Karatsuba-like multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (42 / 120)


Polynomial Hash

Informally a hash function maps a big string into a small one. Among
those function, there exists a specific type of hash called the
polynomial hash

H : {0, 1}n × {0, 1}nm → {0, 1}n

defined as

Hh (P1 ||...||Pm ) = P1 hm ⊕ P2 hm−1 ⊕ ... ⊕ Pm h

All operations are in GF (2n ),h, Pi ∈ {0, 1}n

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (43 / 120)


Polynomial Hash
Informally a hash function maps a big string into a small one. Among
those function, there exists a specific type of hash called the
polynomial hash

H : {0, 1}n × {0, 1}nm → {0, 1}n

defined as

Hh (P1 ||...||Pm ) = P1 hm ⊕ P2 hm−1 ⊕ ... ⊕ Pm h

All operations are in GF (2n ),h, Pi ∈ {0, 1}n


This type of functions are AXU (almost xor universal hash), because
for any G ∈ {0, 1}n , and P 6= P 0 .
0
$ 0 maxdegree(P, P )
Pr[h ← {0, 1}n : Hh (P) ⊕ Hh (P ) = G ] ≤
2n

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (43 / 120)


Polynomial hash

Polynomial hashes are an important part of many cryptographic


protocols like message authentication codes, authenticated
encryption, tweakable enciphering schemes (TES), etc.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (44 / 120)


Polynomial hash

These schemes generally involve the computation of an univariate


polynomial of degree m − 1 defined over a finite field Fq as,

Polyh (X ) = x1 hm−1 + x2 hm−2 + · · · + xm−1 h + xm ,

where X = (x1 , ... , xm ) ∈ Fm


q and h ∈ Fq .

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (44 / 120)


Polynomial hash

These schemes generally involve the computation of an univariate


polynomial of degree m − 1 defined over a finite field Fq as,

Polyh (X ) = x1 hm−1 + x2 hm−2 + · · · + xm−1 h + xm ,

where X = (x1 , ... , xm ) ∈ Fm


q and h ∈ Fq .
Traditionally, the evaluation of Polyh (X ) has been done using
Horner’s rule, which requires (m − 1) multiplications and m − 1
additions in Fq .

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (44 / 120)


BRW Polynomials

BRW polynomials were introduced by Bernstein in 2007, although the


the origin of these polynomials can be traced back to Rabin and
Winograd in 1972

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (45 / 120)


BRW Polynomials

Unlike the normal polynomial they can be evaluated using only b m2 c


multiplications in Fq and dlog2 me squarings. Thus, these polynomials
potentially offer a computational advantage over the normal ones.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (45 / 120)


BRW Polynomials

Unlike the normal polynomial they can be evaluated using only b m2 c


multiplications in Fq and dlog2 me squarings. Thus, these polynomials
potentially offer a computational advantage over the normal ones.
the recursive definition of a BRW polynomial gives it a certain
structure which is amenable to parallelization.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (45 / 120)


BRW Polynomials

Unlike the normal polynomial they can be evaluated using only b m2 c


multiplications in Fq and dlog2 me squarings. Thus, these polynomials
potentially offer a computational advantage over the normal ones.
the recursive definition of a BRW polynomial gives it a certain
structure which is amenable to parallelization.
It turns out that to take advantage of this parallel structure one needs
to carefully schedule the order of multiplications involved in the
polynomial evaluation.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (45 / 120)


BRW definition

The BRW polynomial BRW is defined recursively as follows:

BRWh () = 0
BRWh (x1 ) = x1
BRWh (x1 , x2 ) = x1 + x2 h
BRWh (x1 , x2 , x3 ) = (h + x1 )(h2 + x2 ) + x3
BRWh (x1 , x2 , ... , xm ) = BRWh (x1 , x2 , ... , xt−1 )(ht + xt ) +
BRWh (xt+1 , ... , xm )

where t ∈ {4, 8, 16, ...} and t ≤ m < 2t


The number of multiplications is given by b m2 c.
Additions: m + b m−3
2 c.
Squarings: blgmc.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (46 / 120)


BRW-Polynomials

A BRW polynomial Hh (X1 , ... , Xm ) can be represented as a tree Tm


which contains three types of nodes, namely, multiplication nodes,
addition nodes and leaf nodes
The tree Tm will be called a BRW tree and can be recursively
constructed as we will discuss next

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (47 / 120)


BRW-Polynomials

A BRW polynomial Hh (X1 , ... , Xm ) can be represented as a tree Tm


which contains three types of nodes, namely, multiplication nodes,
addition nodes and leaf nodes
The tree Tm will be called a BRW tree and can be recursively
constructed as we will discuss next

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (47 / 120)


BRW-Polynomials

We propose a framework to construct an efficient circuit to compute


BRW polynomials using a pipelined multiplier.
To achieve a good performance in the implementations of BRW
polynomial, there are two important aspects:
I Scheduling of the blocks of information, trying to have the pipeline
always full.
I The number of accumulators or registers required.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (48 / 120)


BRW-Polynomials Representation

Let’s see the BRW-Polynomial with 16 coefficients

BRWh (x1 , ..., x16 ) = ((((h + x1 )(h2 + x2 ) + x3 )(h4 + x4 )


+(h + x5 )(h2 + x6 ) + x7 )(h8 + x8 )
+((h + x9 )(h2 + x10 ) + x11 )(h4 + x12 )
+(h + x13 )(h2 + x14 ) + x15 )(h16 + x16 )

The total number of operations are 8 multiplications, 4 squarings and 19


additions.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (49 / 120)


BRW-Polynomials Representation
It can be represented as a tree Tm .

2 4 2 8
BRWh (x1 , ..., x16 ) = ((((h + x1 )(h + x2 ) + x3 )(h + x4 ) + (h + x5 )(h + x6 ) + x7 )(h + x8 )
2 4 2 16
+((h + x9 )(h + x10 ) + x11 )(h + x12 ) + (h + x13 )(h + x14 ) + x15 )(h + x16 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (50 / 120)


BRW-Polynomials Representation
It can be represented as a tree Tm .

2 4 2 8
BRWh (x1 , ..., x16 ) = ((((h + x1 )(h + x2 ) + x3 )(h + x4 ) + (h + x5 )(h + x6 ) + x7 )(h + x8 )
2 4 2 16
+((h + x9 )(h + x10 ) + x11 )(h + x12 ) + (h + x13 )(h + x14 ) + x15 )(h + x16 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (50 / 120)


Some Properties of the Tree

Let p = b m2 c and k is a label of a node.

Number of nodes in Tm is p.
The number of connected components is given by
hamming weight of p.
If the bit i of p is 1, Tm contains a tree of size 2i .
If k ≡ 2mod 4, then k is an independent node.
If k ≡ 0mod 8, k has at least k − 2 and k − 4 as
its children.
If k ≡ 4mod 8, k − 2 is the only child of k.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (51 / 120)


Scheduling Algorithm
Algorithm 1 Scheduling algorithm
Input: number of blocks m, number of pipe 18. Function Process(x, L, clock)
stages NS 19. Delete(L);
Output: sequence of multiplications to be done 20. y ← Parent(x);
1. Construct the collapsed forest Fm ; 21. Output x;
2. for each node x in Fm do 22. if y 6= NULL then
3. x.NC ← number of children of x; 23. y .NC ← y .NC − 1
4. x.ST ← undefined; 24. if y .NC = 0 then
5. if levelFm (x)=0 then 25. y .ST = clock
6. Insert(x, L1 ); 26. Insert(y , L2 );
7. L2 ← Empty; 27. return
8. clock ← 1;
9. while (L1 and L2 are both not empty) do
10. x ← Pop(L2 );
11. if (x 6= NULL and clock − x.ST > NS)
then
12. Process(x, L2 , clock);
13. else
14. x ← Pop(L1 );
15. if (x 6= NULL) then
16. Process(x, L1 , clock);
17. clock ← clock + 1;

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (52 / 120)


Scheduling algorithm: an example

As an example let us consider the case when m = 16


and NS = 2.

M1 : R1 = (X2 + h2 )(X1 + h);


M2 : R2 = (X6 + h2 )(X5 + h);
M3 : R3 = (X4 + h4 )(X3 + R1 );
M4 : R4 = (X10 + h2 )(X9 + h);
M5 : R5 = (X8 + h8 )(R3 + R2 + X7 );
M6 : R6 = (X12 + h4 )(X11 + R4 );
M7 : R7 = (X14 + h2 )(X13 + h);
M8 : R8 = (X16 + h16 )(R5 + R6 + R7 + X15 ).

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (53 / 120)


Scheduling algorithm: an example

As an example let us consider the case when m = 16


and NS = 2.

M1 : R1 = (X2 + h2 )(X1 + h);


M2 : R2 = (X6 + h2 )(X5 + h);
M3 : R3 = (X4 + h4 )(X3 + R1 );
M4 : R4 = (X10 + h2 )(X9 + h);
M5 : R5 = (X8 + h8 )(R3 + R2 + X7 );
M6 : R6 = (X12 + h4 )(X11 + R4 );
M7 : R7 = (X14 + h2 )(X13 + h);
M8 : R8 = (X16 + h16 )(R5 + R6 + R7 + X15 ).

This example requires 9 clock cycles to be executed


(there is one delay)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (53 / 120)


Scheduling of the blocks

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (54 / 120)


Scheduling of the blocks

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (54 / 120)


Scheduling of the blocks

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (54 / 120)


Scheduling of the blocks

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (54 / 120)


Scheduling of the blocks

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (54 / 120)


Table: The output of Schedule for NS = 2 for small number of blocks
Number of pipeline stages NS=2
Blocks Clock Total
(m) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 clocks
2 2 1
4 2 − 4 3
6 2 6 4 3
8 2 6 4 − 8 5
10 2 6 4 10 8 5
12 2 6 4 10 8 12 6
14 2 6 4 10 8 12 14 7
16 2 6 4 10 8 12 14 − 16 9
18 2 6 4 10 8 12 14 18 16 9
20 2 6 4 10 8 12 14 18 16 20 10
22 2 6 4 10 8 12 14 18 16 20 22 11
24 2 6 4 10 8 12 14 18 16 20 22 − 24 13
26 2 6 4 10 8 12 14 18 16 20 22 26 24 13
28 2 6 4 10 8 12 14 18 16 20 22 26 24 28 14
30 2 6 4 10 8 12 14 18 16 20 22 26 24 28 30 15

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (55 / 120)


Table: The output of Schedule for NS = 3 for small number of blocks
Number of pipeline stages NS=3
Clock Total
Blocks
(m) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 clocks
2 2 1
4 2 − − 4 4
6 2 6 − 4 4
8 2 6 − 4 − − 8 7
10 2 6 10 4 − − 8 7
12 2 6 10 4 − 12 8 7
14 2 6 10 4 14 12 8 7
16 2 6 10 4 14 12 8 − − 16 10
18 2 6 10 4 14 12 8 18 − 16 10
20 2 6 10 4 14 12 8 18 − 16 20 11
22 2 6 10 4 14 12 8 18 22 16 20 11
24 2 6 10 4 14 12 8 18 22 16 20 − − 24 14
26 2 6 10 4 14 12 8 18 22 16 20 26 − 24 14
28 2 6 10 4 14 12 8 18 22 16 20 26 − 24 28 15
30 2 6 10 4 14 12 8 18 22 16 20 26 30 24 28 15

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (56 / 120)


Optimal Scheduling

Theorem
Let Hh (X1 , X2 , ... , Xm ) be a BRW polynomial and let p = bm/2c be the
number of nodes in the corresponding collapsed
tree. Let clks be the number of clock cycles taken by
Schedule to schedule all nodes, then,
1 If NS = 2, and p ≥ 3, clks = p + 1 if p ≡ 0 mod 4; and
clks = p otherwise.
2 If NS = 3 and p ≥ 7, then


 p+2 if p ≡ 0 mod 4
p+1 if p ≡ 1 mod 4

clks =

 p+1 if p ≡ 2 mod 4
p if p ≡ 3 mod 4

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (57 / 120)


Scheduling of the blocks

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (58 / 120)


Scheduling of the blocks

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (58 / 120)


Tweakable Enciphering Schemes

They are length preserving encryption schemes.


These schemes takes in an extra public quantity called the tweak.
They can provide partial authentication.
A potential application area of such schemes is in-place disk
encryption.
Security of such schemes are that of a strong pseudorandom
permutation.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (59 / 120)


HMCH: A Hash-Counter Hash TES

Algorithm 2 HMCH Encryption algorithm


Input: keys h, k, tweak T and message 4. S ← M1 ⊕ U1 ⊕ β1 ⊕ β2 ;
blocks to encrypt P1 , ... , Pm 5. for i = 2 to m do
Output: ciphertext (C1 , ... , Cm ) 6. Ci ← Pi ⊕ EK (x i−2 β1 ⊕ S);
1. β1 ← EK (T ); β2 ← xβ1 ; 7. C1 ← U1 ⊕ ψh (C2 , ... , Cm );
2. M1 ← P1 ⊕ ψh (P2 , ... , Pm ); 8. return (C1 , ... , Cm );
3. U1 ← EK (M1 );

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (60 / 120)


HCMH circuit architecture and time diagram

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (61 / 120)


HCMH circuit architecture and time diagram

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (61 / 120)


Experimental Results

Table: Modes of operation on Virtex-5 device. AES-PEC: AES pipelined


encryption core, AES-PDC: AES pipelined decryption core, AES-SDC: AES
sequential decryption core, SOF : squares computed on the fly, SPC: squares
pre-computed

Mode Implementation Slices Frequency Clock Time Throughput 1


Details (MHz) Cycles (nS) (Gbits/Sec) (Slice∗Time)
HMCH[BRW]-1 2 AES-PEC, 8040 211.785 66 311.637 13.143 399.112
1 AES-SDC, SOF
HMCH[BRW]-2 2 AES-PEC, 8140 212.589 66 310.458 13.193 395.706
1 AES-SDC, SPC
HMCH[BRW]-3 1 AES-PEC, 6112 223.364 80 358.160 11.436 456.814
1 AES-SDC, SOF
HEH[BRW]-1 2 AES-PEC, 11850 202.856 55 271.128 15.170 311.248
2 AES-PDC, SOF
HEH[BRW]-2 2 AES-PEC, 12002 203.894 55 269.748 15.184 308.879
2 AES-PDC, SPC
HEH[BRW]-3 1 AES-PEC, 8012 218.384 69 315.957 12.964 395.020
1 AES-PDC, SOF

HMCH[Poly] 1 AES-PEC, 1 AES-SDC 5345 225.485 94 416.879 9.825 448.789

HEH[Poly] 1 AES-PEC, 1 AES-PDC 6962 218.198 83 380.388 10.768 377.606

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (62 / 120)


Agenda
1 Context and motivation
2 Hardware design issues
3 basic F2m field arithmetic building blocks
F2m Field Arithmetic
4 basic cryptographic building blocks
block ciphers
polynomial hash
5 BRW polynomials
6 A 256-bit multiplier over Fp
Fp Field Arithmetic
pairings and elliptic curves
Karatsuba-like multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (63 / 120)


Fp Arithmetic

Fp field arithmetic has crucial importance for the performance of any


cryptosystem. The field elements a, b ∈ Fp are integers in the interval
[0, p − 1]

Addition a + b mod p
Multiplication a · b mod p
Multiplicative inversion a−1 mod p

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (64 / 120)


Field Addition

We would like to compute the sum of two k-bit integers A and B. Let Ai
and Bi for i = 1, 2, ... , k − 1 represent the bits of the integers A and B,
respectively, then the sum bits Si for i = 1, 2, ... , k − 1 and the final
carry-out Ck are defined as,
Ak−1 Ak−2 ··· A1 A0
+ Bk−1 Bk−2 ··· B1 B0
Ck Sk−1 Sk−2 ··· S1 S0

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (65 / 120)


Full adders
The truth table of a full adders cell is as follows:
Ai Bi Ci Ci+1 Si
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1

The boolean functions of the output values are as

Ci+1 = Ai Bi + Ai Ci + Bi Ci
Si = Ai ⊕ Bi ⊕ Ci

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (66 / 120)


Carry propagate adder

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (67 / 120)


Carry Save adder

Its main function is to add three k-bit integers A, B, and C to produce


two integers C 0 and S such that

C0 + S = A + B + C

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (68 / 120)


Carry Save adder

As an example, let A = 40, B = 25, and C = 20, we compute S and C 0 as


shown below:
A = 40 = 1 0 1 0 0 0
B = 25 = 0 1 1 0 0 1
C = 20 = 0 1 0 1 0 0
S = 37 = 1 0 0 1 0 1
C 0 = 48 = 0 1 1 0 0 0

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (68 / 120)


Carry Save adder

0
The ith bit of the sum Si and the (i + 1)st bit of the carry Ci+1 is
calculated using the equations

Si = Ai ⊕ Bi ⊕ Ci
0
Ci+1 = Ai Bi + Ai Ci + Bi Ci

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (68 / 120)


Carry Save adder

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (69 / 120)


Field multiplication

We would like to compute C = A · B mod p, with A, B, C ∈ Fp There are


basically four approaches for computing the product,
Multiply and then divide [by possibly using fast reduction]
The steps of the multiplication and reduction are interleaved.
Brickell’s method.
Montgomery’s method.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (70 / 120)


Field multiplication

We would like to compute C = A · B mod p, with A, B, C ∈ Fp The


multiply-and-divide method first multiplies A and B to obtain the 2k-bit
number
M 0 := AB
Then, the result C 0 is divided (reduced) by p to obtain the k-bit number

C := C 0 mod p

The result C is a k-bit or s-word number.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (70 / 120)


Field multiplication: fast reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (71 / 120)


Field multiplication: fast reduction

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (71 / 120)


Montgomery Multiplier
The problem of performing a division by p is traded with divisions by r ,
where r = 2k with k − 1 < |p| < k.

Figure: Montgomery p-Residues

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (72 / 120)


Montgomery Multiplier

The montgomery product is defined as,

MontPr(ã, b̃) = ã · b̃ · r −1 mod p

Given its p-residue ã, one can compute a by performing,

MontPr(ã, 1) = ã · 1 · r −1 mod p = a mod p

Where p 0 can be obtained from Bezout’s identity as,

r · r −1 − p · p 0 = 1,

provided that gcd(r , p) = 1.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (73 / 120)


Montgomery Multiplier

Input: Prime p, p 0 , r = 2k y ã, b̃ ∈ Fp


Output: MontPr(ã, b̃)
1. t ← ã · b̃
2. m ← t · p 0 mod r
3. u ← (t + m · p)/r
4. if u > p then
5. return u − p
6. else
7. return u
8. return u

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (74 / 120)


Montgomery Multiplier

Input: Prime p, p 0 , r = 2k y ã, b̃ ∈ Fp


Output: MontPr(ã, b̃)
1. t ← ã · b̃
2. m ← t · p 0 mod r m ≡ −t · p −1 mod r
3. u ← (t + m · p)/r (t + m · p) ≡ 0 mod r
4. if u > p then
5. return u − p
6. else
7. return u
8. return u

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (75 / 120)


Montgomery multiplier

Input: Prime p, p 0 , r = 2k y ã, b̃ ∈ Fp


Output: MontPr(ã, b̃)
1. t ← ã · b̃
2. m ← t · p 0 mod r
3. u ← (t + m · p)/r t + m · p ≡ ã · b̃ mod p
4. if u > p then
5. return u − p
6. else
7. return u
8. return u

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (76 / 120)


Montgomery multiplier variants: the SOS Separated
Operand Scanning method

Computes first the product t = a · b and then u.

Input: a = (a0 , a1 , ..., an−1 ) and


b = (b0 , b1 , ..., bn−1 )
Output: t = a · b with t = (t0 , t1 , ..., t2n−1 ) a3 a2 a1 a0
1. t ← 0 b3 b2 b1 b0
2. for i = 0 → n − 1 do t03 t02 t01 t00
3. C ←0 t13 t12 t11 t10
4. for j = 0 → n − 1 do t23 t22 t21 t20
5. (C , S) ← ti+j + aj · bi + C t33 t32 t31 t30
6. ti+j = S t7 t6 t5 t4 t3 t2 t1 t0
7. ti+n = C
8. return t

The complexity of this algorithm is O(n2 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (77 / 120)


Montgomery multiplier variants: the SOS Separated
Operand Scanning method

Input: t = (t0 , t1 , ..., t2n−1 ), p = (p0 , p1 , ..., pn ) and p00 , where |p00 | = ω
Output: u ← (t+(t · p 0 mod r )·p)/r
1. for i = 0 → n − 1 do
2. C ←0
3. m ← ti · p00 mod 2ω
4. for j = 0 → n − 1 do
5. (C , S) ← ti+j + m · pj + C
6. ti+j = S
7. ADD(ti+n , C )
8. for i = 0 → n − 1 do
9. ui = ti+n
10. return u

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (78 / 120)


Montgomery multiplier variants: the SOS Separated
Operand Scanning method

Input: t = (t0 , t1 , ..., t2n−1 ), p = (p0 , p1 , ..., pn ) and p00 , where |p00 | = ω
Output: u ← (t+(t · p 0 mod r )·p)/r
1. for i = 0 → n − 1 do
2. C ←0
3. m ← ti · p00 mod 2ω
4. for j = 0 → n − 1 do
5. (C , S) ← ti+j + m · pj + C
6. ti+j = S
7. ADD(ti+n , C )
8. for i = 0 → n − 1 do
9. ui = ti+n
10. return u
The number of products of this method is 2n2 + n.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (78 / 120)


Agenda
1 Context and motivation
2 Hardware design issues
3 basic F2m field arithmetic building blocks
F2m Field Arithmetic
4 basic cryptographic building blocks
block ciphers
polynomial hash
5 BRW polynomials
6 A 256-bit multiplier over Fp
Fp Field Arithmetic
pairings and elliptic curves
Karatsuba-like multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (79 / 120)


Elliptic curves

E defined by a Weierstraß equation of the form

y 2 = x 3 + Ax + B

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (80 / 120)


Elliptic curves

E defined by a Weierstraß equation of the form

y 2 = x 3 + Ax + B

E (K ) set of rational points over a field K

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (80 / 120)


Elliptic curves

E defined by a Weierstraß equation of the form

y 2 = x 3 + Ax + B

E (K ) set of rational points over a field K


Additive group law over E (K )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (80 / 120)


Elliptic curves

E defined by a Weierstraß equation of the form

y 2 = x 3 + Ax + B

E (K ) set of rational points over a field K


Additive group law over E (K )
Many applications in cryptography since 1985
I EC-based Diffie-Hellman key exchange
I EC-based Digital Signature Algorithm
I ...

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (80 / 120)


Elliptic curves

E defined by a Weierstraß equation of the form

y 2 = x 3 + Ax + B

E (K ) set of rational points over a field K


Additive group law over E (K )
Many applications in cryptography since 1985
I EC-based Diffie-Hellman key exchange
I EC-based Digital Signature Algorithm
I ...
Interest: smaller keys than usual cryptosystems (RSA, ElGamal, ...)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (80 / 120)


Elliptic curves

E defined by a Weierstraß equation of the form

y 2 = x 3 + Ax + B

E (K ) set of rational points over a field K


Additive group law over E (K )
Many applications in cryptography since 1985
I EC-based Diffie-Hellman key exchange
I EC-based Digital Signature Algorithm
I ...
Interest: smaller keys than usual cryptosystems (RSA, ElGamal, ...)
Moreover, elliptic curves can be used to construct secure bilinear
pairings

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (80 / 120)


Group cryptography

Let (G1 , +) be an additively-written cyclic groups of prime order


# G1 = `

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (81 / 120)


Group cryptography

Let (G1 , +) be an additively-written cyclic groups of prime order


# G1 = `
P, a generator of the group: G1 = hPi

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (81 / 120)


Group cryptography

Let (G1 , +) be an additively-written cyclic groups of prime order


# G1 = `
P, a generator of the group: G1 = hPi
Scalar multiplication: for any integer k, we have
kP = P| +P + {z· · · + P}
k times

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (81 / 120)


Group cryptography

Let (G1 , +) be an additively-written cyclic groups of prime order


# G1 = `
P, a generator of the group: G1 = hPi
Scalar multiplication: for any integer k, we have
kP = P| +P + {z· · · + P}
k times
Discrete logarithm: given Q ∈ G1 , compute k such that Q = kP

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (81 / 120)


Group cryptography

Let (G1 , +) be an additively-written cyclic groups of prime order


# G1 = `
P, a generator of the group: G1 = hPi
Scalar multiplication: for any integer k, we have
kP = P| +P + {z· · · + P}
k times
Discrete logarithm: given Q ∈ G1 , compute k such that Q = kP
We assume that the discrete logarithm problem (DLP) in G1 is hard

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (81 / 120)


Bilinear pairings
Let (G1 , +), (G2 , +) be two additively-written cyclic groups of prime order
#G1 = #G2 = `
(Gτ , ×), a multiplicatively-written cyclic group of order #Gτ = `

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (82 / 120)


Bilinear pairings
Let (G1 , +), (G2 , +) be two additively-written cyclic groups of prime order
#G1 = #G2 = `
(Gτ , ×), a multiplicatively-written cyclic group of order #Gτ = `
A non-degenerate bilinear pairing is a map

ê : G1 × G2 → Gτ

that satisfies the following conditions:

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (82 / 120)


Bilinear pairings
Let (G1 , +), (G2 , +) be two additively-written cyclic groups of prime order
#G1 = #G2 = `
(Gτ , ×), a multiplicatively-written cyclic group of order #Gτ = `
A non-degenerate bilinear pairing is a map

ê : G1 × G2 → Gτ

that satisfies the following conditions:


I non-degeneracy: ê(P, P) 6= 1Gτ (equivalently ê(P, P) generates Gτ )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (82 / 120)


Bilinear pairings
Let (G1 , +), (G2 , +) be two additively-written cyclic groups of prime order
#G1 = #G2 = `
(Gτ , ×), a multiplicatively-written cyclic group of order #Gτ = `
A non-degenerate bilinear pairing is a map

ê : G1 × G2 → Gτ

that satisfies the following conditions:


I non-degeneracy: ê(P, P) 6= 1Gτ (equivalently ê(P, P) generates Gτ )
I bilinearity:
ê(Q1 +Q2 , R) = ê(Q1 , R)·ê(Q2 , R) ê(Q, R1 +R2 ) = ê(Q, R1 )·ê(Q, R2 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (82 / 120)


Bilinear pairings
Let (G1 , +), (G2 , +) be two additively-written cyclic groups of prime order
#G1 = #G2 = `
(Gτ , ×), a multiplicatively-written cyclic group of order #Gτ = `
A non-degenerate bilinear pairing is a map

ê : G1 × G2 → Gτ

that satisfies the following conditions:


I non-degeneracy: ê(P, P) 6= 1Gτ (equivalently ê(P, P) generates Gτ )
I bilinearity:
ê(Q1 +Q2 , R) = ê(Q1 , R)·ê(Q2 , R) ê(Q, R1 +R2 ) = ê(Q, R1 )·ê(Q, R2 )
I computability: ê can be efficiently computed

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (82 / 120)


Bilinear pairings
Let (G1 , +), (G2 , +) be two additively-written cyclic groups of prime order
#G1 = #G2 = `
(Gτ , ×), a multiplicatively-written cyclic group of order #Gτ = `
A non-degenerate bilinear pairing is a map

ê : G1 × G2 → Gτ

that satisfies the following conditions:


I non-degeneracy: ê(P, P) 6= 1Gτ (equivalently ê(P, P) generates Gτ )
I bilinearity:
ê(Q1 +Q2 , R) = ê(Q1 , R)·ê(Q2 , R) ê(Q, R1 +R2 ) = ê(Q, R1 )·ê(Q, R2 )
I computability: ê can be efficiently computed
Immediate property: for any two integers k1 and k2

ê(k1 Q, k2 R) = ê(Q, R)k1 k2

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (82 / 120)


Pairings in cryptography
At first, used to attack supersingular elliptic curves
I Menezes-Okamoto-Vanstone and Frey-Rück attacks, 1993 and 1994
DLPG1 <P DLPGτ
kP −→ ê(kP, P) = ê(P, P)k
I for cryptographic applications, we will also require the DLP in Gτ to be
hard

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (83 / 120)


Pairings in cryptography
At first, used to attack supersingular elliptic curves
I Menezes-Okamoto-Vanstone and Frey-Rück attacks, 1993 and 1994
DLPG1 <P DLPGτ
kP −→ ê(kP, P) = ê(P, P)k
I for cryptographic applications, we will also require the DLP in Gτ to be
hard
One-round three-party key agreement (Joux, 2000)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (83 / 120)


Pairings in cryptography
At first, used to attack supersingular elliptic curves
I Menezes-Okamoto-Vanstone and Frey-Rück attacks, 1993 and 1994
DLPG1 <P DLPGτ
kP −→ ê(kP, P) = ê(P, P)k
I for cryptographic applications, we will also require the DLP in Gτ to be
hard
One-round three-party key agreement (Joux, 2000)
Identity-based encryption
I Boneh–Franklin, 2001
I Sakai–Kasahara, 2001

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (83 / 120)


Pairings in cryptography
At first, used to attack supersingular elliptic curves
I Menezes-Okamoto-Vanstone and Frey-Rück attacks, 1993 and 1994
DLPG1 <P DLPGτ
kP −→ ê(kP, P) = ê(P, P)k
I for cryptographic applications, we will also require the DLP in Gτ to be
hard
One-round three-party key agreement (Joux, 2000)
Identity-based encryption
I Boneh–Franklin, 2001
I Sakai–Kasahara, 2001
Short digital signatures
I Boneh–Lynn–Shacham, 2001
I Zang–Safavi-Naini–Susilo, 2004

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (83 / 120)


Pairings in cryptography
At first, used to attack supersingular elliptic curves
I Menezes-Okamoto-Vanstone and Frey-Rück attacks, 1993 and 1994
DLPG1 <P DLPGτ
kP −→ ê(kP, P) = ê(P, P)k
I for cryptographic applications, we will also require the DLP in Gτ to be
hard
One-round three-party key agreement (Joux, 2000)
Identity-based encryption
I Boneh–Franklin, 2001
I Sakai–Kasahara, 2001
Short digital signatures
I Boneh–Lynn–Shacham, 2001
I Zang–Safavi-Naini–Susilo, 2004
Aggregate signatures, etc.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (83 / 120)


Barreto–Naehrig elliptic curves

Defined by the equation E : y 2 = x 3 + b, where b 6= 0. Their


embedding degree k is equal to 12.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (84 / 120)


Barreto–Naehrig elliptic curves

Defined by the equation E : y 2 = x 3 + b, where b 6= 0. Their


embedding degree k is equal to 12.
The characteristic p of the prime field, the group order r , and the
trace of Frobenius tr of the curve are parametrized as,

p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1,


r (t) = 36t 4 + 36t 3 + 18t 2 + 6t + 1,
tr (t) = 6t 2 + 1,

where t ∈ Z is an arbitrary integer such that p = p(t) and r = r (t)


are both prime numbers.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (84 / 120)


Barreto–Naehrig elliptic curves

Defined by the equation E : y 2 = x 3 + b, where b 6= 0. Their


embedding degree k is equal to 12.
The characteristic p of the prime field, the group order r , and the
trace of Frobenius tr of the curve are parametrized as,

p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1,


r (t) = 36t 4 + 36t 3 + 18t 2 + 6t + 1,
tr (t) = 6t 2 + 1,

where t ∈ Z is an arbitrary integer such that p = p(t) and r = r (t)


are both prime numbers.
For efficiency purposes, t should have a low Hamming weight. If
log2 t ≈ 64 then log2 t ≈ 256 and the pairing achieves the 128-bit
security level

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (84 / 120)


Barreto–Naehrig Curves

Let E [r ] denote the r -torsion subgroup of E . We define,


G1 = E (Fp )[r ],
G2 ⊆ E (Fp12 )[r ],
Gτ = µr ⊂ F∗p12 (i.e. the group of r -th roots of unity).
The optimal ate pairing on the BN curve E is given as,

aopt : G2 × G1 −→ G3

In practice, pairing computations can be restricted to points P and Q 0


that belong to E (Fp ) and E 0 (Fp2 ), respectively, where,
E 0 /Fp2 : y 2 = x 3 + b/ξ.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (85 / 120)


Field Multiplier, the crucial building block (1/3)

More than 10, 000 and 5, 000 multiplications over Fp and Fp2 ,
respectively are required for computing a pairing defined over BN
curves

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (86 / 120)


Field Multiplier, the crucial building block (1/3)

More than 10, 000 and 5, 000 multiplications over Fp and Fp2 ,
respectively are required for computing a pairing defined over BN
curves
BN curves enjoy several useful features for computing the
Montgomery reduction, namely,
I gcd(t, p) = 1
I p ≡ 1 mod t, which implies, p −1 mod t = 1
I the coefficients of the polynomial p(t) (36, 36, 24, 6, 1) are relatively
small.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (86 / 120)


Field Multiplier, the crucial building block (2/3)

Above features can be exploited better if we represent the operands


a, b ∈ Fp as polynomials of the variable t, i.e.,
4
X 4
X
a(t) = ai t i , b(t) = bi t i , |ai | = |bi | = 64 bits
i=0 i=0

such that the product c is now seen as,


8
X
c(t) = a(t)b(t) = ci t i mod p(t)
i=0

Using the so-called polynomial version of the Montgomery product

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (87 / 120)


Field Multiplier, the crucial building block (3/3)

notice that we have traded a 256-bit integer multiplication by a


4-degree polynomial product, with 64-bit integer coefficients. Hence,
for this approach, it is now necessary to build a 64-bit integer
multiplier as a basic building block.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (88 / 120)


Field Multiplier, the crucial building block (3/3)

notice that we have traded a 256-bit integer multiplication by a


4-degree polynomial product, with 64-bit integer coefficients. Hence,
for this approach, it is now necessary to build a 64-bit integer
multiplier as a basic building block.
For performing the 4-degree polynomial product we propose to use a
Karatsuba-like multiplier [Montgomery, TC’05]], using fourteen 64-bit
products and forty 128-bit additions.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (88 / 120)


Field Multiplier, the crucial building block (3/3)

notice that we have traded a 256-bit integer multiplication by a


4-degree polynomial product, with 64-bit integer coefficients. Hence,
for this approach, it is now necessary to build a 64-bit integer
multiplier as a basic building block.
For performing the 4-degree polynomial product we propose to use a
Karatsuba-like multiplier [Montgomery, TC’05]], using fourteen 64-bit
products and forty 128-bit additions.
The new challenge is to come out with a scheduling that allows that
the 64-bit integer multiplier block stays always busy

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (88 / 120)


Field Multiplier Product [using the polynomial version of
the Montgomery mult]

= 4i=0 ai t i ,
P
Input: a(t) mod p = a(t)P
4
b(t) mod p = b(t) = i=0 bi t i ,
p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1, t = 2n + s
Output: c(t) = a(t)b(t) · t −1 mod p
1. c(t) = 5-term KaratsubaProduct(a(t), b(t)) (Polynomial Product)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (89 / 120)


Field Multiplier Product [using the polynomial version of
the Montgomery mult]

= 4i=0 ai t i ,
P
Input: a(t) mod p = a(t)P
4
b(t) mod p = b(t) = i=0 bi t i ,
p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1, t = 2n + s
Output: c(t) = a(t)b(t) · t −1 mod p
1. c(t) = 5-term KaratsubaProduct(a(t), b(t)) (Polynomial Product)
2. for i = 0 to 4 do
3. µ ← c0 div 2n ; γ ← c0 mod 2n − µs
4. g (t) ← p(t)(−γ) (Montgomery Reduction Phase)
5. c(t) ← (c(t) + g (t))/t + µ

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (89 / 120)


Field Multiplier Product [using the polynomial version of
the Montgomery mult]

= 4i=0 ai t i ,
P
Input: a(t) mod p = a(t)P
4
b(t) mod p = b(t) = i=0 bi t i ,
p(t) = 36t 4 + 36t 3 + 24t 2 + 6t + 1, t = 2n + s
Output: c(t) = a(t)b(t) · t −1 mod p
1. c(t) = 5-term KaratsubaProduct(a(t), b(t)) (Polynomial Product)
2. for i = 0 to 4 do
3. µ ← c0 div 2n ; γ ← c0 mod 2n − µs
4. g (t) ← p(t)(−γ) (Montgomery Reduction Phase)
5. c(t) ← (c(t) + g (t))/t + µ
6. for k = 0 to 1 do
7. for i = 0 to 3 do
8. µ ← ci div 2n ; γ ← ci mod 2n − µs (Coefficient Reduction Phase)
9. ci+1 ← ci+1 + µ; ci ← γ
10. return c(t)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (89 / 120)


proposed architecture for the field multiplication

Polynomial Multiplier

Initial Additions Control

64 64
Input Input
64 64
a(t) 64 x 64 Multiplier b(t)

128

Second Phase of
Control
Additions

128 128

Partial Reduction of Sync and Montgomery


Coeffcients Control Reduction

Final Reduction of
80 80
Coefficients

64

Output
c(t)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (90 / 120)


Agenda
1 Context and motivation
2 Hardware design issues
3 basic F2m field arithmetic building blocks
F2m Field Arithmetic
4 basic cryptographic building blocks
block ciphers
polynomial hash
5 BRW polynomials
6 A 256-bit multiplier over Fp
Fp Field Arithmetic
pairings and elliptic curves
Karatsuba-like multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (91 / 120)


Hierarchy of a 64-bit integer multiplier

We took the design decision of using the embedded 24 × 17 multiplier


available in the Xilinx DSP48Slices.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (92 / 120)


Operand representation using the Xilinx DSP48Slices

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (93 / 120)


School book multiplication method

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (94 / 120)


diamond-shape multiplication method

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (95 / 120)


64-bit integer multiplication architecture
Satge 1: 0 Wait cycles A2B0

Stage 2: 1 Wait cycles A2B1 A1B0


Carry

Stage 3: 2 Wait cycles A2B2 A1B1 A0B0


Carry Carry

Stage 4: 3 Wait cycles A2B3 A1B2 A0B1


Carry Carry

Stage 5: 4 Wait cycles


A1B3 A0B2
Carry Carry

Stage 6: 5 Wait cycles A0B3

Ch Cl

+ Carries
AiBi

(Ai x Bi)+Ch||Cl Result

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (96 / 120)


Programming a Xilinx DSP48Slice

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (97 / 120)


Programming a Xilinx DSP48Slice

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (97 / 120)


Programming a Xilinx DSP48Slice

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (97 / 120)


Programming a Xilinx DSP48Slice

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (97 / 120)


5-term Karatsuba-like multiplication algorithm
P4 i P4 i
Input: a(t) = i=0 ai t , b(t) = i=0 bi t
Output: c(t) = a(t)b(t)
1. c(t)(= i=0 ci t i ) ← 0;
P8

2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p9 = a4 b4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (98 / 120)
5-term Karatsuba-like multiplication algorithm
4 i 4
bi t i
P P
Input: a(t) = i=0 ai t , b(t) = i=0
Output: c(t) = a(t)b(t)
1. c(t)(= i=0 ci t i ) ← 0;
P8

2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p 9 = a4 b 4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (99 / 120)
Initial addition phase

MUX MUX

SA0 SA1 SB1 SB0


High Speed High Speed
Memory Memory
a(t) b(t)

+ +

OPA OPB
SA2 SA3 SB3 SB2

64x64 Multiplier
+ +

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (100 / 120)


5-term Karatsuba-like multiplication algorithm
P4 i P4 i
Input: a(t) = i=0 ai t , b(t) = i=0 bi t
Output: c(t) = a(t)b(t)
1. c(t)(= i=0 ci t i ) ← 0;
P8

2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p9 = a4 b4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (101 / 120)
5-term Karatsuba-like multiplication algorithm
4 i 4
bi t i
P P
Input: a(t) = i=0 ai t , b(t) = i=0
Output: c(t) = a(t)b(t)
1. c(t)(= i=0 ci t i ) ← 0;
P8

2. p0 = a0 b0 ;
3. p1 = a1 b1 ;
4. p2 = (a0 + a1 )(b0 + b1 );
5. p3 = a2 b2 ;
6. p4 = (a0 + a2 )(b0 + b2 );
7. p5 = a3 b3 ;
8. p6 = (a2 + a3 )(b2 + b3 );
9. p7 = (a1 + a3 )(b1 + b3 ); (Initial Addition and product phase)
10. p8 = (a0 + a1 + a2 + a3 )(b0 + b1 + b2 + b3 );
11. p 9 = a4 b 4 ; (each of these are 64 integer multiplications)
12. p10 = (a0 + a4 )(b0 + b4 );
13. p11 = (a0 + a1 + a4 )(b0 + b1 + b4 );
14. p12 = (a2 + a4 )(b2 + b4 );
15. p13 = (a2 + a3 + a4 )(b2 + b3 + b4 )
16. c0 = p0
17. c1 = p2 − p1 − p0
18. c2 = p4 + p1 − p0 − p3
19. S1 = p6 − p5 − p3
20. c3 = p8 − p7 − p4 − c1 − S1 (Final Addition phase)
21. c4 = p10 − p9 − p0 + p3 + p5 − p1 + p7
22. c5 = p11 − p1 − p10 − c1 + S1
23. c6 = p12 − p9 + p5 − p3 (each of these are 64 integer additions)
24. c7 = p13 − p5 − p12 − S1
25. c8 = p9
26. return c(t)
Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (102 / 120)
Final addition phase

Control
Unit
64x64 Multiplier Output

Mux Selection From the Bank


Registers

From the
From the Acc1 Rmem Control Unit
Control Unit Bank Component of Additions
Registers Accumulators Component
Address of read and Registers Mux Selection and
write Acc2 Input Data Load
Signals
Acc2 Acc2

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (103 / 120)


Final addition phase

64x64 Multiplier From the Bank Registers


Output

Rmem

R1 R2 R3 R4
ACC1 ACC2 0's
0's 0's 0's

MUX MUX MUX MUX

A B C D
+ +

Ri1 Ri2

Operator -
(A+B)-(C+D)

Output to the Bank Registers

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (103 / 120)


proposed architecture for the field multiplication

Polynomial Multiplier

Initial Additions Control

64 64
Input Input
64 64
a(t) 64 x 64 Multiplier b(t)

128

Second Phase of
Control
Additions

128 128

Partial Reduction of Sync and Montgomery


Coeffcients Control Reduction

Final Reduction of
80 80
Coefficients

64

Output
c(t)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (104 / 120)


Montgomery reduction

Main Objective: To reduce the nine 128-bit coefficients produced by the


Karatsuba multiplier to five coefficients of whom the least significant one
is always zero

C0 C1
C2 C3

γ
  Cin div 2n C 0  R 2  6
C 2  C 3  6  R1
C + Cin   Cin mod 2    s
n
C1  C 2  24 R1
C 3  36
R 2    C1 R1  12
R2

C0 C1 C2 C3

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (105 / 120)


Coefficient reduction
Main Objective: To guarantee that the parameter t is greater than each
coefficient Ci for i = 5, ... 8. This procedure receives four input coefficients
but produces an output of a fifth coefficient always less than 36.

Input

Cin CH

CL

CL-μs
Reset

R0 R1 R2 R3 + µ

S0 S1 S2 S3 S4

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (106 / 120)


Speed Motivations

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (107 / 120)


Experimental results (1/2)

Table: Performance per stage


Stage Latency Frecuency (MHz)
Multiplier input 3 270
64 × 64 Multiplier 4 313
Additions 18 223.7
Montgomery Reduction 18∗ 230
Coefficient Reduction 15∗ 223.7
Total 40 223.7

(*) Montgomery and coefficient reductions are performed in parallel

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (108 / 120)


Experimental results (2/2)

Table: Comparison table


Design platform area cycles per product Latency Frecuency (MHz)
Slices(DSPs)
[Fan et. al. CHES’09] ASIC 183KGates 23 0 204MHz
[Fan et. al. TC’11] Virtex 6 4014(46) 5 25 210MHz
[Yao et. al. eprint’11] Virtex 6 -(36) 15 8 250MHz
This design Virtex 6 1983(12) 15 40 224MHz

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (109 / 120)


Multiplication over Fp2

Input: X , Y ∈ Fp2 , where X = x0 + x1 u y Y = y0 + y1 u, y u 2 = −5


Output: W = X · Y ∈ Fp2
1. s ← add256(x0 , x1 )
2. t ← add256(y0 , y0 )
3. d0 ← mult256(s, t)
4. d1 ← mult256(x0 , y0 )
5. d2 ← mult256(x1 , y1 )
6. d0 ← sub512(d0 , d1 )
7. d0 ← sub512(d0 , d2 )
8. w1 ← mod512(d0 )
9. d2 ← 5d2
10. d1 ← sub512(d1 , d2 )
11. w0 ← mod512(d1 )
12. return W = w0 + w1 u

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (110 / 120)


Multiplication over Fp2

Hence, the following components are required,


256-bit Adders/Subtracters
256 multipliers
reduction logic

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (111 / 120)


Multiplication over Fp2 : proposed architecture

X0 Y0 X1 Y1 X0 X1 Y0 Y1

Mult256 Mult256 Add256 Add256


X0Y0 X1Y1 X0+X1 Y0+Y1

Add512 Add512 Add512 Mult256


 D2 D1  p  2250 D1+D2 A0B0

Sub512
Sub512
D1   D2 D0-(D1+D2)

Reducción Reducción
de de
Montgomery Montgomery
W0 mod p W1 mod p

W0 W1

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (112 / 120)


Multiplication over Fp2 : carry-save adders
Taking advantage of the carry-save adders we can add three operands
using that representation

X Y Z

S : X  Y  Z
W1 W0
C : ( X  Y )  ( X  Z )  (Y  Z )
W0129
130 W0'1280
128
c +
C
130
C0
s
||
C1
258

+ W1 + W0 W
S
S1

S0

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (113 / 120)


Multiplication over Fp2 : 128-bit diamond multiplier
We once again use the diamond technique for defining a 128 bit multiplier

256
41

E0
A5B0 120

A5B1
E1 A4B0 96

E2 A5B2 A4B1 A3B0 72

E3 A5B3 A4B2 A3B1 A2B0 48


E4 A5B4 A4B3 A3B2 A2B1 A1B0 24
E5 A5B5 A4B4 A3B3 A2B2 A1B1 A0B0
E6 A5B6 A4B5 A3B4 A2B3 A1B2 A0B1 17
E7 A5B7 A4B6 A3B5 A2B4 A1B3 A0B2 34
E8 A4B7 A3B6 A2B5 A1B4 A0B3 51
E9 A3B7 A2B6 A1B5 A0B4 68

E10 A2B7 A1B6 A0B5 85


E11 A1B7 A0B6 102

E12 A0B7 119

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (114 / 120)


Multiplication over Fp2 : 128-bit diamond multiplier
We once again use the diamond technique for defining a 128 bit multiplier
A B

DSP48Slices Multipliers

48 Products
AiBj

13 Stages Ei

Logic Operations Carry-Save

S C

256 bits Adder

256 Bits
Output

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (114 / 120)


Multiplication over Fp2 : 256-bit multiplier
The 256-bit multiplier is computed using the 128-bit multiplier as its main
building block. Notice that the output of this circuit is of four 128-bit
words.
XH XL

YH

MUX
YL

128
128

X  X H || X L
128X128
Multiplier Y  Y H || Y L
| X H || X L || Y H || Y L | 128 bits

MUX

R1

R2 R3
0's 0's 0's

R4 R5 R6

128 128 128

128 bits Adder

130

Partial Output

MUX 2 bits Adder


0 128
Acarreo

Output
P0 , P1 , P2 , P3
| Pi | 128 bits

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (115 / 120)


Multiplication over Fp2 : Montgomery reduction
The Montgomery reduction is performed keeping an eye on avoiding
pipeline stalls
512 bits
Input T1
T0
128

128

128
128x128 128
ROM
Multiplier P’

128 128

MUX
FIFO
Memory 0

MUX
+

ROM
P×2250

MUX

Delay 128 bits


Registers Substractor 256x256 ROM
Multiplier P
128 bits
Substractor
128 bits
Substractor

Bank
Registers
MUX

W mod P

Bank
Registers

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (116 / 120)


Performance revisited

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (117 / 120)


Experimental results revisited

Table: Comparison table


Design platform area cycles per product Latency Frecuency (MHz)
Slices(DSPs)
[Fan et. al. CHES’09] ASIC 183KGates 23 0 204MHz
[Fan et. al. TC’11] Virtex 6 4014(46) 5 25 210MHz
[Yao et. al. eprint’11] Virtex 6 -(36) 15 8 250MHz
Fp design Virtex 6 1983(12) 15 40 224MHz
Fp 2 design Virtex 6 8754(336) 4 45 235MHz
Fp mult using the Fp 2 design Virtex 6 –(144) 4 35 235MHz

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (118 / 120)


Credits

The material presented here is an adaptation of joint work with:


Jean-Luc Beuchat LCIS, University of Tsukuba, Japan
Debrup Chakraborty CINVESTAV-IPN, Computer Science, México
Cuauhtémoc Chávez-Corona CINVESTAV-IPN, Computer Science, México
Cuauhtemoc Mancillas-López CINVESTAV-IPN, Computer Science, México
Palash Sarkar ISI, India

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (119 / 120)


Thank you for your attention

Questions?

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (120 / 120)

You might also like