0% found this document useful (0 votes)

63 views211 pages

Hardware Cryptograhy

This document provides an overview of hardware design issues for cryptographic algorithms. It discusses implementing crypto in hardware for performance reasons, and outlines FPGA architecture including logic cells, block RAMs, DSP slices and families from manufacturers like Xilinx. It compares FPGAs to ASICs and CPUs, and covers design methodology, critical path analysis, and latency considerations for FPGA designs.

Uploaded by

Abhimanyu Raveendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views211 pages

Hardware Cryptograhy

Uploaded by

Abhimanyu Raveendran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 211

Hardware design of cryptographic algorithms

Francisco Rodrı́guez-Henrı́quez
CINVESTAV-IPN, México
[email protected]

Tutorial Talk at Indocrypt 2012 - Sunday December 9th, 2012

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (1 / 120)

Outline of the talk

1 Context and motivation

2 Hardware design issues

3 basic F2m field arithmetic building blocks

4 basic cryptographic building blocks

5 BRW polynomials

6 A 256-bit multiplier over Fp

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (2 / 120)

But.... why should one bother implementing
crypto-schemes in Hardware?

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (3 / 120)

But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)

But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor

There exist specific targets, one of the most prominent ones being
smart cards

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)

But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor

There exist specific targets, one of the most prominent ones being
smart cards
Hardware may be the fastest/most efficient way to implement
cryptographic algorithms.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)

But.... Why should one bother implement pairings in
Hardware?

computation not very well suited for general purpose processor

There exist specific targets, one of the most prominent ones being
smart cards
Hardware may be the fastest/most efficient way to implement
cryptographic algorithms.
However if a cryptographic hardware accelerator is going to be
attractive at all, a significant performance improvement should be
observed with respect to software implementations.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (4 / 120)

FPGA General architecture

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (5 / 120)

Main Xilinx FPGA families

Kintex 7 economical models for high performance applications

Virtex 5 and 6 some of the most sophisticated models. the Look-up
Tables can be configurated with up to six inputs
Artix 7 some of the models include a dual-core ARM Cortex-A9
Spartan 6 economical and simple devices
Virtex 7 the newest family with many embedded components working
at high speed. It is still costly

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (6 / 120)

Main characteristics of Xilinx FPGA Families [as of 2006]

Virtex-5 Virtex-4 Virtex II Pro Spartan 3 & 3E

Logic Cells up to 330K 12K-200K 3K-99K 1.7K-74K
BRAM 576 36-512 12-444 4-104
(18Kbits each)
Multipliers 32 − 1921 32-512 12-444 4-104
DCM up to 18 4-20 4-12 2-18
IOBs up to 1200 240-960 204-1164 63-633
DSP Slices 32-192 32-192 — –
PowerPC Blocks N/A 0-2 0-2 –
Max. freq. 550MHz 500MHz 547 MHz up to 300MHz
Price ≈ $400USD From $300 From $139 From $2 up to $85
1 25 × 18 embedded multipliers

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (7 / 120)

FPGA Manufacturers and Their Devices

Manufacturer FPGA Family Feature

Xilinx Virtex-4-Virtex-7, FPGA market leader
VirtexII, Spartan III 28ηm technology
Altera Stratix, Stratix II, Cyclone 28ηm technology
Lattice LatticeXP ultra low power for mobile applications
Actel Fusion, M7Fusion first mixed-signal FPGA
Quick Logic Eclipse II programmable-only-once FPGA
Atmel AT40KAL fine-grain reconfigurable
Achronix Achronix-ULTRA 1.6GHz - 2.2GHz speed

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (8 / 120)

General Xilinx Virtex 5 Slice architecture

Each Virtex 5 slice has 4 Look-Up Tables (LUTs), eight registers and
several multiplexers

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (9 / 120)

General Xilinx Virtex 5 Slice architecture

Each Virtex 5 slice has 4 Look-Up Tables (LUTs), eight registers and
several multiplexers
A LUT can be configured to perform any Boolean operation of 6
inputs/1 output or 5 inputs/ 2 outputs or as a memory elements of
64 inputs of one-bit size

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (9 / 120)

General Xilinx Virtex 5 Slice architecture

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (10 / 120)

Block RAMs

Virtex devices include built-in 32K-bit RAM memory, called BRAM, which
are intended for storing big amounts of data. Some of its features are,
Polymorphic [bus size programmable]
Dual port [they can perform two data reads and one write in the same
clock cycle]
can be configured for a size of up to 4K bytes

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (11 / 120)

DSP Slices

DSP slices are embedded devices equipped with the following components
25 × 18 two’s-complement multiplier:
48-bit accumulator
pre-adders
Single-instruction-multiple-data (SIMD) arithmetic unit
Can generate any one of ten different logic functions of the two
operands
execute all the operations at a extremely high frequency

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (12 / 120)

DSP Slices

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (13 / 120)

FPGAs Vs. ASIC

Advantages
I They have been utilized for fast prototyping of hardware designs
I They are reconfigurable devices
I They allow for a shorter design cycle
I They permit hardware-software co-design

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (14 / 120)

FPGAs Vs. ASIC

Advantages
I They have been utilized for fast prototyping of hardware designs
I They are reconfigurable devices
I They allow for a shorter design cycle
I They permit hardware-software co-design
Disadvantages
I They tend to consume much more power and energy than ASIC designs
I Their reconfigurability implies redundancy
I Their speed is minor than the one achievable with ASICs

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (14 / 120)

FPGAs Vs. General purpose processors

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (15 / 120)

FPGAs Vs. General purpose processors

Advantages
I More often than not, they are faster than software applications
I some operations are almost free of cost [such as shifts, rotations, etc.]
I They allow for a versatile data-path
I They inherently enjoy fine-grain parallelism
Disadvantages
I It is a bit more difficult to code and test designs
I Their maximum clock frequency is ten times slower
I prime field arithmetic tends to be more difficult to handle

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (15 / 120)

Design Methodology for FPGA designs

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (16 / 120)

Design Methodology for FPGA designs

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (16 / 120)

FPGA Design issues: circuit’s critical path

Circuit’s critical path: The maximum allowed clock frequency is

determined by the longest/slowest combinatorial path present in the circuit

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (17 / 120)

FPGA Design issues: circuit’s critical path

Circuit’s critical path: The maximum allowed clock frequency is

determined by the longest/slowest combinatorial path present in the circuit

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (17 / 120)

FPGA Design issues: Latency

Circuit’s latency: the amount of time [often given in number of clock

cycles] required for producing the first output

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (18 / 120)

Parallel techniques in hardware: pipelining

pipelining is a natural technique for producing more parallelism, however,

the designer must balance the pipe stages, carefully trying to avoid
dependencies and pipeline stalls

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (19 / 120)

Parallel techniques in hardware: pipelining

pipelining is a natural technique for producing more parallelism, however,

the designer must balance the pipe stages, carefully trying to avoid
dependencies and pipeline stalls

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (19 / 120)

Measures of performance in reconfigurable Hardware
devices

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (20 / 120)

Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)

Measures of performance in reconfigurable Hardware
devices
Computational time defined as:
# of clock cycles
clock cycle frequency
Throughput defined as:
# of bits processed · clock cycle frequency
# of clock cycles
Latency:# of clock cycles required for producing the first computation
Amount of hardware resources utilized by the design. Including slices,
dedicated memories, DSP slices, etc.
Time-Area product
Power consumption, energy consumption, ...

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (21 / 120)

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (22 / 120)

Design tools

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (22 / 120)

Design tools

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (22 / 120)

Finite fields

Every prime number p defines a finite field of order p, denoted as, Fp .

The smallest finite field is hF2 , ⊕, i, that contains only two elements
{0, 1} and its binary operations act as the Boolean operators XOR and AND,
respectively.

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (23 / 120)

Field Extensions

Given a positive integer m > 1,the field Fpm is a field extension of Fp .

It can be shown that Fpm is isomorphic to Fp [x]/(f (x)), where f (x) is a

monic polynomial of degree m > 1, irreducible over Fp .

We denote by Fp [x]/(f (x)) the set of equivalence classes of the

polynomials Fp [x] (mod f (x)).

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (24 / 120)

Agenda
1 Context and motivation
2 Hardware design issues
3 basic F2m field arithmetic building blocks
F2m Field Arithmetic
4 basic cryptographic building blocks
block ciphers
polynomial hash
5 BRW polynomials
6 A 256-bit multiplier over Fp
Fp Field Arithmetic
pairings and elliptic curves
Karatsuba-like multiplier

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (25 / 120)

Arithmetic over F3m

f ∈ F2 [x]: degree-m irreducible polynomial over F2

f = x m + fm−1 x m−1 + · · · + f1 x + f0

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (26 / 120)

Arithmetic over F3m

f ∈ F2 [x]: degree-m irreducible polynomial over F2

f = x m + fm−1 x m−1 + · · · + f1 x + f0

F2m ∼
= F2 [x]/(f )
a ∈ F2m :
a = am−1 x m−1 + · · · + a1 x + a0
Each element of F2 stored using one bit, ergo,a field element F2m can
be represented as a vector of m bits.
Usually the irreducible polynomial f is selected as a trinomial or a
pentanomial

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (26 / 120)

Addition over F2m

r = a + b = (am−1 + bm−1 )x m−1 + · · · + (a1 + b1 )x + (a0 + b0 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (27 / 120)

Addition over F2m

r = a + b = (am−1 + bm−1 )x m−1 + · · · + (a1 + b1 )x + (a0 + b0 )

I coefficient-wise additions over F2 : ri = (ai + bi ) mod 2

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (27 / 120)

Addition over F2m

r = a + b = (am−1 + bm−1 )x m−1 + · · · + (a1 + b1 )x + (a0 + b0 )

I coefficient-wise additions over F2 : ri = (ai + bi ) mod 2
I addition over F2 : XOR gates
I This operation directly benefits from the parallel processing of the XOR
operation

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (27 / 120)

Field Squaring
Due to the action of the Frobenius map, polynomial squaring of an
element a ∈ F2m is a linear operation over binary fields,
"m−1 #2
X
2 i
a(x) = ai x
i=0
m−1
X
= ai x 2i
i=0

This can be implemented by interleaving zeroes among the polynomial

coefficients,

~a → (~a)2
(am−1 , am−2 , ... , a1 , a0 ) → (am−1 , 0, ... , a2 , 0, a1 , 0, a0 )

Francisco Rodrı́guez-Henrı́quez Hardware design of cryptographic algorithms (28 / 120)

Multiplication over F2m

Parallel-serial multiplication
I multiplicand loaded in a parallel register
I multiplier loaded in a shift register
Most significant coefficients first (Horner scheme)
lmm
D coefficients processed at each clock cycle: cycles per
D
multiplication