0% found this document useful (0 votes)
87 views101 pages

Design and Analysis of An FPGA-based Multi-Processor HW-SW Syste

This document describes a thesis by Andrew Fitzgerald submitted to the Rochester Institute of Technology. The thesis proposes designing and analyzing an FPGA-based system containing two isolated Nios softcore processors that share data through two AES crypto-engines operating in Galois/Counter mode. The goal is to determine combinations of microprocessor features and crypto-engine architectures that optimize for performance, hardware usage, or a balance. Related work on high-performance and minimal-hardware AES architectures, Galois/Counter mode, and FPGA-based single-chip cryptography is also reviewed. The system aims to employ FPGA security primitives to ensure proper component isolation when implemented.

Uploaded by

Thi Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views101 pages

Design and Analysis of An FPGA-based Multi-Processor HW-SW Syste

This document describes a thesis by Andrew Fitzgerald submitted to the Rochester Institute of Technology. The thesis proposes designing and analyzing an FPGA-based system containing two isolated Nios softcore processors that share data through two AES crypto-engines operating in Galois/Counter mode. The goal is to determine combinations of microprocessor features and crypto-engine architectures that optimize for performance, hardware usage, or a balance. Related work on high-performance and minimal-hardware AES architectures, Galois/Counter mode, and FPGA-based single-chip cryptography is also reviewed. The system aims to employ FPGA security primitives to ensure proper component isolation when implemented.

Uploaded by

Thi Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Rochester Institute of Technology

RIT Scholar Works


Theses Thesis/Dissertation Collections

2010

Design and analysis of an FPGA-based, multi-


processor HW-SW system for SCC applications
Andrew Fitzgerald

Follow this and additional works at: http://scholarworks.rit.edu/theses

Recommended Citation
Fitzgerald, Andrew, "Design and analysis of an FPGA-based, multi-processor HW-SW system for SCC applications" (2010). Thesis.
Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected].
Design and Analysis of an FPGA-based, Multi-processor
HW-SW System for SCC Applications

by

Andrew F. Fitzgerald

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of


Master of Science in Computer Engineering
Supervised by
Dr. Marcin Łukowiak
Department of Computer Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, New York
November 2010

Approved By:

Dr. Marcin Łukowiak


Assistant Professor, RIT Department of Computer Engineering
Primary Adviser

Dr. Michael Kurdziel


Senior Engineering Manager, Harris Corporation
Committee Member

Dr. Pratapa Reddy


Professor, RIT Department of Computer Engineering
Committee Member
Thesis Release Permission Form

Rochester Institute of Technology


Kate Gleason College of Engineering

Title: Design and Analysis of an FPGA-based, Multi-processor HW-SW


System for SCC Applications

I, Andrew F. Fitzgerald, hereby grant permission to the Wallace Memorial Library re-
produce my thesis in whole or part.

Andrew F. Fitzgerald

Date
Abstract

The last 30 years have seen an increase in the complexity of embedded systems from a
collection of simple circuits to systems consisting of multiple processors managing a wide
variety of devices. This ever increasing complexity frequently requires that high assurance,
fail-safe, and security design techniques be applied to protect against possible failures and
breaches. To facilitate the implementation of these embedded systems in an efficient way,
the FPGA industry recently created new families of devices. New features added to these
devices include anti-tamper monitoring, bit stream encryption, and optimized routing ar-
chitectures for physical and functional logic partition isolation. These devices have high
capacities and are capable of implementing processors using their reprogrammable logic
structures. This allows for an unprecedented level of hardware and software interaction
within a single FPGA chip. High assurance and fail-safe systems can now be implemented
within the reconfigurable hardware fabric of an FPGA, enabling them to maintain flexibility
and achieve high performance while providing a high level of data security.
The objective of this thesis is to design and analyze an FPGA-based system containing
two isolated, softcore Nios processors that share data through two crypto-engines. FPGA-
based Single Chip Cryptographic (SCC) techniques will be employed to ensure proper
component isolation when the design is placed on a device supporting the appropriate se-
curity primitives. Each crypto-engine will be an implementation of the Advanced Encryp-
tion Standard (AES), operating in Galois/Counter mode (GCM) for high performance and
authentication. The features of the microprocessors and architectures of the AES crypto-
engines will be varied with the goal of determining combinations which best target high
performance, minimal hardware usage, or a combination of the two.

iii
This thesis work has been performed in association with the Blu Lockdown project, a joint
project between Harris Corporation and Rochester Institute of Technology. I would like to
thank the Harris team: Brian Boorman, Duncan Harris, Michael Kurdziel, Christopher
Mackey, William Skiba, and Ken Smith Jr for their direction on this project. I would also
like to thank Bob Spurr and Rodney Frazer from Altera Corporation for their technical
advice and support.

iv
Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Essential Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Block Cipher Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Advanced Encryption Standard . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 SubBytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 ShiftRows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 MixColumns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 AddRoundKey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Block Cipher Modes of Operation . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Confidentiality Modes . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Authentication Mode . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Authenticated Encryption Modes . . . . . . . . . . . . . . . . . . 22
2.4 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . 25

3 Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Preliminary Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.2 Finite Fields Extensions . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Basis and Representation . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Field Isomorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Composite Fields applied to AES . . . . . . . . . . . . . . . . . . . . . . . 38

v
4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 High Performance Architectures for AES . . . . . . . . . . . . . . . . . . 43
4.1.1 Composite Field S-Box Techniques . . . . . . . . . . . . . . . . . 45
4.1.2 Composite Field S-Box Techniques and Pipelining . . . . . . . . . 46
4.1.3 Precomputation Components . . . . . . . . . . . . . . . . . . . . . 47
4.2 Minimal Hardware Architectures for AES . . . . . . . . . . . . . . . . . . 47
4.2.1 Reduced Datapath Width . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Substructure Sharing . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Normal Basis Techniques . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Galois/Counter Mode Architectures . . . . . . . . . . . . . . . . . . . . . 49
4.4 FPGA-based Single-chip Cryptography . . . . . . . . . . . . . . . . . . . 52

5 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Component Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 AES Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.2 AES Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.3 AES S-box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.1.4 GHASH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.5 GCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 System Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 High-Performance Design . . . . . . . . . . . . . . . . . . . . . . 69
5.2.2 Small Area Design . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2.3 Balanced Performance Design . . . . . . . . . . . . . . . . . . . . 69
5.3 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.1 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2 Build process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.3 Download and execution . . . . . . . . . . . . . . . . . . . . . . . 73

6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3 Security Costs and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 78

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 81

vi
Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

vii
1. Introduction
Cryptography, in the broadest sense, is the study and practice of techniques related to
aspects of information security. Premodern work in cryptography focused on providing
message confidentiality. That is, the creation of methods that would allow two parties to
communicate such that their messages could only be deciphered by the intended recipient.
Contemporary work in cryptography concerns itself with additional concepts, such as mes-
sage authentication, integrity, and non-repudiation. The material presented in this thesis
is pursuant to the advanced study of modern efforts in high security and computational
efficiency.

1.1 Project Description


The primary goal of this thesis is to design and analyze an field programmable gate array
(FPGA)-based, multi-processor HW-SW system. Implementations will be bound to the
restrictions that such a system requires. FPGA-based Single Chip Cryptographic (SCC)
design strategies will be employed to logically and securely separate the design into crypto-
subsystems. For example, a particular subsystem designated to process only encrypted
data will not have physical access to components or signals directly providing unencrypted
data. Validation of the correct operation of a particular hardware crypto-engine will be
done within the example secure communications system. Redundancy of components will
not be a focus of this thesis project.
The final test platform will include two Altera Nios II processors that are connected
through a crypto-engine, similar to Figure 1.1. The processor in the red subsystem par-
tition will process unencrypted data, while the processor in the black subsystem partition
will process only encrypted data. The logic partitions in the figure are surrounded by a

1
Altera Cyclone III FPGA

Red Subsystem Partition Black Subsystem Partition

System Interconnect Fabric

System Interconnect Fabric


Off-chip Off-chip
Communication I/O Communication I/O

Nios II Nios II
GCM Decrypt GCM Decrypt
Processor Processor
Interface Registers Interface Registers
Core Core

GCM Encrypt GCM Encrypt


Interface Registers Interface Registers

GCM Encryption Engine Partition


2

Black Interface Registers


Red Interface Registers
GCM Controller

AES GF(2128)
Encrypt Multiplier

GCM Decryption Engine Partition


1

Black Interface Registers


Red Interface Registers

GCM Controller

GF(2128) AES
Multiplier Decrypt

Figure 1.1: Dataflow diagram of FPGA-based, multi-processor HW-SW system with AES
GCM encryption and decryption engines.
1) SRI connecting two secured regions
2) a fence of unused logic around a secured partition
3) two SRIs containing signals relating only to the encrypted data

border of unused logic blocks, labeled 2, making them physically independent of other
partitions. The signals passing between partitions cross through security routing interfaces
(SRIs), labeled 1 and 3. Each signal must have appropriate security levels to be properly
routed through the interface. Red signal connections in the figure indicate possible trans-
mission of unencrypted data, while black signal connections indicate the data is encrypted.
The data signals between the black subsystem and the crypto-engines could pass through
unencrypted logic since the data is already encrypted. As an additional safeguard, these
signals are passed through SRIs.

2
The Nios II processor itself has many features that can be added or removed based on
need [1]. This feature set will be one of the variables in the test platform. For example, the
performance benefits of adding caching could be weighed against the area requirement on
the entire test platform.
The target FPGA for this project will be an Altera Cyclone III EP3C120F780C7N.
Thus the tools for physical separation of FPGA logic will be provided by Altera’s Quartus
II software [2]. This particular FPGA does not offer the same physical security protections
available in the Cyclone III LS series FPGA, but will allow testing of the correctness of the
designed system’s output.
The crypto-engine will be an implementation of the Advanced Encryption Standard,
see section 2.2. AES is the most applicable block cipher candidate to use in such a system
because it has been approved by the government for use in securing classified material [3].
A thorough analysis of several different AES implementations will be performed. At a
high-level, the AES implementations will vary architecturally. Targeted architectures will
include intuitive round iterative designs, as well as unrolled designs for higher performance.
The operating mode will target GCM for high performance and authentication.
The key length will be varied between 128 and 256 bits. Increasing concern has been
expressed over the strength of the larger bit variants of AES, particularly in the key schedule
[4]. This thesis will provide comparisons of performance and FPGA reconfigurable logic
usage (area) between the two key sizes. A designer could use this information to make
an informed decision in favor of a particular implementation. Such a decision could be
reached by comparing the potential increase in security with a larger key length versus the
increase in utilized FPGA area or impact to throughput.
Internal architectural improvements will be explored for the SubBytes transformation.
Techniques to improve the throughput and reduce the amount of utilized FPGA logic will be
implemented. In particular, a composite field approach, subsection 4.1.1, will be compared
with the standard lookup table implementation, section 4.1. Previous work shows that
this technique can be used to decrease the critical path in the longest stage in a pipelined

3
implementation to increase throughput [5]. These designs will be analyzed in conjunction
with high-level architectural strategies.
The performance of each design will be documented in terms of throughput and com-
pared with the amount of reconfigurable logic used in an FPGA. The ratio of throughput
to hardware area will not include any on-chip memory that is utilized in the FPGA. The
total amount of on-chip memory used will be documented. The reader should not use the
throughput to area ratio for design comparisons alone, since the ratio does not include the
utilized on-chip memory and can be misleading. An effort will be made to provide an es-
timated total of on-chip memory if it were reconfigurable logic, leading to a more realistic
estimate of total FPGA area usage.

4
2. Essential Background
This introductory chapter elucidates the major background topics involved in this thesis. A
description of general block ciphers is given. The focus is then narrowed to AES. Several
modes of operation of block ciphers are described for completeness. The pertinent features
of modern FPGA devices are described along with two different softcore processors.

2.1 Block Cipher Algorithms


A block cipher is an algorithm that accepts data input of a fixed n-bit length and, with
the addition of a secret k-bit key value, produces n-bits of encrypted output based on the
two inputs. The block cipher’s encryption algorithm is paired with a decryption algorithm
that, when using the identical k-bit key value, can reproduce the original data input. The
n-bit length defines the block size of the cipher. A block cipher is known as a symmetric
key algorithm because it requires the same key for decryption that was used during the
encryption process.
The data input to this cipher, known as plaintext, can be any type of information that
can be represented in the form of a binary string of data. The manner in which the plaintext
is presented to a block cipher is defined by the operational mode in which the cipher is
used. These modes of operation [6, 7, 8, 9] are necessary for improved security or per-
formance when the amount of plaintext is larger than the size of a single block [10]. This
encrypted output of a block cipher is known as ciphertext. Equation 2.1 and Equation 2.4
illustrate the invertible nature of the encryption and decryption functions, labeled Ek and
Dk respectively, with a key k, plaintext P , and ciphertext C [11].

5
Ek (P ) = C (2.1)

Ek−1 (Ek (P )) = Ek−1 (C) (2.2)

P = Ek−1 (C) (2.3)

P = Dk (C) (2.4)

Underlying component algorithms in block ciphers make use of well-known mathe-


matical principals. The focus of this work, the Advanced Encryption Standard [12], is
composed of both linear and non-linear functions that serve to make the ciphertext unintel-
ligible to anyone without possession of the key. The regularity in the structure comes from
component pieces that are iterated over multiple times in one encryption or decryption op-
eration. Due to the symmetric nature of AES, the component pieces can often be reused
between encryption and decryption.

2.2 Advanced Encryption Standard


In the early 1970s, the National Bureau of Standards (NBS), soon to be called the National
Institute of Standards and Technology (NIST), found a growing need for a government
standard for securing sensitive documents. After two rounds of proposal submissions, a
candidate from IBM, was selected as the winning algorithm in November of 1976. This
winning algorithm is a direct descendant of IBM’s Lucifer cipher [13].
The winning algorithm from IBM became known as the Data Encryption Standard
(DES) and there have been several reaffirmations of it to date [14]. In each of the doc-
uments, the block cipher is referred to as the Data Encryption Algorithm (DEA). For the
remainder of this thesis document, the initialisms DEA and DES will be used interchange-
ably to refer to the block cipher algorithm proposed by the Federal Information Processing
Standard (FIPS) publication [14].
Due to the improvements in cryptanalysis and technology allowing for a feasible brute

6
force search of the DES keyspace, NIST began a second competition for a standard block
cipher algorithm on January 2, 1997 [13]. Moreover, NIST specified the algorithm should
be “as secure as triple-DES, but much more efficient” [15]. In this competition the algo-
rithm’s block size was required to be 128 bits and was required to support keys of lengths
128, 192, and 256 bits. Of the five final round candidates, Rijndael was noted as having
the best balance of security, performance, efficiency, implementability and flexibility. For
these reasons, NIST adopted Rijndael as the Advanced Encryption Standard on November
26, 2001 [12]. Several changes were made to the original algorithm, such as fixed block
sizes instead of variable block sizes, before it was approved as a standard [13]. AES is the
first public algorithm approved for use in systems intended to protect classified data [3].
AES is an iterated, product cipher that maintains an internal 128 bit state. The inter-
nal state is initialized to the value of the plaintext block before being added to the initial
round key. During a round of AES, there are three key independent operations, also called
transformations, performed before the round key is added to the internal state. These round
operations are SubBytes, ShiftRows, MixColumns and AddRoundKey. The number of it-
erations of the round operations is dependent upon the length of the key. As in [12], N r
will be used to represent the number of rounds and N k will be used to represent the number
of 32 bit words in the input key. When N r is 10, 12, or 14, N k is 4, 6, or 8, respectively.
The input key is expanded into N r round keys, also called subkeys. The general round
structure of AES is shown in Figure 2.1.
Both the plaintext input to the algorithm and the internal state size are 128 bits, or 4
32-bit words. [12] states that N b is the number of column words in the state. Since this
value is fixed by the specification, N b will always be 4 words.

2.2.1 SubBytes

SubBytes is the only nonlinear operation performed in the entire AES algorithm. A large
portion of work is devote to this particular round function throughout this thesis, due to

7
Plaintext Block

AddRoundKey K0

SubBytes

Round 1
ShiftRows

MixColumns

AddRoundKey K1

SubBytes
Round (Nr-1)

ShiftRows

MixColumns

AddRoundKey KNr-1

SubBytes
Round (Nr)

ShiftRows

AddRoundKey KNr

Ciphertext Block

Figure 2.1: The general round structure of AES.

several possible inter-round optimizations. SubBytes operates by independently exchang-


ing each byte of the internal state with a value from a substitution box, shown in figure
Figure 2.2. SubBytes does not have any output bytes that are equivalent to the original
input to the round operation, which implies it is a derangement.
While the act of substituting values in the state a byte at a time is somewhat simple
in concept, the derivation of the substitution box values is more complicated. SubBytes
consists of two major operations, each with its own particular design criteria. The first
operation on the state byte is the calculation of its inverse in GF(28 ), labeled G, followed by
an affine transform, labeled f . Where a is an input byte, Equation 2.5 shows the operations

8
S0,0 S0,1 S0,2 S0,3 S-Box S'0,0 S'0,1 S'0,2 S'0,3

S1,0 S1,1 S1,2 S1,3 S'1,0 S'1,1 S'1,2 S'1,3


S2,0 S2,1 S2,2 S2,3 S'2,0 S'2,1 S'2,2 S'2,3

S3,0 S3,1 S3,2 S3,3 S'3,0 S'3,1 S'3,2 S'3,3

Figure 2.2: SubBytes effect on the internal state [12]

performed by the SubBytes round.

Sbox[a] = A(G(a)) (2.5)

The calculation of the inverse is the origination of the algorithm’s nonlinearity. Accord-
ing to [15], two important design criteria for the non-linear stage operation are that it has a
small maximum input to output correlation and a small difference propagation probability.
These are two factors that have an impact on the linear and differential cryptanalysis of the
algorithm. The affine transform gives the sbox some algebraic complexity within GF(28 )
[15].
The affine transform is define as the resultant bit a0i over GF(2) for i ∈ {0, 1, . . . , 7}
with input bit ai as:

a0i = ai ⊕ a(i+4) mod 8 ⊕ a(i+5) mod 8 ⊕ a(i+6) mod 8 ⊕ a(i+7) mod 8 ⊕ ci (2.6)

The ci constant value used in Equation 2.6 is defined as the ith bit of a constant byte 6316 .
This bitwise equation is a linear transformation followed by a translation by the constant bit.
This relationship can also be expressed as a matrix multiplication and the binary addition
of the constant 6316 expressed as a binary matrix [12]. Equation 2.7 shows the affine
transformation for an entire byte in matrix form.

9
a0i = σ × ai ⊕ ci
   
   
a0 1 1 1 1 1 0 0 0 a 0
 7    7  
 0 
a6  0 1 1 1 1 1 0 0 a6  1
    
       
 0 
a5  0 0 1 1 1 1 1 0 a5  1
    
       
 0 
(2.7)
    
a4  0 0 0 1 1 1 1 1 a4  0
 = × ⊕ 
 0      
a3  1 0 0 0 1 1 1 1 a3  0
       
 0      
a2  1 1 0 0 0 1 1 1 a2  0
       
 0      
a1  1 1 1 0 0 0 1 1 a1  1
       
a00 1 1 1 1 0 0 0 1 a0 1
Even though the derivation of the substitution box values is more complicated than a
simple look up table, certain implementation techniques can be exploited that result in a
design with higher throughput. The calculations for the hardware Galois field inversion can
be found in chapter 4.
The inverse SubBytes round applies the inverse of the two intra-round operations. The
finite field inversion is its own inverse, so no changes need be applied to its internal al-
gorithm. The affine transform, however, must be inverted. The two operation must be
applied in the opposite order from the forward SubBytes round operation, as illustrated in
Equation 2.8.

Sbox−1 [a] = G(A−1 (a)) (2.8)

As with the forward SubBytes, inverse SubBytes can be described as a matrix multipli-
cation followed by the addition of an inverted constant [12]. This is shown in Equation 2.9.

10
a0i = σ −1 × ai ⊕ c−1
i
       
a0 0 1 0 1 0 0 1 0 a 0
 7    7  
 0 
a6  0 0 1 0 1 0 0 1 a6  0
    
       
 0 
a5  1 0 0 1 0 1 0 0 a5  0
    
       
 0 
(2.9)
    
a4  0 1 0 0 1 0 1 0 a4  0
 = × ⊕ 
 0      
a3  0 0 1 0 0 1 0 1 a3  0
       
 0      
a2  1 0 0 1 0 0 1 0 a2  1
       
 0      
a1  0 1 0 0 1 0 0 1 a1  0
       
a00 1 0 1 0 0 1 0 0 a0 1

2.2.2 ShiftRows

The ShiftRows round operation cyclically shifts each row in the state matrix by a constant
offset. The offset by which each row is shifted is unique to each row. This was a design
requirement defined in [15] to provide resistance against linear and differential cryptanaly-
sis. The shift offsets are equal to the row index when the rows indices begin at 0. The first
row is shifted by zero bytes, the second row by one byte, and so on. Figure 2.3 illustrates
the ShiftRows operation.

S0,0 S0,1 S0,2 S0,3 S'0,0 S'0,1 S'0,2 S'0,3


S1,0 S1,1 S1,2 S1,3 S'1,1 S'1,2 S'1,3 S'1,0

S2,0 S2,1 S2,2 S2,3 S'2,2 S'2,3 S'2,0 S'2,1

S3,0 S3,1 S3,2 S3,3 S'3,3 S'3,0 S'3,1 S'3,2

Figure 2.3: The state matrix on the left is transformed by the ShiftRows operation to be the
matrix shown on the right.

The inverse shift rows operation cyclically shifts the bytes in the rows in the opposite
direction as the forward shift rows operation. This is the only difference between the two

11
operations. The shift offsets are also equivalent.

2.2.3 MixColumns

The MixColumns round operation is the only round operation where high performance is
not trivial to obtain [15]. An important design criterion for this stage was having the internal
state columns be exactly four bytes. This was done in order to increase performance via
look up tables on 32 bit architectures. Two other design criteria, linearity over GF(2) and
relevant diffusion, were added to in response to the criteria for the wide trail design strategy
[15].
During this round, each column in the state matrix is multiplied by another fixed poly-
nomial modulo x4 + 1. This operation is accomplished by first creating a polynomial with
coefficients that are the bytes in the current column in the state matrix. More about this
representation can be found in section 3.1. This column polynomial is multiplied by the
fixed polynomial, labeled c(x), shown below:

c(x) = 0316 x3 + 0116 x2 + 0116 x + 0216 (2.10)

This operation can be represented as a matrix multiplication, shown in Figure 2.4.

02 03 01 01 Si , j
01 02 03 01 Si 1, j
01 01 02 03
× Si 2, j
03 01 01 02 S i 3, j

S0,0 S0,1 S0,2 S0,3 S'0,0 S'0,1 S'0,2 S'0,3


S1,0 S1,1 S1,2 S1,3 S'1,1 S'1,2 S'1,3 S'1,0

S2,0 S2,1 S2,2 S2,3 S'2,2 S'2,3 S'2,0 S'2,1


S3,0 S3,1 S3,2 S3,3 S'3,3 S'3,0 S'3,1 S'3,2

Figure 2.4: The MixColumns operation multiplies a column in the state by a constant
matrix [15]

12
The inverse mix columns operation follows the same order of operations as mix columns,
however the constant polynomial is inverted. The polynomial in Equation 2.10 is relatively
prime to the modulus x4 + 1, thus it is invertible. This inverted equation is shown below in
Equation 2.11.

c−1 (x) = 0B16 x3 + 0D16 x2 + 0916 x + 0E16 (2.11)

2.2.4 AddRoundKey

The AddRoundKey (ARK) round performs a bitwise XOR with each byte of the current
round key and each byte of the state. This operation is illustrated in Figure 2.5. The round
keys have the same structure as the internal state. The round keys are taken from the ex-
panded key. The procedure to generate the expanded key is known as the key schedule and
will be detailed in subsection 2.2.5. AddRoundKey and its inverse are the same operation.

S0,0 S0,1 S0,2 S0,3 K0,0 K0,1 K0,2 K0,3 S'0,0 S'0,1 S'0,2 S'0,3
S1,0 S1,1 S1,2 S1,3 K1,0 K1,1 K1,2 K1,3 S'1,0 S'1,1 S'1,2 S'1,3
=
S2,0 S2,1 S2,2 S2,3 K2,0 K2,1 K2,2 K2,3 S'2,0 S'2,1 S'2,2 S'2,3
S3,0 S3,1 S3,2 S3,3 K3,0 K3,1 K3,2 K3,3 S'3,0 S'3,1 S'3,2 S'3,3

Figure 2.5: ARK operates by performing a bitwise XOR between each byte of the state and
the generated round keys [15].

2.2.5 Key Schedule

During every ARK round, the current state is combined with a round key through a bitwise
XOR operation. This round key is a four word part of an expanded key derived from the
original private key. The size of the key controls the number of rounds, which increases the
number of total ARK rounds, which each need a round key. The number of round keys is
N r + 1, where the first round key is the original private key.

13
After the initial private key is used, N r round keys are generated. The round keys
are generated sequentially by words. If a count is kept for every word generated, each
generated word is an XOR between the previous word and the word N b before the current
word, which is the previous round word at the same position as the current word. Additional
operations are required for every word that has a count equal to a multiple of N k. In
that case, instead of using the previous word directly in the bitwise exclusive OR (XOR)
operation, the previous word is first processed by RotWord, SubWord, and then XORed
with a round constant. The key schedule 128 bit input key, N k = 4, is shown in Figure 2.6.
If N k = 8, then there is one more additional step. Whenever the word count is equivalent
to 4 mod 8, the previous word will first be processed by the SubWord operation [12]. This
is shown in Figure 2.7.
The SubWord step performs the SubBytes operation from the AES encryption algo-
rithm on each byte in the current word. The RotWord step cyclically rotates the current
word, shown in Equation 2.12.

[a1 , a2 , a3 , a4 ] ⇒ [a2 , a3 , a4 , a1 ] (2.12)

The round constants are hexadecimal 00, 01, 02, 04, 08, 10, 20, 40, 80, 1B, and 36. Three
bytes of zeros are appended to each of these round constants to make each the length of
an entire word. The constants are derived from repeated squaring of an indeterminate x in
GF(28 )[15].
The design criterion for the key expansion operations have non-linearity requirement.
As was with the encryption and decryption algorithms, the SubBytes operations in the
SubWord step satisfy this requirement. The original private key is diffused throughout
all of the round keys through the repeated XOR operations with the previous keys. The
remaining requirement, symmetry elimination, is satisfied by the additions of the round
constants [15].

14
Private Key
W0 W1 W2 W3
W0
W1
K0 W2
W3

RotWord

SubWord

Rcon[1] + +
W0 +
K1
W1 +
Expanded W2 +
Key W3
...

...
...
...
...
RotWord

SubWord

Rcon[10] + +
W0 +
W1 +
K10 W2 +
W3

Figure 2.6: The key expansion operations for a 128 bit input key to create the 10 additional
round keys [16].

15
Private Key
W0 W1 W2 W3 W4 W5 W6 W7

K0

K1

RotWord

SubWord

Rcon[1] + +
+
K2 +
+

SubWord
Expanded +
+
Key K3 +
+
···

···
···
···
···
···
···
···
···
RotWord

SubWord

Rcon[7] + +
+
K13 +
+

SubWord
+
+
K14 +
+

Figure 2.7: The key expansion operations for a 256 bit input key to create the 14 additional
round keys [16].

16
2.2.6 Decryption

The straight forward decryption algorithm can be created by applying the inverse round
operations in the opposite order that they were originally applied. The round keys used
during the AddRoundKey round must also be applied in the opposite order they were during
encryption. This implies that the key schedule must process all of the round keys before
the first round of decryption can begin. Since each round key depends on the generation of
the previous, each round key much first be generated before the final round key can exist.
It is possible to create a key schedule that generates the decryption keys in the order that
they are needed. Such an algorithm would need the final N k words from the round keys
generated during encryption [15]. Use of inverse key generation implies a dependence on
forward key generation being performed at some point where these final round key words
would be made available.
Due to the commutative nature of the InvSubBytes and InvShiftRows round operations,
their order can be reversed. The InvMixColumns round operations are linear with respect
to the column input, which allows the operation shown in Equation 2.13 from [12].

InvMixColumns((state) ⊕ (round key)) =


(2.13)
InvMixColumns(state) ⊕ InvMixColumns(round key)

Using the linear properties of the different rounds, an equivalent inverse cipher can be
created. These two inversion algorithms are shown in Figure 2.8 for a 128 bit private key.
The equivalent inverse has the advantage of performing the rounds in the same order as
their counterparts in the encryption algorithm. This has the advantage of allowing certain
hardware structures to be reused between the two algorithms.
There are several operations during decryption that cause it to have lower performance
than the encryption algorithm, such as the additional application of InvMixColumns to the
round key in the equivalent inverse cipher. The designers of Rijndael chose this because
there are operating modes where the decryption algorithm is not required. Examples of

17
Ciphertext Ciphertext

AddRoundKey K10 AddRoundKey K10

InvShiftRows InvSubBytes
Round 1

Round 1
InvSubBytes InvShiftRows

AddRoundKey InvMixColumns

InvMixColumns K9 AddRoundKey InvMixColumns K9

InvShiftRows InvSubBytes
Round 9

Round 9
InvSubBytes InvShiftRows

AddRoundKey InvMixColumns

InvMixColumns K1 AddRoundKey InvMixColumns K1

InvShiftRows InvSubBytes
Round 10

Round 10

InvSubBytes InvShiftRows

AddRoundKey K0 AddRoundKey K0

Plaintext Plaintext

Figure 2.8: The diagram on the left shows the straight forward inverse cipher algorithm,
while the one on the right shows the equivalent inverse cipher for a 128 bit key [12].

these operating modes include certain feedback modes or in the generation of a message
authentication code [17].

2.3 Block Cipher Modes of Operation


When a block cipher is needed for a message that is longer than one message block, a
particular mode of operation needs to be defined in order to guarantee security. There are
several different block cipher modes of operation that provide confidentiality, authentica-
tion, or both. A brief description will be given of several common modes of operation

18
in this section. More specific information can be found in the NIST special publications
[6, 7, 8, 9].

2.3.1 Confidentiality Modes

The most straightforward operating mode is Electronic Codebook (ECB) Mode. In ECB,
a message is divided into blocks that have the same length as the fixed block cipher input.
Each of these message blocks are encrypted individually, as shown in Figure 2.9. A weak-
ness in this operational mode presents itself when a duplicate block is encrypted. Such a
block would encrypt to the same ciphertext when used with an identical key, allowing an
attacker to derive some information about the encryption. An example of this weakness is
illustrated in Figure 2.10 where the output clearly reveals information about the original
image. One positive benefit of this mode of encryption is that an error during encryption of
one block will result in an error in only one ciphertext block [11].

Plaintext

PT Block0 PT Block1 ∙∙∙ PT Blocki

K Block Cipher K Block Cipher ∙∙∙ K Block Cipher

CT Block0 CT Block1 ∙∙∙ CT Blocki

Ciphertext

Figure 2.9: An example of encryption using a block cipher in ECB mode.

A second common block encryption mode is the Cipher Block Chaining (CBC) mode,
detailed in Figure 2.11. In CBC, the initial plaintext block is combined with an initializa-
tion vector (IV) by way of a bitwise exclusive OR. The IV is a nonce that has the same
length as the plaintext block. This value is then processed by the block cipher, resulting
in a cipher text block that is used in the XOR operation with the next sequential plaintext
block. Proceeding in this fashion, each ciphertext block appears randomized [15]. Due

19
Figure 2.10: The figure on the left [18] is encrypted using a block cipher in ECB, resulting
in the figure on the right [19].

to the dependency on the previous ciphertext block, none of the ciphertext blocks can be
computed in parallel. Decryption, however, can be computed in parallel since the cipher-
text blocks are readily available. CBC is self-synchronizing, meaning that any errors in the
current operation will affect the current result and the result of the next operation, but not
the third result [6].
Plaintext

PT Block0 PT Block1 ∙∙∙ PT Blocki

IV + + ∙∙∙ +

K Block Cipher K Block Cipher K Block Cipher

CT Block0 CT Block1 ∙∙∙ CT Blocki

Ciphertext

Figure 2.11: An example of encryption using a block cipher in CBC mode.

Unlike the previous two confidential operating modes, Output Feedback (OFB) mode,
Cipher Feedback (CFB) mode and counter mode generate a key stream that is used in a
bitwise exclusive OR operation with the message stream. In ECB and CBC, it is necessary
to pad a message with extra data in order to make the message to encrypt a multiple of the

20
cipher’s block size. In these key stream generating ciphers, padding is not necessary. OFB
maintains an internal state by repeatedly encrypting an IV. The output from each of these
encryptions is then exclusively-ORed with the plaintext, producing a block of cipher text.
During the final decryption, bits generated from the block cipher that are in excess of the
size remaining plaintext are discarded [6].
Counter mode operates in similar fashion to OFB. Counter mode generates its keystream
by encrypting a counter or predictable sequence [15]. Unlike OFB, counter mode can cal-
culate ciphertext blocks in parallel since the input to the block cipher is not dependent upon
a sequentially encrypted IV.
CFB has a similar structure to OFB, except that the sequentially generated ciphertext is
input to the next block cipher instead of a continuous encrypted IV. The CFB structure is
illustrated in Figure 2.12 where s bits of plaintext are exclusive-ORed with b bits of block
cipher output for 1 ≤ s ≤ b. For each XOR operation, b − s bits of the block cipher
output are discarded. The input into each block cipher must be a multiple of its block size
b, so the s cipher text bits are appended to the least significant b − s bits from the previous
block cipher input [6]. CFB has an advantage over counter mode and OFB in that it is
self-synchronizing, like CBC [15]. This self-synchronizing property of OFB assures that it
b
will recover from an error after s
blocks of ciphertext [6].

Plaintext

K K K
PT0 (s bits) PT1 (s bits) ∙∙∙ PTi (s bits)

IV Block Cipher + Block Cipher + ∙∙∙ Block Cipher +

CT0 (s bits) CT1 (s bits) ∙∙∙ CTi (s bits)

Ciphertext

Figure 2.12: An example of encryption using a block cipher in CFB mode.

21
2.3.2 Authentication Mode

The nature of a message authentication code (MAC) is to ensure the integrity of a data by
verifying its authenticity. A block cipher can be used to generate a secure MAC when the
algorithm is generally accepted as being secure. CBC-MAC was originally designed for
this purpose, but was replaced by Cipher-based Message Authentication Code (CMAC) to
allow for message lengths that are not fixed as a multiple of the block size [7, 11].
CBC-MAC operates by using a block cipher in CBC mode, as described in subsec-
tion 2.3.1, then use the last bits of ciphertext as the MAC [15]. CMAC is a variation of the
CBC-MAC algorithm. CMAC generates two subkeys K1 and K2 from a given private key
K, which both have a length equal to the block size of the cipher. This private key is used
to encrypt complete message blocks, following the CBC algorithm, until the last block is
reached. If this last block is equal to a block length, it is masked with subkey K1. If the
last piece of the message is not equal to the block size, the binary sequence 100 . . . 0 is
appended to the message before being masked with subkey K2. This masked value is then
encrypted once more with the key K. The most significant bits on the output of this cipher
are used as the CMAC [7].

2.3.3 Authenticated Encryption Modes

The modes described in this section are designed to provide both confidentiality and ver-
ify the authenticity of the data. Using the Counter with Cipher Block Chaining-Message
Authentication Code (CCM) mode with a packet of information, the packet’s payload is en-
crypted and authenticated, the header is authenticated, and a unique value associated with
the packet is generated. The operation of CCM mode consists of CBC-MAC applied to the
entire packet to generate its MAC, followed by encryption using counter mode of both the
MAC and the packet’s payload. Decryption performs the preceding steps in reverse order.
Verification is performed by comparing the values of the decrypted MAC and the result of
CBC-MAC on the decrypted data. A more detailed description of CCM is presented in [8].

22
Galois/Counter mode (GCM) provides the same encryption and authentication options
as CCM, but is parallelizable and relatively efficient [9]. As specified in [8] and [9], GCM
and CCM must use an NIST approved block cipher with a block size of 128 bits. The
GCM proposal [20] had originally made allowances for block ciphers with both 128 and
64 bit block sizes. Like CCM, GCM operates on a packet of data consisting of Additional
Authenticated Data (AAD) and plaintext. AAD is information that does not need to be
encrypted, but can be validated with the generated tag value.
The confidentiality within GCM originates with the block cipher counter mode of oper-
ation. This operation is illustrated in Figure 2.13. The initial counter block (ICB) is, quite
literally, the first value of the counter to be encrypted. Since counter mode generates a key
stream, only the most significant bits of the final encrypted counter block are needed in the
plaintext XOR if the final plaintext block is not equal to the block size [9].

ICB inc32 CB2 CBi

K Block Cipher K Block Cipher K Block Cipher

PT1 + PT2 + MSBlen(PTi)

CT1 CT2
PTi +

CTi

Figure 2.13: The GCTR function where the inputs, the ICB, key and plaintext PTi , are
shown in light blue, and the ciphertext CTi , is shown in dark blue [9].

To provide authentication, GCM makes use of a GHASH function. This function, show
in Figure 2.14, employs a hash subkey, depicted as H. H is generated by encrypting a
complete block of all zeroes with the private key K used throughout the algorithm. GHASH
multiplies the current input block Xi with ciphertext H in GF(2128 ) modulo x128 + x7 +
x2 + x + 1, effectively evaluating the Equation 2.14 from [9]. The degree of the reduction
polynomial forces the results of the field multiplication be of the 128 bit block size. The
results of each field multiplication sequentially XORed with the next GHASH input, ending

23
with the final input block.

X1 · H i ⊕ X2 · H i−1 ⊕ . . . ⊕ Xi−1 · H 2 ⊕ Xi · H 1 (2.14)

X1 X2 Xi

+ +

∙H ∙H ∙H

Y1 Y2 Yi

Figure 2.14: The GHASH function with inputs Xi and final output hash Yi [9].

Before the IV can be used in GCM, it must first be processed by the pre-counter block
J0 . During this stage, processing occurs on the IV to ensure it is 128 bits in length before
it is used in the GCTR function. Figure 2.15 shows how the IV, plaintext, AAD, and
key inputs are used to produce both the tag and the ciphertext. As described earlier, the
GHASH function operates on block size input. The AAD and ciphertext are padded with
enough zeroes to guarantee each is a multiple of the block size. Additionally, the AAD
and ciphertext lengths are specified are 64 bit representations. Their total bit field length
is 128 bits, the block size. The hashed value of the AAD, ciphertext, and their respective
lengths is encrypted with the GCTR function using the original value from J0 and the most
significant t bits are used as the tag. [9] has specific recommendations for the length of the
tag based on the purpose of its use.
Decryption and authentication use the same counter mode and GHASH algorithms as
encryption. The tag value is recalculated using GCTR with the key and original J0 . If the
calculated tag is identical to the value decrypted value from the original GCM encryption,
the plaintext is authenticated. Tag authentication can precede data decryption. If the tag
fails to authenticate, this order of operations removes the need to perform data decryption

24
IV Plaintext

J0 inc32 GCTR K

AAD 0v CT 0u Len(AAD) Len(CT)


m∙128 bits n∙128 bits 64 bits 64 bits

0128

GHASH H Block Cipher

GCTR

MSBlen(Tag)

Tag

Figure 2.15: The operations in GCM from the IV, plaintext, AAD, and key inputs, shown
in light blue, to the tag and ciphertext outputs, shown in dark blue [9].

[9].

2.4 Field Programmable Gate Arrays


In order to fully understand the purpose of FPGAs, it is important to understand the dif-
ferences in the design development process for a general purpose processor and for an
Application Specific Integrated Circuit (ASIC). A general purpose processor is an inte-
grated circuit that is designed with hardware implementing a variety of common logical
operations. The processor executes a series of instructions that dictate how it will oper-
ate. The instructions are readily programmed into the device. More complex processors
offer varying degrees of instruction level parallelism or hardware support for more logical
functions, floating point arithmetic units for example.
An ASIC, as the name suggests, is an integrated circuit that is designed with the goal

25
of accomplishing a specific task. A full custom designed ASIC is engineered from individ-
ual transistors. This implies a considerable amount of time and resources be allocated for
development of more complex systems. Using standard cells reduces the amount of design
time by providing the engineer with a variety of logic having a high density and known
electrical properties. Often a standard-cell ASIC is designed using a hardware descrip-
tion language. A structured ASIC offers the same benefits to design time as standard-cell
ASICs. Unlike standard-cell ASICs, a structured ASIC makes use of predefined metal
layers. This has the effect of reducing manufacturing time as well.
An FPGA is intended to be a convenient middle ground between an ASIC and a general
purpose processor. The hardware realization time for an ASIC is much greater than that
of an FPGA and it can be more costly to develop. An FPGA combines the flexibility and
programming speed of a general purpose processor with high performance found in custom
designed hardware units. Modern FPGAs have the ability to implement a general purpose
processor utilizing the available FPGA logic structures. These are known as soft proces-
sor cores or softcore processors, since the processor utilizes the FPGA’s programmable
resources for its construction. This allows for hardware/software co-design within a single
FPGA chip.
In contrast to softcore processors, Xilinx has incorporated one or more physical Pow-
erPC cores within the fabric of certain model FPGAs. These physical processor cores
support higher clock rates and provide better performance than the softcore processor al-
ternatives. Like softcore processors, a physically embedded processor can take advantage
of any custom designed hardware units that are present and have been routed to it. Unlike
softcore processors, the physically embedded processor is limited to the feature set present
during the FPGA’s construction.
The cost of the FPGA’s flexibility is lower performance, higher power consumption,
and lower logic density than that of an ASIC [21]. Designing logic targeted for an FPGA
is typically more complex than the creation of software for a general purpose processor to
perform a similar task. Despite these drawbacks, FPGAs have been employed in a wide

26
variety of applications for over 20 years.
An FPGA is made up of a matrix of reconfigurable logical blocks built on a fabric
of interconnections. These blocks vary in their composition between device manufacturers
and models. Xilinx refers to the reconfigurable blocks as configurable logic blocks (CLBs),
while Altera names them logic array blocks (LABs). The structure and routing interconnec-
tions of an Altera Cyclone III LAB is shown in Figure 2.16. The terminology describing
the structures within an FPGA often differs between manufacturers, but the components
it refers to are typically very similar. Within the reconfigurable blocks are smaller units,
known as slices (Xilinx) or logic elements (Altera), which commonly contain a lookup
table (LUT), configurable multiplexors, and registers [22, 23]. Figure 2.17 illustrates the
structure of a logic element (LE) in an Altera Cyclone III. The smallest design unit address-
able by the device programming tool defines the FPGA’s granularity [5]. Fine granularity
allows for easier manipulation of individual bits, while a course granularity allow for better
manipulation of high-level logical structures.

Figure 2.16: The structure of a logic array block in an Altera Cyclone III device [23].

FPGAs contain a variety of interconnection types between the reconfigurable blocks. It


is important to remember that the blocks are organized in the form of a matrix. Long lines

27
Figure 2.17: A logic element in an Altera Cyclone III device [23].

span the length of the chip and are typically used for global signal routing. Logic elements
in Altera FPGAs are connected via a direct local interconnect, which helps to minimize
the amount of row and column interconnects between LABs. Xilinx FPGAs use hex and
double interconnects to connect large groups of CLBs via a Global Switch Matrix (GSM)
[24]. More recent FPGAs offer even more types of routing that can provide increased
performance [5].
In the Xilinx Virtex series and Altera Cyclone series FPGAs, there are at least two
types of on-chip memory available. Both FPGAs offer distributed RAM created by the
LUTs within the respective FPGA’s primitive reconfigurable elements. The other common
type of memory is block random access memory (BRAM). BRAM is located in columns
throughout the FPGA and varies in size based on the model of the FPGA. The distributed
RAM is designed to store smaller amounts of data, while BRAM has a configurable width
and depth for creating larger memory blocks [5].
Both Altera and Xilinx have created libraries of intellectual property (IP) hardware
available for use within their FPGAs. These IP cores provide access to a wide variety

28
(a) Nios II [1] (b) MicroBlaze [25]

Figure 2.18: High level layouts of Altera’s Nios softcore processor (2.18a) and Xilinx’s
MicroBlaze softcore processor (2.18b).

of existing hardware structures that can be readily employed in a new or existing design.
Access to IP cores enables designers to create more advanced designs without a great deal
of effort. A more advanced IP core can generate a softcore processor utilizing the available
FPGA logic structures. Both Altera’s Nios II processor and Xilinx’s MicroBlaze processor
are softcore microprocessors. The two softcore processors are illustrated in Figure 2.18.

29
3. Mathematical Background
This chapter presents the basic properties of a group, developing into properties of a ring,
then into properties of a finite field. An overview will be presented of several finite field
representations, along with their associated field arithmetic. Implementation techniques
discussed in later chapters require manipulation of binary data in composite finite fields.
With that in mind, composite field mathematics and the isomorphic mappings between
composite field representation and standard binary representation will be discussed. It is
assumed the reader has an understanding of some general number theoretical concepts.

3.1 Finite Fields


The majority of the calculations performed in AES are over a finite field. To that end,
an effort is made to explain preliminary and related concepts. The topic of finite fields is
expanded upon in great detail in books [26] and [27]. Detailed descriptions of composite
fields and the isomorphic mappings to and from composite fields can be found in Paar’s
doctoral thesis [28] and in the master’s thesis of his student, Jorge Guajardo [29].

3.1.1 Preliminary Concepts

In order to define a finite field, it is necessary to define some terminology and basic con-
cepts.

Definition 1 [26] A binary operation on a set S is one that maps the Cartesian product,
S × S, of two elements from S back on to the set S.

Definition 2 [26, 28] A set G, defined with a binary operation, is called a group if it obeys
the following properties:

30
• Associativity of the binary operation: (ab)c = a(bc), ∀ a, b, c ∈ G
• Identity element e ∈ G: ae = ea = a
• Inverse element a−1 ∈ G for any element a ∈ G: aa−1 = a−1 a = e
• G is not the empty set

A group is described as being abelian, or commutative, if ab = ba, ∀ a, b ∈ G.


Similar to the concept of a group is a ring. Unlike a group, a ring is defined with two
binary operations. It is important to note that there is no requirement for a well defined
multiplicative inverse for every element in a ring.

Definition 3 [26] A set R, defined with two binary operations: addition(+) and multiplication(·),
is called a ring1 if it obeys the following properties:

• Associativity with respect to each binary operation:


(ab)c = a(bc) and a + (b + c) = (a + b) + c, ∀ a, b, c ∈ R
• R is an abelian group under addition with identity element 0
• Additive Inverse element b ∈ R for any element a ∈ R: a + b = b + a = 0
• Identity element for multiplication 1 ∈ R: 1 · a = a · 1 = a, ∀ a ∈ R
• Distributivity ∀ a, b, c ∈ R: a(b + c) = ab + ac = (b + c)a
• R is not the empty set

Using the definition of a ring, a field can be defined with some additional properties.

Definition 4 [26] A ring R forms a field if it obeys the following properties:

• The nonzero elements of R form an abelian group under multiplication


• Multiplicative identity element 1 is not equal to the additive identity element 0

Every field is an integral domain [26], which specifies that the field has no zero divisors.
This is especially useful along with the definition of the division operation. That then
implies there exists a multiplicative inverse for every element in the field.

1
[26] calls this a “ring with identity”, since it has a multiplicative identity element. Either notation is
acceptable in this thesis as long as it is understood that a field has a multiplicative identity element.

31
For the work performed in this thesis, only finite fields are of relevance. A finite field
is a field with finitely many elements, also known as a Galois field. A Galois field is
abbreviated as GF(pn ) or Fpn , where n is positive integer and p is a prime number. pn
indicates the order of the field, while p indicates the characteristic of the field [27].

Definition 5 [28] The order of a field is the number of elements in the field.

Definition 6 [28] The characteristic of a field is number of additions of the multiplicative


identity element necessary to reach a sum equivalent to the additive identity element.

In addition to the restriction of focus in this thesis to Galois fields, these finite fields
will also be restricted to a characteristic of two. As was stated in [28], every element will
be its own additive inverse. The motivating factor here is to create a field that can easily be
represented by the Boolean equations that define the operation of digital logic.

3.1.2 Finite Fields Extensions

A ring of integers is denoted Zn , where n is the integer modulus. In similar fashion, a


ring of polynomials can be created modulo a polynomial f (x). This ring of polynomials
Zn [x]
is denoted f (x)
, where f (x) ∈ Zn [x]. The ring of integers Zn becomes a finite field when
n is prime. When applying the same logic to the ring of polynomials, one must conclude
that, not only would n have to be prime, the polynomial modulus f (x) would have to be
Zn [x]
irreducible for f (x)
to be a field [29]. These irreducible polynomials exist for every finite
field with order being of a prime power.

Theorem 1 [27, 26] If a field F exists and contains a number of elements equal to a power
of a prime, then ∀ d ≥ 1, there is at least one irreducible polynomial of degree d over the
field F .

The previous subsection defined GF(pn ) as a Galois field with its modulus being of a
prime power. When n = 1, GF(p) describes a field of integers. When n > 1, the field

32
is known as an extension field of GF(p). Conversely, GF(p) is a subfield, or ground field,
of GF(pn ). An element in the extension field GF(pn ) can be represented as a polynomial
with coefficients in the subfield. For example, a polynomial A(x) = an−1 xn−1 + . . . + a0
has coefficients ai ∈ GF(p), where n is the maximum degree of the irreducible polynomial
f (x) and i = 0, 1, . . . , n − 1 [29]. The maximum degree of the indeterminate x will never
be greater than n − 1, since it is reduced by the irreducible polynomial f (x) with degree n.
The pn polynomials are the residue classes modulo f (x) in GF(pn ), which implies that the
algorithms for arithmetic in the field will depend on the choice of f (x) [28].

Composite Fields

A special case of extension fields are composite fields, denoted as GF((2n )m ) for a charac-
teristic of two.

Pn−1 i
Definition 7 [28, 29] A pair of fields {GF(2n ), Q(y) = y n + i=0 qi y } and
{GF((2n )m ), P (x) = xm + m−1 i
P
i=0 pi x } are called a composite field if. . .

• GF(2n ) is constructed from GF(2) by Q(y)


• GF((2n )m ) is constructed from GF(2n ) by P (x)

, where Q(y) and P (x) are irreducible polynomials over GF(2)

If k = nm, GF(2k ) and GF((2n )m ) have the same order. Any two fields with the
same order are isomorphic to one another [27]. Although these two fields are isomorphic,
the algorithms computed over the fields could have different complexities. The change in
complexity would be with respect to the defined field operations: addition and multiplica-
tion. Just as a change in in the irreducible polynomial for an extension field could result
in a change in algorithmic complexity, subsection 3.1.2, a change in the irreducible poly-
nomials Q(y) or P (x) could also produce such a change for a composite field. In a more
general sense, the values selected for n and m will prompt this change, due to their effect
on the irreducible field polynomials [28].

33
3.1.3 Basis and Representation

Although there are many different basis representations, three are commonly found in ap-
plied cryptography research: the standard basis, the normal basis, and the dual basis. The
selection of a particular basis representation determines the complexity of the finite field
arithmetic and any constructed isomorphisms. Field element representations will be shown
and an overview will be given for each of the three bases. Specific details regarding basis
representation can be found in [26, 30] with implementations described in chapter 4.
A basis for a finite field is a set of elements that, when linearly combined, form every
element within the finite field. A general basis for GF(pm ) is shown in Equation 3.1,
where ai represents a coefficient, xi represents a basis vector, and A(x) is an element in the
finite field. The number of basis elements m is directly related power of the prime integer
defining the ground field.

A(x) = a0 x0 + a1 x1 + a2 x2 + . . . + am−1 xm−1 , where ai ∈ GF(q) (3.1)

Standard Basis

GF(2m )[x]
If a finite field extension R(x)
is formed over GF(2), the standard basis for the extension
over the ground field GF(2) is shown in Equation 3.2 using the root x of R(x). This
primitive root x is known as the generator.

{1, x, x2 , . . . , xm−1 } (3.2)

Each vector within this basis is linearly independent of every other basis. The m elements
within the field GF(2m ) are represented as polynomials of the form shown in Equation 3.3.
Each ai is an element of the ground field GF(2). As with any basis, a linear combination
of the m basis elements produces each of the finite field elements [28].

A(x) = a0 + a1 x + a2 x2 + . . . + am−1 xm−1 , where ai ∈ GF(2) (3.3)

34
The standard basis is also known as the polynomial basis or canonical basis, specifically
with regards to finite field extensions. The origin of the polynomial basis name is obvious
because of the direct connection between the basis and element representation as a polyno-
mial [28]. Finite field arithmetic operations using the standard basis are performed modulo
a field reduction polynomial, as was mentioned in subsection 3.1.2. Each coefficient of the
indeterminate in field element representation is also reduced by the modulus of the ground
field.
Using a field of characteristic two, addition is equivalent to subtraction. This is a sim-
ple exclusive-OR operation with the coefficients at equivalent powers of the indeterminate.
Multiplication and squaring are more complex operations with a variety of different im-
plementations with specific advantages. Multiplication in the standard basis, in its most
elementary form, is performed using the same methods as conventional polynomial multi-
plication. The exception is in the reduction steps described in the previous paragraph. A
squaring operation involves some shifting operations followed by a multiplication and an
exclusive-OR operation.

Normal Basis

GF(2m )[x]
If the same finite field extension R(x)
is formed over GF(2), as in section 3.1.3, the
normal basis for the extension field over the ground field GF(2) is shown in Equation 3.4.

2 3 m−1
{x, xq , xq , xq , . . . , xq } (3.4)

In this set, a primitive element x is raised to increasing powers of the field’s characteris-
tic, labeled q. Again, x is also known as a generator for the finite field since it generates all
the elements of the multiplicative group. As was mentioned previously, only fields of char-
acteristic two are utilized in this thesis. That being the case, q = 2. Each finite field element
is defined by the coefficients of the field’s primitive elements, as shown in Equation 3.5.
Again, each coefficient is reduced by the ground field modulus.

35
1 2 m−1
A(x) = a0 x + a1 x2 + a2 x2 + . . . + am−1 x2 , where ai ∈ GF(q) (3.5)

Like the standard basis, addition in the normal basis is a simple exclusive-OR of the
respective primitive element’s coefficients. Unlike the standard basis, multiplication is
more complex. An advantage of the normal basis is that squaring is a simple cyclic shift of
the primitive elements coefficients.

Dual Basis

In the standard basis, elements were of successive generator powers. In the normal basis,
elements were of successive generator exponentiations. A dual basis can be created for any
basis [30, 28]. Using the basis elements {x0 , x1 , . . . , xm−1 } of GF(q m ) from the generic
example in subsection 3.1.3, let h be a nonzero linear function from GF(q m ) to GF(q).
Using this basis, its dual basis can be defined as {γ0 , γ1 , . . . γm−1 }, where

1 if i = j

h(xi γi ) = , where 0 ≤ i and j ≤ m − 1 (3.6)
0 if i 6= j

[28] specifies the linear function h as the trace of the basis element product, while [30]
does not identify a particular linear function.
According to [30], the dual basis conversion is an import and export problem where
the basis must be changed efficiently. Dual basis techniques allow a conversion that is
more efficient than a matrix multiplication where the matrix is too large for a constrained
environment [30]. In certain implementations, such as in [31], a dual basis approach can
result in less hardware area and potentially offer high performance for field operations as
the field size increases.

36
3.2 Field Isomorphisms
Of particular interest in this thesis is the mapping between the binary standard represen-
tation and the composite field representation of finite field elements. Moreover, these iso-
morphisms allow operations that were originally in GF(28 ) to be performed in lower order
fields, such as GF(24 ) or GF(22 ). Several construction methods for the transformation
matrices have been documented. These will be given more attention in chapter 4.
An isomorphism is a bijective map where the mapping function and its inverse are ho-
momorphisms. In simplest terms, it is a mapping between two vector spaces with two
qualifications: 1) the mapping function is one-to-one and onto; and 2) the mapping pre-
serves structure. In the context of a vector space, the qualification of preserving structure
implies that the mapping preserves both field operations: addition and multiplication. Vec-
tor spaces of the same dimension are isomorphic. A finite field is a vector space over its
subfields and is isomorphic to another finite field of the same order. For example, a finite
field extension GF(pk ) is a vector space of dimension k over GF(p). This extension would
be isomorphic to another finite field with order pk .
In order to construct the isomorphism from standard representation to composite field
representation, a mapping must be constructed between basis elements of GF(2k ) and
GF((2n )m ) for k = nm. Using the notation for the irreducible polynomials in section 3.1.2,
let α be a primitive root of P (x) and β be a primitive root of an irreducible polynomial R(z)
with degree k and coefficients in GF(2). In other words, R(x) is the irreducible polynomial
of GF(2k ) in binary standard representation. The mapping is created using value of t that
satisfies the following from [28]:

Tβ i = αit , where i = 0, 1, . . . , k − 1 (3.7)

Equation 3.7 is not the only qualification to be satisfied by a value of t. As was stated
in the beginning of this subsection, structure must be preserved with an isomorphism. For

37
multiplication to be preserved, αt must be a root of R(z) modulo the irreducible polyno-
mials Q(y) and P (x) as illustrated in Equation 3.8 [28].

R(αt ) = 0 (3.8)

Finding the value of t can be done in a number of ways, the most obvious of which
is a brute force search [28]. The complexity of the isomorphism is directly related to the
composite field primitive polynomials. In chapter 4, related works are listed that utilize
different approaches to find these mappings and polynomials.

3.3 Composite Fields applied to AES


Using the definition of the AES S-Box in subsection 2.2.1 and the construction of a com-
posite field given in section 3.1.2, an S-Box can be constructed using lower order Galois
fields. The AES S-Box performs a Galois field inversion and an affine transformation. The
composite field approach is used to implement the Galois field inversion calculations.
To be able to use the composite field approach for an S-Box implementation, an iso-
morphic mapping must be generated between the AES S-Box Galois field and the desired
composite field. Figure 3.1 illustrates the three main calculations of an inverse in the Galois
field GF(28 ) using the composite field approach. Since the affine transformation is nothing
more than a linear transformation followed by a translation, it can be integrated with the
isomorphic mapping to reduce the amount of required calculations [32, 33]. These map-
pings are constructed using the techniques discussed in section 3.2. The remainder of this
section is devoted to the derivation of the GF((24 )2 ) inversion.
As was stated in [34], elements of GF(28 ) can be expressed as a polynomial of the first
degree with coefficients in GF(24 ), bx + c, given an irreducible polynomial x2 + Ax + B.
It is when performing calculations with these coefficients that the reduction in required
logic can be realized. [15] states that irreducible polynomials exist in GF(24 ) where A = 1
and B is a constant that does not compromise the irreducibility of the polynomial. This

38
(1) (2) (3)

Multiplicative Isomorphic
Isomorphic
A Inverse Reverse A-1
Mapping
Manipulation Mapping

GF(28) GF(22)2 & GF(24)2 GF(28)

Figure 3.1: A high-level diagram showing the three main components in a composite field
S-Box calculation, not including the affine transformation [5].

constant is hereby referred to as λ.


Using the irreducible polynomial P2 (x) = x2 + x + λ, the inverse of bx + c can be cal-
culated with coefficients in GF(24 ). The steps to this procedure are adapted from [15] and
[32]. This inversion is equivalent to solving for A(x) and B(x) in the following equation
from [32]:

A(x)P2 (x) + B(x)S(x) = 1 mod P2 (x) (3.9)

In this equation, S(x) = bx + c. Thus B(x) is the multiplicative inverse of S(x). When the
irreducible polynomial P2 (x) is divided by S(x), the result can be expressed as a product
between S(x) and a quotient polynomial Q(x) summed with a remainder polynomial R(x).

P2 (x) = Q(x)S(x) + R(x) (3.10)

Using long division to solve for Q(x) and R(x) yields the following:

Q(x) = b−1 x + (1 + b−1 c)b−1 (3.11)

R(x) = λ + (1 + b−1 c)b−1 c (3.12)

Substituting Equation 3.11 and Equation 3.12 back into Equation 3.10 yields:

P2 (x) = (b−1 x + b−1 (1 + b−1 c))S(x) + (λ + b−1 c(1 + b−1 c)) (3.13)

39
To remove negative exponents from Equation 3.13, b2 is distributed to both sides of the
equation.

b2 P2 (x) = (bx + (b + c))S(x) + (b2 λ + bc + c2 ) (3.14)

b2 P2 (x) = (bx + (b + c))S(x) + (b2 λ + c(b + c)) (3.15)

In order to have an equation that is comparable to Equation 3.9, the constant term must be
equivalent to one. Distributing (b2 λ+c(b+c))−1 to both sides of the equation accomplishes
this goal.

(b2 λ + c(b + c))−1 · b2 P2 (x) =


(3.16)
(b2 λ + c(b + c))−1 · (bx + (b + c))S(x) + (b2 λ + bc + c2 )
 

(b2 λ + c(b + c))−1 · b2 P2 (x) = (b2 λ + c(b + c))−1 · (bx + (b + c))S(x) + 1 (3.17)

(b2 λ + c(b + c))−1 · b2 P2 (x) + (b2 λ + c(b + c))−1 · (bx + (b + c))S(x) = 1 (3.18)

Between Equation 3.17 and Equation 3.18, the S(x) term switches sides of the equation.
There is no sign change here because this is a field of characteristic two.
Comparing Equation 3.18 to the original Equation 3.9 shows that the coefficient of the
S(x) term in Equation 3.18 is the desired multiplicative inverse, S −1 (x) or (bx + c)−1 .

S −1 (x) = (bx + c)−1 = (b2 λ + c(b + c))−1 · (bx + (b + c)) (3.19)

Directly mapping Equation 3.19 to hardware logic components results in Figure 3.2. This
figure shows a high-level view of the interaction between the composite field S-Box com-
ponents involved in the Galois field inversion. Again, this diagram does not constitute a
complete S-Box implementation since it does not include the affine transformation. The
blocks labeled δ(x) and δ −1 (x) are the isomorphism and inverse isomorphism respectively.
The components in Figure 3.2 between the two isomorphism components belong to
GF(24 ). Within this component, it should be noted that a Galois field inversion is still

40
4
X·Y
4 2
X Xλ

A δ(x) X-1 δ-1(x) A-1


4
X·Y

4
X·Y

Figure 3.2: A block diagram showing the individual hardware components in a composite
field S-Box implementation [5].

required. This component, however, could utilize a simple LUT since it only requires 16
elements instead of the original 256 elements required by a complete S-Box implementa-
tion in a LUT. Other approaches, such as continued Galois field decomposition or a square
and multiply approach could also be utilized for this component [32].
Continued decomposition is certainly possible for the components in GF(24 ), but it may
not be advantageous due to the algorithmic complexity of one approach versus another
on a given platform. Constructing GF(24 ) using the composite field GF((2)2 ) requires
another irreducible field polynomial, P1 (x), be defined. The constant required to keep this
polynomial irreducible is defined to be φ.

P1 (x) = x2 + x + φ (3.20)

Using s as the product of a and b in GF(24 ), the multiplication operation is as shown below:

s=a·b (3.21)

As before with the elements in GF(28 ) represented as an equation of the first degree with
coefficients in GF(24 ), elements in GF(24 ) can be represented by an equation of the first
degree with coefficients in GF(2). This equation is shown below with the most significant
bits indicated with a subscript H and the least significant bits indicated with a subscript L.

41
2

X XΦ
4

X 2

4
4
2
2 X

Figure 3.3: A block diagram showing the individual hardware components in a GF((2)2 )
multiplier [5].

sH x + sL = (aH x + aL )(bH x + bL ) (3.22)

This product can be evaluated and then reduced with the field polynomial, as shown below:

sH x + sL = (aH bH )x2 + (aH bL + aL bH )x + aL bL (3.23)

sH x + sL = (aH bH )(x + φ) + (aH bL + aL bH )x + aL bL (3.24)

sH x + sL = (aH bH + aH bL + aL bH )x + (aH bH φ + aL bL ) (3.25)

Equation 3.25 defines a GF((2)2 ) multiplier whose logical hardware implementation is


shown in Figure 3.3. This GF((2)2 ) multiplier could be further decomposed into GF(2) if
desired.
As stated in section 3.1.2, the algorithm complexity of these composite fields are di-
rectly related to the choice of exponents in the construction of the composite field, due
to their effect on the irreducible polynomials. Composite field constructions have been
analyzed by [35] that utilize all values of λ and φ that do not result in a reducible field
polynomial, yielding the optimal values for the constants.

42
4. Related work
This chapter summarizes the relevant related work in implementations of the Advanced
Encryption Standard (AES) and Single Chip Cryptography (SCC) on FPGAs. Several high-
level AES design techniques will be covered, such as T table approaches [17, 36, 37, 38,
39, 40] and pipelined approaches [10, 33, 41, 32, 42, 43, 44, 45, 46]. High-level design
approaches govern how resources are utilized during an entire encryption or decryption
operation. Improvements in the individual MixColumns and SubBytes round operations
are also overviewed. Particular attention is given to the SubBytes transformation, where
significant performance improvements and reduction of required resources have been made
[34, 47, 16, 48, 42, 35, 49, 46, 50].
Apart from high level organization, different AES designs vary particularly in the Sub-
Bytes round implementation: 1) as a table in memory [37, 36, 38, 10, 51, 44, 52], 2) as
combinational logic implementing calculations in GF(28 ) [34, 48, 43], or 3) as combina-
tional logic utilizing the composite field technique [34, 47, 16, 33, 42, 44, 53, 32, 35, 46].
Previous research varies in focus from ASIC and FPGA implementations of AES en-
cryption and decryption in their entirety, to more theoretical approaches focusing on only
a particular round operation. The related research presented here is not limited in scope to
FPGA implementations.

4.1 High Performance Architectures for AES


There exists a great deal of published literature regarding high performance designs or
techniques AES implementations. From a top level perspective, the majority of the high
performance designs are pipelined or have unrolled the round structure of AES. One of
the earliest AES designs that claimed high performance used a lookup table structure to

43
replace an entire round of AES [17, 36, 37, 38]. This approach was first proposed in [17],
which referred to these hardwired LUTs as T tables. More recently, T tables have been
referred to as T-boxes. As stated in [36], a necessary cost of this operation are the four 256
entry T-box lookup tables. Each T-box entry is one word, or 4 bytes, in length, yielding a
total of 8 Kilobits for all the T-boxes needed for one round. The T-box approach was an
improvement over a previous high performance implementation by the same authors that
used a LUT for the S-Box [54].
A second T-box implementation was performed by [38]. This particular implementation
used a 32 bit datapath and had a 128 bit key with an offline key schedule. As with all
AES T-box implementations, there is a large memory requirement to achieve maximum
performance. T-box implementations that operate on 128 bit blocks offer high throughputs.
Implementations using 32 bit blocks can be more economic in terms of resource utilization,
but have lower throughputs [37]. This T-box methodology was applied to ASICs by [39].
A common technique to improve the performance of an iterated algorithm is to unroll
a number of rounds. This has the effect of eliminating accumulated delay from multiplex-
ors and registers that would normally control the round loops. A result of applying this
technique is duplicate hardware for each unrolled round, creating a large critical path. The
larger the critical path, the lower the clock frequency must be [5].
Pipelined implementations have been explored in detail by a great number of researchers.
Pipelining can be seen as an improvement on the unrolling strategy. This technique in-
creases the amount of data that can be processed concurrently by inserting registers between
independent hardware modules, allowing unrelated data to be processed in each module.
The amount of data parallelism results in an increase of the throughput for the implemen-
tation at the expense of latency for a single block. The registers placed between modules
forces an increase resource cost above that required for a loop unrolled design strategy.
Registers can be placed in between rounds for inter-round pipelining and/or between in-
dividual round operations for intra-round pipelining. Intra-round pipelining has also been
referred to as a sub-pipeline design strategy [5].

44
In pipelined designs, it is common to use an online key schedule. In this type of key
schedule, the key is expanded in parallel with the encryption or decryption operations as
it is needed. This removes the need for additional key expansion memory. An offline key
schedule precomputes all of the round keys before the encryption or decryption operations
have begun. Although this guarantees that all round keys will be available when needed in
the encryption or decryption process, it also requires that memory be present to store every
round key.
[41] compares several pipelined implementations on a Xilinx Virtex II device, differing
in the amount of unrolling, round transformation partitioning, and S-Box technology. The
results from these different implementations showed a direct implementation of the S-Box
in FPGA logic yielding suboptimal results in all measured metrics. [10] uses a five stage
pipeline to implement the encryption logic with an online key schedule. The S-Box was im-
plemented using FPGA BRAM. Both [10] and [41] list several trade-offs when optimizing
for area, latency, or throughput.

4.1.1 Composite Field S-Box Techniques

In [34], the author makes a suggestion that the calculation of the Galois field inversion in
the SubBytes transformation could be reduced to operations in lower order Galois fields.
Using this technique, it is feasible to implement the circuitry for Galois field inversion.
The first work to apply this methodology to the AES S-Box was [47]. In this work, the
composite field methodology was used to compute the Galois field inversion, as well as
MixColumns and AddRoundKey operations in GF((24 )2 ). The lower order Galois field
operations reduced the gate count of the finite field arithmetic operations significantly. [48]
illustrates in greater detail how this composite field technique is used to implement the
S-Box.

45
4.1.2 Composite Field S-Box Techniques and Pipelining

Following [47]’s implementation for an ASIC, [33] applied the composite field GF((24 )2 )
approach to an AES encryption implementation on Xilinx Virtex II FPGAs. The Sub-
Bytes and MixColumns transformations were mapped into the composite field. The round
constants used in the key schedule were also mapped into the composite field. [33] im-
plemented AES encryption with a 128 bit key and a 128 bit wide data path. Inter-round
pipelining was used, which required a unique memory block following the operations of
each unrolled round. The additional large memory required to register the memory after
each round led [33] to state that the composite field technique would allow such a design
to be implemented in smaller devices by reducing the need for additional S-Box memory.
Following the work in [47, 33], [32] showed that it is optimal to use composite field
techniques only in the SubBytes transformation. [32] also presented an improved key
schedule designed for the intra-round, pipelined approach using the composite field GF((24 )2 ).
One of the major goals in pipelining is to balance the latency at each stage. The stages
in the pipeline with the greatest latency will limit the performance of the entire pipeline.
To that end, [16] created a tower of composite fields GF(((22 )2 )2 ). Using this technique,
it is possible to break the calculations for the Galois field inversion in the SubBytes trans-
formation into smaller components. These smaller design units are ideal for a pipelined
implementation, since the SubBytes transformation has the greatest latency of any indi-
vidual stage when implemented using a lookup table. [16] implemented this technique in
an ASIC. A pipelined design that makes use of this strategy is known as a sub-pipelined
design with balanced stages[5]. [44] utilized this strategy to create a high throughput,
pipelined design on a Xilinx FPGA. A thorough analysis of 16 different constructions of
GF(((22 )2 )2 ) is conducted in [35], yielding an optimum choice for irreducible polynomials
constants.
As mentioned previously, [32] utilized the composite field GF((24 )2 ) in the S-Box cal-
culation. This paper also discussed three approaches to calculate the Galois field inversion

46
within this field: 1) continued composite field decomposition into GF(((22 )2 )2 ), 2) utiliz-
ing the square and multiply algorithm, and 3) calculation of the combinational equations
for each bit. It was found that the direct calculation of the combinational equations for each
bit resulted in a decrease in the length of the critical path and the total number of gates.

4.1.3 Precomputation Components

A more recent work, [46], reduces the critical path delay found in a composite field S-Box
implementation by utilizing precomputation units. The first presented design replaces the
GF(24 ) inverse component and final two GF(24 ) multipliers with 2 sets of GF(24 ) multipli-
ers, one set for each of the four bits in the final GF(28 ) S-Box output value. The inputs to
these multipliers are 1) constant inverse values in GF(24 ) and 2) the values calculated dur-
ing the second stage of the conventional composite field S-Box. At the expense of twice the
area usage, this design reduces the critical path by 20%. [46]’s second design divides the
inverse affine transform component into smaller GF(24 ) sub-transforms performed at the
output of each of the 32 GF(24 ) multipliers. The author states that if intra-round pipelining
were performed, only 3 pipeline stages would be needed for the first design and 2 pipeline
stages for the second design instead of the 5 pipeline stages used in the conventional ap-
proach.

4.2 Minimal Hardware Architectures for AES


The designs featured in this section target compact hardware implementations. In such
an implementation, throughput is sacrificed in lieu of reduced hardware requirements [5].
In an ASIC, this corresponds to a lower transistor count. In FPGAs, a compact hardware
architecture utilizes less reconfigurable logic than a standard or high performance imple-
mentation. The iterated round structure of AES allows a compact design to be created
simply by iterating over the hardware required to implement one round of the algorithm.
To minimize the requirements on hardware, some compact implementations perform the

47
key schedule online [43].
Using newer hardware with larger BRAM specifications, [40] was able to implement a
T-box in only two BRAM cells. Consequently, this implementation utilized a small amount
of reconfigurable logic, leading to a higher throughput to area ratio. Opposing this approach
is [43], which stated that “hardware implementations of MixColumns are smaller than T-
boxes, have a shorter critical path, and consume less power.”

4.2.1 Reduced Datapath Width

Another common technique to reduce the area requirements of AES is to utilize an 8 bit dat-
apath width. This has the effect of greatly reducing the throughput. [44] utilized this lower
datapath width to create a low area implementation with a drastically reduced throughput.
The S-Boxes were allowed to reside within BRAM. [43] used this technique to construct an
ASIC. The MixColumns transformation uses a novel technique that calculates one column
of the state in seven clock cycles. The S-Box is implemented using combinational logic.
[53] also utilized this approach for an ASIC with higher throughput results. [45] used the
8 bit datapath to produce an Application Specific Instruction Processor (ASIP) on a Xilinx
FPGA. Additional FPGA area savings are gained with a shared finite field multiplier in
the SubBytes composite field. All remaining AES operations, including the MixColumns
steps, are performed using an iterative multiply-accumulate architecture. Using these tech-
niques, [45] uses less reconfigurable logic than any of the previous implementations.

4.2.2 Substructure Sharing

An implementation that supports both encryption and decryption is able to share substruc-
tures between the two. [51] creates a low area implementation that shares the MixColumns
transformation hardware. According to [15], the MixColumns substructure sharing concept
originated with P Barreto. Using this idea, the InvMixColumns transformation is simply
the MixColumns transformation with some preprocessing. This reduces performance of

48
the algorithm, but reduces the hardware requirement as well. In such an approach, encrypt
and decrypt cannot be used in parallel. The key schedule used in [51] was designed such
that all the subkeys were pregenerated and stored in a single BRAM before being used.
Since the key schedule and encryption modules were not used at the same time, the BRAM
stored S-Box was shared between the two with switching logic.
Besides MixColumns substructure sharing, it is possible to share hardware in the Sub-
Bytes transformation. In particular, the Galois field inversion is identical between encryp-
tion and decryption. Some switching logic is required to create a datapath that would either
1) perform the inverse affine transform, then the Galois field inversion (decrypt) or 2) per-
form the Galois field inversion, then the affine transform (encrypt) [5, 15]. This technique
is employed in [45].

4.2.3 Normal Basis Techniques

Another improvement on the composite field technique has been to change its basis. [42]
represented elements in the normal basis at each level in the tower of fields GF(((22 )2 )2 ).
432 possible decompositions were investigated in this work, yielding the smallest hardware
implementation for this particular composite field. [49] uses the same technique, but does
not descend into the lowest composite field, choosing instead to use GF((24 )2 ). This com-
posite field was not explored by [42]. Gate costs of this implementation were comparable
with that of the earlier work. [50] implements the entire AES algorithm in a normal basis
on an ASIC. None of these previous listed normal basis techniques have been implemented
in an FPGA.

4.3 Galois/Counter Mode Architectures


The implementations of GCM explored in this section exclusively target AES as the block
cipher algorithm. These designs primarily focus on high performance, since authenticated
encryption and decryption with GCM are relatively efficient and parallelizable [9].

49
[55] completed a pipelined ASIC implementation of GCM with a throughput of 34
Gbps using a 128 bit key. An iterative implementation of AES was used to calculate H,
while a round pipelined approach was used in the main GCM encryption datapath. The
S-Boxes were implemented in LUTs. The critical path is found at the Mastrovito parallel
finite field multiplier used in the GHASH function.
The three ASIC GCM implementations in [56] were completed for all three AES key
sizes and each had a lower gate count than reported in [55]. [56] attributes this to the use
of composite field and a binary decision diagram (BDD) techniques for S-Box implemen-
tation and architectural changes in the GCM datapath. A sequential GCM architecture was
evaluated with AES architectures constructed as 1) a 4-stage pipelined loop and 2) a round
pipeline architecture and a parallel GCM architecture was constructed with four iterative
AES implementations. A four clock cycle GF(2128 ) multiplier is used with the sequential
4-stage AES pipelined loop implementation, which then gives a similar throughput to the
parallel GCM implementation at the same clock frequency. The sequential version, how-
ever, had a lower gate count. The sequential GCM with pipelined AES implementation
achieved the highest throughput with the best throughput to area ratios.
[57] improves upon the throughput found in his earlier work [56]. This work uses
the parallelization of the GHASH function suggested in [20] where the input data is in-
terleaved in even or odd sequences. Four parallel GHASH components are used together
with four AES components, each pipelined. Throughput rates are over 100 Gbps with bet-
ter throughput to area ratios. This ratio statistic was improved by sharing the same key
scheduler for the entire circuit. Building on [57], [58] again targets improvements to the
GF(2128 ) multiplier. In this work, GHASH is broken into four 128 bit by 32 bit pipelined
digit-serial multiplications. Combined with a 56 stage pipelined AES implementation, this
work achieves a throughput of 54.94 Gbps with a higher throughput to area ratio than seen
previously. These works are summarized in [59].
Virtex-4 FPGA implementations of GCM were explored in [60]. The AES implemen-
tations had data path widths of 128, 64, 32, 16 bits and support all three key lengths. Three

50
S-Box implementations were evaluated: lookup tables; composite field approach, descend-
ing into GF(24 ); and utilizing on-chip BRAM. The GF(2128 ) multiplier was implemented
as bit-parallel, digit-serial, and as a hybrid where a composite field GF((216 )8 ) allows a
16 bit multiplier to complete the multiplication. [60] found that the hybrid multiplier was
larger and slower than a digit serial approach on an FPGA, likely due to the selected irre-
ducible polynomial for the composite field. The 128 bit datapath GCM core had the highest
performance at over 10 Gbps. The 16 bit datapath versions operated around 2 Gbps and
utilized significantly less logic than the higher datapath versions.
In [61], two high performance GCM designs targeting a Xilinx Virtex-4 FPGA were
created. A detailed complexity analysis using FPGA primitives is also given. The GF(2128 )
multiplier is implemented in bit parallel fashion for performance and using Karatsuba’s
algorithm for a more area efficient design. The AES implementations target 128 bit keys
and are pipelined using the composite field approach for the S-Box to balance the delay
time with the GHASH multiplier. A separate iterative AES component is used to calculate
H, but does share the key scheduling component with the main datapath pipelined AES.
The GCM containing a pipelined AES with Karatsuba based multiplier reached 15 Gbps,
while the inner-round pipelined AES with a bit parallel multiplier reached 20 Gbps.
Improving upon [61], [62] uses a 4 stage pipelined Karatsuba-Ofman based multiplier
in the GHASH component. As stated in [62], this approach for the multiplier has reduced
complexity when compared to the approach in [58], reducing hardware area requirements.
BRAM, LUT and composite field implemented S-Boxes are used in the different imple-
mentations. [62] achieves a higher max operating frequency than their previous work in
[61] and a throughput to slice improvement over [60] by a factor of three, comparing the
Virtex-4 implementations with AES supporting all three key lengths. On a Virtex-4, the
highest throughput was 34 Gbps; while on the Virtex-5, the highest achieved throughput
was 39 Gbps.

51
4.4 FPGA-based Single-chip Cryptography
The concept of a system on a single chip is not a new one. System designers, having the
goal of reducing the resources required to implement a digital system, utilized ASICs to
implement their designs. As stated in section 2.4, the time and costs associated with cre-
ating an ASIC can greatly exceed that of creating an FPGA design. The logic resource
requirements of implementing an entire system on an FPGA kept FPGA implementations
infeasible for some time. Adding to the problem, fail-safe systems require physically sep-
arate, redundant components. A physically separated, cryptographic system is illustrated
in Figure 4.1. According to [63], verification analysis of these redundant FPGA designs
was considered to be an intractable problem by many security analysts. Not until relatively
recently have FPGA manufacturers been able to produce tools that verify such a design
[24, 63]. As such, the topic of FPGA-based SCC is still developing. The remainder of this
section describes the related research done to create secure FPGA primitives and design
techniques necessary in creating a secure hardware implementation. A detailed description
of the bodies that evaluate design security or secure design standards is not within the scope
of this document.

RED BLACK
Subsystem Subsystem

Crypto
Engine

Figure 4.1: Physically separated black and red subsystems that exchange data only by
means of a cryptographic engine [63].

52
[64] discusses the required primitives for a securely separated design on a reconfig-
urable system. The concepts of a moat and a drawbridge are introduced to establish stat-
ically verifiable physical isolation and physical interfaces, respectively, on one reconfig-
urable chip. An additional technique known as reconfiguration scrubbing is presented to
ensure information does not remain that could compromise the security of the system after
a partial reconfiguration of a particular core. In this work, isolated cores can only make use
of routing with lengths that are no longer than the width of the moat. An experiment in [64]
that eliminated hex and long lines resulted in a 14.9% area increase and 18.9% increase in
critical path delay on average.
[65] creates a design to test the practicality of using moats, drawbridges, and reference
monitors, mentioned in earlier works [66, 64], in an actual system. Two MicroBlaze pro-
cessors are implemented in a Xilinx FPGA as a red-black system. In this system, resources,
including an AES encryption core, are shared on an On-chip Peripheral Bus (OPB). A ref-
erence monitor is used to govern access to each of the peripherals by a particular processor.
The monitor can reside on the OPB or between a processor and memory. The monitor
cannot be bypassed and must validate access to each peripheral. Using the monitor, it is
possible to share a memory device between processor cores. A specific range of memory
addresses can be permitted for access to a particular processor. Each processor and the ref-
erence monitor is isolated by a moat of unused CLBs and has appropriate drawbridges on
which signals can pass. Since each peripheral in the example system is a memory mapped
device, the reference monitor governs access to the memory range belonging to each pe-
ripheral. Moat size directly impacts the number of used CLBs and the reference monitor
adds some minor overhead to each peripheral access.
Both Altera and Xilinx have developed software tools and FPGAs that support the nec-
essary separation primitives. [24] describes the structure of a fence in Xilinx FPGAs as set
of unused CLBs in which no logic or routing may be present. This fence is used to create
isolated regions through which access is permitted via bus macros. The use of longer length

53
interconnects is expressly forbidden where one of the possible end points of the intercon-
nect could be in an isolated region. An Isolation Verification Tool (IVT) was developed in
order to verify each isolated region [24].

Figure 4.2: A partitioned design (right) which has been locked into regions in an FPGA
floorplan with appropriate security routing interfaces (left) [63].

A fence in Altera FPGAs is a boundary of unused LABs, which create a secured re-
gion when completely surrounding some on-chip logic. An equivalent concept to the bus
macros used in Xilinx FPGAs are the SRIs placed between secured regions. The SRI has
several levels of security which limit the signals that may be passed between the regions
[2]. Figure 4.2 shows a partitioned design (right) which has been locked into regions in
the FPGA floorplanner with appropriate SRIs (left) using Altera’s Quartus II tools. Each
FPGA manufacturer was able to meet the classified Fail-Safe Design and Analysis (FSDA)
requirements set by the National Security Agency (NSA) [24, 63].

54
5. System Architecture
In this chapter, a detailed overview of the design and functionality of each hardware com-
ponent is given. After the component design has been established, the organization of the
systems targeted towards metrics of high performance, small area and a balance between
the two metrics. The design of the systems is first discussed at a generic level common to
each targeted metric. The interface and verification software is also presented, followed
by a section describing work performed with the electronic design and automation (EDA)
security tools provided by the Altera Corporation.

5.1 Component Hardware Design


This section describes the design of the lowest level hardware components in this work.

5.1.1 AES Key Schedule

The first key schedule to be implemented in this work was chosen to be offline with a 128-
bit datapath width. The AES key schedule is designed such that each 32-bit word depends
upon the previous 32-bit word, removing the possibility of a parallel word generation. To
create a 128-bit datapath, four 32-bit words must be generated during one clock cycle.
Thus, later words in a key schedule round will pass through more combinational logic than
earlier words before ultimately being registered.
This key schedule was modeled after the 128-bit wide datapath design presented in [56].
This design is shown in Figure 5.1. Since the design presented in [56] targeted an ASIC
approach, it was modified to better fit the architecture of the target FPGA. This modified
design is shown in Figure 5.2.

55
Rcon[i] 0

k7
>>8
k5
t7
SubWord +
t6 32

K0 K14

32
k0 + t0
t4
Key input

32
k1 + t1
t5

Key Memory (128-bits X 11 entries)


32
k2 + t2
t6 32

32
k3 + t3
t7
32

K2-K12
t4
k4
+ t4 32

t5
k5
+ t5 32

t6
k6

t7
k7

K13
K1

Figure 5.1: The 128-bit width datapath key schedule from [56].

56
Rcon[i] 0

t3 >>8
t5 SubWord +
t7
32

t4 + t0
Key Input
32
t5 + t1

Key Memory (128-bits X 14 entries)


t6 + t2
32
t7 + t3
32

t4
+ t4 32

t5
+ t5 32

t6

t7

Figure 5.2: The 128-bit width datapath key schedule from [56] that was modified to better
fit the architecture of the target FPGA.

57
In this modified version of the key schedule, the private key is stored directly into the
registers labeled t0 through t7. This reduces the required number of registers by eight. The
state machine that controls the key schedule leaves the multiplexor inputs to the registers
switched such that they can be written to while the state machine is idle. The register
enables are one hot encoded inputs provided at the top level of the key schedule that must be
enable when the appropriate key word is written. These signals are OR-ed with the register
enable signals used during normal key expansion operations. Since the input registers were
removed, four two-input 32-bit multiplexors were also removed that were inputs to the
upper removed registers.
The key schedule in [56] would also not be capable of providing an entire 128-bit block
every key schedule round for all key lengths as it was pictured. This was partly due to
the inputs to the round function multiplexor. In this modified version, only three inputs
are necessary to the round function multiplexor and will allow four 32-bit words to be
generated every cycle.
The modified key schedule has only one 32-bit input to match up with the GCM archi-
tecture and outputs four 32-bit words to expanded key memory every cycle. The expanded
key memory component is left abstracted from the key schedule so the architecture can be
easily changed in the future.
The modified key schedule picture in Figure 5.2 was initially designed to support a
combinational S-Box calculation. A second version was designed that allowed a memory
based S-Box to be used. This memory based version requires an additional cycle of ini-
tialization for all three key lengths. Three 32-bit inputs are added to the round function
multiplexor to allow data to be read before it would normally be registered. This changes
the order of components found on the main datapath through the key schedule. The round
function multiplexor and SubWord multiplexor, instead of always being the first compo-
nents during a round function cycle, are now the last components. This does not affect the
length of the critical path through the key schedule.
The second key schedule implementation was selected to have a 32-bit wide datapath.

58
Rcon[i] 0

>>8
SubWord +

+
t0
Private Key Memory
(32-bits X 8 entries)
3
Shift
5
Reg 7

Key Memory
(32-bits X 56 entries)

Figure 5.3: The 32-bit width datapath key schedule that was modified from [51] to better
fit the architecture of the target FPGA.

This implementation width is better aligned with the AES key schedule algorithm since an
entire word can be generated in a key schedule round. The dependencies between 32-bit
words in the key schedule are illustrated in Figure 2.6 and Figure 2.7.
The 32-bit wide key schedule was modeled on an implementation in [51]. This imple-
mentation was modified to make better use of the target FPGA’s resources. The modified
version of this key schedule is shown in Figure 5.3. The original version targets the 128-
bit key length. This modified version supports all three AES key lengths. Here, an eight
element shift register is used with two additional taps that provide access to the fourth and
sixth elements. Please note that in the figure, these elements are indexed from zero, thus
they are offset by one. Depending on the selected key length, the multiplexor chooses the
appropriate output of the shift register.
The private key for this circuit is stored in its own memory. The original circuit utilized
the shift register for this completely. The memory approach has the advantage that the
private key words can be applied in any order. In a system where this engine is operating
for hardware acceleration, this allows the software to write the key words in any order or
rewrite specific key words without having to reload all the key words in the correct order.
The key schedule shown in Figure 5.3 was designed to have a combinational SubWord
component. A second version of the key schedule was designed that supported a memory

59
Counter
Input
K0 +

reg
Data
output
SubBytes

ShiftRows

MixCols

K1~KN +
‘0’ +

Data Input

Figure 5.4: The 128-bit width AES encryption datapath modified from [59]

based SubWord component. In this version, the single 32-bit round register is moved after
the multiplexor that previously had followed it. The datapath for this implementation with
a combinational SubWord component is longer.

5.1.2 AES Encryption

Corresponding to the two key schedule implementations, there were two main AES encryp-
tion components developed. Both implementations utilized a combinational MixColumns
component. The first encryption implementation made use of a 128-bit datapath and was
based on the simple loop method given in [59]. Since work was targeting an ASIC ap-
proach, several modifications were made to make better use of the FPGA being targeted in
this work. This modified version is shown in Figure 5.4.
The 128-bit round register was moved from after the final XOR in a round to the loca-
tion immediately following the input multiplexor. In this location, the round register can
also register the initial XOR with the first round key. In the round registers previous lo-
cation, the expanded key memory would be required to provide both the first round key
and the second round key within the same cycle. This is very possible since the target

60
Data input
‘0’
Output data in m registers
m0 n0
ARK

Plaintext MixCols

ShiftRows
m1 n1
+
Input SubWord + +
m2 n2

m3 n3

Figure 5.5: The 32-bit width AES encryption datapath.

FPGA supports true dual-port memory; however, the expanded key memory’s other input
for true dual-port mode is allocated to the key schedule. So, to eliminate the need to add
another 128-bit register in the circuit or multiplexor for the true dual-port memory control
signals, the round register was simply moved. This also reduces the length of the critical
path through the encryption implementation when a combinational S-Box implementation
is utilized.
In [59], an additional 128-bit input is given into the component that is multiplexed with
128 zero bits. The output of the multiplexor feeds into an XOR gate. During normal iter-
ative rounds, the all zero input is used. Once the final ARK is complete, the multiplexor
is switched to XOR the encrypted data with the input data. This is to facilitate the encryp-
tion implementations function in the counter mode of operation. This final value is then
registered in the round register.
In the version of encryption that uses a memory based implementation of SubBytes, the
round register is removed from the main round datapath. The round register is still retained
in the circuit itself, but is used to register the final output data value.
The 32-bit wide version of AES encryption is based on a word iterative approach and is
shown in Figure 5.5. The subcomponent requirements are reduced from the 128-bit wide
version, especially in the SubBytes round function where only four S-Boxes are needed
instead of sixteen. Eight 32-bit registers are used during round operations. The first group
of four registers is used between round rounds, while the second group of registers is used
to realign the data before it is processed by the ShiftRows round operation. This is due to a

61
dependency that exists between 32-bit words in this operation. Once the data is aligned, the
ShiftRows round operation is simply a rewiring of signals at the byte level. Multiplexers
are used to select the appropriate 32-bit word for the remaining round operations.
This 32-bit implementation also has a 128-bit additional input for counter mode opera-
tion. In this case, the multiplexor on this input toggles between all four 32-bit words in this
128-bit input in addition to the zero input.

5.1.3 AES S-box

For each of the encryption and key schedule designs, an appropriate S-Box implementation
must be selected for the SubBytes or SubWord round operation. Four S-Box implemen-
tations were explored in this work: 1) single-port M9K memory based, 2) dual-port M9K
memory based, 3) LUT based, and 4) composite field based. The composite field based im-
plementation makes use of combinational logic to calculate the S-Box output value, while
the other three implementations reference a precalculated value.
As was described subsection 2.2.1, an S-Box is no more than a Galois field inversion
and an affine transform performed on one byte. In total, that is 28 possible values which
can all be precalculated. To use a LUT to implement an S-Box, 256 one byte elements must
be allocated. These large LUTs are formed by combining four input LUTs from the target
FPGA’s logic elements (LEs). The output of the LUTs is available after one cycle.
The memory based implementations are created in the same manner, except that the
FPGA’s M9K memory elements are utilized instead of the LE’s LUTs. Each M9K provides
8192 bits of memory. Single-port memory instances have one address input, one data input
and one data output. True dual-port memory instances have two address inputs, two data
inputs and two data outputs. The cost savings when utilizing true dual-port memory based
S-Boxes is half of single-port memory because it requires half the number of M9Ks to
implement the same number of S-Boxes.
The composite field S-Box was chosen for use in this work to present an alternative to
the memory and LUT approaches. Depending upon usage, the composite field approach

62
can be used to achieve high performance or low area. In this work, the composite field
S-Box components are not pipelined. In other words, there is no registers present between
individual components within the composite field S-Box component. This was done in or-
der to explore the reduced logic cost of the composite field implementation. The composite
field S-Box structure is shown in Figure 3.2. Pipelining the unit would result in a much
shorter critical path through the component, but would increase its resource utilization.

5.1.4 GHASH

The implementations of GHASH explored in this work differ based on their implementa-
tion of the GF(2128 ) multiplier. A bit serial approach, a full parallel approach and a variety
of sequential multiply and add approaches are examined. The full parallel approach cal-
culates a result in one clock cycle, but has a very large critical path. In contrast to the
full parallel approach, the bit serial approach has a much smaller critical path, but requires
128 clock cycles before a result is calculated. The sequential multiply and add approaches
reduce the number of clock cycles found in the bit serial approach by multiplying a power
of two number of bits in one clock cycle. Due to the data dependency in these Galois
field multiplications, these calculations are not performed in parallel. Thus, the larger the
number of bits multiplied in one clock cycle, the deeper the critical path of the multiplier.
All of the implementations of the GF(2128 ) multiplier were created from one generic
instance. At a minimum, the generic multiplier operated in bit-serial. At a maximum, the
generic multiplier implemented a full parallel approach. This implementation was based on
the description of the sequential multiply and add approach presented in [57]. The desired
number of clock cycles till completion directly corresponds to the width of the sequentially
multiplied bits in the reduced Galois field multiplier. For example, an implementation that
takes 16 cycles to complete would utilize a GF(28 ) X GF(2128 ) multiplier.
Each round of the GHASH function, a number of bytes from the H constant are mul-
tiplied by the 128-bit current GHASH value. Reductions are performed based on the
GHASH primitive polynomial: x128 + x7 + x2 + x + 1. A 128-bit register stores the

63
GCM
GCTR
CIRCUIT OUTPUTS
AES KEY EXPANSION H
Register 128
IV Registers
Temp Expansion Data Output Registers (CT/PT)
Expanded Key Memory
Registers Logic Y0 Hash Register (Tag)
128
Output
Encrypted Data Input Registers (AAD)
128
Register
AAD Len | CT Len registers

IV Cntr AES ENCRYPT Data


+1 Output
Register 32
128
Registers
Encryption
128 Logic
IV
Input 32 Registers 96 AAD Len
‘0’ Registers 64 GF(2128)
Multiplier
‘0’ Cipher- Hash
Data

+
+ text Register
PT/CT CT Len
Input Register
Registers 128 Registers 64

GHASH

Figure 5.6: A high-level diagram illustrating the individual GCM components and their
interconnections.

partial product until the final product has been evaluated. Once this is complete, this final
product is registered in a 128-bit register. The values stored in this register will always be
the final result of the sequential multiply and add operations.

5.1.5 GCM

The highest level of the GCM algorithm is shown in Figure 5.6 with the control logic for
the state machine shown in Figure 5.7. It is the responsibility of this module to control
the functionality of all the previously designed subcomponents. Registers that receive their
input from the top level data input of the GCM engine are written in 32-bit words. Internally
to the engine, operations are handled at 32-bit or 128-bit widths, as defined by the particular
component. All implementations of GCM in this work utilize a 96-bit IV, as recommended
by the NIST standard [9] for efficiency purposes [55].
The AAD and CT length registers are 64-bits each. These registers were added so that
the processor could report to the engine the length of the data that was processed. The
GCM engine was designed to support data only in lengths of the AES block size. This
moves the data padding operations to the processors, which is a trivial operation. These
lengths are required when the final tag is generated.

64
The IV and private key are stored within the top level of this engine. The private key is
stored in the lowest bits of the expanded key memory. The AES key schedule performs this
copy by design. The controlling state machine, Figure 5.7, was designed such that when
the private key is changed, recalculation is also performed for the Y0 value. This feature
was added so that some additional savings could be gained by not needing to invoke a
second operation to simply calculate this value when the IV is written after the private key
is changed. If the IV is not going to be changed with the private key, then the savings still
takes place because the Y0 will be recalculated with the new key. The state machine was
designed so that the IV could be changed on its own, which would then update Y0 .
Although it is not necessary, the GCM engine provides read access to the AAD and CT
length registers. This can be helpful in a debugging scenario. This is not necessary because
GCM is an online algorithm.
This engine was designed to support all three AES key lengths and GCM in both en-
cryption and decryption modes. Figure 5.6 shows a signal in thick red that is only active
during decryption mode and a signal that is only active in encryption mode in thick black.
The AES key length is indicated to the engine by means of a one hot encoded signal. The
encryption mode is indicated by means of a single signal that is logic level high when in
encrypt and logic level low when in decrypt.
An internal control register is continually read by the state machine to determine next
state information. The inputs to this control register are external strobe signals. These
signals correspond to the operations available in the engine: 1) start key change, 2) start IV
change, 3) start aad hash, 4) start CT/PT encryption/decryption and hash, 5) and start final
TAG generation.
An additional register was created to report to the controlling state machine when the
last component has finished in a decryption operation. A high level goal in this thesis work
was to put an emphasis on modularity. The addition of this register allows different AES
and GHASH implementations to be changed without having to modify the top level GCM
state machine to identify which consumes more cycles. This is important because the AES

65
encrypt component and the GHASH component can run in parallel during a decryption
operation. Each component in the engine notifies the controlling state machine that it
is finished by means of a done flag. If these state machine monitored for this condition
directly, it would require three additional states if they finished simultaneously or if either
one finished before the other.

5.2 System Organization


At the highest level, the system consists of two Nios II microprocessors and two GCM
engines. This layout is shown in Figure 1.1. This Nios II processors were designed with a
custom component that merely output the Avalon bus signals as a conduit. These Avalon
bus signals were then interfaced with another component that registered each of the signals.
The waitrequest signal was handled as a special case since it signals to the processor that
the engine would like it to wait. These registered signals were then outputs of the processor
subsystem (black or red). In this way, the register interface wraps the conduits of the Nios
II processor. A register interface was also placed around each of the GCM engines signals.
The two register interfaces were then connected.
The register interfaces are at the highest level of the design hierarchy within a subsys-
tem partition. Thus, any signal passing between a processor partition and a GCM engine
partition is registered at its output and again at its corresponding input on the destination
component. This register interface was added based on a recommendation in [67], which
stated, “Registers minimize the delays on inter-partition paths, and prevent the need for
cross-boundary logic optimizations.”
The GCM module was originally designed to support 128-bit inputs, but was modified
to support 32-bit inputs. This modification was necessary due to the maximum width of the
data bus of the Nios II processor. Since this system was targeted to be used by the Nios II,
this was a required modification. If this change was not made, more logic would have been
necessary on the top GCM wrapper layer to properly handle data presented to the engine.

66
The only defined requirement in this area was that the engine would interface with a Nios
II processor, so a 128-bit input and output from the system was unnecessary.
Immediately below the register wrapper interface on a GCM engine is a GCM interface
component. It is the role of this module to translate all signals passing between the pro-
cessors and the engines into formats that each expect. One of the most important features
of this component are the control, configuration and status registers. The status register
can be read by either processor and simply registers when each of the GCM engines tasks
is complete. There are twice as many bits as there are GCM operations because both the
black processor and the red processor have their own bits that indicate which task is com-
plete. These bits must be cleared by the corresponding processor. For example, the black
processor cannot clear the red processors status register bits. The purpose of clearing a bit
is notify the other processor that the current processor has finished working with the data
corresponding to a particular operation. Again, all of these bits can be read by either pro-
cessor. Thus, each processor can read when an operation their waiting on is complete and
when the other processor has finished performing operations with respect to a particular
GCM operation.
The control register allows a processor to start a GCM operation and the configuration
register sets the AES key length and encryption or decryption mode of operation. The
configuration register may only be set by the red processor. Restrictions are placed on
other control and status register bits such that the black processor may not write to them.
This depends on the current configuration of the engine. Table 5.1 indicates which address
can or cannot be written or read by a given processor under certain conditions. Writing
to a register from the incorrect processor has no effect. Reading from a register that is
inaccessible to a particular processor will result in all zero output.

67
Table 5.1: GCM register map.

Register Encrypt Decrypt Bits (encrypt register / decrypt register)


name Red Black Red Black 31:10 9 8 7 6 5 4 3 2 1 0
1 1 1 1 3 3 3 3 3 2 2 2 2
status RW RW RW RW tag done ct done aad done iv change done key change done tag done ct done aad done iv change done key change done2
config RW R RW R encrypt or decrypt key len 256 key len 192 key len 128
control W No W W1 clr tag2 start tag start ct/start pt start aad start iv change start key change2

key 0 W No W No key(31 downto 0)


key 1 W No W No key(63 downto 32)
key 2 W No W No key(95 downto 64)
key 3 W No W No key(127 downto 96)
key 4 W No W No key(159 downto 128)
key 5 W No W No key(191 downto 160)
key 6 W No W No key(223 downto 192)
key 7 W No W No key(255 downto 224)

iv 0 RW R R RW iv register(31 downto 0)
iv 1 RW R R RW iv register(63 downto 32)
iv 2 RW R R RW iv register(95 downto 64)

aad 0 RW R R W data out register(31 downto 0)


aad 1 RW R R W data out register(63 downto 32)
aad 2 RW R R W data out register(95 downto 64)
aad 3 RW R R W data out register(127 downto 96)

68
pt 0 RW No R No encrypt in register(31 downto 0) / data out register(31 downto 0)
pt 1 RW No R No encrypt in register(63 downto 32) / data out register(63 downto 32)
pt 2 RW No R No encrypt in register(95 downto 64) / data out register(95 downto 64)
pt 3 RW No R No encrypt in register(127 downto 96) / data out register(127 downto 96)

ct 0 R R R RW data out register(31 downto 0) / encrypt in register(31 downto 0)


ct 1 R R R RW data out register(63 downto 32) / encrypt in register(63 downto 32)
ct 2 R R R RW data out register(95 downto 64) / encrypt in register(95 downto 64)
ct 3 R R R RW data out register(127 downto 96) / encrypt in register(127 downto 96)

tag 0 R R R R tag register(31 downto 0)


tag 1 R R R R tag register(63 downto 32)
tag 2 R R R R tag register(95 downto 64)
tag 3 R R R R tag register(127 downto 96)

aad len 0 RW R R RW aad len register(31 downto 0)


aad len 1 RW R R RW aad len register(63 downto 32)
ct len 0 RW R R RW ct len register(31 downto 0)
ct len 1 RW R R RW ct len register(63 downto 32)

ver tag 0 No No R W data out register(31 downto 0)


ver tag 1 No No R W data out register(63 downto 32)
ver tag 2 No No R W data out register(95 downto 64)
ver tag 3 No No R W data out register(127 downto 96)

1
Indicates that a particular bit in this register can only be set or cleared by a particular processor.
2
This bit can only be set or cleared by the red processor.
3
This bit can only be set or cleared by the black processor.
5.2.1 High-Performance Design

The high performance implementation in this work targets a 128-bit AES datapath. The im-
plementation of AES encryption is based on a round iterative design, where the round keys
are read out of on-chip memory as was described in subsection 5.1.2. These round keys
are placed into on-chip memory by the 128-bit offline key schedule described in subsec-
tion 5.1.1. The on-chip memory is organized such that it is addressable by 128 bit blocks.
This allows one entire round key to be read or written in one cycle. The GHASH multiplier
was implemented fully parallel.

5.2.2 Small Area Design

The small area implementation in this work focused on using a 32-bit datapath for the AES
key schedule and encryption components. The key schedule was based upon the 32-bit
iterative version described in subsection 5.1.1. The AES 32-bit encryption implementation
is a simple word iterative design that was discussed in subsection 5.1.2. The round keys
are stored in on-chip memory, addressable by 32 bit words. The GF(2128 ) multiplier in the
GHASH component was implemented in bit serial fashion: multiplying and accumulating
one bit of the input data by the 128-bit H value sequentially for 128 clock cycles. The high-
level architecture differs from Figure 5.6 in the additional logic required by the GHASH
component to accumulate the partial products.

5.2.3 Balanced Performance Design

The implementation targeting a balance between high performance and small area utilizes
the same 128-bit datapath AES key schedule and encryption components as the high perfor-
mance implementation. The difference lies in the GHASH component, which utilizes the
sequential multiply and add approach from [58]. A result is calculated in 16 clock cycles
by sequential multiply and accumulate operations using one 128 bit operand and one byte
operand. The high-level architecture for the GCM engine differs from Figure 5.6 in the

69
Idle

CNTRL_START_CT CNTRL_START_CT
CNTRL_START_KEY CNTRL_START_IV CNTRL_START_AAD CNTRL_START_TAG
encrypt_mode=1 encrypt_mode=0

start_hash_data_len
start_key_exp start_ghash_data_in start_encrypt start_decrypt
s

gctr
wait_for_key_exp ghash_data_in gctr hash_data_lens &
start_ghash_data_in

encrypt_done=0 ghash_done=0 encrypt_done=0


exp_done=0 ghash_done=0 ghash_done=0

clr_iv_counter aad_done gctr_done xor_y0 pt_done

CNTRL_START_KEY

ghash_done=0

CNTRL_START_IV
start_ghash_data_o
start_h_encrypt tag_done
ut

calculate_h ghash_data_out

encrypt_done=0 ghash_done=0

calculate_h_done ct_done

calculate_y0_init
init_iv_counter

start_y0_encrypt

calculate_y0

encrypt_done=0

CNTRL_START_KEY

key_change_done Iv_change_done

Figure 5.7: The GCM top level state machine.

70
GHASH component, where an additional register and several multiplexors are necessary to
compute the partial product.

5.3 Software Design


This section details the organization and build process of the software running on the em-
bedded Nios II processors. A description is given of the automation scripts that were de-
signed to create, build and download the software.

5.3.1 Organization

At a high-level, the software is organized into three major projects:1) blk proc, 2) red proc
and 3) lib gcm The blk proc and red proc projects contain code that pertains only to the
black and red processors respectively. The lib gcm directory contains shared code that is
used reference by both processors to interact with their associated GCM engine.

5.3.2 Build process

The software bitstream is created by executing a series of scripts. These scripts are outlined
below:

create-this-bsp <red/black>.sh Creates the board support package (BSP) binary files for
a Nios II processor
create-this-app <red/black>.sh Creates the application binary files for a Nios II proces-
sor
create-this-lib.sh Attempt to make a statically linked library for the GCM engine func-
tions

The BSP and application creation scripts were modified from versions provided in the
Altera software. All three scripts use relative locations to address files and directories.
Thus, it is necessary for software to be organized in the directory structure discussed in

71
subsection 5.3.1. Additionally, the scripts need access to the files created during hardware
compilation. These scripts expect that the hardware files are one directory level above the
location of the scripts themselves in a directory labeled “hw altera”.
The BSP generation scripts pass the appropriate files and directories to “nios2-bsp”.
This application then creates the ucosii net zipfs BSP which is required to utilize the soft-
core Nios processor. The application generation scripts pass the appropriate files and direc-
tories to “nios2-app-generate-makefile”, which recursively generates the Makefile required
to create application code for a particular BSP. Once this Makefile has been created, the
script executes “make” on the newly created Makefile.
The GCM software static library creation script was ultimately not utilized in this
project. This is because of the way that the Nios build tools create the static library. The
tools require that the static library be built for a particular processor architecture and feature
set. In this project, this level of abstraction from the GCM code base was simply not nec-
essary. The best practice, in this case, was to guarantee that both processors were using the
same code interact with the GCM engines. This simplifies interaction with the engine and
allows changes to be made to the underlying GCM engine hardware interaction code with-
out necessarily changing the software interface presented to the embedded microprocessor
application code.
From a security perspective, the black processor does have access to the same GCM
engine interaction code as the red processor. This does not present an issue because the
hardware configuration prevents the black processor from attaining the same level of con-
trol given to the red processor. Again, this is only hardware interaction code.
The shell scripts were utilized in this project rather than the available Nios interactive
development environment (IDE) mainly to simplify the build process. In the normal em-
bedded processor software development cycle using the Nios IDE, the IDE must be opened
and used to build the BSP and application code, download the bitstream, execute it and per-
form any debugging desired by the user. In this project, many different hardware designs
were created, each requiring their own associated software environment for verification

72
testing. The high-level GCM hardware interface wrapper for the Nios II processor did not
change between hardware projects as well. This being the case, the software was consis-
tent across all of the different hardware projects. It was much faster to simply execute
these series of scripts to verify functionality of the system, rather than create an entire new
development project simply to test the same software on different hardware.

5.3.3 Download and execution

To facilitate easy interaction with the softcore processors, a series of scripts were created
to download hardware and software. These scripts are outlined below:

download-hw-sw-terminal.sh Downloads the hardware bitstream to the FPGA and the


software to each softcore processor. A console window is opened for each processor
download-sw-terminal.sh Downloads the software to the softcore processors and opens
a console window to each
download-sw.sh Downloads the software to the softcore processors
download-sw-<red/black>.sh Downloads the software to a specific softcore processor

The console windows that are opened for each softcore processor display the standard
error and standard output from each using the universal serial bus (USB) Joint Test Action
Group (JTAG) universal asynchronous receiver/transmitters (UARTs). All of these scripts
expect that the hardware bitstream has been created and all of the software, BSP and appli-
cation code, has been compiled.
The hardware bitstream is downloaded using relative paths passed to the “quartus pgm”
command. The software binaries are downloaded using relative paths passed to the “nios2-
download” command. Once the software is downloaded, it begins execution immediately.
The scripts that download software to both processors downloads the red processor first,
but the software could be downloaded in any order.

73
6. Evaluation
This chapter describes the three areas of analysis of this work: 1) verification, 2) perfor-
mance and 3) security analysis. The verification analysis section gives a brief discussion
on how the system output was found to be correct. The performance analysis section de-
tails the maximum frequency and throughput values of the GCM hardware engines. The
security analysis section provides a detailed analysis on the impact of the FPGA security
features on the system implementation.

6.1 Verification
Verification of the hardware modules was done using VHDL test benches and operating in
the FPGA by means of the softcore microprocessors used as stimulus. Once the hardware
bitstream was compiled, the software was compiled to match this hardware. The compi-
lation and download processes were discussed in section 5.3. The actual verification was
performed using test vectors provided in [20].
The red processor was given the test vector data in the form of static constants. The
testing was broken down into a C function for each test vector. The configuration and
private key data was first used to configure both GCM engines, then the plaintext data was
sent to the GCM encryption engine. The black microprocessor received this encrypted data
from the encryption engine and passed it back into the decryption engine. The decryption
engine was initialized with the same configuration as the encryption engine by the red
processor. If the data received by the red processor from the decryption engine was the
same as the data originally sent to the encryption engine, the test is labeled as successful.
The success or failure of each test is written to standard output.

74
Table 6.1: Single GCM engine performance statistics

Target AES AES GHASH GCM 128-bit 2K-bit


App. Width S-Box Arch Max Pckt. Pckt.
(bits) Arch Freq. Th/put Th/put
(MHz) (Mbps) (Mbps)
Small 32 C-Field Bit serial 68.7 33.6 63.6
area 32 M9K Bit serial 119 58.1 110
High 128 C-Field Full parallel 71.2 434 528
perf. 128 M9K Full parallel 79.7 486 591
Bal. 128 C-Field 16-Sequen. 64.6 218 430
perf 128 M9K 16-Sequen. 97.0 327 645

6.2 Performance
The performance results of each system configuration is shown in Table 6.1. The packet
size is the amount of encrypted data that is processed by the GCM engine before an au-
thentication tag is generated. Functionality was verified for proper handling of the AAD,
but AAD processing was not included in the performance tests. The architectures and
peripherals of the two Nios II/e processors were consistent through every test.
As expected, the deep critical path found in the composite field S-Box implementation
limited the maximum frequency of the GCM engine. A possible advantage to this approach
is the low requirements on M9Ks, as shown in Table 6.2. The GCM implementations taking
advantage of the true dual-port on-chip memory are much more resource efficient given the
prevalence of M9K memory blocks on the target FPGA. Additionally, the throughput and
clock frequency were higher for the implementations using memory based S-Boxes. The
balanced performance memory based implementation had the highest throughput for the
larger packet size. This is because of its higher frequency and that the high performance
implementation had an equivalent AES encryption component which continued to perform
its computations while the parallel GHASH component was idle.
Implementation floorplans for the Cyclone III LS (EP3CLS70) can be seen in Fig-
ure 6.1. In these figures, labels one and two identify the red and black subsystem partitions
respectively. Labels three and four identify the GCM encryption and decryption engine

75
partitions respectively. The darker shaded blocks are utilized resources, except on the bor-
ders of secured regions. These are fences of unused logic where no routing connections are
permitted. The small lightly shaded regions containing no logic between secured regions
are SRIs that allow routing to pass between regions. The strips passing vertically through-
out the FPGA are M9K memory elements and embedded multipliers. It is important to
note that the unallocated space is where the global signals are input to the global routing
structures for distribution throughout the entire FPGA.

76
Table 6.2: Secured region resource utilization statistics

GCM Engine Partition Red/Blk Subsystem Partition


Target AES AES GHASH Utilized Allocated Utilization Utilized Allocated Utilization
App. Width S-Box Arch M9K LEs M9K LEs M9K LEs M9K LEs M9K LEs M9K LEs
(bits) Arch (#) (#) (#) (#) (%) (%) (#) (#) (#) (#) (%) (%)

77
Small 32 C-Field Bit serial 3 3696 26 4160 12 89 22 1591 51 7856 43 20
area 32 M9K Bit serial 7 3170 26 4160 27 76 22 1596 51 7856 43 20
High 128 C-Field Full parallel 4 17023 94 21056 4 81 22 1663 99 15760 22 11
perf. 128 M9K Full parallel 14 14956 94 21056 15 71 22 1691 99 15760 23 11
Bal. 128 C-Field 16-Sequen. 4 6122 30 6240 14 99 22 1651 60 9648 37 18
perf. 128 M9K 16-Sequen. 14 4682 30 6240 47 75 22 1636 60 9488 37 18
(a) (b)

Figure 6.1: Low area (3a) and high performance (3b) implementation floorplans for the
EP3CLS70 FPGA.

6.3 Security Costs and Analysis


In this work, the designs were first functionally verified on a Cyclone III device and the
final designs were synthesized for a Cyclone III LS device. While the implemented hard-
ware is functionally equivalent, additional resources were necessary to satisfy the security
requirements on the LS device. Table 6.2 illustrates the total utilized LEs and M9Ks out of
the total allocated for a particular secured region.
Together the size and position of the secured regions 1. dictate the amount of resources
available to a region, 2. set the timing of on-chip and off-chip data transfers to and from
regions, and 3. limit the available routing paths between secured partitions. Each secured
region is also a logical partition in itself, which prevents optimizations that would have

78
Red
Subsystem

GCM GCM
Encrypt Decrypt

Black
Subsystem

Figure 6.2: A planar graph showing the flow of data between subsystems.

taken place at the partition boundaries [67]. As such, these secured regions are highly
design and platform dependent. To allow conclusions to be drawn more easily, the sizes of
the secured regions were fixed for a given target application.
For the high performance implementation in Figure 6.1b, overhead costs include 8912
LEs, 36 M9Ks, and 20 embedded multipliers that must remain unused to form the borders
between secured regions. SRIs may only contain routing, thus the 3120 LEs, 12 M9Ks,
and 8 embedded multipliers contained within SRIs are allocated, but not utilized. Taking
these figures into account, this leaves 20576 LEs, 35 M9Ks and 20 embedded multipliers
free for use as global logic.
An additional concern when choosing the size of a secured region is routability. [2]
specifies that the minimum size of a secured region is 8x8 LABs, else the possibility exists
that routing would need to protrude beyond the regions borders to connect to each LAB.
The width of an SRI is also recommended to be 12 LABs when signals are passing up or
down and 17 LABs when signals are passing left or right within the FPGA. The encryption

79
and decryption partitions pass 78 connections vertically, necessitating an 8 LAB width.
The red and black partitions have 186 connections each to the Altera JTAG Hub, requiring
16 LABs for vertical connections. The final size was selected to be 10 LABs for vertical
and 5 LABs for horizontal, yielding a value of 205 possible connections. This minimum
recommended region size contributes significantly to the excessive height of the red and
black partitions.
The I/O banks available for global logic are limited in a secured design because I/O
banks may not be shared by secured regions [2]. In this work, one free bank was utilized
at the global level to provide the system clock and reset inputs. Excluding the Altera JTAG
interface signals, no other I/O banks were utilized. The secured routing constraints in
Figure 6.1b precluded complete access to I/O banks 7 and 6, with only partial access to the
pins on banks 8 and 5.
Designing for the previously mentioned constraints resulted in an inefficient use of
FPGA resources within the secured regions. The reference design for this project uti-
lized the Altera JTAG connection for input and output to the systems. This restricted the
floorplan such that the red and black subsystem partitions both needed to connect to the
JTAG signals, obscured by the black box in Figure 6.1. In a practical application, a de-
signer would assign pins such that data could flow across the FPGA, removing the need for
excessive logic to be allocated for circuitous routing. The balanced performance GCM im-
plementation utilizing the true dual-port memory makes the most efficient use of memory
and logic, while still providing a high throughput.

80
7. Conclusions
In this work, a red/black separated system was designed with independent encryption and
decryption implementations of GCM using AES to pass authenticated and encrypted infor-
mation between two Nios II processors. The implementations of this system were evalu-
ated on a Cyclone III LS (EP3CLS70) FPGA based on the metrics of high performance,
low resource usage and a balance between the two. An analysis of resource utilization was
performed, yielding the additional cost of red/black separation on this platform. FPGA
hardware designers with a need for high security should be able to use this work to assist
in selecting component architectures that best meet their performance or resource require-
ments.
Future high performance designs based on this work could explore pipelined approaches
or passing data directly to the hardware engines from the input and output. Low area ap-
proaches could benefit from an 8 bit datapath or a T-box approach for further reduction in
required resources.

7.1 Suggestions for Future Work


Depending on desired implementation attributes and the target architecture this system is to
be placed on, there are a number of improvements that can be made. At the subcomponent
level, S-Box sharing could be implemented between the AES key schedule and encryption
implementations. During normal GCM operation, it is unlikely that the IV or private key
would be changed frequently. In this case, the key schedule would be idle the vast ma-
jority of the time. The key schedule for all datapath widths uses a four S-Box SubWord
component. A 32-bit wide version of the AES encryption function could make use of this.
Along the same lines as the S-Box substructure sharing, a 32-bit wide key schedule

81
could be used in conjunction with a 128-bit wide AES encryption component. Again,
the key schedule is not invoked frequently and the tradeoff could be made to save logic
resources at the expense of a slower key schedule on an IV or private key change. The
expanded key memory component would need some additional control logic in order to
have 32-bit words written into it and still be able to provide 128-bit round keys to the
encryption engine.
It is worth noting that in all the implementations explored in this work, the majority of
the logical resources on the FPGA were utilized by the GHASH component. Thus, it seems
likely only a moderate savings would be achieved by utilizing substructure sharing between
S-Boxes. An 8-bit datapath width implementation for the key schedule and encryption
components might see a more significant area savings over a 128-bit width approach. Pair-
ing these 8-bit AES components with a bit-serial GHASH implementation would create a
circuit that uses very few logical resources and would have similar throughputs for both the
GHASH and encryption operations. A very large delay would be experienced during an
encryption operation since GHASH must follow encryption sequentially due to an existing
data dependency. Such a circuit would be useful in a situation where a very small, low cost,
secured FPGA is being employed and performance is not an issue. This could also be a
preferable implementation in an FPGA where redundancy must be implemented. It could
be possible to fit many of these GCM engines within one chip.
At a high level, first in, first outs (FIFOs) data structures could be used to queue up data
for the engine, which could then be pipelined during encryption operations. Additionally
or separately, direct data inputs into the GCM engine could be explored. In this case,
the processors would manage the flow of data throughout the circuit, but would not pass
the data directly through themselves. This would remove any software processing of data,
allowing it to pass completely through hardware. The performance benefits to this approach
are great, but this type of implementation would need to be scrutinized more closely to
make certain that it would pass a security evaluation.
A simpler improvement could be to remove redundant registers that exist at the top

82
level of the GCM component. This would result in a cost savings as far as logical resources
are concerned, but might reduce the modularity of the system. A final suggestion from Dr.
Łukowiak to reduce utilized resources would be to use shift registers for the expanded key
memory. This would remove the need for counters and their corresponding control lines to
memory address signals. This technique might require less memory elements from FPGA,
depending upon their organization and usage in the design.
To create a more self-contained version of GCM, the data lengths could be registered
internally by the engine itself. Currently the length of the data must be multiple of the
AES block size and the total length of the data must be provided to the engine via two
64-bit registers. In the implementation designed in this paper, the padding and masking
can be performed by the processors. If the recommendation is take to pass data directly to
the engine instead of passing through the processors, this padding might no longer be so
trivial. At that time, it might become appropriate to register the data lengths internally and
add in a barrel shifter to mask the data if it is not a multiple of the AES block size.

83
Acronyms

AAD . . . . . . . . . . Additional Authenticated Data

AES . . . . . . . . . . . Advanced Encryption Standard

ARK . . . . . . . . . . AddRoundKey

ASIC . . . . . . . . . . Application Specific Integrated Circuit

ASIP . . . . . . . . . . Application Specific Instruction Processor

BDD . . . . . . . . . . binary decision diagram

BRAM . . . . . . . . . block random access memory

BSP . . . . . . . . . . . board support package

CBC . . . . . . . . . . . Cipher Block Chaining

CCM . . . . . . . . . . Counter with Cipher Block Chaining-Message Authentication Code

CFB . . . . . . . . . . . Cipher Feedback

CLB . . . . . . . . . . . configurable logic block

CMAC . . . . . . . . . Cipher-based Message Authentication Code

DES . . . . . . . . . . . Data Encryption Standard

ECB . . . . . . . . . . . Electronic Codebook

84
EDA . . . . . . . . . . . Electronic Design and Automation

FIFO . . . . . . . . . . First in, first out

FIPS . . . . . . . . . . . Federal Information Processing Standard

FPGA . . . . . . . . . . Field Programmable Gate Array

FSDA . . . . . . . . . . Fail-Safe Design and Analysis

GCM . . . . . . . . . . Galois/Counter mode

GSM . . . . . . . . . . Global Switch Matrix

ICB . . . . . . . . . . . initial counter block

IDE . . . . . . . . . . . interactive development environment

IP . . . . . . . . . . . . Intellectual Property

IV . . . . . . . . . . . . Initialization Vector

IVT . . . . . . . . . . . Isolation Verification Tool

JTAG . . . . . . . . . . Joint Test Action Group

LAB . . . . . . . . . . . logic array block

LE . . . . . . . . . . . . logic element

LUT . . . . . . . . . . . lookup table

MAC . . . . . . . . . . message authentication code

NBS . . . . . . . . . . . National Bureau of Standards

NIST . . . . . . . . . . National Institute of Standards and Technology

85
NSA . . . . . . . . . . . National Security Agency

OFB . . . . . . . . . . . Output Feedback

OPB . . . . . . . . . . . On-chip Perhipheral Bus

SCC . . . . . . . . . . . Single Chip Cryptographic

SCC . . . . . . . . . . . Single Chip Cryptography

SRI . . . . . . . . . . . security routing interface

UART . . . . . . . . . . universal asynchronous receiver/transmitter

USB . . . . . . . . . . . universal serial bus

VHDL . . . . . . . . . VHSIC hardware description language

XOR . . . . . . . . . . bitwise exclusive OR

86
References

[1] Altera Corporation, 101 Innovation Drive, San Jose, CA, Nios II Processor Reference
Handbook, November 2009. NII5V1-9.1.

[2] Altera Corporation, 101 Innovation Drive, San Jose, CA, Quartus II Design Separa-
tion Flow, June 2009. AN 569.

[3] L. Hathaway, “National Policy on the Use of the Advanced Encryption Standard
(AES) to Protect National Security Systems and National Security Information.” On-
line, June 2003. CNSS Policy No. 15, Fact Sheet No. 1.

[4] A. Biryukov, O. Dunkelman, N. Keller, D. Khovratovich, and A. Shamir, “Key Re-


covery Attacks of Practical Complexity on AES Variants With Up To 10 Rounds.”
Cryptology ePrint Archive, Report 2009/374, August 2009. http://eprint.
iacr.org/.

[5] F. Rodrı́guez-Henrı́quez, N. Saqib, A. Dı́az-Pèrez, and C. Koc, Cryptographic Algo-


rithms on Reconfigurable Hardware (Signals and Communication Technology). Se-
caucus, NJ, USA: Springer-Verlag New York, Inc., 2006.

[6] M. Dworkin, “Recommendation for Block Cipher Modes of Operation: Methods and
Techniques,” NIST Special Publication 800-38C, National Institute of Standards and
Technology (NIST), Gaithersburg, MD 20899-8930, USA, December 2001.

[7] M. Dworkin, “Recommendation for Block Cipher Modes of Operation: The CMAC
Mode for Authentication,” NIST Special Publication 800-38B, National Institute of
Standards and Technology (NIST), Gaithersburg, MD 20899-8930, USA, May 2005.

[8] M. Dworkin, “Recommendation for Block Cipher Modes of Operation: The CCM
Mode for Authentication and Confidentiality,” NIST Special Publication 800-38C,
National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899-
8930, USA, May 2004.

87
[9] M. Dworkin, “Recommendation for Block Cipher Modes of Operation: Ga-
lois/Counter Mode (GCM) and GMAC,” NIST Special Publication 800-38D, National
Institute of Standards and Technology (NIST), Gaithersburg, MD 20899-8930, USA,
November 2007.

[10] N. Weaver and J. Wawrzynek, “High Performance, Compact AES Implementations


in Xilinx FPGAs,” tech. rep., U.C. Berkely BRASS group, September 2002.

[11] A. Menezes, S. Vanstone, and P. V. Oorschot, Handbook of Applied Cryptography.


Boca Raton, FL, USA: CRC Press, Inc., 1996.

[12] National Institute of Standards and Technology (NIST), “Specification for the Ad-
vanced Encryption Standard (AES).” Federal Information Processing Standards Pub-
lication 197, 2001.

[13] D. Stinson, Cryptography: Theory and Practice. Discrete Mathematics and its Ap-
plications, Boca Raton, FL: Chapman & Hall/CRC, third ed., November 2005.

[14] National Institute of Standards and Technology (NIST), “Data Encryption Standard
(DES).” Federal Information Processing Standards Publication 46-3, 1999.

[15] J. Daemen and V. Rijmen, The Design of Rijndael: AES – The Advanced Encryp-
tion Standard. No. XVII in Information Security and Cryptography, Springer-Verlag,
2002.

[16] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, “A Compact Rijndael Hard-


ware Architecture with S-Box Optimization,” in Advances in Cryptology ASIACRYPT
2001, vol. 2248 of Lecture Notes in Computer Science, pp. 239–254, Springer Berlin
/ Heidelberg, 2001.

[17] J. Daemen and V. Rijmen, “AES Proposal: Rijndael,” March 1999.

[18] Larry Ewing ([email protected]) using The Gimp http://www.isc.tamu.edu/ lew-


ing/gimp/, “tux.jpg.” Online, August 2009.

[19] Adapted from the Wikimedia Commons file Image:Tux ecb.jpg, “Tux ecb.jpg.” On-
line, August 2009.

[20] D. McGrew and J. Viega, “The Galois/Counter Mode of Operation (GCM),” May
2005.

88
[21] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,” Computer-
Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 26,
pp. 203–215, Feb. 2007.

[22] Xilinx, Virtex-5 FPGA User Guide, June 2009.

[23] Altera Corporation, 101 Innovation Drive, San Jose, CA, Cyclone III Device Hand-
book, July 2009. CIII5V1-3.1.

[24] M. McLean and J. Moore, “FPGA-based single chip cryptographic solution,” Military
Embedded Systems, March 2007.

[25] Xilinx, MicroBlaze Processor Reference Guide, June 2009. 10.1i.

[26] S. Roman, Field Theory. No. 158 in Graduate Texts in Mathematics, New York:
Springer-Verlag, 1995.

[27] R. J. McEliece, Finite Fields for Computer Scientists and Engineers. No. 23 in
The Kluwer international series in engineering and computer science, Massachusetts:
Kluwer Academic Publishers, 1987.

[28] C. Paar, Efficient VLSI Architectures for Bit-Parallel Computation in Galois Fields.
Dissertation, Institute for Experimental Mathematics, Universitt Essen, Germany,
1994.

[29] J. Guajardo, “Efficient Algorithms for Elliptic Curve Cryptosystems.,” master of sci-
ence thesis, Worcester Polytechnic Institute, May 1997.

[30] B. Kaliski Jr and M. Liskov, “Efficient Finite Field Basis Conversion Involving dual
bases,” in Cryptographic Hardware and Embedded Systems, vol. 1717 of Lecture
Notes in Computer Science, Springer Berlin / Heidelberg, 1999.

[31] I. S. Hsu, T. K. Truong, L. J. Deutsch, and I. S. Reed, “A Comparison of VLSI Ar-


chitecture of Finite Field Multipliers Using Dual, Normal, or Standard Bases,” IEEE
Trans. Comput., vol. 37, no. 6, pp. 735–739, 1988.

[32] X. Zhang and K. Parhi, “High-Speed VLSI Architectures for the AES Algorithm,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 9,
pp. 957–967, 2004.

89
[33] K. Järvinen, M. Tommiska, and J. Skyttä, “A Fully Pipelined Memoryless 17.8 Gbps
AES-128 Encryptor,” in FPGA ’03: Proceedings of the 2003 ACM/SIGDA eleventh
international symposium on Field programmable gate arrays, (New York, NY, USA),
pp. 207–215, ACM, 2003.

[34] V. Rijmen, “Efficient Implementation of the Rijndael S-box,” 2000.

[35] X. Zhang and K. Parhi, “On the Optimum Constructions of Composite Field for the
AES Algorithm,” Circuits and Systems II: Express Briefs, IEEE Transactions on,
vol. 53, pp. 1153–1157, Oct. 2006.

[36] M. McLoone and J. McCanny, “Rijndael FPGA Implementation Utilizing Look-


up Tables,” in Signal Processing Systems, 2001 IEEE Workshop on, pp. 349–360,
September 2001.

[37] V. Fischer and M. Drutarovsk, “Two Methods of Rijndael Implementation in Re-


configurable Hardware,” in Cryptographic Hardware and Embedded Systems CHES
2001, vol. 2162 of Lecture Notes in Computer Science, pp. 77–92, Springer Berlin /
Heidelberg, 2001.

[38] M. Alam, W. Badawy, and G. Jullien, “A Novel Pipelined Threads Architecture for
AES Encryption Algorithm,” pp. 296–302, 2002.

[39] S. Morioka and A. Satoh, “A 10 Gbps Full-AES Crypto Design with a Twisted-BDD
S-Box Architecture,” in Computer Design: VLSI in Computers and Processors, 2002.
Proceedings. 2002 IEEE International Conference on, pp. 98–103, 2002.

[40] G. Rouvroy, F. Standaert, J. Quisquater, and J. Legat, “Compact and Efficient Encryp-
tion/Decryption Module for FPGA Implementation of the AES Rijndael Very Well
Suited for Small Embedded Applications,” vol. 2, pp. 583–587 Vol.2, April 2004.

[41] J. Zambreno, D. Nguyen, and A. Choudhary, “Exploring Area/Delay Tradeoffs in


an AES FPGA Implementation,” in In Proceedings of the 14th Annual International
Conference on Field-Programmable Logic and Applications (FPL 04), pp. 575–585,
Springer, 2004.

[42] D. Canright, “A Very Compact S-Box for AES,” in Cryptographic Hardware and
Embedded Systems CHES 2005, vol. 3659 of Lecture Notes in Computer Science,
pp. 441–455, Springer Berlin / Heidelberg, 2005.

90
[43] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen, “AES implementation on a grain of
sand,” Information Security, IEE Proceedings, vol. 152, pp. 13–20, Oct. 2005.

[44] T. Good and M. Benaissa, “AES on FPGA from the Fastest to the Smallest,” in Cryp-
tographic Hardware and Embedded Systems CHES 2005, vol. 3659 of Lecture Notes
in Computer Science, pp. 427–440, Springer Berlin / Heidelberg, 2005.

[45] T. Good and M. Benaissa, “Very Small FPGA Application-Specific Instruction Pro-
cessor for AES,” Circuits and Systems I: Regular Papers, IEEE Transactions on,
vol. 53, pp. 1477–1486, July 2006.

[46] R. Liu and K. Parhi, “Fast Composite Field S-Box Architectures for Advanced En-
cryption Standard,” in GLSVLSI ’08: Proceedings of the 18th ACM Great Lakes sym-
posium on VLSI, (New York, NY, USA), pp. 65–70, ACM, 2008.

[47] A. Rudra, P. Dubey, C. Jutla, V. Kumar, J. Rao, and P. Rohatgi, “Efficient Rijndael En-
cryption Implementation with Composite Field Arithmetic,” in Cryptographic Hard-
ware and Embedded Systems CHES 2001, vol. 2162 of Lecture Notes in Computer
Science, pp. 171–184, Springer Berlin / Heidelberg, 2001.

[48] J. Wolkerstorfer, E. Oswald, and M. Lamberger, “An ASIC Implementation of the


AES SBoxes,” in Topics in Cryptology CT-RSA 2002, vol. 2271 of Lecture Notes in
Computer Science, pp. 67–78, Springer Berlin / Heidelberg, 2002.

[49] S. Nikova, V. Rijmen, and M. Schlffer, “Using Normal Bases for Compact Hardware
Implementations of the AES S-Box,” in Security and Cryptography for Networks,
vol. 5229 of Lecture Notes in Computer Science, pp. 236–245, Springer Berlin / Hei-
delberg, 2008.

[50] F. Burns, J. Murphy, A. Koelmans, and A. Yakovlev, “Efficient Advanced Encryp-


tion Standard Implementation using lookup and normal basis,” Computers & Digital
Techniques, IET, vol. 3, pp. 270–280, May 2009.

[51] P. Chodowiec and K. Gaj, “Very Compact FPGA Implementation of the AES Algo-
rithm,” in Cryptographic Hardware and Embedded Systems - CHES 2003, vol. 2779
of Lecture Notes in Computer Science, pp. 319–333, Springer Berlin / Heidelberg,
2003.

[52] N. Pramstaller, S. Mangard, S. Dominikus, and J. Wolkerstorfer, “Efficient AES


Implementations on ASICs and FPGAs,” in Advanced Encryption Standard AES,

91
vol. 3373 of Lecture Notes in Computer Science, pp. 98–112, Springer Berlin / Hei-
delberg, 2005.

[53] P. Hamalainen, T. Alho, M. Hannikainen, and T. Hamalainen, “Design and Imple-


mentation of Low-Area and Low-Power AES Encryption Hardware Core,” in Digital
System Design: Architectures, Methods and Tools, 2006. DSD 2006. 9th EUROMI-
CRO Conference on, pp. 577–583, 2006.

[54] M. McLoone and J. McCanny, “High Performance Single-Chip FPGA Rijndael Algo-
rithm Implementations,” in Cryptographic Hardware and Embedded Systems CHES
2001, vol. 2162 of Lecture Notes in Computer Science, pp. 65–76, Springer Berlin /
Heidelberg, 2001.

[55] B. Yang, S. Mishra, and R. Karri, “A High Speed Architecture for Galois/Counter
Mode of Operation (GCM).” Cryptology ePrint Archive, Report 2005/146, 2005.

[56] A. Satoh, “High-Speed Hardware Architectures for Authenticated Encryption Mode


GCM,” in Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE Inter-
national Symposium on, pp. 4831–4834, May 2006.

[57] A. Satoh, “High-Speed Parallel Hardware Architecture for Galois Counter Mode,”
in Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on,
pp. 1863–1866, May 2007.

[58] A. Satoh, T. Sugawara, and T. Aoki, “High-Speed Pipelined Hardware Architecture


for Galois Counter Mode,” in Information Security, vol. 4779 of Lecture Notes in
Computer Science, pp. 118–129, Springer Berlin / Heidelberg, 2007.

[59] A. Satoh, T. Sugawara, and T. Aoki, “High-Performance Hardware Architectures for


Galois Counter Mode,” IEEE Transactions on Computers, vol. 58, no. 7, pp. 917–930,
2009.

[60] S. Lemsitzer, J. Wolkerstorfer, N. Felber, and M. Braendli, “Multi-gigabit GCM-AES


Architecture Optimized for FPGAs,” in Cryptographic Hardware and Embedded Sys-
tems - CHES 2007, vol. 4727 of Lecture Notes in Computer Science, pp. 227–238,
Springer Berlin / Heidelberg, 2007.

[61] G. Zhou, H. Michalik, and L. Hinsenkamp, “Efficient and High-Throughput Im-


plementations of AES-GCM on FPGAs,” in Field-Programmable Technology, 2007.
ICFPT 2007. International Conference on, pp. 185–192, Dec. 2007.

92
[62] G. Zhou, H. Michalik, and L. Hinsenkamp, “Improving throughput of AES-GCM
with pipelined karatsuba multipliers on FPGAs,” in Reconfigurable Computing: Ar-
chitectures, Tools and Applications, vol. 5453 of Lecture Notes in Computer Science,
pp. 193–203, Springer Berlin / Heidelberg, 2009.

[63] P. Quintana, “Fail-Safe FPGA Design Features for High-Reliability Systems,” Tech.
Rep. CP-01053-1.0, Altera Corporation, 101 Innovation Drive, San Jose, CA, April
2009.

[64] T. Huffmire, B. Brotherton, G. Wang, T. Sherwood, R. Kastner, T. Levin, T. Nguyen,


and C. Irvine, “Moats and Drawbridges: An Isolation Primitive for Reconfigurable
Hardware Based Systems,” in Security and Privacy, 2007. SP ’07. IEEE Symposium
on, pp. 281–295, May 2007.

[65] T. Huffmire, B. Brotherton, N. Callegari, J. Valamehr, J. White, R. Kastner, and


T. Sherwood, “Designing secure systems on reconfigurable hardware,” ACM Trans.
Des. Autom. Electron. Syst., vol. 13, no. 3, pp. 1–24, 2008.

[66] T. Huffmire, S. Prasad, T. Sherwood, and R. Kastner, “Policy-Driven Memory Protec-


tion for Reconfigurable Hardware,” in Computer Security ESORICS 2006, vol. 4189
of Lecture Notes in Computer Science, pp. 461–478, Springer Berlin / Heidelberg,
2006.

[67] Altera Corporation, 101 Innovation Drive, San Jose, CA, Quartus II Handbook, De-
cember 2009. QII5V1-9.1.1.

93

You might also like