Hardware Implementation of Bit-Parallel Finite Field Multipliers
Hardware Implementation of Bit-Parallel Finite Field Multipliers
Scholarship at UWindsor
1-1-2019
Recommended Citation
Pan, Meitong, "Hardware Implementation of Bit-Parallel Finite Field Multipliers Based on Overlap-free
Algorithm on FPGA" (2019). Electronic Theses and Dissertations. 8175.
https://scholar.uwindsor.ca/etd/8175
This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor
students from 1954 forward. These documents are made available for personal study and research purposes only,
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution,
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or
thesis from this database. For additional inquiries, please contact the repository administrator via email
([email protected]) or by telephone at 519-253-3000ext. 3208.
Hardware Implementation of
Bit-Parallel Finite Field
Multipliers Based on Overlap-free
Algorithm on FPGA
By
Meitong Pan
A Thesis
Submitted to the Faculty of Graduate Studies
through the Department of Electrical and Computer Engineering
in Partial Fulfilment of the Requirements for
the Degree of Master of Applied Science
at the University of Windsor
2019
Meitong Pan
APPROVED BY:
S. Cheng
Department of Civil & Environmental Engineering
B. Balasingam
Department of Electrical & Computer Engineering
H. Wu, Co-Advisor
Department of Electrical & Computer Engineering
M. Mirhassani, Advisor
Department of Electrical & Computer Engineering
I certify that, to the best of my knowledge, my thesis does not infringe upon
anyone’s copyright nor violate any proprietary rights and that any ideas, tech-
niques, quotations, or any other material from the work of other people included
in my thesis, published or otherwise, are fully acknowledged in accordance with
the standard referencing practices. Furthermore, to the extent that I have in-
cluded copyrighted material that surpasses the bounds of fair dealing within the
meaning of the Canada Copyright Act, I certify that I have obtained a written
permission from the copyright owner(s) to include such material(s) in my thesis
and have included copies of such copyright clearances to my appendix.
I declare that this is a true copy of my thesis, including any final revisions, as
approved by my thesis committee and the Graduate Studies office, and that this
thesis has not been submitted for a higher degree to any other University or
Institution.
iii
Abstract
Karatsuba algorithm and its generalization are most often used to construct mul-
tiplication architectures with significantly improved in these decades. However,
one of its optimized architecture called Overlap-free Karatsuba algorithm has been
mention by fewer people and even its implementation on FPGA has not been men-
tioned by anyone. After completion of a detailed study of this specific algorithm,
this thesis has proposed implementation of modified Overlap-free Karatsuba algo-
rithm on Xilinx Spartan-605. Applied this algorithm and its specific architecture,
reduced gates or shorten critical path will be achieved for the given value of n.
iv
To my family
my grandparents
my parents
my fiancé
for their unconditional love
and
support
v
Acknowledgments
I wish to express my sincere gratitude to my supervisor Dr.Mitra Mirhassani and
my co-supervisor Dr. Huapeng Wu, for their patience, motivation and immense
knowledge throughout my graduate study.
I would like to thank my family members, my mum, dad and my fiancé, for their
constant support and continuous encouragement during the time of completing
my further study.
I would like to thank my committee members, Dr. Huapeng Wu, Dr. Bala Bal-
asingam and Dr. Shaohong Cheng.
vi
Table of Contents
Declaration of Originality iii
Abstract iv
Dedication v
Acknowledgments vi
List of Figures ix
List of Abbreviations x
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Preliminary 6
2.1 Mathematics Fundamental . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Finite Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Arithmetic Operation in Finite Field GF (2n ) . . . . . . . . . . . . . 9
2.3.1 Arithmetic operation in complex number field . . . . . . . . 10
2.3.2 Arithmetic operation in Finite Field GF (2n ) . . . . . . . . . 10
2.4 Multiplication Architectures . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Bit-parallel multiplication . . . . . . . . . . . . . . . . . . . 12
2.4.2 Bit-serial multiplication . . . . . . . . . . . . . . . . . . . . 12
vii
Table of Content viii
5 Conclusion 44
5.1 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Bibliography 47
Appendix A 51
Appendix B 54
Vita Auctoris 57
List of Figures
3.1 Ranges of x ’s exponents of equation (3.1) . . . . . . . . . . . . . . . 17
3.2 Comparison time and space complexities of four different multipli-
cation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Horizon direction comparison . . . . . . . . . . . . . . . . . . . . . 23
3.4 Comparison in the number of XOR gates . . . . . . . . . . . . . . . 23
3.5 Comparison in time complexity . . . . . . . . . . . . . . . . . . . . 24
ix
List of Abbreviations
FF Flip-Flop
MUX Mutiplexer
KA Karatsuba Algorithm
x
Chapter 1
Introduction
1.1 Motivation
1
Introduction 2
the symmetric key, which only known for senders and receivers. However, it is
difficult for the two parties to exchange keys without compromising the security
of the keys themselves, which in return will hazard data confidentiality and data
authentication. The second question is assumed that symmetric cryptography is
called a key management problem. Supposing that a communication medium is
shared between n users, and each pair of users needs a different key to establish
their own secure communication. So n(n − 1)/2 different keys will be provided,
even in medium-sized networks, it is hard to manage.
So far, based on the concept of public key cryptography, three different types
of cryptosystems have been proposed, RSA [1], ELGamal [2], and Elliptic Curve
Cryptography (ECC) [3, 4]. The security of each of these cryptosystems depends
on the a difficult mathematical problem, which called the one-way function. ECC
is much more security for the following reasons:
• The ECC keys are obviously smaller than those of RSA and ELGamal for
any given level of security.
Introduction 3
• The ratio of key sizes of ECC is much more higher than the other two public
key schemes, which means that the higher security is required, the more
efficient ECC becomes.
• The key length of ECC are twice as long as those of symmetric algorithms
for the same level of security, which illustrates the higher comutational com-
plexity of the public key schemes.
1.2 Objective
Two common used classed of finite fields in cyptography are prime fields of degree
one GF (p) and binary extension fields of degree greater than one GF (2n ). The
latter is a subclass of a more generalized group of finite fields known as finite
prime extension fields GF (pn ), where the parameter p is equal to two and the
extension degree is greater than one. Binary fields are more attractive for high
speed cryptosystem applications. Because the basic field operations addition and
multiplication in the underlying field F2 can be readily realized by a bit-wise XOR
and a bit-wise AND operations, respectively.
Introduction 4
Different architectures for finite field multipliers can generally be divided into bit-
serial, bit-parallel and digit-level architectures. Given a binary extension field of
degree n, bit-serial multipliers need n clock cycles to finish a full multiplication
operation. Although they need the maximun number of clock cycles for computing
the product coordinates, the provide the optimal area utilization and power con-
sumption. On the other hand, bit-parallel multipliers utilize the highest level of
parallelism, multiplication operation is performed fast and only need one clock cy-
cle. Digit-level architecture, finally, fill the gap between bit-serial and bit-parallel
design styles to keep a balance between space and delay complexities.
Since the extension of the Karatsuba algorithm (a ”divide and conquer” technique
for efficient integer multiplication) to finite field multiplication with quadratic
space complexity, many improvements have been made to this method over the
past few years. Specifically, these improvements can be summarized into two sub-
areas: one attempts to improve the Karatsuba architecture through an optimized
re-factoring process, and another attempt focuses on summarizing the Karatsuba
formula by reducing the number of sub-multiplications, which will be introduced
in Chapter 3 in detail.
• Chapter 3 In this chapter contains two parts, including four different kinds of
multiplication algorithms and their comparison based on NIST recommended
GF(2n ) fields. We briefly introduce original Karatsuba, Overlap-free Karat-
suba, Reconstruction Karatsuba and Improved Reconstruction by Bernstein
multiplication algorithms. We also arrange their recursive function describ-
ing each method’s space and time complexity. Finally, we analyse the result
of this four algorithms applied in different field and we also achieve the main
algorithm which can efficiently apply into the GF(2128 ).
Preliminary
In this section, three briefly definitions about group, rings and fields will be illus-
trated.
Definition 1: A group is a set G together with a binary operation (?) on G, such
that the following three properties [5]:
a ? (b ? c) = (a ? b) ? c
6
Preliminary 7
a?e=e?a=a
a ? a−1 = a−1 ? a = e
a?e=e?a=a
a?b=b?a
(a · b) · c = a · (b · c)
a · (b + c) = a · b + a · c
(b + c) · a = b · a + c · a
Preliminary 8
a·b=b·a
Finite field,is also called Galois field, is a set of finite number of elements, where
addition and multiplication are defined.
• All the non-zero elements in a finite field form a multiplicative group under
multiplication operation.
• When we say the order of a field element, it means that the order of the
element in the multiplicative group.
• Prime fields, GF (p), is a set of {0, 1, 2, ..., p − 1}, where p is a prime num-
ber. InGF (p) , the binary operator(·) refers to mod-p multiplication and
Preliminary 9
The irreducible polynomial in finite field can not be factorized into a factor, which
degree between 1 and n − 1 in the same field, just like a prime number. In this
thesis, the irreducible polynomial is fixed over the basic field GF (2128 ) and will be
discussed in detail in the following sections.
√
C = {a + bi|a, b ∈ R, i = −1} = {a + bi|a, b ∈ R, i2 + 1 = 0},
A + B = (a0 + b0 ) + (a1 + b1 )i
= (a0 b0 − a1 b1 ) + (a1 b0 + a0 b1 )i
Because the equation i2 + 1 = 0 does not have a root in real number field, so it is
called the irreducible polynomial in real number field.
The procedure of the complex number in field C and its arithmetic in the real
number field R can be summarized as below:
2. Use the root of equation i2 + 1 = 0 be i and coin the expression a + bi, where
a,b ∈ R. And get the representation of the complex field numbers C.
Similar to the case of complex number C and its arithmetic, we can easily derive
GF(2n ) and its arithmetic as follows:
Preliminary 11
3. Use x as the root of F(x) = 0. Then GF(2n ) = {an−1 xn−1 + an−2 xn−2 + . . . +
a0 |ai ∈ GF(2), f(x) = 0}
Pn−1
4. Arithmetic operations in GF (2n ). For A,B ∈GF (2n ), and A = i=0 ai x i , B =
Pn−1 i
i=0 bi x , then we get
n−1
X
A+B = ( (a1 + bi )xi )mod2
i=0
n−1
X n−1
X
A×B = ( ai x i × bi xi )mod2modf(x)
i=0 i=0
Note that the product of the multiplication operation must be modular re-
duced to no higher than n − 1.
Time and space complexities are applied to measure the efficiency of GF (2n )
multipliers. In GF (2), polynomial addition can be implemented by a 2-input XOR
gate and multiplication can be used by a 2-input AND gate. According to this
rule, the space complexity can be represented by the total number of AND gates
and XOR gates, and the time complexity can be measured by the delays occur in
one AND gate and XOR gate. So we use S ⊕ and S ⊗ to denote the number of
XOR and AND gates, respectively. We also use TA and TX to represent the delay
of AND and XOR gates, respectively.
Preliminary 12
Compare with the feature to bit-parallel, bit-serial multiplication has a lower space
cost, which makes it competitive in application in constrained resources. Based
on the input and output sequences, bit-parallel multiplication can be divided into
four types, as follows [9]:
An Overview of Bit-Parallel
Multiplication for GF (2n) and
Comparison
In this chapter contains two parts, including four different kinds of multiplica-
tion algorithms and their comparison based on NIST recommended GF(2n ) fields.
First, we briefly introduce original Karatsuba, Overlap-free Karatsuba, Recon-
struction Karatsuba and Improved Reconstruction by Bernstein multiplication al-
gorithms. We also arrange their recursive function describing each method’s space
and time complexity. After that, we analyse the result of this four algorithms ap-
plied in different field and we also achieve the main algorithm which can efficiently
apply into the GF(2128 ).
In early 1960, the first sub-quadratic integer multiplication algorithm was invented
by A.A.Karatsuba for fast multiplication of multi-place numbers [10]. After that,
14
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 15
First, the previous KOA implementations split polynomials A and B into the ”most
significant half” and the ”least significant half” as follows:
n−1
X m−1
X m−1
X
i m i
A= ai x = x am+i x + a1 xi = xm AH + AL
i=0 i=0 i=0
n−1
X m−1
X m−1
X
B= bi x i = x m bm+i xi + b 1 x i = x m B H + BL
i=0 i=0 i=0
Pm−1 Pm−1
where AH = i=0 am+i xi , AL = i=0 ai xi ,BH and BL are defined similarly.
we note that in GF(2) ”-” is the same as ”+”, where means that a 2-input XOR
gate is needed. For VLSI implementation of (3.1), the expression in the two square
brackets are computed confluently, and one XOR gate delay 1Tx is required. As we
mentioned, ”-” operation is also performed at a cost of 1Tx . Therefore, two XOR
gate delays 2Tx are needed to calculate the three part products AH BH , AL BL and
(AH + AL )(BH + BL ).
XOR gates, T⊗ (n) and T⊕ (n) to denote the delays produced by AND and XOR
gates, respectively.
After solving the above recurrence relations using the formula derived in the new
method [13], we obtain the following complexity results for the binary polynomial
KOA [17], [14].
S⊕ (n) = 6nlog2 3
S⊗ (n) = nlog2 3
(3.2)
T (n) = 3 log2 n − 1
⊕
T (n) = 1
⊗
In 2010, H.Fan have proposed a new method to implement the polynomial KOA
for hardware multipliers [12]. It estimates overlaps in the previous designs so the
XOR gate delay of proposed is obviously better than the original KOA. In addition
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 17
to the theoretical significance, this new method is also suitable for practical VLSI
applications such as designs of hybrid GF(2n ) multipliers.
From the equation (3.1), we can get that the partial polynomials AH BH x2m ,
{[(AH + AL )(BH + BL )] − [AH BH + AL BL ]}xm and AL BL are XORed by adding
coefficients of common exponents of x together. The VLSI module used to perform
this XOR operation is called overlap module [14]. In order to explain overlaps of
common exponents of x clearly, we present the following table, which shows ranges
of x ’s exponents in these three polynomials. From the figure, it is easy to know
that overlaps occur only when n > 4 or m > 2, and there is no overlap when n = 2
or m = 1.
Because of the overlaps, one more XOR gate delay is needed in the overlap module
to compute the summation of the three polynomials AH BH x2m , {[(AH +AL )(BH +
BL )] − [AH BH + AL BL ]}xm and AL BL . According to this, a total of 3 XOR gates
delays are required in (3.1) besides the cost of the recursive computation of the
three partial products.
Therefore, a new method focus on overlaps has been proposed. Instead of splitting
two input operands int the ”most significant half” and the ”least significant half”,
this new method split operands according to the parity of x ’s exponents. So we
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 18
n−1
X m−1
X m−1
X m−1
X m−1
X
i 2i 2i+1 2i
A= ai x = a2i x + a2i+1 x = a2i x + x a2o+1 x2i
i=0 i=0 i=0 i=0 i=0
n−1
X m−1
X m−1
X m−1
X m−1
X
B= bi x i = b2i x2i + b2i+1 x2i+1 = b2i x2i + x b2o+1 x2i
i=0 i=0 i=0 i=0 i=0
Pm−1 Pm−1
where Ae (y) = i=0 a2i y i , Ao (y) = i=0 a2i+1 y i , and Be (y) and Bo (y) are de-
fined similarly. BecauseAe (y), Ao (y), Be (y) and Bo (y) are polynomials in degree of
y, which is less than m, multiplication operations among them may also be com-
puted recursively. Then we can get the product of A, B as the KOA-like formula
as follows
= {Ae (y)Be (y) + x2 Ao (y)Bo (y)} + x{Ae (y)Bo (y) + Ao (y)Be (y)}
(3.3)
= {Ae (y)Be (y) + yAo (y)Bo (y)}+
x{(Ae (y) + Ao (y))(Be (y) + Bo (y)) − (Ae (y)Be (y) + Ao (y)Bo (y))}
Obviously, function (3.3) also includes three partial products and in hardware im-
plementation multiplying a polynomial by x or y = x2 is equivalent to shifting
its coefficients left and no extra gate is required. It is clearly to check that the
expansion of Ae (y)Be (y) + yAo (y)Bo (y) contains with even exponents x, and the
expansion of x{(Ae (y) + Ao (y))(Be (y) + Bo (y)) − (Ae (y)Be (y) + Ao (y)Bo (y)) con-
tains with odd exponents x. Therefore, no overlap exists when computing their
summation, and no gate is needed either.
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 19
Consequently, the recurrence relations describing the time and space complexities
can be cited as follows:
S⊗ (2) = 3
T⊗ (2) = 1
Compared with formula (3.2), the overlap-free method reduces the XOR gate delay
T⊕ (n) from 3 log2 n − 1 to 2 log2 n, which nearly equal to 33% for n = 2t (t > 1).
In 2009, Bernstein [15], Zhou and Michalik [16] has optimize the reconstruction
part of the Karatsuba formula by factorizing some constant common terms. Bern-
stein also applied this optimization to the reconstruction of Karatsuba formula
and then to two recursion of Karatsuba resulting in 5.46nlog2 (n) + S⊕ instead of
6nlog2 (n) +S⊕ for the original Karatsuba formula and a delay of 2.5 log2 (n)T⊕ +T⊗ .
Pn−1 Pn−1
Let consider two degree n−1 polynomials A(x) = i=0 ai xi and B(x) = i=0 bi x i
with n = 2k . The method of Karatsuba for polynomial multiplication consists of
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 20
0 0
and then generate three polynomials of half size A0 = AL , A1 = AL +AH and
0 n 0 0
A2 = AH . The same as B = BL +BH x 2 , we generate B0 = BL , B1 = BL +BH
0
and B2 = BH .
0 0 0
C0 = A0 B0 = AL BL
0 0 0
C1 = A1 B1 = (AL + AH )(BL + BH ) (3.5)
0 0 0
C2 = A2 B2 = AH BH
• Reconstruction. We reconstruct C = A × B as
0 n 0 n 0 n n
C = C0 (1 + x 2 + C1 x 2 + C2 x 2 (1 + x 2 )
0 0 0 0 n 0
= C0 + (C0 + C1 + C2 )x 2 + C2 xn (3.6)
0 0 0
The three half size products C0 , C1 and C2 of (3.5) are computed by applying
the same method recursively. If the recursive computations are performed in
parallel we get a parallel multiplier with a sub-quadratic space complexity and a
logarithmic delay. And a non-recursive form of the number of XOR gates , AND
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 21
n n
Step 1. R0 = P0 + x 2 P1 (Cost = 2
− 1 bit additions)
n
Step 2. R1 = R0 (1 + x ) 2 (Cost = n − 1 bit additions) (3.8)
n
Step 3. C = R1 + P 2 x 2 (Cost = n − 1 bit additions)
This method reduces the number of bit additions of one recursion of the Karatsuba
formula S⊕ = 7n/2−3+3S⊕ (n/2), which gives for a full recursion S⊕ = 5.5nlog2 n −
7n+3/2. But this method converses a delay of T = 3 log2 nD⊕ +T⊗ . In this result,
we call the reconstruction formula (3.8) as improved reconstruction by Bernstein.
3.5 Comparison
From the previous sections, we have summarized four different kinds of bit-parallel
multiplication algorithms,including original KOA, overlap-free KOA, reconstruc-
tion Karatsuba and improved reconstruction by Bernstein. Therefore, we collect
all these four algorithms results and briefly make a comparison, including space
complexity (the number of AND gates and XOR gates)and time complexity. It
shows in the form of table as follows: For more specific digital comparison, we
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 22
Figure 3.2: Comparison time and space complexities of four different multi-
plication algorithms
First, according to figure 3.2, we analyse the data in horizon direction, which means
that we compare three concepts among four multiplication algorithms, including
#AND (the number of AND gates), #XOR (the number of XOR gates) and time
complexity.
where we use blue, orange and yellow column to represent #AND, #XOR and
time complexity, respectively.
• Using the Overlap-free Karatsuba algorithm can achieve lowest time com-
plexity
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 23
Because of the number of AND gates, we only make the vertical comparison two
concepts among these four algorithms, including #XOR and time complexity.
where red and blue column represent the Improved Reconstruction by Bernstein
where blue, red and yellow column represent the original KOA, Overlap-free Karat-
suba and Reconstruction Karatsuba (or Improved Reconstruction by Bernstein),
respectively.
• The apparent gap between Overlap-free Karatsuba and other three algo-
rithms always exists no matter how the value of n changing.
Above these figures and analyses, we can settle that we will just focus on the
Overlap-free Karatsuba algorithm and its hardware implement in the following
chapters in this thesis. Although the Improved Reconstruction by Bernstein al-
gorithm can do well in the space complexity, especially for the number of XOR
An Overview of Bit-Parallel Multiplication for GF(2n ) and Comparison 25
gates, this result is the consequence of the huge value of m. For the limit of
the input and output number in the FPGA (Field- Programmable Gate Array)
board, which will be mentioned in the next chapter, we will design the hardware
implementation when n = 128. And in this case, the Improved Reconstruction
by Bernstein algorithm does not have a better layout in the comparison of the
number of XOR gates. Therefore, we only do the research on Overlap-free Karat-
suba multiplication algorithm as the following chapter. We will also compare the
proposed hardware implementation with other methods or other published data
in space and time complexities in detail.
Chapter 4
Proposed Hardware
Implementation of Modified
Overlap-free Karatsuba
Multiplication Algorithm for
GF (2n)
26
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 27
Field Programmable Gate Array (FPGAs) are semiconductor devices that are
based around a matrix of configurable logic blocks (CLBs) connected via pro-
grammable interconnects [19]. Typical internal structure of FPGA (figure 4.1)
comprises of three major elements:
• Configurable logic blocks (CLBs), shown as blue boxes in figure 4.1, are
the resources of FPGA meant to implement logic functions. Each CLB is
comprised of a set of slices which are further decomposable into a definite
number of look-up tables (LUTs), flip-flops (FFs) and multiplexers (MUXes).
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 28
In general, FPGAs are more flexible than ASICs as they are able to programmed
easily to desired functions or applications, with the emphasis on the ease of re-
programmability. This is the feature that makes such devices suitable for building
processing units for polynomials which are likely to have to adapt to parameter
changes from time to time. The fundamental building block of a FPGA is its logic
cells. Despite the different hardware used to realize the logic cell functions and
different input widths provided by various FPGA vendors, they can be mapped
to certain logic functions with the help of the synthesis and mapping tools.
FPGAs are much more than just a bunch of gates. Although it is possible to build
logic circuits of any complexity simply by arranging and connecting logic gates, it
is just not practical and efficient. So we need a way to express the logic in some
easy to use format that can be converted to an array of gates eventually. And
HDL will be focused throughout this thesis. A Hardware Description Language
(HDL) is a software programming language used to model the intended operation
of a piece of hardware. There are two aspects to the description of hardware
that an HDL facilities; the abstract behaviour modelling and hardware structure
modelling.
• The behavioural constructs of Verilog could describe both hardware and test
stimulus.
According to these features, we choose Verilog as HDL in this thesis to write the
code and program the FPGA board.
Simulation is the fundamental and essential part of the design process for any
electronic based product; not just FPGA devices. For FPGA devices, simulation
is the process of verifying the function characteristics of models at any level or
behaviour, that is, from high levels of abstraction down to low levels. The basic
arrangement for simulation is shown in Figure 4.2.
In this thesis, we choose Xilinx ISE software as the simulator to finish the FPGA
board hardware simulation. The Xilinx ISE (Integrated Synthesis Environment)
produced by Xilinx for synthesis and analysis of HDL design. The ISE software
controls all aspects of the design flow [24]
Through the Project Navigator interface (shown in figure 4.3), you can access all
of the design entry and design implementation tools. You can also access the files
and documents associated with your project.
In this section, we first present the complexity analysis by applying 1-step Overlap-
free KA (Karatsuba) for even-term polynomials (ETP). Then we apply the pro-
posed modified algorithm into FPGA and achieve the results for GF (2128 ).
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 32
We now convey an example to compare the proposed modified method with the
original KOA. We assume n = 4 and then let
A = a3 x3 + a2 x2 + a1 x + a0 = AH x2 + AL
B = b 3 x 3 + b2 x 2 + b1 x + b 0 = B H x 2 + B L
there are three products of polynomials of degree 1 in (4.1), and they can be
computed recursively using the KOA at a cost of 2Tx .
To show the role of the overlap in 4.1, we group the three products in 4.1 and
write them as polynomials of degree 2 in x as follows:
d2 x2 + d1 x + d0 = AH + BH
f2 x2 + f1 x + f0 = AL BL
Then we have
Obviously, one XOR gate delay 1Tx is required to compute the overlap summations
(d0 + e2 ) and (e0 + f2 ). Because we need 2Tx to perform the XOR operations in
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 33
the curly bracket of (4.1), the total number of XOR gate delays of the original
KOA is 2 + 1 + 2 = 5.
= {Ae (y)Be (y) + x2 Ao (y)Bo (y)} + x{Ae (y)Bo (y) + Ao (y)Be (y)}
(4.3)
= {Ae (y)Be (y) + yAo (y)Bo (y)}+
x{(Ae (y) + Ao (y))(Be (y) + Bo (y)) − (Ae (y)Be (y) + Ao (y)Bo (y))}
p2 y 2 + p1 y + p0 = Ae (y)Be (y)
q2 y 2 + q1 y + q0 = Ao (y)Bo (y)
We need 1Tx to perform ”+” operations in the last two equations. We also need
2Tx to compute the three products of polynomials of degree 1 in yin the above
four equations. Then we have the product AB can be shown as follows:
x{(r2 y 2 + r1 y + r0 ) + (s2 y 2 + s1 y + s0 )}
(4.4)
= q2 x6 + (p2 + q1 )x4 + (p1 + q0 )x2 + p0 +
Evidently, one XOR gate delay is needed to obtain the summations in the five
brackets. Therefore the total number of XOR gate delay is 4, and 1Tx has been
saved compared to the original KOA.
Figure 4.4 shows the multiplier architecture by applying one step Overlap-free KA
algorithm as above example, if m = n is even. The multiplier includes three stages:
the splitting stage, the sub-multiplier stage and the alignment stage, where three
sub-multiplier operate in parallel.
In this architecture [16], we can efficiently define which part’s function. The split-
ting stage requires m XOR gates to generate the inputs for the middle multiplier,
which compute the product of Ae (y) + Ao (y) and Be (y) + Bo (y). The alignment
stage merges the output of sub-multipliers according to their degrees. Both in
figure 4.4 and (4.5), common sub-expressions are found when calculating D m2 ...m−2
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 35
D m ...m−2 = [U m ...m−2 + W0... m −2 ] + U0... m −2 + V0... m −2
2 2 2 2 2
(4.5)
D
= [U m2 ...m−2 + W0... m2 −2 ] + W m2 ...m−2 + V m2 ...m−2
m... 3m
2
−2
1 module mul_2_module (
2 input [1:0] A ,
3 input [1:0] B ,
4 output [3:0] mul_2
5 );
6 assign mul_2 [0]= A [0]& B [0];
7 assign mul_2 [2]= A [1]& B [1];
8 assign mul_2 [3:0]=
9 { A [1]& B [1] ,
10 ( A [0]^ A [1])&( B [0]^ B [1])^ mul_2 [0]^ mul_2 [2] , A [0]& B [0]};
11 endmodule
equal to function AL BL
assign mul_2 [2]= A [1]& B [1]
equal to function AH BH
assign A [0]^ A [1])&( B [0]^ B [1])^ mul_2 [0]^ mul_2 [2]
Then we extend the value of n from 2 to 4, which Verilog HDL shows in table 4.2.
Since value 4 is exact double size of 2, we use nested and transferred statement to
finish the module. For this value, overlap occurs during the alignment stage and
then we apply the proposed algorithm in this part, which also shows in the table
4.2, the specific code as below:
assign d7 = d2 ^ d1 ^ d0 ;
assign mul_4 [7:0]={ d2 [3:2] ,( d2 [1:0]^ d7 [3:2]) ,
( d0 [3:2]^ d7 [1:0]) , d0 [1:0]}
therefore, we can extend the value of n until 128. The detail Verilog HDL code
has been shown in Appendix in the end of this thesis.
Table 4.2: Verilog HDL n = 4 module
1 module mul_4_module (
2 input [3:0] A ,
3 input [3:0] B ,
4 output [7:0] mul_4
5 );
6 wire [3:0] d0 , d1 , d2 , d7 ;
7 mul_2_module u0 (( A [1:0]) ,( B [1:0]) ,( d0 ));
8 mul_2_module u1 (( A [1:0]^ A [3:2]) ,( B [1:0]^ B [3:2]) ,( d1 ));
9 mul_2_module u2 (( A [3:2]) ,( B [3:2]) ,( d2 ));
10 assign d7 = d2 ^ d1 ^ d0 ;
11 assign mul_4 [7:0]={ d2 [3:2] ,( d2 [1:0]^ d7 [3:2]) ,
12 ( d0 [3:2]^ d7 [1:0]) , d0 [1:0]};
13 endmodule
Following the nested and transferred statement, we finally get the module of n =
128 in Appendix A. Then we use the simulator, Xilinx ISE software, to complete
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 37
the simulation of all the huge module. From the simulator, we first implement
this module in to the typical board, Xilinx Spartan-605. And then complete
implement design part, including translate, map and place & route. After that,
we also generate programming file and from the simulation part, we achieve the
RTL schematic in figure 4.5.
In figure 4.5, we can directly know that our module exactly follows multiplier
architecture, in figure 4.4. There are several CLBs shown in the RTL scheme,
including the input, output and the name of the block, which also illustrates the
steps.
In each CLB, when we check in it, it shows the kinds of LUTs, FFs and MUX. And
we summarise the exact kinds of LUTs, in figure 4.6 which occurs in the whole
module.
Internally, LUTs comprises of 1-bit memory cells and a set of multiplexers. One
value among these SRAM bits will be available at the LUT’s output depending
on the value(s) fed to the control line(s) of the multiplexers. For these features,
LUTs is an important cell in CLB. If we can design the LUT’s structure, we may
optimize the speed of input and output, which reflects on chips is the speed of
reading and writing information.
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 38
We can also get ISm simulation in the simulator, shown in figure 4.7 and 4.8
without and with input respectively. We can control the value of each input,
directly achieve the output value, analyse the time delay and get wave changes if
we design the clockwise.
In the next section, we will discuss the time delay and the comparison value of
output using the ISm simulation.
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 39
In this section, we first present the simulation results of proposed modified module
in Verilog code and ISE system. Then compare it with other published multipliers,
for GF (24 ),GF (28 ) and GF (216 ) field, referencing specific paper.
First, we take the simulation results of proposed modified module using overlap-
free Karatsuba multiplication algorithm for GF (24 ) as an example, which has been
shown in following codes and figures.
The proposed modified module has been coded in Verilog in Appendix B. From the
code, the first two inputs have been settled 001, 001 respectively and the system
needs to wait 100ns for global reset to finish. Then the value of B, which is one of
inputs, has changed from 001 to 111 every 1ns. And using the simulation system
we can achieve the following figure.
The figure shows the binary equivalent of multiplication of two 4-bit numbers to
give the product. Ports A and B are the input ports that accept the numbers
to be multiplied. The port mul 4 is the output port, where the product of the
two aforesaid numbers are obtained. For example, the product of 0001 and 0001
(binary equivalents), specified at the ports A and B respectively, is obtained at
port mul 4, output port, as 00000001. Similarly, products of other specified finite
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 40
field GF (28 ) and GF (216 ) are obtained, shown as figure 4.10 and 4.11 respectively.
4.3.2 Comparison
According to the simulation results, we reference the paper called FPGA Based
Modified Karatsuba Multiplier [32] because it has valuable kinds of finite field
multipliers. We have studied the performance of each multiplier over GF (24 ),
GF (28 ) and GF (216 ) employing the Xilinx ISE simulation tool. All multipliers
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 41
Table in figure 4.12 shows the result of device utilization and combinational path
delay of various types of GF (24 ) multipliers. The number of slices and combina-
tional path delay for proposed modified multiplier are 6 out of 6822 and 10.101 ns
respectively. Whereas, the minimum number of slices and combinational path de-
lay for Modified Karatsuba multiplier are 6 out of 6822 and 13.057 ns respectively.
Although they have the same number of slices, the combinational path delay for
proposed modified multiplier is 23.4% lower than the one for Modified Karatsuba,
which is the minimum combinational path delay among the other multipliers.
In order to make the comparison clearer, we only implement the polynomial mul-
tiplication part, which will be research further in Chapter 5. So we compare
Karatsuba, Modified Karatsuba and proposed modified Overlap-free algorithm
multiplication, in the following comparison for GF (28 ) and GF (216 ).
Tables 4.3 and 4.4 illustrate the result of device utilization and combinational path
delay of three types multipliers for GF (28 ) and GF (216 ) respectively. The com-
binational path delays for proposed modified Overlap-free multiplier are 13.425
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 42
ns and 18.277 ns respectively. For GF (28 ), the combinational path delay for pro-
posed modified Overlap-free multiplier is 32.97% lower than that for Karatsuba
multiplier and 21.19% lower than the one for Modified Karatsuba multiplier. For
GF (216 ), the combinational path delay for proposed modified Overlap-free multi-
plier is 32.34% and 25.13% lower than that for Karatsuba multiplier and Modified
Karatsuba multiplier respectively. Although the number of slices occupied of pro-
posed modified Overlap-free multiplier is not obviously less than the other two
methods, the max combinational path delay of proposed modified Overlap-free
multiplier has a significant reduction among these three methods.
In conclusion, proposed modified multiplier module has less hardware space com-
plexity and time complexity than other finite field multipliers. And this result
proves the comparison made in Chapter 3, that Overlap-free Karatsuba algorithm
Proposed Hardware Implementation of Modified
Overlap-free Karatsuba Multiplication Algorithm for GF(2n ) 43
multiplication has lower time delay comparing with other kinds of finite field mul-
tipliers.
Chapter 5
Conclusion
In this chapter, we summarize the main contribution in this thesis and propose
the future work in related implementation.
44
Conclusion 45
Table 5.1 shows the number of XOR gates for the finite field with the irreducible
given in equation 5.1.
GF(2113 ) : F(x)113 = x113 + x9 + 1
GF(2128 ) : F(x)128 = x128 + x8 + x7 + x2 + x + 1
GF(2163 ) : F(x)163 = x163 + x7 + x6 + x3 + 1
(5.1)
GF(2193 ) : F(x)193 = x193 + x15 + 1
GF(2233 ) : F(x)233 = x233 + x74 + 1
GF(2283 ) : F(x)283 = x283 + x12 + x7 + x5 + 1
F(x)128 is adopted for GHASH function in the AES-GCM standard [36], and other
polynomials are recommended for elliptic curve crypto-systems by NIST FIPS-186-
2 standard [34] or the SECG domain parameters in [35].
[3] N. Koblitz, ”Elliptic curve cryptosystems,” Math. Comp., vol. 48, no. 177,
pp. 203–209, 1987.
[5] R. Lidl and H. Niederreiter, ”Introduction to finite fields and their applica-
tions”, Cambridge university press, 1994.
47
Bibliography 48
[8] Y. Li, X. Ma, Y. Zhang, and C. Qi, ”Mastrovito form of non-recursive Karat-
suba multiplier for all trinomials,” IEEE Transactions on Computers, vol. 66,
no. 9, pp. 1573-1584, 2017.
[12] H. Fan, J. Sun, M. Gu, and K.-Y. Lam, ”Overlap-free KaratsubaOfman poly-
nomial multiplication algorithms,” IET Information security, vol. 4, no. 1,
pp. 8-14, 2010.
[13] Fan, H., and Hasan, M. A., ”A New Approach to Subquadratic Space Com-
plexity Parallel Multipliers for Extended Binary Fields”,IEEE Transactions
on Computers, vol. 56, no. 2, pp. 224-233, Feb. 2007.
[14] Gathen, J. V. Z., and Shokrollahi, J., ”Efficient FPGA-based Karatsuba Mul-
tipliers for Polynomials over F2 ”, Proc. 12th Workshop on Selected Areas in
Cryptography (SAC 2005), LNCS 3897 pp.359-369, 2006.
[17] Paar, C., ”A New Architecture for a Parallel Finite Field Multiplier with Low
Complexity Based on Composite Fields”,IEEE Transactions on Computers,
vol. 45, no. 7, pp. 856-861, July 1996
[20] Sneha H.L., ”Purpose and Internal Functionality of FPGA Look-Up Ta-
bles”, [Online], Available:X. Inc. (2013) Field programmable gate ar-
ray (fpga). [Online]. Available: https://www.allaboutcircuits.com/technical-
articles/purpose-and-internal-functionality-of-fpga-look-up-tables/
[23] Nielsen AA, Der BS, Shin J, Vaidyanathan P, Paralanov V, Strychalski EA,
Ross D, Densmore D, Voigt CA, ”Genetic circuit design automation”,Science,
vol. 352 (6281), 2016.
[25] Gang Zhou, Harald Michalik, and László Hinsenkamp, ”Complexity analy-
sis and efficient implementations of bit parallel finite field multipliers based
on Karatsuba-Ofman algorithm on FPGAs”, IEEE Transactions Very Large
Scale Integration (VLSI) System, vol. 18, no. 7, July 2010.
Bibliography 50
[26] T. Zhang and K.K. Parhi, ”Systematic Design of Original and Modified Mas-
trovito Multipliers for General Irreducible Polynomials,” IEEE Trans. Com-
puters, vol. 50, no. 7, pp. 734-749, July 2001.
[32] Jagannath Samanta, Razia Sultana, Jaydeb Bhaumik, ”FPGA based mod-
ified Karatsuba multiplier”, International Conference on VLSI and Signal
Processing (ICVSP), vol. 10-12, January 2014.
[33] H. Wu, ”Bit-parallel finite field multiplier and squarer using polynomial ba-
sis,” IEEE Transactions on Computers, vol. 51, no. 7, pp. 750758, 2002.
[34] Digital Signature Standard (DSS), FIPS PUB 186-2, NIST, 2000.
[35] Certicom Research, ON, Canada, ”SEC 2: Recommended ellipltic curve do-
main parameters”, 2000.
Appendices 51
module mul_4_module (
input [3:0] A ,
input [3:0] B ,
output [7:0] mul_4
);
wire [3:0] d0 , d1 , d2 , d7 ;
mul_2_module u0 (( A [1:0]) ,( B [1:0]) ,( d0 ));
mul_2_module u1 (( A [1:0]^ A [3:2]) ,( B [1:0]^ B [3:2]) ,( d1 ));
mul_2_module u2 (( A [3:2]) ,( B [3:2]) ,( d2 ));
assign d7 = d2 ^ d1 ^ d0 ;
assign mul_4 [7:0]={ d2 [3:2] ,( d2 [1:0]^ d7 [3:2]) ,( d0 [3:2]^ d7 [1:0]) , d0 [1:0]};
endmodule
module mul_8_module (
input [7:0] A ,
input [7:0] B ,
output [15:0] mul_8
52
Appendices 53
);
wire [7:0] d0 , d1 , d2 , d7 ;
mul_4_module u3 (( A [3:0]) ,( B [3:0]) ,( d0 ));
mul_4_module u4 (( A [3:0]^ A [7:4]) ,( B [3:0]^ B [7:4]) ,( d1 ));
mul_4_module u5 (( A [7:4]) ,( B [7:4]) ,( d2 ));
assign d7 = d2 ^ d1 ^ d0 ;
assign mul_8 [15:0]={ d2 [7:4] ,( d2 [3:0]^ d7 [7:4]) ,( d0 [7:4]^ d7 [3:0]) , d0 [3:0]};
endmodule
module mul_16_module (
input [15:0] A ,
input [15:0] B ,
output [31:0] mul_16
);
wire [15:0] d0 , d1 , d2 , d7 ;
mul_8_module u6 (( A [7:0]) ,( B [7:0]) ,( d0 ));
mul_8_module u7 (( A [7:0]^ A [15:8]) ,( B [7:0]^ B [15:8]) ,( d1 ));
mul_8_module u8 (( A [15:8]) ,( B [15:8]) ,( d2 ));
assign d7 = d2 ^ d1 ^ d0 ;
assign mul_16 [31:0]={ d2 [15:8] ,( d2 [7:0]^ d7 [15:8]) ,( d0 [15:8]^ d7 [7:0]) , d0 [7:0]};
endmodule
module mul_32_module (
input [31:0] A ,
input [31:0] B ,
output [63:0] mul_32
);
wire [31:0] d0 , d1 , d2 , d7 ;
mul_16_module u9 (( A [15:0]) ,( B [15:0]) ,( d0 ));
mul_16_module u10 (( A [15:0]^ A [31:16]) ,( B [15:0]^ B [31:16]) ,( d1 ));
mul_16_module u11 (( A [31:16]) ,( B [31:16]) ,( d2 ));
assign d7 = d2 ^ d1 ^ d0 ;
assign mul_32 [63:0]={ d2 [31:16] ,( d2 [15:0]^ d7 [31:16]) ,( d0 [31:16]^ d7 [15:0]) , d0 [15:0]};
endmodule
module mul_64_module (
input [63:0] A ,
input [63:0] B ,
output [127:0] mul_64
);
wire [63:0] d0 , d1 , d2 , d7 ;
mul_32_module u12 (( A [31:0]) ,( B [31:0]) ,( d0 ));
mul_32_module u13 (( A [31:0]^ A [63:32]) ,( B [31:0]^ B [63:32]) ,( d1 ));
mul_32_module u14 (( A [63:32]) ,( B [63:32]) ,( d2 ));
assign d7 = d2 ^ d1 ^ d0 ;
Appendices 54
55
Appendices 56
26 end
27 endmodule
Vita Auctoris
57