ECC Software Implementation On Embedded Platforms
ECC Software Implementation On Embedded Platforms
Cryptography Software
Implementation on
Embedded Platforms
By
Mohamed Said Sulaiman Albahri
September-2019
iii
Abstract
The demand for resources-constrained devices of 8-bit and 32-bit microcontrollers has
increased due to the requirements of different applications such as Radio Frequency
Identification (RFID), Internet of Things (IoT) and Wireless Sensor Network (WSN).
Applying efficient security in these applications and their microcontroller platform is one of
the significant concerns for its limited acceptance. In fact, public key cryptography (PKC),
RSA and Elliptic Curve Cryptography, are generally considered the most powerful
cryptosystems that could provide a high level of security. However, RSA involves very
intensive computational arithmetic with a key size of 1024-2048 bits. Therefore, ECC could
be a feasible solution to provide a similar level of security with a smaller key size and lesser
arithmetic computations. However, the highly effective ECC implementations in
microcontroller devices remain as a concern, due to some drawbacks of the microcontrollers.
This thesis illustrates the technique for achieving highly efficient ECC on microcontroller
devices that could be used in applications such as IoT, WSN and RFID. We implement an
efficient ECC cryptosystem in single-core microcontroller and a homogenous multicore
microcontroller. The Elliptic Curve Digital Signature is implemented on an 8-bit and 32-bit
microcontrollers and its performance is evaluated for the possible combination of finite field
arithmetic, point doubling, point addition and scalar point algorithms. The developed
technique reduced the time required for generating EDSA key from 83ms in 32bit
microcontroller to 263ms in 8bit microcontroller. The parallelization of the Comba
multiplication in 𝐺𝐹(2163 ) implemented in a homogenous multicore microcontroller,
obtained a performance enhancement of 85% in comparison to a single core microcontroller.
The feasibility of the algorithms and the advantages of adopting parallelization is validated
by using these algorithms to implement ECC scalar point multiplication over GF(2m) using
the Xmos multi-core microcontroller. Also it is believed that our proposed solutions for a
multicore microcontroller that could be used in applications like IoT, WSN and RFID is the
first of its kind.
iv
Acknowledgements
First, I would like to express gratitude to my supervisor, Dr. Mohammed Benassia, for his
valuable help and support and his continuous encouragement and direction. In fact, there is
no doubt that without his effort and support, this thesis would not have been achievable.
Also, I am very pleased to Dr. Luke Seed and Professor. Sakir Sezer for their kind
acceptance to evaluate and examine me during my viva. Their discussions were very
productive and effective, which added more valuable information to my knowledge as well as
my thesis.
To the spirit of my dear father, and all those mentioned above, I am very thankful for
you.
Mohamed Said Albahri
Sheffield-05/09/2019
v
Table of Contents
Table of Contents
Abstract ...................................................................................................................................................... iv
Acknowledgements .................................................................................................................................... v
Table of Figures ........................................................................................................................................ viii
iii
Table of Contents
3.3 The Arduino Mega2560 and Arduino Due Architecture ................................................................... 3-4
3.4 Efficient Method of Improving Relic toolkit on Arduino Devices ..................................................... 3-5
3.4 Proposed Design ............................................................................................................................... 3-6
iv
Table of Contents
Chapter 6 Software Design: Fast Parallel ECC Point Multiplication over Prime Fields ............................... 6-1
6.1 Introduction ...................................................................................................................................... 6-2
6.3 Proposed Design ECC point Multiplication over Prime Fields........................................................... 6-6
v
Tables of Figures
Table of Figures
Figure 2.1 Main Families Public Key Cryptography ..................................................................................... 2-5
Figure 2.2 Public key cryptography model.................................................................................................. 2-8
Figure 2.3 Symmetric-key cryptography model.......................................................................................... 2-9
Figure 2.4 ECC Implementation Pyramid .................................................................................................. 2-15
Figure 2.5 The Elliptic Curve 𝒚𝟐 = 𝒙𝟑 − 𝟓𝒙 + 𝟒 over R .......................................................................... 2-16
Figure 2.6 The Elliptic Curve Point Addition 𝒚𝟐 = 𝒙𝟑 − 𝒙 + 𝟏 ................................................................ 2-19
Figure 2.7 Elliptic Curve Point for 𝒚𝟐 = 𝒙𝟑 + 𝒙 ....................................................................................... 2-21
Figure 3.1 Recommended Arduino High Speed Preset ............................................................................... 3-9
Figure 3.2 Recommended Arduino Low area Preset .................................................................................. 3-9
Figure 3.3 Time Execution for EDSA.......................................................................................................... 3-10
Figure 3.4 Time Execution for EDSA.......................................................................................................... 3-11
Figure 3.5 Time Execution for EDSA.......................................................................................................... 3-11
Figure 3.6 Time Execution for EDSA.......................................................................................................... 3-12
Figure 4.1 XSI -U Series 16 core devices ..................................................................................................... 4-6
Figure 4.2 Schematic representation 𝟒 × 𝟒word multiplication using Comba Algorithm ......................... 4-9
Figure 4.3 Xtimecompsoer Task-1 Flow Diagram ..................................................................................... 4-12
Figure 4.4 Result Analysis of Implemented Comba Algorithm with Fast Reduction ................................ 4-15
Figure 4.5 Result Analysis of Implemented Comba Algorithm Without Reduction.................................. 4-15
Figure 5.1 Shifting Operation in Fast Reduction Process with word size=8................................................ 5-9
Figure 5.2 Sequential ECC Scalar Point Multiplication Intel Vtune Analysis [3]........................................ 5-14
Figure 5.3 Parallel ECC Scalar Point Multiplication xmos single core ....................................................... 5-15
Figure 6.1 Sequential Single Core ECC Scalar Point Multiplication in xmos ............................................. 6-11
Figure 6.2 Parallel Multicore ECC Scalar Point Multiplication in xmos Multiplication in xmos ............... 6-12
viii
List of Tables
List of Tables
Table 2.1 NIST Guidelines for Public key size [2] ...................................................................................... 2-14
Table 2.2 Point addition and Point Doubling Operation Counts[1] .......................................................... 2-22
Table 2.3 Double and Addition in Binary Algorithm ................................................................................. 2-40
Table 2.4 NAF with Addition and Subtraction........................................................................................... 2-41
Table 2.5 Number of operations for point addition and point doubling[1].............................................. 2-42
Table 4.1 Comparison with State of Art of Comba Implementation ....................................................... 4-16
Table 5.1 Time Complexity Table ............................................................................................................. 5-17
Table 5.2 Operation Details and Performance......................................................................................... 5-17
Table 5.3 Comparison With State of Arts................................................................................................. 5-18
Table 6.1 Comparison With State-Of-Art ................................................................................................. 6-13
x
List of Algorithms
List of Algorithms
Algorithm 2.1 Fast Reduction modulo𝑷𝟏𝟗𝟐 = 𝟐𝟏𝟗𝟐 − 𝟐𝟔𝟒 − 𝟏.......................................................... 2-24
Algorithm 2.2 Multi-precision addition.................................................................................................... 2-25
Algorithm 2.3 Multi-precision subtraction............................................................................................... 2-25
Algorithm 2.4 Integer Multiplication - Product Scanning ........................................................................ 2-27
Algorithm 2.5 Integer Multiplication - Operand Scanning ....................................................................... 2-27
Algorithm 2.6 Comba Algorithm ............................................................................................................... 2-28
Algorithm 2.7 Extended Euclidean Algorithm for Integers ...................................................................... 2-30
Algorithm 2.8 𝑭𝑷 Inversion using Extended Euclidean ........................................................................... 2-30
Algorithm 2.9 Pencil and Paper Polynomial Multiplication ...................................................................... 2-35
Algorithm 2.10 Elliptic Curve Key Pair Generation ................................................................................... 2-44
Algorithm 2.11 Elliptic Curve Digital Signature Algorithm (Generation) .................................................. 2-45
Algorithm 2.12 Elliptic Curve Diffie-Hellman Key exchange(ECDH) .......................................................... 2-45
Algorithm 2.13 Elliptic Curve Digital Signature Algorithm (Verification) .................................................. 2-46
Algorithm 2.14 Elliptic Curve ElGamal Analogue Encryption .................................................................... 2-46
Algorithm 2.15 Elliptic Curve ElGamal Analogue Encryption .................................................................... 2-47
Algorithm 4.1 Comba's Algorithm over GF(2m) .......................................................................................... 4-8
Algorithm 4.2 Comba's Fast reduction method with 𝑾 = 𝟑𝟐 for 𝑮𝑭𝟐𝟏𝟔𝟑 𝒇𝒛 = 𝒛𝟏𝟔𝟑 + 𝒛𝟕 + 𝒛𝟔 + 𝒛𝟑 +
𝟏 ................................................................................................................................................................ 4-10
Algorithm 4.3 Modified Comba's Algorithm over GF(2m) ........................................................................ 4-11
Algorithm 4.4 Fast reduction Modification using Word size =8 .............................................................. 4-13
Algorithm 4.5 Fast reduction Modification using Word size =16 ............................................................ 4-14
Algorithm 5.1 Point Doubling in LD coordinate system where 𝒂 ∈ {𝟎, 𝟏} ................................................. 5-5
Algorithm 5.2 Point Addition in LD coordinate system where 𝒂 ∈ {𝟎, 𝟏} .................................................. 5-6
Algorithm 5.3 Left-to-right binary method for point multiplication........................................................... 5-7
Algorithm 5.4 Modified Point Doubling in LD coordinate system where a ϵ {0,1} ................................... 5-11
Algorithm 5.5 Modified Point Addition in LD coordinate system where a ϵ {0,1} .................................... 5-12
Algorithm 5.6 Modified Left-to-right binary method for point multiplication ......................................... 5-13
Algorithm 6.1 Comba Multiplication Technique ......................................................................................... 6-5
Algorithm 6.2 Fast Reduction modulo𝑷𝟐𝟓𝟔 = 𝟐𝟐 + 𝟐𝟐𝟐𝟒 + 𝟐𝟏𝟗𝟐 + 𝟐𝟗𝟔 − 𝟏 .................................... 6-6
Algorithm 6.3 Modified Comba Multiplication Technique ......................................................................... 6-7
Algorithm 6.4 Modified Jacobian doubling (𝐚 = −𝟑) ................................................................................ 6-9
Algorithm 6.5 Modified (X, Y)- only co-Z addition with update XYCZ - ..................................................... 6-10
xi
Glossary
Glossary
AES Advance Encryption Standard
ECC Elliptic Curve Cryptography
ECDH Elliptic Curve Diffie-Hellman
ECDSA Elliptic Curve Digital Signature Algorithm
FPGA Field Programmable Gate Arrays
FLT Fermat’s Little Theorem
GF Galois Field
GF2 Galois Field in the field characteristic two
IEEE Institute of Electrical and Electronics Engineers
IoT Internet of Things
MD Message Digest
MSB Most Significant Bit
NAF Non-Adjacent Form
NIST National Institute of Standards and Technology
PKC Public Key Cryptography
RFID Radio Frequency Identification
RSA Rivest-Shamir-Adleman
SHA-1 Secure Hash Algorithm-1
SECG Standard for Efficient Cryptography Group
WSNs Wireless Sensor Nodes
XOR Exclusive OR (logical Operation)
τNAF τ-adic Non-Adjacent Forms
xii
Chapter 1: Introduction of Thesis
This chapter provides an overview of the complete thesis. It starts by expressing the demands
of cryptography in securing the new technology of WSN and IoT. Then, it illustrates how the
aims of the thesis been developed, followed by the main contribution. Finally, a detailed list of
the published papers is given.
1-1
Chapter 1: Introduction of Thesis
1.1 Overview
The emergence of new technologies related to the Wireless Sensor Network (WSN),
Wireless Body Area Network (WBAN), Internet of Things (IoT) and Radio Frequency
Identification (RFID) is based on embedded microcontroller platforms. It is considered that
these microcontrollers are resource constrained devices, which can accommodate small sized-
code memory and have low-speed processors working with limited battery resources. The main
functionality of these microcontrollers in such applications is to aggregate the information
produced by the sensors connected to them and transmit the data over a communication
channel to its destination. For example, WBAN could be designed to help patients and doctors
with real-time data about the vital life parameters related to patients, such as blood pressure,
pulse heart rate, body temperature and other parameters. A new advancement in medication
related to WBAN goes beyond the scope of merely transmitting the data, and may involve
injecting the medicine into the patient, such as an online insulin pump system. Therefore, such
highly critical applications related to human health hazards, lead to concern over the security of
these applications and their networks.
However, modern cryptography plays a vital role in ensuring the security of these
applications. For that, a different type of symmetric cryptography could be used to guarantee
confidentiality, integrity, and authenticity for the provided services [13]. However, employing
highly effective Public Key Cryptography (PKC), such as RSA or ECC on microcontrollers,
can lead to many technical challenges and problems that need to be addressed in advance, like
the deployment of ECC schemes delay the processes within the processor due to the
complexity of the arithmetic operations associated with this scheme [13].
In this context, there have been many attempts to improve the ECC efficiency of
constrained devices, which in turn resulted in the possibility of implementing the ECC
algorithm in such devices. As a result, researchers have been encouraged to undertake further
evolutionary researches that have also demonstrated the ability of ECC to provide the same
level of security provided by RSA with lower key size in these devices. The fundamental
approach of ECC allows end users to make ECC implementation more flexible and selectable.
For example, ECC could be built based on either over a binary finite field or prime finite field
arithmetic. However, the latest research shows possibilities of implementing the ECC as
hardware, software or even combination of software/ hardware. This could lead to further
1-2
Chapter 1: Introduction of Thesis
investigation for the identification of the best solutions that provide higher performance with
low power consumption and high-security level.
Therefore, based on the above concerns, we aim to provide solutions that can result in
an optimal ECC performance along with maintaining the efficiency of the applications.
Therefore, we address such concerns by attempting to enhance the performance with the help
of an ECC software implementation. Our selection is based on the flexibility and simplicity of
integrating ECC with application architecture. Hence, in this thesis, we used two different
approaches, detailed below, to provide directions for our research.
Furthermore, designing ECC security level is highly dependent on the underlying layer,
known as finite field arithmetic operations, and its size. Therefore, a flexible tool that could
support different ECC curves over the binary field or prime field is a necessity. Such
mechanisms allow the selection of the required level of security. Also, this enables a smooth
1-3
Chapter 1: Introduction of Thesis
communication between the microcontroller and applications when an application has different
security requirements.
Among these types of multicore microcontrollers, XMOS is considered the best since it
works similar to an ordinary microcontroller and due to its ability to tackle issues beyond the
capabilities of a traditional microcontroller. Additionally, it has a multiple core processor that
allows simultaneous execution of sequential or multiple tasks. It can also provide timing
analysis along with hardware simulation, using a powerful IDE known as Xtimecomposer
[15]. Design and implementation on an XMOS multicore microcontroller is very flexible since
the parallelization can be invoked in the main function or within all the functions of the
programs. Calling for parallelizing multiple tasks from the primary function will also allow the
developer to assign a particular core for his parallel tasks while calling for parallelization on
other functions to automatically select the core allocation by the system. Thus, for complex
implementation where only a particular function is required for conducting parallelization, the
latter should be implemented within these functions instead of the main function, and the tasks,
logic cores will be dynamically distributed by the system based on the availability of the
resources.
In fact, having such features and functionalities makes the development of such
platforms more suitable for accommodating complex algorithms. In the later stages this could
particularly lead to the integration of some application (such as IoT, WSN, or WSN) with
parallel reading from sensors along with the possibility of conducting the parallel tasks of
implementation necessary for secure key generation, encryption, and security protocol in
parallel approaches.
1-4
Chapter 1: Introduction of Thesis
different ECC curves, including: curves over binary fields 𝐺𝐹(2𝑚 ) where m=163, m=233,
m=283, m=409 and m=571 as well as curves for ECC implementation over prime field 𝐺𝐹(𝑝)
where p=192, p=224, p=256, p=384 and p=521[16]. Having said that, parallelizing such
complex ECC algorithms using a software implementation approach is not an easy task.
However, the possible features and functionalities in the XMOS multicore microcontroller and
its powerful Xtimecomposer IDE can help in tackling such challenges. As mentioned before,
our selection for such a platform is not only based on implementing the ECC, but also
specifically considering the flexibility and simplicity of integrating the ECC cryptosystem with
the applications’ sensors data collection and overall communication protocols.
Hence, the focus of this thesis is at a different level of ECC cryptosystem layers. Thus,
we initially attempt to adopt the well-known algorithms that are assigned to each layer in order
to obtain an effective speedup of the sequential and parallel performance. However, some
algorithms are considered to improve the overall ECC performance. In a way, the proposed
technique will serve the purpose of enhancing the ECC performance on microcontrollers that
could be used in different applications, such as RFID, WSN and IoT. The end-to-end
encryption and decryption, higher-level protocol communication, power consumption analysis
and system failure analysis are not considered in this thesis.
The current published works are mainly focused on enhancing the ECC performance of
a single core microcontroller. Many of them attempted to create their own library for their
targeted devices. However, we noticed that these works are limited with few algorithms that
should be supported by the ECC. ECC is a highly algorithmic-based cryptosystem with many
algorithms in place that could either increase or decrease their performance. When the
algorithms for implementing the ECC is limited then the speedup efficiency of the ECC
processor is not enhanced to the desired level. Hence, in this work, we initially started by
importing a highly effective open-source library that allows flexible and reconfigurable
features, along with increasing the ECC efficiency by selecting the best combination from a
wide range of algorithms provided by the tool.
Also, there are a significant number of published works that are related to improving
the ECC efficiency in large scale computer and processors. However, none of the published
1-5
Chapter 1: Introduction of Thesis
works make any attempts to improve the ECC in a homogeneous multicore microcontroller.
Therefore, to our knowledge, we are the first to propose boosting the ECC on a multicore
microcontroller, despite the overall complexity of implementing ECC in such a constrained
device.
The initial research indicated the presence of some open-source tools that help in
boosting the efficiency of ECC in microcontroller-based platforms. However, some of them are
limited to supporting just a specific platform and some of the others could be used to support a
wide range of microcontrollers. To overcome this limitation, we managed to import relic tool
open-source to our microcontroller and accordingly managed to prove the enhancement in the
performance of the ECC. The novelty of our contribution is summarized below:
1-6
Chapter 1: Introduction of Thesis
fast reduction algorithm to support the 8-bit word size, which was a result of
modifying the 32-bit reduction algorithm.
● Our second novelty related to the second approach in the overall proposed
improvement in ECC scalar multiplication over the binary field 𝐺𝐹(2𝑚 ) for
m=163, m=233, m=283, m=409 and m=571, using an XMOS homogeneous
multicore microcontroller. In this work, we managed to report a 63%
improvement in ECC point multiplication, in comparison to its sequential
implementation in a single core implementation. Furthermore, in this particular
work, three algorithms listed below have been modified and optimized:
I. Modified Point Doubling in LD coordinate system by implementing a
parallelization principle in it. Accordingly, we managed to reduce the
number of algorithm steps from 14 to 9.
II. Modified Point Addition in LD coordinate system by implementing a
parallelization principle in it. Hence, we were able to reduce the number
of algorithm steps from 26 to 20 steps.
III. Modified left to right binary method point multiplication algorithm has
been proposed. The enhancement is achieved by performing an initial
scanning of the most significant bit (MSB) of k in order to track down
the first non-zero bit from the MSB. If the non-zero first bit is found,
then the coordinates are filled in Q to start the loop operation. P's
coordinates will be filled in Q to start the loop operation.
● The third novelty presented relates to the second proposed approach is our
contribution towards improving the ECC point multiplication over GF(P),
where P=128, P=192, P=256 and P=384 using XMOS homogenous multicore
microcontroller. In this work, we obtained an 80% improvement of ECC point
multiplication, which is higher when compared to its sequential implementation
in a single-core implementation. Also, in this study we modified the algorithms
listed below.:
I. We were able to parallelize the Comba algorithm proposed by the
scholars in [17] and [18] and replaced the original algorithm proposed
by the scholars in [19] with this new modified parallelized algorithm.
II. We modified (X, Y)- only co-Z conjugate addition with update XYCZ -
ADDC algorithm, in turn we were able to reduce the original sequential
1-7
Chapter 1: Introduction of Thesis
operational steps of the algorithm from 19 to 13, only with the cost of
5M+3+16A by involving of 6 field registers in this operation.
III. We modified (X, Y)- only co-Z addition with update XYCZ -ADD
algorithm, as a result reduced the original sequential operations steps of
the algorithm from 13 steps to only 7 steps.
IV. We modified the Jacobian doubling (𝑎=−3) algorithm for reducing the
original sequential point doubling operation steps from 18 to 15.
Hence, it can be observed that our effective software design for efficient ECC tackles
ECC over the binary field as well as ECC over the prime field. Furthermore, in this work, we
considered three different types of data width 8, 16 and 32 bits. This is the first-ever effort that
attempts to enhance the performance of ECC by considering curves over the binary as well as
the prime fields.
Finally, in this research, we have been able to enhance the performance of the ECC on
a microcontroller that could be used for different applications. For that, our thesis contributions
point to the fact that PKC could be used in the constrained devices and secure communication
could be established easily – taking into account the various types of microcontrollers
mentioned in this thesis.
Immediately after this chapter, the next chapter provides the reader with essential
background and historical information about cryptography. In addition to that, a detailed
explanation and differentiation between symmetric and asymmetric cryptography is also
provided. Furthermore, details about the elliptic curve cryptography group's law and point
multiplication algorithms, have been provided. Finally, the chapter 2 presents a discussion
about the domain parameters and protocols of ECC.
1-8
Chapter 1: Introduction of Thesis
In Chapter 5, presents the concept of parallelizing the ECC point multiplication over a
binary field. The chapter begins with the introduction of the overall concept of ECC over a
binary field, followed by a related mathematical background. Then the proposed solution to
increase the performance of ECC point multiplication is elaborated. The details of
implementing the proposed solutions is provided in subsection 5.4. Finally, the performance
analysis is presented along with the conclusions drawn from the analysis.
Chapter 6 elaborates the proposed solution to enhance the scalar point multiplication of
ECC, but this time for ECC over the prime field. Initially a general description for ECC over a
prime field along with its mathematical background is provided. Then the proposed solution is
described along with the technical details of its implementation. Finally, the results are
analyzed and the conclusions drawn from it are mentioned.
Finally, the Chapter 7, provides a summary of the complete research along with the
suggestions for the future work that can be conducted.
1-9
Chapter 1: Introduction of Thesis
Albahri, M. S., and M. Benaissa. "Parallel elliptic Curve Cryptography over GF(P) on a
Homogeneous a Multi Core Microcontroller." To be submitted to IEEE Embedded Systems
Letters (ESL).
1-10
Chapter 2: Background Theory
2-1
Chapter 2: Background Theory
In the past, steganography, a Greek originated word meaning “covered writing”, was
referred to as a technique for communication of secure messages. In contrast to cryptography,
steganography means concealing the message itself by covering it with something else,
whereas cryptography means concealing the content of the message by enciphering [20].
Also, the word ‘cryptography’ originally came from the Greek root words kryptos and
gráphō, which together mean “hidden writing” [21]. Cryptography has very long and
interesting history, dating back to 4000 years ago. The best description for the history of
cryptography can be found in [22]. In this book, the author traced its history from the initial
use of cryptography by the Egyptians in 1900 B.C to the 20 th century.
As per [1] and [2], the first attempt to cipher a plain text was developed by Roman
generals in the 1st century B.C. They ciphered the message by shifting a fixed number of
letters down to the alphabet. This procedure of ciphering came to be known as Caesar's
Cipher.
It is believed that the first transposition cipher was used by Spartan in 5 th century B.C
[21]. They used to exchange secret messages by wrapping slender bar parchment and wound
it in something called a scytale. To decipher the message, the papyrus needed to be rewound
it in the scytale of identical thickness.
In the 9th century, the first code-breaking textbook Istikhraj al-Kotob Al-Mu'amah was
published by Islamic mathematician named Abū Yūsuf Yaʻqūb ibn Isḥāq al-Kindī. In his
book, al-Kindī introduced the alphabetic cipher and frequency analysis techniques [21] .
The Middle Ages witnessed more progress in the cryptographic field. During this
time, most of the Western European governments used cryptography to be in touch with their
ambassadors. The most significant enhancements were developed in Italy in 1452. Venice
established a new organization with three secretaries responsible to cipher and deciphers the
government's messages [23].
2-2
Chapter 2: Background Theory
In 1553, a new concept of a Vigenere cipher was published by Italian Renaissance Leon
Basttista Alberti. This new concept was considered as strong as polyalphabetic substitution
cipher at that time [21].
Furthermore, the historical information shows that cryptography played a vital role in
the outcome of both world wars. For example, in 1895, the invention of radio transmission
made a remarkable change of using cryptography in telegraphic communication – when the
French military managed to intercept German communication during the First World War.
This is because the French cryptanalysts managed to break the double columnar transposition
created by German military [23]. The Enigma machine is an encryption gadget created and
utilized in the mid-twentieth century to ensure business, discretionary and military
correspondence. It was utilized widely by Nazi Germany during World War II, in all parts of
the German military. But, the Enigma encryption demonstrated to be vulnerable to
cryptanalytic attacks by Germany's foes, at first Polish and French and, later, a gigantic effort
by the United Kingdom at Bletchley Park. While Germany acquainted a progression of
enhancements with Enigma and these hampered efforts of decryption to fluctuating degrees,
they didn't decisively keep Britain and its partners from misusing Enigma-encoded messages
as a noteworthy source of knowledge during the war. Numerous observers state that this flow
of intelligent communication reduced the duration of the war altogether and may even have
modified its result.
However, the work published by Diffie and Hellman in 1976 created new directions
in cryptography. In fact, this work is considered the most impressive development in the
2-3
Chapter 2: Background Theory
2-4
Chapter 2: Background Theory
development of public-key cryptography, which was developed for tackling the drawbacks of
this scheme. The concept of this scheme is based on using a single key during the encryption
and decryption processes. It consists of five main components as listed below [20, 26]:
➢ Plaintext
➢ Encryption Algorithm
➢ Secret Key
➢ Ciphertext
➢ Decryption Algorithm
Figure 2.1 illustrates the symmetric key cryptography. Here, the plaintext represents the
original message, which is fed into the encryption algorithm. The purpose of having an
encryption algorithm is to conduct various substitutions and transformation in the plain
text.
The secret key K used in this scheme is completely independent of the plaintext and
Public Key
Algorithms
DLP(e.g ECDLP(e.g
IFP (e.g RSA) Diffie- ECDH and
Hellman) ECDSA)
encryption algorithm. The key is generally selected to be the binary alphabet 0,1. To the
plain text X the sender needs to form the cipher Y as a function of K. Based on this, the
encryption transformation could be written in the form given below:
2-5
Chapter 2: Background Theory
𝑌 = 𝐸𝑘 (𝑃)
The decryption process allows the receiver to retrieve the original message, X, using the
below decryption function:
𝑋 = 𝐷𝑘 (𝑌)
During the decryption process, the decryption algorithm must apply the same secret
key that has been used in the encryption process. Therefore, the algorithm being used
must enable any person to perform the deciphering process, without using the pre-shared
key or even figuring out the secret key from the ciphered text. Thus, it is important to note
the minimum specifications required while using the encryption algorithm. Accordingly,
such an implementation must ensure the secrecy of the secret key, because all the
information will be readable to the opponent if he/she knows the secret key and
encryption algorithms.
In general, the security of this scheme is maintained by ensuring that the security
mechanisms are being used by the sender and receiver. It should be also noted that
encryption algorithms are not kept secret for the following reasons:
Symmetric key cryptography has two categories: stream cipher and block cipher. The
concept of the former is based on encrypting bit individually by adding a bit from secret
keystream to the plaintext. The stream cipher is further divided into two types:
synchronous stream cipher and asynchronous stream cipher. The synchronous stream
cipher is only dependent on the secret keystream, whereas asynchronous stream ciphering
the keystream depends on the cipher text.
2-6
Chapter 2: Background Theory
The synchronous stream cipher is the most practically-used stream cipher, and
an A5/1 cipher, an example of it, particularly used in the GSM mobile phone standard.
The concept of the block cipher is based on encrypting the entire block of the plaintext at
a time using the secret key.
Advance Encryption Standard (AES) and Data Encryption Standard (DES) or triple
(3DES) are standards comes under the block cipher. The AES has a block length of 128
bits (16 Bytes) and DES has a block length of 64 bits. To summarize, the symmetric key
cryptography is still playing a major role in providing the security services, such as
confidentiality, integrity and authenticity due to its efficiency and short key length.
Although it does have powerful advantages, it has some cons which are listed below:
2. Setting up shared key manually results in losing control over the secret keys,
especially when it is used in the large network which contains a large number of
entities.
3. More storage requirements will be needed for storing a large number of key pairs.
These drawbacks can be tackled by using public-key cryptography in line with symmetric
key efficiencies and functionalities.
2-7
Chapter 2: Background Theory
1. Digital Signature
2. Encryption
In addition to the above mechanisms, the public key schemes allow implementing all
To Public
Key Generation
Public Key
Distribution
Public key
Private Key
Insecure
Plain Text Encryption Cipher Text Cipher Text Decryption Plain Text
Channel
(Output)
required functionalities for modern security protocols, such as SSL/TLS [27]. However,
implementing public-key schemes is not an easy task due to its high computational
requirements. For that, implementing cryptographic system requires a mixed implementation
of symmetric and asymmetric key cryptography, which could be nominated as a hybrid
cryptosystem. Hybrid cryptosystem can be achieved by using the public key algorithm for the
2-8
Chapter 2: Background Theory
key establishment, and the symmetric key algorithm will be used to perform the data
encryption processes.
There are three main families for public key cryptography listed in Figure 2.3:
Y= E(K,X)
X= D(K,Y)
Encryption Decryption
Algorithm (e.g Algorithm
Plain Text Plain Text
AES, DES) Reverse of
(Input) (Output)
Encryption
Algorithm)
The first category of the public key algorithms referred to as Integer Based Problem (IFP).
The concept of these algorithms is based on determining the prime factors of a given positive
Figure 2.3 Symmetric-key cryptography model
integer. RSA, which refers to its developers of the algorithm Rivest, Shamir and Adleman, is
one of the famous IFP families [27].
The last type of this family is called Elliptic Curve Discrete Logarithm Problem
(ECDLP). Its concept is based on finding the positive integer, K, on a given points P and Q in
the elliptic curve that is defined over a finite field, such that Q=K.P. Elliptic Curve Diffie
Hellman (ECDH) key exchange protocol and Elliptic Curve Digital Signature Algorithm
(ECDSA) are examples of ECDLP based algorithms.
2-9
Chapter 2: Background Theory
Groups
DEF. 2.1 A group (G) is defined as a set of elements along with binary operation "*",
satisfying four properties. An Abelian group is the most common type of algebraic groups
that satisfies the four properties defined below [1, 20, 29]:
Despite the groups that involve one single operation, the properties provided
on the operation allow using the pairs of operations. For example, addition and subtraction
operations could be supported by defined addition operation in the group, as long as addition
is using inverse – in which, if the identity element e=0, then the inverse is 𝑎−1 = −𝑎.
DEF. 2.2 If the set has a finite number of elements, it is said to be a finite group; otherwise, it
is known as an infinite group [20]. The number of elements in a finite group is referred to as
the order of the group, |G|. However, if the group is not finite, its order also is infinite; if the
group is finite, the order is finite.
2-10
Chapter 2: Background Theory
DEF. 2.4 A multiplicative finite group G is called cyclic if all elements of the group 𝑎 ∈ 𝐺
can be generated by repeated application of group operation. Thus, if there is an element 𝑎 ∈
𝐺 such that for any 𝑎 ∈ 𝐺, there is some integer j with 𝑏 = 𝑎 𝑗 . However, such an element is
nominated as a generator of the cyclic group and to be written as 𝑮 = 〈𝑎〉 [28].
DEF. 2.5 If a subgroup of a group can be produced by applying the power of an element,
then it is known as the cyclic subgroup. The term “power” here stands for repeatedly
employing the group process to the element, which is presented below[20] :
𝑎𝑛 → 𝑎 ∗ 𝑎 ∗ ⋯ ∗ 𝑎(𝑛 𝑡𝑖𝑚𝑒𝑠)
Ring
DEF. 2.6 A ring is an algebraic structure having two operations, and denoted as 𝑅 =< {⋯ },∗
, >. All of the abelian group properties must be fulfilled by the first operation. The second
operation must satisfy only the first two properties, and it must distribute over the first
operation.
DEF. 2.7 A ring is called distributive if all a,b and c elements of R have 𝑎□(𝑏 ∗ 𝑐) =
(a□b)*(a□c) and (a*b)□c=(a□c)*(b□c). A commutative ring is a ring in which the
commutative property is also satisfied for the second operation.
Field
DEF. 2.8 A field represented by F=<{⋯},*,□> is a commutative ring in which the second
operation satisfies all five properties defined for the first operation, except that the identity of
the first operation and (sometimes called the zero elements), which has no inverse.
2-11
Chapter 2: Background Theory
2.4.2 RSA
RSA, whose concept was first introduced by Diffie-Hellman, was developed by
Revest, Adi Shamir and Len Adleman in 1978. Their main objective was the development of
a cryptographic algorithm that could meet the requirements of the public-key cryptosystem.
The RSA algorithm can be categorized under the block cipher. The format of the plaintext
and ciphertext in the RSA, whose typical size is 1024 bits or 309 decimal digits < 2 1024, is
referred to as integers between 0 and n-1. 21024 .
RSA is involved in various applications like key transport, encrypting small pieces of
data. Digital signature is another application of RSA which can be utilized for the digital
certificates on the internet. RSA is unable to replace the symmetric cipher since it is very
much slower due to higher number of computations in comparison with the AES. The main
purpose of having RSA encryption feature is to provide highly secure key exchanges for a
symmetric cipher. The primary objective of RSA encryption feature is proving the highest
security for key exchanges in a symmetric cipher, which means that the RSA is generally
used along with symmetric ciphers like AES since it has the responsibility of performing bulk
encryption of the data [20]. The RSA encryption is accomplished by using ring 𝑍𝑛 and
modular computation. The functions below describe the RSA encryption and decryption.
RSA Encryption:
Given the public key (n,e) ≡ 𝐾𝑝𝑢𝑏 and Plaintext Y=e. 𝐾𝑝𝑢𝑏 (x) ≡𝑋 𝑒 modn where, x,y ϵ 𝑍𝑛
RSA Decryption:
Given the private key d ≡ 𝐾𝑝𝑟𝑖𝑣𝑎𝑡𝑒 the ciphertext Y, X ≡ d. 𝐾𝑝𝑟𝑖𝑣𝑎𝑡𝑒 (Y) ≡ Yd modn where, X,Y ϵ 𝑍𝑛
Generally, implementing RSA is more critical in comparison with 3DES or AES since
it involves the exponation of large numbers. Additionally it involves algorithms of modular
multiplication, squaring and multiply [20]. As the RSA has high computational complexity,
the Elliptic Curve Cryptography (ECC) is another option for implementing public key
cryptosystem due to its attractive features and reported efficiencies [1]. Some of the merits of
ECC are listed below:
● ECC can provide the same level of security as that of RSA with smaller key sizes.
● ECC requires lesser memory size and faster arithmetic operations.
2-12
Chapter 2: Background Theory
● There exists a high possibility of implementing ECC in the constrained devices like
mobiles, as it requires less memory and less power consumption.
Therefore, based on the ECC advantages, we consider it as the main scope of this thesis.
Besides the RSA digital signature and ElGamal digital signature, Elliptic Curve
Digital Signature Algorithm came as an alternative solution for providing confidential
service. ECDSA consists of three processes, which are key generation process, signing
process and signature process. These processes are explained in detail below and are based on
implementing ECDSA between the sender Alice and the receiver Bob:
2-13
Chapter 2: Background Theory
➢ Signing Process: The main purpose of the signing process is selection of a random
number, creation of a third point, calculation of the signature and sending the message
with the signature.
1. Alice selects random number 𝑟 between 1 and 𝑞 − 1
2. Alice calculates the third point 𝑃 = 𝑟 × 𝑒1(… . , … . )𝑃 = (𝑈, 𝑉)
3. Alice calculates the first signature 𝑠1 = 𝑈𝑚𝑜𝑑𝑞
Table 1 clearly exemplifies the fact that ECC requires a much smaller key size in
comparison to RSA for providing the same level of security since the security per key bit rate
is much higher. For example, the level of security offered by a 3072-bit legacy key (RSA) is
2-14
Chapter 2: Background Theory
the same as that offered by a 256 bit ECC key, and thus ECC offers better performance with a
key size that is 1/12th of RSA key. This type of performance efficiency gets better on
increasing the security level. Hence, they can be effectively used in constrained platforms
like wireless devices, handheld computers, smart cards, etc. [30]
The Concept of Elliptic Curve Cryptography is very rich with theories and deep
arithmetic. Therefore, ECC implementation requires a focus on different arithmetic and
operations and algorithms. Figure 2.4 describes the system level of elliptic curve
cryptography implementation. The top layer in the pyramid represents the implementation of
ECC Protocols levels, such as Elliptic Curve Digital Signature Algorithm (ECDSA), Elliptic
Curve Diffie Hellman ECDH and Elliptic Curve Integrated Encryption Schemes (ECIES).
Scalar point multiplication is considered as a second level before ECC protocol. The
fundamental operation of the elliptic curve cryptography is scalar point multiplication, which
is defined in equation (2.1) as follows:
Q = k.P (2.1)
The scalar point multiplication is based on point addition and point multiplication.
The point addition operation for two given points 𝑃, 𝑄 ∈ 𝐸, resulting in point known as the
sum of P and Q,𝑃 + 𝑄 ∈ 𝐸, whereas the point multiplication is the process concern of
ECC Protocols
Scalar Point
Multiplication
2-15
Chapter 2: Background Theory
multiplying two elements in the multiplication group GF*(q) for integer modulo a prime P.
Different point multiplication algorithms are presented in [1] to compute (1). The
performance of ECC depends on the point multiplication and its associated coordinates
systems. The scalar point multiplication for two points 𝑄, 𝑃 ∈ 𝐸⁄𝐺𝐹(𝑞) which belongs to
elliptic curve 𝐸 ∈ 𝑮𝐹(𝑞 ) is basically defined over an Abelian group as shown below:
𝑄 = 𝐾. 𝑃 = ⏟
𝑃 + 𝑃⋯𝑃 +𝑃
𝑘−1 𝑡𝑖𝑚𝑒𝑠
An elliptic curve E over a field K could be defined over either the field R of the real
numbers, the field Q of rational numbers, the field C of complex numbers, or finite field 𝐹𝑞 of
𝑞 = 𝑝𝑟 elements. Figure 2.5 shows an elliptic curve over the rational field Q. Thus, an elliptic
2-16
Chapter 2: Background Theory
curve over K is to be defined as a set of points (𝑥, 𝑦) where 𝑥, 𝑦 ∈ 𝐾 that satisfies the
following equation:
𝐸: 𝑦 2 = 𝑥 3 + 𝑎𝑥 2 + 𝑏 (2.3)
where 𝑎, 𝑏 ∈ 𝑘 is to be a cubic polynomial with no multiple roots and K is to be a field of
characteristic ≠ 2,3.
Therefore, the elliptic curve E over finite field 𝐺𝐹 is to be defined using the following long
Weierstrass equation in the projective form [1]:
𝐸: 𝑌 2 𝑍 + 𝑎1 𝑋𝑌𝑍 + 𝑎3 𝑌𝑍 2 = 𝑋 3 + 𝑎2 𝑋 2 𝑍 + 𝑎4 𝑋𝑍 2 + 𝑎6 𝑍 3 (2.4)
where 𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 , 𝑎6 ∈ 𝐺𝐹.
Weierstrass equation (2.4) represents a smooth elliptic curve in projective coordinates. It has
the correspondence in the affine (Euclidean) coordinates, containing the form:
𝐸: 𝑦 2 + 𝑎1 𝑥𝑦 + 𝑎3 𝑦 = 𝑥 3 + 𝑎2 𝑥 2 + 𝑎4 𝑥 + 𝑎6 (2.5)
where 𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 , 𝑎6 ∈ 𝐺𝐹. and ∆≠ 0 and ∆ is discriminante of 𝐸 which defined as below:
∆= −𝑑22 𝑑8 − 8𝑑43 − 27𝑑62 + 9𝑑2 𝑑4 𝑑6
𝑑2 = 𝑎12 + 4𝑎2
𝑑4 = 2𝑎4 + 𝑎1 𝑎3
𝑑6 = 𝑎32 + 4𝑎6
𝑑8 = 𝑎12 𝑎6 + 4𝑎2 𝑎6 − 𝑎1 𝑎3 𝑎4 + 𝑎2 𝑎32 − 𝑎42 }
However, the below simplified Weierstrass equations provided by [1] shows that it is not
necessary to use whole equations (2.4 and 2.4 ). In which, the same original Weierstrass
equation could be rewritten in a simpler way, depending upon the field characteristics.
2-17
Chapter 2: Background Theory
2-18
Chapter 2: Background Theory
added with (−𝑃) that additive is the inverse of P, which should give us back O point
on infinity again. Therefore, a minus P is the additive inverse of P.
𝑦2 = 𝑥3 − 𝑥 + 1
which shows the form of an elliptic curve that we have chosen here, if 𝑃 𝑎𝑛𝑑 𝑄 are two
points in this curve. Thus, we can define the addition operation for the 𝑃 𝑎𝑛𝑑 𝑄 using
Diophantus techniques – in which we can draw a straight line through them, and it will
intersect in the elliptic curve in the third point. Accordingly, we take the reflection of the
third point in x-axis as the sum, and we refer 𝑅 as the sum of P 𝑎𝑛𝑑 𝑄 . In fact, this has a
very close relationship with Diophantus techniques.
2-19
Chapter 2: Background Theory
Having such choices and huge flexibility makes ECC feasible for both constrained
devices and high-performance servers. Subsequently, we first provide a fundamental
arithmetic for the curve defined over prime field 𝐺𝐹(𝑝) (Section 2.5.3) and elliptic over the
binary field 𝐺𝐹(2𝑚 ) (Section 2.5.4). In general, in this thesis, we only present the algorithms
that are applied in our research. Further detail of different algorithms could be obtained in [1,
25].
The elliptic curve over prime field is represented by the following equation:
2-20
Chapter 2: Background Theory
The finite field elements for the above equation are integers number between 0 and p − 1.
The integers to be involved in all of the Elliptic Curve modular arithmetic operations include
addition, subtraction, multiplication, division and multiplicative inverse. The selection of
prime number is to be conducted based on SEC specification, where p is rated between 112-
521 bits. Figure 2.5 shows points generated by Sagemath tool for the prime field of size y 2 =
x3 + x
Mathematically, elliptic curve over prime field consisting of integer P over finite field
𝐺𝐹 (𝑝) and the elements 𝑎, 𝑏 ∈ 𝐹𝑝 can be defined by the equation below:
E: y 2 ≡ x 3 + ax + b (mod p)
▪ Using the standard projective coordinates, the affine point (X⁄Z , Y⁄Z) could be used
in correspondence to the projective point (X: Y: Z), Z ≠ 0. The point at infinity ∞
2-21
Chapter 2: Background Theory
corresponds to corresponds to (0: 1: 0), and the negative of (X: Y: Z) is (X: −Y: Z).
The elliptic curve equation that corresponds to the standard projective coordinates is:
Y 2 Z = X 3 + aXZ 2 + bZ 3
▪ Using the Jacobian projective coordinates, the affine point (X⁄Z2 , Y⁄Z3 ) could be
used in correspondence to the projective point (X: Y: Z), Z ≠ 0. The point at infinity ∞
corresponds to (0: 1: 0), and the negative of (X: Y: Z) is (X: −Y: Z). The elliptic curve
equation corresponding to the Jacobian projective coordinates is:
Y 2 Z = X 3 + aXZ 4 + bZ 6
The calculation of point addition and point doubling is highly dependent on the type of the
selected projective coordinates. Thus, these representations can be considered as
advantageous if the inversion operation is more expensive when compared to multiplication
in the finite field. To accomplish the processes of elliptic curve arithmetic, a single inversion
at the end of point addition and point doubling is needed. This can be achieved using the
Fermat's Little Theorem: x −1 ≡ x p−2 mod p. However, Table 2.2 illustrates different
complexity concerning group operations for a different type of coordinates representation on
𝑦 2 = 𝑥 3 − 3𝑥 + 𝑏. Where A=affine representation, J=Jacobin, P=Standard Projective, I=
Field Inversion, M=Field Multiplication, S=Field Squaring.
2-22
Chapter 2: Background Theory
overall performance of ECC algorithms on GF(P). Since field arithmetic over GF(P) consists
of several algorithms, we presented some of these algorithms along with their functionality.
Modular Reduction:
The modular operation is to be used for reducing to modulo P, where P is large. In which, a
finite field of order P, GF(P) with P prime is to be identified as the set 𝑍𝑝 of integers
{0,1, ⋯ , 𝑃 − 1}. The main concept of modular 𝑅 arithmetic is based on dividing C by M
where C, M ∈ 𝑍 such that 𝑀 < 𝐶. In which, the modular reduction is a process of computing
𝑅 = 𝐶 𝑚𝑜𝑑 𝑀 – i.e., the remainder R of the division C is to be represented as below:
𝐶
𝑅 = 𝐶 − ⌊ ⌋𝑀
𝑀
𝑝521 = 2521 − 1
2-23
Chapter 2: Background Theory
where the base of 264 representation of c where each 𝑐𝑖 ∈ [0, 264 − 1] reduce the higher
power of 2 in (2.) using the congruence
and 𝑐 𝑚𝑜𝑑𝑢𝑙𝑜 𝑝 can obtained by adding the four 192-bit integers 𝑐5 2128 + 𝑐5 264 +
𝑠1 = (𝑐2 , 𝑐1 , 𝑐0 ), 𝑠2 = (0, 𝑐3 , 𝑐3 ),
𝑠3 = (𝑐4 , 𝑐4 , 0), ) 𝑠4 = (𝑐5 , 𝑐5 , 𝑐5 )
𝑐5 , 𝑐4 2128 + 𝑐4 264 , 𝑐3 264 + 𝑐3 , 𝑐2 2128 + 𝑐1 264 + 𝑐0 and continually subtracting P until the
result is less than p. This procedure can be illustrated on fast modular reduction for modulo
𝑃192 = 2192 − 264 − 1 Algorithm 2.1 [1]. However, in chapter 6, we consider using a
Generic Generalized Mersenne Reduction procedure as proposed by [19].
2-24
Chapter 2: Background Theory
Addition and Subtraction: The mechanism for performing the field addition and subtraction
is quite straightforward. [1] proposed Algorithm 2.2 and Algorithm 2.3 performs the addition
and subtraction respectively of multiword integers where assigning in the form “(ε,z)←w” for
3) Return (𝜀, 𝑐)
Modular Addition:
Else if 𝑐 ≥ 𝑃 𝑡ℎ𝑒𝑛 𝑐 ← 𝑐 − 𝑝
2-25
Chapter 2: Background Theory
Modular Subtraction:
Modular Multiplication:
The efficiency of modular multiplication plays a very important role in the overall
performance of ECC. However, to perform a field multiplication of 𝑎, 𝑏 ∈ 𝐹𝑃 , the
multiplication process of a and b as integers need to be accomplished first. Then, they need to
be processed with a reduction process of the result modulo p. Operand-scanning, product
scanning, Comba Algorithm, Montgomery Multiplication and Karatusba Multiplication are
the most popular modular multiplication algorithms. The operand scanning method and
product scanning method are based on obtaining the bit quantity (U V) by concatenating of
W-bit word U and V. Algorithm 2.4 illustrates the integer multiplication using the operand
scanning method. In this algorithm, the main operation is to be executed in step 2.2, known as
inner product operation. The calculation process is used to be represented by 𝐶 [𝑖 + 𝑗] + 𝐴[𝑖 ] ∙
𝐵[𝑗] + 𝑈 and operands are w-bit values. In general, the inner product in this algorithm is
bounded by 2(2𝑤 − 1) + (2𝑤 − 1)2 = 2𝑤 − 1, which can be depicted by (UV).
2-26
Chapter 2: Background Theory
2.3 𝐶[𝑖 + 𝑡] ← 𝑈
3) Return (𝑐)
2-27
Chapter 2: Background Theory
1: (𝑡, 𝑢, 𝑣) ← 0
2: for i from 0 by 1 to s do
3: for j from 0 by 1 to i do
5: end for
6: 𝑃𝑖 ← 𝑣
7: 𝑣 ← 𝑢, 𝑢 ← 𝑡, 𝑡 ← 0
8: end for
13: 𝑃𝑖 ← 𝑣
14: 𝑣 ← 𝑢, 𝑢 ← 𝑡, 𝑡 ← 0
15: 𝑃2𝑠−1 ← 𝑣
Modular Squaring:
In addition to the multiplication, the Comba Algorithm could be also used for
squaring. This is because the Comba squaring is structured with a nested loop. The nested
2-28
Chapter 2: Background Theory
2
loop in Comba squaring is to be only iterated by 𝑠 ⁄2 compared to (𝑠 2 + 𝑠)/2 in single
precision multiplication. In which, the operation in inner loop could be presented by the
following form:
Modular Inversion:
The process of finding the inversion in a prime field can be done using a direct
exponentiation technique. Thus, if 𝐵 is an element of prime field 𝐺𝐹(𝑃) and 𝐶 an inverse of
field 𝐵, then the inverse could be computed using a direct exponentiation of 𝐶 = 𝐵−1 =
𝐵𝑃−2 . However, direct exponentiation is considered to be costly, as it involves modular
multiplication, modular squaring and modular reduction. Therefore, a binary extended
Euclidean Algorithm could be considered as the most effective way of implementing
inversion [1]. This is because the only divisions done are by 2 and accordingly processed
with a right-shift. The normal process of computing 𝑔𝑐𝑑 of positive integers 𝑎 𝑎𝑛𝑑 𝑏 is
implemented through a classical Euclidean Algorithm. The algorithm is based on dividing
𝑏 𝑏𝑦 𝑎 and obtaining a quotient and a remainder where 𝑏 ≥ 𝑎. The overall process should
satisfy 𝑏 = 𝑞𝑎 + 𝑟 and 0 ≤ 𝑟 ≤ 𝑎. However, to achieve this, 𝑔𝑐𝑑(𝑎, 𝑏) is to be reduced by
computing 𝑔𝑐𝑑 (𝑟, 𝑎) until the argument (𝑟, 𝑎) is obtained smaller than the argument (𝑎, 𝑏)
and the process need to be repeated until one of the argument is 0 with a result of
𝑔𝑐𝑑 (0, 𝑑 ) = 𝑑. Therefore, at this point, the algorithm could be terminated, as there are no
negative remainders to be reduced. Hence, this method is very efficient, since division steps
could be shown at most 2𝑘 where 𝑘 is the length of 𝑎. The mechanism above could be
extended to Euclidean Algorithm 2.7 to find integers 𝑥 𝑎𝑛𝑑 𝑦 in which 𝑎𝑥 + 𝑏𝑦 =
𝑑 where 𝑑 = 𝑔𝑐𝑑 (𝑎, 𝑏) .
2-29
Chapter 2: Background Theory
2) 𝑥1 ← 1, 𝑦1 ← 0, 𝑥2 ← 0, 𝑦2 ← 1
3) While u ≠ 0 do
3.1 𝑞 ← ⌊𝑣⁄𝑢⌋, 𝑟 ← 𝑣 − 𝑞𝑢, 𝑥 ← 𝑥2 − 𝑞𝑥1 , 𝑦 ← 𝑦2 − 𝑞𝑦1
3.2 𝑣 ← 𝑢, 𝑢 ← 𝑟, 𝑥2 ← 𝑥1 , 𝑥1 ← 𝑥, 𝑦2 ← 𝑦1 , 𝑦1 ← 𝑦
4)𝑑 ← 𝑣, 𝑥 ← 𝑥2 , 𝑦 ← 𝑦2
5) Return (𝑑, 𝑥, 𝑦)
2) 𝑥1 ← 1, 𝑦1 ← 0, 𝑥2 ← 0
3) While u ≠ 0 do
3.1 𝑞 ← ⌊𝑣⁄𝑢⌋, 𝑟 ← 𝑣 − 𝑞𝑢, 𝑥 ← 𝑥2 − 𝑞𝑥1
3.2 𝑣 ← 𝑢, 𝑢 ← 𝑟, 𝑥2 ← 𝑥1 , 𝑥1 ← 𝑥
2-30
Chapter 2: Background Theory
𝑦 2 + 𝑥𝑦 = 𝑥 3 + 𝑎𝑥 2 + 𝑏 (2.1)
where 𝑎, 𝑏 are parameters ∈ 𝐺𝐹(2𝑚 ), 𝑏 ≠ 0 and {(𝑋𝑖 , 𝑌𝑖 ), for 𝑋𝑖 , 𝑌𝑖 ∈ 𝐺𝐹(2𝑚 )} are set of
solutions for equation (2.1). However, the number of the points in 𝐺𝐹(2𝑚 ) are denoted by
#𝐸(2𝑚 ), whereas the addition inverse point 𝑅𝑖 (𝑋𝑟 , 𝑌𝑟 ) of 𝐸(2𝑚 ) is defined as −𝑅(𝑋𝑟 , 𝑋𝑟 −
𝑌𝑟 ) and the elliptic curve 𝐸(2𝑚 ) points from addition group are normally satisfying closure,
identity and inverse properties. [1] proposed the below set of rules for defending the
operations of Elliptic Curve over 𝐺𝐹(2𝑚 )
The result of adding two points, P and Q, where 𝑃: (𝑋𝑝 , 𝑌𝑝 ) ∈ (2𝑚 ), 𝑄: (𝑋𝑞 , 𝑌𝑞 ) ∈ (2𝑚 ),
𝑋𝑝 ≠ 𝑋𝑞 and the coordinates of R is (𝑋𝑟 , 𝑌𝑟 ) is given by
𝑌 +𝑌
𝑋𝑟 = 𝑠 2 + 𝑠 + 𝑋𝑝 + 𝑋𝑞 + 𝑎, 𝑌𝑟 = 𝑠(𝑋𝑝 + 𝑋𝑟 ) + 𝑌𝑝 + 𝑋𝑟 , 𝑤ℎ𝑒𝑟𝑒 𝑠 = 𝑋𝑝+𝑋𝑞 (2.2)
𝑝 𝑞
The result of doubling points P where 𝑃: (𝑋𝑝 , 𝑌𝑝 ) ∈ (2𝑚 ), 𝑋𝑝 ≠ 0 and the coordinates of R
is (𝑋𝑟 , 𝑌𝑟 ) is given by
𝑌
𝑋𝑟 = 𝑠 2 + 𝑠 + 𝑎, 𝑌𝑟 = 𝑋𝑝2 + (𝑠 + 1)𝑋𝑟 , 𝑤ℎ𝑒𝑟𝑒 𝑠 = 𝑋𝑝 + 𝑋𝑝 (2.3)
𝑝
The process of elliptic curve key generation and elliptic curve digital signature
contain modular inverse operations. In fact, as mentioned previously, the modular inversion
is considered to be too costly in terms of time complexity compared to multiplication
computation. Therefore, an alternative way to avoid such cost is to convert the affine
coordinates (𝑋, 𝑌) of elliptic curve point to the projective coordinate (𝑋 ∗ , 𝑌 ∗ , 𝑍 ∗ ) and to take
care of denominator part of the operations with 𝑍 ∗ . The process is to be finalized by returning
back the projective coordinates (𝑋 ∗ , 𝑌 ∗ , 𝑍 ∗ ) to affine coordinates (𝑋, 𝑌). There are different
types of projective coordinates considering the elliptic curve of a non super-singular formula
shown:
2-31
Chapter 2: Background Theory
𝐸: 𝑦 2 + 𝑥𝑦 = 𝑥 3 + 𝑎𝑥 2 + 𝑏
𝑌 2 𝑍 + 𝑋𝑌𝑍 = 𝑋 3 + 𝑎𝑋 2 𝑍 + 𝑏𝑍 3
where the point of infinity ∞ corresponds to (0: 1: 0), negative points of (𝑋: 𝑌: 𝑍) is (𝑋: 𝑋 +
𝑌: 𝑍). The projective point (𝑋: 𝑌: 𝑍) for 𝑍 ≠ 0 corresponds to Affine point (𝑋⁄𝑍 , 𝑌⁄𝑍)
where 𝑐 = 1 𝑎𝑛𝑑 𝑑 = 1.
𝑌 2 + 𝑋𝑌𝑍 = 𝑋 3 + 𝑎𝑋 2 𝑍 2 + 𝑏𝑍 6
𝑌 2 + 𝑋𝑌𝑍 = 𝑋 3 𝑍 + 𝑎𝑋 2 𝑍 2 + 𝑏𝑍 4
where the point of infinity ∞ corresponds to (1: 0: 0), negative points of (𝑋: 𝑌: 𝑍) is (𝑋: 𝑋 +
𝑌: 𝑍). The projective point (𝑋: 𝑌: 𝑍) for 𝑍 ≠ 0 corresponds to Affine point (𝑋⁄𝑍 , 𝑌⁄𝑍 2 )
where 𝑐 = 1 𝑎𝑛𝑑 𝑑 = 2.
In practice, there is no practical use of implementing elliptic curve over the real
numbers. This is due to the computational limitation and constraints. Therefore, in this
subsection, we discuss the mechanism and related computation arithmetic of implementing
elliptic curve over binary field 𝐺𝐹(2𝑚 ), where the order of elliptic curve can be defined up to
𝑚 − 𝑏𝑖𝑡. Until recently, most of the applications, such as ECDSA over 𝐺𝐹(2𝑚 ), defines 𝑚 to
be equal or greater than 163 bits. Thus, most of the ECDSA operations over 𝐺𝐹(2𝑚 ) involve
m-bit integers. In the other words, the size of elliptic curve coefficients, points and elliptic
2-32
Chapter 2: Background Theory
curve are all m-bit numbers. However, performing elliptic curve over binary field 𝐺𝐹(2𝑚 )
requires many binary arithmetic functions that include modular reduction, addition,
multiplication, squaring and inverse.
A modulus computation for 𝑓(𝑥) based on the output of 𝐺𝐹(2𝑚 ) can be achieved if 𝑓 (𝑥 ) =
𝑥 𝑚 + 𝑟(𝑥) is irreducible binary (Primitives) polynomial of degree of 𝑚 and if the elements
of degree 𝑚 and if the elements of 𝐺𝐹(2𝑚 ) is also generated using primitive polynomial
𝑓(𝑥), where the elements of 𝐺𝐹(2𝑚 ) of the degree at most 𝑚 − 1. Additionally, the 𝐺𝐹(2𝑚 )
field elements is an 𝑚 − 𝑏𝑖𝑡 member, which can be presented in polynomial form as 𝑎(𝑥 ) =
𝑎𝑚−1 𝑥 𝑚−1 + ⋯ + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0 or 𝐴 = [𝑎𝑚−1 , 𝑎𝑚−𝑥 , ⋯ , 𝑎2 , 𝑎1 , 𝑎0 ] for a vector form
representation.
The binary arithmetic for squaring and multiplication with 𝑚 − 1 polynomial resulting the
output polynomial with a degree of 2𝑚 − 2. Therefore, we can compute 𝑌(𝑥 ) 𝑚𝑜𝑑𝑢𝑙𝑜 𝑓(𝑥)
if the output of 𝑌(𝑥 ) is greater than the degree of the primitive polynomial. Having such
mechanism, we will be to ensure that the output result 𝑌(𝑥) polynomial is less than 𝑚. It is
very often that the binary field arithmetic 𝑖 is to be normally consider true for 𝑥 𝑖 =
𝑥 𝑖−𝑚 𝑟(𝑥 )(𝑚𝑜𝑑 𝑓 (𝑥)) 𝑖𝑓 𝑖 ≥ 𝑚. For example, if we consider 𝑚 = 163 then 2𝑚 − 2
degree 𝑌(𝑥). Accordingly, we can use 32 − 𝑏𝑖𝑡 𝑤𝑜𝑟𝑑 by utilizing a 32-bit vector, which
could be represented as below:
2-33
Chapter 2: Background Theory
Assuming that 𝑓(𝑥) is to be trinomial or pent-nominal having middle terms close to each
others. Accordingly, we can process a reduction process of 𝑌(𝑥 )𝑚𝑜𝑑𝑢𝑙𝑜 𝑓(𝑥) in very
effective way by reducing 32 − 𝑏𝑖𝑡 at a time. The reduction process of 𝑦[9] starts by adding
𝑦[9] four times to 𝑌. In detail, the process is to be accomplished by using 0th of 𝐿𝑆𝐵 belongs
to 𝑦[9] and added to bit 132,131,128 and 125 of 𝑌. Then, we can add the first 𝐿𝑆𝐵 of 𝑦[9] to
bit 133,129 and 126 of 𝑦[9] and so on. For example, if 𝑓 (𝑥 ) = 𝑥 163 + 𝑥 7 + 𝑥 6 + 𝑥 3 , then
the computation for the modular reduction for 𝑦[9] should start from bit 288 to 319 of 𝑌 as
shown below:
For further detail about modular reduction over binary field, that related can be found in
Chapter 4 (Section 4.4.2) and [1] .
2-34
Chapter 2: Background Theory
1) 𝑓𝑜𝑟 (𝑖 = 0 𝑡𝑜 2(𝑡 − 1)
C[i]← 0
End for
2) for (i= 0 𝑡𝑜 𝑡 − 1)
For (𝑗 = 0 𝑡𝑜 𝑡 − 1)
P,Q← 𝐴[𝑖]⨂𝐵[𝑗]
C[i+j]← 𝐶[𝑖 + 𝑗]⨁𝑃
C[i+j+1]← 𝐶[𝑖 + 𝑗 + 1]⨁𝑃
End for
End for
The third field elements known as 𝑐 (𝑧) = 𝑎(𝑧). 𝑏(𝑧) 𝑚𝑜𝑑 𝑓(𝑧). In fact, conducting finite
field multiplication could involves two steps: polynomial multiplication and reduction
process of modulo using irreducible polynomial. In fact, there are two benefits of using
irreducible polynomial: firstly, it simplifies a reduction process, and secondly, it can help to
fewer nonzero especially with spare irreducible. However, many algorithms have been
proposed by researchers to help implement binary field multiplication, which include: Pencil
and Paper Polynomial Multiplication algorithm, Karatusba-Ofman Algorithm, Montgomery
Algorithm and Comba Multiplication Algorithm.
2-35
Chapter 2: Background Theory
The procedure for multiplying 𝑎(𝑥) and 𝑏(𝑥) of degree at most 𝑚 − 1 using
Karatusba- Ofman Algorithm mainly consists of the two steps. The first step is to split 𝑎(𝑥)
𝑚
and 𝑏(𝑥) into two polynomials of degree at most ( 2 ) − 1. However, in case m is odd, then
the polynomials are to be pretended with zeros. Thus, A(x) = A1 (x)X + A0 (x) − B(x) =
m⁄
B1 (x)X + B0 (x), where X = x 2. Accordingly, 𝑎 (𝑥). 𝑏(𝑥 ) = 𝐴1 𝐵1 𝑋 2 + [(𝐴1 + 𝐴0 )(𝐵1 +
𝐵0 ) + 𝐴1 𝐵1 + 𝐴0 𝐵0 ]𝑋 + 𝐴0 𝐵0 and 𝑋 products is to be derived from three products of degree
(𝑚⁄2 − 1) per the following steps:
Karatusba Ofman Algorithm which could lead to have a recursive multiplication algorithm.
However, the practical implementation shows that the number of used recursion levels will be
finally dictated by the amount of overhead associated with algorithm implementation. Also, it
will be relatively dictated by the performance of the multiplication and addition process. For
instance, applying 192-bit binary polynomials with a 32-word-length processor leads to the
following recursions [29]:
The initial attempt of computing 𝑎. 𝑏 in 𝐺𝐹(2𝑚 ) was proposed by Koc and Acar
and proposed in [36]. Their proposal was based on computing 𝑎. 𝑏. 𝑟 −1 in 𝐺𝐹(2𝑚 ), where 𝑟
2-36
Chapter 2: Background Theory
𝑟(𝑥 )𝑟 −1 (𝑥 ) + 𝑓 ′ (𝑥 )𝑓(𝑥 ) = 1
The Extended Euclidean algorithm could be used to compute the 𝑟 −1 (𝑥) and 𝑓 ′ (𝑥)
polynomial. In order to compute the word level of Montgomery products, it is required to
calculate the w-length of f0(x) rather than computing the entire polynomial f(x) which is
normally known as the length of 𝑘 = 𝑡𝑤. It is worth mentioning here that the efficiency of
the inversion algorithm is based on observing the polynomial of f0(x),in which its inverse
should satisfy 𝑓0 (𝑥 )𝑓 ′ (𝑥) = 1 𝑚𝑜𝑑 𝑥𝑖 for 𝑖 = 1,2,3 ⋯ 𝑤.
On the other hand, Comba [33] proposed accelerating the multiplication by reducing the
number of extended references during the time of execution. The proposed idea is based on
eliminating the write-back operation just by changing the order of partial product
generation/accumulation. In which, each result is to be computed in its entirety and sequence.
This operation is to be carried out with least significant word-only as well as values of 𝐴[𝑖]
and 𝐵[𝑗] to be read from memory. Further improvement of comba could help significantly
the storage overhead. All polynomials 𝑢(𝑥) could be obtained by computing 𝑢(𝑥 )𝑏(𝑥) of
degree less than 𝑤 (Window Length). However, further detail about Comba algorithm is
discussed in Chapter 4.
2-37
Chapter 2: Background Theory
identity, zero and inverse elements. However, as we described the Elliptic Curve earlier in
(section 2.5) the idea of the elliptic curve is based on having two points in the elliptic curve,
and accordingly, we can produce other points. Therefore, in this section, we provide a
detailed description of the group laws for the elliptic curve over the prime field and binary
field.
In general, an elliptic curve 𝐸 over prime field 𝐾 with characteristic 𝐾 ≠ 2,3 is a set
of solutions that usually satisfy the following simplified Weierstrass equation:
𝐸 ⁄𝐾: 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏
where 𝑎, 𝑏 ∈ 𝐾 and 4𝑎3 + 27𝑏2 ≠ 0 combined with point at infinity ∞. In which, the group
laws of an elliptic curve over the prime field as shown below[1]:
Identity
Negatives
If 𝑃 = (𝑥, 𝑦) ∈ 𝐸(𝐾) then 𝑃 + 𝑄 = ∞ where the point 𝑄 = (𝑥, −𝑦) ∈ 𝐸(𝐾) which is
known as negative of 𝑃 and denoted by −𝑃. Note that, −∞ = ∞.
Point Addition
𝑦 −𝑦 𝑦 −𝑦
𝑥3 = (𝑥2 −𝑥1 )2 − 𝑥1 − 𝑥2 and 𝑦3 = (𝑥2 −𝑥1 ) (𝑥1 − 𝑥3 ) − 𝑦1 .
2 1 2 1
Point Doubling
3𝑥12+𝑎 2 3𝑥12+𝑎
𝑥3 = ( ) − 2𝑥1 and 𝑦3 = ( ) (𝑥1 − 𝑥3 ) − 𝑦1 .
2𝑦1 2𝑦1
2-38
Chapter 2: Background Theory
An elliptic curve 𝐸over non-super singular 𝐹2𝑚 is the set of solutions to be represented by
the following simplified Weierstrass equation[1]:
𝐸 ⁄𝐹2𝑚 : 𝑦 2 + 𝑥𝑦 = 𝑥 3 + 𝑎𝑥 2 + 𝑏
Identity
Negatives
Point Addition
Point Doubling
𝑏
𝑥3 = 𝜆2 + 𝜆 + 𝑎 = 𝑥12 + and 𝑦3 = 𝑥12 + 𝜆𝑥3 + 𝑥3 with 𝜆 = 𝑥1 + 𝑦1 ⁄𝑥1
𝑥12
2-39
Chapter 2: Background Theory
Scalar multiplication is considered dominant of the ECC operation, which consumes about
80% of the time spent to execute the ECC operation [4]. A scalar point multiplication could
be implemented using various algorithms include: Double and Addition in binary Algorithm,
window method, NAF and wNAF Algorithm, sliding window Algorithm and Montgomery
Ladder Algorithm.
The idea of this algorithm is based on interpreting 𝑘 to binary format, then performing the
point addition and point doubling, accordingly. However, to process with such an algorithm,
we need to conduct point doubling operation for the ′0′ bit, whereas point doubling and
addition need to be conducted for the ′1′ bit. For example, if 𝑘 = 19 = (10011)2, then the
following point addition and point doubling will be performed as per Table 2.3.
The main purpose of this algorithm is to come up with a binary format in which 𝑘 is not
adjacent nonzero bit close to each other. Thus, the binary form of the previous example could
be tackled by changing it below, using the NAF algorithm:
2-40
Chapter 2: Background Theory
In this part, we particularly consider scalar multiplication over the binary curve. In which, an
elliptic curve is defined using the below Weierstrass equation:
𝐸: 𝑦 2 + 𝑥𝑦 = 𝑥 3 + 𝑎𝑥 2 + 𝑏 𝑤𝑖𝑡ℎ 𝑎, 𝑏 ∈ 𝐹2𝑚
where 𝑃1 = (𝑥1 , 𝑦1 ) and 𝑃2 = (𝑥2 , 𝑦2 ) are two points on a curve𝐸(𝐹2𝑚 ). Thus, to compute
the point addition, the following should be considered:
𝑥3 = 𝜆2 + 𝜆 + 𝑥 + 𝑥 + 𝑎,
{ 1 2
𝑦3 = (𝑥1 + 𝑥3 )𝜆 + 𝑥3 + 𝑦1
2-41
Chapter 2: Background Theory
𝑦1 +𝑦2
𝑖𝑓 𝑃1 ≠ 𝑃2
𝑥 +𝑦
Where 𝜆 = {𝑦11 2 From the above, it is very obvious that point addition and
+ 𝑥1 𝑖𝑓 𝑃1 = 𝑃2
𝑥1
Table 2.5 Number of operations for point addition and point doubling[1]
Coordinate system General addition General addition (mixed Doubling
coordinates)
Affine V+M --- V+M
Standard projective 13M 12M 7M
Jacobian projective 14M 10M 5M
Lòpez-Dahab 14M 8M 4M
From the above, it is very obvious that point addition and point doubling involve of an
inversion operation in 𝐺𝐹(2𝑚 ), which is considered a costly operation. Therefore, it is
recommended to use a projective coordinate instead, as it has the capability to perform the
curve operation with a bit more field multiplication operations without field operations.
However, there are different types of coordinates system that include: standard
projective coordinate system, Lòpez-Dahab projective coordinate system, Jacobian Projective
coordinate system and affine coordinate system. While carefully planning to use any one of
these, it is suggested to consider the number of field operations, as summarized in the
following table:
2-42
Chapter 2: Background Theory
The main domain parameters for Elliptic Curve over prime GF(P) are p,a,b,G,n and h. These
parameters are defined as below:
1. a and b: These two parameters are responsible for defining the curve 𝑦 2 𝑚𝑜𝑑 𝑝 =
𝑥 3 + 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑝
2. p: It is used to define the prime for the finite field.
3. G: Is the parameter used to generate the points (𝑋𝐺 , 𝑌𝐺 ).
4. n: The parameter is used while selecting the scalar for multiplication.
5. h: It represents the number of points on an elliptic curve where h is a cofactor.
On the other hand, the elliptic curve domain parameter over the binary field GF(2𝑚 )
is nominated by 𝒎, 𝒇(𝒙), 𝒂, 𝒃, 𝑮, 𝒏 𝒂𝒏𝒅 𝒉. Each of these parameters are described below:
In symmetric cryptography, such as RSA and ECC, the number of bit operations is to
be powered to K, in which (log k log 3 q) and log notation without base presents a natural
logarithm [40]. However, the number of bit operations reflects the number needed to
calculate the coordinates of multipleK. P . Accordingly, it is possible to drive efficient public
protocols by adopting the ECC [41].
2-43
Chapter 2: Background Theory
2-44
Chapter 2: Background Theory
3. Return Z
have similar authentication parameters plus the public key 𝑄𝐵 [24]. Therefore, to perform
ECDSA tasks, a user 𝐴 is required to use his key pair (Q,𝐾𝐴) ) [1] as shown in Algorithm
2.12.
IV. Elliptic Curve Digital Signature Algorithm EDSA (Verification)
The EDSA verification process is to be achieved by considering the domain parameters
(𝑞, 𝐹𝑅, 𝑎, 𝑏, 𝑃, 𝑛, ℎ), a message 𝑚 and proposed signature information. Algorithm 2.13
illustrates the process involving the acceptance or rejection for the entire digital signature
2-45
Chapter 2: Background Theory
assigned with the message. However, this process required a user 𝐵 to validate a user 𝐴′𝑠
signature (𝑠, 𝑟) assigned to message 𝑚. Accordingly, user 𝐵 need to accomplish the rest of
the algorithm’s steps by using the provided authenticated copy of the 𝐴′𝑠 public key 𝑄.
INPUT: Domain Parameters (𝑞, 𝐹𝑅, 𝑎, 𝑏, 𝑃, 𝑛, ℎ), 𝑆𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒 (𝑠, 𝑟), 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 𝑚
OUTPUT: Signature Acceptance or Rejection
1. Verify that (𝑠, 𝑟) are integers and the interval [1,n-1] else the signature is
rejected.
2. Compute ℎ(𝑚).
3. Compute 𝑢 = 𝑟 −1 mod n.
4. Compute 𝑣1 = 𝑢. ℎ(𝑚)𝑚𝑜𝑑 𝑛& 𝑣2 𝑄 and let 𝑤 = 𝑥2 𝑚𝑜𝑑 𝑛.
5. Compute (𝐾𝐴−1 ) 𝑀𝑜𝑑 𝑛.
6. Compute (𝑥2 , 𝑦2 ) = 𝑣1 𝑃 + 𝑣2 𝑄 and let 𝑤 = 𝑥2 𝑚𝑜𝑑 𝑛.
7. If 𝑤 = 𝑠 the signature is verified else rejected
2-46
Chapter 2: Background Theory
to use his private key inline with the cipher text (C1 , C2 )in which, B needs to multiply the
first point with his private key and deduce the obtained result from the second point until
retrieving the plain text.
2.7 Conclusions
Throughout this chapter, we provided the background details associated with
cryptography, such as its history and the differences between symmetric and asymmetric
types of cryptography. Accordingly, we introduced fundamental information related to
Elliptic Curve Cryptography such as Finite Field algorithms, point addition, point doubling
and ECC protocols. However, the main focus of this chapter was to provide preliminary
information relevant to work in this thesis.
2-47
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
3-1
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
3.1 Introduction
The recent and expected proliferation of wireless sensor networks (WSN) with all its
economical and societal benefits across a range of applications spanning healthcare, home,
environment, and defense will face serious limitations if security concerns are not addressed.
Cryptography plays a very important role in achieving security.
Elliptic Curve Cryptography (ECC) is increasingly becoming the first choice for
public key cryptography implementation, as it requires much shorter key sizes compared to
the RSA for the same level of security. The implementation of ECC on sensor node platforms
remains a challenge due to the resources limitation in these nodes. Therefore, optimal low
resource ECC implementations are required with optimization techniques to speed up the
ECC operations and to reduce the memory usage without prohibitive complexity.
Experimental analysis and evaluation for Elliptic Curve Digital Signature (ECDSA)
on both an 8-bit and a 32-bit platform (Arduino mega2560 and Arduino Due) has been
carried out and comparative implementation results are given. To our knowledge, no such
analysis and results have been reported to date.
The implementation results obtained, show that ECDSA key generation on Arduino
Due can be achieved in (90ms) compared to (263ms) on the Arduino Mega for m=163.
Furthermore, implementation optimisation (such as multi-precision GF(2m) arithmetic)
configurations are shown to enhance the performance of the ECDSA on the Arduino Due to
(83 ms). These results will act as a useful benchmark and guidance in selection of the
optimization techniques provided by the tool for a given WSN application.
This chapter is organized as follows: Section 3.2 provides the related work of
software implementation on microcontroller and ECC background. The third section
illustrates the arduino mega2560 and arduino Due architectures. The efficient implementation
and optimizations provided by the relic-toolkit are presented in section 3.4. Our proposed
optimization using the relic code is discussed in section 3.5. Accordingly, our implementation
3-2
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
work is described and results analyses are illustrated in section 3.5. Finally, we conclude this
discussion in section 3.6.
3.2 Background
In 1985, both Neal Koblitz and Victor S. Miler independently proposed Elliptic Curve
Cryptography which is based on Elliptic Curve theories. Currently, ECC is considered to be
one of the main players for implementing security in different applications. Basically, ECC
has better features and a better future for cryptography since it has the capability to provide
many cryptography schemes, such as key management, digital signature and verification.
Beside these services and its powerful security, ECC has more powerful computation with
shorter key length sizes compared to other public key cryptography solutions, such as RSA
and Diffie-Hellman. ECC could be defined over prime fields and binary fields. However, for
the purpose of this work, we consider Elliptic Curve over binary fields. The equation below
represents the elliptic curve over binary fields:
y 2 + xy = x 3 + ax 2 + b
where 𝑏 ≠ 0 and the value of 𝑥, 𝑦, 𝑎 and 𝑏 are polynomials representing 𝑛 − 𝑏𝑖𝑡 words.
Finding points on the curve could be achieved by using generator for polynomials and
irreducible polynomial. The rules for points addition in 𝐺𝐹(2𝑚 ) is different from 𝐺𝐹(𝑝).
Therefore, if 𝑃 = (𝑥1 , 𝑦1 ) and 𝑄 = (𝑥2 , 𝑦2 ) and 𝑄 ≠ 𝑃, then can be determined as shown
below:
𝑦2 + 𝑦1
𝜆=
𝑥2 + 𝑥1
𝑥3 = 𝜆2 + 𝜆 + 𝑥1 + 𝑥2 + 𝑎
𝑦3 = 𝜆(𝑥1 + 𝑥3 ) + 𝑥3 + 𝑦1
𝜆 = 𝑥1 + 𝑦1 /𝑥1
𝑥3 = 𝜆2 + 𝜆 + 𝑎
𝑦3 = 𝑥2 + (𝜆 + 1)𝑥3
3-3
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
𝑦 + 𝑦1
where [ 2 ⁄𝑥 + 𝑥 ] 𝑃1 ≠ 𝑃2 & 𝜆 = 𝑥𝑥1 + 𝑦1 /𝑥1 𝑃1 = 𝑃2
2 1
Elliptic Curve Digital Signature (ECDSA) is used for digital signature purposes consisting of
three main procedures: key pair generation, signature generation and signature verification.
The Elliptic Curve Diffie Hellman (ECDH) protocol is used for exchanging the keys between
two parties over an insecure channel. The purpose for having the ECC schemes is to provide
a high level of security with smaller key sizes. Therefore, it is important for both parties
involved in the communication to have pre-defined and agreed domain parameters for each
scheme. The detailed specification can be found in Chapter 2, section 2.6.2.
In fact, this is the first arduino microcontroller that has been equipped with 32 bit ARM core
processor. This type of arduino microcontroller is also designed to support 54 digital
input/output ports. In which, 12 of them could be used as PWM outputs. However, compared
to the arduino mega2560, this microcontroller has much higher oscillator, which can reach up
to 84 MHz. Furthermore, it has been also equipped with USB OTG capable connection and
JTAG header as well as reset and erase button. Additional information could be found in
[47].
3-4
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
Portability: The relic-toolkit can be used with different types of the wireless sensor
platforms, such as ARM, AVR and MSP. Additionally, the library could be built in different
types of operating system such as windows (using MingW), Ubuntu and Mac OS. In this
work, we consider importing and testing the relic library in Arduino mega260 (AVR- 8-bit
processor) and Arduino Due (ARM-cortex-32 bit processor).
Efficiency: In order to accomplish the desired efficiencies from the tool, we decided to
implement the ECC over binary fields based on the potential result reported by the end to end
security. We also used an assembly version (shown as K163-asm) file in order to achieve
better performance as recommended by [48] wherein a new secure and energy-efficient
communication model for the Constrained Application Protocol (CoAP), was developed for
smart object networks. This model ensured authenticity over a network of multi-hop
topology.
Functionality: This principle is ensured through the practical implementation for different
public key cryptography schemes provided by the relic-toolkit such as ECDH and ECDSA.
3-5
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
Comba Algorithm: The Comba algorithm is a technique in which the partial products are
ordered and scheduled. The multiple precision is required for big number arithmetic. In the
multiple precision arithmetic the computations are carried out on the digits whose precision
are constrained by the host system memory availability. It is highly efficient for public key
cryptography implementations in resolving memory limitations as well as overcoming
overflow issues. The contribution of multiple precision on solving such problems is achieved
by increasing the integer representation while using single precision data type [49] shows
better performance compared to the school book multiplication method. However, the relic-
toolkits allow users to select from different types of multiple precision arithmetic algorithms
besides the comba algorithm such as school book multiplication, Karatsuba multiplication
and others.
The implementation of the Montgomery modular reduction algorithm involves fewer single
multi precision multiplications in comparison with Barrett Modular reduction, which requires
two modified multipliers [86]. Previous software implementation of the Montgomery
algorithm reported slower speed. This challenge has been tackled and resolved by the
researchers through combing the Montgomery modular reduction and comba algorithms. The
combination methodology can be achieved by allowing the comba algorithm to act as a
multiplier.
3-6
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
methods. Using specialized squaring helps to reduce the load operand approximately by half
over using multiplication algorithms. Additionally, it helps to enhance the computation
performance for the duplicated partial products. Moreover, specialized squaring algorithms
contribute to overcome the limitation of baseline multiplication algorithms. These limitations
can be summarized into two main points. First, the needs for processing single precision shift
inside the nested loop. Second, the challenges of performing the products doubling process
inside the inner loop.
The comba squaring algorithm could be used to solve these drawbacks. The concept
of comba squaring is to some extent similar to the comba multiplication algorithm with some
differences that help to accommodate the single precision shifting and doubling processes.
The relic toolkit supports three different multiple precision square algorithms besides the
comba squaring, which are the Karatsuba Squaring, the recursive karatsuba and the School
book method [49]. In this work, we configure the relic library with Comba squaring to obtain
better performance.
Optimization for Elliptic Curve Arithmetic
Point Representations: There are different coordinate systems that can be used to represent
the elliptic curve, the most popular being the affine coordinates and projective coordinates.
The projective coordinates can be considered an option that can help avoid the costly and
expensive multiplication and inversion operations. The results reported by [38] show better
performance achievement compared to the affine. The relic-toolkit has been designed to
support both, and we selected the projective coordinate to achieve a higher performance.
Point Multiplications: Point multiplication or scalar multiplication is implemented through a
series of point addition and point doubling operations. The key has to be obtained after
conducting a full cycle of addition and doubling operations. The point multiplication over the
binary elliptic curve can be implemented with different algorithms such as left-to-right binary
algorithm, halving, right-to-left width-w and others. The relic library consists of six different
algorithms, such as the basic binary point multiplication algorithms, Lopez-Dahab point
multiplication and right-to-left width-w (T)NAF. Since the sliding windows method is more
helpful on speeding up the scalar multiplication, we selected the right-to-left width (T) NAF
algorithm for performing the point multiplication. The concept of sliding windows is based
on scanning a bit at a time and performing the point doubling for them at the same time [44].
3-7
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
to its significance in some ECC schemes. For example, the implementation of ECDSA
required two types of point multiplication, the first for signature generation which is fixed,
the second for signature verification process, which is also fixed but unknown. However, the
speed of the signature verification process can be increased by using simultaneous multiple-
point multiplication [1]. Different methods have been proposed for simultaneous point
multiplications such as Shamir's trick, Joint sparse form and interleaving. With this aspect,
the relic-toolkits support all of these methods plus the basic simultaneous point multiplication
methods that can be selected during the relic building process
We imported the relic-toolkit ] into the arduino mega 2560 (8-bit AVR processor) [47] and
arduino Due (32-bit ARM processor). Our selection for these platforms is based on the fact
that we targeted to implement the ECC schemes on a processor that does not require an
operating system support. Furthermore, the 8-bit to 32-bit processor range is a representative
range for resource embedded applications. We imported relic-0.3.1 onto the two platform
boards, and experimented with the performance of ECDSA and ECDH over binary fields
using different NIST curve standard (NIST-K163,NIST-B163). In order to obtain better
performance, we examined the presets provided by [46]. The execution timings of the codes
were measured using inbuilt millis() function provided by Arduino.h library. Furthermore, we
measured the amount of RAM using "MemoryFree.h" library beside the avr-size and arm-
none-eabi-size tools.
Experiment Setup:
In order to build the library, we installed the avr-gcc version 4.5.3 compiler and cmake cross-
platform version 2.8.7. The recommended presets by [16] shown in Figures 4 and 5 in the
Appendix were used for building the library with low memory arm-none-eabi-size tools.
optimization algorithms and faster time execution respectively, compared to the original
recommended presets. For importing the relic-toolkit in arduino Due, we installed arduino
extension plug-in (embedxcode) in Xcode IDE MAC OS X version 10.7.3, and then we
imported the relic-toolkits into the XCode IDE.
3-8
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
Due to the importance of time and memory usages, we considered evaluating our
ECC implementation based on these two factors. The arduino mega2560 is an 8-bit micro-
controller, but it has the capability to manipulate 16X16 bit operations by using two separate
registers. From the other perspective, the arduino Due is a 32bit micro-controller and can
easily handle 8- and 16-bit operations. We measured the execution time using the inbuilt
3-9
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
function millis() provided by the ardunio library. This function returns the timing result in
milliseconds using the arduino internal timer #0 or TCNT0. However, the timer runs at 16
MHz in arduino mega2560 and at 84 MHz in arduino DUE. On the other hand, we measured
the amount of RAM using arm-none-eabi-size tool for arduino DUE, and we used the avr-
size tool for measuring the RAM utilization in arduino mega2560.
ECDSA: In this part, we demonstrate the main obtained results with regards the ECDSA
performance. Figure3.3 below shows the time execution for ECDSA key generation on both
platforms.
As expected, the arduino mega2560 takes more time to generate the ECDSA keys, as
it runs at a much lower clock than the arduino Due The figure 3.4 presents a comparison of
the binary field arithmetic with basic and comba algorithms. The binary field arithmetic with
BASIC algorithm configuration resulted in an improved performance on the DUE as shown
in Figure 3.4.
3-10
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
The performance on the mega2560 was improved using the assembly code provided in the
library. This enhancement is represented by the figures which include the time execution
improvement and memory usages, respectively. These results show even better performance
compared to the results reported by [44] and [45].
3-11
Chapter 3: Software Design: ECC Implementation on Single Core Microcontroller
3.6 Conclusions
In this work, we illustrated the potential of implementing relic-toolkits on sensor node
platforms. We also evaluated some of the optimization methods and their effectiveness in the
ECDSA implementation performance. The configuration features provided by the relic-
toolkit can help enhance the ECC performance, which could be considered as a benchmark
and guidance for the developer planning to use the relic in resource constrained processor
platforms, such as the ones presented in this thesis.
3-12
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
Efficient field arithmetic over 𝐺𝐹(2163 ) is proposed in this chapter. Thus, our novel proposal
here was trying to enhance the performance of Comba algorithm. The reason for such an
attempt was to examine the possibility of enhancing its performance using a homogeneous
multi core microcontroller. Therefore, we started this chapter by providing and highlighting
the importance of multiplication processes in the overall performance of ECC. Then we
detailed our proposal of parallelizing the Comba algorithm. After that, we provide the
analytical details for the obtained results. In this work, we managed to enhance the Comba
Algorithm by about 90%.
4-1
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
4.1 Introduction
The modern technologies of inexpensive constrained devices help to motivate the
researchers to use these devices in Wireless Sensor Networks (WSN), such as a hazardous
environments, military operations and medical monitoring with high attention of maintaining
the necessary security requirement. Recently, Elliptic Curve Cryptography (ECC) proved to
be a competitive substitute for standard public key cryptosystems like RSA, DSA and DH.
Particularly, it can provide the same level of security provided by RSA with short key size,
low processing time and less memory size.
ECC can be implemented based on prime finite field arithmetic GF(p) or binary finite
field arithmetic GF(2m ) where m is prime and its performance highly dependent on the
multiplication operation of the finite field arithmetic. Indeed, [51] states that around 80% of
the time execution is consumed by the multiplication operation in a software implementation.
Therefore, various attempts were conducted to reduce the time execution such as the work
done by [52] and [53], where they suggested modified algorithms and provided new
multiplication techniques suited to microcontroller platforms that can be used in WSNs.
Lately, the work conducted by [54] expressed the benefits of employing multicore embedded
platforms in WSNs through energy savings and time execution improvements.
In this chapter, we organized our works as follows. Firstly, we introduced the ECC
concept and how multiplication play an importance role in ECC performance in section 4.1.
In section 4.2, we discussed the related works that have been conducted to enhance the
4-2
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
Therefore, many attempts have been made by researchers to enhance the performance
of multi-precision multiplication. For instance,[11] suggested a simple architectural
improvement using a general-purpose processor core that could assist execute arithmetic
operations in GF(2 m) finite binary areas. Their suggestion is based on a recent modification
of the MULSC instruction supplied by SPARC V8 Architecture, which was implemented by
Lopez and Dahab in the left-to-right comb technique. The authors utilized this technology to
describe an increase in the speed of 90 per cent in addition to a remarkable reduction in the
use of RAM. It utilizes polynomial bias as well as special polynomials such as trinomials,
pentanomial and all one polynomial (AOP) to develop an extensive and careful study of finite
field multiplication over GF (2 m). The Montgomery multiplication scheme carries out this
multiplication and application of it is also described. It focuses on different arithmetical
operation on the elliptic curve cryptography over GF (2m ). The parameter performance is
also discussed in term of a number of component, latency, space and time complexity.
Michael Hutter and Erich Wenger [53] proposed a new novel multiplication technique
to help to increase the performance of multiplication. Their technique is based on the product
scanning approach, but it divides the calculation into several rows. In this method, the authors
reduced the number of necessary load instructions through caching of operands. The method
significantly reduces the number of load instructions required, which is usually one of the
most expensive operations on modern processors. I evaluated the new technique on an 8-bit
ATmega128 microcontroller and compared the result with existing solutions. The application
requires only 2,395 clock cycles for a 160-bit multiplication that exceeds associated job by a
factor of 10% to 23%. The amount of load orders required is decreased from 167 (needed to
multiply the best-known hybrid) to just 80. Even for larger Integer sizes (required for RSA)
4-3
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
and limited register sets, the implementation scales are perfect. It also fully complies with
existing multiply-accumulate instructions integrated into most of the processors available.
The proposed method was implemented in ATmega128 microcontroller and showed by 23 %
compared to the result reported by [51]. The number of load instructions required is usually
one of the most expensive operations on modern processors and is reduced by the method.
The new technique is evaluated on an 8-bit ATmega128 microcontroller, and the result is
compared with existing solutions. There is need of only 2,395 clock cycles for a 160-bit
multiplication in the application that exceeds associated job by a factor of 10% to 23%. The
amount of load orders required has been reduced from 167 to just 80. The implementation
scales are perfect for larger Integer sizes and limited register sets. It also fully complies with
existing multiply-accumulate instructions integrated into most of the processors available.
Next, Seo, Hwajeong [58] proposed a novel method nominated as carry-once capable
to perform multi-precision multiplication having accumulation of intermediate results. The
principal idea of this technique is to optimize the number of addition instructions required for
intermediate result update. Through this method, the authors reported better performance of
multi-precision multiplication while they implemented 160-bit multiplication over
ATmega128.
Z. Liu and J. Großschädl [59] proposed a new software technique for improving the
performance of Montgomery modular multiplication on a 8-bit AVR microcontroller. Using
assembly language, the authors managed to implement six hybrid Montgomery multiplication
algorithms in AVR microcontroller. In fact, the authors take the advantages of the hybrid
multiplication and combine it with Montgomery's multiplication to enhance the modular
multiplication. Accordingly, they evaluated the performance of the new method for different
operands ranging from 160 to 1024 bits.
The work in [5] proposed a new efficient techniques for improving the multiplication,
squaring modular reduction and inversion in 𝐺𝐹(2163 ) and 𝐺𝐹(2233 ) using MICAz Mote
microcontroller. In this work, the authors proposed using Karatsuba’s multiplication
algorithm to divide the multiplication problem into two sub-problems. These two
subproblems are to be manipulated separately. In addition to that, they suggested saving the
already shifted results produced in the first phase. This process can help to reload the
intermediate result into registers for multi-precision shifting of some of the read memory
already released during the first phase.
4-4
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
[60] describes new techniques for parallelizing binary fields in computers equipped
with modern vector instruction sets. The authors’ detailed methods for implementing field
multiplication, squaring and square root, and they present a constant memory lookup-based
multiplication mechanism. In this work, the authors implemented the finite field arithmetic as
an arithmetic backend of the relic toolkit [16] for testing and benchmarking purposes.
Speed-up of the arithmetic operation and enhancing its effectiveness in the software
implementation of 𝐺𝐹(𝑝) is the work proposed by [61]. Their work was mainly focused on
increasing the performance of finite field multiplication for 32-bit and 64-bit platform using
the Comba algorithm. In this work, the authors suggested implementing carry accumulation
by the addition of 32-bit variables in the 64-bit variable accumulator to avoid accounting
carry after the addition of variables. However, they proposed to accumulate the carry in the
final iteration.
H. Seo, Y. Lee, H. Kim, T. Park, and H. Kim [55] presented further techniques for
improving the performance of multi-precision multiplication on an embedded
microprocessor. The authors proposed enhancing carry-once method by applying the operand
caching methodology and further optimization for multiplication and accumulation (MAC),
unbalanced comb and comb-window methods. In this work, the authors managed to optimize
the product scanning method by reducing the number of required registers.
However, despite the amount of works that have been conducted to enhance the
performance of multi-precision multiplication, we noticed none of the previous works
attempted to improve it using the multicore microcontroller. Thus, in the present work, we
introduce new methodologies of enhancing the multi-precision multiplication in the multicore
microcontroller.
4-5
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
As shown in Figure 4.1, the starting kit xCORE microcontroller is equipped with
analog and digital nodes. The digital node consists of xCORE Tile, a switch, and PLL (Phase-
Looked-Loop), whereas, the analog node comprises the USB PHY, multi-channel
ADC(Analog to Digital Converter), deep sleep memory, an oscillator, a real-time counter and
power supply control. To establish the communication between analog and digital node, a
necessary link that is capable to switch to the digital node is required.
4-6
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
flexible multicore microcontroller component. It consists of the integrated I/O and on-chip
memory and multiple logical cores, which can be run simultaneously. In fact, each of the
logical cores guaranteed a slice of processing power, can execute computational code and
provide a control software and I/O interfaces. The logical cores use channels to exchange
data within a tile or across tiles, while the tiles are to be connected using switch network
known as xCONNECT. The xCONNECT uses the proprietary of physical layer protocol to
add additional resources to a design. Additionally, the I/O pins are determined through
intelligent ports, which can help for serializing data, interpret strobe signals, wait for
scheduled times or events and make the device ideal for real-time control applications.
Each tile consists of 8 active logical cores, which have a capability to issue the
instructions down a shared four-stage pipeline. The instructions generated from active cores
are issued using round-robin. However, if up to four logical cores are in use, then each core is
allocated with a quarter of the processing cycles. In contrast, activating more than four logical
cores results in each core being allocated at least with 1/𝑛 cycles (for 𝑛 cores).
For further detail about xmos multicore microcontroller, we refer the reader to [14].
4-7
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
32 bits on single core microcontrollers. Our obtained results outperform most of the
published single core modular multiplication implementations and require much fewer
cycles. We achieve more than 85% enhancement of the measured latency in comparison to a
single-core implementation.
5) Return 𝐶(𝑧)
processors by splitting the operation into smaller blocks. There are many techniques to
implement muti-precision multiplication over 𝐺𝐹(2𝑚 ); these include product scanning [33],
hybrid scanning, operand caching, and consecutive operand-caching techniques [11].
Product scanning techniques (known also as Comba) are considered to be the most efficient
for large operands. Comba, as illustrated in Algorithm 4.1, is based on two individual outer
loops to generate the multiplier operands index and inner loops for generating multiplicand
operands index. The multiplicand and multiplier operands are to be produced in column-wise
style as explained in Figure 4.2, where t = 4 and the inner loop is iterating 42 or 16 times in
total.
4-8
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
Recently, Comba algorithms have also been shown to be efficient in enhancing the
implementation performance of Fully Homomorphic Encryption Schemes [62] and [63]. In
this work, we propose enhancing mutli-precision multiplication using comba on a
homogenous multi core microcontroller, which allows to carry out multiple instruction flows
4-9
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
concurrently. This could form a basis for further research on implementations on such
platforms.
4-10
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
Accordingly, the multiplication is performed using two outer loop functions and
simultaneously called from the main function as shown in Figure 4.3.
4-11
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
We applied the 32-bit fast modular multiplication per Algorithm 2, whereas for the 8-bit and
16-bit, we modified algorithm 2, as seen in Algorithm 4.4 and Algorithm 4.5 below, which
illustrate fast modular reduction for 8 and 16 bit, respectively. The reduction process started
after the multiplication, which is to be based on the 326 bits of the 163 arithmetic
multiplication bit result. In this, the 326 is to be divided into 𝑤 = 8 for 8-bit word size, and
( )
the calculation of the number of word size is based on ( 163 × 2 ⁄8) =
𝑓 (𝑧) = 𝑧 163 + 𝑧 7 + 𝑧 6 + 𝑧 3 +1
4-12
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
For the 8-bit word size, there will be no shifting instruction required at the initial step, and the
bits will be copied as it is, and after that, the 3 bits shifting (of bit 163-325) is to be
implemented by 6-bit and 7-bit left shift. Following each shift, extra bits moving out of the
325-bit mark due to shifting are replaced back into the starting positions left vacant by
shifting.
Then, the extra bit will be serially shifted by 3, 6 and 7 bits. Afterwards, to obtain the 163
reduction in output, all columns starting from bit 0 to bit 163 are to be sequentially added
using XOR instruction. Moreover, a similar strategy is used for performing reduction with
word size = 16.
4-13
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
For the 8-bit word size, there will be no shifting instruction required at the initial step,
and the bits will be copied as it is and after that, the 3 bit shifting (of bit 163-325) is to be
implemented by 6-bit and 7-bit left shift. Following each shift, extra bits moving out of the
325-bit mark due to shifting are replaced back into the starting positions left vacant by
shifting. Then, the extra bit will be serially shifted by 3, 6 and 7 bits. Afterwards, to obtain
the 163 reduced output, all columns starting from bit 0 to bit 163 are to be sequentially added
using XOR instruction. Moreover, a similar strategy is used for performing reduction with
word size = 16.
We further provide to summarize and compare the obtained number of cycles for
𝐺𝐹 (2163 ) using Comba parallel multiplication in single and multicore in comparison with
state of art for 𝐺𝐹(2163 ) and 𝐺𝐹(160) multiplication. It is apparent that different types of
algorithms, platforms and finite field types were used in previous state of art works. This
makes evaluations more difficult for a fair comparison between our work and other published
works. Therefore, we used the number of cycles as a reasonable metric for
4-14
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
16000
14000
12000
10000
Time(μs) 8000
6000
4000
2000
0
Comba Comba Comba
Algorithm Algorithm Algorithm
(W=32) (W=16) (W=8)
ARDUINO MEGA2560 616 14116
ARDUINO DUE 380 689
Xmos (single core) 1.064 0.816 0.76
Xmos (two cores) 0.99 0.788 0.784
16000
14000
12000
Time(μs)
10000
8000
6000
4000
2000
0
Comba Algorithm Comba
_Reuction(W=16) Algorithm_Reduction
(W=8)
ARDUINO MEGA2560 1040 14340
ARDUINO DUE 553 901
Xmos (single core) 198.686 311.972
Xmos (two cores) 97.596 151.76
Figure 4.4 Result Analysis of Implemented Comba Algorithm with Fast Reduction
comparisons and evaluation. Furthermore, [66] used 21 registers out of 32 to execute the LD
Multiplication Algorithm on a single core ATmega128 8-bit processor, whereas our approach
of using xmos 32 bits multicore microcontroller used only 12 registers for single core
multiplication and 4 registers in each task for performing simultaneous Comba.
4-15
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
4-16
Chapter 4: Software Design: Efficient Field Arithmetic over 𝐺𝐹2163
4.6 Conclusions
In this work, we have shown that homogenous multicore microcontroller platforms
are a feasible option to enhance the performance of Comba multiplication over binary finite
fields, thereby enhancing the performance of ECC implementations. We have detailed the
design of a modified Comba multiplier over the binary finite field 𝐺𝐹(2163 ) corresponding to
an ECC curve using an Xmos startKit homogenous multicore platform that can be adopted in
WSN applications. The design required a modification of both the Comba algorithm and the
the fast reduction step to accommodate the reduction process for 8- and 16-bit word sizes.
About 90% improvement in cycle performance was achieved compared to the single core
implementation. Further work will concentrate on implementing the ECC point multiplication
based on the modified Comba multiplier.
4-17
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
5-1
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
5.1 Introduction
Elliptic curve cryptography (ECC)-based security has potential use in resource
constrained applications, such as RFID tags and wireless sensor networks (WSN) and
therefore the Internet of Things (IoT). Compared to RSA, ECC requires shorter length keys
for the same level of security and is computationally more efficient, and therefore, it has the
ability to provide high security with faster processing time and fewer resources. In general,
scalar point multiplication (PM) is the main operation in Elliptic Curve Cryptography [1].
The PM can be implemented either over binary extension fields GF(2m ) or over prime fields
GF(P). In the ECC PM, the public key is computed by multiplying a base point on the elliptic
curve, P with a private key (integer), K. A Koblitz curve [38] is a special elliptic curve that is
resource friendly due to its simplicity, and therefore, it is used in this work; however, a
random binary curve can also be implemented with an extra latency overhead. The ECC PM
can be implemented in software, hardware, and as a software/hardware co-design. A pure
software implementation is attractive on battery-run devices due to its flexibility and low
resource requirements.
The crucial problem of software implementations is the latency due to the word-level
computations required and frequent memory operations. Thus, different efforts have been
conducted by researchers to enhance ECC performance in pure software design by modifying
ECC related algorithms as reported by [68] and [69]. General purpose multicore processors
are being increasingly adopted as alternative platforms to single core architectures for high-
performance domain specific applications, such as ECC. For example, in [11], the authors
proposed hardware design for separated hybrid scanning parallelization for Montgomery
Multiplication using multicore approach by constructing two, four and eight soft cores on
FPGA. The reported results in this work show good speed, large communication latency
tolerance and good scalability. [70] proposed fully programmable curve-based crypto
processors to accelerate scalar point multiplication of ECC using the GEZEL
hardware/software co-design platform. Also, ECC multi-core software implementations on
Intel Xeon Quad-Core processors using OpenMP are reported in [71] and [72]. Another
example for software multicore implementation is reported by [8] where right to left double
and add algorithm is parallelized using two threads through OpenMP library. We observed
that almost all of the previous ECC multicore software implementations were implemented
5-2
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
on powerful platforms that may not be suited to low-resource WSN type applications using
devices with limited resources and lower clock speeds.
Here, the proposed multicore ECC point multiplication with a single core version
implemented on same Xmos device and on Arduino (Due) for 𝐺𝐹 (2163 ). The proposed
multicore implementation performs 60% better than single core-based implementation. It is
also evident that multicore design perfomed better on embedded implementation.
5-3
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
the proposed ECC and results. Finally, we end this chapter with some conclusions in section
5.6
5.2 Background
The elliptic curve over binary field (E) is defined as a set of points combined with
point of infinity, and O is expressed by the Weierstrass equation:
𝐸: 𝑦 2 + 𝑥𝑦 = 𝑥 3 + 𝑎𝑥 2 + 𝑏 (5.1)
where 𝑎, 𝑏 ∈ 𝐺𝐹(2𝑚 ) and 𝑏 ≠ 0. The fundamental operation of the elliptic curve
cryptography is scalar point multiplication, which is defined in (2) as follows:
Q = k.P (5.2)
where k is an integer, P is a point on the elliptic curve and Q is a new point on the elliptic
curve. The new point, Q, is produced by scalar point multiplication, k .P = P+…+P+P, where
Q is a result of k-1 times point addition of P.
Different point multiplication algorithms are presented in [1] to compute (2). The
performance of ECC depends on the point multiplication and its associated coordinates
systems. In this work, standard coordinates-based Lopez-Dahab Projective is chosen to avoid
the expensive field inversion operation that is involved in the affine coordinates systems.
Moreover, the standard coordinates offer less computation than the Jacobian Projective [1]
coordinates. Again, in Lopez- Dahab (LD) Projective coordinate, a point [𝑋, 𝑌, 𝑍] be point on
the elliptic curve corresponding to the affine points [𝑋/𝑍, 𝑌, 𝑌/ 𝑍 2 ] where 𝑍 ≠ 0 [1].
𝑍3 ← 𝑋 2 𝑍 2
𝑋3 ← 𝑋 4 + 𝑏𝑍14
5-4
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
The point doubling based on the LD projective coordinate [1] has 4 finite field multiplication
operations, 5 finite field squaring operations and 4 finite field addition operations, as shown
in Algorithm 5.1. The algorithm has two temporary variables 𝑇1 and 𝑇2 to save intermediate
results of point doubling.
1: If P = ∞ then (∞)
2: T1 ← Z12 { T1 ← Z12 }
7: T2 ← T1 . b{T2 ← b. Z14 }
13: T1 ← T2 . Z3 {bZ14 Z3 }
𝐷 ← 𝐵 3 (𝐶 + 𝑎𝑍12 )
𝑍3 ← 𝐶 2
5-5
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
𝐸 ← 𝐴. 𝐶
𝑋3 ← 𝐴2 + 𝐷 + 𝐸
𝐹 ← 𝑋3 + 𝑥2 . 𝑍3
𝐺 ← (𝑥2 + 𝑦2 ). 𝑍32
𝑌3 ← (𝐸 + 𝑍3 ). 𝐹 + 𝐺
The point addition algorithm has 8 finite field multiplications, 5 squaring and 9 additions as
shown in Algorithm 5.2. In addition to that, 𝑇1 , 𝑇2 , 𝑇3 are temporary variables to be used for
the adder operations.
5-6
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
For the projective to affine coordinate conversion, a single field inversion is used. In this
work, we consider implementing left-to-right binary method for the point multiplication as
shown in Algorithm 5.3 [1] that is suitable for an initial operation.
OUTPUT:𝑘𝑃
1) 𝑄 ← ∞
2) For 𝑖 from 𝑡 − 1down to do
a. 𝑄 ← 2𝑄
b. if 𝑘𝑖 = 1 𝑡ℎ𝑒𝑛 𝑄 ← 𝑄 + 𝑃
3) Return (𝑄)
5-7
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
In this work, we adopt the product scanning algorithm (i.e. Comba algorithm) for its
reported efficiencies [74]. The algorithm runs using two individual nested loops as outer loop
and inner loop. The outer loop handles the index of the multiplier and the inner loop is
responsible for generating the index of the multiplicand. The amount of time the inner loop
iterates relies on the amount of words required for a given field (m). For example, if there are
"t" words, the number of iteration will be ′t′. In each inner loop iteration, there is one
𝐺𝐹(2𝑚 ) multiplication, one xor and two load operations for collecting each column products.
The column products are 2w size. Thus, a second store operation is essential at the outer loop
to hold the result of the partial product.
The main idea of the linear algorithm is to accomplish the squaring by inserting zeros
between every corresponding bit of a(z) from bit position "1". In this process, the odd
positions are to be filled up with zeros and even positions loaded with bits of the input
polynomial. After the square operation, the output is 2m-1 bit that is required to be reduced to
m bit by using a reduction operation.
5-8
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
fast reduction polynomials [1]. Thus, a modified 8-bit fast reduction algorithm we proposed
in [74] is implemented over 𝐺𝐹(2163 ) with the irreducible polynomial. Additionally, the rest
of 𝐺𝐹(2𝑚 ) were adopted and implemented as described in [1].
As illustrated in Figure 5.1, the result of multiplication or square operation is 325 bits. We
need to reduce the 325 bits to 163 bits (i.e. 162 to 0). In the fast reduction method, the 163 th
to 325 th bits are added with 0th to 162th bit with shifting. For the irreducible polynomial in (3)
over 𝐺𝐹(2𝑚 ), the bitwise addition is performed with: no-shifting, 3-bit shifting, 6-bit shifting
and finally, 7-bit shifting. Moreover, the 3-bit shifting, 6-bit shifting and 7-bit shifting shifts
Figure 5.1 Shifting Operation in Fast Reduction Process with word size=8
5-9
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
extra bits over the 162𝑡ℎ bit as shown in the Figure 5.1. For example, three extra bits
overflow due to the 3-bit shifting, and therefore, it is required to add the extra bits with the
bits from the rightmost side bit (0𝑡ℎ bit) in a shifted fashion as per the order of the irreducible
polynomial. Thus, in 3-bit shifting, the extra bits (323𝑡ℎ , 324𝑡ℎ and 325𝑡ℎ ) are added with
no-shift, 3-bit shift, 6-bit shift and finally 7-bit shift from the rightmost bit (0th bit). The
same approach applies for the 6-bit shifting and 7-bit shifting cases, as shown in Figure 5.1.
5-10
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
1: If 𝑃 = ∞ then (∞)
2: 𝑇1 ← 𝑍12 { 𝑇1 ← 𝑍12 }
5-11
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
n Algorithm 5.2, there are 26 steps, including the conditional step that is triggering the point
doubling operation, 𝑌1 = 0. In the modified algorithm, Algorithm 5.5, we managed to
minimize the number of steps to 20. Furthermore, in this work, we advocate a new approach
of parallelization, namely interleaving parallelization as shown, for example, in the modified
algorithm for step 16.
5.3.2 Proposed Left to Right Double and Add Scalar Point Multiplication
5-12
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
In order to obtain better time complexity performance, we modified the left to right
algorithm[1]. The enhancement is achieved by performing the initial scanning most
significant bits (MSB) of k in order to track down the first none zero bit from MSB. If the non-
zero first bit is found, then P's coordinates will be filled in Q to start loop operation is shown
in Algorithm 3. The start position of the loop is the position of the first non-zero bit that is
Q=kP=(𝑄𝑥 , 𝑄𝑦 , 𝑄𝑧 )
1: 𝑄 ← ∞
a) if 𝐾𝑖 =′ 0′ then
1. break
2. end if
b) else if 𝐾𝑖 =′ 1′ then
1. load Q = P
2. store i in pos
3. end if
c) end for
3: For i in pos-1 to 0 do
a) Perform_point_doubling: 𝑄 ← 2𝑄
b) if 𝐾𝑖 =′ 1′ then
1. Perform_point_addition: 𝑄 ← 𝑃 + 𝑄
2. end if
c) end for
4: Return (Q)
listed in variable “pos” as shown in the algorithm. The loop operation then continues from
“pos-1” to 0th bit of k. The scanning process reduces latency for the case of k input with zeros
in the MSBs. In [1], the time complexity for point addition (A) and point doubling (D) are mD
and (m⁄2)A respectively. In the proposed Algorithm 3, the loop operation of the main
algorithm starts only after finding of the first from the MSBs. Thus, the number of point
addition operations will be (m/2-1) and point-doubling operations would be (m-(m-1-pos))
where (m-1-pos) represents the number at the MSB. Therefore, the expected running time for
the scalar point multiplication could be represented by (m/2-1).A+(m-(m-1-pos)-1)D, which
is potentially lower than original algorithm in [1].
5-13
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
Figure 5.2 Sequential ECC Scalar Point Multiplication Intel Vtune Analysis [3]
5-14
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
Figure 5.3 Parallel ECC Scalar Point Multiplication xmos single core
The optimal utilization of the core is a key part of the optimization of ECC point
multiplication, as shown in Table 5.1. In this work, we managed to utilize all of the existing 8
cores provided. Notably, the performance of each binary finite field arithmetic operation that
is implemented on a particular core of the Xmos can be obtained. Again, the parallel
5-15
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
operation consumes a latency that is equivalent to a multiplication latency, whereas the other
operations (field squaring and field addition) are operating on the fly. Furthermore, the
overall number of processors for performing the scalar point multiplication over projective
coordinates is tested for k=3{“00…011”}, where a load Q operation for the MSB of k, 1
followed by point doubling and point addition for the last bit of k, 1.
To quantify, a k input of the best case (k=3), average case (as same as the complexity in [8])
and worst case (k=0x”07ff..f” ) inputs were investigated. Table 5.2 illustrates the time
complexity of the scalar point multiplication for different inputs.
5-16
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
𝑘=0
Infinity 69093 64
𝑘=1
Loading 70270 65
Multiplication 5 318
Point Doubling for 4 logic cores (core 0 Squaring 5.648 353
Step Number 4 To core 3)
Addition 1.536 96
Multiplication 5 318
Point Addition One Core used
Step No 10 (Core 0)
Squaring 5.648 353
5-17
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
As already stated, to our knowledge, our work is the first attempt to implement ECC scalar
point multiplication on a homogeneous multicore microcontroller. For comparisons, we tried
to put context more than provide like for like comparisons as the platforms are not
comparable in terms of resources available, as shown in Table 5.3.
Qualcom Parallel
m
C. Negre[6] (Double, Halve)- 1060 1591 × 103 𝐺𝐹(2233 )
SnapDrag
on and-add with NAF4
5-18
Chapter 5: Software Implementation of Parallelized ECC Scalar Point Multiplication over Binary Fields
5.6 Conclusions
In this chapter, we proposed new modified algorithms that overcome data
dependencies in ECC computations and hence enable parallel implementation of ECC on
multi-core platforms efficiently. A pure software implementation for ECC scalar point
multiplication over 𝐺𝐹(2𝑚 ) using the Xmos multi-core microcontroller was implemented
using these algorithms, which confirmed the feasibility and improvements of adopting
parallelism in ECC implementations. It is advocated that homogeneous multicore platforms
can be useful for resources constrained applications where strong security is required.
Potentially, our parallelization approach could be adopted to improve cryptography
operations and to open up the potential of having strong public key cryptography in software
with high performance and flexibility on a range of multicore microcontroller platforms.
Hence, future work will investigate deploying these algorithms on alternative multicore
platforms.
5-19
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
6-1
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
6.1 Introduction
There is an expectation that the emergence of multicore processing would enable a
new generation of sensor nodes that suits the anticipated growth of information-rich
applications using Wireless Sensor Nodes (WSNs) [76]. Specifically, the lower power
homogeneous multicore microcontrollers can be very attractive for sensing the necessary data
and carrying out the required computations across the multicores concurrently. The security
remains a major concern, similar to that of single core sensor nodes. Hence there is a need to
consider the safety of the devices at the time of implementation has developed as an
alternative to RSA in the applications of public key-based security due to its enhanced
performance of security per bit. However, the complex ECC computations based on either
binary 𝐺𝐹(2𝑚 ) or prime 𝐺𝐹(𝑃) fields make their implementation on small, low-power
devices challenging.
Although [77] concluded that ECC over binary fields is faster than ECC over prime
fields in hardware implementation, the timing result reported by [1] for the ECC software
implementation using Intel Pentium Processors shows the opposite. In fact, [1] reported faster
performance when using for NIST recommended prime fields compared to the binary fields
curves. A pure ECC software implementation over prime fields optimized for high
performance would be attractive for implementation on sensor node processors. Different
efforts have been made by researchers to enhance the ECC performance in these devices like
the works reported by [78-80].
Parallel ECC implementations over prime fields have been reported for both hardware
and software platforms [8, 81, 82]. For example, [8, 81, 82] managed to improve the ECC
performance over prime field for 256-bit and 160-bit in GF(P) using Read after Write (RAW)
strategy implemented in Modular Arithmetic Logic Unit (MALU) using hardware design.
Similarly, in [8, 81, 82] the authors proposed software parallel approach for enhancing the
scalar point multiplication over the prime field (p = 2255 −19) in an Intel Core 2 workstation.
We noticed that ECC over prime field parallel software was implemented on powerful
platforms. These implementations may not suit to low-resource, WSN-type applications using
devices with limited resources and lower clock cycles.
The contribution of this work is mainly to further explore the feasibility and potential
for parallelizing improve ECC scalar point multiplication over prime field for four SECG
curves Spec128r1, Spec192r1, Spec256 and Spec384r1 on a homogeneous multicore
microcontroller(xmos) [15] and [14]. To our knowledge, this is the first ECC over prime
6-2
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
The purpose of parameters 𝐺, 𝑛, ℎ are to define the base point, define the order of G
and define the cofactor, respectively. These four curves are following the definition specified
by equation (1). In fact, these curves have the same basic arithmetic operations which work
6-3
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
for all of them. The only differences between them are the modulus and size of the numbers
change. Therefore, in the coming subsections. we provide more detail for the arithmetic
prime finite field operations involved in our work.
In principle, the construction of Elliptic Curve is mainly depending on the selection of
point representation, point doubling, point addition and point multiplication. Having these
operations implemented will allow to create a trapdoor for implementing different protocols
such Digital Signature Algorithm (DSA) and Diffie-Hellman based encryption. However,
there are different ECC standard bodies recommended to implement ECC underlying finite
field, a Galois Field (GF) prime field or binary fields 𝐺𝐹(2𝑚 ) [15].
6-4
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
1: (𝑡, 𝑢, 𝑣) ← 0
2: for i from 0 by 1 to s do
3: for j from 0 by 1 to i do
5: end for
6: 𝑃𝑖 ← 𝑣
7: 𝑣 ← 𝑢, 𝑢 ← 𝑡, 𝑡 ← 0
8: end for
13: 𝑃𝑖 ← 𝑣
14: 𝑣 ← 𝑢, 𝑢 ← 𝑡, 𝑡 ← 0
15: 𝑃2𝑠−1 ← 𝑣
In this work, we considered the FIPS 186-2 standard that is used to provide different moduli
illustrated below that can help creating fast reduction algorithms especially on word size 32.
We also provide an example of these algorithms (Algorithm 6.2). Therefore, for further
information and algorithms, we refer the reader to [1] section 2.2.6:
6-5
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
𝑠1 = (𝑐8 , 𝑐7 𝑐6 , 𝑐4 , 𝑐3 𝑐2 , 𝑐1 , 𝑐0 ),
𝑠2 = (𝑐15 , 𝑐14 , 𝑐13 , 𝑐12 , 𝑐11 , 0,0,0),
𝑠3 = (0, 𝑐15 , 𝑐14 , 𝑐13 , 𝑐12 , 0,0,0),
𝑠4 = (𝑐15 , 𝑐14 , 0,0,0, 𝑐10 , 𝑐9 , 𝑐8 ),
𝑠5 = (𝑐8 , 𝑐13 , 𝑐15 , 𝑐14 , 𝑐13 , 𝑐11 , 𝑐10 , 𝑐9 ),
𝑠6 = (𝑐10 , 𝑐8 , , 0,0,0, 𝑐13 , 𝑐12 , 𝑐11 ),
𝑠7 = (𝑐11 , 𝑐9 , 0,0,0, 𝑐15 , 𝑐14 , 𝑐13 , 𝑐12 ),
𝑠8 = (𝑐12 , 0, , 𝑐10 , 𝑐9 , 𝑐8 , 𝑐8 , 𝑐15 , 𝑐14 , 𝑐13 ),
𝑠9 = (𝑐13 , 0, 𝑐11 , 𝑐10 , 0, 𝑐15 , 𝑐14 ),
6-6
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
costly due to the operations of field inversion which are required during the Elliptic Curve
Scalar Point Multiplication (ECSPM) computations.
There are several types of projective coordinates that can help avoid the inversion
operations; these include Standard Projective coordinates, Jacobian Projective Coordinates
and Chudnovsky coordinates. Hence, we used nano-ecc [19], which is an open-source library,
and the Jacobian Coordinates is considered in this work. The selection of this type is based on
the good results reported by [1] and [65], [69],[85].
1: (𝑡, 𝑢, 𝑣) ← 0
2: par{
3: for i from 0 by 1 to s do
4: for j from 0 by 1 to i do
5: (𝑡, 𝑢, 𝑣) ← (𝑡, 𝑢, 𝑣) + 𝑎𝑗 × 𝑏𝑖−𝑗
6: end for
7: 𝑃𝑖 ← 𝑣
8: 𝑣 ← 𝑢, 𝑢 ← 𝑡, 𝑡 ← 0
10: end for
11: end par}
12: par{
13: for i from s by 1 to 2s-1 do
14: for j from i+1--s by 1 to s do
15: (𝑡, 𝑢, 𝑣) ← (𝑡, 𝑢, 𝑣) + 𝑎𝑗 × 𝑏𝑖−𝑗
16: end for
17: 𝑃𝑖 ← 𝑣
18: 𝑣 ← 𝑢, 𝑢 ← 𝑡, 𝑡 ← 0
19: end for
20: end par}
21: 𝑃2𝑠−1 ← 𝑣
6-7
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
During each iteration, the inner loops are responsible for performing a Multiplication
and Accumulate operations, in which 2w-bit words is multiplied, and in accordance to that,
2w-bit product is to be added to the cumulative sum. However, the output of sum operation is
most likely to be longer than 2w bit, which will require three w-bit registers to store them. In
fact, as shown Algorithm 6.1, cumulative sum process is presented by (t,u,v) which are used
to present the integer value 𝑡. 22𝑤 + 𝑢. 2𝑤 + 𝑣. Meanwhile, the operations carried out in line
7 and 14 illustrated in Algorithm 6.1 is used to represent w-bit right-shift of the cumulative
sum (t,u,v).
The above algorithm can be calculated using 6 field registers when 𝑎 = −3. In this library, a
modified Jacobin point doubling (Algorithm 14) developed in [83] has been selected.
However, we propose a modification to speed up the point doubling computation process.
6-8
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
Our proposed modification is based on performing parallel operations for finite field
operations that are to be executed within the point doubling algorithm. This has been
achieved after conducting Read After Write (RAW) investigation in the algorithm to extract
possible parallel operations. Accordingly, we managed to conduct several parallel operations
in point doubling that are compatible with our target platform. For example, we parallelized
step 3 (T4 ← T42 ) with step 4 (T2 ← T2 . T3 ) since there is no data dependency. Using this
methodology and in line with careful rescheduling, we managed to reduce the point doubling
𝑇1 = 𝑋1 , 𝑇2 = 𝑌1 , 𝑇3 = 𝑍1 )
1: 𝑇4 ← 𝑇22 [𝑌12 ]
2
2: 𝑇5 ← 𝑇1 . 𝑇4 [𝑋1 𝑌1 = 𝐴]
3: 𝑇4 ← 𝑇42 [𝑌14 ]|| 𝑇2 ← 𝑇2 . 𝑇3 [𝑌1 𝑍1 = 𝑍3 ]||
𝑇3 ← 𝑇32 [𝑍12 ] [ Parallel Operations]
4:𝑇1 ← 𝑇1 + 𝑇3 [𝑋1 + 𝑍12 ] || 𝑇3 ← 𝑇3 + 𝑇3 [2𝑍12 ] [Parallel Operation]
5: 𝑇3 ← 𝑇1 − 𝑇3 [𝑋1 − 𝑍12 ]
6: 𝑇1 ← 𝑇1 . 𝑇3 [𝑋12 − 𝑍14 ]
7: 𝑇3 ← 𝑇1 + 𝑇1 [2(𝑋12 − 𝑍14 )]
8: 𝑇1 ← 𝑇1 + 𝑇3 [3(𝑋12 − 𝑍14 )]
3
9:𝑇1 ← 𝑇1 /2 [ 2 4 ) = 𝐵]
2(𝑋1 −𝑍1
10: 𝑇3 ← 𝑇12 [𝐵2 ]
11: 𝑇3 ← 𝑇3 − 𝑇5 [𝐵2 -A]
12: 𝑇3 ← 𝑇3 − 𝑇1 [𝐵2 − 2𝐴 = 𝑋3 ]
13: 𝑇5 ← 𝑇5 − 𝑇3 [𝐴 − 𝑋3 ]
14: 𝑇1 ← 𝑇1 . 𝑇5 [𝐵 (𝐴 − 𝑋3 )]
15: 𝑇1 ← 𝑇1 − 𝑇4 [𝐴 − 𝑋3 ) − 𝑌14 = 𝑌3 ]
Return (𝑇3 , 𝑇1 , 𝑇2 )
steps to 15 steps compared to 18 steps reported by [83]. The following algorithm provided a
low-level description for the new parallel Modified Jacobian Algorithm depicted in [83].
6-9
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
This mechanism yields very efficient co-Z point addition with a cost of 5M+2S+7A
point addition. In this work, we proposed parallel co-Z point addition by removing
dependencies to help speed up the point addition process. Therefore, we reduced the steps
required for trigging the co-Z point addition to 7 steps compared to the 13 steps reported by
[83], as shown in Algorithm 6.4.
Algorithm 6.5 Modified (X, Y)- only co-Z addition with update XYCZ -
INPUT: (𝑋1 , 𝑌1 ) and (𝑋2 , 𝑌2 )𝑠. 𝑡 𝑃 ≡ (𝑋1 : 𝑌1 : 𝑍) 𝑎𝑛𝑑 𝑄 ≡ (𝑋2 : 𝑌2 : 𝑍)for some 𝑍 ∈ 𝐹𝑞 , 𝑃, 𝑄 ∈ 𝐸(𝐹𝑞 )
OUTPUT: : (𝑋3 , 𝑌3 ) and (𝑋′2 , 𝑌′2 )𝑠. 𝑡 𝑃 ≡ (𝑋 ′1 : 𝑌 ′1 : 𝑍3 ) 𝑎𝑛𝑑 𝑃 + 𝑄 ≡ (𝑋3 : 𝑌3 : 𝑍3 )for some 𝑍3 ∈ 𝐹𝑞
( 𝑇1 = 𝑋1 , 𝑇2 = 𝑌1 , 𝑇3 = 𝑋2 , 𝑇4 = 𝑌2 )
1: 𝑇5 ← 𝑇3 − 𝑇1 [𝑋2 − 𝑋1 ]
2
2: 𝑇5 ← 𝑇5 [(𝑋1 − 𝑋1 )2 = 𝐴]
3: 𝑇1 ← 𝑇1 . 𝑇5 [𝑋1 𝐴 = 𝐵]|| 𝑇3 ← 𝑇3 . 𝑇5 [𝑋2 𝐴 = 𝐶 ][Parallel
Operations]
4:𝑇4 ← 𝑇4 − 𝑇2 [𝑌2 − 𝑌1 ]
2
5: 𝑇5 ← 𝑇4 [(𝑌2 − 𝑌1 )2 = 𝐷]
6: 𝑇5 ← 𝑇5 − 𝑇1 [𝐷 − 𝐵]
7: 𝑇5 ← 𝑇5 − 𝑇3 [𝑋3 ] || 𝑇3 ← 𝑇3 − 𝑇1 [𝐶 − 𝐵]
𝑇2 ← 𝑇2 . 𝑇3 [𝑌1 (𝐶 − 𝐵)] || 𝑇3 ← 𝑇1 − 𝑇5 [𝐵 − 𝑋3 ]
𝑇4 ← 𝑇4 . 𝑇3 [(𝑌2 − 𝑌1 )(𝐵 − 𝑋3 )]|| 𝑇4 ← 𝑇4 − 𝑇2 [𝑌3 ]
[Parallel Operations]
with F = (Y1 + Y3 )2 and (A, B, C, D, E) as defined in (4). In [83], the authors proposed co-Z
conjugate addition algorithm with 19 steps with cost of 5M+3+16A by involving 6 field
registers in this operation. However, in this work, the proposed approach requires just 15
steps as shown in the Algorithm 6.4.
6-10
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
223455 1400
200000
No of Cycles
1200
Time (µs)
150000 1004 1000
800
125510 129024
100000
600
383.208 363.928 400
50000
200
0 0
Spec128r1 Spec192r1 Spec256r1 Spec384r1
Figure 6.1 Sequential Single Core ECC Scalar Point Multiplication in xmos
Then, we implemented our parallelization in Comba algorithm, point doubling, and point
Algorithms, as shown in Figure 6.2.
As it can be seen from these figures, we managed to reduce the time and number
clock cycles for computing ECC scalar point multiplication by 85% for some curves. In this
work, we utilized 6 logic cores from the 8 logic cores provided by the xmos start kit.
To our knowledge, our work is the first attempt to implement ECC scalar point
multiplication over prime field GF(P) on a homogenous multicore microcontroller. We
reported the latest contribution concerned with software parallel implementation. In this
table, we tried to put context more than like-for-like comparison, as the platforms are not
6-11
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
comparable with respect to resource availability. To sum up, our implementation shows the
feasibility of using parallel approach in a homogenous multicore microcontroller to improve
ECC performance in constrained devices. This is an attractive option in case of low-resource
applications that contribute to towards enabling the strong public key cryptography schemes
and protocols to be implemented faster in smaller devices.
12000 150
Time (µs)
10000 123.432 129.732
10286 10781
8000 100
6000
4000 50
2000 18.82
0 0
Spec128r1 Spec192r1 Spec256r1 Spec384r1
6-12
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
6-13
Chapter 6: Software Design: Fast Parallel ECC Point Multiplication over Prime Fields
6.6 Conclusion
In this chapter, we addressed the practical feasibility of parallel software
implementation of ECC scalar point multiplication on a homogeneous multicore
microcontroller. In particular, we proposed an efficient ECC scalar point multiplication
implementation which can be hosted on the xmos start kit so that it scales to target different
ECC standard curves underlying prime field GF(P) recommended by SECG [16]. To
maximize the performance of ECC point multiplication on multicore microcontroller, three
novel modified parallelization have been proposed. These include: the parallelization of finite
field multiplication Comba algorithm, point doubling and point addition algorithms. This
implementation runs the whole ECC point multiplication in only 123µs for Spec128r1, 129µs
for Spec192r1, 224µs for Spec256 and 18.82µs for Spec384r1. Potentially, our proposed
methods were able to boost cryptography operations and provide the potential of having
strong public key cryptography in parallel software implementation with high performance
and flexibility using a multicore microcontroller platform.
6-14
Chapter 7: Conclusion and Future Research
7-1
Chapter 7: Conclusion and Future Research
Throughout the third chapter of this thesis, we proposed enhancing the performance of
ECC in arduino 8-bit and 16-bit microcontrollers. In this work, we imported a very well-
known modern cryptographic supporting C code relic-toolkit. A relic toolkit managed to
provide high efficiency and flexibility of modern cryptography. Also, the toolkit could
support different types of cryptographies protocols such EDSA, ECDH, ECMQV and ECSS.
Additionally, it supports a wide range of configurable structure algorithms, which could be
configured during the relic setup. In particular, in this work, relic-0.3.1 is imported onto the
two platform boards, and we experimented with the performance of ECDSA and ECDH over
binary fields using different NIST curves standard (NIST-K163,NIST-B163). Accordingly,
we enhanced the performance of ECDSA by providing the best combination of algorithms in
the relic presets and succeed to report better performance in arduino DUE compared to
arduino mega2560.
In the fourth chapter, an effort has been taken to enhance multiplication operation
of 𝐺𝐹(2163 ). Two novelties have been demonstrated here. Firstly, we managed to parallelize
Comba algorithm using a Homogenous multi core microcontroller (XMOS). This, in fact,
results in much better performance compared to a single core Comba multiplication in
software implementation in a microcontroller.
A second novel claim is that we modified a fast modular reduction 32-bit algorithm to
support 8-bit in 𝐺𝐹(2163 ) and get it integrated with Comba algorithm. In this, we were able
7-2
Chapter 7: Conclusion and Future Research
to report better performance of Comba algorithm for both implementations of parallel Comba
without reduction and Comba multiplication with reduction.
A new modified parallelized Lopez-Dahab point doubling been proposed here. Such a
novel idea helps us to obtain high performance for overall ECC scalar point multiplication.
Our method shows that we were able to reduce the number of steps from 14 steps of point
doubling operations algorithm to only 9 steps.
The second novel contribution concerns on modifying the Lopez-Dahab point addition
algorithm and accordingly the parallelizing some of its steps. We also managed to reduce the
algorithm steps from 26 steps to 20 steps using parallelization mechanisms. Furthermore, we
also introduce the concept of parallelizing inside a parallelized round.
Thirdly, we enhanced the overall performance of Left to Right Double and Add Scalar
Point Multiplication algorithm. This was achieved by designing the algorithm that should
work initially by performing the initial scanning most significant bits (MSB) of k in order to
track down the first none zero bit from MSB. If the non-zero first bit is found, then P's
coordinates will be filled in Q to start loop operation.
From all over all that, in this work, we able to report much higher performance in ECC
point multiplication over binary fields for all related standardized curves, including:
𝐺𝐹 (2163 ), 𝐺𝐹(2233 ), 𝐺𝐹(2283 ), 𝐺𝐹(2409 ), 𝐺𝐹(2571 ).
In addition to all that is mentioned above, we tried to tackle improving the ECC over
constrained devices from different perspectives. Thus, in Chapter 6, we managed to obtained
better performance of ECC point multiplication over the prime field 𝐺𝐹(𝑃). We also here
claimed three more novelties that could be summarized as below:
The first claimed novelty is parallelizing the Comba Algorithm for 𝐺𝐹(𝑃) gets the
benefits from the previous novelty of parallelizing Comba Algorithm in 𝐺𝐹(2163 ). In this, we
managed to parallelize the multiplication arithmetic of ECC over 𝐺𝐹(𝑃).
7-3
Chapter 7: Conclusion and Future Research
Secondly, we proposed Parallel Jacobian Point Doubling over 𝐺𝐹(𝑃). Implementing such
a solution helps us to reduce the steps involved in a sequential manner from 18 steps to 15
steps. Such enhancement positively impacted the overall performance or ECC point
multiplication over 𝐺𝐹(𝑃).
The third novelty has been achieved by parallelizing co-Z addition point addition. In this
approach, we proposed a parallel implementation for this algorithm and managed to reduce
the steps to 7 steps compared to 13 steps proposed by the original algorithm.
In this thesis, we considered how to improve the efficiency of software implementation for
Elliptic Curve Cryptography on Microcontroller platform. Therefore, for a purpose of our
research we considered implementation mainly on three types of Microcontrollers, which
include: Arduino Mega2560 with 8-bit microcontroller, Arduino DUE with 32-bit ARM
processor Microcontroller and XMOS start Kit. We also confirm the scalability of
implementing different types of curves per SEC1 standard, in which different SEC1 curves
have been implemented for both 𝐺𝐹(2𝑚 ) and 𝐺𝐹(𝑃) and bench mark results have been
presented.
In fact, several attempts have been made to improve the performance of ECC on software
implementation methodology. Besides that, there are also different efforts to boost
efficiencies using assembly programming language or by modifying the existing related
algorithms. However, none of these attempts have tried to improve the ECC performance
using a homogenous multicore microcontroller.
Therefore. to our knowledge proposed solutions herein for boosting the ECC performance
in a homogenous device is the first ever proposal of enhancing the ECC performance in such
type of microcontroller at the time of writing this thesis. It is also worth mentioning that our
proposed solution could be easily acquired to be smoothly integrated with any application
using XMOS microcontroller or arduino microcontroller. Through this, ECC is to be used as
a security service provider in these platforms.
7-4
Chapter 7: Conclusion and Future Research
Besides the speed enhancement the ECC performance that we have achieved in this
work, power consumption is one of the counter measures part that need to taken into
consideration. Considering that, our implementation is proposed for a microcontroller to be
used in applications (WSN, IoT) that are highly depending on low-power resources like
batteries. Introducing such ECC algorithms alongside the process and communication related
to the data sensing and others inside these microcontroller should be highly evaluated and
addressed.
One of challenges that needs to considered while implementing cryptography for any
system is studying the strength of a security system from any cryptanalysis mechanisms and
algorithms. In fact, lightweight applications, such as the devices we used in our work, need to
evaluated against any side-channel attacks. Using strong cryptography in such applications
does not mean that these devices are strong enough to defend from any side-channel attack.
The side-channel attack could be implemented in a passive manner, where time and power
are simultaneously used during the attack. One possible way to defend such an attack is to
use resistance algorithms, but such a solution requires large memory and double execution
time. Therefore, a deep investigation on how to prevent the ECC on XMOS and arduino
devices from side-channel attacks is one of the issues that could further investigated.
End-to-end encryption for parallelizing ECC using XMOS devices is one of the open
problems we suggested here. Therefore, we recommend implementing our proposed solutions
to check the effectiveness in a real-time network environment. This environment might be
WSN or IoT.
Examining our proposed solution with different platforms, such as parallax, is one of
open problems, since we have examined our solutions with only one type of a homogenous
multicore microcontroller. This work can help to understand how other platforms will be able
to react and cope with such solutions.
7-5
Chapter 7: Conclusion and Future Research
7-6
References
[1] D. Hankerson, S. Vanstone, and A. J. Menezes, Guide to elliptic curve cryptography. Springer,
2004.
[2] The Basics of ECC [Online]. Available: https://www.certicom.com/index.php/the-basics-of-
ecc
[3] (2016). Intel® Parallel Studio XE 2016: High Performance for HPC Applications and Big Data
Analytics | Intel® Developer Zone. Available: https://software.intel.com/en-us/blogs/Intel-
Parallel-Studio-XE-2016
[4] Y. R. Venturini and U. Sorocaba, "Performance analysis of parallel modular multiplication
algorithms for ECC in mobile devices," Revista de Sistemas de Informaçao da FSMA, no. 13,
pp. 57-67, 2014.
[5] D. F. Aranha, R. Dahab, J. López, and L. B. Oliveira, "Efficient implementation of elliptic curve
cryptography in wireless sensors," Advances in Mathematics of Communications, vol. 4, no.
2, pp. 169-187, 2010.
[6] C. Negre and J.-M. Robert, "New parallel approaches for scalar multiplication in elliptic curve
over fields of small characteristic," Computers, IEEE Transactions on, vol. 64, no. 10, pp.
2875-2890, 2015.
[7] A. Kargl, S. Pyka, and H. Seuschek, "Fast Arithmetic on ATmega128 for Elliptic Curve
Cryptography," IACR Cryptology ePrint Archive, vol. 2008, p. 442, 2008.
[8] J.-M. Robert, "Software Implementation of Parallelized ECSM over Binary and Prime Fields,"
2014.
[9] U. S. Kanniah and A. Samsudin, "Multi-threading elliptic curve cryptosystems," in
Telecommunications and Malaysia International Conference on Communications, 2007. ICT-
MICC 2007. IEEE International Conference on, 2007, pp. 134-139: IEEE.
[10] C. P. Gouvêa, L. B. Oliveira, and J. López, "Efficient software implementation of public-key
cryptography on sensor networks using the MSP430X microcontroller," Journal of
Cryptographic Engineering, vol. 2, no. 1, pp. 19-29, 2012.
[11] S. Tillich and J. Großschädl, "A simple architectural enhancement for fast and flexible elliptic
curve cryptography over binary finite fields GF (2 m)," in Advances in Computer Systems
Architecture: Springer, 2004, pp. 282-295.
[12] L. B. Oliveira et al., "TinyPBC: Pairings for authenticated identity-based non-interactive key
distribution in sensor networks," Computer Communications, vol. 34, no. 3, pp. 485-493,
2011.
[13] C. Alcaraz, J. Lopez, R. Roman, and H.-H. Chen, "Selecting key management schemes for WSN
applications," Computers & Security, vol. 31, no. 8, pp. 956-966, 2012.
[14] (2015). What is startKIT? | XMOS. Available: http://www.xmos.com/startKit/what
[15] (2014). sec1_final.pdf (application/pdf Object). Available:
http://www.secg.org/collateral/sec1_final.pdf
[16] D. F. A. a. C. P. L. Gouv. RELIC is an Efficient LIbrary for Cryptography. Available:
http://code.google.com/p/relic-toolkit/
[17] A. Szekely and S. Tillich, "Algorithm exploration for long integer modular arithmetic on a
SPARC V8 processor with cryptography extensions," in null, 2005, pp. 187-194: IEEE.
[18] R. Brumnik, V. Kovtun, and A. Okhrimenko, "Techniques for performance improvement of
integer multiplication in cryptographic applications."
[19] (2016). iSECPartners/nano-ecc. Available: https://github.com/iSECPartners/nano-ecc
[20] B. A. Forouzan, Cryptography And Network Security (Sie). Tata McGraw-Hill Education, 2011.
[21] (2016). http://www.billthelizard.com/2009/05/brief-history-of-cryptography.html.
[22] D. Kahn, The codebreakers. Weidenfeld and Nicolson, 1974.
[23] F. Cohen, "A short history of cryptography," Fred Cohen & Associates, 2001.
[24] A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone, Handbook of applied cryptography.
CRC press, 2010.
1
References
[25] S. William and W. Stallings, Cryptography and Network Security, 4/E. Pearson Education
India, 2006.
[26] C. Paar and J. Pelzl, Understanding cryptography: a textbook for students and practitioners.
Springer, 2010.
[27] S. S. Kumar, "Elliptic curve cryptography for constrained devices," Ruhr University Bochum,
2006.
[28] R. Lidl and H. Niederreiter, Introduction to finite fields and their applications. Cambridge
university press, 1994.
[29] I. Branovic, R. Giorgi, and E. Martinelli, "Instruction Set Extensions for Elliptic Curve
Cryptography over Binary Finite Fields."
[30] ECC Holds Key to Next-Gen Cryptography [Online]. Available: http://www.design-
reuse.com/articles/7409/ecc-holds-key-to-next-gen-cryptography.html
[31] F. Brechenmacher, "A history of galois fields," 2012.
[32] E. W. Weisstein. (2014). Finite Field -- from Wolfram MathWorld [Text]. Available:
http://mathworld.wolfram.com/FiniteField.html
[33] P. G. Comba, "Exponentiation cryptosystems on the IBM PC," IBM systems journal, vol. 29,
no. 4, pp. 526-538, 1990.
[34] D. Knuth, "The Art of Computer Programming. Seminumerical Algorithms, vol. 2, 1981,"
Distributed Sensor Networks International Journal of Mechanical Engineering Advances in.
[35] A. Karatsuba and Y. Ofman, "Multiplication of multidigit numbers on automata," in Soviet
physics doklady, 1963, vol. 7, p. 595.
[36] C. Koc and T. Acar, "Montgomery Multiplication in GF( 2k)," An International Journal, vol. 14,
no. 1, pp. 57-69.
[37] P. L. Montgomery, "Modular multiplication without trial division," Mathematics of
computation, vol. 44, no. 170, pp. 519-521, 1985.
[38] D. Hankerson, J. L. Hernandez, and A. Menezes, "Software implementation of elliptic curve
cryptography over binary fields," in Cryptographic Hardware and Embedded Systems—CHES
2000, 2000, pp. 1-24: Springer.
[39] S. Certicom, "SEC 2: Recommended elliptic curve domain parameters," Proceeding of
Standards for Efficient Cryptography, Version, vol. 1, 2000.
[40] N. Koblitz, A course in number theory and cryptography. Springer Science & Business Media,
1994.
[41] N. Koblitz, A. Menezes, and S. Vanstone, "The state of elliptic curve cryptography," Designs,
codes and cryptography, vol. 19, no. 2-3, pp. 173-193, 2000.
[42] R. C. Merkle, "Secure communications over insecure channels," Communications of the ACM,
vol. 21, no. 4, pp. 294-299, 1978.
[43] E. Barker, L. Chen, A. Roginsky, and M. Smid, "Recommendation for pair-wise key
establishment schemes using discrete logarithm cryptography," NIST special publication, vol.
800, p. 56A, 2013.
[44] A. Liu and P. Ning, "TinyECC: A configurable library for elliptic curve cryptography in wireless
sensor networks," in Information Processing in Sensor Networks, 2008. IPSN'08. International
Conference on, 2008, pp. 245-256: IEEE.
[45] S. C. Seo, H. Dong-Guk, H. C. Kim, and H. Seokhie, "TinyECCK: Efficient Elliptic Curve
Cryptography Implementation over< I> GF</I>(< I> 2</I>< I>< SUP> m</SUP></I>) on 8-Bit
Micaz Mote," IEICE transactions on information and systems, vol. 91, no. 5, pp. 1338-1347,
2008.
[46] (2014). AVRCryptoLib. Available: http://www.emsign.nl/
[47] (2014). Arduino - HomePage. Available: http://www.arduino.cc/
2
References
[48] M. Sethi, J. Arkko, and A. Keranen, "End-to-end security for sleepy smart object networks,"
in Local Computer Networks Workshops (LCN Workshops), 2012 IEEE 37th Conference on,
2012, pp. 964-972.
[49] T. S. Denis, BigNum Math: Implementing Cryptographic Multiple Precision Arithmetic.
Syngress Publishing, 2006.
[50] M. Albahri and M. Benaissa, "Parallel comba multiplication in GF (2163) using homogenous
multicore microcontroller," in Electronics, Circuits, and Systems (ICECS), 2015 IEEE
International Conference on, 2015, pp. 641-644: IEEE.
[51] N. Gura, A. Patel, A. Wander, H. Eberle, and S. C. Shantz, "Comparing elliptic curve
cryptography and RSA on 8-bit CPUs," in Cryptographic Hardware and Embedded Systems-
CHES 2004: Springer, 2004, pp. 119-132.
[52] R. Brumnik, V. Kovtun, A. Okhrimenko, and S. Kavun, "Techniques for Performance
Improvement of Integer Multiplication in Cryptographic Applications," Mathematical
Problems in Engineering, vol. 2014, 2014.
[53] M. Hutter and E. Wenger, "Fast multi-precision multiplication for public-key cryptography on
embedded microprocessors," in Cryptographic Hardware and Embedded Systems–CHES
2011: Springer, 2011, pp. 459-474.
[54] A. Munir, A. Gordon-Ross, and S. Ranka, "Multi-core embedded wireless sensor networks:
Architecture and applications," Parallel and Distributed Systems, IEEE Transactions on, vol.
25, no. 6, pp. 1553-1562, 2014.
[55] H. Seo, Y. Lee, H. Kim, T. Park, and H. Kim, "Binary and prime field multiplication for public
key cryptography on embedded microprocessors," Security and Communication Networks,
vol. 7, no. 4, pp. 774-787, 2014.
[56] M. Hutter and P. Schwabe, "Multiprecision multiplication on AVR revisited," 2014.
[57] H. Seo and H. Kim, "Implementation of Multi-Precision Multiplication over Sensor Networks
with Efficient Instructions," Journal of Information and Communication Convergence
Engineering, vol. 11, no. 1, pp. 12-16, 2013.
[58] H. Seo and H. Kim, "Optimized multi-precision multiplication for public-key cryptography on
embedded microprocessors," International Journal of Computer and Communication
Engineering, vol. 2, no. 3, pp. 255-259, 2013.
[59] Z. Liu and J. Großschädl, "New speed records for montgomery modular multiplication on 8-
bit AVR microcontrollers," in Progress in Cryptology–AFRICACRYPT 2014: Springer, 2014, pp.
215-234.
[60] D. F. Aranha, J. López, and D. Hankerson, "Efficient software implementation of binary field
arithmetic using vector instruction sets," in Progress in Cryptology–LATINCRYPT 2010:
Springer, 2010, pp. 144-161.
[61] V. Kovtun and A. Okhrimenko, "Approaches for the performance increasing of software
implementation of integer multiplication in prime fields," IACR Cryptology ePrint Archive,
vol. 2012, p. 170, 2012.
[62] C. Moore, M. O'Neill, N. Hanley, and E. O'Sullivan, "Accelerating integer-based fully
homomorphic encryption using Comba multiplication," in Signal Processing Systems (SiPS),
2014 IEEE Workshop on, 2014, pp. 1-6: IEEE.
[63] C. Moore, M. O'Neill, E. O'Sullivan, Y. Doroz, and B. Sunar, "Practical homomorphic
encryption: A survey," in Circuits and Systems (ISCAS), 2014 IEEE International Symposium
on, 2014, pp. 2792-2795: IEEE.
[64] (2014). sec2_final.pdf (application/pdf Object). Available:
http://www.secg.org/collateral/sec2_final.pdf
[65] M. J. Quinn, Parallel Programming in C with MPI and OpenMP. McGraw-Hill Education
Group, 2003.
3
References
[66] D. F. Aranha, R. Dahab, J. López, and L. B. Oliveira, "Efficient implementation of elliptic curve
cryptography in wireless sensors," Adv. in Math. of Comm., vol. 4, no. 2, pp. 169-187, 2010.
[67] M. Albahri, M. Benaissa, and Z. U. A. Khan, "Parallel Implementation of ECC Point
Multiplication on a Homogeneous Multi-Core Microcontroller," in Mobile Ad-Hoc and Sensor
Networks (MSN), 2016 12th International Conference on, 2016, pp. 386-389: IEEE.
[68] J. Taverne, A. Faz-Hernández, D. F. Aranha, F. Rodríguez-Henríquez, D. Hankerson, and J.
López, "Software implementation of binary elliptic curves: impact of the carry-less multiplier
on scalar multiplication," in Cryptographic Hardware and Embedded Systems–CHES 2011:
Springer, 2011, pp. 108-123.
[69] J. López and R. Dahab, "Fast multiplication on elliptic curves over GF (2m) without
precomputation," in Cryptographic Hardware and Embedded Systems, 1999, pp. 316-327:
Springer.
[70] F. Rodrıguez-Henrıquez, N. A. Saqib, and A. Dıaz-Pérez, "A fast parallel implementation of
elliptic curve point multiplication over GF (2 m)," Microprocessors and Microsystems, vol. 28,
no. 5, pp. 329-339, 2004.
[71] J. V. Tembhurne and S. R. Sathe, "Performance evaluation of long integer multiplication
using OpenMP and MPI on shared memory architecture," in Contemporary Computing (IC3),
2014 Seventh International Conference on, 2014, pp. 283-288: IEEE.
[72] M. Purnaprajna, C. Puttmann, and M. Porrmann, "Power aware reconfigurable
multiprocessor for elliptic curve cryptography," in Design, Automation and Test in Europe,
2008. DATE'08, 2008, pp. 1462-1467: IEEE.
[73] J. López and R. Dahab, "Improved algorithms for elliptic curve arithmetic in GF (2n)," in
Selected areas in cryptography, 1998, pp. 201-212: Springer.
[74] M. Albahri and M. Benaissa, "Parallel comba multiplication in GF (2163) using homogenous
multicore microcontroller," in 2015 IEEE International Conference on Electronics, Circuits,
and Systems (ICECS), 2015, pp. 641-644: IEEE.
[75] J. Großschädl and E. Savaş, "Instruction set extensions for fast arithmetic in finite fields GF
(p) and GF (2 m)," in Cryptographic Hardware and Embedded Systems-CHES 2004: Springer,
2004, pp. 133-147.
[76] A. Munir, A. Gordon-Ross, and S. Ranka, "Multi-core Embedded Wireless Sensor Networks:
Architecture and Applications," 2013.
[77] E. Wenger and M. Hutter, "Exploring the design space of prime field vs. binary field ECC-
hardware implementations," in Information Security Technology for Applications: Springer,
2011, pp. 256-271.
[78] T. VanAmeron and W. Skiba, "Implementing efficient 384-bit NIST Elliptic Curve over prime
fields on an ARM946E," in Military Communications Conference, 2008. MILCOM 2008. IEEE,
pp. 1-7: IEEE.
[79] P. Szczechowiak, L. B. Oliveira, M. Scott, M. Collier, and R. Dahab, "NanoECC: Testing the
limits of elliptic curve cryptography in sensor networks," in Wireless sensor networks:
Springer, 2008, pp. 305-320.
[80] Z. Liu, J. Großschädl, and D. S. Wong, "Low-weight primes for lightweight elliptic curve
cryptography on 8-bit AVR processors," in Information Security and Cryptology, 2013, pp.
217-235: Springer.
[81] K. Sakiyama, E. De Mulder, B. Preneel, and I. Verbauwhede, "A parallel processing hardware
architecture for elliptic curve cryptosystems," in Acoustics, Speech and Signal Processing,
2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, 2006, vol. 3, pp. III-
III: IEEE.
[82] S.-C. Chung, J.-W. Lee, H.-C. Chang, and C.-Y. Lee, "A high-performance elliptic curve
cryptographic processor over GF (p) with SPA resistance," in Circuits and Systems (ISCAS),
2012 IEEE International Symposium on, 2012, pp. 1456-1459: IEEE.
4
References
[83] M. Rivain, "Fast and Regular Algorithms for Scalar Multiplication over Elliptic Curves," IACR
Cryptology ePrint Archive, vol. 2011, p. 338, 2011.
[84] P. Longa and A. Miri, "Fast and flexible elliptic curve point arithmetic over prime fields,"
Computers, IEEE Transactions on, vol. 57, no. 3, pp. 289-302, 2008.
[85] P. Longa and C. Gebotys, "Efficient techniques for high-speed elliptic curve cryptography," in
Cryptographic hardware and embedded systems, CHES 2010: Springer, 2010, pp. 80-94.