2011 - High-Speed Elliptic Curve and Pairing-Based
2011 - High-Speed Elliptic Curve and Pairing-Based
Cryptography
by
Patrick Longa
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Doctor of Philosophy
in
Electrical and Computer Engineering
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including
any required final revisions, as accepted by my examiners.
I understand that my thesis may be made electronically available to the public.
ii
Abstract
Elliptic Curve Cryptography (ECC), independently proposed by Miller [Mil86] and Koblitz
[Kob87] in mid 80’s, is finding momentum to consolidate its status as the public-key system of
choice in a wide range of applications and to further expand this position to settings traditionally
occupied by RSA and DL-based systems. The non-existence of known subexponential attacks on
this cryptosystem directly translates to shorter keylengths for a given security level and,
consequently, has led to implementations with better bandwidth usage, reduced power and
memory requirements, and higher speeds. Moreover, the dramatic entry of pairing-based
cryptosystems defined on elliptic curves at the beginning of the new millennium has opened the
possibility of a plethora of innovative applications, solving in some cases longstanding problems
in cryptography. Nevertheless, public-key cryptography (PKC) is still relatively expensive in
comparison with its symmetric-key counterpart and it remains an open challenge to reduce
further the computing cost of the most time-consuming PKC primitives to guarantee their
adoption for secure communication in commercial and Internet-based applications. The latter is
especially true for pairing computations. Thus, it is of paramount importance to research methods
which permit the efficient realization of Elliptic Curve and Pairing-based Cryptography on the
several new platforms and applications.
This thesis deals with efficient methods and explicit formulas for computing elliptic curve
scalar multiplication and pairings over fields of large prime characteristic with the objective of
enabling the realization of software implementations at very high speeds.
To achieve this main goal in the case of elliptic curves, we accomplish the following tasks:
iii
identify the elliptic curve settings with the fastest arithmetic; accelerate the precomputation stage
in the scalar multiplication; study number representations and scalar multiplication algorithms for
speeding up the evaluation stage; identify most efficient field arithmetic algorithms and optimize
them; analyze the architecture of the targeted platforms for maximizing the performance of ECC
operations; identify most efficient coordinate systems and optimize explicit formulas; and realize
implementations on x86-64 processors with an optimal algorithmic selection among all studied
cases.
In the case of pairings, the following tasks are accomplished: accelerate tower and curve
arithmetic; identify most efficient tower and field arithmetic algorithms and optimize them;
identify the curve setting with the fastest arithmetic and optimize it; identify state-of-the-art
techniques for the Miller loop and final exponentiation; and realize an implementation on x86-64
processors with optimal algorithmic selection.
The most outstanding contributions that have been achieved with the methodologies above in
this thesis can be summarized as follows:
• Two novel precomputation schemes are introduced and shown to achieve the lowest costs
in the literature for different curve forms and scalar multiplication primitives. The
detailed cost formulas of the schemes are derived for most relevant scenarios.
• A new methodology based on the operation cost per bit to devise highly optimized and
compact multibase algorithms is proposed. Derived multibase chains using bases {2,3}
and {2,3,5} are shown to achieve the lowest theoretical costs for scalar multiplication on
certain curve forms and for scenarios with and without precomputations. In addition, the
zero and nonzero density formulas of the original (width-w) multibase NAF method are
derived by using Markov chains. The application of “fractional” windows to the
multibase method is described together with the derivation of the corresponding density
formulas.
• Incomplete reduction and branchless arithmetic techniques are optimally combined for
devising high-performance field arithmetic. Efficient algorithms for “small” modular
operations using suitably chosen pseudo-Mersenne primes are carefully analyzed and
optimized for incomplete reduction.
• Data dependencies between contiguous field operations are discovered to be a source of
performance degradation on x86-64 processors. Three techniques for reducing the
number of potential pipeline stalls due to these dependencies are proposed: field
arithmetic scheduling, merging of point operations and merging of field operations.
• Explicit formulas for two relevant cases, namely Weierstrass and Twisted Edwards
curves over Fp and Fp2 , are carefully optimized employing incomplete reduction,
minimal number of operations and reduced number of data dependencies between
iv
contiguous field operations.
• Best algorithms for the field, point and scalar arithmetic, studied or proposed in this
thesis, are brought together to realize four high-speed implementations on x86-64
processors at the 128-bit security level. Presented results set new speed records for
elliptic curve scalar multiplication and introduce up to 34% of cost reduction in
comparison with the best previous results in the literature.
• A generalized lazy reduction technique that enables the elimination of up to 32% of
modular reductions in the pairing computation is proposed. Further, a methodology that
keeps intermediate results under Montgomery reduction boundaries maximizing
operations without carry checks is introduced. Optimized formulas for the popular tower
Fp → Fp2 → Fp 6 → Fp12 are explicitly stated and a detailed operation count that permits
to determine the theoretical cost improvement attainable with the proposed method is
carried out for the case of an optimal ate pairing on a Barreto-Naehrig (BN) curve at the
128-bit security level.
• Best algorithms for the different stages of the pairing computation, including the
proposed techniques and optimizations, are brought together to realize a high-speed
implementation at the 128-bit security level. Presented results on x86-64 processors set
new speed records for pairings, introducing up to 34% of cost reduction in comparison
with the best published result.
From a general viewpoint, the proposed methods and optimized formulas have a practical
impact in the performance of cryptographic protocols based on elliptic curves and pairings in a
wide range of applications. In particular, the introduced implementations represent a direct and
significant improvement that may be exploited in performance-dominated applications such as
high-demand Web servers in which millions of secure transactions need to be generated.
v
Acknowledgements
This Ph.D. thesis would not have been possible without the support and encouragement of many
people. My thanks go first to my supervisor, Dr. Catherine Gebotys, for her invaluable support
and guidance during all my Ph.D. studies.
I am also grateful to all my professors at the University of Waterloo, especially to Dr. Anwar
Hasan and Dr. David Jao for providing me with very useful feedback and comments on my
preliminary technical reports that later became part of Chapter 4, and to Dr. Hiren Patel for his
useful feedback on my research about efficient ECC implementation that later became part of
Chapter 5.
I would like to thank my committee members, Dr. Gordon Agnew, Dr. Anwar Hasan, Dr.
Michael Scott and Dr. Doug Stinson, for taking the time for reading this thesis and providing
many useful suggestions that helped me improve this work.
My thanks go to Dr. Michael Scott for his valuable help and feedback when I was developing
the ECC implementations presented in Chapter 5 on top of the MIRACL crypto library that he
developed; and to Dr. Huseyin Hisil for very valuable discussions on elliptic curves. Works from
both authors have been an inspiration and the basis for several developments in this thesis.
I would like to thank Diego F. Aranha, Dr. Catherine Gebotys, Dr. Koray Karabina and Dr.
Julio Lopez for our joint work on pairings [AKL+10], which is part of Chapter 6.
Special thanks go to Diego F. Aranha for his friendship, our always interesting discussions on
vii
cryptography and its efficient implementation, and our joint effort to develop the pairing
implementation presented in Chapter 6.
My thanks go to Tom St Denis, Diego F. Aranha and Dr. Colin Walter for providing valuable
comments for several sections of this thesis.
This work would not have been possible without the financial support of the NSERC
Alexander Graham Bell Canada Graduate Scholarship – Doctoral (CGS-D) and the University of
Waterloo President's Graduate Scholarship. Also, many test results presented in Chapters 5 and 6
were obtained using the facilities of the Shared Hierarchical Academic Research Computing
Network (SHARCNET) and Compute/Calcul Canada. My sincere gratitude goes to all of them.
Last but not least, I am profoundly grateful to my mother, Patricia Pierola, and my brothers,
Patricia and Franccesco Longa, for their love and unconditional encouragement. Especially, I
dedicate this work to my wife, Veronica Zeballos, and my daughter, Adriana Longa, because
their infinite love, support, patience and faith in my work were the ones that actually made this
thesis possible.
viii
To my wife and daughter,
Veronica and Adriana,
my lights in this World
ix
Table of Contents
Author’s Declaration ii
Abstract iii
Acknowledgements vii
Dedication ix
Table of Contents xi
List of Tables xvii
List of Algorithms xxi
List of Acronyms xxiii
Chapter 1: Introduction 1
1.1. Motivation 1
1.2. Contributions 6
1.3. Outline 8
Chapter 2: Background 11
2.1. Preliminaries 11
xi
Table of Contents
xii
Table of Contents
xiii
Table of Contents
xiv
Table of Contents
Appendices
Permissions 197
Bibliography 199
xv
List of Tables
Table 2.1. Key sizes for ECC and RSA for equivalent security levels [NIST07]. .......................17
Table 2.2. Costs (in terms of multiplications and squarings) of point operations using Jacobian
(J ) and mixed Jacobian-affine coordinates. ..............................................................................24
Table 2.3. Costs of point operations for an extended Jacobi quartic curve with d = 1 using
extended Jacobi quartic ( JQ e ) coordinates. ..............................................................................28
Table 2.4. Costs of point operations for a Twisted Edwards curve using inverted Edwards (IE)
and mixed homogeneous/extended homogeneous ( E / E e ) coordinates. .....................................30
Table 3.2. Costs of addition/conjugate addition formulas using projective (J, IE and JQ e ) and
affine coordinates. .....................................................................................................................48
Table 3.3. Costs of the LG precomputation scheme: case 1 in projective coordinates using J,
JQ e and IE; case 2 using one inversion; and case 3 in A. .........................................................53
Table 3.4. Cost of the LG precomputation scheme for tables of the form ci P ± di Q : case 1 in
projective coordinates; case 2 using one inversion; and case 3 in affine coordinates. ..................54
xvii
List of Tables
Table 3.5. Costs of different schemes using multiple inversions (case 3) and I/M ranges for
which each scheme achieves the lowest cost on a standard curve form (1M = 0.8S). ..................55
Table 3.6. Performance comparison of LG and LM Schemes with the C-based method (case 1) in
160-bit scalar multiplication on a standard curve form (1M = 0.8S). ..........................................56
Table 3.7. Performance comparison of LG and LM Schemes with the C-based method (case 1) in
256-bit scalar multiplication on a standard curve form (1M = 0.8S). ..........................................57
Table 3.8. Performance comparison of LG and LM Schemes with the C-based method (case 1) in
512-bit scalar multiplication on a standard curve form (1M = 0.8S). ..........................................57
Table 3.9. Performance comparison of LG and LM Schemes with the DOS method in 160-bit
scalar multiplication for different memory constraints on a standard curve (1M = 0.8S). ............59
Table 3.10. Performance comparison of LG Scheme with methods using a traditional chain for
cases 1 and 2 on JQ e and IE coordinates (1M = 0.8S). ............................................................61
Table 3.11. Cost of 160-bit scalar multiplication using Frac-wNAF and the LG Scheme (cases 1
and 2); and I/M range for which case 1 achieves the lowest cost on JQ e and IE (1M = 0.8S). ..62
Table 3.12. Cost of 512-bit scalar multiplication using Frac-wNAF and the LG Scheme (cases 1
and 2); and I/M range for which case 1 achieves the lowest cost on JQ e and IE (1M = 0.8S). ..62
Table 3.13. Performance comparison of LG Scheme and a scheme using traditional additions for
computing tables of the form ci P ± di Q , cases 1 and 2 (1M = 0.8S). ..........................................63
Table 3.14. Cost of 160-bit multiple scalar multiplication using window-based JSF and LG
Scheme (cases 1 and 2); and I/M ranges for which case 1 achieves the lowest cost; 1M = 0.8S...64
Table 3.15. Cost of 512-bit multiple scalar multiplication using window-based JSF and LG
Scheme (cases 1 and 2); and I/M ranges for which case 1 achieves the lowest cost; 1M = 0.8S...64
Table 4.1. Cost-per-bit for statements in CONDITION1, bases {2,3}, w = 2, J coordinates. ......87
Table 4.2. Cost-per-bit for statements in CONDITION2, bases {2,3}, w = 2, J coordinates. ......89
xviii
List of Tables
Table 4.3. Comparison of double-base and triple-base scalar multiplication methods (n = 160
bits; 1S = 0.8M). .......................................................................................................................96
Table 4.4. Comparison of double-base and triple-base scalar multiplication methods (n = 256
bits; 1S = 0.8M). .......................................................................................................................97
Table 4.5. Comparison of lowest costs using multibase and radix-2 methods for scalar
multiplication, n = 160 bits (cost of precomputation is not included). ........................................98
Table 5.1. Cost (in cycles) of modular operations when using incomplete reduction (IR) and
complete reduction (CR); p = 2256 − 189 . .............................................................................. 111
Table 5.2. Cost (in cycles) of modular operations without conditional branches (w/o CB) against
operations using conditional branches (with CB); p = 2256 − 189 . .......................................... 113
Table 5.3. Cost (in cycles) of point operations with Jacobian coordinates when using incomplete
reduction (IR) or complete reduction (CR) and with or without conditional branches (CB);
p = 2256 − 189 ........................................................................................................................... 114
Table 5.4. Various sequences of field operations with different levels of contiguous data
dependence. ............................................................................................................................ 118
Table 5.5. Average cost (in cycles) of modular operations using best-case (no contiguous data
dependencies, Sequence 1) and worst-case (strong contiguous data dependence, Sequence 2)
“arrangements” ( p = 2256 − 189 , on a 2.66GHz Intel Core 2 Duo E6750). ................................. 118
Table 5.6. Cost (in cycles) of point doubling using Jacobian coordinates with different number of
contiguous data dependencies and the corresponding reduction in the cost of point multiplication.
“Unscheduled” refers to implementations with a high number of dependencies (here, 10
dependencies per doubling). “Scheduled and merged” refers to implementations optimized
through the scheduling of field operations, merging of point operations and merging of field
operations (here, 1.25 dependencies per doubling); p = 2256 − 189 . ........................................... 121
Table 5.7. Cost (in cycles) of point multiplication on 64-bit architectures. .............................. 131
Table 6.1. Different options to convert negative results to positive after a subtraction with the
form c = a + l ⋅ b , where a, b ∈ [0, mp 2 ] , m ∈ Z + and l < 0 ∈Z s.t. lmp < 2 N . ........................ 142
xix
List of Tables
Table 6.2. Operation counts for arithmetic required by Miller’s algorithm when using: (i)
generalized lazy reduction technique; (ii) basic lazy reduction applied to Fp 2 arithmetic only. 156
Table A.1. Performance comparison of LG and LM Schemes with the DOS method in 256-bit
scalar multiplication for different memory constraints on a standard curve (1M = 0.8S)………185
Table A.2. Performance comparison of LG and LM Schemes with the DOS method in 512-bit
scalar multiplication for different memory constraints on a standard curve (1M = 0.8S)………186
xx
List of Algorithms
Algorithm 2.9. Optimal ate pairing on BN curves (including the case u < 0 )...........................35
xxi
List of Algorithms
Algorithm 5.2. Modular subtraction with a pseudo-Mersenne prime and complete reduction . 109
Algorithm 6.2. Multiplication in Fp6 without reduction ( ×6 , cost of 6mu + 28a ) ................... 145
Algorithm 6.3. Multiplication in Fp12 ( ×12 , cost of 18mu + 6r + 110a )................................... 146
Algorithm 6.5. Point doubling in Jacobian coordinates (cost of 6mu + 5su + 10r + 10a + 4 M ) 149
Algorithm 6.6. Point addition in Jacobian coordinates (cost of 10mu + 3su + 11r + 10a + 4M ) 150
Algorithm 6.8. Point addition in homogeneous coordinates (cost of 11mu + 2su +11r +12a + 4M )
............................................................................................................................................... 152
Algorithm 6.9. Modified optimal ate pairing on BN curves (generalized for u < 0 ) ............... 154
xxii
List of Acronyms
xxiii
List of Acronyms
xxiv
1 Chapter 1
Introduction
1.1. Motivation
Since its discovery by Diffie and Hellman in 1976 [DH76], public-key cryptography (PKC) has
revolutionized the way communications are securely achieved by governments, banks,
enterprises and even plain people. Based on clever mathematical constructs, public-key systems
appeared to alleviate the difficult problem of key management and distribution, and provide such
powerful tools as digital signatures. See, for example, [HMV04, Section 1.2] or [ACD+05,
Section 1] for an introduction to PKC.
Nonetheless, RSA, the dominant public-key system during many years, and discrete
logarithm (DL)-based cryptosystems are already exhibiting clear limitations to keep an
acceptable performance level in the plethora of new applications and platforms in the new
millennium that range from constrained, power-limited wireless devices [BCH+00, Lau04] to
cluster servers performing millions of secure transactions for e-commerce and e-banking
[GGC02, GSF04]. A relatively new, more “compact” player in the public-key crypto arena has
been gaining increasing attention in academia and commercial applications: elliptic curve
cryptosystems.
The complex and elegant mathematics behind elliptic curves have attracted number theorists and
1
Chapter 1: Introduction
algebra geometers long time before the remarkable work by Lenstra [Len87] using elliptic curves
for factoring led to the independent discovery by Miller [Mil86] and Koblitz [Kob87] of Elliptic
Curve Cryptography (ECC) in 1985. Since then, with the exception of some studies that found
vulnerabilities in certain special curves [MOV93, Sma99], it has not been possible to find better
attacks than Pollard’s rho [Pol78], which runs in exponential time, for elliptic curves with large
prime order subgroup. As a consequence, elliptic curve cryptosystems require shorter keys to
attain a certain security level in comparison with those required by the traditional RSA and DL-
based systems. For instance, to achieve a level of security equivalent to the Advanced Encryption
Standard algorithm with 256 bits (AES-256), the National Institute of Standards and Technology
(NIST) recommends the use of ECC keys of 512 bits, whereas RSA would require keylengths of
more than 15000 bits [NIST07]. This significant difference in favour of ECC has led in many
scenarios to faster, more power-efficient and/or memory-friendly implementations, which make
this cryptosystem especially attractive for constrained devices such as wireless sensor nodes,
smartcards, personal digital assistants (PDAs), cellphones, smartphones, and many others.
Moreover, the superior speed of ECC over RSA supports the improvement of performance of
Web servers in which public-key transactions may be a bottleneck, thus enabling the use of
strong cryptography on a wider range of Internet-based applications [GSF04].
A clear example of the importance of ECC in future commercial and governmental
applications has been set by the inclusion of ECC primitives in the U.S. National Security
Agency (NSA) Suite B Cryptography, which contains a set of recommended algorithms for
classified and unclassified U.S. security systems and information [NSA09]. In particular, the
Elliptic Curve Digital Signature (ECDSA) algorithm and the Elliptic Curve Diffie-Hellman
(ECDH) key exchange over prime fields (see §2.2.3) are recommended in Suite B for providing
security up to top secret level. Hence, ECC is arguably getting positioned as the dominant public-
key system in many applications, and is expected to occupy that privileged position for several
years to come. As direct consequence of this technological shift, the efficient implementation of
ECC schemes in software and hardware platforms is gaining key importance to realize strong
cryptography.
In that direction, this thesis deals with the fast and efficient computation of elliptic curve
scalar multiplication. This critical operation, denoted by kP (where k is a scalar and P a point on
an elliptic curve), is the central and most time-consuming operation in ECC. Although several
methods to compute kP efficiently have been proposed and extensively studied in past years, it is
still a very interesting challenge to improve further the performance of this operation. Elliptic
curve scalar multiplication comprises three arithmetic layers: field arithmetic, point arithmetic
and scalar arithmetic. Cryptographic protocols and schemes work on top of these layers; see
Section 2.2.3 and [HMV04, Chapter 4] for an overview. In this thesis, we focus on improving the
overall computation at all three arithmetic layers to try to achieve the highest speed possible in
2
Chapter 1: Introduction
software. In this effort we follow the next steps: (i) identify the elliptic curve settings with the
fastest arithmetic; (ii) accelerate the precomputation stage of scalar multiplication; (iii) study
number representations and scalar multiplication algorithms for speeding up the evaluation stage;
(iv) identify most efficient field arithmetic algorithms and optimize them; (v) analyze the
architecture of the targeted platforms for maximizing the performance of ECC operations; (vi)
identify most efficient coordinate systems and optimize explicit formulas; and (vii) realize
implementations on x86-64 processors with an optimal algorithmic selection among all studied
cases.
Grouping together the steps above, let us consider in greater detail the most relevant
problems and aspects that are considered in this study.
A practical strategy that reduces the number of required operations at the expense of some extra
memory is the use of precomputations. In this case, a table of points is built and stored in
advance (precomputation stage) for later use during the execution of the scalar multiplication
itself (evaluation stage). The effect of computing these additional points in the overall cost
basically depends on the context in which the scalar multiplication occurs. In [HMV04],
Hankerson et al. distinguishes two possible scenarios that depend on the prior knowledge of the
initial point P, and classifies the different methods for scalar multiplication according to them.
Let us illustrate both scenarios, and their subtleties, in the context of the ECDH key exchange
(see Section 2.2.3): when each Bob and Alice computes the initial scalar multiplication using a
random scalar in the first phase of the ECDH scheme, both use a publicly known point P for the
computation. Because P is available beforehand, it is obvious that methods that extensively
exploit precomputations to reduce the cost of the evaluation stage are preferable in this scenario.
Examples of efficient methods in this case are comb methods [HMV04, Section 3.3.2]. On the
other hand, during the second phase of the ECDH scheme, Bob and Alice exchange the results
from the first phase and calculate a new scalar multiplication. This time, however, the results
(which are also points on the curve) are not known in advance by their corresponding receptors.
Although methods may still exploit precomputations, this time the overall cost includes the costs
of both the precomputation and evaluation stages. A well-known method in this case is width-w
NAF (wNAF) [Sol00], which is the windowed version of the standard non-adjacent form (NAF).
The cost of the evaluation stage in the computation kP is strongly tied to the representation used
for the scalar k. With the exception of Montgomery’s method [Mon87], the most popular
approach has been the use of the NAF or wNAF representation in combination with some version
3
Chapter 1: Introduction
of the double-and-add algorithm (see Section 2.2.4.3). However, recently there has been an
increased interest in using novel arithmetic representations of integers based on double- and
multi-base number systems [DJM98, DIM05]. In general, it has been observed that these
representations enable a reduction in the number of point operations required for computing kP.
However, it is still an open question to determine up to what extent and in which scenarios the
new multibase representations reduce the computational cost of scalar multiplication. It has been
shown that these methods in fact reduce the number of point operations but in exchange they
require more complex formulas besides point doubling and addition. Partially, the question above
could be answered by trying to find the “optimal” (or close to “optimal”) multibase
representation of a given scalar for a particular setting, where “optimal” is defined here as
relative to the computational cost and not to the minimization of the number of additions.
Over the years, many efforts have focused on efficient implementation of ECC primitives on
different platforms [BHL+01, GPW+04, GAS+05, Ber06, CS09]. An incomplete list includes the
analysis on 8-bit microcontrollers, 32-bit embedded devices, graphical processing units,
processors based on the x86 Instruction Set Architecture (ISA) or the cell broadband engine
architecture, among many others. At a high-level, these works provide two main contributions:
As a side-effect, when different test results are made available, readers learn from direct
comparisons among alternative methods or algorithms.
Processors based on the x86-64 ISA [AMD] have become increasingly popular in the last few
years and are now being extensively used for notebook, desktop and server computers. Hence,
efficient cryptographic computation on these processors is of paramount importance to realize
strong cryptography in a wide variety of applications. Relevant questions are then: what are the
methods, formulas and parameters that once combined achieve the highest performance for
computing ECC primitives on these processors? and what are the features of these devices that
can be exploited to gain (or sometimes, not to lose) performance? It is then obvious that, for best
results, the analysis should contemplate architectural features of the processors under analysis.
4
Chapter 1: Introduction
Elliptic curves over prime fields have been traditionally represented in its short Weiertrass form,
y 2 = x3 + ax + b , where a, b ∈ Fp . More specifically, the projective form of this curve equation
using Jacobian coordinates has been the preferred elliptic curve shape for many years by most
implementers and standardization bodies such as NIST and IEEE [NIST00, NIST09, IEEE00].
However, in the last few years intense research has been working on new and improved curve
forms. Although these curves have not been standardized by national/international bodies up to
date, they provide attractive advantages such as faster arithmetic and/or higher resilience against
certain side-channel analysis (SCA) attacks [Sma01, BJ03b, BL07]. Since in this thesis we are
particularly interested in high-speed cryptography, we focus on two curve forms that currently
exhibit the lowest point operation costs: extended Jacobi quartic form, y 2 = dx 4 + 2ax 2 + 1 ,
a, d ∈ Fp ; and Twisted Edwards form, ax 2 + y 2 = 1 + dx 2 y 2 , a , d ∈ Fp . For each case, we
consider in our analysis and implementations the coordinate system(s) and curve parameters that
in our experience provide the highest performance (see Section 2.2.5 for further details):
We also include the short Weierstrass form because of its widespread use in practice:
Pairing-Based Cryptography
Since Boneh and Franklin [BF01], following pioneering works by several authors [Jou00,
SOK00, Ver01], formalized the use of pairings based on elliptic curves with the introduction of
Identity-Based Encryption (IBE) in 2001, the interest of cryptographers and implementers in this
new research area have grown dramatically. This is mainly due to the potential of pairings for
elegantly solving many open problems in cryptography such as Identity-Based Encryption
[BF01], short signatures [BLS04], multi-party key agreements [Jou00], among many others. See,
for example, [Men09] for an introduction to pairing-based cryptography.
Nevertheless, the pairing computation, which is the central and most time-consuming
operation in most pairing-based schemes, is still relatively expensive in comparison with ECC
5
Chapter 1: Introduction
operations (e.g., an elliptic curve scalar multiplication is about ten times faster than a pairing
computation at the 128-bit security level on x86-64 processors [BGM+10, GLS09]). Hence, the
development of techniques and methods leading to optimization of the pairing computation are of
great importance. Given the technological shift to x86-64-based processors, a series of efforts
have recently developed faster pairing implementations targeting these platforms [HMS08,
NNS10, BGM+10]. However, it remains a challenging effort to try to optimize further this
crucial operation for incentivizing the adoption of these elegant cryptosystems in commercial
applications.
In this thesis, we focus on improving the overall pairing computation to try to achieve the
highest speed possible in software. In this effort we follow the next steps: (i) accelerate tower and
curve arithmetic; (ii) identify most efficient tower and field arithmetic algorithms and optimize
them; (iii) identify elliptic curve setting with the fastest arithmetic and optimize it; (iii) identify
state-of-the-art techniques for the Miller loop and final exponentiation; and (iv) realize
implementation on x86-64 processors with an optimal algorithmic selection.
1.2. Contributions
In this thesis, we propose efficient methods and optimized explicit formulas for accelerating the
computation of elliptic curve scalar multiplication and pairings on ordinary curves over prime
fields. In many cases, the improvements are generic and apply to different types of (hardware and
software) platforms.
Our main contributions can be summarized as follows:
6
Chapter 1: Introduction
applications where memory is not scarce. An extensive comparison with other works is
performed on curves using Jacobian, extended Jacobi quartic and inverted Edwards
coordinates at different security levels. A relevant comparison with the fastest curves
using radix-2 methods is presented and demonstrates that “slower” curves employing
refined multibase chains become competitive for suitably chosen curve parameters on
memory-constrained devices.
• We bring together the most efficient ECC algorithms for performing elliptic curve scalar
multiplication on x86-64 processors and optimize them using techniques from computer
architecture. We study the optimal combination of incomplete reduction technique and
elimination of conditional branches to achieve high-speed field arithmetic over Fp using
a pseudo-Mersenne prime. We also demonstrate the high penalty incurred by data
dependencies between instructions in neighbouring field operations. Three generic
techniques are proposed to minimize the number of pipeline stalls due to true data
dependencies and to reduce the number of function calls and memory accesses. Further,
explicit formulas are optimized by minimizing the number of “small” field operations,
which are not inexpensive on the targeted platforms. Improved explicit formulas
exploiting incomplete reduction and exhibiting minimal number of operations and
reduced number of data dependencies between contiguous field operations are derived
and explicitly stated for Jacobian coordinates and mixed Twisted Edwards
homogeneous/extended homogeneous coordinates for two cases: with and without using
the GLS method [GLS09]. Record-breaking implementations demonstrating the
significant performance improvements obtained with the optimizations and techniques
under analysis at the 128-bit security level are described. Benchmark results for different
x86-64 processors exhibiting up to 34% cost reduction in comparison with the best
published results are presented.
• We introduce a generalized lazy reduction technique that allows us to eliminate up to
32% of the total number of modular reductions when applied to the towering and curve
arithmetic in the pairing computation. Furthermore, we present a methodology to keep
intermediate results under Montgomery reduction boundaries so that the number of
operations without carry checks is maximized. We illustrate the method with the well-
known tower Fp → Fp2 → Fp 6 → Fp12 , for which case we explicitly state the improved
formulas. Curve arithmetic using Jacobian and homogeneous coordinates is optimized
using the projective equivalence class and with the application of lazy reduction. A
detailed operation count that allows us to determine the theoretical cost improvement
attainable with the proposed method is carried out for the case of an optimal ate pairing
on a BN curve [BN05] at the 128-bit security level. To illustrate the practical
performance boost obtained with the new formulas we realize a record-breaking
7
Chapter 1: Introduction
The details above only highlight the most relevant contributions of this thesis. The reader is
referred to Chapters 3, 4, 5 and 6 for additional outcomes.
Partial results that have been developed further in this thesis already appear in the following
relevant publications:
[1] “New Composite Operations and Precomputation Scheme for Elliptic Curve
Cryptosystems over Prime Fields”, with A. Miri. In Proc. Int. Conference on Practice
and Theory in Public Key Cryptography (PKC 2008), 2008. This corresponds to part of
Chapter 3.
[2] “Novel Precomputation Schemes for Elliptic Curve Cryptosystems”, with C. Gebotys. In
Proc. Int. Conference on Applied Cryptography and Network Security (ACNS 2009),
2009. This corresponds to part of Chapter 3.
[3] “Fast Multibase Methods and Other Several Optimizations for Elliptic Curve Scalar
Multiplication”, with C. Gebotys. In Proc. Int. Conference on Practice and Theory in
Public Key Cryptography (PKC 2009), 2009. This corresponds to Chapter 4.
[4] “Efficient Techniques for High-Speed Elliptic Curve Cryptography”, with C. Gebotys. In
Proc. Workshop on Cryptographic Hardware and Embedded Systems (CHES 2010),
2010. This corresponds to Chapter 5.
[5] “Faster Explicit Formulas for Computing Pairings over Ordinary Curves”, with D.F.
Aranha, K. Karabina, C. Gebotys and J. Lopez. In Proc. Advances in Cryptology -
Eurocrypt 2011 (to appear), 2011. This corresponds to Chapter 6.
1.3. Outline
This thesis is organized as follows. In Chapter 2, we present the mathematical background
necessary for the understanding of Elliptic Curve and Pairing-based Cryptography, including
curve definitions and operation costs that will be accessed throughout the thesis.
In Chapter 3, we introduce the novel precomputation schemes, namely LM and LG schemes,
and present their operation costs when applied to different curve forms in various settings.
In Chapter 4, we discuss our contributions for accelerating the evaluation stage using
multibase representations. We present the theoretical analysis of the (width-w) multibase NAF
method, optimize the windowed variant by applying fractional windows and introduce the new
methodology to derive refined algorithms able to find improved multibase chains.
8
Chapter 1: Introduction
9
2 Chapter 2
Background
In this chapter, we introduce the mathematical tools that are considered fundamental for the
understanding of Elliptic Curve and Pairing-based Cryptography. For more extensive treatments,
the reader is referred to [HMV04, ACD+05]. First, we begin with an exposition of basic abstract
algebra and elliptic curves, and then discuss the security foundations of ECC, some of the most
popular EC-based cryptographic schemes and the arithmetic layers that constitute the
computation of elliptic curve scalar multiplication. Following, we summarize some advanced
research topics related to special curves and the Galbraith-Lin-Scott (GLS) method, which are
extensively used in Chapters 3-5. We end this chapter with a brief introduction to Pairing-based
Cryptography, including a description of the optimal ate pairing used in Chapter 6.
2.1. Preliminaries
In this section, we introduce some fundamental concepts about finite groups, finite fields, cyclic
subgroups and the generalized discrete logarithm problem.
Finite Groups
A set G is called a finite group with order q, and denoted by (G, ∗ ), if it has a finite number q of
elements, has a binary operation ∗ : G × G → G and satisfies the following properties [HMV04]:
11
Chapter 2: Background
In addition, the group is called abelian if it satisfies the commutativity law, that is,
a ∗ b = b ∗ a , for all elements a, b ∈ G .
If the binary (group) operation is called addition (+), then the group is additive. In this case,
the identity element is usually denoted by 0 (zero) and the additive inverse of an element a is
denoted by −a . If, otherwise, the binary (group) operation is called multiplication ( i ) , then the
finite group is multiplicative. In this case, the identity element is usually denoted by 1 and the
multiplicative inverse of an element a is denoted by a −1 .
Finite Fields
A field is a set F together with two operations, addition (+) and multiplication ( i ) , s.t. ( F , + )
and ( F∗ , i ) are abelian groups and the distributive law ( a + b) ⋅ c = a ⋅ c + b ⋅ c holds for all
elements a , b, c ∈ F . There exists a finite field if and only if its order q is a prime power with the
form q = p m , where p is a prime and m ≥ 1 . We denote this field by Fq and distinguish the
following cases:
Two notable cases are extensively used today to build elliptic curve cryptosystems: prime
fields Fp and binary fields F2m . In this thesis, we focus on the former case. Also, other
extension fields Fp m of large prime characteristic are employed in many applications including
12
Chapter 2: Background
pairing-based cryptography (see Section 2.3) and new ECC systems based on the GLS method
(see Section 2.2.6).
Cyclic Subgroups
Let G be a finite group of order n with multiplication ( i ) as binary operation, and let g be an
element of G such that g = {g i : 0 ≤ i ≤ r − 1} is the subgroup of G generated by g, where r is the
order of the element g, that is, r is the smallest positive integer for which g r = 1 . It is known that
r always exists and is in fact a divisor of n. Then, G is a cyclic group with generator g if G = g
(i.e., r = n holds). The set g is also a group itself under the same binary operation and is called
the cyclic subgroup of G generated by g. More precisely, G contains exactly one cyclic subgroup
of order d for each divisor d of n.
In the next section, we explore the way in which all the points belonging to an elliptic curve
over a prime field Fp form an abelian group under addition, and how the cyclic subgroups of this
group can be used to implement EC-based cryptosystems.
13
Chapter 2: Background
If we define elliptic curve points as the pairs ( x, y ) solving the curve equation (2.1) and L is
any extension field of K, the set of L -rational points on EW ,a1 ,a2 ,a3 ,a4 ,a6 is:
where O represents the point at infinity and is an L -rational point for all extension fields L of
K.
Definition 2.1. Two elliptic curves E1 = EW ,a1, a2 ,a3 ,a4 ,a6 and E2 = EW ,b1 ,b2 ,b3 ,b4 ,b6 defined over K
in Weierstrass form are said to be isomorphic over K if there exist r , s, t ∈ K and u ∈ K \ {0}
such that the mapping (also called an admissible change of variables):
( x, y ) ( u 2 x + r , u 3 y + u 2 s x + t ) (2.3)
transforms E1 into E2 .
Definition 2.2. If r , s, t ∈ K (closure of K) and u ∈ K \ {0} in the setting of Definition 2.1, then
curves E1 into E2 are isomorphic over K or twists of each other. Moreover, j ( E1 ) = j ( E2 ) if
and only if E1 into E2 are twists, where j() denotes the j-invariant of a given curve equation.
Following Definition 2.2, the j-invariant can be used to determine if two curves are twists.
The Weierstrass equation has had a privileged role in most standards and cryptographic
applications because of the fact that every elliptic curve can be expressed in this form. Moreover,
it enables efficient computation when simplified to its isomorphic forms over K known as short
Weierstrass curves, which are obtained through an admissible change of variables.
Since in the present work we mainly focus our attention on prime fields Fp with p > 3, we limit
following definitions to Fp only. However, the reader should be aware that the same descriptions
extend to any prime field K with prime characteristic > 3.
For the case of Fp with p > 3, the general Weierstrass equation (2.1) simplifies to the
following form, known as short Weierstrass form:
EW ,a ,b : y 2 = x 3 + ax + b , (2.4)
14
Chapter 2: Background
form an additive abelian group ( EW (Fp ), +) when the so called chord-and-tangent rule is used to
define the group operation. In this case, the point at infinity O acts as the identity element of the
group law (see Section 2.2.4.2 for more details).
Cyclic subgroups of the group ( EW (Fp ), +) can be used to build elliptic curve cryptosystems.
The hardness of these constructs is based on the so-called Elliptic Curve Discrete Logarithm
Problem (ECDLP), described next.
Let E / Fp be an elliptic curve defined over Fp . If P ∈ E (Fp ) is a point of order r, the cyclic
subgroup of E (Fp ) generated by P is {O , P,2 P,… (r − 1) P} . Then, if we define the scalar k as an
integer in the range [1, r − 1] , we can represent the main operation in ECC, namely, scalar
multiplication (a.k.a. point multiplication), as the following computation:
Q = kP , (2.6)
Definition 2.3. Given the cyclic group ( E(Fp ), +) with generator P and a point Q ∈ P , the
ECDLP is defined as the problem of determining the unique integer k ∈ [0, r − 1] such that
Q = kP , where r is the order of points P and Q.
15
Chapter 2: Background
The ECDLP is assumed to be harder than other recognized problems such as integer
factorization and the discrete logarithm problem in the multiplicative group of a finite field,
which are the foundations of RSA [RSA78] and the ElGamal [ElG84] cryptosystems,
respectively.
To assess more precisely the impact of the attacks available for each problem, we first
introduce the following definition about algorithmic running time.
Definition 2.4. If we define the running time of a given algorithm with input n by
Ln [ a, c] = O ( exp((c + ε )(ln n)a (ln ln n)1− a ) ) , where c > 0 and 0 ≤ a ≤ 1 are constants and
limn →∞ ε = 0 , then it is said to be polynomial in ln n (i.e., O ((ln n)c +ε ) ) if a = 0, exponential in
n (i.e., O (n c +ε ) ) if a = 1, and subexponential if 0 < a < 1 .
Then, the parameter a can be seen as a measure of the efficacy of an attack to solve a
particular problem, where higher values indicate inefficiency (as it is approximating to
exponential running time) and lower values indicate efficiency (as it is approximating to
polynomial running time). As consequence, one would prefer systems for which only exponential
attacks are known.
In particular, the need for increasingly larger keys in RSA and traditional DL-based systems
is due to the existence of a sub-exponential attack, known as the Number Field Sieve (NFS)
[LLM+93, Gor93], which solves the integer factorization and discrete logarithm problems. This
attack falls in the category of the well-known index calculus attacks, and has an expected running
1
time of Ln [ ,1.923] . In contrast, the fastest known method to solve ECDLP is Pollard’s rho
3
[Pol78], which falls in the category of square root attacks and has the exponential running time
O( r ) , where r is the order of the cyclic group with generator P in the setting of Definition 2.3.
Note that there are “weaker” curves such as supersingular curves for which it is feasible to
transport the ECDLP to the DLP in the group Fq*k using the Weil pairing and then to apply index
calculus attacks [MOV93]. However, for the wide range of remaining elliptic curves with large
prime order subgroup there are still no better attacks than Pollard’s rho.
In conclusion, it is expected that the key sizes required for ECC using a suitably chosen curve
and underlying field for a given security level are significantly smaller than those required for
traditional cryptosystems based on the integer factorization and DL problems.
Table 2.1 shows the key sizes for EC-based and RSA cryptosystems for equivalent security
levels, as recommended by [NIST07]. Security levels are shown at the bottom of the table and
refer to the bitlength n of keys in a well-designed symmetric cryptosystem such that a brute force
attack would require performing 2n steps in order to break the system. For instance, an attacker
would need to go through all 2256 possible keys to break AES-256, where n = 256 . Estimates for
ECC and RSA systems are based on the key size necessary to successfully run the fastest
16
Chapter 2: Background
algorithm that solves each problem (i.e., Pollard’s rho and NFS, respectively) in a number of
steps that matches the corresponding security level.
Table 2.1. Key sizes for ECC and RSA for equivalent security levels [NIST07].
As we can observe from Table 2.1, ECC requires much smaller keys. This directly translates
to important savings in bandwidth and memory requirements to transmit/store key material.
Moreover, with the rapid advances in software/hardware implementation during the last years,
that advantage has also been extended to faster execution times.
These advantages directly reflect on cryptographic systems based on elliptic curves that have
single and multiple scalar multiplications as their main primitives. Next, we review some of the
best known elliptic curve cryptosystems.
Based on the original key exchange proposed by Diffie and Hellman in [DH76], this scheme
makes use of elliptic curve groups to allow that two parties establish a shared secret key over a
17
Chapter 2: Background
public medium. The protocol is illustrated in Algorithm 2.2 for the case of ECC.
18
Chapter 2: Background
popular EC-based signature scheme. It has been standardized in ANSI X9.62, FIPS 186-2, IEEE
1363-2000 and ISO/IEC 15946-2. Signature generation and verification are illustrated in
Algorithms 2.5 and 2.6 . H denotes a hash function that is assumed to be preimage and collision
resistant.
The security of the ECDH key exchange, ElGamal elliptic curve cryptosystem and ECDSA is
based on the intractability of the ECDLP in P . In addition, the ECDSA requires that the hash
function H be preimage and collision resistant. As can be seen, scalar multiplication (or multiple
scalar multiplication) constitutes the central (and most time-consuming) operation of the schemes
above. Hence, speeding up this operation has a direct impact in the computing performance of
any cryptographic protocol based on elliptic curves.
In the following section, we briefly describe the arithmetic layers that constitute the
computation of elliptic curve scalar multiplication. The interested reader is referred to [HMV04,
ACD+05] for a more detailed look at the topic.
19
Chapter 2: Background
The computation of elliptic curve scalar multiplication consists of three arithmetic levels or
layers: field, point and scalar arithmetic. As previously seen, a cryptographic protocol or scheme
works on top of scalar multiplication. However, since this thesis focuses on the efficient
computation of this operation, our discussion will center on the aforementioned arithmetic levels.
Since modular reduction represents an important portion of the cost of computing modular
arithmetic, it is relevant to optimize this operation. In the setting of elliptic curve point
multiplication, the selection of a prime of special form (e.g., a pseudo-Mersenne prime p s.t.
p ≈ 2m ) enables very efficient modular reduction; see Chapter 5 for an implementation of the
field arithmetic using a pseudo-Mersenne prime. If a general form for the prime p is mandatory
for security concerns (e.g., in pairing-based cryptosystems), then the use of Montgomery
arithmetic [Mon85] is a popular choice given its relatively efficient reduction step. In this case,
elements x are represented with the form a = x ⋅ 2 N mod p , where N = t ⋅ w , 2 N > p , w is the
computer wordlength and t is the number of words. Montgomery reduction produces
a ⋅ 2− N mod p for an input a < 2 N ⋅ p . Then, Montgomery multiplication of elements a =
x ⋅ 2 N mod p and b = y ⋅ 2 N mod p can be performed as c = ab mod p = ( x ⋅ y ⋅ 22 N )⋅ 2− N (mod p ) =
x y ⋅ 2 N mod p , which is in Montgomery representation; see Chapter 6 for an implementation of
the field arithmetic using Montgomery arithmetic with a prime of “general” form.
The reader is referred to [HMV04, Chapter 2] and [ACD+05, Chapter 10] for more detailed
discussions about efficient algorithms to perform integer arithmetic and field operations.
20
Chapter 2: Background
In the remainder of this work, we use the following notation in italics to specify the
computing time (or computing cost) required to perform field operations in Fp : A (field addition
or subtraction), S (field squaring), M (field multiplication) and I (field inversion). In some cases,
multiplication by a curve parameter is required. The cost of this operation is denoted by D.
In theoretical estimates throughout this work, we make the following assumptions:
1S = 0.8M , which is commonly used in the literature; the costs of computing field addition/
subtraction and division/multiplication by a small constant are roughly equivalent to one another
and/or negligible in comparison with the cost of field multiplication and squaring; and curve
parameters are suitably chosen such that the cost of multiplying by these constants is negligible.
Whenever required for simplification purposes, the assumptions above are applied in our
theoretical cost analysis. However, the reader should be aware that these assumptions may vary
from one implementation to another.
This level consists of the binary (group) operation accompanying the defined additive abelian
group ( E(Fp ), +) . The different variants of this group operation are better known as point
operations.
The elementary representation of points is based on the natural representation using ( x, y )
coordinates, which is called in the context of ECC affine coordinates (denoted by A for the
remainder of this work). As previously stated, the group addition is geometrically defined by the
chord-and-tangent rule: (i) the result of adding two points is the projection over the x axis of the
point that intersects the line that crosses the two original points being added. This operation is
referred to as point addition and can be visualized in Figure 2.1(a) over the real numbers; (ii) the
result of adding a point to itself can be geometrically defined as the projection over the x axis of
the point that intersects the tangent of the original point. This operation is referred to as point
doubling and can be visualized in Figure 2.1(b) over the real numbers.
Following the geometrical definition, it is relatively easy to derive the following formula to
add two points. Let EW be an elliptic curve over Fp in short Weierstrass form (2.4), where p >
3. Given two points P = ( x1 , y1 ) and Q = ( x2 , y2 ) ∈ EW (Fp ) , where P ≠ ±Q , the addition
P + Q = ( x3 , y3 ) is obtained as follows:
x3 = λ 2 − x1 − x2 , y3 = λ ( x1 − x3 ) − y1 , (2.7)
y −y
where: λ = 2 1 . This addition formula has a cost of 1I + 2M + 1S.
x2 − x1
Similarly, formula for point doubling in affine coordinates can be easily derived from the
previously described geometric description. Let EW be an elliptic curve over Fp in short
Weierstrass form (2.4), where p > 3. Given a point P = ( x1 , y1 ) ∈ EW ( Fp ) , 2 P = ( x3 , y3 ) can be
21
Chapter 2: Background
y y
P = (x1 , y1)
Q = (x2 , y2)
x x
P = (x1 , y1)
P + Q = (x3 , y3)
2P = (x3 , y3)
obtained as follows:
x3 = λ 2 − 2 x1 , y3 = λ ( x1 − x3 ) − y1 , (2.8)
3x2 + a
where:a λ = 1 . The cost of the previous formula is 1I + 2M + 2S.
2 y1
There are a few exceptions to the previous formulas that can be solved by applying the
identity element, namely, the point at infinity O. Recall that the point at infinity can be
geometrically defined as the point “lying far out on the y-axis such that any line x = c, for some
constant c, parallel to the y-axis passes through it” [ACD+05]. Thus, if P = ( x1 , y1 ) and
Q = ( x1 , − y1 ) , then the addition is given by: P + Q = ( x1 , y1 ) + ( x1 , − y1 ) = O . Q = ( x1 , − y1 ) is
called the negative of P and is denoted by − P . Similarly, P + O = O + P = P , and O = −O .
22
Chapter 2: Background
in which case the third coordinate Z permits to replace inversions for a few other field operations.
More precisely, given a prime field Fp and c, d ∈ Z + , there is an equivalence relation ∼ among
nonzero triplets over Fp , such that [HMV04]:
Note that, in the Jacobian representation, each projective point ( X : Y : Z ) corresponds to the
affine point ( X /Z 2 , Y / Z 3 ) . In this case, the curve equation (2.4) acquires the projective form
Y 2 = X 3 + aXZ 4 + bZ 6 , the negative of a point P = ( X : Y : Z ) is given by − P = ( X : −Y : Z ) and
the point at infinity corresponds to O = (1 : 1 : 0) .
In Table 2.2, we summarize costs of the most efficient point formulas in J coordinates,
including recently proposed composite operations such as tripling (3P) and quintupling (5P) of a
point, which are built on top of traditional doubling and addition operations and are relevant for
the efficient implementation of multibase scalar multiplication methods (see Chapter 4). Also, we
include the highly efficient doubling-addition operation proposed by the author in [Lon07] which
computes the recurrent value 2P + Q and is more efficient than performing a doubling followed
by an addition when using Jacobian coordinates (see also [LM08b]). Besides “traditional” costs
in each case, we also show costs of formulas after applying the technique of replacing
multiplications by squarings (labeled as “Using S-M tradings”) [LM08] using the algebraic
substitutions a ⋅ b = ( a + b ) 2 − a 2 − b 2 2 or 2a ⋅ b = ( a + b ) 2 − a 2 − b 2 . In general, this
technique is more efficient always that M − S > 4 A or M − S > 2 A (respect.). The reader is
referred to our online database [Lon08] for complete details about state-of-the-art formulas using
Jacobian coordinates.
Note that formulas considered in Table 2.2 fix a = −3 in the curve equation (2.4) for
efficiency purposes. This assumption, which has been shown not to impose significant restrictions
to the cryptosystem [BJ03], has been recommended and incorporated in public-key standards
[NIST00, IEEE00].
23
Chapter 2: Background
Table 2.2. Costs (in terms of multiplications and squarings) of point operations using Jacobian
(J ) and mixed Jacobian-affine coordinates.
Cost
Point operation
“Traditional” Using S-M tradings
Doubling (DBL), 2J → J 4M + 4S 3M + 5S
Mixed doubling (mDBL), 2A → J 2M + 4S 1M + 5S
Tripling (TPL), 3J → J 9M + 5S 7M + 7S
Mixed tripling (mTPL), 3A → J 7M + 5S 5M + 7S
Quintupling (QPL), 5J → J 13M + 9S 10M + 12S
Mixed quintupling (mQPL), 5A → J 12M + 8S 8M + 12S
Mixed addition (mADD), J + A → J 8M + 3S 7M + 4S
2
Mixed addition (mmADD), A + A → J 4M + 2S 4M + 2S
Addition (ADD), J + J → J 12M + 4S 11M + 5S
Addition with two stored values ( ADD[1,1] ), J + J → J 11M + 3S 10M + 4S
Addition with four stored values ( ADD[2,2] ), J + J → J 10M + 2S 9M + 3S
Mixed doubling-addition (mDBLADD), 2J + A → J 13M + 5S 11M + 7S
Doubling-addition (DBLADD), 2J + J → J 17M + 6S 14M + 9S
Doubling-addition ( DBLADD[1,1] ), 2J + J → J 16M + 5S 13M + 8S
For the remainder, doubling (2P), tripling (3P), quintupling (5P), addition (P+Q) and
doubling-addition (2P+Q) are denoted by DBL, TPL, QPL, ADD and DBLADD, respectively. If
at least one of the inputs is in affine and the output is in J coordinates, the operations use mixed
coordinates (see Cohen et al. [CMO98]) and are denoted by mDBL, mTPL, mQPL, mADD and
mDBLADD, corresponding to each of the previous point operations. For addition, the case in
which both inputs are in affine is denoted by mmADD. Costs are expressed in terms of field
multiplications (M) and squarings (S) only. The reader is referred to [Lon08] for the full
operation count.
In some cases, it is possible to reduce the cost of certain operations if some values are
precomputed in advance. That is the case of addition and doubling-addition with stored values
(identified by the subscripts [ M , S ] , where M and S denote the number of precalculated
multiplications and squarings, respect.). If, for instance, values Z i2 and Z i3 are calculated for
each precomputed point d i P in windowed methods the costs of the aforementioned operations
can be reduced by 1M + 1S. Maximum savings can be achieved if four values, namely, Z i2 , Z i3 ,
Z 22 and Z 23 , can be precalculated before performing an addition of the form ( X 1 : Y1 : Z1 ) +
( X 2 : Y2 : Z 2 ) . In this case, we can save up to 2M + 2S.
24
Chapter 2: Background
Variants of J coordinates have also been explored in the literature. In particular, the four-
tuple ( X : Y : Z : aZ 4 ) and five-tuple ( X : Y : Z : Z 2 : Z 3 ) , known as modified Jacobian ( J m )
[CMO98] and Chudnovsky (C) [CC86] coordinates, respectively, permit the saving of some
operations by passing recurrent values between point operations. However, most benefits
achieved with these representations are virtually cancelled by assuming a = −3 in the EC
equation and with the alternative use of operations with stored values. Other (somewhat less
efficient) system, referred to as homogeneous (H) coordinates, is defined by fixing c = d = 1 in
(2.9).
The costs presented in Table 2.2 (specifically, costs labeled as “Using S-M tradings”) will be
used later for assessing the methods proposed for precomputation and multibase scalar
multiplication in Chapters 3 and 4, respectively. Also, our high-speed implementations of scalar
multiplication in Chapter 5 are based on standard curves using this system. In this case, given the
relatively high cost of additions and other “small” operations on x86-64 processors, we make use
of “traditional” operations without exploiting S-M tradings.
25
Chapter 2: Background
directly translates to a reduction in the number of required additions). By adjusting the double-
and-add for this case, we obtain what is known as the double-and-(add-or-subtract) method. See
Algorithm 2.7(b). Popular signed binary representations are the standard non-adjacent form
(NAF) and its variants, which are briefly described in the next subsection.
Note that Algorithm 2.7 presents left-to-right versions of the methods discussed above. There
are also right-to-left variants which can be advantageous when protection against side-channel
analysis (SCA) attacks is required. The same observation applies to other methods such as the
Montgomery Ladder [Mon87].
In the remainder of this work, for a scalar multiplication kP, we assume that P ∈ E (Fp ) is of
order r and E (Fp ) is of order # E ( Fp ) = h ⋅ r , where r is prime and h << r. Since it is known
that # E (Fp ) ≈ p following Hasse’s theorem (see Theorem 3.7 in [HMV04]), we have that
r ≈ p. Then, if k is a scalar randomly chosen in the range [1, r − 1] , the average length of k in
binary representation is n = log 2 p and the corresponding operation will be referred as n-bit
scalar multiplication. In this case, double-and-add and double-and-(add-or-subtract) algorithms
will require in average (n − 1) main loop iterations. We refer as nonzero density or Hamming
weight to the number of nonzero digits in a given scalar representation. In particular, for scalar
multiplication, the nonzero density of the representation of k directly translates to the number of
required point additions to compute kP.
Among different signed radix-2 representations using digits from the set D = {0, ±1} , NAF is a
canonical representation with the fewest number of nonzero digits for any scalar k [Rei60]. The
NAF representation of k contains at most one nonzero digit among any two successive digits. The
expected nonzero density of this representation is δ NAF = 1/ 3 . Hence, the average cost of an n-
26
Chapter 2: Background
bit scalar multiplication using NAF is approximately (n − 1)DBL + (n / 3)ADD , where DBL and
ADD represent the cost of doubling and addition, respectively.
If there is memory available, one can exploit the use of precomputations by means of a
method known as wNAF [Sol00], which uses precomputed values to “insert” windows of width
w. The latter permits the consecutive execution of several doublings to reduce the density of the
expansion. The wNAF representation of k contains at most one nonzero digit among any w
successive digits, and uses the digit set D = {0, ±1, ±3, ±5,…, ±(2w−1 − 1)} , where w > 2 ∈ Z + . The
average density of nonzero digits for a window of width w is δ wNAF = 1/( w + 1) , and the number
of required precomputed points is (2w−2 − 1) (hereafter we refer as precomputed points to non-
trivial points not including {O , P} ). Hence, the cost using this method is approximately
(n − 1)DBL + (n /( w + 1))ADD plus the cost of the precomputation stage.
During the last few years, there has been a growing interest in studying curve forms different to
the standardized Weierstrass form (2.4). These special curves have gained increasing attention
because in some cases they offer higher resilience against side-channel analysis attacks and/or
enable faster implementations. In this work, we focus on two special curve forms that have been
shown to achieve very high performance: extended Jacobi quartics and Twisted Edwards curves.
Next, we briefly describe both curve shapes in their most generalized form.
27
Chapter 2: Background
where a, d ∈ Fp and d (a2 − d ) ≠ 0 . Results by Billet and Joye [BJ03b] show that every elliptic
curve of even order can be written in extended Jacobi quartic form. The projective curve in
weighted projective coordinates is given by Y 2 = dX 4 + 2aX 2 Z 2 + Z 4 , where a projective point
( X : Y : Z ) corresponds to the affine point ( X /Z , Y / Z 2 ) . In this case, the negative of a point
P = ( X : Y : Z ) is represented by − P = ( − X : Y : Z ) and the identity element is given by (0 :1:1) .
The most efficient formulae for this case have been developed by Hisil et al. [HWC+07,
HWC+08b] using an extended coordinate system of the form ( X : Y : Z : X 2 : Z 2 ) that will be
referred in the remainder of this work as JQ e .
Note that, recently, Hisil et al. [HWC+09] proposed the use of a mixed system that
efficiently combines homogeneous coordinates with an extended homogeneous coordinate
system with the form ( X : Y : Z : T ) , where T = X 2 / Z . However, formulas for composite
operations known to date are faster in weighted projective coordinates JQ e .
In Table 2.3, we summarize the costs of formulas using extended Jacobi quartic coordinates
[HWC+07, HWC+08b]. Note that it is also possible to trade multiplications for squarings in some
cases (labeled as “Using S-M tradings”). And similarly to the case of operations with stored values
Table 2.3. Costs of point operations for an extended Jacobi quartic curve with d = 1 using
extended Jacobi quartic ( JQ e ) coordinates.
Cost
Point operation Coord.
“Traditional” Using S-M tradings
DBL 2(JQ e ) → JQ e 3M + 4S + 1D 2M + 5S + 1D
mDBL 2( A ) → JQ e 1M + 6S + 1D 7S + 1D
TPL 3(JQ e ) → JQ e 8M + 4S + 1D 8M + 4S + 1D
mTPL 3( A ) → JQ e 5M + 6S + 2D 5M + 6S + 2D
mADD JQ e +A → JQ e 7M + 2S + 1D 6M + 3S + 1D
mmADD A + A → JQ e 4M + 3S + 1D 4M + 3S + 1D
ADD JQ e + JQ e → JQ e 8M + 3S + 1D 7M + 4S + 1D
ADD[0,1] JQ e + JQ e → JQ e 8M + 2S + 1D 7M + 3S + 1D
28
Chapter 2: Background
using Jacobian coord. (see Section 2.2.4.2), the original cost of addition can be reduced further.
For instance, the addition with cost of 7M + 4S can be reduced to 7M + 3S by noting that
( X i + Z i ) 2 can be precomputed for each precomputed point (see [HWC+07] for complete
details).
Given the relatively “well-balanced” performance among all point operations listed in Table
2.3, we use these costs (specifically, the costs labeled as “Using S-M tradings”, assuming that
1D ≈ 0 M ) for evaluating the multibase methods in Chapter 4. We also use this system for
illustrating the use of the LG precomputation scheme in Chapter 3.
29
Chapter 2: Background
Table 2.4. Costs of point operations for a Twisted Edwards curve using inverted Edwards (IE)
and mixed homogeneous/extended homogeneous ( E / E e ) coordinates.
e
IE (a = 1) E / E (a = −1)
Point operation Using S-M Using S-M
Coord. “Traditional” Coord. “Traditional”
tradings tradings
DBL 2( IE ) → IE 4M + 3S + 1D 3M + 4S + 1D 2(E ) → E 4M + 3S 3M + 4S
mDBL 2(A) → IE 4M + 2S 3M + 3S 2(A) → E - -
mADD IE + A → IE 8M + 1S + 1D 8M + 1S + 1D Ee + A → Ee 7M 7M
mmADD A + A → IE 7M 7M A + A → Ee - -
ADD IE + IE → IE 9M + 1S + 1D 9M + 1S + 1D Ee + Ee → Ee 8M 8M
Recently, Galbraith et al. [GLS09] proposed to perform ECC computations on the quadratic twist
E ′ of an elliptic curve E over Fp 2 with an efficiently computable homomorphism ψ ( x, y ) →
(α x, β y ) such that ψ ( P ) = λ P and λ 2 + 1 ≡ 0(mod r ) , where P ∈ E ′( Fp 2 )[ r ] . Then, following
[GLV01], kP can be computed as a multiple point multiplication with form k0 P + k1 (λ P ) ,
where k0 and k1 have approximately half the bitlength of k. Integers k0 and k1 can be calculated
by solving a closest vector in a lattice or (in the case of a random scalar k) by simply choosing
the integers directly [GLS09].
30
Chapter 2: Background
It has also been observed that the GLS method can be adapted to different curve forms. In
Chapter 5, we evaluate various techniques and optimizations in combination with this method to
realize high-speed elliptic curve implementations on software. For this purpose, we choose
curves in Weierstrass and Twisted Edwards form. The details of these curve forms using the GLS
method, mainly taken from the literature, are summarized next. For complete details, please refer
to [GLS09, GLS08].
Corollary 2.1. Let curve EW over Fp be defined by (2.4) with # EW (Fp ) = p + 1 − t points,
where t is called the trace of EW / Fp , t ≤ 2 p , and µ be a quadratic non-residue in Fp 2 . If
ab ≠ 0 , EW is isomorphic to the curve:
with a′ = µ 2a and b′ = µ 3b ∈ Fp 2 , and # EW′ (Fp2 ) = ( p − 1)2 + t 2 . Curve E′W is the quadratic
twist of EW over Fp 2 . The twisting isomorphism is given by φ : EW → EW′ , φ ( x, y ) =
3
(ux , u y ) , which is defined over Fp 4 . The group homomorphism is given by:
µ
ψ ( x, y ) = ( p
⋅ x , µ 3 / µ 3p ⋅ y ) , (2.14)
µ
where x and y denote the Galois conjugates of x and y, respectively.
Corollary 2.2. Let curve ETE over Fp be defined by (2.12) with # ETE (Fp ) = p + 1 − t points
4 | ( p + 1 − t ) , t ≤ 2 p , and µ be a quadratic non-residue in Fp2 .Then ETE is isomorphic to the
curve:
′ / Fp2 : a′x 2 + y 2 = 1 + d ′x 2 y 2 ,
ETE (2.15)
ψ ( x, y ) = ( µ p / µ ⋅ x , y ) . (2.16)
Following [GLS09], for our implementations on Weierstrass and Twisted Edwards curves in
Chapter 5 we fix p = 2127 − 1 ≡ 3(mod 4) and µ = 2 + i ∈ Fp2 where i = −1 ∈ Fp . The chosen
prime is assumed to provide approximately 128 bits of security.
31
Chapter 2: Background
Definition 2.5. Let E be an elliptic curve defined in Fq , a point P ∈ E (Fq ) of order r, a point
x P ∈ P for a random integer x ∈ [0, r − 1] and a reusable point a P ∈ P for an integer
a ∈ [0, r − 1] . The Static Diffie-Hellman Problem (denoted by Static DHP) is defined as the
problem of determining axP .
Recently, Granger [Gra10] introduced a new attack that was shown to solve the Static DHP in
heuristic time O (q1−1/( m+1) ) for any elliptic curve in E ( Fq m ) if an attacker has access to a Static
DHP oracle. Hence, this result is immediately more efficient than Gaudry’s attack and, most
importantly, faster than Pollard’s rho attack if m = 2 . Accordingly, it is suggested to avoid the
use of the GLS method in settings where the Static DHP applies (e.g., when the same Diffie-
Hellman secret is reused for various Diffie-Hellman key agreements). Alternatively, one may
increase the key size accordingly to make this attack and Pollard’s rho algorithm roughly
equivalent for solving the ECDLP in E (Fq 2 ) .
We remark that it is known that the Static DHP can be solved for any arbitrary curve in
E (Fq ) with O(q1/ 3 ) Static DHP oracle queries and O(q1/ 3 ) group operations [BG04], which is
faster than the best generic attack achieving complexity O(q1/ 2 ) , namely Pollard’s rho.
32
Chapter 2: Background
Also, it immediately follows that e(aP, bR) = e( P, R)ab = e(bP, aR) for any two integers a
and b.
Bilinear pairings provide elegant solutions to some longstanding problems in cryptography
such as Identity-Based Encryption (IBE) [BF01, SOK00], three-party one-round Diffie-Hellman
key exchange [Jou00], short signatures [BLS04], among others, and has been the focus of intense
research since its introduction by Boneh, Franklin and others at the beginning of the new
millennium. For illustration purposes we show in Algorithm 2.8 the three-party one-round key
agreement by Joux using a bilinear pairing on (G1 ,G T ) . The reader is referred to [Men09] for a
discussion of other fundamental pairing-based protocols.
33
Chapter 2: Background
The security of Algorithm 2.8 relies on the impossibility of computing e( P, P)abc given P,
aP, bP and cP by an eavesdropper. This is an instance of the so-called Bilinear Diffie-Hellman
Problem, whose intractability is the security basis of many pairing-based protocols. As will be
evident later, the hardness of this problem implies the hardness of the Diffie-Hellman Problem.
Definition 2.6. The Bilinear Diffie-Hellman Problem (denoted by BDHP) is the problem of
computing e(P, R) xy given P, xP, yP and R.
Definition 2.7. The (Computational) Diffie-Hellman Problem (denoted by DHP) is the problem
of computing xyP given P, xP, yP.
Note that if the DHP can be solved in G1 , the value xyP is available and e(P, R) xy can be
easily computed as e( xyP, R) . A similar conclusion is achieved for G T . Since the DHP can be
easily solved if the DLP can be solved ( DLPassumption ≥P DHPassumption , that is, DHP is not harder
than the DLP), it can be concluded that DLPassumption ≥P DHPassumption ≥P BDHPassumption . Since
nothing else is known about the difficulty of solving the BDHP, it is assumed to be as difficult as
the DHP and that the security of pairing-based cryptographic schemes ultimately relies on the
hardness of the DLP in G1 , G 2 and G T .
Miller introduced in [Mil86b] an algorithm to evaluate rational functions on algebraic curves,
enabling the efficient computation of pairings at linear complexity with respect to the input size
(see also [Mil04]). Since then many optimizations have been proposed to improve the so called
Miller’s algorithm by, for instance, reducing the loop length [HSV06, LLP09, Ver10] or
constructing pairing-friendly elliptic curves [BN05, BW05, SB06].
When G1 = G 2 the pairing is called symmetric and is defined over supersingular curves. In
this case, ηT pairing is arguably the most efficient algorithm [BGO+07]. If, otherwise, G1 ≠ G 2 ,
the pairing is called asymmetric and is defined over ordinary elliptic curves. In this case, the
optimized variants of the Tate pairing [BKL+02] (e.g., ate [HSV06], R-ate [LLP09], optimal ate
[Ver10] pairing) achieve the highest performance.
In this work, we focus on the efficient implementation of asymmetric pairings with ordinary
curves (see Chapter 6). Accordingly, we will assume the following groups for the construction of
pairings: G1 , G 2 = cyclic subgroups of E (Fp ) ; G T = cyclic subgroup of F*pk .
For the case of ordinary curves, Barreto and Naehrig [BN05] proposed a large and easy-to-
generate family of elliptic curves (called BN curves) with embedding degree k = 12 , which is
optimal for implementing pairings at the 128-bit security level. For our analysis and tests we
choose the optimal ate pairing algorithm [Ver10]. We stress, however, that according to our tests
other variants of the Tate pairing achieve similar performance on the targeted platforms (i.e.,
x86-64-based processors).
34
Chapter 2: Background
EBN : y 2 = x3 + b , (2.17)
defined over Fp with b ≠ 0 and embedding degree k = 12, where p = 36u 4 + 36u 3 + 24u 2 + 6u +1 ,
prime order n = 36u 4 + 36u 3 + 18u 2 + 6u +1 and u ∈ Z .
Let the map π p : EBN → EBN be the p-power endomorphism π p ( x , y ) = ( x p , y p ) , EBN [ n] the
n-torsion subgroup of EBN , E′BN the sextic twist EBN ′ / F 2 : y 2 = x 3 + b ξ with ξ neither a
p
cube nor a square in Fp 2 , G1 = EBN [n] ∩ Ker(π p − [1]) = EBN (Fp )[n] , G 2 the preimage
′ ( F 2 )[ n ] of E BN [ n ] ∩ Ker(π p − [ p ]) ⊆ EBN ( F 12 )[ n ] and G T = µn ⊂ F*12 the group of n-th
E BN p p p
roots of unity. The optimal ate pairing on equation (2.17) is defined as [NNS10]:
aopt : G 2 × G1 → GT
p12 −1
(
(Q, P) → f r ,Q ( P ) ⋅ l[ r ]Q,π p (Q) ( P ) ⋅ l[ r ]Q +π 2
p ( Q ), −π p ( Q )
( P) ) n , (2.18)
Algorithm 2.9. Optimal ate pairing on BN curves (including the case u < 0 )
Input: P ∈ G1 , Q ∈ G 2 , r = 6u + 2 = ∑ i =02 ri 2i
log r
35
Chapter 2: Background
final exponentiation, which corresponds to line 9 in the same algorithm. Note that the power
( p12 − 1) / n is factored in the exponents ( p 6 − 1) , ( p 2 + 1) and ( p 4 − p 2 + 1) / n .
36
3 Chapter 3
This chapter revisits the problem of calculating precomputations efficiently when the base
point(s) is not known in advance. There are two standard table forms used by most elliptic curve
scalar multiplication methods in the literature: d i P and ci P + di Q , where ci , di ∈ D = {0, ±1,
±3, ±5,..., ± m} with m odd. In the first case, it is required the on-line calculation of the non-trivial
points d i P , where di ∈ D + \ {0,1} = {3,5,..., m} with m odd. In the second case, it is required (in
the extreme case) the on-line calculation of the non-trivial points ci P ± di Q, where
ci , di ∈ D + = {0,1,3,5,..., m} , ci > 1 if di = 0 , di > 1 if ci = 0 , and m odd. The negative of these
points can be computed on-the-fly at negligible cost. In the remainder, we will refer to these
tables built with non-trivial points as simply di P and ci P ± di Q, respectively. Well-known
methods to compute scalar multiplication using the former table are wNAF and Frac-wNAF in
the case of single scalar multiplication, and the interleaving NAF method in the case of multiple
scalar multiplication [HMV04]. Methods that employ a table with the form ci P ± di Q are
commonly intended for multiple scalar multiplication, such as the Joint Sparse Form (JSF)
[Sol01] and its variants [KZZ04, OKN10].
In this chapter, we propose two novel methods for precomputing points and carry out an
exhaustive analysis at different memory and security requirement levels:
• The first scheme, referred to as Longa-Miri (LM) Scheme, is based on the special
addition with identical Z coordinate [Mel07] and is intended for tables with the form
37
Chapter 3: New Precomputation Schemes
The different schemes are adapted and analyzed (whenever relevant) in three possible
scenarios (see Section 3.1.1): case 1, without using inversions; case 2, using only one inversion;
and case 3, using multiple inversions. The analysis of the proposed schemes includes three
curves of interest: Weiertrass curves using Jacobian coordinates J, extended Jacobi quartics
using extended Jacobi quartic coordinates JQ e , and Twisted Edwards curves using inverted
Edwards coordinates IE .
This chapter is organized as follows. §3.1 discusses the most relevant previous work. §3.2
introduces the LM precomputation scheme for standard curves using Jacobian coordinates,
targeting the single scalar multiplication case. §3.3 introduces the LG precomputation scheme
and discusses its applicability to different curves forms for both single and multiple scalar
multiplication. §3.4 presents the performance analysis of the proposed schemes, including
detailed comparisons with previous methods. §3.5 discusses other applications for conjugate
additions. And, finally, some conclusions are drawn in §3.6.
The most commonly used precomputation table has the form di P , where di ∈ D + \ {0,1} =
{3,5,..., m} , for some odd integer m. This table form can be found in most algorithms to compute
scalar multiplication such as the wNAF and Frac-wNAF methods (see Section 2.2.4.3).
The traditional approach is to compute the points by following the sequence P → 3P → 5P
→ … → mP with the application of an addition with 2P at each step. Depending on the
coordinate system(s) applied for the calculation, we can distinguish three different cases:
• Case 1: points are precomputed and left in some projective system. This scenario has the
38
Chapter 3: New Precomputation Schemes
potential advantage of having very low cost because no additional coordinate system
conversion is required. However, because points are left in certain projective system,
additions during the evaluation stage have general form and one cannot make use of
efficient mixed addition or mixed doubling-addition operations.
• Case 2: points are computed in some projective system and then converted to affine
coordinates. The latter step is usually performed with the Montgomery’ simultaneous
inversion method in order to reduce the number of inversions (see Alg. 2.26 of
[HMV04]). In this scenario, precomputation cost is higher because of the conversion to
affine step. However, the use of mixed additions (or mixed doubling-additions) allows
reducing costs during the evaluation stage.
• Case 3: points are computed and left in affine coordinates. This case is probably the most
expensive approach of all three cases in terms of speed, mainly because inversion is
especially expensive over prime fields. One potential advantage of this approach is that
memory requirement is kept to a minimal.
Cases 1 and 2 were studied by Cohen et al. [CMO98] when proposed the use of mixed
coordinates to implement scalar multiplication on Weiertrass curves. In particular, Cohen et al.
proposed two alternatives using different coordinate systems: (C1, C 2 , C 3 ) = (J m , J , C ) and
(C1, C 2 , C 3 ) = (J m , J , A ) , where C 1 represents the system to perform doublings, C 2 represents
the system for every doubling before an addition, and C 3 represents the system to perform
additions (in the evaluation and precomputation stages). In particular, the first approach, which
computes precomputations in C coordinates (corresponding to case 1), was shown to be more
efficient than the second approach using A coordinates combined with the Montgomery’
simultaneous inversion method (corresponding to case 2) always that 1I > 30M approximately.
Nevertheless, the conclusions drawn in [CMO98] are somewhat outdated because J m
coordinates (proposed for the evaluation stage in both cases) do not provide any advantage if
a = −3 , as discussed in Section 2.2.4.2. Also, Cohen et al.’s approach to case 2 involves the use
of Montgomery’s method over groups of points. However, a more popular alternative in recent
years has been to apply the method to all points in the table so that the number of inversions is
limited to only one. In this scenario, possible approaches are to compute precomputed points in
J, C or H coordinates and then use Montgomery’s method over all the partial points.
Very recently, Dahmen et al. [DOS07] proposed a highly efficient method (called the DOS
method) and showed that it is more cost-effective than all other previous schemes using one
inversion (case 2). Also, when compared to the approach using only A coordinates (case 3), the
DOS method exhibits superior performance for a wide range of I/M ratios. The DOS method’s
cost is 1I + (10 L − 1) M + (4 L + 4) S , where L = (m − 1) / 2 is the number of non-trivial points in
the table, and it has a memory requirement of (2 L + 4) registers (in this thesis, we assume that
39
Chapter 3: New Precomputation Schemes
each “register” can store a field element). One disadvantage of the DOS method is that there is no
straightforward version to compute points as in case 1.
2 3 2
X 3 = (Y2 − Y1 ) − ( X 2 − X1 ) − 2 X 1 ( X 2 − X 1 ) ,
Y3 = (Y2 − Y1 ) X1 ( X 2 − X1 ) − X 3 − Y1 ( X 2 − X1 ) ,
2 3
Z 3 = Z ( X 2 − X1 ) . (3.1)
Remarkably, Meloni also noticed that one can extract from (3.1) a new representative of
( 2 3
)
P = ( X1 : Y1 : Z ) given by X 1 ( X 2 − X1 ) , Y1 ( X 2 − X 1 ) , Z ( X 2 − X 1 ) , which has identical Z
coordinate to P + Q = ( X 3 : Y3 : Z3 ) . So one can continue applying the same formula recursively.
The new addition only costs 5M + 2S, which represents a significant reduction in comparison
with 8M + 3S (or 7M + 4S), corresponding to the mixed Jacobian-affine addition (see Table 2.2).
Unfortunately, it is not possible to directly replace traditional additions with this special
operation since, obviously, it is expected that additions are computed over operands with
different Z coordinates during standard scalar multiplication. Hence, Meloni [Mel07] applied his
formula to the context of scalar multiplication with star addition chains, where the particular
sequence of operations allows the replacement of each traditional addition by formula (3.1)
(referred to as ADDCo- Z for the remainder, borrowing notation from [GMJ10]).
Nevertheless, the author noticed in [Lon07] that the new addition can in fact be useful to
devise new formulas for composite operations such as doubling-addition that are applicable to
traditional scalar multiplication methods (see [Lon07] and also [LM08b]).
In Section 3.2, we again exploit the ADDCo- Z operation to build low-cost precomputation
tables. The new approach is called LM Scheme, offers very low cost and can be easily adapted to
cases 1 and 2, exhibiting higher performance and flexibility than the DOS method.
To the best of our knowledge, most research in the literature has only explored the efficiency of
precomputation schemes on standard curves of Weierstrass form (2.4). Although the traditional
40
Chapter 3: New Precomputation Schemes
NOTE: Okeya et al.’s idea is similar to the proposed concept of “conjugate” addition. However,
their observation was restricted to affine coordinates whereas we discovered the idea of saving
operations in the computation P ± Q when observing redundancies in projective coordinate
formulae. In general, projective coordinates are largely preferred over prime fields (especially on
special curves), so savings in these settings are more valuable.
For the remainder of this chapter, we assume that curve parameters for the curves under
analysis can be chosen such that the cost of multiplying a curve constant can be considered
negligible in comparison with a regular multiplication. Also, in most cases additions and
41
Chapter 3: New Precomputation Schemes
subtractions are neglected in our cost analysis. These assumptions greatly simplify our analysis
without affecting the conclusions.
di P = 2 P + … + 2 P + 2 P + P , (3.2)
performing additions from right to left. We will show that all the additions in (3.2) can be
computed with the ADDCo- Z operation proposed by Meloni [Mel07], reducing costs in
comparison with previous approaches.
The direct scheme applying (3.2) and calculating the points in J coordinates is referred to as
LM Scheme, case 1. Furthermore, although the author proposed in [Lon07, Section 3.4.1] a
version of the method using only one inversion (case 2), in this work we observe that some
values computed during the aforementioned additions can be efficiently exploited to minimize
costs during conversion to A coordinates. In this regard, we present two new and optimized
schemes which are referred to as LM Scheme, cases 2a and 2b.
where α = (3x12 + a) 2 , β = ( x1 + y12 )2 − x12 − y14 2 , and the input and result are P = ( x1 , y1 )
and 2 P = ( X 2 : Y2 : Z 2 ) ∈ E (Fp ) , respectively. Formula (3.3) can be easily derived from the
doubling formula (5.2), Section 5.4, by setting Z1 = 1 , and has a cost of only 1M + 5S + 12 A .
Note that, if 1M − 1S < 4 A , then computing β as x1 ⋅ y12 is more efficient with a total cost of
2M + 4 S + 8 A .
Then, by fixing λ = y1 in (2.10) we can set a point P (1) equivalent to P given by:
( )
P (1) = X 1(1) , Y1(1) , Z1(1) = ( x1 y12 , y14 , y1 ) ≡ P = ( x1 , y1 ,1) ,
42
Chapter 3: New Precomputation Schemes
whose computation does not involve additional costs since its coordinates have already been
computed in (3.3). Following additions to compute points d i P are performed using the special
addition ADDCo- Z as follows:
2 3
Y = (Y − Y ) X ( X
3 1
(1)
2 2
(1)
1 − X ) − X −Y (X − X ) ,
2 3 2
(1)
1 2
Z = Z (X − X ).
3 2
(1)
1 2
2nd Fix 2 P (1) = X 2(1) ,Y2(1) ,Z 2(1) = X 2 X 1(1) − X 2 ,Y2 X 1(1) − X 2 ,Z 2 X 1(1) − X 2 ≡ ( X 2 ,Y2 ,Z 2 ) ,
2 3
(
) (
) ( ) ( )
(1) (1) (1) (1)
(
and compute 5 P = 2 P + 3 P = X 2 , Y2 , Z 2 + ( X 3 , Y3 , Z 3 ) = ( X 4 , Y4 , Z 4 ) : )
2 3 (1) 2
X4 = ( ) − ( X − ) − 2X ( X − X ) ,
Y3 − Y2(1) 3 X 2(1) (1)
2 3 2
2 3
Y4 = (Y − Y ) X ( X − X ) − X − Y ( X − X ) ,
3 2
(1) (1)
2 3
(1)
2 4 2
(1)
3
(1)
2
(1) 2 3
Z4 = Z ( X − X ) , A = (X − X ), B = ( X − X )
(1)
2 3
(1)
2 4 3
(1)
2 4 3 2 (
, C4 = X 3 − X 2(1) ) .
(
th 2
m−1
2
(
Fix 2 P (( m−3) / 2) = X 2(( m−3) / 2) , Y2(( m−3) / 2) , Z 2(( m−3) / 2) = X 2(( m−5) / 2) X ( m−1) / 2 − X 2(( m−5) / 2) ) ( ) ,
)) ≡ ( X
3
(
Y2(( m −5) / 2) X ( m −1) / 2 − ) X 2(( m −5) / 2) , Z 2(( m −5) / 2) (X ( m −1) / 2 − X 2(( m −5) / 2) (( m −5) / 2)
2 , Y2(( m −5) / 2) ,
) X m −
2
(
Y( m + 3) / 2 = Y( m +1) / 2 − Y2(( m −3) / 2) ((
2
3) / 2)
(X (( m − 3) / 2)
( m +1) / 2 − X 2 ) −X ( m + 3) / 2 −…
3
(
… Y2((m −3) / 2) X ( m +1) / 2 − X 2((m −3) / 2) ),
(
Z ( m +3) / 2 = Z 2(( m −3) / 2) X ( m +1) / 2 − X 2(( m −3) / 2) ), A ( m +3) / 2 = (X (( m −3) / 2)
( m +1) / 2 − X 2 ),
2 (( m −3) / 2) 3
B( m+3) / 2 = ( X ( m+1) / 2 − X 2(( m −3) / 2) ) , C( m +3) / 2 =(X ( m +1) / 2 − X 2 . )
Intermediate values Ai and ( Bi , Ci ) , for i = 4 to ( m + 3) / 2 , are stored for LM Scheme,
cases 2a and 2b, respectively, and used in Step 2 to save some computations when converting
points to A coordinates. Note that the LM Scheme, case 1, does not require neither storing values
43
Chapter 3: New Precomputation Schemes
This step involves the conversion from J to A of points ( X i : Yi : Zi ) computed in Step 1, for i =
3 to (m + 3) / 2 , m > 3 , enabling the use of the efficient mixed addition operation during the
evaluation stage of scalar multiplication.
Conversion from J to A is achieved by applying ( X i / Zi2 , Yi / Zi3 , 1) (see Section 2.2.4.2).
Then, to avoid the computation of several expensive inversions we use a modified version of the
Montgomery’s method of simultaneous inversion to limit the requirement to only one inversion
for all the points in the precomputed table d i P .
In LM Scheme, case 2a, we first compute the inverse r = Z (−m1+ 3) / 2 , and then recover every
point using ( X i / Zi2 , Yi / Zi3 , 1) as follows:
mP : x( m + 3) / 2 = r 2 ⋅ X ( m + 3) / 2 , y( m + 3) / 2 = r 3 ⋅ Y( m +3) / 2 ,
3P : r = r ⋅ A4 , x3 = r 2 ⋅ X 3 , y3 = r 3 ⋅ Y3 .
3P : r1 = r1 ⋅ B4 , r2 = r2 ⋅ C4 , x3 = r1 ⋅ X 3 , y3 = r2 ⋅ Y3 .
44
Chapter 3: New Precomputation Schemes
where L = (m −1)/2 is the number of non-trivial points in the table d i P . The cost (3.4) assumes
the use of the addition (or doubling-addition) with stored values during the evaluation stage that
requires precalculating values Zi2 and Z i3 (see Table 2.2). Otherwise, the cost can be reduced to
only (5L + 1) M + (2 L + 5) S . In terms of memory usage (for temporary calculations and point
storage), LM Scheme, case 1, requires (5 L + 6) registers if using the addition or doubling-
addition with stored values or (3L + 6) registers if using operations without stored values.
The LM Scheme, case 2a, has the following cost:
In terms of memory usage, LM Scheme, case 2a, requires (3L + 3) registers overall. In the
case of LM Scheme, case 2b, the cost is as follows:
For this scheme, we require (4 L + 1) registers when L > 1. For L = 1, the requirement is fixed
at 6 registers. It will be shown later that memory requirements of cases 2a and 2b do not exceed
the memory allocated for scalar multiplication for small or intermediate values of L, whereas case
1 does not exceed memory constraints in any case. For the detailed estimation of costs and
memory requirements of the LM Scheme, cases 1, 2a and 2b, please refer to Appendix A2.
For the record, the original scheme in [Lon07] has a cost of 1I + (11L + 2) M + (3L + 5) S . As
can be seen in (3.5) and (3.6), the new LM Scheme variants represent an important improvement
in terms of computing cost. In particular, case 2b achieves the lowest cost in scenarios using one
inversion at the expense of some extra memory.
Next, we analyze the memory requirements for scalar multiplication and determine if our
method adjusts to such constraints.
In the case of using general (doubling-additions) additions or general (doubling-additions)
additions with stored values for the evaluation stage (i.e., case 1), scalar multiplication requires in
total (3L+R) or (5L+R) registers, respectively, where R is the number of registers needed by the
most memory-consuming point operation in a given implementation. In scalar multiplications
using solely radix 2, addition and doubling-addition are usually such operations. Depending on
the implementation details, these operations can require up to 8 registers [Lon08]. Consequently,
the LM Scheme, case 1, adjusts to the above requirements as it always holds that 3 L + 6 ≤ 3L + R
45
Chapter 3: New Precomputation Schemes
First, P − Q = P + (−Q ) . As the negative of a point only involves the change of at most one of the
coordinate values in the projective representation (see Sections 2.2.4.2 and 2.2.5), it is then
expected than computing P + Q and P − Q share most of the intermediate computations.
Let us illustrate the latter with the point addition formula using J coordinates. Let
P = ( X 1 : Y1 : Z1 ) and Q = ( X 2 : Y2 : Z 2 ) be two points on an elliptic curve EW / F p . If the
addition P + Q = ( X 3 : Y3 : Z 3 ) is performed using the optimized addition formula:
X 3 = α 2 − β 3 − 2 Z 22 X 1β 2 , Y3 = α ( Z 22 X 1β 2 − X 3 ) − Z 23Y1β 3 , Z 3 = θβ , (3.7)
46
Chapter 3: New Precomputation Schemes
where γ = Z13Y2 + Z 23Y1 . Note that (3.8) only involves the extra cost of 1M + 1S , which is
significantly less than the cost of a general addition (3.7) (i.e., 11M + 5 S ). If we also consider
other usually neglected operations, the cost drops from 11M + 5S + 9 A + 1(×2) + 1( ÷2) to only
1M + 1S + 4 A . In total, the addition/conjugate addition pair costs 12M + 6S + 13 A + 1(×2) + 1( ÷2) .
It may seem that performing this conjugate operation would involve several extra registers to
store partial values temporarily. However, memory requirements can be minimized by
performing P + Q and P − Q concurrently. For instance, a possible execution sequence for
computing P ± Q using formulas (3.7) and (3.8) would be as the one shown in Table 3.1.
The execution of the addition/conjugate addition pair detailed in Table 3.1 requires 8 registers
only (including temporary registers and registers storing input/output coordinates), which is the
same memory requirement of the addition formula alone. Thus, executing the conjugate addition
does not increase the memory consumption in this case. Similar results are expected for other
coordinate systems.
INPUT: P = ( X 1 : Y1 : Z1 ) and Q = ( X 2 : Y2 : Z 2 ) ; T1 ← X 1 , T2 ← Y1 , T3 ← Z1 , T4 ← X 2 , T5 ← Y2 , T6 ← Z 2
OUTPUT: P + Q = ( X 3 : Y3 : Z 3 ) = (T1 : T2 : T3 ) and P − Q = ( X 4 : Y4 : Z 4 ) = (T4 : T5 : T3 )
47
Chapter 3: New Precomputation Schemes
Table 3.2. Costs of addition/conjugate addition formulas using projective (J, IE and JQ e ) and
affine coordinates.
Cost
Point Operation Standard curve Twisted Edwards Ext. Jacobi quartic
(a = −3), J (a = 1), IE (d = 1), JQ e
In the following section, we introduce novel precomputation schemes for tables with the
forms di P and ci P ± di Q that take advantage of the new conjugate formulas. We again consider
all three precomputation scenarios, i.e., cases 1, 2 and 3.
We propose a recursive scheme that first reaches a “strategic” point and then applies efficiently
the conjugate addition technique described in the previous section. In the following, we define as
“strategic” to those points that can be efficiently computed and from which it is possible to
calculate the maximum possible number of precomputed points at the lowest cost. The steps of
our scheme are detailed in the following.
The main body of our scheme is detailed in Algorithm 3.1. In this step, points can be computed in
projective coordinates using operations from Tables 2.2, 2.3 or 2.4 (case 1), or directly in A
coordinates (case 3). If projective points are to be converted to A (case 2) then Step 2 should be
48
Chapter 3: New Precomputation Schemes
Example 3.1. If m = 13, Algorithm 3.1 computes the first points as P → 3 P → 6 P , where 6P is
the first “strategic” point. From this, 5P and 7P (close points) are calculated by adding
6 P + ( − P ) and 6P + P . Note that the latter operations can be calculated with a low cost
addition/conjugate addition pair. Then, Algorithm 3.1 calculates the following “strategic” point
(since m > 12) by doubling 6 P → 12 P = rmax P , and finally computes close points 9P, 11P and
13P by performing 12P + (−3P ) , 12 P + ( − P ) and 12P + P , respectively. Again the last two
operations can also be computed with an addition/conjugate addition pair.
In Appendix A4, we have sketched the derivation of points for tables with different values m.
Note that the described method does not include the case m = 5. For a table with m = 5, JQ e and
J coordinates, it is more efficient to compute points by performing P → 2 P → 4 P , and then
obtaining 3P and 5P with an addition/conjugate addition pair (i.e., 4 P + (− P ) and 4P + P ). For
the case IE, we suggest to compute the table following the sequence P → 2 P → 3P → 5 P .
If mixed addition (or mixed DBLADD) is significantly more efficient than general addition (or
general DBLADD) in a given setting, then it could be convenient to express the precomputed
table in A coordinates.
It is known that conversion to A can be achieved by calculating ( X i /Z i2 , Yi / Z i3 ) ,
( X i /Z i , Yi / Z i2 ) and ( Z i / X i , Z i / Yi ) for points in J, JQ e and IE coordinates, respectively. For
each setting, calculation of denominators (denoted by ui ) can be efficiently carried out by using
49
Chapter 3: New Precomputation Schemes
the Montgomery’s method of simultaneous inversion. In this way, the number of expensive
inversions can be limited to only one.
First, we compute the inverse U = (u1u2 …ut ) −1 , where ui are all distinct denominators of
the conversion expressions above (without considering exponents) from all the non-trivial points
in the table {3P, 5P, …, mP}. For J and JQ e , the number of such denominators reduces to only
t = (m − 1) / 2 − c , where c is the number of points computed via conjugate addition, since points
computed with addition/conjugate addition pairs share the same Z coordinate (see Appendix A3).
For IE, t = m − 1 as each point has two distinct denominators, namely X i and Yi .
Then, individual denominators ui are recovered from U and scaled with the corresponding
exponent (if any), and the results are finally multiplied to their corresponding numerator
following the conversion expressions.
50
Chapter 3: New Precomputation Schemes
Thus, the use of conjugate additions reduces the cost of the Montgomery’s method for J and
JQ e . Following the details above, it can be verified that one saves (4 M + 1S ) and (3M + 1S )
per point computed with a conjugate addition using J and JQ e coordinates, respectively.
The “generic” costs of the proposed scheme, cases 1-3 and case 2, are given by:
Cost LG Scheme, cases 1/3 = 1TPL + (ω − 2)DBL + (2ε − L + 1)ADD + ( L − ε − 1)ADD-ADD′ , (3.9)
Cost J →A = 1I + (6 L − 4c − 3) M + ( L − c) S , (3.11)
Cost JQ e →A = 1I + (5 L − 3c − 3) M + (2 L − c ) S , (3.12)
Please, refer to Appendix A5 for the proof. We remark that cost formula (3.14) is generalized
to any projective system. Hence, depending on the curve form selected, some additional speed-
ups are available. Let us discuss some of these optimizations in the context of J coordinates.
First, when performing additions with a “strategic” point Q, the values Z Q2 and Z Q3 are calculated
in the first mixed addition, say Q + P = ( X Q : YQ : ZQ ) + ( xP , yP ) . Then, following general
additions of the form Q + R = ( X Q : YQ : Z Q ) + ( X R : YR : Z R ) can be executed using ADD[1,1] in
case 1 and save (1M + 1S ) per operation. This can be optimized further by using ADD[2,2]
instead and save (2 M + 2 S ) per general addition if one assumes that the evaluation stage
51
Chapter 3: New Precomputation Schemes
employs additions with stored values and all values Zi2 and Z i3 need to be precomputed in case
2. Also, one squaring can be saved every time a doubling 2Pj is performed to get a “strategic”
point since the value Z 2j can be obtained from the initial tripling or the mixed addition preceding
this doubling. Moreover, as observed before addition and conjugate addition formulas share the
same Z coordinate. Hence, in case 2 we only require (1M + 1S ) to get Zi2 and Z i3 for two points
computed with an addition/conjugate addition pair. Similar savings apply to conversion to affine
for case 1, where one saves (4 M + 1S ) per conjugate addition as discussed in the previous
section. By applying these optimizations to (3.14) with (3.11), we obtain the following cost
formulas for the LG Scheme, cases 1 and 2, using Jacobian coordinates:
Note that it is still possible to optimize further cost (3.16) for case 2 if every addition with 3P
is computed with ADD[1,1] by reusing values Z32P and Z33P computed in the tripling operation.
This saves an extra (1M + 1S ) per addition with 3P.
The following optimizations to cost formula (3.14) using JQ e coordinates are analogous to
the ones described for J coordinates. First, one squaring can be saved every time a doubling 2Pj
is performed to get a “strategic” point by noting that ( X j + Z j ) 2 can be obtained from the initial
tripling or the mixed addition preceding this doubling. Also, when performing additions with a
“strategic” point Q, the value ( X Q + ZQ ) 2 is calculated in the first mixed addition. Then, each
extra addition with the same point Q can be executed using ADD[0,1] in case 1 and save 1S per
operation. This can be optimized further by using ADD[0,2] instead and save 2S per general
addition if one assumes that the evaluation stage employs additions with stored values and all
values ( X i + Z i )2 need to be precomputed in case 2. Thus, the optimized costs of the LG
Scheme, case 1 and case 2, using extended Jacobi quartic coordinates are given by:
Again, it is still possible to optimize further cost (3.18) for case 2 if every addition with 3P is
computed with ADD[0,1] by reusing the value ( X 3P + Z3P )2 computed in the tripling operation.
This saves an extra squaring per addition with 3P.
In Table 3.3 we list the cost of the LG Scheme for various values L using the derived
formulas (3.15), (3.16), (3.17), (3.18). Costs for IE coordinates can be obtained by simply
applying operations from Tables 2.4 and 3.2 to cost formulas (3.14) and (3.13). As operations in
52
Chapter 3: New Precomputation Schemes
affine are relatively expensive in extended Jacobi quartic and Twisted Edwards curves (see Table
3.2), we only show the performance of case 3 in the setting of standard curves estimated with
formula (3.9). In Sections 3.4.1 and 3.4.2, we carry out an exhaustive evaluation of this method’s
performance.
Table 3.3. Costs of the LG precomputation scheme: case 1 in projective coordinates using J,
JQ e and IE; case 2 using one inversion; and case 3 in A.
3 17M + 17S 15M + 17S 22M + 8S 1I + 27M + 18S 1I + 24M + 20S 1I + 40M + 8S 3I + 13M + 8S
7 40M + 32S 34M + 32S 51M + 14S 1I + 64M + 33S 1I + 57M + 37S 1I + 93M + 14S 6I + 23M + 14S
15 85M + 57S 71M + 57S 108M + 22S 1I + 139M + 60S 1I + 122M + 68S 1I + 198M + 22S 11I + 41M + 24S
This scenario mainly applies to methods for computing multiple scalar multiplication such as
those based on JSF [Sol01, OKN10, SEI10]. In this case, the application of our strategy of
conjugate additions is straightforward since precomputed points have the form ci P ± di Q and
each point pair ci P + di Q and ci P − di Q with ci , di ≠ 0 can be computed with an addition/
conjugate addition pair. Points ci P and di Q are computed using the chain P → P+2P = 3P →
3P+2P = 5P → … → (m−2)P+2P = mP. Interestingly enough, we note that, for the case of
Jacobian coordinates with m ≥ 5 , this chain can be performed using the LM Scheme and, thus,
reduce the costs further.
In the following, we analyze the cost involved when precomputing points for the window-
based JSF [OKN10, SEI10]. Extension of the method to similar table forms easily follows.
53
Chapter 3: New Precomputation Schemes
( m + 1) 2 m − 1
Cost LG Scheme, cases 1/3(2) = ( m − 1)ADD + (ADD − ADD′) + 2 DBL (+Cost P →A ) ,
4 m
(3.19)
where L = ( m 2 + 4 m − 1) / 2 > 1 and again Cost P →A (that only applies to case 2) denotes the cost
of converting points from projective to affine coordinates and is defined by cost formulas (3.11),
(3.12) and (3.13) for J, JQ e and IE, respectively. For these formulas, c = (m + 1) 2 / 4 . Cost
(3.19) assumes that points ci P ± di Q for which ci or di = 0 are computed using the chain P →
P+2P = 3P → 3P+2P = 5P → … → (m−2)P+2P = mP. As mentioned before, one can apply the
LM Scheme to this computation when using J coordinates. The cost of this combined LG/LM
Scheme is given by ( m ≥ 5) :
( m + 1) 2
Cost LG Scheme, J ,cases 1(2) = 2DBL + ( m − 1)ADD Co-Z + (ADD − ADD′) (+Cost2J → A ), (3.20)
4
where Cost2J →A = [2m( m + 4) − 1]M + [( m + 1) 2 / 4 + 2]S applies to case 2 only and represents
the cost of converting points from Jacobian to affine coordinates using a modified Montgomery’
simultaneous inversion method that has been adapted to case 2b of LM Scheme and the use of
conjugate additions. Please, refer to Appendix A6 for the proof and extended details.
We remark that further optimizations are possible, such as the use of mixed coordinates or
efficient tripling formulas. Similarly, certain coordinate systems such as J and JQ e allow again
the use of efficient addition formulas with stored values, following the same optimizations
described in Section 3.3.2.1.
In Table 3.4, we show the cost performance of the proposed scheme for the curve forms
under analysis and considering the discussed optimizations. As operations in affine are relatively
expensive in JQ e and IE coordinates, we only show the performance of case 3 in the setting of
standard curves. We carry out the evaluation of this method’s performance in Section 3.4.3.
Table 3.4. Cost of the LG precomputation scheme for tables of the form ci P ± di Q : case 1 in
projective coordinates; case 2 using one inversion; and case 3 in affine coordinates.
22 107M + 65S 100M + 74S 159M + 18S 1I + 175M + 68S 1I + 180M + 91S 1I + 291M + 18S 15I + 48M + 26S
54
Chapter 3: New Precomputation Schemes
There are different schemes to compute precomputed points on standard curves in the literature
(see Section 3.1.1). The simplest approaches suggest performing computations in A or C
coordinates using the chain P → 3P → 5P → … → mP. The latter requires one doubling and
L = (m −1)/2 additions, which can be expressed as follows in terms of field operations:
Cost A = ( L + 1) I + (2 L + 2) M + ( L + 2) S , (3.21)
Note that (3.22) shows a better performance than the estimated cost given by [DOS07] since
we are considering that the initial doubling 2P is computed as 2A → C with a cost of 2M + 5S,
the first addition P + 2P computed with a mixed addition as A + C → C (7 M + 4S ) , and the
following ( L − 1) additions as C + C → C (10 M + 4 S ) . The new operation costs are obtained by
applying the technique of replacing multiplications by squarings [LM08]. The memory
requirements of the A- and C-based methods are (2 L + R ) and (5 L + R ) registers, respectively,
where R is again the memory requirement of the most memory-demanding point operation used
for scalar multiplication.
Let us first compare the performance of the proposed methods with approaches using several
inversions (case 3). In this case, we show in Table 3.5 the performance comparison of the LG
Scheme, case 3, with the traditional A-based approach whose cost is given by (3.21). Also, the
I/M ratios for which the traditional, LG and LM methods achieve the lowest cost are shown at the
Table 3.5. Costs of different schemes using multiple inversions (case 3) and I/M ranges for
which each scheme achieves the lowest cost on a standard curve form (1M = 0.8S).
# Points (L) 2 3 (w = 4) 6 7 (w = 5) 14 15 (w = 6)
LG Scheme (case 3) 3I + 12.8M 3I + 19.4M 6I + 31.4M 6I + 34.2M 11I + 57.4M 11I + 60.2M
Traditional (3.21) 3I + 9.2M 4I + 12M 7I + 20.4M 8I + 23.2M 15I + 42.8M 16I + 45.6M
I/M range (LM, case 2b) I > 8.4M I > 8.6M I > 8M I > 9M I > 9.6M I > 10.4M
I/M range (LG, case 3) - 7.4M < I < 8.6 M - 5.5M < I < 9 M 3.7M < I < 9.6 M 2.9M < I < 10.4 M
I/M range (traditional) I < 8.4M I < 7.4M I < 8M I < 5.5M I < 3.7M I < 2.9M
55
Chapter 3: New Precomputation Schemes
bottom of the table. Note that we are including in the comparison LM Scheme, case 2b, to
determine the efficiency gained by using an approach based on only one inversion (case 2).
An important result from Table 3.5 is that the LM Scheme, case 2b, outperforms approaches
using several inversions for a wide range of I/M ratios. In general, this method is superior always
that inversion is more than 8-10 times the cost of multiplication, which holds on the majority of
implementations over prime fields. On the other hand, the LG Scheme, case 3, is only suitable for
low/intermediate values I/M.
Now, let us evaluate methods for case 1, and consider the C-based approach, whose cost is
given by (3.22), for our comparisons. In this case, we should also consider the cost of scalar
multiplication as the evaluation stage in C coordinates has a cost different to our methods.
When precomputations are in C, Cohen et al. [CMO98] proposed the use of J + C → J m to
perform additions (10M + 6S), 2J m → J (2M + 5S) to every doubling preceding an addition,
and 2J m → J m (3M + 5S) to the rest of doublings. Again, we have reduced the cost of these
operations by applying the technique discussed in [LM08] to trade multiplications for squarings.
Using this scheme the scalar multiplication cost including precomputations (3.22) is as follows:
In the case of LG and LM Schemes, case 1, we consider the use of addition with stored
values. Thus, the approximated cost of scalar multiplication is given by:
(n − 1) δ Frac-wNAF L (n − 1) δ Frac-wNAF
n ⋅DBL + mADD + ADD[1,1] + Cost scheme, case 1 , (3.24)
( L + 1) ( L + 1)
where 1DBL = 3M + 5 S , 1mADD = 7 M + 4 S , 1ADD[1,1] = 10 M + 4S (as in Table 2.2) and
Cost scheme, case1 is given by (3.4) or (3.15) for LM and LG Schemes, respectively.
Tables 3.6, 3.7 and 3.8 show the costs of performing an n-bit scalar multiplication using the
different methods above (case 1) for n = 160, 256 and 512 bits, respectively. We show results for
Table 3.6. Performance comparison of LG and LM Schemes with the C-based method (case 1) in
160-bit scalar multiplication on a standard curve form (1M = 0.8S).
# Points (L) 2 3 (w = 4) 4 5 6 7 (w = 5)
56
Chapter 3: New Precomputation Schemes
Table 3.7. Performance comparison of LG and LM Schemes with the C-based method (case 1) in
256-bit scalar multiplication on a standard curve form (1M = 0.8S).
# Points (L) 2 3 (w = 4) 4 5 6 7 (w = 5) 8
# Points (L) 9 10 11 15 (w = 6)
Table 3.8. Performance comparison of LG and LM Schemes with the C-based method (case 1) in
512-bit scalar multiplication on a standard curve form (1M = 0.8S).
# Points (L) 2 3 (w = 4) 4 5 6 7 (w = 5) 8
# Points (L) 9 10 11 12 13 14 15 (w = 6)
all the possible and practical values L. Also, note that all the methods considered exhibit the same
memory requirement, namely, 5L + R.
As we can see above, the LM method, case 1, achieves the highest performance in all the
cases for any number of precomputed points, surpassing the C-based approach by up to 4.1%.
57
Chapter 3: New Precomputation Schemes
Also, it is important to note that LG Scheme’s performance is comparable (or equivalent) to that
of LM Scheme in several cases. The latter especially holds for standard window values w (L = 3,
7, 15).
Let us now compare methods using one inversion only (case 2). Previous methods in this
scenario perform computations in H, J or C coordinates and then convert the points to A by
using Montgomery’ simultaneous inversion method to limit the number of inversions to one.
Costs of these methods are extracted from [DOS07] (assuming that 1S = 0.8M):
Recently, Dahmen et al. [DOS07] proposed a new scheme, known as DOS, whose
computations are efficiently performed using formulae in affine solely. This scheme has a low
memory requirement given by (2 L + 4) registers and computing cost:
that shows its superiority when compared to methods (3.25), (3.26), (3.27) requiring only one
inversion. However, the proposed LM Scheme achieves even lower computing costs given by
Cost LM, case 2a = 1I + (11.4 L + 4) M and Cost LM, case 2b = 1I + (10.6 L + 4.8) M (assuming that 1S
= 0.8M in formulas (3.5) and (3.6)). Therefore, LM Scheme (specifically, case 2b) achieves the
lowest cost in the literature when the number of inversions is limited to one. LM Scheme, case
2a, also achieves high performance with the advantage of requiring less memory.
The previous comparison applies to scenarios where memory is not limited. For applications
with strict memory constraints, it would be more realistic to compare methods for a certain
number of available registers. In Table 3.9, the cost of each method is restricted by the maximum
number of registers available for the evaluation stage. For each method, we show the total cost of
performing a 160-bit scalar multiplication and the optimal number of precomputed points L when
considering that a maximum of (2 LES + R ) registers are available for the evaluation stage (i.e.,
L ≤ LES ). For our analysis, we set R = 7. Also, to compare the performance of schemes using no
inversions (case 1) with methods using one inversion (case 2), we include costs of the most
efficient scheme found for case 1 (i.e., LM Scheme, case 1; see Tables 3.6, 3.7 and 3.8) and show
at the bottom of each table the I/M range for which LM Scheme, case 1, would achieve the
lowest cost. For comparisons for n = 256, 512, please refer to Appendix A7.
58
Chapter 3: New Precomputation Schemes
Table 3.9. Performance comparison of LG and LM Schemes with the DOS method in 160-bit
scalar multiplication for different memory constraints on a standard curve (1M = 0.8S).
I/M range (LM, case1) I > 90M I > 115M I > 117M I > 97M I > 97M
I/M range (LM, case1) I > 73M I > 70M I > 71M I > 60M I > 55M
From results in Tables 3.9, A.1, and A.2 that target case 2, it can be seen that LM Scheme
achieves the lowest cost for most cases for different security levels (lowest cost per register
allowance is shown in bold). For n = 160 bits, the LM Scheme, case 2b, offers the lowest costs
excepting for LES = 4 , in which case LM Scheme, case 2a, is slightly cheaper. For n = 256 bits,
LM Scheme, cases 2a and 2b, again achieves the lowest cost for all cases, excepting for LES = 5 ,
for which the DOS method offers a slight advantage. In the case of n = 512 bits, the DOS method
finds its best performance by achieving the lowest cost for LES = 5, 6 and 8. Also, the LG
Scheme, case 2, results more advantageous for LES = 7 and 8. Nevertheless, for most cases the
LM Scheme still achieves the highest performance. Also, in settings where memory is not
constrained the highest speed-up is achieved with LM Scheme, case 2b, for any value n.
Finally, when comparing methods for case 1 and case 2, it can be observed that LM Scheme,
case 1, can be advantageous for n = 160 bits if the ratio I/M is at least 50-60 and there are a high
number of registers available. For n = 256 bits, that margin reduces to ratios greater than 90-100.
And for n = 512 bits, the LM Scheme, case 1, would be the most efficient method for extremely
59
Chapter 3: New Precomputation Schemes
In this section, we analyze and compare the performance of the proposed LG Scheme (Section
3.3) with extended Jacobi quartics and inverted Edwards coordinates. As we could not find any
literature related to precomputation schemes on these settings, we have derived the cost formulas
of precomputing points using the traditional chain P → 3P → 5P → … → mP. For the case
without inversions (case 1), the cost of precomputation is given by (1S = 0.8M ) :
for IE and JQ e coordinates, respectively. These costs have been derived by adding the costs of
performing one mixed doubling, one mixed addition and ( L − 1) general additions. For JQ e we
consider the use of ADD[0,1] to reduce costs during the evaluation stage. For case 2, the costs are
given by (1S = 0.8M ) :
which have been derived by adding the cost of eq. (3.13) and (3.12) with c = 0 (for Montgomery’
simultaneous inversion method) to eq. (3.29) and (3.30), respectively.
In Table 3.10, we compare the costs of these schemes with the LG Scheme for different
standard windows w. Costs for LG Scheme are calculated with formulas (3.14), (3.17), (3.18). As
can be seen, the LG Scheme outperforms the methods using traditional chains in all covered
cases for both IE and JQ e coordinates. Note also that the advantage increases with the window
size. For instance, if 1I = 30M, w = 6, JQ e , the cost reduction is as high as 20% and 24% in
cases 1 and 2, respectively.
Let us now compare the performance of cases 1 and 2 of LG Scheme with the objective of
determining the best method for each possible scenario. In this analysis we should also consider
the scalar multiplication cost since different point operation costs apply to different cases. We
consider the fractional width-w NAF method for our analysis. For case 1, the approximated cost of
60
Chapter 3: New Precomputation Schemes
Table 3.10. Performance comparison of LG Scheme with methods using a traditional chain for
cases 1 and 2 on JQ e and IE coordinates (1M = 0.8S).
scalar multiplication is given by eq. (3.24), and for case 2, the cost is given by:
Tables 3.11 and 3.12 show the performance of scalar multiplication including the costs of the
LG Scheme, cases 1 and 2. At the bottom of the table, we display the I/M range for which case 1
is the most efficient approach.
As can be observed from Tables 3.11 and 3.12, on IE and JQ e coordinates LG Scheme,
case 1, achieves the best performance for most common I/M ratios if n = 160 bits. This result
differs from that for standard curves where the use of one inversion during precomputation is
only efficient for high I/M ratios (see Table 3.9). For higher security levels (n = 512 bits), the
difference between case 1 and case 2 reduces. Ultimately, the most effective approach would be
determined by the particular I/M ratio of a given implementation. However, as the window size
grows, case 1 would be again largely preferred. Therefore, for applications where memory is not
scarce, LG Scheme, case 1, achieves the lowest cost in both JQ e and IE coordinates.
In this section, we analyze and compare the performance of LG Scheme when targeting multiple
scalar multiplication methods such as JSF (Section 3.3.3). In particular, we first compare our
approach with the computation using traditional additions and then we evaluate performance of
cases 1 and 2 for the window-based JSF.
61
Chapter 3: New Precomputation Schemes
Table 3.11. Cost of 160-bit scalar multiplication using Frac-wNAF and the LG Scheme (cases 1
and 2); and I/M range for which case 1 achieves the lowest cost on JQ e and IE (1M = 0.8S).
# of Points (L)
Method Curve
2 3 (w = 4) 6 ≥ 7 (w = 5)
Table 3.12. Cost of 512-bit scalar multiplication using Frac-wNAF and the LG Scheme (cases 1
and 2); and I/M range for which case 1 achieves the lowest cost on JQ e and IE (1M = 0.8S).
# of Points (L)
Method Curve
2 3 (w = 4) 6 7 (w = 5)
I/M range (case 1) I > 71M I > 66M I > 51M I > 48M
I/M range (case 1) I > 64M I > 59M I > 40M I > 33M
# of Points (L)
Method Curve
14 ≥ 15 (w = 6)
62
Chapter 3: New Precomputation Schemes
m(m + 4) − 1 m − 1
Cost scheme, cases 1(2) = ADD + 2 DBL (+Cost P →A ) , (3.34)
2 m
where Cost P →A applies to case 2 only and represents the cost of conversion from projective to
affine coordinates given by eq. (3.11), (3.12), (3.13) with c = 0 for J, JQ e , IE coordinates, resp.
For J and JQ e , cost (3.34) can again be optimized further by using mixed coordinates, tripling
formulas and additions with stored values. In Table 3.13, we compare the performance of this
scheme with the LG Scheme. The costs for the latter method are taken from Table 3.4.
Table 3.13. Performance comparison of LG Scheme and a scheme using traditional additions for
computing tables of the form ci P ± di Q , cases 1 and 2 (1M = 0.8S).
# of Points (L)
Method Curve
L = 2 (m = 1) L = 10 (m = 3) L = 22 (m = 5)
As can be seen, the LG Scheme outperforms the method using traditional additions in all
cases covered. For instance, if 1I = 30M, L = 22, JQ e , the cost reduction is as high as 22% for
both case 1 and 2. Remarkably, the higher improvements are obtained with J coordinates due to
the combined use of LG and LM Schemes (see Section 3.3.3), especially in case 2, where larger
savings are obtained through both methods when converting points to affine coordinates. For
instance, if 1I = 30M, L = 22, J, the cost reduction is as high as 38% in case 2.
Assuming that points P and Q are unknown before execution and given in affine, a multiple
scalar multiplication with the form kP + lQ using windowed JSF costs approx. [ n⋅ DBL +
63
Chapter 3: New Precomputation Schemes
δ JSF ( L /( L + 2))(n − 1)ADD + δ JSF (2/( L + 2))(n − 1)mADD] + Cost scheme, case 1 and [n ⋅ DBL +
δ JSF ( n − 1) mADD] + Cost scheme, case 2 for cases 1 and 2, respectively, where δ JSF = 0.5 if m = 1,
δ JSF = 0.3575 if m = 3, δ JSF = 0.31 if m = 5 [SEI10], and Cost scheme, case x represents the cost of
precomputation given by formula (3.19). For J and JQ e , we use again ADD[ M , S ] instead of
ADD . The estimates using these cost formulas are displayed in Tables 3.14 and 3.15.
Table 3.14. Cost of 160-bit multiple scalar multiplication using window-based JSF and LG
Scheme (cases 1 and 2); and I/M ranges for which case 1 achieves the lowest cost; 1M = 0.8S.
# of Points (L)
Method Curve
L = 2 (m = 1) L = 10 (m = 3) L = 22 (m = 5)
Table 3.15. Cost of 512-bit multiple scalar multiplication using window-based JSF and LG
Scheme (cases 1 and 2); and I/M ranges for which case 1 achieves the lowest cost; 1M = 0.8S.
# of Points (L)
Method Curve
L = 2 (m = 1) L = 10 (m = 3) L = 22 (m = 5)
64
Chapter 3: New Precomputation Schemes
Similarly to the case of single scalar multiplication (see Table 3.11), case 1 achieves the best
performance for most common I/M ratios for n = 160 bits with JQ e and IE coordinates.
However, if n = 512 bits, the range of I/M ratios for which case 2 is more efficient increases
significantly. Also, note that case 2 appears to be the best choice for J coordinates for a wide
range of I/M ratios, especially for high levels of security, i.e., n = 512.
65
Chapter 3: New Precomputation Schemes
After developing the LG Scheme, we became aware of other (virtually simultaneous) efforts
based on similar ideas. Avanzi, Heuberger and Prodinger [AHP08] also noticed the savings
introduced by computations with the form P ± Q when precomputing points in projective
coordinates. They, however, analyzed the applicability of this idea in the context of Koblitz
curves with τ–adic representations using LD coordinates. In a talk in ECC2008 [Sco08], Scott
described an approach similar to the LG Scheme for the case of single scalar multiplication. He
also proposed to exploit similarities between P + Q and P – Q during precomputation but using a
slightly different sequence to compute points. After an analysis on the settings discussed in this
chapter, we conclude that our calculation sequence achieves better performance.
3.6. Conclusions
This chapter introduced new schemes for precomputing points, a basic ingredient to accelerate
the fastest variable-scalar-variable-point scalar multiplication methods which are based on
window-based strategies.
After presenting most relevant previous work in §3.1, we introduced in §3.2 the LM Scheme,
which is intended for standard curves using Jacobian coordinates, and adapted it to two typical
scenarios for precomputation: case 1, without using inversions; and case 2, using one inversion.
For the latter, we presented two variants that have slightly different speeds and memory
requirements. The theoretical costs for each case were derived (with the corresponding proofs in
the appendix), exploiting state-of-the-art formulas and techniques for maximal performance. In
particular, for a number L of non-trivial points, case 1 has a cost of (5L + 1) M + (2 L + 5) S (or
(6 L + 1) M + (3L + 5) S when using operations with stored values) and case 2b has a cost of
1I + (9 L) M + (2L + 6) S , which are the lowest in the literature for tables di P .
In §3.3, we introduced the highly-flexible LG Scheme, which is based on the concept of
conjugate addition and that can be adapted to any curve form or type of scalar multiplication (i.e.,
single and multiple scalar versions). We also discussed its applicability to cases 1, 2 and 3, and
analyzed its efficiency on three curve settings: standard curves using Jacobian coordinates,
extended Jacobi quartics using extended Jacobi quartics coordinates and Twisted Edwards curves
using inverted Edwards coordinates. Moreover, for the case of multiple scalar multiplication
using Jacobian coordinates, we proposed a novel scheme combining the LG and LM approaches.
The theoretical costs for each case were derived (with the corresponding proofs in the appendix),
66
Chapter 3: New Precomputation Schemes
67
4 Chapter 4
In this chapter, we describe efficient methods based on multibase representations and analyze
their performance to compute elliptic curve scalar multiplication at the evaluation stage. Our
contributions can be summarized as follows:
• We include a thorough discussion and analysis of the most relevant methods based on
double- and multi-base representations in the literature. We categorize the different
approaches and highlight their advantages and disadvantages.
• We provide an improved and more thorough exposition of the original multibase NAF
(mbNAF) method and its variants, which were introduced by the author in [Lon07]. In
particular, we include the analysis of the average density of these methods when using
bases {2,3} and {2,3,5} that was deferred in [Lon07].
• We apply the concept of “fractional” windows to improve the flexibility of the windowed
variant of mbNAF so that implementers can freely choose the optimal number of
precomputations in a given application.
• We apply the concept of operation cost per bit to the derivation of efficient multibase
algorithms able to find cheaper multibase chains for scalar multiplication. We argue that
this approach, assuming unrestricted resources, leads to optimal multibase chains for any
69
Chapter 4: Scalar Multiplication using Multibase Chains
given scalar. For practical scenarios, we present very compact algorithms that yield
(conjecturally, close to optimal) multibase chains.
• Finally, we perform an exhaustive performance evaluation of the various methods for
different security levels and for three different curve forms: standard curves using
Jacobian coordinates (J ), extended Jacobi quartics using extended Jacobi quartic
coordinates ( JQ e ) and Twisted Edwards curves using inverted Edwards coordinates
(IE). These results allow us to assess the state of affairs of the use of double bases and
multibases in practice.
For the remainder of this chapter, we assume that curve parameters can be chosen such that
the cost of multiplying a curve constant can be considered negligible in comparison with a
regular multiplication. Also, additions and subtractions are neglected when performing cost
analysis. These assumptions greatly simplify our analysis without affecting the conclusions.
This chapter is organized as follows. §4.1 discusses the most relevant previous work and
categorizes the different approaches based on double- and multi-base representations. §4.2
discusses the mbNAF method and its variants, and provides the zero and nonzero density
formulas obtained with the use of Markov chains. §4.3 details the application of “fractional”
windows to mbNAF. §4.4 presents the methodology based on the operation cost per bit to derive
more efficient multibase chains. §4.5 evaluates the performance of the different methods in
comparison with other works in the literature for different security levels and memory
constraints. §4.6 discusses potential variants of the proposed methods and their application to
other settings. This section also discusses the challenges still faced by methods using double- and
multi-base representations. Finally, some conclusions are drawn in §4.7.
70
Chapter 4: Scalar Multiplication using Multibase Chains
Recently, there have been proposed new methods for scalar multiplication using number
representations based on double- and multi-base number systems, which basically mix different
bases to decrease the number of terms required in the representation of integers. Based on
previous work by Dimitrov and Cooklev [DC95], the use of the so-called Double Base Number
System (DBNS) for cryptographic applications was first proposed by Dimitrov et al. in [DJM98].
In this number system an integer k is represented as follows:
K
k= ∑ si 2b ⋅ 3c ,
i i
(4.1)
i =1
where si ∈ {−1,1} .
To enable the use of DBNS in the setting of ECC, Dimitrov et al. [DIM05] were the first to
introduce the concept of double-base chains where bi and ci must decrease as i increases. This
was later generalized to multi-base chains (i.e., using two or more bases) by the author in
[Lon07] and Mishra and Dimitrov in [MD07]. Of particular interest are the facts that multibase
chains are redundant and that some representations are highly sparse, which, as consequence,
allow a reduction in the Hamming weight of the scalar expansion (that is, a reduction in the
number of additions in the point multiplication). Let us illustrate the latter with the following
example.
Example 4.1. The representation of k = 9750 using NAF is given by 9750 = 213 + 211 − 29 +
25 − 23 − 2 , which requires 13DBL + 5ADD using Horner’s scheme for scalar multiplication
(i.e., the computation uses the expansion 9750 P = 2(22 (22 (24 (22 (22 P + P) − P) + P ) − P) − P) ).
If one, otherwise, uses the double-base chain 9750 = 210 × 32 + 2 6 × 3 − 2 4 × 3 + 2 × 3 , the scalar
multiplication takes the form 9750 P = 2 × 3 (23 (22 × 3(24 P + P) − P) + P) and costs 10DBL +
2TPL + 3ADD, which reduces the nonzero density in comparison with the NAF representation.
Multibase chains using {2bi 3ci }- terms or {2bi 3ci 5di }- terms are particularly attractive for
ECC because operations associated with these bases (namely, point doubling, tripling and
quintupling) are the cheapest-per-bit point operations available for some elliptic curves.
Nevertheless, multibase chains are not unique and this poses the conjecturally hard problem
of determining (in a reasonable amount of time and utilization of resources) the optimal
multibase chain for a given integer. Hereinafter, we use the term optimal to define a multibase
chain for a given scalar k that achieves the lowest cost when applied to the computation of the
point multiplication kP. In contrast to radix 2-based representations, the complexity of this
analysis is significantly higher as the point operations involved (e.g., doubling, tripling,
quintupling and addition) have different costs per bit that even vary with the type of elliptic
71
Chapter 4: Scalar Multiplication using Multibase Chains
curve. Hence, it does not necessarily hold that representations with the lowest nonzero density
achieve the lowest cost. Note that this complexity increases with the number of bases in the
representation.
Although it remains an open problem to find the optimal double- or multi-base chains, there
have appeared in the literature several efforts trying to find “efficient” multibase chains and using
them advantageously in the computation of elliptic curve scalar multiplication. In general, there
are two main approaches to find a double-base or multi-base representation for a given integer in
the setting of elliptic curves: using a “Greedy” algorithm [DIM05, DI06] and using division
chains [CJL+06, Lon07] (borrowing the term from [Wal98]).
This approach consists in the derivation of scalar representations by consecutive division with
integers from a suitably chosen set of bases. When the partial result is not divisible by at least one
72
Chapter 4: Scalar Multiplication using Multibase Chains
base then a particular rule defines how to approximate the value to a close number that is again
divisible by one or more bases. Note that methods using division chains are apparently easier to
analyze by using, for instance, Markov chains. Moreover, they do not rely on pre-stored tables
for conversion, immediately enabling their use in memory-constrained applications. In the 90’s
several algorithms with different division rules were proposed for reducing the cost of
exponentiation [DC95, CCY96, Wal98] (the term “division chain” was coined by Walter in
[Wal98]). Walter [Wal02] also exploited these ideas to develop an exponentiation method with
random selection of bases to protect against certain SCA attacks. Nevertheless, it seems that the
binary/ternary algorithm by Ciet et al. [CJL+06] was the first method using division chains that
was intended for ECC applications. In this case, a partial result obtained after dividing by bases 2
and 3 is approximated to the closest term that is congruent to 0(mod6) . Since this approximation
gives roughly equivalent “weight” to bases 2 and 3, the method has some efficiency limitations
especially in most common ECC settings where doubling is much faster than tripling and
addition. In fact, if one does not take into account the memory/conversion overhead, it can be the
case that “Greedy”-based approaches achieve better performance (see, for example, Table 2 in
[DH08]). In [Lon07] (see also [LM08c]), the author introduced new algorithms able to find
generalized multi-base chains, solving efficiently for first time the problem of memory penalty
and difficulty to analyze the zero and nonzero density of a multibase expansion. Remarkably, it
also achieves better cost performance than the “Greedy” approach. The new method finds
multibase chains by creating a “window” with a fixed width with one of the bases (referred to as
“main base”) and then approximates the partial scalar value to it. The latter guarantees the
execution of a minimum number of operations with the “main base” before the following
addition happens, similar to the way NAF of a scalar is generated with base 2. Moreover, the
nonzero density is further reduced because, once an addition is performed, not only doublings but
also triplings, quintuplings, and so on, can be used. This new approach is called multibase NAF
(denoted by mbNAF). Its window-based version using an extended digit set appears as a natural
extension and is referred to as width-w multibase NAF (wmbNAF).
NOTE: one does not need to restrict the “window” in mbNAF to only one base. In fact, in
[Lon07] (see also [LM08c, Section 5.3]), the author proposed an extended wmbNAF method that
generalizes the use of windows, such that the approximation after the divisibility tests is
performed to the generic value a = a1w1 ⋅ a2w2 ⋅… ⋅ a JwJ for a set of bases {a1 , a2 ,… , a J } , where
w j ≥ 0 are integers. For instance, the use of a = 2 w1 ⋅ 3w2 was shown to be especially efficient on
the elliptic curves with degree 3 isogenies proposed in [DIK06] and known as DIK curves (see
Table 8 in [LM08c]). Note that this method was recently rediscovered by Adikari, Dimitrov and
Imbert in [ADI10, Section 3.1] and [Adi10, Chapter 5] for the case of bases {2,3}. Also, note that
the binary/ternary algorithm by [CJL+06] is a special case of extended wmbNAF when a = 2 ⋅ 3 .
73
Chapter 4: Scalar Multiplication using Multibase Chains
More recently, Doche et al. [DH08] introduced a new method that also finds double-base
chains using division chains, although using a somewhat more complex tree-based approach in
comparison with multibase NAF. Their method basically divides by 2 and 3 values ( ki + 1) and
( ki − 1) for B distinct values ki that are coprime to 6, and keeps the B division sequences that
reach the lowest values. This procedure is repeated with the new values until reaching 1.
Initialization proceeds as above although in this case the algorithm keeps all the possible
sequences until B distinct values ki are obtained. As will be evident later, the disadvantage of
this method is that the division sequences that are chosen at each iteration are the ones whose
final values are the lowest ones. However, a long sequence of divisions alone does not guarantee
optimal cost. This drawback is somewhat minimized by keeping up to B values at each iteration
(and then the probability that a long sequence is also among the cheapest ones increases).
However, it is evident that one may avoid storing unnecessary sequences by applying an
operation cost analysis instead.
In Section 4.4, we introduce a methodology to derive algorithms able to find more efficient
multibase chains. Our technique is based on the careful analysis of the operation cost per bit,
which helps to choose the most efficient division sequence per iteration. We argue that the
inclusion of this analysis in the design of any multibase algorithm potentially enables the
derivation of the fastest multibase chains.
K J c ( j)
k = ∑ si ∏ a ji (4.2)
i =1 j =1
where: a1 ≠ … ≠ aJ are prime integers from a set of bases A = {a1 ,… , a J } ( a1: main base),
K is the length of the expansion,
si are signed digits from a given set D \ {0} , i.e., si ≥ 1 and si ∈ D \ {0} ,
ci ( j ) are decreasing exponents, s.t. c1 ( j ) ≥ c2 ( j ) ≥ … ≥ cK ( j ) ≥ 0 for 2 ≤ j ≤ J , and
74
Chapter 4: Scalar Multiplication using Multibase Chains
ci (1) are decreasing exponents for the main base a1 (i.e., j = 1), s.t. ci (1) ≥ ci +1 (1) + 2 ≥ 2
for 1 ≤ i ≤ K − 1 .
Note that the last two conditions above guarantee that an expansion of the form (4.2) is
efficiently executed by a scalar multiplication using Horner’s method as follows:
K J J J J
kP = ∑ si ∏ acji ( j ) P = ∏ a dj K ( j ) ∏ a dj K −1 ( j ) … ∏ a dj 1 ( j ) ( s1 P )+ s2 P +…+ sK −1 P + sK P (4.3)
i=1 j=1 j=1
j =1 j=1
where d K (1) ≥ 0 , and di (1) ≥ 2 for 1 ≤ i ≤ K − 1 . The latter is equivalent to the last condition in
(4.2) and incorporates the non-adjacency property in the multibase representation. Basically, it
fixes the minimal number of consecutive operations with the “main base” (i.e., a1 ) between any
two additions to two. Note that an operation with the main base refers to doubling if a1 = 2 or
tripling if a1 = 3 , and so on.
On the other hand, if we relax the previous condition and allow larger window sizes (i.e.,
allowing 3, 4, or more, consecutive operations with the main base between any two additions) we
can reduce further the average number of nonzero terms in the scalar representation at the
expense of a larger digit set D and, consequently, a larger precomputed table. The previous
technique is known as wmbNAF.
The mbNAF and wmbNAF representations require the following digit set [Lon07]:
a w − 1 a1w−1 − 1
D = 0, ±1, ±2,…, ± 1 \ ±1a1 , ±2a1 ,…, ± a1 (4.4)
2 2
where w ≥ 2 ∈ Z + ( w = 2 for mbNAF). Without considering {O, P} , the digit set (4.4) implies
that a scalar multiplication would require precomputing d i P , where di ∈ D + \ {0,1} (note that
only positive values d i P need to be stored in the table as the negative of points can be computed
on-the-fly at negligible cost). Thus, the precomputation table consists of ( a1w − a1w −1 − 2) / 2
points. Note that if w = 2 (mbNAF case), the requirement of precomputations is minimal. For
instance, in the case a1 = 2 we do not need to store any points besides {O, P} .
It can be easily seen that selecting the main base according to the relative efficiency of its
corresponding operation will guarantee that more of these operations are used in average, which
potentially could decrease the computational cost of scalar multiplication. In the remainder (and
following what is observed in most common ECC settings over prime fields), we will assume
that doubling is the most efficient point operation available, and hence, a1 = 2 .
It is important to remark that, obviously, eq. (4.2) does not involve unique representations.
For instance, both expressions 210 × 32 + 26 × 32 − 24 × 3 + 2 × 3 and 29 × 33 − 27 × 33 − 25 × 33 +
23 × 33 + 2 × 32 + 32 + 3 enable two different mbNAF representations for the integer 9750
75
Chapter 4: Scalar Multiplication using Multibase Chains
following (4.2). In [Lon07], the author provided algorithms based on division chains that
efficiently find an (w)mbNAF chain of the form (4.2) and, given a window width and set of bases,
is unique for each integer. Note that we have integrated algorithms for finding mbNAFs and
wmbNAFs in Algorithm 4.1.
ki(bi ) in Algorithm 4.1 represent the digits in the multibase NAF representation, where
ki ∈ D , see (4.4); and the superscript (bi ) represents the base bi ∈A associated to the digit in
position i. The function mods represents the following computation:
Example 4.2. The mbNAF representation of 9750 obtained with Algorithm 4.1 using the division
(2) (2)
sequence 9750 2
→ 4875 → 1625 − 1 → 1624 → 203 + 1 → 204 → 51 → 17 − 1 → 16 → 1 is (2,3)NAF2 (9750) = 1 0
3 8 4 3 16
0(2) 0(2) 1(2) 0(3) 0(2) −1(2) 0(2) 0(2) 1(2) 0(3) 0(2) , which allows us to compute the corresponding scalar
multiplication 9750P as 2 × 3 (23 (22 × 3(24 P + P) − P) + P) , using Horner’s method. The latter
involves 1mDBL + 9DBL + 2TPL + 3mADD. For instance, using Table 2.4 ( JQ e , 1S = 0.8M ),
76
Chapter 4: Scalar Multiplication using Multibase Chains
9750P costs 107.2M. Compare this to the cost using NAF: NAF(9750) = 1010 −10
0010 − 10 − 10 , given by 1mDBL + 12DBL + 5mADD = 119.6M.
For brevity (and whenever understood in the context), we will refer as the multibase NAF of
an integer k to the unique representation found through Algorithm 4.1.
One of the attractive properties of multibase NAF representations found with Algorithm 4.1 is
that the average number of operations can be precisely determined by using Markov chains. The
following theorems are presented on this regard. With a slight abuse of notation, density refers to
the number of certain point operation relative to the total number of zero and nonzero digits in a
given representation.
Theorem 4.1. The average densities of additions, doublings and triplings for the (w)mbNAF
using bases A = {2,3} are approximately:
2w 2w ( w + 1) 3(2w− 2 − s )
δ1 = , δ x2 = , δ x3 = ,
3(2w−2 − s) + 2w ( w + 1) 3(2w−2 − s) + 2 w ( w + 1) 3(2 w−2 − s) + 2w ( w + 1)
respectively, where s = (2w − 2 + 1) / 3 and w ≥ 2 ∈ Z + ( w = 2 for mbNAF).
Proof. The method can be modeled as a Markov chain with three states in the case of bases
(2)
{2,3}: "0(2) ", "0(3) " and "0… 0(2) ki(2)" , with the following probability matrix:
w-1
This Markov chain is irreducible and aperiodic, and hence, it has stationary distribution,
which is given by:
"0(2) …0(2) k (2)","0(2) ","0(3) " : 2w 2w
.
(
3 2w−2 − s )
w-1 i (2w+1 + 3 2w−2 − s 2w+1 + 3 2w−2 − s
2w+1 + 3 2w−2 − s ) ( ) ( )
w w w
Therefore, nonzero digits ki appear 2 out of w ⋅ 2 + 2 + 3 2 w− 2
− (2
w − 2
(
+ 1) / 3 digits,
)
77
Chapter 4: Scalar Multiplication using Multibase Chains
which proves the assertion about the nonzero density. Doublings and triplings (i.e., number of
zero and nonzero digits with bases 2 and 3, respectively) appear 2w ⋅ w + 2 w and
( ) ( )
3 2 w− 2 − (2 w− 2 + 1) / 3 out of w ⋅ 2 w + 2w + 3 2w− 2 − (2w− 2 + 1) / 3 digits, respectively. This
proves assertion about the average density of doublings and triplings. □
Theorem 4.2. The average densities of additions, doublings, triplings and quintuplings for the
(w)mbNAF using bases A = {2,3,5} are approximately:
2 w+ 3 2w+3 ( w + 1)
δ1 = , δ x2 = ,
17 ⋅ 2w−1 − 5r − 24s − 5t + 2w+3 ( w + 1) 17 ⋅ 2 w−1 − 5r − 24 s − 5t + 2w+3 ( w + 1)
24(2 w− 2 − s) 5(2 w−1 − r − t )
δ x3 = and δ x5 = ,
17 ⋅ 2 w−1 − 5r − 24s − 5t + 2 w+3 ( w + 1) 17 ⋅ 2 w−1 − 5r − 24 s − 5t + 2w+3 ( w + 1)
Proof. For the case of bases A = {2,3,5} , this method can be modeled with four states: "0(2) " ,
(2)
"0(3) " , "0(5) " and "0…0(2) ki(2)" . The probability matrix in this case is as follows:
w-1
(2) 2 w− 2 − s 2 w− 2 − r + s − t 3 ⋅ 2 w−2 + r + 3s + t
"0 " : 1/2
2w 2 w+ 2 2 w+ 2
"0(3) " : 0 1/ 3 1/ 6 1/ 2
"0(5) " : 0 0 1/ 5 4/5
(2) 2 w− 2 − s 2 w− 2 − r + s − t 3 ⋅ 2 w−2 + r + 3s + t
"0… 0(2) ki(2)" : 1/2
w-1 2w 2 w+ 2 2w + 2
This Markov chain is irreducible and aperiodic with stationary distribution:
w− 2
−s ( ) (
5 2w−1 − r − t )
"0(2) …0(2) k (2)","0(2) ","0(3) ","0(5) " : 2
w+ 3
2 w+3 24 2 ,
i ω ω ω ω
w-1
w−1 w+3 w +3
where ω = 49 ⋅ 2 − 5r − 24s − 5t . Therefore, nonzero digits ki appear 2 out of 2 ⋅ w +
2 w+ 3
+ 24 2 ( w−2
) ( w −1
)
− s + 5 2 − r − t digits, which proves our assertion about the nonzero density.
Doublings, triplings and quintuplings (i.e., number of zero and nonzero digits with bases 2, 3 and
( ) ( )
5, respectively) appear 2w+ 3 ⋅ w + 2w+ 3 , 24 2 w− 2 − s and 5 2 w−1 − r − t out of 2w+ 3 ⋅ w + 2 w+3 +
( ) ( )
24 2w− 2 − s + 5 2w−1 − r − t digits, respectively. This proves our assertion about the average
density for the aforementioned operations. □
Let us determine the average number of operations for the multibase NAF method with the
78
Chapter 4: Scalar Multiplication using Multibase Chains
help of the presented theorems. First, it is known that the expected number of doublings, triplings
and additions is given by #DBL = δ x2 ⋅ digits , #TPL = δ x3 ⋅ digits and #ADD = δ1 ⋅ digits , where
digits represents the total number of (zero and nonzero) digits in the expansion (note that a
nonzero digit involves one doubling and one addition). Then, we can assume that
2 #DBL ⋅ 3#TPL ≈ 2n−1 , where n represents the average bitlength of the scalar k. Thus,
#DBL ⋅ log 2 + #TPL ⋅ log 3 ≈ ( n − 1)log 2 , and replacing #DBL and #TPL, we can estimate digits
with the following:
( n − 1)log 2
digits ≈ , (4.5)
δ x2 ⋅ log 2 + δ x3 ⋅log3
which allow us to determine #DBL, #TPL and #ADD using the expressions above and Theorem
4.1. For instance, in the case of mbNAF, bases A = {2,3} and w = 2, the average densities for
doublings, triplings and additions derived from Theorem 4.1 are 4/5, 1/5 and 4/15. If n = 160 bits,
we determine that digits = 142.35 using (4.5). Then, the average cost of a scalar multiplication
using Table 2.4 ( JQ e , 1S = 0.8M ) is approximately 113.88DBL + 28.47TPL + 37.96mADD =
1321M. Similarly, if we use bases A = {2,3,5} , the average cost can be estimated as
approximately 97.06DBL + 24.27TPL + 10.11QPL + 32.35mADD = 1299.82M. Compare the
previous costs to that one offered by NAF: 159DBL + 53mADD = 1399.2M (in this case,
δ NAF = 1/ 3 ). Hence, theoretically, it is determined that (2,3)NAF and (2,3,5)NAF surpasses NAF
(case with no precomputations, JQ e ) by about 5.6% and 7.1%, respectively.
It is still possible to find more efficient multibase chains at the expense of some increment in
the complexity of the original multibase NAF. The improved algorithms will be discussed in
Section 4.4. Following, we optimize the basic multibase NAF methods using a recoding based on
fractional windows.
79
Chapter 4: Scalar Multiplication using Multibase Chains
For the remainder, we will assume that the main base a1 is 2. First, let us establish our ideal
table with unrestricted number of non-trivial points d i P , where di ∈D+ \{0,1}={3, 5, … , m} and
m ≥ 3 ∈ Z + is an odd integer. If we define m in terms of the standard windows w, it would be
expressed as:
m = 2 w− 2 + h , (4.6)
where 2w− 2 < m < 2w−1 and h ≥ 1 ∈ Z + is odd.
We define the rules of the recoding scheme for bases A = {2, a2 ,…, aJ } in Algorithm 4.2.
Basically, the proposed recoding first detects if k is divisible by one of the bases. Else, it
establishes a window w and checks if k can be approximated to the closest extreme of the
window using any of the digits di available. It can be verified that the latter will be accomplished
if steps 2 or 4 are satisfied. Otherwise, the established window is too large and, hence, it is
“reduced” to the immediately preceding window size to which k can be approximated (condition
in step 3).
An algorithm to convert any integer to Frac-wmbNAF representation can be easily derived by
replacing steps 3-6 in Algorithm 4.1 by steps 1-5 of Algorithm 4.2. In this case, we will denote
the Frac-wmbNAF of an integer k by (2, a2 ,..., a J )NAFw,L ( k ) = (..., k 2( b2 ) , k1(b1 ) ) , where w is the
standard window width according to (4.6) and L represents the number of precomputed points,
that is, L = (m − 1)/ 2 .
Let us illustrate the new recoding with the following example.
Example 4.3. If k = 9750 and m = 5 , then di ∈ D + \{0,1} = {3,5} , and w = 4 and h = 1 by means
of eq. (4.6). Then, the Frac-wmbNAF of 9750 is (2,3)NAF4,2 (9750) = 1(2) 0 (2) 0(2) 0(2) −3(2) 0(2)
0(2) 0(2) −5(2) 0(2) 0(2)1(2)0(3)0(2) , and the conversion process can be visualized as the division
chain 9750 → 4875 → 1625 − 1 → 1624 → 203 + 5 → 208 → 13 + 3 → 16 → 1 .
2 3 8 16 16
Observe that, when 1625 is obtained, it requires an addition with 7 to approximate the value
to 1632 (which is the closest number ≡ 0 (mod 24 ) , as required by a standard window w = 4 ).
However, 7 is not part of the precomputed table, so the window width is reduced accordingly to
80
Chapter 4: Scalar Multiplication using Multibase Chains
w = 3 and the value 1625 is approximated to the closest value in the new window (i.e., 1624)
using an addition with −1.
We now present the following theorem regarding the average density of this method for the
case A = {2,3} .
Theorem 4.3. The average densities of nonzero terms, doublings and triplings of the Frac-
wmbNAF using bases A = {2,3} , window size w and L available points (represented by
(2,3)NAFw, L ) are approximately:
2w 8(L + 1) + 2w (w − 1)
δ1 = , δ 2 = and
8( L + 1) − 3(u + v) + 2w −2 (4 w − 1) x
8( L + 1) − 3(u + v) + 2w−2 (4w − 1)
3(2w−2 − (u + v))
δ x3 = ,
8( L + 1) − 3(u + v) + 2w−2 (4w − 1)
respectively, where u = ( L + 2) / 3 and v = (2w−2 − L) / 3 .
Proof. Let us consider the following states to model this fractional window method using Markov
(2)
chains: "0(2) " , "0(3) " , "0… 0(2) ki(2)" and "0
(2)
… 0(2) ki(2)" . Then, the probability matrix is as
follows: w-2 w-1
"0(2) " : 1/2
t − (t + 1) / 3 (2 w− 2
) t + (t + 1) / 3
− t ( t + (t + 1) / 3 )
4t 2w t 2w
w− 2
(3) 2 −t t
"0 " : 0 1/ 3
3 ⋅ 2 w −3 3 ⋅ 2 w −3
(2 w− 2 − t ) − (2w − 2 − t + 1) / 3 (2 w− 2 − t ) + (2 w− 2 − t + 1) / 3
"0(2) … 0(2) ki(2)" : 0 α= β= 1−α − β
w -2 2(2w − 2 − t ) 2 w−1
"0(2) … 0(2) ki(2)" : 1/2
t − (t + 1) / 3 ( )
2w − 2 − t ( t + (t + 1) / 3 ) t + (t + 1) / 3
4t 2w t 2w
w-1
This Markov chain is irreducible and aperiodic with the stationary distribution:
(2) (3) (2) (2) (2) (2) (2) (2) 16t 12(2w − 2 − (u + v)) 16(2w− 2 − t ) 16t
"0 ","0 ","0
i… 0 k ","0
i… 0 k " : ,
w-2 w-1
µ µ µ µ
where µ = 16t −12(u + v) + 7 ⋅ 2w and t = L + 1 . Therefore, the nonzero digits ki appear 2w out of
8t − 3 ( u + v ) + 2 w− 2 ( 4w − 1) digits, proving the assertion about the nonzero density. Doublings
and triplings (i.e., the number of zero and nonzero digits with bases 2 and 3, respect.) appear
( )
8t + 2w ( w − 1) and 3 2w−2 − ( u + v ) out of 8t − 3 ( u + v ) + 2w− 2 ( 4 w − 1) digits, respectively. This
proves the assertion about the average density of doublings and triplings. □
81
Chapter 4: Scalar Multiplication using Multibase Chains
With Theorem 4.3, it is possible to theoretically estimate the expected number of doublings,
triplings and additions using this method. For instance, following the procedure detailed in
Section 4.2.1, we can estimate the cost of scalar multiplication (without including
precomputation cost) for n = 160 bits using L = 2 points (w = 4) as 132.7DBL + 16.6TPL +
29.5mADD = 1229.9M ( JQ e , 1S = 0.8M ). Compare to the cost achieved by Frac-wNAF,
namely 159DBL + 35.3mADD = 1250.5M ( δ Frac-wNAF = 1/ 4.5 when using m = 5; see Section
2.2.4.3). Further cost reductions are observed for the case of A = {2,3,5} .
Definition 4.1. The operation cost per bit of an elliptic curve point operation is given by
ς(operation) = cost(operation)/bitlength(operation).
Following a common practice in the literature, we express operation costs in terms of field
multiplications and squarings, assuming the approximation 1S = 0.8M. For instance, a point
doubling in Jacobian coordinates costs ς (DBL) = DBL / log 2 2 = 7 field multiplications per bit,
where DBL = 3M + 5S (see Table 2.2).
Note that the definition above can be readily extended to division sequences. In this case, one
should take into account the cost of all the operations involved and their corresponding
bitlengths.
Corollary 4.1. From all possible chains using a given set of bases A , the optimal chains for a
given integer k are the ones with the lowest cost per bit.
If, for instance A = {2,3} , Corollary 4.1 implies that the optimal chains for a given integer k
# DBL × DBL + # TPL × TPL+ # ADD × ADD
have ς (chain) = minimal, where OP
# DBL × bitlength (DBL) + # TPL × bitlength (TPL) + # ADD × bitlength (ADD)
and #OP denote the cost of certain operation and the number of times this operation is used,
respectively. With a slight abuse of notation, bitlength(ADD) represents the number of bits added
or subtracted from the total bitlength after addition with a digit from a given digit set.
Obviously, an exhaustive search evaluating costs per bit of all possible division sequences
from k would yield the optimal chains for this scalar. Nevertheless, for cryptographic purposes,
one should constrain the search to “smaller” ranges. For instance, it seems natural to limit the
82
Chapter 4: Scalar Multiplication using Multibase Chains
Proposition 4.1. Let a digit set D \ {0} = {±1, ±3, ±5,…, ± m} , where m << k for a scalar
multiplication kP. Then, the “bitlength” of an addition with any digit di ∈ D \ {0} (i.e., the bit
reduction or increase due to the addition operation) is negligible in comparison with the total
bitlength and approximates to zero in average.
Proposition 4.2. Let A = {a1 , a2 ,… , aJ } be a set of bases where a j ∈A are all primes and
a1w1 ⋅ … ⋅ a JwJ | (k − ki ) for an integer k, a digit ki ∈ D = {0, ±1, ±3, ±5,…, ± m} and integers w j ≥ 0 .
Then, for given values kP, ki P ∈ E ( Fp ) , the cost per bit of computing a1w1 ⋅… ⋅ a JwJ ⋅ (kP − ki P ) ,
which is denoted by ς ( k − ki ) , can be estimated as follows:
where a j P represents the cost of the point operation corresponding to base a j (for instance,
a j P = DBL if a j = 2 ) and v represents the number of additions such that v = 2 if ki ≠ 0 and
v = 1 if ki = 0 .
83
Chapter 4: Scalar Multiplication using Multibase Chains
In this section we propose a new algorithm that has been derived by rewriting the original
multibase NAF and adding a few conditional statements. The refined multibase algorithm is
shown as Algorithm 4.3. In the remainder, we will refer to chains obtained from this algorithm as
refined multibase chains.
84
Chapter 4: Scalar Multiplication using Multibase Chains
evaluating division sequences according to the cost per bit function (4.7).
Now, let k be a partial value of the scalar during execution of Algorithm 4.3, w j , w′j ≥ 0 be
integers, di , ki ∈ D = {0, ±1, ±3, ±5,… , ± (2 w−1 − 1)} , di ≠ 0 , with a standard window width w, e be
a parameter > 0 and φ , µ be odd integers such that {a1 , a2 ,… , aJ } /| φ , µ . The conditional
statements to be inserted in Algorithm 4.3 follow the next criteria:
CONDITION1 aims at reducing the length of the expansion by using more expensive point
operations (i.e., operations with bases a j , where j > 1 ) that yield cheaper-per-bit chains than the
usual sequence of operations with base a1 , after each nonzero term. Similarly, CONDITION2
determines if there is a chain involving an addition that is cheaper-per-bit than the sequence
directly dividing by the bases.
Note that if one assumes that values after executing a given sequence followed by an addition
are approximately uniformly distributed over odd numbers, then choosing the cheapest-per-bit
sequence for a partial value k would ultimately yield a multibase chain for the full point
multiplication that is cheaper in average. However, Algorithm 4.3 does not necessarily execute
the full sequence that was chosen. It instead re-evaluates and analyzes the costs of new sequences
after each doubling, tripling or quintupling. Hence, CONDITION1 and 2 above include a security
parameter, namely e, to guarantee that the chosen sequence is significantly better than the usual
one.
Although the number of divisibility tests with different combinations of bases A that can be
evaluated in CONDITION1 and 2 of Algorithm 4.3 is potentially high, we show in the following
that only a few tests are necessary to achieve performance (conjecturally) close to optimal.
Next, we illustrate the design of efficient CONDITION1 and 2 for the case A = {2,3} . Since
extension to other cases easily follows, we simply sketch the design for the case A = {2,3,5} (see
Section 4.4.1.2). As before, we fix the main base a1 to 2.
85
Chapter 4: Scalar Multiplication using Multibase Chains
CONDITION1:
Following the criteria discussed previously and given a partial scalar k, standard window width w
and set of bases A = {a1 , a2 } = {2,3} , where a j ∈ Z + , we propose the following format for
CONDITION1 in Algorithm 4.3:
′
w1,1 ′
w2,1
1 : if (( k − di ,1 ) mod 2 ⋅3 = 0 and ( k − ki ) mod 2w +1 ≠ 0) or …
′
w1,2 ′
w2,2
2: ((k − di ,2 ) mod 2 ⋅3 = 0 and ( k − ki ) mod 2w + 2 ≠ 0) or …
(4.8)
w1,′ C ′C
w2,
C: ((k − di,C ) mod 2 ⋅3 = 0 and ( k − ki ) mod 2w+ C ≠ 0)
where w′j ,c ≥ 0 are integers, ki = k mods 2 w , c is the condition number such that 1 ≤ c ≤ C and
d i,c ∈ D \ {0} = {±1, ±3, ±5,… , ± (2 w−1 − 1)} . In order to guarantee a cheaper-per-bit sequence at
each evaluation of CONDITION1 it is required that ς (k − di ,c ) + ec < ς (k − ki ) , which compares
the sequence costs up to the next addition using positive values ec for 1 ≤ c ≤ C . Using function
(4.7), this is roughly equivalent to the following comparison:
We next illustrate the procedure for selecting values w′j,c and ec for format (4.8) using eq.
(4.9) when w = 2 . The procedure can be easily extended to other window sizes.
First, we build two tables: one with the costs per bit corresponding to sequences containing
exactly d doublings (for congruency of ( k − ki ) ) and another with the costs per bit corresponding
w′ w′
to sequences divisible by 2 1,c ⋅ 3 2,c (for congruency of (k − di ,c ) ). Note that since w = 2 it
always holds that w1,′ c = 1 . We show in Table 4.1 the results for Jacobian coordinates using costs
from Table 2.2 (assuming that 1S = 0.8M ) . Since w = 2 calculations are performed with mixed
additions (the cost of one mixed addition is obtained as mADD = mDBLADD − DBL ).
Using Table 4.1, it is easy to see that ( k − ki ) ≡ mod 4, (k − ki ) ≡ mod8 yields a sequence
that is more expensive per bit than, at least, (k − di,c ) ≡ mod3 ; (k − ki ) ≡ mod8, ( k − ki ) ≡ mod16
yields a sequence that is more expensive per bit than, at least, (k − di ,c ) ≡ mod9 ;
(k − ki ) ≡ mod16, ( k − ki ) ≡ mod 32 yields a sequence that is more expensive per bit than, at
least, (k − di,c ) ≡ mod 27 ; and so on. This analysis gives a close idea about the statements that
should be defined in (4.8) for CONDITION1. In fact, if we plug the congruency evaluations above
86
Chapter 4: Scalar Multiplication using Multibase Chains
to (4.8) for conditions c = 1, 2, 3, and so on (in that order), the multibase chains obtained are
expected to be cheaper in average than those produced by the case without conditions (i.e.,
mbNAF, given by Algorithm 4.1). Nevertheless, choosing the minimal condition for which
congruency with ( k − ki ) is more expensive is not necessarily optimal. In other words, it is still
possible to do better by choosing the optimal parameter ec for each case.
For the latter, it is necessary to perform an analysis of costs of the possible combinations. For
instance, consider the evaluation “ if (( k − d i,1 ) mod 3t = 0 and (k − ki ) mod 8 ≠ 0) ” with w = 2 ,
t ≥ 1 ∈ Z and C = c = 1 in (4.8) to implement CONDITION1. The cost per bit in this case is
approximately given by:
It can be seen from (4.10) that optimality is achieved with min(α ) . For example, for J, JQ e
and IE coordinates (using operation costs from Tables 2.2, 2.3 and 2.4 and assuming
1S = 0.8M ) , min(α ) is obtained with t = 2 . Notice that analysis in α can go deeper and include
a higher number of consecutive doublings and triplings. However, the occurrence decreases
rapidly with the number of consecutive operations and so their impact in the cost. A similar
analysis can be carried out to determine optimal values for following conditions c in (4.8).
Additionally, it is necessary to determine the influence of C in the cost performance. A
probability analysis similar to the one performed above can be carried out to determine the
optimal C. However, the analysis increases in complexity very rapidly. Instead, we ran several
87
Chapter 4: Scalar Multiplication using Multibase Chains
tests to evaluate the cost performance of full 160-bit scalar multiplications. The results are
discussed in the subsection “Analysis of Multiple Conditions”, pp. 90.
CONDITION2:
Following the criteria discussed previously and given a partial scalar k, standard window width w
and set of bases A = {a1 , a2 } = {2,3} , where a j ∈ Z + , we propose the following format for
CONDITION2 in Algorithm 4.3:
′
w1,1
1 : if (( k − di,1 ) mod 2 = 0 and k mod 32 ≠ 0) or …
′
w1,2
2: ((k − di ,2 ) mod 2 = 0 and k mod 33 ≠ 0) or …
(4.11)
w1,′ C
C: ((k − di ,C ) mod 2 = 0 and k mod 3C +1 ≠ 0)
where again w′j ,c ≥ 0 are integers, c is the condition number s.t. 1 ≤ c ≤ C , and d i ,c ∈ D \ {0} =
{±1, ±3, ±5,…, ±(2w−1 − 1)} . To guarantee a cheaper-per-bit sequence at each evaluation of
CONDITION2 it is required that ς (k − di,c ) + ec < ς (k ) , which compares the sequence costs up to
the next addition using positive values ec for 1 ≤ c ≤ C . Using function (4.7), this is roughly
equivalent to the following comparison:
Let us now illustrate the procedure for selecting values w′j,c and ec for format (4.11) using
eq. (4.12) when w = 2 .
Similarly to the case with CONDITION1, we first build two tables: one with the costs per bit
w′
corresponding to sequences divisible by 2 1, c (for congruency of (k − di ,c ) ) and another with the
costs per bit corresponding to sequences with exactly t triplings (for congruency of k). In Table
4.2, we show the results for Jacobian coordinates using costs from Table 2.2 (assuming that
1S = 0.8M ) . Again, we assume that w = 2 , calculations are performed with mixed additions and
the cost of one mixed addition is obtained as mADD = mDBLADD − DBL .
Using Table 4.2, we can see that k ≡ mod3, k ≡ mod9 yields a sequence that is more
expensive per bit than, at least, (k − di,c ) ≡ mod8 ; k ≡ mod9, k ≡ mod 27 yields a sequence that
is more expensive per bit than, at least, (k − di,c ) ≡ mod32 ; k ≡ mod 27, k ≡ mod81 yields a
sequence that is more expensive per bit than, at least, (k − di,c ) ≡ mod128 ; and so on. If these
congruency evaluations are directly plugged into (4.11) for conditions c = 1, 2, 3, and so on (in
that order), the multibase chains obtained are expected to be cheaper in average than those produced
88
Chapter 4: Scalar Multiplication using Multibase Chains
by the case without conditions (i.e., mbNAF; Algorithm 4.1). However, choosing the minimal
condition for which congruency with k is more expensive is not necessarily optimal. In this case,
it is necessary to perform a more in detail analysis of costs of the possible combinations. For
instance, consider the evaluation “ if (( k − d i,1 ) mod 2 d = 0 and k mod 9 ≠ 0) ” with w = 2 ,
d > 1 ∈ Z and C = c = 1 in (4.11) to implement CONDITION2 in Algorithm 4.1. The cost per bit
in this case is approximately given by:
2 1 2TPL + 1ADD
β+ , (4.13)
3 3 2log 2 3
By analyzing (4.13), it can be seen that optimality is achieved with min( β ) . For instance, for
J, JQ e and IE coordinates (using operation costs from Tables 2.2, 2.3 and 2.4 and assuming
1S = 0.8M ) , min( β ) is obtained with d = 4 . Although analysis in β can go deeper and include
higher numbers of consecutive doublings and triplings, the occurrence decreases rapidly with the
number of consecutive operations and so the impact in the cost. A similar analysis can be carried
out to determine optimal values for following conditions c in (4.11).
In the following example, we illustrate the derivation of a multibase chain using Algorithm
4.3 with an efficient selection of parameters for CONDITION1 (4.8) and CONDITION2 (4.11), as
discussed in this section. In the remainder, conditions from (4.8) and (4.11) are denoted by
w′ w′ w′
pairing values 2 1, c ⋅ 3 2, c and 2w +c , and values 2 1, c and 3c+1 , respectively, as follows:
′
w1,1 ′
w2,1 ′
w1,2 ′
w2,2 w1,′ C ′C
w2, ′
w1,1 ′
w1,2 w1,′ C
(2 ⋅3 - 2 w+1 , 2 ⋅3 - 2 w+2 ,…,2 ⋅3 - 2w+ C | 2 - 32 ,2 - 33 ,…,2 - 3C +1 ) ,
89
Chapter 4: Scalar Multiplication using Multibase Chains
where paired values for CONDITION1 and 2 are separated by “|”. For instance, in Example 4.4
conditions denoted by (9-8|32-9) mean that Algorithm 4.3 includes the evaluation
“ if ((k − di ,1 )mod9 = 0 and (k − ki )mod8 ≠ 0) ” as CONDITION1 and the evaluation
“ if ((k − di ,1 )mod32 = 0 and k mod9 ≠ 0) ” as CONDITION2.
Example 4.4. Using Algorithm 4.3, we find the following refined multibase chain for computing
8821P by using bases {2,3}, w = 2 and conditions (9-8|32-9): 8821 = 1(3) 0(2) 0(2) 0(2) − 1(2) 0(2)
0(2)0(2) 0(2) − 1(2) 0(3) 0(2)1(2) , which has been derived using the division sequence 8821 − 1 → 8820 →
2
4410 2205 736 368 184 92 46 24 12 6 3
→ → 735 + 1 → → → → → → 23 + 1 → → → → →1.
2 3 2 2 2 2 2 2 2 2 3
Notice that, for instance, the partial value 735 is conveniently approximated to 736, by means
of CONDITION1, instead of dividing it by 3, allowing the efficient insertion of several
consecutive doublings that ultimately reduce the nonzero density of the expansion. If we compare
the performance of this multibase chain when computing 8821P against the basic multibase NAF
approach using the same window size, we can observe that the cost reduces from 8DBL + 3TPL
+ 4mADD = 115.2M to only 10DBL + 2TPL + 3mADD = 107.6M ( JQ e , 1S = 0.8M).
Finally, a probability analysis can be carried out to determine the optimal C for
CONDITION1 and 2. As stated before, this analysis increases in complexity very rapidly, so
instead we have run many tests to evaluate the cost performance of full 160-bit scalar
multiplications. The results are discussed in the following subsection.
90
Chapter 4: Scalar Multiplication using Multibase Chains
Jacobian coordinates
1520
Cost (M)
1500
1480
1300
1280
1380
1360
nc (1|1) (2|2) (3|3) (4|4)
Figure 4.1. Cost of 160-bit point multiplication without precomputations using refined multibase chains. Conditional
statements: nc = no conditions, (1|1) = (9-8|32-9), (2|2) = (9-8,27-16|32-9,64-27), (3|3) = (9-8,27-16,81-32|32-9,64-
27,128-81) ,(4|4) = (9-8,27-16,81-32,243-64|32-9,64-27,128-81,256-243).
Jacobian coordinates
1460
Cost (M)
1450
1440
1220
1210
1270
1265
nc (1|1) (2|2) (3|3) (4|4)
Figure 4.2. Cost of 160-bit point multiplication with w = 5 using refined multibase chains. Conditional statements: nc
= no conditions, (1|1) = (144-64|64-9), (2|2) = (144-64,324-128|64-9,512-27), (3|3) = (144-64,324-128,648-256|64-
9,512-27,1024-81), (4|4) = (144-64,324-128,648-256,972-512|64-9,512-27,1024-81,2048-243).
91
Chapter 4: Scalar Multiplication using Multibase Chains
Example 4.5. If we select conditions (9-8|32-9), window size w = 2 and bases {2,3}, then it is
straightforward to transform Algorithm 4.3 and replace lines 3 to J+5 with the following:
ki = k mods 4, bi = 2
if k mod 2 = 0, ki = 0
elseif k mod 3 = 0 and ~ [ ( k − ki ) mod 32 = 0 and k mod 9 ≠ 0] , ki = 0, bi = 3
elseif ( k + k ) mod 9 = 0 and (k − k ) mod8 ≠ 0, k = − k
i i i i
Example 4.6. If we select conditions (144-64|64-9), window size w = 5 and bases {2,3}, then
lines 3 to J+5 of Algorithm 4.3 can be replaced with the following:
Modified algorithms above are obtained by removing redundancy in the evaluations and
rearranging conditional statements once design parameters are fixed. We remark that these
algorithms are equivalent to Algorithm 4.3 and yield the same output for a given scalar when
using the same design parameters. As consequence, we observe that the refined multibase
methodology described in this section can achieve (conjecturally) close to optimal performance
with highly compact algorithms.
92
Chapter 4: Scalar Multiplication using Multibase Chains
Suggested formats for CONDITION1 and 2 in this case are provided below.
CONDITION1:
Given a partial scalar k, standard window width w and set of bases A = {a1 , a2 , a3} = {2,3,5} ,
where a j ∈ Z + , we propose the following format for CONDITION1 in Algorithm 4.3:
′
w1,1 ′
w2,1 ′
w3,1
1 : if ((k − di,1 ) mod 2 ⋅3 ⋅5 = 0 and ( k − ki ) mod 2w+1 ≠ 0) or …
′
w1,2 ′
w2,2 ′
w3,2
2: ((k − di ,2 ) mod 2 ⋅3 ⋅5 = 0 and (k − ki ) mod 2 w+ 2 ≠ 0) or …
(4.14)
w1,′ C ′C
w2, ′C
w3,
C: ((k − di ,C ) mod 2 ⋅3 ⋅5 = 0 and (k − ki ) mod 2w +C ≠ 0)
where w′j ,c ≥ 0 are integers, ki = k mods 2w , c is the condition number such that 1 ≤ c ≤ C and
d i ,c ∈ D \ {0} = {±1, ± 3, ±5,… , ± (2 w−1 − 1)} .
CONDITION2:
Given a partial scalar k, standard window width w and set of bases A = {a1 , a2 , a3} = {2,3,5} ,
where a j ∈ Z + , we propose the following format for CONDITION2.1 in Algorithm 4.3:
1 : if (k mod 5 ≠ 0) and
′
w1,1
1.1 : [if ((k − di ,1 ) mod 2 = 0 and k mod 32 ≠ 0) or …
w1,′ B
1.B : ((k − di , B ) mod 2 = 0 and k mod 3B +1 ≠ 0)] or …
(4.15)
2 : if ( k mod 5 = 0) and
′′
w1,1
2.1 : [if (( k − di ,1 ) mod 2 = 0 and k mod 3 ⋅ 5 ≠ 0) or …
w1,′′ C
2.C : (( k − di,C ) mod 2 = 0 and k mod 3u ⋅ 5v ≠ 0)]
where w′j ,b , w′′j ,c , u, v ≥ 0 are integers, b and c are the condition numbers such that 1 ≤ b ≤ B ,
1 ≤ c ≤ C , and d i,b , d i ,c ∈ D \ {0} = {±1, ±3, ±5,… , ± (2 w−1 − 1)} . Note that the upper section of
(4.15) evaluates conditions when sequences are not divisible by 5, whereas the lower section
evaluates conditions when sequences are divisible by both 3 and 5.
For CONDITION2.2, we propose the following format:
93
Chapter 4: Scalar Multiplication using Multibase Chains
′
w1,1
1 : if (( k − di ,1 ) mod 2 = 0 and k mod 52 ≠ 0) or …
′
w1,2
2: ((k − di,2 ) mod 2 = 0 and k mod 53 ≠ 0) or …
(4.16)
w1,′ C
C: ((k − di ,C ) mod 2 = 0 and k mod 5C +1 ≠ 0)
where again w′j ,c ≥ 0 are integers, c is the condition number s.t. 1 ≤ c ≤ C , and di ,c ∈ D \{0} =
{±1, ±3, ±5,…, ±(2w−1 − 1)} .
94
Chapter 4: Scalar Multiplication using Multibase Chains
The costs using the different methods are summarized in Tables 4.3 and 4.4 for n = 160 and
256 bits, respectively. We have sped up further the proposed multibase methods by saving some
initial computations. This technique is similar to that proposed in [Elm06, Section 4.2.2] plus
some additional savings gained with the use of composite operations (i.e., tripling, quintupling).
Note that for Jacobian coordinates we use the efficient doubling-addition (DBLADD)
operation instead of traditional addition for all the proposed methods. This operation has also
been used to improve the performance of the tree-based approach by Doche et al. [DH08].
As can be seen, in the scenario without precomputations, the new refined multibase chains
obtained from Algorithm 4.3 achieve the lowest costs for all curves under analysis and security
levels. For instance, our results reduce costs in 3% and 10% in comparison with the tree-based
method and NAF, respectively, on both JQ e and J coordinates with n = 160 bits. On the other
hand, the basic multibase NAF using bases {2,3} and {2,3,5} achieves better performance than
the original double-base method based on the “Greedy” algorithm [DIM05]. That is in addition to
the attractive features of mbNAF such as simplicity, memory efficiency and easiness to be
analyzed theoretically. The tree-based method achieves slightly lower costs than mbNAF for
bases {2,3} when using IE coordinates. However, mbNAF with bases {2,3,5} surpasses the
performance of this method in all the remaining cases. We remark that the tree-based method
also finds double-base chains using division chains, although using a search-based approach that
consumes more memory than the basic multibase NAF.
Remarkably, in some scenarios using J , refined multibase chains with bases {2,3,5} and no
precomputations surpasses the performance of the fastest NAF-based method using an optimal
number of precomputed points. For instance, if n = 160 bits the multibase method is superior
always when 1I > 19M.
For comparison in the scenario with optimal number of precomputations, we include results
by Bernstein et al. [BBL+07]. This work uses a double-base method based on the “Greedy”
algorithm that has been optimized for the use of precomputations. We can see that both the basic
wmbNAF and the refined multibase chains offer lower computing costs for all the cases under
analysis. Note that in this case the performance gap is due to a combination of superior multibase
chains and precomputation schemes, faster point operations (e.g., we use the doubling-addition
operation in Jacobian coordinates) and the inclusion of the technique to save initial computations.
A more serious competition is brought by the recent work by Meloni and Hasan [MH09],
which proposes the use of DBNS representations in combination with Yao’s algorithm. This
method, denoted by Yao-DBNS, is not based on division chains and has been shown to be
efficient when using DBNS representations obtained with the “Greedy” algorithm. Therefore, it
is intended for platforms where memory is not scarce.
If there are no memory restrictions, the refined multibase chains using bases {2,3,5} and Yao-
95
Table 4.3. Comparison of double-base and triple-base scalar multiplication methods (n = 160 bits; 1S = 0.8M).
Yao-DBNS (Greedy), Meloni et al. [MH09] N/A N/A 1211M N/A N/A 1259M N/A N/A 1475M
Double-base (Greedy), Bernstein et al. [BBL+07] 7 N/A 1311M 7 N/A 1290M 7 N/A 1504M †
6 55.4M 1476M
(Fractional) wNAF 7 59.6M 1246M 7 62.2M 1291M
6 1I+68.4M 1I + 1432M
† Without using doubling-addition operation [LM08b].
(1) Bases {2,3}.
(2) Bases {2,3,5}.
96
Table 4.4. Comparison of double-base and triple-base scalar multiplication methods (n = 256 bits; 1S = 0.8M).
Yao-DBNS (Greedy), Meloni et al. [MH09] N/A N/A 1911M N/A N/A 1993M N/A N/A 2316M
Double-base (Greedy), Bernstein et al. [BBL+07] 8 N/A 2071M 8 N/A 2041M 7 N/A 2379M †
8 72.2M 2326M
(Fractional) wNAF 8 69M 1954M 8 72M 2023M
8 1I+89.6M 1I + 2235M
† Without using doubling-addition operation [LM08b].
(1) Bases {2,3}.
(2) Bases {2,3,5}.
97
Chapter 4: Scalar Multiplication using Multibase Chains
DBNS achieve very close performance for all cases and security levels under analysis. The gap
when using JQ e and IE coordinates is between ~0%-1% in favor of Yao-DBNS. Given the
small theoretical gap and because factors such as cache performance and operation cost
variations influence computing time in practice, both methods are expected to achieve equivalent
performance for all practical purposes. When using J coordinates the refined multibase chains
remain faster than Yao-DBNS with an advantage between 2%-3%.
Table 4.5. Comparison of lowest costs using multibase and radix-2 methods for scalar
multiplication, n = 160 bits (cost of precomputation is not included).
ExtJQuartic, d = 1, JQe refined (2,3,5) 1281M 1346M 1428M 1186M 1254M 1339M
TEdwards, a = 1, IE refined (2,3) 1372M 1444M 1534M 1233M 1303M 1390M
e
TEdwards, a = −1, E
/E (w)NAF 1353M 1353M 1353M 1181M 1181M 1181M
98
Chapter 4: Scalar Multiplication using Multibase Chains
the clear winner given the significant overhead introduced by extra multiplications by constants
and/or the reduced gain margin obtained with the use of multibases.
In conclusion, if curve parameters are suitably chosen then curves using multibase methods
(which otherwise would be slower) may become competitive and even faster than the fastest
known curves using radix-2 in memory-constrained devices. For other applications with no
memory constraints, it is suggested the use of the fastest curves using (Frac-)wNAF.
99
Chapter 4: Scalar Multiplication using Multibase Chains
the number of iterations required to the number of additions. The impact in the cost of scalar
multiplication is left as future work.
Closely following developments for single scalar multiplication, there have appeared recent
efforts for speeding up multiple scalar multiplication with the form kP + lQ using double-base
chains. See for instance [DKS09] that presents the analogous of the original tree-based approach
[DH08], or [ADI10]. All these works employ division chains and can be improved by exploiting
the methodology based on the operation cost per bit exposed in this chapter. The different
variants discussed in this subsection could also be adapted to this case.
Very recently and working on top of our techniques published in PKC2009 [LG09], Walter
[Wal11] also proposed the use of the cost per bit to derive multibase algorithms based on division
chains. Although his methodology is based on a slightly more elaborated cost function, results
are expected to be similar to the ones obtained with the methodology in Section 4.4.1.1.
Algorithms in [Wal11] are similar (with some variations) to the ones proposed in PKC2009 and
revisited here. Although [Wal11] presented slightly better results, we implemented and tested the
modified algorithms under the same conditions in which all our algorithms were tested and they
achieved equivalent or slightly lower performance than our results. Walter proposed to simplify
algorithms to obtain much more compact versions. Following these suggestions, we derived
compact versions for our algorithms in the subsection “Highly Compact Multibase Algorithms,
Bases {2,3}”, pp. 92.
It has been shown in this chapter that the use of double- and multi-base representations enables
faster scalar multiplication in terms of field multiplications and squarings. However, the
conversion step in double-base and multi-base methods is more time consuming than using
methods based on radix 2. This may or may not be a limiting factor depending on the
characteristics of a particular implementation and the chosen platform.
If scalar conversion to multibase representation is expensive, then it must be performed off-
line, limiting the applicability of these methods to scenarios in which the same scalar k is reused
several times or the conversion can be carried out during an idle time (e.g., between the first and
second phases of the ECDH scheme during data transmission). To overcome this restriction,
more research is necessary for developing efficient conversion mechanisms for popular
100
Chapter 4: Scalar Multiplication using Multibase Chains
4.7. Conclusions
This chapter discussed the efficient design of scalar multiplication algorithms based on double
and multibase chains.
In §4.1, we categorized and analyzed the most relevant methods using double-base and multi-
base representations in the literature, highlighting advantages and disadvantages. Then in §4.2 we
formally described the original (width-w) multibase NAF method, presenting the theoretical
analysis of the different variants using Markov chains. In §4.3, we applied the fractional window
recoding to multibase NAF. The revised method allows any number of precomputations,
enabling lower costs and/or better coupling to memory-constrained environments.
In §4.4, we introduced a novel methodology based on the analysis of point operation cost per
bit to design flexible algorithms able to find more efficient multibase chains. This approach was
101
Chapter 4: Scalar Multiplication using Multibase Chains
implicitly used in Longa and Gebotys [LG09] to derive refined mbNAF chains, although an
explicit description of the algorithm derivation was missing. We have filled the gap in this
chapter. Intuitively, given unlimited resources this approach is expected to lead to optimal
multibase chains. We demonstrated that very compact algorithms are still able to achieve high
performance. We derived algorithms for the case of bases {2,3} and {2,3,5}, and analyzed the
performance gain with the increase in the complexity of the multibase evaluation. For illustration
purposes, we focused the analysis on three scenarios: standard curves using Jacobian coordinates,
extended Jacobi quartics using extended Jacobi quartics coordinates and Twisted Edwards curves
using inverted Edwards coordinates.
In §4.5 we carried out a detailed comparison of the studied methods with the best approaches
in the literature. For further cost improvement, we applied the best precomputation method
developed in Chapter 3 for each scenario. After extensive comparisons with the most efficient
methods in the literature, we concluded that the refined multibase chains achieve the highest
performance on all scenarios with no precomputations, introducing cost reductions in the range
7%-10% in comparison with NAF. For the case of optimal use of precomputations, we show that
the proposed algorithms are among the fastest ones, achieving practically equivalent performance
to recent methods such as Yao-DBNS [MH09]. In this case, the theoretical cost reductions are in
the range 1%-3% in comparison with (Frac)-wNAF.
Finally, in §4.6 we discussed many potential possibilities for the multibase approach based on
the analysis of the operation cost per bit. We detailed how this tool could potentially lead to
different variants of the proposed multibase algorithms and how it could even improve existent
methods in the literature. Other possible applications such as multiple scalar multiplications were
also covered, as well as a discussion of open problems that challenge the practicality of double-
base and multi-base methods in real applications. In conclusion, we suggested the use of
multibases for memory-constrained devices when the conversion step (if expensive) can be
performed off-line. When precomputations are allowed, the gain may be negligible and faster
curves without exploiting multibases are available.
102
5 Chapter 5
In this chapter, we analyze and present experimental data evaluating the efficiency of several
techniques for speeding up the computation of elliptic curve point multiplication on emerging
x86-64 processor architectures. Our approach is based on a careful optimization of elliptic curve
operations at all arithmetic layers in combination with techniques from computer architecture.
Our contributions can be summarized as follows:
103
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
subtractions and small constants and maximizing the use of operations exploiting IR.
• We study the extension of all the previous techniques to field arithmetic over Fp2 , which
has several applications in cryptography including its use as underlying field by the
recently proposed Galbraith-Lin-Scott (GLS) method.
• We explicitly state the improved explicit formulas using IR, with minimal number of
operations and reduced number of data dependencies between contiguous field
operations for two relevant cases: standard curves using J coordinates and Twisted
Edwards curves using mixed homogeneous/extended homogeneous ( E / E e ) coordinates.
• Finally, to illustrate the significant savings obtained by combining all the previous
techniques with state-of-the-art ECC algorithms we present high-speed implementations
of point multiplication that are up to 34% faster than the best previous results on x86-64
processors. Our software takes into account results from Chapter 3 and includes the best
precomputation scheme corresponding to each setting.
Analysis and tests presented in this chapter are carried out and applied on emerging x86-64
processors, which are getting widespread use in notebooks, desktops, workstations and servers.
The reader should note, however, that some techniques and analysis are generic and can be
extended to other computing devices based on 32-, 16- or 8-bit architectures. Whenever relevant,
we briefly discuss the applicability of the techniques under analysis to other architectures.
This chapter is organized as follows. After discussing some relevant previous work and
background related to x86-64 processors in §5.1, we describe the techniques for optimizing
modular reduction using a pseudo-Mersenne prime, namely incomplete reduction and elimination
of conditional branches, in §5.2. Then, in §5.3 we study data dependencies between field
operations and analyze some efficient countermeasures when their effect is potentially negative
to performance. In §5.4, we describe our optimizations to explicit formulas that enable a
reduction in the number of additions and other “small” operations. The extension of the
techniques above to quadratic extension fields is presented in §5.5. Our high-speed
implementations with and without exploiting the GLS method that illustrate the performance gain
obtained with the techniques under analysis are presented in §5.6. Some conclusions are drawn in
§5.7.
104
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
the number of point operations required for computing scalar multiplication. Other approaches
have focused on constructing curve forms with fast group arithmetic and/or improved resilience
against certain side-channel analysis (SCA) attacks [Sma01, BJ03b, Edw07], complemented by
research studying efficient projective systems and optimized explicit formulas for point
operations [CC86, CMO98, HWC+08, LM08b, HWC+09]. Yet another important aspect refers to
the efficient implementation of long integer modular arithmetic [Kar95, Mon85, Com90,
YSK02]. Given the myriad of possibilities, it is a very difficult task to determine which methods,
once combined for the computation of scalar multiplication, are the most efficient ones for a
specific platform. Notorious efforts in this direction are the efficient implementations on
constrained 8-bit microcontrollers by [GPW+04, UWL+07], on 32-bit embedded devices by
[XB01, GAS+05], on Graphical Processing Units (GPUs) by [SG08], on processors based on the
Cell Broadband Engine Architecture (CBEA) by [CS09], on 32-bit processors by [BHL+01,
Ber06], among others. In this work, we try to cover this analysis for the increasingly popular x86-
64-based processors.
105
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
platforms, the cost of addition and other similar operations becomes non-negligible. Another
aspect from this observation is that it now becomes relevant for the targeted 64-bit platforms the
optimization of these usually neglected “small” operations.
Another important feature is the highly pipelined architectures of these processors. For
instance, experiments by [Fog2] suggest that Intel Atom, Intel Core 2 Duo and AMD
architectures have pipelines with 16, 15 and 12 stages, respectively. Although sophisticated
branch prediction techniques exist, it is expected that the “random” nature of cryptographic
computations, specifically of modular reduction, causes expensive mispredictions that force the
pipeline to flush.
In this work, we analyze the performance of combining incomplete reduction (IR) and the
elimination of conditional branches to obtain high-speed field arithmetic for performing
operations such as addition, subtraction and multiplication/division by small constants using a
very efficient pseudo-Mersenne prime. This effort puts together in an optimal way techniques by
[YSK02], which only provided IR algorithms targeting primes of general form, with branchless
field arithmetic recently adopted by some cryptographic libraries [mpFq, MIR]. In the process,
we present experimental data quantifying the performance improvement obtained by eliminating
branches in the field arithmetic.
We also analyze the influence of deeply pipelined architectures in the ECC point
multiplication execution. In particular, the increased number of pipeline stages makes (true) data
dependencies between instructions in contiguous field operations expensive because these can
potentially stall the execution for several clock cycles. These dependencies, also known as read-
after-write (RAW), are typically found between several field operations when the result of an
operation is required as input by a following operation. In this work, we demonstrate the
potentially high cost incurred by these dependencies, which is hardly avoided by compilers and
dynamic schedulers in processors, and propose three techniques to reduce its effect: field
arithmetic scheduling, merging of field operations and merging of point operations.
The techniques above are first applied to modular operations using a prime p, which are used
for performing the Fp arithmetic in ECC over prime fields. However, some of these techniques
are generic and can also be extended to different scenarios using other underlying fields. For
instance, Galbraith et al. [GLS09] recently proposed a faster way to do ECC that exploits an
efficiently computable endomorphism to accelerate the execution of point multiplication over a
quadratic extension field (a.k.a. GLS method); see Section 2.2.6. Accordingly, we extend our
analysis to Fp 2 arithmetic and show that the proposed techniques also lead to significant gains in
performance in this case.
Our extensive tests assessing the techniques under analysis cover at least one representative
x86-64-based CPU from each processor class: 1.66GHz Intel Atom N450 from the notebook/
106
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
netbook class, 2.66GHz Intel Core 2 Duo E6750 from the desktop class, and 2.6GHz AMD
Opteron 252 and 3.0GHz AMD Phenom II X4 940 from the server/workstation class.
Finally, to assess their effectiveness for a full point multiplication, the proposed techniques
are applied to state-of-the-art implementations using Jacobian (J ) and mixed Twisted Edwards
homogeneous/extended homogeneous ( E / E e ) coordinates on the targeted processors. Our
measurements show that the proposed optimizations (in combination with state-of-the-art point
formulas/coordinate systems, precomputation schemes and exponentiation methods) significantly
speed up the execution time of point multiplication, surpassing by considerable margins best
previous results. For instance, we show that a point multiplication at the 128-bit security level
can be computed in only 181000 cycles (in about 60µsec.) on an AMD Phenom II X4 when
combining E / E e with GLS. This represents a cost reduction of about 29% over the closest
previous result; see Section 5.6.4 for complete details.
This technique was introduced by Yanik et al. [YSK02] for the case of primes of general form.
Given two numbers in the range [0, p − 1] , it consists of allowing the result of an operation to
stay in the range [0,2 s − 1] instead of executing a complete reduction, where p < 2 s < 2 p − 1 ,
s = n ⋅ w , w is the basic wordlength (typically, w = 8,16,32,64 ) and n is the number of words. If
the modulus is a pseudo-Mersenne prime of the form 2m − c such that m = s and c < 2w , then
the method gets even more advantageous. In the case of addition, for example, the result can be
reduced by first discarding the carry bit in the most significant word and then adding the
correction value c, which fits in a single w-bit register. Also note that this last addition does not
produce an overflow because 2 × (2 m − c − 1) − (2 m − c) < 2m . The procedure is illustrated for the
case of modular addition in Algorithm 5.1(b), for which the reduction step described above is
107
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
performed in steps 4-8. In contrast, it can be seen in Algorithm 5.1(a) that a complete reduction
requires additionally the execution of steps 9-14 that perform a subtraction r − p in case
p ≤ r < 2 m , where r is the partial result from step 3.
Yanik et al. [YSK02] also shows that subtraction can benefit from IR when using a prime p
of arbitrary form. However, we show in the following that for primes of special form, such as
pseudo-Mersenne primes, that is not necessarily the case.
Modular Subtraction:
Let us consider Algorithm 5.2. After step 3 we obtain the completely reduced value r = a − b if
borrow = 0 . If, otherwise, borrow = 1 then this bit is discarded and the partial result is given by
r = a − b + 2m , where b > a . This value is incorrect, because it has the extra addition with 2m .
Step 6 performs the computation r + p = ( a − b + 2 m ) + (2 m − c) = a − b − c + 2 m +1 , where
2m < a − b − c + 2m +1 < 2m +1 since −2m + c < a − b < 0 . Then, by simply discarding the final carry
from this result (i.e., by subtracting 2m ) we obtain the correct, completely reduced result
a − b − c + 2m +1 − 2m = a − b + p , where 0 < a − b + p < p . Since Algorithm 5.2 gives the
108
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
correct result without evaluating both values of borrow after step 3 (similarly to the case of carry
in Algorithm 5.1(b)), there is no need for incomplete reduction in this case.
Algorithm 5.2. Modular subtraction with a pseudo-Mersenne prime and complete reduction
Input: integers a , b ∈ [0, p − 1] , p = 2 m − c , m = n ⋅ w , where n, w, c ∈ Z + and c < 2w
Output: r = a − b (mod p)
1: borrow = 0
2: For i from 0 to n − 1 do
3: (borrow, r[i]) ← a[i] − b[i ] − borrow
4: If borrow = 1
5: carry = 0
6: For i from 1 to n − 1 do
7: (carry , r[i]) ← r[i ] + carry
8: Return r
Nevertheless, there are other types of “small” operations that may benefit from the use of IR.
Next we analyze the cases that are useful to the setting of ECC over prime fields.
This operation is illustrated when using IR by Algorithm 5.3(b). If the value a is even, then a
division by 2 can be directly applied through steps 5-7, where (carry , r[i ]) ← (carry, r[i]) / 2
represents the concurrent assignments r[i ] ← (carry ⋅ 2 (i +1). w + r[i ]) / 2 and carry ← r[i](mod 2) .
109
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
In this case, if a ∈ [0, 2 m − 2] then the result r ∈ [0, 2m −1 − 1] is completely reduced since
2m −1 − 1 << 2m − c for practical values of m, such that c < 2w and w < m − 1 . If, otherwise, the
operand a is odd, we first add p to a in steps 3-4 to obtain an equivalent from the residue class
that is even. Then, 2 m − c + 1 < p + a < 2 m +1 − c − 1 , where the partial result has m + 1 bits
maximum and is stored in (carry , r ) . The operation is then completed by dividing by 2 through
steps 5-7, where the final result 2 m −1 − (c − 1) / 2 < ( p + a ) / 2 < 2 m − (c + 1) / 2 . Hence, the result is
incompletely reduced because 2m − c ≤ 2 m − (c + 1) / 2 ≤ 2 m − 1 . If the result needs to be
completely reduced then, for the case that ( p + a ) / 2 ∈ [ p, 2m − (c + 1) / 2 ] , one needs to
additionally compute a subtraction with p such that 0 ≤ ( p + a ) / 2 − p < (c − 1) / 2 < 2m − c , as
performed in steps 9-12 of Algorithm 5.3(a).
It is also interesting to note that in the case that input a is in completely reduced form, i.e., if
a ∈[0, p − 1] , after steps 6-7 in Algorithm 5.3(b) we get 2 m−1 − (c + 1) / 2 < ( p + a ) / 2 < 2 m − c ,
which is in completely reduced form.
110
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
Table 5.1. Cost (in cycles) of modular operations when using incomplete reduction (IR) and
complete reduction (CR); p = 2256 − 189 .
As can be seen, in our experiments using the pseudo-Mersenne prime p = 2 256 − 189 we
obtain significant reductions in cost ranging from 7% to up to 41% when using IR.
It is important to note that, because multiplication and squaring may accept inputs in the
range [0, 2 m − 1] , an operation using IR can precede any of these two operations. Thus, the
reduction process (which is left “incomplete” by the operation using IR) is fully completed by
these multiplications or squarings without any additional cost. If care is taken when
implementing point operations, virtually all additions and multiplications/divisions by small
constants can be implemented with IR because most of them have results that are later required
by multiplications or squarings only. See Appendix B1 for details about the scheduling of field
operations Fp suggested for point formulas using J and E / E e coordinates.
Conditional branches may be expensive in several modern processors with deep pipelines if the
prediction strategy fails in most instances in a particular implementation. Recovering from a
mispredicted branch requires the pipeline to flush, wasting several clock cycles that may increase
the overall cost significantly. In particular, the reduction portion of modular addition, subtraction
and other similar operations is traditionally expressed with a conditional branch. For example, let
us consider the evaluation in step 4 of Algorithm 5.1(b) for performing a modular addition with
IR. Because a, b ∈ [0, p − 1] and 2 m − p = c (again considering p = 2 m − c and m = s ), where c
is a relatively small number such that 2 m ≈ p for practical estimates, the possible values for
carry after computing a + b in steps 2-3, where ( a + b) ∈ [0,2 p − 2] , are (approximately) equally
distributed and describe a “random” sequence for all practical purposes. In this scenario, only an
average of 50% of the predictions can be correct in the best case. Similar results are expected for
conditional branches in other operations (see Algorithms 5.1, 5.2, 5.3).
To avoid the latter effect, it is possible to eliminate conditional branches by using techniques
111
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
such as look-up tables or branch predication. In Figure 5.1, we illustrate the replacement of the
conditional branch in step 4 of Algorithm 5.1(b) by a predicated move instruction (Figure 5.1(a))
and by a look-up table with indexed indirect addressing (Figure 5.1(b)). In both cases, the
strategy is to perform an addition with 0 if there is no carry-out (i.e., the reduction step is not
required) or an addition with c = 189 , where p = 2 256 − 189 , if there is carry-out and the
computation ( a + b − 2 256 ) + 189 is necessary. On the targeted CPUs, our tests reveal that branch
predication performs slightly better in most cases. This conclusion is platform-dependent and, in
the case of the targeted processors, may be due to the faster execution of cmov in comparison to
the memory access required by the look-up table approach.
(a) (b)
> >
> cmovnc %rax,%rcx > adcq $0,%rax
> addq %rcx,%r8 > addq (%rcx,%rax,8),%r8
> movq %r8,8(%rdx) > movq %r8,8(%rdx)
> adcq $0,%r9 > adcq $0,%r9
> movq %r9,16(%rdx) > movq %r9,16(%rdx)
> adcq $0,%r10 > adcq $0,%r10
> movq %r10,24(%rdx) > movq %r10,24(%rdx)
> adcq $0,%r11 > adcq $0,%r11
> movq %r11,32(%rdx) > movq %r11,32(%rdx)
> ret > ret
Figure 5.1. Steps 4-9 of Alg. 5.1(b) for executing modular addition using IR, where p = 2 256 − 189 . The conditional
branch is replaced by (a) cmov instruction (initial values %rax=0, %rcx=189) and (b) look-up table using indexed
indirect addressing mode (preset values %rax=0, (%rcx)=0, 8(%rcx)=189). Partial addition a + b from step 3 is
stored in registers %r8-r11 and final result is stored in x(%rdx). x86-64 assembly code uses AT&T syntax.
112
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
Table 5.2. Cost (in cycles) of modular operations without conditional branches (w/o CB) against
operations using conditional branches (with CB); p = 2256 − 189 .
due to the fact that modular operations exploiting IR allow very compact implementations that
are even easier to schedule efficiently when branches are removed. It is also interesting to note
that, when comparing Core 2 Duo’s and Opteron’s performances, gains are higher for the former
processor, which has more stages in its pipeline. Roughly speaking, the gain obtained by
eliminating (poorly predictable) CBs on these architectures grows proportionally with the number
of stages in the pipeline. In contrast, the gains on Intel Atom are significantly smaller since the
pipeline execution and Instruction-Level Parallelism (ILP) on this in-order processor are much
less efficient and, hence, the relative cost of misprediction penalty reduces.
Following the conclusions above, we have implemented ECC point formulas such that the
gain obtained by combining IR and the elimination of CBs is maximal. The reader is referred to
Appendix B1 for details about the cost of point formulas in terms of field operations when using
J and E / E e coordinates.
Next, we evaluate the cost of point doubling and doubling-addition (using Jacobian
coordinates) when their “small” field operations are implemented with complete or incomplete
reduction and with or without conditional branches. For the analysis, we use the revised doubling
formula (5.2), Section 5.4, and the doubling-addition formula introduced in [Lon07, formula
(3.5), Section 3.2]. The results are shown in Table 5.3.
As can be seen, the computing costs of point doubling and doubling-addition on the AMD
processor reduce in 12% and 9%, respectively, by combining the elimination of conditional
branches with the use of incomplete reduction. Without taking into account precomputation and
the final inversion to convert to affine, these reductions represent about 11% of the computing
cost of point multiplication. A similar figure is observed for Intel Core 2 Duo in which doubling
113
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
Table 5.3. Cost (in cycles) of point operations with Jacobian coordinates when using incomplete
reduction (IR) or complete reduction (CR) and with or without conditional branches (CB);
p = 2256 − 189 .
and doubling-addition are reduced by approx. 11% and 8%, respectively. These savings represent
a reduction of about 10% in the cost of point multiplication (again, without considering
precomputation and the final inversion). In contrast, following previous observations (see Table
5.2) the techniques are less effective on architectures such as Intel Atom, where the ILP is less
powerful and branch misprediction penalty is relatively less expensive. In this case, the cost
reduction of point multiplication is only about 3%.
We remark that the algorithms discussed in this section combining completely and
incompletely reduced numbers are generic and can be applied to different platforms. Also, the
gain obtained by eliminating conditional branches is strongly tied to the pipeline length. So in
general it is expected to provide a performance improvement on any architecture with high
number of pipeline stages such as most AMD and Intel processors.
Definition 5.1. Let i and j be the computer orders of instructions I i and I j in a given program
flow. We say that instruction I j depends on instruction I i if:
114
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
where R ( I x ) is the set of memory locations or registers that are read by I x and W ( I x ) is the set
of memory locations or registers written by I x .
Modern out-of-order processors and compilers deal relatively well with anti-dependencies
( R( Ii ) ∩ W ( I j ) , i.e., if I i reads a location later updated by I j ) and output dependencies
(W ( I i ) ∩ W ( I j ) , i.e., if both I i and I j write on the same location) through register renaming.
However, true or RAW dependencies ( W ( Ii ) ∩ R( I j ) , i.e., if I j reads something written by I i )
cannot be removed in the strict sense of the term and are more dangerous to the performance of
architectures exploiting ILP.
Corollary 5.1. Let I i and I j be write and read instructions, respectively, holding true data
dependence, i.e., W ( Ii ) ∩ R( I j ) ≠ ∅ , where i < j and I i and I j are scheduled to be executed at
the ith and j th cycle, respectively, in a non-superscalar pipelined architecture. Then, if
ρ = j − i < δ write the pipeline is to be stalled for at least (δ write − ρ ) cycles, where δ write specifies
the number of cycles required by the write instruction I i to complete its pipeline latency after
instruction fetching.
115
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
> addq %rcx,%r8
> movq %r8,8(%rdx)
> adcq $0,%r9
> movq %r9,16(%rdx)
> adcq $0,%r10
> movq %r10,24(%rdx)
> adcq $0,%r11
> Add(op1,op2,res1) > movq %r11,32(%rdx)
> Add(res1,op3,res2) > xorq %rax,%rax
> movq $0xBD,%rcx
> movq 8(%rdi),%r8
> addq 8(%rsi),%r8
> movq 16(%rdi),%r9
> adcq 16(%rsi),%r9
> movq 24(%rdi),%r10
> adcq 24(%rsi),%r10
> movq 32(%rdi),%r11
> adcq 32(%rsi),%r11
256
Figure 5.2. Field additions with RAW dependencies on an x86-64 CPU ( p = 2 − 189 ). High-level field operations
are in the left column and low-level assembly instructions corresponding to each field operation are to the right.
Destination x(%rdx) of first field addition = source x(%rdi) of second field addition. RAW dependencies are
indicated by arrows.
As can be seen in Figure 5.2, results stored in memory in the last stage of the first addition
are read in the beginning of the second addition. First, if a compiler or out-of-order scheduler is
unable of identifying the common addresses then it will not be able of exploiting rescheduling to
prevent pipeline stalls due to inter-field operation dependencies. Moreover, four consecutive
writings to memory and then four consecutive readings need to be performed because operands
are 256-bit long distributed over four 64-bit registers. This obviously complicates the extraction
of any benefit from data forwarding. If δ write > ρ x for at least one of the dependences x indicated
by arrows then the pipeline is expected to stall for at least (δ write − ρ x ) cycles. Thus, for the
writing/reading sequence in Figure 5.2, the pipeline is roughly stalled by max(δ write − ρ x ) for
0≤x<4.
Definition 5.2. Two field operations OPi (opm , opn , res p ) and OPj (opr , ops , rest ) are said to be
data dependent at the field arithmetic level if i < j and res p = opr or res p = ops , where OPi
and OPj denote the field operations performed at positions ith and j th during a program
execution, and op and res are registers holding the inputs and result, respectively. Then, this is
called a contiguous data dependence in the field arithmetic if j − i = 1 , i.e., OPi and OPj are
consecutive in the execution sequence. When understood in the context we refer to these
116
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
dependencies happening at the field arithmetic level as simply contiguous data dependencies for
brevity.
For the applications targeted in this work all field operations follow a similar writing/reading
pattern to that one shown in Figure 5.2, and hence, two contiguous, data dependent field
operations hold several data dependencies x between their internal write/read instructions.
Following Definition 5.2 and Corollary 5.1, contiguous data dependencies pose a problem when
δ write > ρ x in a given program execution, in which case the pipeline is stalled by roughly
max(δ write − ρ x ) cycles for all dependencies x. Note that at fewer dependent write/read
instruction pairs (i.e., at smaller field sizes) the expression max(δ write − ρ x ) grows as well as the
number of potential stalled cycles. Similarly, at larger computer wordlengths w the value
max(δ write − ρ x ) is expected to increase, worsening the effect of contiguous data dependencies.
For instance, neglecting other architectural factors and assuming a fixed pipeline length, these
dependencies are expected to affect performance more dramatically in 64-bit architectures in
comparison with 32-bit architectures.
Closely following the analysis above, we propose three techniques that help to reduce the
number of contiguous data dependencies and study several practical scenarios in which this
would allow us to improve the execution performance of point multiplication. As a side effect
our techniques also reduce the number of function calls and memory accesses. The reader should
note that these additional benefits are processor-independent.
117
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
this case, the execution reaches its maximal performance with an average of 110 cycles per
multiplication because for any pair of data-dependent multiplications we have ρ x >> δ write . In
contrast, Sequence 2 is highly dependent because each output is required as input in the
following operation. In this case, δ write > ρ x for at least one dependence x. This is the worst-case
scenario with an average of 128 cycles per multiplication, which is about 14% less efficient than
the “ideal” case. We have also studied other possible arrangements such as Sequence 3, in which
operands of Sequence 2 have been reordered. This slightly amortizes the impact of contiguous
data dependencies because ρ x is increased, improving the performance to 125 cycles/mult.
Table 5.4. Various sequences of field operations with different levels of contiguous data
dependence.
Similarly, we have also tested the effect of contiguous data dependencies on other field
operations. In Table 5.5, we summarize the most representative field operation “arrangements”
and their costs. As can be seen, the reductions in cost obtained by switching from an execution
with strong contiguous data dependence (worst-case scenario with Sequence 2) to an execution
Table 5.5. Average cost (in cycles) of modular operations using best-case (no contiguous data
dependencies, Sequence 1) and worst-case (strong contiguous data dependence, Sequence 2)
“arrangements” ( p = 2256 − 189 , on a 2.66GHz Intel Core 2 Duo E6750).
Subtraction 21 23 9%
Addition with IR 20 24 17%
Multiplication by 2 with IR 19 23 17%
Multiplication by 3 with IR 28 34 18%
Division by 2 with IR 20 30 33%
Squaring 101 113 11%
Multiplication 110 128 14%
118
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
with no contiguous data dependencies (best-case scenario with Sequence 1) range from
approximately 9% to up to 33% on an Intel Core 2 Duo. Similar results were observed for the
targeted AMD Opteron and Phenom II processors, where the high performance of their
architectures significantly reduce relative positions ρ x between their data-dependent write/read
instructions, increasing the value max(δ write − ρ x ) . Thus, minimizing contiguous data
dependencies is expected to improve the execution of point multiplication on all these processors.
In contrast, Sequence 1 and Sequence 2 perform similarly on processors such as Intel Atom, in
which the much less powerful architecture tends to increase values ρ x such that δ write < ρ x for
all dependencies x.
This technique complements and increases the gain obtained by scheduling field operations. As
expected, in some cases it is not possible to eliminate all contiguous data dependencies in a point
formula. A clever way to increase the chances of eliminating more of these dependencies is by
“merging” successive point operations into unified functions.
For example, let us consider the following sequence of field operations for computing a point
doubling using Jacobian coordinates, 2( X1 : Y1 : Z1 ) → ( X1 : Y1 : Z1 ) (DblSub(b,c,a) represents
the operation a ← b − 2c (mod p) ; see Section 5.3.3):
In total, there are five contiguous data dependencies between field operations (denoted by
"• ") in the sequence above. Note that the last stage accounts for most dependencies, which are
very difficult to eliminate. However, if another point doubling follows, one could merge both
successive operations and be able to reduce the number of contiguous data-dependent operations.
Consider, for example, the following arrangement of two consecutive doublings:
119
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
As can be seen, the sequence above (instructions from the second doubling are in bold)
allows us to further reduce the number of dependencies from five to only two.
In ECC implementations, it appears natural to merge successive doubling operations or a
doubling and an addition. Efficient elliptic curve point multiplications kP use NAF in
combination with some windowing strategy to recode the scalar k (see Section 2.2.4.3). For
instance, wNAF guarantees at least w successive doublings between point additions. Also, one
could exploit the efficient doubling-addition operation by [Lon07] for Jacobian coordinates or the
combined (dedicated) doubling-(dedicated) addition by [HWC+08] for mixed Twisted Edwards
homogeneous/extended homogeneous coordinates (see Table 2.4). Hence, an efficient solution
for these systems is to merge ( w − 1) consecutive doublings (for an optimal choice of w) in a
separate function and merge each addition with the precedent doubling in another function. On
the other hand, if an efficient doubling-addition formula is not available for certain setting, then it
is suggested to merge w consecutive doublings in one function and have the addition in a
separate function. Note that for different coordinate systems/curve forms/point multiplication
methods the optimal merging strategy may vary or include different operations.
Remarkably, a side-effect of this technique is that the number of function calls to point
formulas is also reduced.
This technique consists in merging various field operations with common operands to implement
them in a joint function. There are two scenarios where this approach becomes attractive:
• Operands are required by more than one field operation: merging reduces the number of
memory reads/writes.
We remark that the feasibility of merging certain field operations strictly depends on the
chosen platform and the number of general purpose registers available to the programmer/
compiler. Also, before deciding on a merging option implementers should analyze and test the
increase in the code size and how this affects the performance of the cache for example.
Accordingly, in the setting of ECC over prime fields, multiplication and squaring are not
recommended to be merged with other operations if multiple functions containing these
operations are necessary. The code increase could potentially affect the cache performance.
120
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
Example 5.1. Taking into account the considerations above, the following merged field
operations can be advantageous on x86-64-based processors using J and E / E e coordinates:
a − 2b (mod p) , a + a + a (mod p) , and the merging of a − b (mod p ) and ( a − b) − 2c (mod p) .
We remark that the list in the example above is not exhaustive. Different platforms with more
registers may enable a much wider range of merging options. Also, other possibilities for
merging could be available for different coordinate systems and/or underlying fields (for
instance, see Section 5.5.2 for the merging options suggested for ECC implementations over
quadratic extension fields).
To illustrate the impact of scheduling field operations, merging point operations and merging
field operations, we show in Table 5.6 the cost of point doubling using Jacobian coordinates
when using these techniques in comparison with a naïve implementation with a high number of
dependencies. As can be seen, by reducing the number of dependencies from ten to about one per
doubling, minimizing function calls and reducing the number of memory reads/writes, we are
able to reduce the cost of a doubling by 12% and 8% on Intel Core 2 Duo and AMD Opteron
processors, respectively. It is also important to note that on a processor such as AMD Opteron,
which has a smaller pipeline and consequently less lost due to contiguous data dependencies
(smaller δ write with roughly the same values ρ x as Intel Core 2 Duo), the estimated gain
obtained with these techniques in the point multiplication is lower (5%) in comparison with the
Intel processor (9%). Finally, following our analysis in previous sections, Intel Atom only
obtains a very small improvement in this case because contiguous data dependencies do not affect
the execution performance significantly (see Section 5.3.1).
Table 5.6. Cost (in cycles) of point doubling using Jacobian coordinates with different number of
contiguous data dependencies and the corresponding reduction in the cost of point multiplication.
“Unscheduled” refers to implementations with a high number of dependencies (here, 10
dependencies per doubling). “Scheduled and merged” refers to implementations optimized
through the scheduling of field operations, merging of point operations and merging of field
operations (here, 1.25 dependencies per doubling); p = 2256 − 189 .
121
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
The reader is referred to Appendix B1 for the explicit formulas optimized by scheduling or
merging field operations and merging point operations for the case of J and E / E e .
where α = 3( X 1 + Z12 )( X 1 − Z12 ) 2 and β = X 1Y12 . With formula (5.2), the operation count is
reduced to 4Mul + 4Sqr +1Add + 5Sub +1Mul3 +1Div2 , replacing two multiplications by 2 with
one subtraction. Moreover, because constants are minimized, there are greater chances that
more “small” operations are executed using incomplete reduction. In Algorithm 5.4, we show an
efficient implementation of point doubling (5.2) with optimal use of incomplete reduction (every
addition and multiplication/division by constant precedes a multiplication or squaring),
minimized number of contiguous data dependencies between field operations and exploiting the
use of merged field operations. This execution costs 4Mul + 4Sqr +1Add IR + 3Sub +1DblSub +
1Mul3IR +1Div2IR (where operationIR represents an operation using incomplete reduction) and
has 5 contiguous data dependencies. In Algorithm 5.4, operators ⊕, ⊗ and represent addition,
multiplication by constant and division by constant using incomplete reduction, respectively.
These operations are computed with Algorithm 5.1(b) for addition and multiplication by 3, and
with Algorithm 5.3(b) for division by 2 (see Section 5.2.1 for details).
In certain formulas, another optimization is possible. If 1Mul − 1Sqr > 4Add and the values
a and b2 are available, one can compute a ⋅ b as ( a + b) 2 − a 2 − b2 2 . See for example
2
addition and doubling-addition formulas, option 1, of the online database EPAF [Lon08].
We remark that the optimizations above are not limited to 64-bit architectures and that are in
general advantageous on any platform whenever division by 2 is approximately as efficient as
field addition.
Finally, we observe that in some settings field subtraction is more efficient than addition with
complete reduction (see for example Table 5.2, when using a pseudo-Mersenne prime). Thus,
1
Mul = multiplication, Sqr = squaring, Add = addition, Sub = subtraction, Mulx = multiplication
by x, Divx = division by x.
122
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
whenever possible, one can convert those additions that cannot exploit IR to subtractions. For this
case, one applies λ = −1∈ F*p to the corresponding formula.
123
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
As described in Section 2.2.6.1, each Fp 2 operation consists of a few field operations over Fp .
Thus, the analysis of data dependencies and scheduling of operations should be performed taking
into account this underlying layer. For instance, let us consider the execution of a Fp 2
multiplication followed by a subtraction shown in Figure 5.3. Note that multiplication is
implemented using Karatsuba with 3 Fp multiplications and 5 Fp additions/subtractions.
As can be seen in Figure 5.3, the scheduling of the internal Fp operations of the Fp 2
multiplication has been performed in such a way that contiguous data dependencies are minimal
between F p operations (there is only one dependency between DblSub and Sub in the last stage
of multiplication). A similar analysis can be performed between contiguous higher-layer Fp 2
operations. In Figure 5.3, the last F p operation of the multiplication and the first Fp operation of
the subtraction hold contiguous data dependence. There are different solutions to eliminate this
problem. For example, it can be eliminated by rescheduling the Fp 2 subtraction and addition, as
shown in Figure 5.4(a). Note that addition does not hold any dependence with the multiplication
or subtraction, as required. Alternatively, if internal Fp field operations of the subtraction in
Fp 2 are rescheduled, as shown in Figure 5.4(b), the contiguous data dependence is also
eliminated.
These strategies can be applied to point formulas to minimize the appearance of such
dependencies. The reader is referred to Appendix B2 for details about the scheduling of Fp 2
operations suggested for point formulas using J and E / E e coordinates.
In the case of the GLS method, merging of point doublings is not as advantageous as in the
traditional scenario of ECC over Fp because most contiguous data dependencies can be
eliminated by simply rescheduling field operations inside point formulas using the techniques
from the previous subsection (see Appendix B2). Moreover, GLS employs point multiplication
techniques such as interleaving, which do not guarantee a long series of consecutive doublings
between additions. Nevertheless, it is still advantageous the use of the merged doubling-addition
operation (when applicable), which is a recurrent operation in interleaving.
On the other hand, merging field operations is more advantageous in this scenario than over
Fp . There are two reasons for this to happen. First, arithmetic over Fp2 works on top of the arith-
metic over Fp , which opens new possibilities to merge more Fp operations. Second, operations
are on fields of half size, which means that fewer registers are required for representing field
elements and more registers are available for holding intermediate operands.
124
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
> Add(op1[1],op1[2],t1)
> Add(op2[1],op2[2],t2)
> Mult(op1[2],op2[2],t3)
> Mult(t1,t2,res1[2])
> Mult(op1[2],op2[1],res1[1])
> DblSub(res1[2],res1[1],t3)
> Mult(op1,op2,res1) > Sub(res1[1],t3,res1[1])
> Sub(res1,op3,res2) > Sub(res1[1],op3[1],res2[1])
> Add(op4,op5,res3) > Sub(res1[2],op3[2],res2[2])
Figure 5.3. Fp 2 operations with contiguous data dependencies. High-level Fp 2 operations are in the left column and
their corresponding low-level F p operations are in the right column. Fp 2 elements (a + bi ) are represented as
(op[1],op[2]). Dependencies are indicated by arrows.
(a)
> Add(op1[1],op1[2],t1)
> Add(op2[1],op2[2],t2)
> Mult(op1[2],op2[2],t3)
> Mult(t1,t2,res1[2])
>
> DblSub(res1[2],res1[1],t3)
> Mult(op1,op2,res1) > Sub(res1[1],t3,res1[1])
> Add(op4,op5,res3) ...
> Sub(res1,op3,res2) > Sub(res1[1],op3[1],res2[1])
> Sub(res1[2],op3[2],res2[2])
(b)
> Add(op1[1],op1[2],t1)
> Add(op2[1],op2[2],t2)
> Mult(op1[2],op2[2],t3)
> Mult(t1,t2,res1[2])
> Mult(op1[2],op2[1],res1[1])
> DblSub(res1[2],res1[1],t3)
> Mult(op1,op2,res1) > Sub(res1[1],t3,res1[1])
> Sub(res1,op3,res2) > Sub(res1[2],op3[2],res2[2])
> Add(op4,op5,res3) > Sub(res1[1],op3[1],res2[1])
Figure 5.4. (a) Contiguous data dependencies eliminated by scheduling Fp 2 field operations; (b) Contiguous data
dependencies eliminated by scheduling F p field operations.
125
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
Example 5.2. The following merged field operations can be advantageous on x86-64-based
processors using J and E / E e coordinates over quadratic extension fields: a − 2b (mod p ) ,
(a + a + a )/ 2 (mod p) , a + b − c (mod p) , the merging of a + b (mod p) and a − b (mod p) , the merging
of a − b (mod p) and c − d (mod p) , and the merging of a + a (mod p) and a + a + a (mod p) .
Again, we remark that the list above is not intended to be exhaustive and different merging
options could be more advantageous or be available on different platforms with different
coordinate systems or underlying fields. The reader is referred to Appendix B2 for the explicit
formulas optimized with the proposed techniques for the case of J and E / E e coordinates using
the GLS method.
Field Arithmetic
As previously described, the field arithmetic over Fp using the pseudo-Mersenne prime
p = 2 256 − 189 was written using x86-64 compatible assembly language and optimized by
exploiting incomplete reduction and elimination of conditional branches for modular addition,
subtraction and multiplication/division by constants (see Section 5.2). For the case of squaring
and multiplication, there are two methods that are commonly preferred in the literature for
implementation on general purpose processors: schoolbook (or operand scanning method) and
Comba [Com90] (or product scanning method) (see Section 5.3 of [EYK09] or Section 2.2.2 of
[HMV04]). Both methods require n 2 w-bit multiplications when multiplying two n-digit
numbers. However, we choose to implement Comba’s method since it requires approx. 3n 2 w-bit
additions, whereas schoolbook requires 4n 2 . Modular reduction for both operations was
performed exploiting the fact that 2 256 ≡ 189 so r ≡ (r % 2 256 ) + 189( r >> 256) , where r is the
result of integer multiplication or squaring. Our code was aggressively optimized by carefully
scheduling instructions to exploit the instruction-level parallelism.
126
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
Point Arithmetic
For our implementations, we chose J and E / E e coordinates and used the execution patterns
based on doublings and doubling-additions proposed by [Lon07] and [HWC+08] for J and
E / E e , respectively. The costs in terms of multiplications and squarings can be found in Tables
2.2 and 2.4. Note that we use general additions (or general doubling-additions) because inversion
is relatively expensive and its inclusion during precomputation cancels any gain using addition
with mixed coordinates during the evaluation stage.
This arithmetic layer was optimized through the use of the techniques described in Sections
5.3 and 5.4, namely field arithmetic scheduling, merging of field and point operations and
minimization of field operations. Because the maximal performance was found with a window of
size 5 for the scalar recoding using wNAF (see next subsection), we merged four consecutive
doublings into a joint function and every addition with the precedent doubling into another
function. Please refer to Appendix B1 for complete details about the employed formulas
exhibiting minimal number of field operations, different merged field operations and reduced
number of contiguous data dependencies.
For scalar recoding we use wNAF, which offers minimal nonzero density among signed binary
representations for a given window width (i.e., for certain number of precomputed points)
[Ava05]. In particular, we use Alg. 3.35 of [HMV04] for conversion from integer to wNAF
representation. Although left-to-right conversion algorithms exist [Ava05], which save memory
and allow on-the-fly computation of point multiplication, they are not advantageous on the
targeted CPUs. In fact, our tests show that converting the scalar to wNAF and then executing the
point multiplication achieves higher performance than interleaving conversion and point
multiplication. That is because the latter approach “interrupts” the otherwise smooth flow of
point multiplication by calling the conversion function at every iteration of the double-and-add
algorithm. Our choice is also justified because there are no stringent constraints in terms of
memory in the targeted platforms.
For precomputation on J coordinates, we choose the variant of the LM scheme that does not
require inversions, whose cost is given by formula (3.4) (Section 3.2.2). This method achieves
the lowest cost for precomputing points, given by (6 L + 2) M + (3 L + 4) S , where L represents the
number of non-trivial points (note that we avoid here the S-M trading in the first doubling). On
E / E e , we precompute points in the traditional way using the sequence P + 2 P + 2 P + … + 2 P ,
adding 2P with general additions. Because precomputed points are left in projective form no
inversion is required and the cost is given by (8 L + 4) M + 2 S . This involves computing 2P as
2A → E e , which costs 5 M + 2 S (one squaring is saved because Z P = 1 ; one extra multiplication
127
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
For this case we make use of the optimized assembly module of the field arithmetic over Fp 2
written by M. Scott [MIR], which exploits the Mersenne prime p = 2127 − 1 allowing the use of a
very simple reduction step with no conditional branches.
For the point arithmetic, we slightly modify formulas for the “traditional” implementations
since in this case these require a few extra multiplications with the twisted curve parameter µ (see
Section 2.2.6). For example, the (dedicated) addition using extended Twisted Edwards
coordinates with cost 8M (pp. 332 of [HWC+08]) cannot be used in this case and has to be
replaced by a formula that costs 9M (also discussed in pp. 332 of [HWC+08] as “9M+1D”),
which is one multiplication more expensive (“1D” is avoided because parameter a is still set to
−1). Accordingly (and also following our discussions in Sections 5.3.1 and 5.5.1), the scheduling
of the field arithmetic slightly differs. Moreover, different merging options for the field and point
arithmetic are exploited (see Section 5.5.2). The reader is referred to Appendix B2 for complete
details about the revised formulas exhibiting minimal number of field operations, different
merged operations and reduced number of contiguous data dependencies.
For point multiplication, each of the two scalars k0 and k1 in the multiple point multiplication
k 0 P + k1 (λ P ) is converted using fractional wNAF [Möl05], and then the evaluation stage is
executed using interleaving (see Alg. 3.51 of [HMV04]). Similarly to our experiments with the
“traditional” implementations, we remark that the separation of the conversion and evaluation
stages yields better performance in the targeted platforms.
For precomputation on J, we use the LM scheme that has minimal cost among methods
using only one inversion. The cost in this case is given by eq. (3.6). We avoid here the S-M
trading in the first doubling, so the precomputing cost is 1I + (9 L + 1) M + (2 L + 5) S , where L
represents the number of non-trivial points. A fractional window with L = 6 achieves the optimal
performance in our case.
Again, on E / E e coordinates we precompute points using general additions in the sequence
P + 2 P + … + 2 P . Precomputed points are better left in projective coordinates, in which case the
cost is given by (9 L + 4) M + 2 S . This cost involves the computation of 2P as 2A → E e , which
costs 5 M + 2 S (one squaring is saved because Z P = 1 ; one extra multiplication is required to
compute T coordinate of 2P), one mixed addition to compute P + 2 P as A + E e → E e that costs
8M and ( L − 1) general additions E e + E e → E e that cost 9M each. In this case, an integral
128
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
Next, we detailed the curves used for our implementations. These curves provide approximately
128 bits of security and were found with a modified version of the Schoof’s algorithm provided
with MIRACL.
• For the implementation on short Weierstrass form over Fp using J, we chose the curve
EW : y 2 = x 3 − 3 x + B , where p = 2 256 − 189 , B = 0 × fd63c3319814da55e88e9328e962
73c483dca6cc84df53ec8d91b1b3e0237064 and # EW ( Fp ) = 10 r where r is the 253-bit
prime:
11579208923731619542357098500868790785394551372836712768287417232790500318517 .
The implementation corresponding to this curve is referred to as jac256189 in the
remainder.
• For Twisted Edwards over Fp using E / E e coordinates, we chose the curve ETE :
− x 2 + y 2 = 1 + 358 x 2 y 2 , where p = 2 256 − 189 and # ETE (F p ) = 4 r where r is the 255-
bit prime:
28948022309329048855892746252171976963381653644566793329716531190136815607949 .
The implementation corresponding to this curve is referred to as ted256189 in the
remainder.
• Let ETE −GLS : − x 2 + y 2 = 1 + 109 x 2 y 2 be defined over Fp , where p = 2127 − 1 . For the
case of Twisted Edwards using the GLS method, we use the quadratic twist
′ −GLS : − µ x 2 + y 2 = 1 + 109 µ x 2 y 2 of ETE −GLS / Fp 2 , where µ = 2 + i ∈ Fp 2 is non-
ETE
square. In this case, # ETE ′ −GLS ( Fp 2 ) = 4r where r is the 252-bit prime:
7237005577332262213973186563042994240709941236554960197665975021634500559269 .
The implementation corresponding to this curve is referred to as ted1271gls in the
remainder.
129
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
5.6.4. Timings
130
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
coordinates are ted256189 and jac256189, respectively. In this case, ted256189 and jac256189
are 22% and 28% faster than the previously best cycle counts due to Hisil et al. [HWC+09] using
also E / E e and J coordinates, respectively, on a 2.66GHz Intel Core 2 Duo.
It is also interesting to note that the performance boost given by the GLS method strongly
depends on the characteristics of a given platform. For instance, ted1271gls and jac1271gls are
about 40% and 45% faster than their “counterparts” over F p , namely ted256189 and jac256189,
respectively, on an Intel Atom N450. On an Intel Core 2 Duo E6750, the differences reduce to
25% and 32% (respect.). And on an AMD Opteron processor, the differences reduce even further
to only 9% and 13% (respect.). Thus, it seems that there exists certain correlation between an
architecture’s “aggressiveness” for scheduling operations/exploiting ILP and the gap between the
costs of Fp and Fp 2 operations on x86-64 based processors. In general, the greater such
“aggressiveness” the smaller the Fp − Fp2 gap. And since working on the quadratic extension
involves a considerable increase in the number of multiplications and additions, GLS loses its
attractiveness if such gap is not large enough on certain platform. For the record, ted1271gls
achieves the best cycle counts on an AMD Opteron, with an advantage of about 31% over the
best previous result in the literature by [GT07b], and on an AMD Phenom II X4, with an
advantage of about 29% over the closest result obtained by gls1271-ref4 [MIR].
For extended benchmark results and comparisons with other previous works on different 64-
bit processors, the reader is referred to our online database [Lon10].
131
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
5.7. Conclusions
In this chapter we have proposed and evaluated different techniques and optimizations to speed
up elliptic curve scalar multiplication over prime fields on the increasingly popular x86-64-based
processors. We have carefully studied the architecture of these processors and optimized the
arithmetic of elliptic curves at the different computational levels accordingly. Extensive tests
have been carried out on at least one x86-64 processor representative from the notebook/netbook,
desktop and server/workstation processor classes. Whenever relevant, we have also discussed the
extension of the analysis and optimizations to other microarchitectures.
After detailing in §5.1 some previous work and the general features of x86-64 processors that
are most relevant to this work, we studied the performance boost obtained when combining
incomplete reduction and elimination of conditional branches with the use of a highly-efficient
pseudo-Mersenne prime in §5.2. We provided explicit algorithms for performing different
variants of modular addition, subtraction, multiplication by constant and division by constant
with incompletely and completely reduced numbers. Our tests on the targeted platforms reveal
cost reductions as high as 9% and 12% in the computation of point doubling and doubling-
addition, respectively, when combining the techniques above. Overall, the cost reduction in a
132
Chapter 5: Efficient Techniques for Implementing Elliptic Curves in Software
133
6 Chapter 6
In this chapter, we propose efficient methods and optimized explicit formulas that speed up
significantly the computation of pairings on ordinary curves over prime fields. Our contributions
can be summarized as follows:
135
Chapter 6: Efficient Techniques for Implementing Pairings in Software
This chapter is organized as follows. After discussing relevant previous work in §6.1, we
describe the generalization of lazy reduction to pairing-friendly tower fields in §6.2. In the same
section, we discuss how to optimize the implementation of tower field arithmetic when dealing
with both single- and double-precision operations, and illustrate the flexible methodology with
the popular tower Fp → Fp 2 → Fp6 → Fp12 . In §6.3, we present our optimizations to the curve
arithmetic in the Miller loop, including the application of lazy reduction. Then, in §6.4 we
describe our high-speed implementation of an optimal ate pairing on BN curves, carry out a
detailed operation count and compare our results with the previously best results in the literature.
We end this chapter with some conclusions in §6.5.
136
Chapter 6: Efficient Techniques for Implementing Pairings in Software
to at least Crypto’98 [WD98], and has been advantageously exploited by many works in different
scenarios [LH00, Ava04, Sco07]. According to [LH00], multiplication in Fp k can be performed
with k reductions modulo p when Fp k is seen as a direct extension over Fp via an irreducible
binomial. This improves the traditional method that requires k 2 reductions (or k ( k + 1) / 2
reductions when using Karatsuba multiplication). Lazy reduction was first employed in the
context of pairing computation by [Sco07] to eliminate reductions in Fp 2 multiplication.
Essentially, when using Karatsuba method for multiplication in Fp 2 , lazy reduction allows us to
lower the number of reductions from 3 to only 2. Note that these savings are at the cost of
somewhat more expensive additions. If, for instance, one considers the tower
Fp → Fp 2 → Fp6 → Fp12 , then this approach requires 3 ⋅ 6 ⋅ 3 = 54 integer multiplications with
2 ⋅ 6 ⋅ 3 = 36 reductions modulo p for performing one multiplication in Fp12 ; see [Sco07, HMS08,
BGM+10].
In this work we go a step further and generalize lazy reduction to the whole pairing
computation, including the tower extension and curve arithmetic. For instance, these
optimizations allow us to eliminate about 32% of reductions in a state-of-the-art implementation
of the optimal ate pairing using a BN curve with 128 bits of security; see Section 6.4.2 for
complete details.
Recently, many authors have targeted the efficient software implementation of bilinear
pairings at the 128-bit security level. Most remarkable results include the computation of the R-
ate pairing in 10,000,000 cycles on one core of an Intel Core 2 Duo processor by Hankerson et al.
[HMS08], and the computation of the optimal ate pairing in 4,380,000 cycles on one core of an
Intel Core 2 Quad Q9550 by Naehrig et al. [NNS10] and in 2,950,000 cycles on one core of an
Intel Core 2 Duo T7100 by Beuchat et al. [BGM+10]. Beuchat et al. also reports an optimal ate
pairing computation in 2,330,000 cycles on one core of an Intel Core i7 860 processor.
In this work, to demonstrate the effectiveness of our optimizations, we realize a high-speed
implementation of an optimal ate pairing at the 128-bit security level that additionally
incorporates the latest advancements in the area, including software techniques by Beuchat et al.
[BGM+10] to optimize carry handling and eliminate function call overheads in the Fp 2
arithmetic, and the use of efficient compressed squarings and decompression in cyclotomic
subgroups to speed up computations in the final exponentiation by Karabina (see [Kar10] and
also [AKL+10, Section 5.2]). We report a pairing computation in 2,194,000 cycles on one core of
an Intel Core 2 Duo E6750 and in 1,688,000 cycles on an Intel Core i5 540M. Moreover, we also
report a pairing computation in only 1,562,000 cycles (~0.5msec.) on an AMD Phenom II X4
940 processor. Taking into account timings in identical platforms, our results introduce
improvements between 28% and 34% in comparison with the best previous results.
137
Chapter 6: Efficient Techniques for Implementing Pairings in Software
Lemma 6.1. A sum of products with the form ∑ ± ai ⋅ bi mod p , where ai , bi are elements in
Montgomery representation, can be reduced with only one Montgomery reduction modulo p by
accumulating inner products as double-precision integers always that ∑ ± ai ⋅ bi < 2 N ⋅ p , where
N = n ⋅ w , n is the exact number of words required to represent p, i.e., n = log 2 p w , and w
is the computer word-size.
Lemma 6.1 defines the basic lazy reduction technique in the context of Montgomery
reduction. Readers should note that internal additions and subtractions with partial results r
“slightly” outside the Montgomery reduction range [0, 2 N ⋅ p] , i.e., 2 N ⋅ p ≤ r < 2 N +1 ⋅ p , can be
easily corrected at negligible cost by performing a subtraction with 2 N ⋅ p .
Next, we present our main result applying lazy reduction to towering-friendly fields.
Proof. We will proof this theorem in a wider context for generic tower extensions built with
t
irreducible binomials. Let Fp k be a direct extension of Fp , where k = ∏ n , and an element
i =1 i
a ∈ Fp k be represented in polynomial basis as a( X ) = a0 + a1 X + … + ak −1 X k −1 . Then one can
use the following tower construction Fp n0 = Fp → Fp n1 → Fp n1 ⋅n2 → … → Fpn1 ⋅n2 ⋅…⋅nt = Fp k to
represent the extension field Fp k s.t. Fp n1 = Fp [u ] (u n1 − β ) , Fp n1 ⋅n2 = Fp n1 [v ] (v n2 − ξ ) ,…,
138
Chapter 6: Efficient Techniques for Implementing Pairings in Software
i =0 i =0
n1 −1 m n1 −1
= ∑ ∑ a j bm− j mod p + β ∑ a j bm − j + n1 mod p u m
m =0 j =0 j = m +1
n1 −1 m n1 −1 n1 −1
= ∑ ∑ a j bm− j +β ∑
a j bm − j + n1 mod p u =
m
∑ ( cm mod p ) u m , (6.1)
m =0 j =0 j = m +1 m =0
It is important to note that there is no restriction in the selection of parameters for the
139
Chapter 6: Efficient Techniques for Implementing Pairings in Software
irreducible binomials (e.g., β and ξ in the proof above). However, for efficiency purposes one
should select parameters with small values such that multiplications with them can be converted
to a few additions and subtractions (see for example the chosen parameters in the illustrative
tower in Section 6.2.2).
The next theorem extends our result to towering-friendly fields exploiting Karatsuba
multiplication.
Theorem 6.2. Multiplications in a towering-friendly field Fpk built with irreducible binomials,
where k = 2 d 3e , d ≥ 1 , e ≥ 0 , can be computed with k reductions and 3d 6e multiplications.
v0 = a0 ⋅ b0 , v1 = a1 ⋅ b1 , v2 = a2 ⋅ b2 ,
c0 = v0 + β ki [ ( a1 + a2 )(b1 + b2 ) − v1 − v2 ] , c1 = (a0 + a1 )(b0 + b1 ) − v0 − v1 + v2 β ki ,
c1 = (a0 + a2 )(b0 + b2 ) − v0 + v1 − v2 , c = c0 + c1 x + c1 x 2 , (6.5)
Theorem 6.2 shows that lazy reduction can be combined with Karatsuba multiplication for
efficient computation in tower extension fields. In fact, it is straightforward to generalize lazy
reduction to any formula that also involves only sums (or subtractions) of products of the form
∑ ± ai ⋅ bi , with ai , bi ∈ Fp kl , such as complex squaring or the asymmetric squaring formulas
devised by Chung and Hasan [CH07].
For efficiency purposes, we suggest a different treatment for the highest layer in the tower
140
Chapter 6: Efficient Techniques for Implementing Pairings in Software
arithmetic. Proof of Theorem 6.1 implies that reductions can be completely delayed to the end of
the last layer by applying lazy reduction, but in some cases (when the optimal k is already
reached and no reductions can be saved) it will be more efficient to perform reductions right after
multiplications or squarings. This will be illustrated later with the computation of squaring in
Fp12 in Section 6.2.2.
In summary, the generalized lazy reduction can be applied to every computation involving
operations in tower extension fields in the Miller loop and final exponentiation, including the
recently proposed compressed squarings by [Kar10] (see Appendix C1).
Remarkably, in the Miller Loop reductions can also be delayed from the underlying Fp 2 field
during multiplication and squaring to the arithmetic layer immediately above, i.e., the point
arithmetic and line evaluation. Similarly to the tower extension, reductions on this upper layer
should only be delayed in the cases where this technique leads to fewer reductions. For details,
see Section 6.3.
There are some penalties when delaying reductions. In particular, single-precision operations
(with operands occupying n = log 2 p w words, where w is the computer word-size) are
replaced by double-precision operations (with operands occupying 2n words). However, this
disadvantage can be minimized in terms of speed by selecting a field size smaller than the word-
size boundary because this technique can be exploited more extensively for optimizing double-
precision arithmetic.
If the modulus p is selected such that l = log 2 p < N , where again N = n ⋅ w , n = l w and w
is the computer word-size, then several consecutive additions without carry-out in the most
significant word (MSW) can be performed before a multiplication with the form c = a ⋅ b , where
a, b ∈[0, 2 N − 1] s.t. c < 2 2 N . In the case of Montgomery reduction, the restriction is given by the
upper bound c < 2 N ⋅ p . Similarly, when delaying reductions the result of a multiplication without
reduction has maximum value ( p − 1) 2 < 2 2 N (assuming that a, b ∈ [0, p] ) and several
consecutive double-precision additions without carry-outs in the MSW (and, in some cases,
subtractions without borrow-outs in the MSW) can be performed before reduction. When using
Montgomery reduction up to ∼ 2 N p additions can be performed without carry checks.
Furthermore, cheaper single- and double-precision operations exploiting this “extra room”
can be combined for maximal performance. The challenge is to optimally balance their use in the
tower arithmetic since both may interfere with each other. For instance, if intermediate values are
allowed to grow up to 2p before multiplication (instead of p) then the maximum result would be
4 p 2 . This strategy makes use of cheaper single-precision additions without carry checks but
limits the number of double-precision additions that can be executed without carry checks after
141
Chapter 6: Efficient Techniques for Implementing Pairings in Software
multiplication with delayed reduction. As it will be evident later, to maximize the gain obtained
with the proposed methodology one should take into account relative costs of operations and
maximum bounds.
In the case of double-precision arithmetic, different optimizing alternatives are available. Let
us analyze them in the context of Montgomery arithmetic. First, as pointed out by [BGM+10], if
c > 2 N ⋅ p , where c is the result of a double-precision addition, then c can be restored with a
cheaper single-precision subtraction by 2 N ⋅ p (note that the first half of this value consists of
zeroes only). Second, different options are available to convert negative numbers to positive after
double-precision subtraction. In particular, let us consider the computation c = a + l ⋅ b , where
a, b ∈ [0, mp 2 ] , m ∈ Z + and l < 0 ∈ Z , which is a recurrent operation (for instance, when l = β
from Section 6.2.2). For this operation, we have explored the alternatives listed in Table 6.1,
which can be integrated in the tower arithmetic with different advantages.
Table 6.1. Different options to convert negative results to positive after a subtraction with the
form c = a + l ⋅ b , where a, b ∈ [0, mp 2 ] , m ∈ Z + and l < 0 ∈ Z s.t. lmp < 2 N .
In particular, Options 2 and 4 in Table 6.1 require conditional checks that make the
corresponding operations more expensive. Nevertheless, these options may be valuable when
negative values cannot be corrected with other options without violating the upper bound. Also
note that Option 2 can make use of a cheaper single-precision subtraction for converting negative
results to positive. Options 1 and 3 are particularly efficient because no conditional checks are
required. Moreover, if l is small enough (and h maximized for Option 1) several following
operations can avoid carry checks. Between both, Option 1 is generally more efficient because
adding 2 N ⋅ p / 2h requires less than double-precision if h ≤ w , where w is the computer word-
size.
Next, we demonstrate how the different design options discussed in this section can be
exploited with a clever selection of parameters and applied to different operations combining
single- and double-precision arithmetic to speed up the extension field arithmetic.
142
Chapter 6: Efficient Techniques for Implementing Pairings in Software
For our illustrative analysis, we use the tower Fp → Fp 2 → Fp 6 → Fp12 constructed as follows
[PSN+10]:
• Fp 2 = Fp [i]/(i 2 − β ), where β = −1 .
• Fp 6 = Fp 2 [v ]/(v3 − ξ ), where ξ = 1 + i .
• Fp12 = Fp 6 [ w] /( w2 − v) .
We use a similar tower construction for our illustrative implementation of the optimal ate
pairing on a BN curve (see Section 6.4.1 for complete details).
When targeting the 128-bit security level, single- and double-precision operations are defined
by operands with sizes N = 256 and 2N = 512, respectively. For our selected prime,
N 2
log 2 p = 254 and 2 ⋅ p ≈ 6.8 p . We use the following notation [AKL+10]:
(i) +, −, × are operators not involving carry handling or modular reduction for boundary
keeping;
(ii) ⊕, , ⊗ are operators producing reduced results through carry handling or modular
reduction;
(iii) a superscript in an operator is used to denote the extension degree involved in the
operation;
(iv) notation ai, j is used to address j-th subfield element inside extension field element ai ;
(v) variables with lower case t and upper case T represent single- and double-precision
integers or extension field elements composed of single and double-precision integers,
respectively.
For the remainder of the chapter, we assume that (except when explicitly stated) double-
143
Chapter 6: Efficient Techniques for Implementing Pairings in Software
precision addition has the cost of 2A and 2a in Fp and Fp 2 , respectively, which approximately
follows what we observe in practice.
Note that, as stated before, if c > 2 N ⋅ p after adding c = a + b in double-precision we correct
the result by computing c − 2 N ⋅ p . Similar to subtraction (see Table 6.1), we refer to the latter as
“Option 2”. Remaining references to “Option x” are taken from Table 6.1.
We will now illustrate a selection of operations for efficient multiplication in Fp12 , beginning
with multiplication in Fp 2 . Let a, b, c ∈ Fp 2 such that a = a0 + a1i , b = b0 + b1i , c = a ⋅ b = c0 + c1i
. The required operations for computing Fp 2 multiplication are detailed in Algorithm 6.1. As
explained in Beuchat et al. [BGM+10, Section 5.2], when using the Karatsuba method and
ai , bi ∈ Fp , c1 = ( a0 + a1 )(b0 + b1 ) − a0b0 − a1b1 = a0b1 + a1b0 < 2 p 2 < 2 N ⋅ p , additions are single-
precision, reduction after multiplication can be delayed and hence subtractions are double-
precision (steps 1-3 in Algorithm 6.1). Obviously, these operations do not require carry checks.
For c0 = a0 b0 − a1b1 , c0 is in interval [ − p 2 , p 2 ] and a negative result can be converted to positive
using Option 1 with h = 2 or Option 2, for which the final c0 is in the range
[0, (2 N ⋅ p / 4) + p 2 ] ⊂ [0, 2 N ⋅ p ] or [0, 2 N ⋅ p] , respectively (step 4 in Algorithm 6.1). Following
Theorem 6.1, all reductions can be completely delayed to the next arithmetic layer (higher
extension or curve arithmetic).
Let us now define multiplication in Fp 6 . Let a, b, c ∈ Fp6 such that a = ( a0 + a1v + a2v 2 ) ,
b = (b0 + b1v + b2v 2 ) , c = a ⋅ b = (c0 + c1v + c2v 2 ) . The required operations for computing Fp 6
multiplication are detailed in Algorithm 6.2. In this case, c0 = v0 + ξ [( a1 + a2 )(b1 + b2 ) − v1 − v2 ] ,
c1 = ( a0 + a1 )(b0 + b1 ) − v0 − v1 + ξ v2 and c2 = ( a0 + a2 )(b0 + b2 ) − v0 − v2 + v1 , where v0 = a0b0 ,
v1 = a1b1 and v2 = a2b2 . First, note that the pattern s x = ( ai + a j )(bi + b j ) − vi − v j repeats for each
cx , 0 ≤ x ≤ 2 . After multiplications using Algorithm 6.1 with Option 1 (h = 2), we have
vi,0 , v j ,0 ∈[0,(2N ⋅ p / 4) + p2 ] and vi,1 , v j ,1 ∈[0, 2 p 2 ] (step 1 of Alg. 6.2). Outputs of single-precision
additions of the forms ( ai + a j ) and (bi + b j ) are in the range [0, 2p] and hence do not produce
carries (steps 2, 9 and 17 of Alg. 6.2). Corresponding Fp 2 multiplications rx = ( ai + a j )(bi + b j )
144
Chapter 6: Efficient Techniques for Implementing Pairings in Software
using Algorithm 6.1 with Option 2 give results in the ranges rx ,0 ∈ [0, 2 N ⋅ p ] and rx ,1 ∈ [0,8 p 2 ]
(steps 3, 10 and 18). Although max(rx ,1 ) = 8 p 2 > 2 N ⋅ p , note that 8 p 2 < 22 N and
s x,1 = ai,0b j ,1 + ai,1b j ,0 + a j ,0bi ,1 + a j ,1bi,0 ∈ [0, 4 p 2 ] since sx = ai b j + a j bi . Hence, for 0 ≤ x ≤ 2 ,
double-precision subtractions for computing s x,1 using Karatsuba do not require carry checks
(steps 4 and 6, 11 and 13, 19 and 21). For computing sx ,0 = rx ,0 − (vi ,0 + v j ,0 ) addition does not
require carry check (output range [0, 2(2 N ⋅ p / 4 + p 2 )] ⊂ [0, 2 N ⋅ p ] ) and subtraction gives result
in the range [0, 2 N ⋅ p] when using Option 2 (steps 5, 12 and 20). For computing c0 ,
multiplication by ξ , i.e., S0 = ξ s0 involves the operations S0,0 = s0,0 − s0,1 and S0,1 = s0,0 + s0,1 ,
145
Chapter 6: Efficient Techniques for Implementing Pairings in Software
which are computed in double-precision using Option 2 to get the output range [0, 2 N ⋅ p] (step
7). Similarly, final additions with v0 require Option 2 to get again the output range [0, 2 N ⋅ p]
(step 8). For computing c1 , S1 = ξ v2 is computed as S1,0 = v2,0 − v2,1 and S1,1 = v2,0 + v2,1 , where
the former requires a double-precision subtraction using Option 1 (h = 1) to get a result in the
range [0, 2 N ⋅ p / 2 + 2 N ⋅ p / 4 + p 2 )] ⊂ [0,2 N ⋅ p] (step 14) and the latter requires a double-precision
addition with no carry check to get a result in the range [0, (2 N ⋅ p / 4) + 3 p 2 ] ⊂ [0, 2 N ⋅ p ] (step
15). Then, c1,0 = s1,0 + S1,0 and c1,1 = s1,1 + S1,1 involve double-precision additions using Option 2
to obtain results in the range [0, 2 N ⋅ p] (step 16). Results c2,0 = s2,0 + v1,0 and c2,1 = s2,1 + v1,1
require a double-precision addition using Option 2 (final output range [0, 2 N ⋅ p] , step 22) and a
double-precision addition without carry check (final output range [0, 6 p 2 ] ⊂ [0, 2 N ⋅ p ] , step 23),
respectively. Modular reductions have been delayed again to the last layer Fp12 .
Finally, let a, b, c ∈ Fp12 such that a = a0 + a1w , b = b0 + b1w , c = a ⋅ b = c0 + c1w . Algorithm
6.3 details the required operations for computing multiplication in Fp12 . In this case, c1 =
(a0 + a1 )(b0 + b1 ) − a0b0 − a1b1 . At step 1, Fp 6 multiplications a0b0 and a1b1 give outputs in range
⊂ [0, 2 N ⋅ p ] using Algorithm 6.2. Additions a0 + a1 and b0 + b1 are single-precision reduced
modulo p so that multiplication (a0 + a1 )(b0 + b1 ) in step 2 gives output in range ⊂ [0, 2 N ⋅ p ] using
Algorithm 6.2. Then, subtractions by a1b1 and a0b0 use double-precision operations with Option
2 to have an output range [0, 2 N ⋅ p ] so that we can apply Montgomery reduction at step 5 to
obtain the result modulo p. For c0 = a0b0 + va1b1 , multiplication by v, i.e., T = v ⋅ v1 , where
vi = ai bi , involves the double-precision operations T0,0 = v2,0 − v2,1 , T0,1 = v2,0 + v2,1 , T1 = v0 and
T2 = v1 , all performed with Option 2 to obtain the output range [0, 2 N ⋅ p] (steps 6-7). Final addi-
146
Chapter 6: Efficient Techniques for Implementing Pairings in Software
tion with a0b0 uses double-precision with Option 2 again so that we can apply Montgomery
reduction at step 9 to obtain the result modulo p. We remark that, by applying the lazy reduction
technique using the operation sequence above, we have reduced the number of reductions in
Fp 6 from 3 to only 2, or the number of total modular reductions in Fp from 54 (or 36 if lazy
reduction is employed in Fp 2 ) to only k = 12.
As previously stated, there are situations when it is more efficient to perform reductions right
after multiplications and squarings in the last arithmetic layer of the tower construction. We
illustrate the latter with squaring in Fp12 . As shown in Algorithm 6.4, a total of 2 reductions in
Fp 6 are required when performing Fp 6 multiplications in step 4. If lazy reduction was applied,
the number of reductions would stay at 2, and worse, the total cost would be increased because
some operations would require double-precision. The reader should note that the approach
suggested by [PSN+10], where the formulas in [CH07] are employed for computing squarings in
internal cubic extensions of Fp12 , saves 1m in comparison with Algorithm 6.4. However, we
experimented such approach with several combinations of formulas and towering, and it
remained consistently slower than Algorithm 6.4 due to an increase in the number of additions.
147
Chapter 6: Efficient Techniques for Implementing Pairings in Software
formulas which have consecutive and redundant multiplications/squarings), there are a few
places where this technique can be applied. To be consistent with other results in the literature,
we only assume that double-precision addition has the cost of 2A and 2a in Fp and Fp 2 when
applying the lazy reduction technique. When this technique is not applied, we do not distinguish
between single- and double-precision additions.
The curve arithmetic in the Miller loop is traditionally performed using Jacobian coordinates
[HMS08, BGM+10]. Let the point T = ( X 1 , Y1 , Z1 ) ∈ E ′( Fp 2 ) be in Jacobian coordinates. The
point doubling computation 2T = ( X 3 , Y3 , Z 3 ) and evaluation of the arising line function l at
point P = ( xP , y P ) ∈ E ( Fp ) are traditionally performed with the following formulae [HMS08,
Section 2]:
An operation count of (6.6) reveals that this formula can be performed with
6m + 5 s + 11a + 4 M . We present the following revised formula that requires fewer Fp 2 additions:
9 X 14 3X 2
X3 = − 2 X1Y12 , Y3 = 1 ( X1Y12 − X 3 ) − Y14 , Z 3 = Y1Z1 ,
4 2
3 X12 Z12 xP 3X 3
l = Z 3 Z12 y P − ( ) + ( 1 − Y12 ) . (6.7)
2 2
This doubling formula only requires 6 m + 5 s + 8a + 4 M if computed as follows ( xP = − xP is
precomputed):
X 3 = A2 − D, E = C − X 3 , Y3 = A ⋅ E − B 2 , Z 3 = Y1 ⋅ Z1 , F = Z12 ,
148
Chapter 6: Efficient Techniques for Implementing Pairings in Software
Algorithm 6.5. Point doubling in Jacobian coordinates (cost of 6mu + 5su + 10r + 10a + 4 M )
Input: T = ( X1 , Y1 , Z1 ) ∈ E ′( Fp2 ), P = ( xP , yP ) ∈ E ( Fp ) and xP = − xP
Output: 2T = ( X 3 , Y3 , Z 3 ) ∈ E ′(Fp2 ) and the tangent line l ∈ Fp12
2 2
1: t0 ← X 1 ⊗ X1 , t2 ← Z1 ⊗ Z 1
2 2
2: t1 ← t0 ⊕ t0 , Z3 ← Y1 ⊗ Z 1
2 2
3: t0 ← t0 ⊕ t1, t3 ← Y1 ⊗ Y1
4: t0 ← t0 / 2 2
2 2
5: t1 ← t0 ⊗ t2 , t4 ← t0 ⊗ X1
2 2
6: l1,0,0 ← t1,0 ⊗ xP , l1,0,1 ← t1,1 ⊗ xP , l1,1 ← t4 t3 , t2 ← Z3 ⊗ t2
2
7: t1 ← t3 ⊗ X1
2 2
8: l0,0,0 ← t 2,0 ⊗ yP , l0,0,1 ← t2,1 ⊗ yP , Y1 ← t1 ⊕ t1 , X1 ← t0 ⊗ t0
2
9: X 3 ← X1 Y1
2
10: t1 ← t1 X 3
11: T0 ← t3 ×2 t3 , T1 ← t0 × 2 t1 (Option 1, h = 2)
2
12: T1 ← T1 T0 (Option 2)
13: Y3 ← T1 mod 2 p
14: Return 2T = ( X 3 , Y3 , Z 3 ) and l = (l0 , l1 )
θ = Y2 Z13 − Y1 , λ = X 2 Z12 − X1 ,
l = Z 3 y P + (θ xP ) + (θ X 2 − Y2 Z 3 ), (6.8)
that costs 10mu + 3su + 11r + 10a + 4M when exploiting lazy reduction (see Algorithm 6.6).
Costello et al. [CLN10, Section 5] proposed the use of homogeneous coordinates to perform the
curve arithmetic entirely on the twist. Their point doubling/line evaluation formula costs 2m + 7 s +
149
Chapter 6: Efficient Techniques for Implementing Pairings in Software
Algorithm 6.6. Point addition in Jacobian coordinates (cost of 10mu + 3su + 11r + 10a + 4M )
Input: T = ( X1 , Y1 , Z1 ) and R = ( X 2 , Y2 , Z 2 ) ∈ E ′( Fp 2 ), P = ( xP , y P ) ∈ E (Fp ) and xP = − xP
Output: T + R = ( X 3 , Y3 , Z 3 ) ∈ E ′( Fp 2 ) and the tangent line l ∈ Fp12
2
1: t1 ← Z1 ⊗ Z 1
2 2
2: t3 ← X 2 ⊗ t1 , t1 ← t1 ⊗ Z 1
2 2
3: t3 ← t3 X1 , t4 ← t1 ⊗ Y2
2 2 2
4: Z 3 ← Z1 ⊗ t3 , t0 ← t4 Y1, t1 ← t3 ⊗ t3
2 2
5: t4 ← t1 ⊗ t3 , X 3 ← t0 ⊗ t0
2
6: t1 ← t1 ⊗ X1
2
7: t3 ← t1 ⊕ t1
2
8: X 3 ← X 3 t3
2
9: X 3 ← X 3 t4
2
10: t1 ← t1 X 3
11: T0 ← t0 × 2 t1 , T1 ← t 4 × 2 Y1 (Option 1, h = 2)
2
12: T0 ← T0 T1 (Option 2)
13: Y3 ← T0 mod 2 p, l1,0,0 ← t0,0 ⊗ xP , l1,0,1 ← t0,1 ⊗ xP
14: T0 ← t 0 × 2 X 2 , T1 ← Z 3 × 2 Y2 (Option 1, h = 2)
2
15: T0 ← T0 T1 (Option 2)
16: l1,1 ← T0 mod 2 p, l0,0,0 ← Z 3,0 ⊗ yP , l0,0,1 ← Z3,1 ⊗ yP
17: Return T + R = ( X 3 , Y3 , Z 3 ) and l = (l0 , l1 )
x y
23a + 4M + 1M b ' . The twisting of point P, given in our case by ( x p w2 , y p w3 ) = ( p v 2 , p
ξ ξ
v w ), is eliminated by multiplying the whole line evaluation by ξ and relying on the final
exponentiation to eliminate this extra factor [CLN10]. Clearly, the main drawback of this formula
is the high number of additions. We present the following revised formula:
X 1Y1 2 1
X3 = (Y1 − 9b′Z12 ) , Y3 = (Y12 + 9b′Z12 ) − 27b′2 Z14 , Z 3 = 2Y13 Z1 ,
2 2
l = ( −2Y1Z1 yP )vw − (3 X 12 xP ) v 2 + ξ (3b′Z12 − Y12 ). (6.9)
A = X1 ⋅ Y1 2, B = Y12 , C = Z12 , D = 3C , E0 = D0 + D1 , E1 = D1 − D0 ,
150
Chapter 6: Efficient Techniques for Implementing Pairings in Software
F = 3E , X 3 = A ⋅ ( B − F ), G = ( B + F ) / 2, Y3 = G 2 − 3E 2 ,
H = (Y1 + Z1 ) 2 − ( B + C ), Z3 = B ⋅ H , I = E − B, l0,0,0 = I 0 − I1 ,
We point out that in practice we have observed that m − s ≈ 3a . Hence, it is more efficient to
compute X 1Y1 directly than using ( X 1 + Y1 ) 2 , B and X12 . If this was not the case, the formula
could be computed with cost 2m + 7 s + 23a + 4 M .
Doubling formula (6.9) requires 25 Fp reductions (3 per Fp2 multiplication using Karatsuba, 2
per Fp2 squaring and 1 per Fp multiplication). First, by delaying reductions inside Fp 2 arith-
metic the number of reductions per multiplication goes down to only 2, with 22 reductions in
total. Moreover, reductions corresponding to G 2 and 3E 2 in Y3 (see execution (6.10)) can be
151
Chapter 6: Efficient Techniques for Implementing Pairings in Software
further delayed and merged, eliminating the need of two reductions. Thus, the number of
reductions is now 20 and the total cost of formula (6.9) is 3mu + 6 su + 8r + 22a + 4M . The
details are shown in Algorithm 6.7.
Let T = ( X1 , Y1 , Z1 ) and R = ( X 2 , Y2 , Z 2 ) ∈ E ′(Fp 2 ) be points in homogeneous coordinates.
To compute T + R = ( X 3 , Y3 , Z 3 ) and the tangent line l evaluated at point P = ( xP , yP ) ∈ Fp we
use the following addition formula:
θ = Y1 − Y2 Z1 , λ = X1 − X 2 Z1 ,
that costs 11mu + 2 su + 11r + 12 a + 4 M when employing lazy reduction (see Alg. 6.8 below).
Algorithm 6.8. Point addition in homogeneous coordinates (cost of 11mu + 2su +11r +12a + 4M )
Input: T = ( X1 , Y1 , Z1 ) and R = ( X 2 , Y2 , Z 2 ) ∈ E ′( Fp 2 ), P = ( xP , yP ) ∈ E (Fp ) and y P = − y P
Output: T + R = ( X 3 , Y3 , Z 3 ) ∈ E ′( Fp 2 ) and the tangent line l ∈ Fp12
2 2
1: t1 ← Z1 ⊗ X 2 , t2 ← Z1 ⊗ Y2
2 2
2: t1 ← X 1 t1 , t2 ← Y1 t2
2
3: t3 ← t1 ⊗ t1
2 2
4: X 3 ← t3 ⊗ X 1 , t 4 ← t 2 ⊗ t 2
2 2
5: t3 ← t1 ⊗ t3 , t4 ← t4 ⊗ Z 1
2
6: t 4 ← t3 ⊕ t 4
2
7: t4 ← t4 X 3
2
8: t4 ← t4 X 3
2
9: X 3 ← X 3 t4
10: T1 ← t 2 × 2 X 3 , T2 ← t3 × 2 Y1 (Option 1, h = 2)
2
11: T2 ← T1 T2 (Option 2)
2 2
12: Y3 ← T2 mod 2 p , X 3 ← t1 ⊗ t4 , Z3 ← t3 ⊗ Z 1
13: l0,2,0 ← t2,0 ⊗ xP , l0,2,1 ← t2,1 ⊗ xP
14: l 0,2 ← − 2 l0,2
15: T1 ← t2 × 2 X 2 , T2 ← t1 ×2 Y2 (Option 1, h = 2)
2
16: T2 ← T1 T2 (Option 2)
17: t 2 ← T1 mod 2 p
18: l0,0,0 ← t 2,0 t2,1, l0,0,1 ← t2,0 ⊕ t2,1 (≡ l0,0 ← ξ ⋅ t2 )
19: l1,1,0 ← t1,0 ⊗ yP , l1,1,1 ← t1,1 ⊗ yP
18: Return T + R = ( X 3 , Y3 , Z 3 ) and l = (l0 , l1 )
152
Chapter 6: Efficient Techniques for Implementing Pairings in Software
For our analysis and tests, we use the Barreto-Naehrig (BN) curve:
EBN : y 2 = x 3 + 2 (6.12)
defined over Fp , where p = 36u 4 + 36u 3 + 24u 2 + 6u +1 ≡ 3(mod 4) , embedding degree k = 12,
prime order n = 36u 4 + 36u 3 +18u 2 + 6u + 1 and u = −(262 + 255 + 1) < 0 ∈ Z .
To implement the arithmetic over extension fields efficiently, we follow the
recommendations in [IEEE08] to represent Fp k with a tower of extensions using irreducible
binomials. Accordingly, we represent Fp12 using the flexible towering scheme used in [DSD07,
HMS08, BGM+10, PSN+10] combined with the parameters suggested by [PSN+10]:
• Fp 2 = Fp [i]/(i 2 − β ), where β = −1 .
• Fp 4 = Fp 2 [ s] /( s 2 − ξ ), where ξ = 1 + i .
• Fp 6 = Fp 2 [v ]/(v3 − ξ ), where ξ = 1 + i .
• Fp12 = Fp 4 [t ]/(t 3 − s ) or Fp 6 [ w]/( w2 − v) .
As can be seen in Algorithm 6.1, the selection of β = −1 , enabled by the fact that
p ≡ 3(mod 4) , accelerates Fp 2 arithmetic since multiplications by β can be computed as simple
subtractions [PSN+10].
Although several variants of the Tate pairing are available (e.g., R-ate, optimal ate, X-ate),
our experiments reveal that they achieve very similar performance. For testing purposes, we
choose to implement the optimal ate pairing given by:
aopt : G 2 × G1 → G T
p12 −1
(
(Q, P) → f r ,Q ( P) ⋅ l[r ]Q ,π p (Q ) ( P) ⋅ l[r ]Q +π 2
p ( Q ), −π p ( Q )
( P) ) n , (6.13)
153
Chapter 6: Efficient Techniques for Implementing Pairings in Software
where r = 6u + 2 < 0 since u < 0 . To accommodate the negative r, Arahna et al. [AKL+10]
modifies Algorithm 2.9 with the replacement of an expensive inversion by a simple conjugation.
The details are shown in Algorithm 6.9. For complete details, the reader is referred to [AKL+10,
Section 5.1].
Curve arithmetic and line evaluation in Algorithm 6.9 (lines 1, 2, 5, 6, 9) were implemented
with the optimized formulas in homogeneous coordinates discussed in Section 6.3.2 (Algorithm
6.7 and Algorithm 6.8). Towering arithmetic (lines 3, 5, 6, 9, 10) was optimized with the lazy
reduction technique as described in Section 6.2. Following [AKL+10], for accumulating line
evaluations into the Miller variable, Fp12 is represented using the towering Fp → Fp 2 →
Fp 4 → Fp12 and a special (dense × sparse)-multiplication (called sparse multiplication) costing
13mu + 6r + 61a is used (steps 5 and 6 of Algorithm 6.9). Aranha et al. also points that, during
the first iteration of the loop, a squaring in Fp12 can be eliminated since the Miller variable is
initialized as 1 (step 1 in Algorithm 2.9) and a special (sparse × sparse) multiplication (called
sparser multiplication) costing 7 mu + 5r + 30a is used to multiply the first two line evaluations
(step 3 of Algorithm 6.9). This sparser multiplication is also used for multiplying the two final
line evaluations in step 9 of the algorithm. Final exponentiation in step 10 was implemented with
the method by Scott et al. [SBC+09], in which the power ( p12 − 1) / n is factored in the exponents
( p 6 − 1) , ( p 2 + 1) and ( p 4 − p 2 + 1) / n . Among them, the most expensive part is the computation
with the exponent ( p 4 − p 2 + 1) / n . In this case, the execution can be performed in the cyclotomic
subgroup Gφ6 ( Fp 2 ) , which requires, among other operations, 3 exponentiations by u . In order
to speed up these exponentiations, we use the faster compressed squarings by Karabina [Kar10].
Algorithm 6.9. Modified optimal ate pairing on BN curves (generalized for u < 0 )
Input: P ∈ G1 , Q ∈ G 2 , r = 6u + 2 = ∑ i =02 ri 2i
log r
154
Chapter 6: Efficient Techniques for Implementing Pairings in Software
Remarkably, we note that these compressed squarings can be sped up by applying the generalized
lazy reduction again. In total, about 8% of reductions can be eliminated per exponentiation by
u . The reader is referred to Appendix C1 for complete details.
We now consider all the described improvements and state-of-the-art techniques to carry out a
detailed operation count of an optimal ate pairing over BN curves using Algorithm 6.9. We aim
to determine the performance gain obtained with the use of the generalized lazy reduction
technique introduced in Section 6.2.
Operation counts for arithmetic performed by the Miller’s algorithm when using the
generalized lazy reduction are detailed in Table 6.2. For reference, we also include costs when
using lazy reduction for Fp 2 arithmetic only (referred to as basic lazy reduction).
First, using the parameter selection detailed in Section 6.4.1 the Miller loop in Algorithm 6.9
requires 1 negation in F p to precompute the coordinate − yP ; 64 point doublings with line
evaluations, 6 point additions with line evaluations, 2 negations, 1 p-power Frobenius and 1 p2-
power Frobenius in E ( Fp 2 ) ; and 1 conjugation, 66 sparse multiplications, 63 squarings, 2
sparser multiplications and 1 multiplication in Fp12 . Thus, the cost of the Miller loop when using
the generalized lazy reduction technique ( MLGL ) is given by:
And the total cost of the Miller loop when using basic lazy reduction ( MLBL ) is given by:
155
Chapter 6: Efficient Techniques for Implementing Pairings in Software
Table 6.2. Operation counts for arithmetic required by Miller’s algorithm when using: (i)
generalized lazy reduction technique; (ii) basic lazy reduction applied to Fp 2 arithmetic only.
Add/Sub 1a = 2 A 1a = 2 A
Double-precision Add/Sub 2a 2a
Multiplication by ξ 2A 2A
Double-precision Mult. by ξ 4A -
Conjugation 1A 1A
Reduction r = 2R r = 2R
Multiplication m = mu + r = 3M u + 2 R + 8 A m = mu + r = 3M u + 2 R + 8 A
Squaring s = su + r = 2 M u + 2 R + 3 A s = su + r = 2 M u + 2 R + 3 A
Inversion i = 1I + 2 M + 2 S + 2 A i = 1I + 2 M + 2 S + 2 A
Add/Sub 6a = 12 A 6 a = 12 A
Conjugation 3a 3a
Multiplication 18mu + 6r + 110a 18m + 67 a
Sparse Multiplication 13mu + 6r + 61a 13m + 36 a
Sparser Multiplication 7 mu + 5r + 30 a 7 m + 18a
Squaring 12 mu + 6r + 73a 12m + 51a
Cyclotomic Squaring 9 su + 6r + 46 a 6m + 61a
Compressed Squaring 6 su + 4 r + 31a 4m + 27 a
p-power Frobenius 5m + 6 A 5m + 6 A
p2-power Frobenius 10M + 2a 10M + 2a
Inversion 1i + 25mu + 9 su + 24 r + 112a 1i + 25m + 9 s + 82a
156
Chapter 6: Efficient Techniques for Implementing Pairings in Software
FEGL = (1i + 25mu + 9su + 24r + 112a) + 4(3a) + 15(18mu + 6r + 110a) + 3(1i + 36mu + 372su + 9m + 6s
And the total cost of the final exponentiation when using basic lazy reduction ( FE BL ) is
given by:
FEBL = (1i + 25m + 9s + 82a) + 4(3a) + 15(18m + 67 a) + 3(1i + 293m + 6s + 1830a) + 4(6m + 61a)
After adding (6.14) with (6.16) and adding (6.15) with (6.17), we obtain:
Therefore, in the case of a state-of-the-art optimal ate pairing the generalized lazy reduction
technique allows us to eliminate about 32% of reductions. For instance, if we assume that
1M u = 0.65R and 1A = 0.1R (neglecting the cost of field inversions for simplification purposes)
the expected cost reduction for the whole pairing computation is approximately 9%. Obviously,
this estimate is expected to grow with the ratios R/A (reduction/addition) and R / M u (reduction/
integer multiplication).
A software implementation was developed in collaboration with Diego F. Aranha to evaluate the
performance boost obtained with the introduced techniques and improved explicit formulas. To
optimize carry handling and eliminate function call overheads, we followed suggestions by
[BGM+10] and implemented the Fp 2 arithmetic purely in Assembly. Higher-level algorithms
were implemented using the C language and compiled with GCC. To obtain our cycle counts, we
ran our implementations 104 times, averaged and approximated the results to the nearest 1000
157
Chapter 6: Efficient Techniques for Implementing Pairings in Software
cycles. Table 6.3 compares the timings of our Basic and Optimized implementations: the former
employs lazy reduction below Fp 2 only, whereas the latter is fully optimized with the lazy
reduction technique applied to the whole pairing computation. Both implementations exploit
faster compressed squarings and our optimized explicit formulas using homogeneous
coordinates. Therefore, Table 6.3 directly illustrates the benefits of using the generalized lazy
reduction technique discussed in Section 6.2. As can be seen, this technique enables in practice
cost reductions between 12% and 18% on x86-64-based processors.
Table 6.4 compares our implementation results with Beuchat et al. [BGM+10], which
presented the previously fastest implementation at the 128-bit security level in the literature. We
remark that the tested Core i5 exhibits a microarchitecture that is equivalent to the Core i7
processor employed by [BGM+10]. To confirm this assumption, we benchmarked software by
Beuchat et al. and compared the results with the ones reported in [BGM+10]. We also note that
Phenom II was not considered in [BGM+10] and that we could not find a Core 2 Duo machine
producing the same timings as in [BGM+10]. Hence, timings for these two architectures were
measured independently by the authors using the available software.
First, observe that the basic implementation in Table 6.3 consistently outperforms Beuchat et
al.’s results. This is due to our careful implementation using an optimal choice of parameters
combined with optimized curve arithmetic in homogeneous coordinates and faster cyclotomic
formulas. When lazy reduction is enabled (optimized implementation), pairing computation
becomes faster than the best previous result by 28%-34%.
For extended benchmark results and comparisons with other previous works on different 64-
bit processors, the reader is referred to our online database [Lon10b].
158
Chapter 6: Efficient Techniques for Implementing Pairings in Software
6.5. Conclusions
In this chapter, we have proposed efficient methods and improved explicit formulas that speed up
significantly the computation of pairings on ordinary curves over prime fields. Most remarkably,
the introduced generalized lazy reduction technique is shown to apply to every computation
involving tower field operations found in the Miller loop and final exponentiation, including the
recently proposed compressed squarings by [Kar10] (see Appendix C1).
After discussing relevant previous work in §6.1, we introduced the generalized lazy reduction
technique in the context of tower extension fields in §6.2. We described a methodology that relies
on the careful selection of the field size to keep intermediate results under Montgomery
boundaries with the objective of reducing costs of additions/subtractions and maximizing the use
of operations without carry checks. Moreover, we illustrated the efficient realization of these
techniques with the popular tower Fp → Fp 2 → Fp6 → Fp12 , detailing the improved explicit
159
Chapter 6: Efficient Techniques for Implementing Pairings in Software
160
7 Chapter 7
Conclusions
In the last few years, intense research has been focused on the efficient computation of elliptic
curve and pairing primitives to enable their realization in the plethora of potential applications
and emerging platforms of the new millennium. This thesis has focused on devising efficient
methods and formulas for enabling high-speed elliptic curve and pairing-based cryptography
over fields of large prime characteristic. These results have a practical impact in the performance
of cryptographic protocols and schemes based on elliptic curves and pairings. Most remarkably, a
careful selection of state-of-the-art algorithms has led to the realization of record-breaking
implementations in software. For instance, these results may directly increase the number of
secure transaction requests per second that can be processed by a Web server in an Internet-based
application such as e-banking or e-commerce. This could potentially lead to savings in hardware
costs for corporations, to more Web-based content being protected and to reduced waiting times
during online transactions for consumers, among other benefits.
A more detailed description of the contributions of this thesis follow in §7.1. Possible future
research directions are described in §7.2.
161
Chapter 7: Conclusions
were described.
Chapter 3 introduced two new schemes for precomputing points. The LM Scheme, which is
intended for tables of form di P on standard curves using Jacobian coordinates, was adapted to
the case using only one inversion (case 2) and to the case without inversions (case 1). For case 2,
two variants were proposed with slightly different memory requirements and speeds, case 2a and
case 2b. It was shown that the new method achieves the lowest costs in the literature when using
an optimal number of precomputations. For instance, LM Scheme, case 2b, has a cost of
1I + (9 L ) M + (2 L + 6) S with L non-trivial points, which is the lowest in the literature among
methods using one inversion only. The cost formulas for the different variants were derived (see
proofs in Appendices A1 and A2). On the other hand, the LG Scheme, which is based on the
proposed idea of conjugate additions in projective coordinates, was shown to apply to different
curve forms and types of scalar multiplication. Conjugate addition formulas were derived for J,
JQ e and IE coordinates (see Appendix A3). Moreover, an efficient method combining the LM
and LG Schemes was proposed for the case of multiple scalar multiplication on standard curves
using J. The generic cost formulas for single and multiple scalar multiplications were derived
(see proofs in Appendices A5 and A6), as well as the cost formulas of the optimized schemes for
J, JQ e and IE coordinates. Finally, an extensive comparative analysis of different pre-
computations methods for different scenarios, memory requirements and security levels was
carried out to determine the most efficient scheme for each case when using J, JQ e and IE
coordinates. In general, it was shown that for the great majority of cases the proposed schemes
achieve the best performance. Refer to §3.4 for complete details. Finally, potential applications
for the use of conjugate additions were described (see §3.5). The outcomes of this chapter were
exploited for speeding up further scalar multiplication in Chapters 4 and 5.
Chapter 4 was about efficient multibase representations for scalar multiplication and how
efficient these methods are in different scenarios. First, a taxonomy and comparative analysis of
the various double- and multi-base methods for scalar multiplication were discussed. Then, the
theoretical analysis of the multibase NAF (mbNAF) method and its windowed variant, wmbNAF,
were developed. Our methods were modeled using Markov chains and formulas for estimating
the average zero and nonzero densities for cases with bases {2,3} and {2,3,5} were derived.
Then, the “fractional” windows recoding was applied to the setting of wmbNAF to solve the
problem of restricted number of precomputations imposed by standard windows. The new
method, denoted by Frac-wmbNAF, allows a flexible number of precomputations in the
execution of scalar multiplication, which makes it ideal for applications with restricted memory.
The method was also analyzed theoretically using Markov chains for the case with bases {2,3}.
Furthermore, a new methodology based on the operation cost per bit to derive efficient multibase
algorithms was introduced. The optimized algorithms were implemented in Matlab to perform an
extensive comparison for computing scalar multiplication when using J, JQ e and IE
162
Chapter 7: Conclusions
coordinates. The cases with bases {2,3} and {2,3,5} using (Frac-w)mbNAF and the refined
multibase chains were compared with the performance of standard NAF-based methods and the
most efficient double-base methods in the literature. For proposed and standard NAF methods,
the best precomputation scheme available for each case was applied (using results from Chapter
3). The conclusion was that, currently, the proposed refined multibase chains achieve the lowest
costs found in the literature among methods without precomputations, for all curve forms under
analysis. For instance, using bases {2,3,5} and {2,3} for n = 160 bits we can perform a scalar
multiplication with costs of only 1451M (field multiplications) and 1351M in Jacobian and
inverted Edwards coordinates, respectively. With JQ e , that cost can be as low as 1261M using
bases {2,3,5}. These results provide cost reductions between 7%-10% in comparison with NAF.
Similar results were attained by the refined multibase chains using an optimal number of
precomputations, although in this case the gain was only 1%-3% in comparison with (Frac)-
wNAF (see §4.5 for complete details). A relevant comparison with the fastest curves using
standard radix-2 methods followed. In conclusion, “slower” curves that can advantageously
exploit multibase chains may become competitive with the “fastest” curves using radix-2
methods when curve parameters are suitably chosen and no precomputations are allowed.
Finally, a discussion of potential applications and variants of the proposed methods was included,
as well as a critical look at the practical implications of double- and multi-base number systems
in the computation of scalar multiplication (see §4.6). In conclusion, the use of multibases was
recommended for memory-constrained devices and when the conversion step (if expensive) can
be performed off-line. For non-constrained devices, it was shown that the gain may be negligible
and that faster curves without exploiting multibases are available. These conclusions were
confirmed by tests on real x86-64-based implementations in §5.6.4, subsection “Timings using
Multibase Methods”.
Chapter 5 studied and brought together most efficient algorithms for the field, point and
scalar arithmetic levels with the objective of achieving high-speed implementations of ECC on
x86-64 processors. Optimizations at different levels were carefully tuned for the targeted
architectures. First, incomplete reduction and branchless arithmetic were optimally combined for
suitably chosen pseudo-Mersenne primes for achieving efficient arithmetic in Fp . Dependencies
between consecutive field operations were found to degrade the performance on the targeted
processors by stalling the pipeline. The rescheduling and merging of field operations and the
merging of point operations were proposed to minimize this problem. These techniques also
reduce the number of function calls and memory accesses. Explicit point formulas for the
relevant cases of J and E / E e over Fp and Fp 2 were optimized by reducing the number of
“small” operations and by applying the techniques aforementioned (see Appendices B1 and B2).
By combining all optimized formulas with state-of-the-art algorithms, including the use of the
LM precomputation scheme (see §5.6.1 and §5.6.2 for further details), we presented two
163
Chapter 7: Conclusions
Precomputations for other special curves and settings. In particular, for the efficient Twisted
Edwards curve using E / E e or extended Jacobi quartics using homogeneous/extended
homogeneous coordinates it is still unknown if other precomputation schemes with higher
efficiency than the traditional scheme using P → 3 P → 5 P → … → mP exist. Further
164
Chapter 7: Conclusions
research could focus on the development of improved schemes for these systems. Also, in §3.5 it
was observed that conjugate additions can be derived for formulas over F2m . The application of
LG-like precomputation schemes to this setting requires further analysis.
More composite formulas and efficient conversion to multibase. In §4.6.1, it was argued that
the main obstacle that opposes to the use of multiple bases in a wide range of applications is the
computing cost of conversion from binary to multibase. Further research is needed to improve the
implementation of conversion algorithms on different platforms. This effort can be
complemented by the development of efficient tripling and quintupling formulas for other
coordinate systems such as E / E e where radix-2 methods are still more efficient.
Implementation on constrained devices. Following the results and analysis in §4.5 and §4.6.1,
the use of multibase methods is more promising for devices with constrained memory resources
in which the gain is maximal in terms of speed. However, these devices are usually limited in
terms of power. Further investigation supported with implementations is required for assessing
the practical impact of using multibase methods in these platforms with such a constraint.
Analysis on other platforms; improving ECC over binary fields, HECC. Several software
techniques and optimizations were proposed for elliptic curve point multiplication over Fp and
Fp 2 in Chapter 5. The analysis and implementations targeted x86-64 processors. In many cases,
the proposed techniques and optimized formulas are generic and further study could be devoted
to test them on different platforms, e.g., embedded devices with 32-bit and 8-bit
microarchitectures. Moreover, further research can be focused on applying similar methods to the
case over F2m . For instance, it would be interesting to analyze whether data dependencies
degrade performance of field operations and if similar countermeasures also apply. In fact,
further study could analyze the application of these methods to other settings such as
Hyperelliptic Curve Cryptosystems.
Generalized lazy reduction on other platforms. This technique was shown to reduce
significantly the computing cost of pairings on various x86-64-based processors. Practical
implementation of the technique in Field Programmable Gate Arrays (FPGAs), 32-bit embedded
devices or microcontrollers with 8-bit architectures would be highly valuable. In certain cases,
the gain is expected to grow even further as the ratio multiplication/addition is usually larger on
smaller devices in which embedded multipliers are much less powerful.
165
A Appendix A
In this section, we present the pseudocode of the LM Scheme described in Section 3.2.
167
Appendix A1: Pseudocode of the LM Precomputation Scheme
Lemma A.1. Algorithm A.1, that computes the initial doubling (3.3) of Step 1 (see Section
3.2.1), costs 1M + 5S and requires 6 temporary registers.
Lemma A.2. Algorithm A.2, that computes the first addition 2 P + P in sequence (3.2) using
ADDCo − Z , costs 5M + 2S and requires 6 temporary registers if the precomputed table contains
only one point. Otherwise, Algorithm A.2 requires 6 temporary registers for calculations plus 2
extra registers to store the ( X , Y ) coordinates of 3P. To adapt Algorithm A.2 to case 1, it should
also store the Z coordinate of 3P in register Z 3 .
168
Appendix A1: Pseudocode of the LM Precomputation Scheme
Lemma A.3. Algorithm A.3, that computes following additions in sequence (3.2) using
ADDCo − Z operations, costs 5M + 2S per extra point, requires 6 temporary registers for
calculations and 3 (4) extra registers per each point for case 2a (case 2b) to store the values
X , Y , A ( X , Y , B, C ) . In the last iteration the memory requirement is reduced by storing values
X , Y ( X , Y , B ) in temporary registers. To adapt Algorithm A.3 to case 1, one should execute the
steps that correspond to case 2a except that, instead of values A(i +3) / 2 , one should store Z
coordinates of points iP.
169
Appendix A1: Pseudocode of the LM Precomputation Scheme
170
Appendix A1: Pseudocode of the LM Precomputation Scheme
Lemma A.4. Algorithm A.4, that computes the modified Montgomery’s method corresponding
to Step 2 (see Section 3.2.1), costs 1I + (3M + 1S) + (4M + 1S)(L − 1) and 1I + (3M + 1S) + 4(L
− 1)M for cases 2a and 2b, respectively, and requires 4 temporary registers for calculations and
storage for the affine coordinates (x, y) of (L − 1) precomputed points. In addition, case 2a
requires (L − 1) registers for values A j , and case 2b requires 2(L − 1) registers for values
( B j , C j ) . This step is not executed in case 1.
171
A2 Cost Analysis of the LM Precomputation Scheme
and requires (3L + 6) registers, where L is the number of non-trivial points in the precomputed
2 3
table di P . The requirement increases to (5L + 6) if values Zi and Zi are also stored in order to
use the addition (or doubling-addition) with stored values during evaluation.
Proof: Following Lemmas A.1-A.3, Algorithms A.1, A.2 and A.3 cost 1M + 5 S , 5 M + 2 S and
(5M + 2S )( L − 1) , respectively. Also, precomputing values Z i2 , Z i3 (to enable the use of ADD or
DBLADD with store values during the evaluation stage) costs (1M + 1S ) L . By adding these
values we obtain the cost of the LM Scheme, case 1, above. In terms of memory, this method
only requires 6 temporary registers during the execution of Algorithms A.1, A.2 and A.3 plus 3
registers to store the ( X : Y : Z ) coordinates of each precomputed point. That makes a total
requirement of 3L + 6 registers. If the pair Z i2 / Z i3 is also stored per point, the total requirement
increases to 5L + 6. □
Theorem A.2. The LM Scheme, case 2a, has the following cost:
Proof: Following Lemmas A.1-A.3, Algorithms A.1, A.2 and A.3 cost 1M + 5S, 5M + 2S and
(5M + 2S)(L − 1), respectively. According to Lemma A.4, Algorithm A.4 costs 1I + (3M + 1S) +
(4M + 1S)(L − 1). By adding these values, we obtain the cost of the LM Scheme, case 2a, above.
Regarding memory requirements, Algorithm A.1 needs 6 temporary registers T1 ,… , T6 . The same
registers can be reused by Algorithm A.2 for calculations. Additionally, it needs 2 extra registers
to store ( X , Y ) coordinates corresponding to 3P, making a total of 6 + 2 = 8 registers (see
Lemma A.2). Algorithm A.3 also reuses temporary registers T1 ,… , T6 , and requires 3 registers
per point, excepting the last one, to store (X, Y, A) values. For the last iteration, we only require
registers T1 ,… , T6 and 1 extra register to store A since the last ( X , Y ) coordinates are stored in T1
and T2 (see Lemma A.3). That makes an accumulated requirement of 6 + 2 + 3(L − 2) + 1 = 3L +
3 at the end of Algorithm A.3, for L ≥ 2. If L = 1, we do not compute Algorithm A.3, and the
requirement is fixed by Algorithm A.2 at only 6 registers (note that in this case (X, Y) coordinates
173
Appendix A2: Cost Analysis of the LM Precomputation Scheme
are stored in T1 and T2 ). Algorithm A.4 requires 4 temporary registers for calculations (where T1
and T2 can store the (x, y) coordinates of the last point mP), 2(L − 1) − 2 registers for (x, y)
coordinates of the remaining (L − 1) points (assuming that T3 and T4 can store the (x, y)
coordinates of 3P) and (L – 1) registers for values Aj for 4 ≤ j ≤ (m + 3) / 2 , m > 3 odd, making
a total requirement of 3L – 1. In conclusion, LM Scheme, case 2a, requires 3L + 3 registers. □
Theorem A.3. The LM Scheme, case 2b, has the following cost:
Proof: Following Lemmas A.1-A.3, Algorithms A.1, A.2 and A.3 have the same costs as cases 1
and 2a, and Algorithm A.4 costs 1I + (3M + 1S) + (4M)(L − 1). Adding these costs we obtain the
value indicated for the LM Scheme, case 2b. Regarding memory requirements, Algorithm A.1
needs 6 registers T1 ,… , T6 , which can be reused by Algorithm A.2 for temporary calculations.
Additionally, Algorithm A.2 needs 2 extra registers to store ( X , Y ) coordinates corresponding to
3P, making a total of 6 + 2 = 8 registers (see Lemma A.2). Algorithm A.3 also reuses registers
T1 ,… , T6 , and requires 4 registers per point, excepting the last one, to store (X, Y, B, C) values.
For the last iteration, we only require registers T1 ,… , T6 and 1 extra register to store C since the
last ( X , Y ) coordinates are stored in T1 and T2 , and T6 stores B (see Lemma A.3). That makes an
accumulated requirement of 6 + 2 + 4(L − 2) + 1 = 4L + 1 at the end of Algorithm A.3, for L ≥ 2.
If L = 1, we do not compute Algorithm A.3, and the requirement is fixed by Algorithm A.2 at
only 6 registers as pointed out in the analysis for case 2a. Algorithm A.4 requires 4 registers for
calculations (where T1 and T2 can store the (x, y) coordinates of the last point mP), 2(L − 1) − 2
registers for (x, y) coordinates of the remaining (L − 1) points (assuming that T3 and T4 can store
the (x, y) coordinates of 3P) and 2(L – 1) registers for values B j , C j for 4 ≤ j ≤ (m + 3) / 2 , m > 3
odd, making a total requirement of 4L – 2 registers. In conclusion, case 2b requires 4L + 1
registers. □
174
A3 Conjugate Addition Formulas
X 4 = γ 2 − (4 β 3 + 8Z 22 X 1β 2 ) , Y4 = γ ( Z 22 X 1β 2 − X 4 ) − Z 23Y1β 3 , Z 4 = Z 3 , (A.1)
175
Appendix A3: Conjugate Addition Formulas
Thus, the cost of an addition/conjugate addition pair is of (7M + 4S) + (2M + 1S) = 9M + 5S
if using an ADD operation or (7M + 3S) + (2M + 1S) = 9M + 4S, if using an ADD[0,1] operation.
See Tables 2.4 and 3.2.
In the case of mixed addition, let P = ( X 1 : Y1 : Z1 : X12 : Z12 ) and Q = ( x2 , y2 , x22 ) be two points
in JQ e and A coordinates, respectively, on an extended Jacobi quartic curve E JQ / Fp with
d = 1 in (2.11). If the mixed addition P + Q is performed using the following formula due to
[HWC+07, HWC+08b]:
where α = ( X 1 + Z1 ) 2 − ( X12 + Z12 ) , and the partial values (α + 2Y1 ) , α x2 , −2Y1 y2 , ( X 12 x 22 + Z12 ) ,
[2( X12 + Z12 )( x22 + 1) + 2Y1 y2 ] , a α x2 , Z 3 and Z 32 are temporarily stored, then the conjugate
mixed addition P − Q = P + ( −Q) = ( X 1 : Y1 : Z1 : X12 : Z12 ) + ( − x2 , y2 , x22 ) = ( X 4 : Y4 : Z 4 : X 42 : Z 42 )
can be performed with 2M + 1S + 7A + 2 (×2) as follows:
Thus, the cost of a mixed addition/conjugate mixed addition pair is of (6M + 3S) + (2M + 1S)
= 8M + 4S. See Tables 2.4 and 3.2.
176
Appendix A3: Conjugate Addition Formulas
field additions):
and the partial values [ X1 X 2Y1Y2 + d ( Z1Z 2 ) 2 ] , X 1 X 2 , Y1Y2 , [ X1 X 2Y1Y2 − d ( Z1Z 2 )2 ] , X 1Y2 ,
X 2Y1 and Z1 Z 2 are temporarily stored, then the conjugate addition P − Q = P + (−Q ) =
( X1 : Y1 : Z1 ) + (− X 2 : Y2 : Z 2 ) = ( X 4 : Y4 : Z 4 ) can be performed with the following (with a cost of
only 4M + 2A):
Thus, the cost of an addition/conjugate addition pair is of (10M + 1S) + 4M = 14M + 1S.
The formula for mixed addition can be obtained by setting Z 2 = 1 in formula (A.7) and has a
cost of 9M + 1S + 4A. Then, if the partial values ( X 1 x2Y1 y2 + dZ12 ) , ( X 1 x2Y1 y2 − dZ12 ) , X 1 x2 ,
Y1 y2 , X1 y2 and x2Y1 are temporarily cached, the conjugate mixed addition P − Q = P + (−Q ) =
( X1 : Y1 : Z1 ) + ( x2 : − y2 ) = ( X 4 : Y4 : Z 4 ) can be performed by:
Z 4 = − Z1 ( X1 x2 + Y1 y2 )( X 1 y2 − x2Y1 ) , (A.9)
which only costs 4M + 2A. Therefore, the cost of a mixed addition/conjugate mixed addition pair
is of (9M + 1S) + 4M = 13M + 1S.
177
A4 Calculation of Precomputed Points for the LG Scheme
The following table shows the proposed sequences for computing a table with the form d i P ,
where di ∈ D + \ {0,1} = {3,5,..., m} with m odd. For m = 5, the first sequence corresponds to J
and JQ e , and the second one to IE coordinates. Tied arrows denote an addition/conjugate
addition pair (or mixed addition/conjugate mixed addition pair if addition is performed with
affine point P).
3 15
5 17
7 19
9 27
11 29
13 31
179
A5 Cost Analysis of the LG Scheme, Table diP
Theorem A.4. Given an elliptic curve E of arbitrary form, the cost of using the LG Scheme for
computing a precomputed table with the form d i P , where di ∈ D+ \{0,1} = {3,5,..., m} with m
odd and the base point P ∈ E (Fp ) , is given by:
Proof: first, note that m ≥ 3 . If rmax is defined as the value of the highest “strategic” point, then
it holds that rmax = 3 × 2ω −2 for some integer ω ≥ 2 since “strategic” points have the form
Pi +1 = 2 Pi , for integers i ≥ 0 with P0 = 3P . It easily follows that calculating all “strategic” points
up to rmax P = (3 × 2ω −2 ) P requires one tripling and (ω − 2) doublings. Then, additions are
required to compute each point in the table except 3P , which is already calculated. Since there
are L non-trivial points in the table, we require ( L − 1) additions in total. Let us now estimate the
number of regular additions required for computing points below rmax P , and then above rmax P .
First, up to rmax P there are rmax / 2 odd points, from which ( rmax / 6) − 1 are computed with a
conjugate addition. If P and 3P are discarded we require (rmax / 2) − ( rmax / 6) + 1 − 2 = (rmax / 3) − 1
regular additions up to rmax P . Above rmax P there is a range for which points are computed with
conjugate additions. Then we need to establish the value rmax < k < m s.t. points kP , ( k + 2) P ,
… , mP are calculated with regular additions. Following Appendix A4, it is straightforward to
note that k = 9 if rmax = 6 , k = 17 if rmax = 12 , k = 33 if rmax = 24 , and so on. Thus, k =
(4rmax + 3) / 3 and, hence, m − (4rmax + 3) / 3 +1 = L + 1 − 2rmax / 3 regular additions are required above
2
rmax P . However, an exception happens when m < k , for which case the number of additions
above rmax P should be zero. The latter can be accomplished by simply multiplying
(2rmax − 1 + m − (4rmax + 3) / 3) (2rmax − 1) = (6 L + 2rmax − 3) /(6rmax − 3) with L +1− 2rmax / 3 .
Therefore, the total number of regular additions is given by the expression
ε = (6 L + 2rmax − 3) /(6rmax − 3) ( L + 1 − 2rmax / 3) + (rmax / 3) − 1 . Since it was established that
there are ( L − 1) additions in total, then ( L − 1 − ε ) are addition/conjugate addition pairs and
ε − ( L − 1 − ε ) = 2ε − L + 1 are individual additions. By definition, case 2 requires the addition of
the cost of converting projective points to affine. □
181
Appendix A5: Cost Analysis of the LG Scheme, Table diP
Corollary A.1. In the setting of Theorem A.4, the cost of the LG Scheme when using mixed
coordinates is given by:
Proof: assuming that the base point P is given in affine coordinates, then P0 = 3P can be
computed using a mixed tripling with the form 3A → P . Since (ω − 2) doublings are required,
there are also (ω − 2) “strategic” points. By definition, m > rmax , so for each “strategic” point Pj
there is always a pair of points with the form Pj ± P . Then, there are (ω − 2) points that can be
calculated with an addition/conjugate addition pair using mixed Projective-affine coordinates,
that is, computing P ± A → P . According to Theorem A.4, there are ( L − ε − 1) addition/
conjugate addition pairs in total. Hence, ( L − ε − 1) − (ω + 2) = L − ε − ω + 1 are addition/
conjugate addition pairs using Jacobian coordinates, that is, computing P ± P → P . □
182
A6 Cost Analysis of the LG Scheme, Table ciP ± diQ
Theorem A.5. Given an elliptic curve E of arbitrary form, the cost of using the LG Scheme for
computing a precomputed table with the form ci P ± di Q , where ci , di ∈ D + = {0,1,3,5,..., m} ,
ci > 1 if di = 0 , di > 1 if ci = 0 , m odd and P, Q are points in E ( F p ) , is given by:
( m + 1) 2 m − 1
Cost cases 1/3(2) = (m − 1)ADD + (ADD − ADD′) + 2 DBL (+ Cost P →A ) ,
4 m
where L = ( m 2 + 4 m − 1) / 2 > 1 is the number of non-trivial points in the table and Cost P → A
denotes the cost of converting points from projective to affine coordinates in case 2.
Proof: first, let us establish the value L. There are (m + 1) points with the form ci P or di Q ,
which can be combined in ( m + 1) 2 2 ways to get points of the form ci P ± di Q with ci di ≠ 0 .
(m + 1)2
By discarding points P and Q, we obtain the total number of non-trivial points as L =
2
+( m + 1) − 2 = m(m + 4) − 1 . As it always holds that m ≥ 1 , then L > 1 . The points ci P or di Q with
2
ci ≥ 3 and di ≥ 3 can be computed with two sequences with the form P → P+2P = 3P → 3P+2P
= 5P → … → (m−2)P+2P = mP. This requires in total two doublings and (m − 1) additions. Note
that when m = 1 , there are no calculations required for this part. Hence, for m ≥ 1 the number of
required doublings can be expressed by 2 (m − 1) / m . Finally, the computation of the
( m + 1) 2 2 points ci P ± di Q with ci di ≠ 0 involves (m + 1) 2 4 addition/conjugate addition pairs.
By definition, case 2 requires in addition the cost of converting points from projective to affine
coordinates. □
Theorem A.6. In the setting of Theorem A.5 and assuming that m ≥ 5 , the cost of the LG
Scheme when using Jacobian coordinates is given by:
( m + 1) 2
Cost cases 1(2) = 2DBL + ( m − 1)ADD Co -Z + (ADD − ADD′) (+Cost2J → A ) ,
4
where Cost2J →A = [2m( m + 4) − 1]M + [( m + 1) 2 / 4 + 2]S for case 2.
Proof: according to Theorem A.1, if m ≥ 3 points with the form di P , where di ∈ D + \ {0,1} =
{3, 5,… , m} can be computed with the sequence P → P+2P = 3P → 3P+2P = 5P → … →
(m−2)P+2P = mP using one (mixed) doubling and (m −1) / 2 additions with identical Z
coordinate. Then, points ci P and di Q with ci ≥ 3 and di ≥ 3 can be computed with two
doublings and (m −1) additions with identical Z coordinate. The restriction m ≥ 5 is because
when m = 3 it is more efficient to compute 3P directly with a (mixed) tripling operation.
Following Theorem A.5, the computation of the ( m + 1) 2 2 points ci P ± di Q with ci di ≠ 0
183
Appendix A6: Cost Analysis of the LG Scheme, Table ciP ± diQ
184
A7 Comparison of LG and LM Schemes using Jacobian
Coordinates
The tables below compare the performance of LM and LG Schemes with the DOS method for
n = 256 and 512. For each method, we show the cost of performing an n-bit scalar multiplication
and the optimal number of precomputed points L when considering that a maximum of
(2 LES + R ) registers are available for the evaluation stage (i.e., L ≤ LES ). For our analysis, R =
7. Also, to compare the performance of schemes for cases 1 and 2, we include costs of the most
efficient scheme for case 1 (i.e., LM Scheme, case 1) and show at the bottom of each table the
I/M range for which LM Scheme, case 1, would achieve the lowest cost.
Table A.1. Performance comparison of LG and LM Schemes with the DOS method in 256-bit
scalar multiplication for different memory constraints on a standard curve (1M = 0.8S).
I/M range (LM, case1) I > 152M I > 199M I > 211M I > 172M I > 179M
I/M range (LM, case1) I > 141M I > 134M I > 138M I > 109M I > 92M
185
Appendix A7: Comparison of LG and LM Schemes using Jacobian Coordinates
Table A.2. Performance comparison of LG and LM Schemes with the DOS method in 512-bit
scalar multiplication for different memory constraints on a standard curve (1M = 0.8S).
I/M range (LM, case1) I > 321M I > 426M I > 463M I > 391M I > 419M
I/M range (LM, case1) I > 344M I > 344M I > 350M I > 319M I > 319M
I/M range (LM, case1) I > 286M I > 291M I > 291M I > 259M I > 262M
186
Appendix A7: Comparison of LG and LM Schemes using Jacobian Coordinates
I/M range (LM, case1) I > 228M I > 232M I > 234M I > 227M I > 220M
I/M range (LM, case1) I > 214M I > 207M I > 173M
187
B Appendix B
The following Maple scripts detail the improved explicit formulas for the case of Jacobian (J )
and mixed Twisted Edwards homogeneous/extended homogeneous (E / E e ) coordinates
exploiting the techniques discussed in Chapter 5, namely incomplete reduction, merging and
scheduling of field operations and merging of point operations.
These formulas have been used for the “traditional” implementations discussed in Section 5.6.1.
Temporary registers are denoted by ti and Mul = multiplication, Sqr = squaring, Add = addition,
Sub = subtraction, Mulx = multiplication by x, Divx = division by x, Neg = negation. DblSub
represents the computation a − 2b (mod p) and SubDblSub represents the merging of
a − b (mod p ) and (a − b) − 2c (mod p). Underlined field operations are merged and operationIR
represents a field operation using incomplete reduction. In practice, input registers are reused to
store the result of an operation.
189
Appendix B1: Explicit Formulas for “Traditional” Implementations
# In practice, Xout,Yout,Zout reuse the registers X1,Y1,Z1 for all cases below.
t4:=Z1^2; t3:=Y1^2; t1:=X1+t4; t4:=X1-t4; t0:=3*t4; t5:=X1*t3; t4:=t1*t0; t0:=t3^2;
t1:=t4/2; t3:=t1^2; Zout:=Y1*Z1; Xout:=t3-2*t5; t3:=t5-Xout; t5:=t1*t3; Yout:=t5-t0;
simplify([x3-Xout/Zout^2]), simplify([y3-Yout/Zout^3]); # Check
190
Appendix B1: Explicit Formulas for “Traditional” Implementations
191
B2 Explicit Formulas for GLS-Based Implementations
These formulas have been used for the GLS-based implementations discussed in Section 5.6.2.
Temporary registers are denoted by ti and Mul = multiplication, Sqr = squaring, Add = addition,
Sub = subtraction, Mulx = multiplication by x, Divx = division by x, Neg = negation. DblSub
represents the operation a − 2b (mod p) or a − b − c (mod p) , Mul3Div2 represents the operation
( a + a + a ) / 2 (mod p) , AddSub represents the merging of a + b (mod p ) and a − b (mod p ) ,
AddSub2 represents a + b − c (mod p) , SubSub represents the merging of a − b (mod p ) and
c − d (mod p ) , and Mul2Mul3 represents the merging of a + a (mod p ) and a + a + a (mod p) .
Underlined field operations are merged and operationIR represents a field operation using
incomplete reduction. In practice, input registers are reused to store the result of an operation.
193
Appendix B1: Explicit Formulas for GLS-Based Implementations
194
C Appendix C
with Ai , j = ( gi + g j )( gi + ξ g j ) and Bi , j = gi g j .
The formulae above have a cost of 4 multiplications and 4 reductions in Fp 2 . The following
improved version was proposed in [AKL+10]:
h2 = 2 g 2 + 3ξ ( S4,5 − S4 − S5 ) , h3 = 3( S4 + ξ S5 ) − 2 g3 ,
h4 = 3( S2 + ξ S3 ) − 2 g 4 , h2 = 2 g5 + 3( S 2,3 − S 2 − S3 ) , (C.2)
195
Appendix C1: Optimized Compressed Squarings
In contrast, the traditional computation would cost (using lazy reduction below Fp2 only):
Hence, our technique reduces the number of reductions in Fp 2 in about 8% (from 299 to
275) in one exponentiation g u computed with the new compressed squarings.
196
PERMISSIONS
Partial results that have been included and extended in this Dissertation appear in [LM08b,
LG09, LG09b, LG10, AKL+10]. Complete references to original publications have been
included in the Bibliography in compliance with Springer’s copyright, which states: “The Author
retains the right to use his/her Contribution for his/her further scientific career by including the
final published paper in his/her dissertation or doctoral thesis provided acknowledgement is
given to the original source of publication.”
197
Bibliography
199
Bibliography
[BBT+08] D. Bernstein, P. Birkner, T. Lange, C. Peters and M. Joye, "Twisted Edwards Curves,"
Progress in Cryptology - Africacrypt 2008, LNCS Vol. 5023, pp. 389-405, Springer, 2008.
[BCH+00] M. Brown, D. Cheung, D. Hankerson, J. Lopez, M. Kirkup and A. Menezes, "PGP in
Constrained Wireless Devices," Usenix Security Symposium, pp. 247-261, 2000.
[Ber06] D. Bernstein, "Curve25519: New Diffie-Hellman Speed Records," International Conference on
Practice and Theory in Public Key Cryptography (PKC 2006), LNCS Vol. 3958, pp. 207-228, Springer,
2006.
[BF01] D. Boneh and M. Franklin, "Identity-Based Encryption from the Weil Pairing," Advances in
Cryptology - Crypto 2001, LNCS Vol. 2139, pp. 213-229, Springer, 2001.
[BG04] D. Brown and R. Gallant, "The Static Diffie-Hellman Problem," Cryptology ePrint Archive,
Report 2004/306, 2004.
[BGM+10] J. Beuchat, J.E. González-Díaz, S. Mitsunari, E. Okamoto, F. Rodríguez-Henríquez and T.
Teruya, "High-Speed Software Implementation of the Optimal Ate Pairing over Barreto-Naehrig Curves,"
International Conference on Pairing-Based Cryptography (Pairing 2010), LNCS Vol. 6487, pp. 21-39,
Springer, 2010.
[BGO+07] P.S.L.M. Barreto, S. Galbraith, C. O'hEigeartaigh and M. Scott, "Efficient Pairing
Computation on Supersingular Abelian Varieties," Designs, Codes and Cryptography, Vol. 42, pp. 239-
271, 2007.
[BHL+01] M. Brown, D. Hankerson, J. Lopez and A. Menezes, "Software Implementation of the NIST
Elliptic Curves over Prime Fields," Topics in Cryptology - CT-RSA 2001, LNCS Vol. 2020, pp. 250-265,
Springer, 2001.
[BJ03] O. Billet and M. Joye, "Fast Point Multiplication on Elliptic Curves through Isogenies," Applied
Algebra, Algebraic Algorithms, and Error Correcting Codes Symposium (AAECC-15), LNCS Vol. 2643,
pp. 43-50, Springer, 2003.
[BJ03b] O. Billet and M. Joye, "The Jacobi Model of an Elliptic Curve and Side-Channel Analysis,"
International Conference on Applied Algebra, Algebraic Algorithms and Error-Correcting Codes (AAECC
'03), LNCS Vol. 2643, pp. 34-42, Springer, 2003.
[BKL+02] P.S.L.M. Barreto, H.Y. Kim, B. Lynn and M. Scott, "Efficient Algorithms for Pairing-Based
Cryptosystems," Advances in Cryptology - Crypto 2002, LNCS Vol. 2442, pp. 354-368, Springer, 2002.
[BL07] D. Bernstein and T. Lange, "Faster Addition and Doubling on Elliptic Curves," Advances in
Cryptology - Asiacrypt 2007, LNCS Vol. 4833, pp. 29-50, Springer, 2007.
[BL07b] D. Bernstein and T. Lange, "Inverted Edwards Coordinates," Applied Algebra, Algebraic
Algorithms, and Error Correcting Codes Symposium (AAECC-17), LNCS Vol. 4851, pp. 20-27, Springer,
2007.
[BL08] D. Bernstein and T. Lange, "Analysis and Optimization of Elliptic-Curve Single-Scalar
Multiplication," Finite Fields and Applications: Proceedings of Fq8, Vol. 461, pp. 1-18, 2008.
[BLS03] P.S.L.M. Barreto, B. Lynn and M. Scott, "On the Selection of Pairing-Friendly Groups," Int.
Workshop on Selected Areas in Cryptography (SAC 2003), LNCS Vol. 3006, pp. 17-25, Springer, 2003.
[BLS03b] P.S.L.M. Barreto, B. Lynn and M. Scott, "Constructing Elliptic Curves with Prescribed
Embedding Degrees," Security in Communication Networks, LNCS Vol. 2576, pp. 257-267, Springer,
2003.
200
Bibliography
[BLS04] D. Boneh, B. Lynn and H. Shacham, "Short signatures from the Weil pairing," Journal of
Cryptology, Vol. 17, pp. 297-319, 2004.
[BLS04b] P.S.L.M. Barreto, B. Lynn and M. Scott, "Efficient Implementation of Pairing-Based
Cryptosystems," Journal of Cryptology, Vol. 17, pp. 321-334, 2004.
[BN05] P.S.L.M. Barreto and M. Naehrig, "Pairing-Friendly Elliptic Curves of Prime Order,"
International Workshop on Selected Areas in Cryptography (SAC 2005), LNCS Vol. 3897, pp. 319-331,
Springer, 2005.
[BPP07] R. Barua, S.K. Pandey and R. Pankaj, "Efficient Window-Based Scalar Multiplication on
Elliptic Curves using Double-Base Number System," International Conference on Cryptolology -
Indocrypt 2007, LNCS Vol. 4859, pp. 351-360, Springer, 2007.
[BS10] N. Benger and M. Scott, "Constructing Tower Extensions of Finite Fields for Implementation of
Pairing-Based Cryptography," International Workshop on Arithmetic of Finite Fields (WAIFI 2010), LNCS
Vol. 6087, pp. 180-189, Springer, 2010.
[BW05] F. Brezing and Z. Weng, "Elliptic Curves Suitable for Pairing Based Cryptography," Designs,
Codes and Cryptography, Vol. 37(1), pp. 133-141, 2005.
[CC86] D.V. Chudnovsky and G.V. Chudnovsky, "Sequences of Numbers Generated by Addition in
Formal Groups and New Primality and Factorization Tests," Advances in Applied Mathematics, Vol. 7(4),
pp. 385-434, 1986.
[CCY96] C.-Y. Chen, C.-C. Chang and W.-P. Yang, "Hybrid Method for Modular Exponentiation with
Precomputation," Electronics Letters, Vol. 32(6), pp. 540-541, 1996.
[CH07] J. Chung and M.A. Hasan, "Asymmetric Squaring Formulae," IEEE Symposium on Computer
Arithmetic (ARITH-18 2007), pp. 113-122, 2007.
[CHB+09] C. Costello, H. Hisil, C. Boyd, J. Gonzalez Nieto and K.K. Wong, "Faster Pairings on Special
Weierstrass Curves," International Conference on Pairing-Based Cryptography (Pairing 2009), LNCS
Vol. 5671, pp. 89-101, Springer, 2009.
[CJL+06] M. Ciet, M. Joye, K. Lauter and P.L. Montgomery, "Trading Inversions for Multiplications in
Elliptic Curve Cryptography," Designs, Codes and Cryptography, Vol. 39(2), pp. 189-206, 2006.
[CLN10] C. Costello, T. Lange and M. Naehrig, "Faster Pairing Computations on Curves with High-
Degree Twists," International Conference on Practice and Theory in Public Key Cryptography (PKC
2010), LNCS Vol. 6056, pp. 224-242, Springer, 2010.
[CMO98] H. Cohen, A. Miyaji and T. Ono, "Efficient Elliptic Curve Exponentiation using Mixed
Coordinates," Advances in Cryptology - Asiacrypt '98, LNCS Vol. 1514, pp. 51-65, Springer, 1998.
[Com90] P.G. Comba, "Exponentiation Cryptosystems on the IBM PC," IBM Systems Journal, Vol. 29,
pp. 526-538, 1990.
[CS09] N. Costigan and P. Schwabe, "Fast Elliptic-Curve Cryptography on the Cell Broadband Engine,"
Progress in Cryptology - Africacrypt 2009, LNCS Vol. 5580, pp. 368-385, Springer, 2009.
[DC95] V. Dimitrov and T. Cooklev, "Two Algorithms for Modular Exponentiation based on
Nonstandard Arithmetics," IEICE Transactions on Fundamentals of Electronics, Communications and
Computer Science (Special Issue on Cryptography and Information Security), Vol. E78-A, pp. 82–87,
1995.
201
Bibliography
202
Bibliography
[Fog2] A. Fog, "The Microarchitecture of Intel, AMD and VIA CPUs," 2009. Available online at:
http://www.agner.org/optimize/#manuals, accessed on January 2010.
[Fre06] D. Freeman, "Constructing Pairing-Friendly Elliptic Curves with Embedding Degree 10,"
International Symposium on Algorithmic Number Theory (ANTS-VII), LNCS Vol. 4076, pp. 452-465,
Springer, 2006.
[FVV09] J. Fan, F. Vercauteren and I. Verbauwhede, "Faster Fp-Arithmetic for Cryptographic Pairings
on Barreto-Naehrig Curves," International Workshop on Cryptographic Hardware and Embedded Systems
(CHES 2009), LNCS Vol. 5747, pp. 240-253, Springer, 2009.
[GAS+05] J. Großschädl, R. Avanzi, E. Savaş and S. Tillich, "Energy-Efficient Software Implementation
of Long Integer Modular Arithmetic," International Workshop on Cryptographic Hardware and Embedded
Systems (CHES 2005), LNCS Vol. 3659, pp. 75-90, Springer, 2005.
[Gau09] P. Gaudry, "Index Calculus for Abelian Varieties of Small Dimension and the Elliptic Curve
Discrete Logarithm Problem," Journal of Symbolic Computation, Vol. 44, pp. 1690-1702, 2009.
[GGC02] V. Gupta, S. Gupta and S. Chang, "Performance Analysis of Elliptic Curve Cryptography for
SSL," ACM Workshop on Wireless Security (WiSe), Mobicom 2002, 2002.
[GLS08] S. Galbraith, X. Lin and M. Scott, "Endomorphisms for Faster Elliptic Curve Cryptography on
a Large Class of Curves," Cryptology ePrint Archive, Report 2008/194, 2008.
[GLS09] S. Galbraith, X. Lin and M. Scott, "Endomorphisms for Faster Elliptic Curve Cryptography on
a Large Class of Curves," Advances in Cryptology - Eurocrypt 2009, LNCS Vol. 5479, pp. 518-535,
Springer, 2009.
[GLV01] R. Gallant, R. Lambert and S. Vanstone, "Faster Point Multiplication on Elliptic Curves with
Efficient Endomorphisms," Advances in Cryptology - Crypto 2001, LNCS Vol. 2139, pp. 190-200,
Springer, 2001.
[GMJ10] R.R. Goundar, A. Miyaji and M. Joye, "Co-Z Addition Formulæ and Binary Ladders on
Elliptic Curves," International Workshop on Cryptographic Hardware and Embedded Systems (CHES
2010), LNCS Vol. 6225, pp. 65-79, Springer, 2010.
[Gor93] D. Gordon, "Discrete Logarithms in GF(p) using the Number Field Sieve," SIAM Journal on
Discrete Mathematics, Vol. 6, pp. 124-138, 1993.
[GPW+04] N. Gura, A. Patel, A. Wander, H. Eberle and S.C. Shantz, "Comparing Elliptic Curve
Cryptography and RSA on 8-bit CPUs," International Workshop on Cryptographic Hardware and
Embedded Systems (CHES 2004), LNCS Vol. 3156, pp. 119-132, Springer, 2004.
[Gra10] R. Granger, "On the Static Diffie-Hellman Problem on Elliptic Curves over Extension Fields,"
Advances in Cryptology - Asiacrypt 2010, LNCS Vol. 6477, pp. 283-302, Springer, 2010.
[GSF04] V. Gupta, D. Stebila and S. Fung, "Speeding up Secure Web Transactions using Elliptic Curve
Cryptography," Annual Network and Distributed System Security (NDSS) Symposium, 2004.
[GT07b] P. Gaudry and E. Thomé, "The mpFq Library and Implementing Curve-Based Key
Exchanges," SPEED 2007, pp. 49-64, 2007.
[His10] H. Hisil, "Elliptic Curves, Group Law, and Efficient Computation," PhD. Thesis, Queensland
University of Technology, 2010. Available online at: http://eprints.qut.edu.au/33233/
203
Bibliography
204
Bibliography
[Lau04] K. Lauter, "The Advantages of Elliptic Cryptography for Wireless Security," IEEE Wireless
Communications, Vol. 11(1), pp. 62-67, 2004.
[LD99] J. Lopez and R. Dahab, "Improved Algorithms for Elliptic Curve Arithmetic in GF(2n),"
International Workshop on Selected Areas in Cryptography (SAC '98), LNCS Vol. 1556, pp. 201-212,
Springer, 1999.
[Len87] H. Lenstra, "Factoring Integers with Elliptic Curves," Annals of Mathematics, Vol. 126, pp.
649-673, 1987.
[LG08] P. Longa and C. Gebotys, "Setting Speed Records with the (Fractional) Multibase Non-Adjacent
Form Method for Efficient Elliptic Curve Scalar Multiplication," CACR Technical Report, CACR 2008-06,
2008.
[LG09] P. Longa and C. Gebotys, "Fast Multibase Methods and Other Several Optimizations for Elliptic
Curve Scalar Multiplication," International Conference on Practice and Theory in Public Key
Cryptography (PKC 2009), LNCS Vol. 5443, pp. 443-462, Springer, 2009.
[LG09b] P. Longa and C. Gebotys, "Novel Precomputation Schemes for Elliptic Curve Cryptosystems,"
International Conference on Applied Cryptography and Network Security (ACNS 2009), LNCS Vol. 5536,
pp. 71-88, Springer, 2009.
[LG10] P. Longa and C. Gebotys, "Efficient Techniques for High-Speed Elliptic Curve Cryptography,"
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2010), LNCS Vol. 6225, pp. 80-94,
Springer, 2010.
[LH00] C.H. Lim and H.S. Hwang, "Fast Implementation of Elliptic Curve Arithmetic in GF(pm),"
International Conference on Practice and Theory in Public Key Cryptography (PKC 2000), LNCS Vol.
1751, pp. 405-421, Springer, 2000.
[LLM+93] A. Lenstra, H. Lenstra, M. Manasse and J. Pollard, "The Number Field Sieve," The
Development of the Number Field Sieve, LNCS Vol. 1554, pp. 11-42, Springer, 1993.
[LLP09] E. Lee, H.-S. Lee and C.-M. Park, "Efficient and Generalized Pairing Computation on Abelian
Varieties," IEEE Transactions on Information Theory, Vol. 55(4), pp. 1793-1803, 2009.
[LM08] P. Longa and A. Miri, "Fast and Flexible Elliptic Curve Point Arithmetic over Prime Fields,"
IEEE Transactions on Computers, Vol. 57(3), pp. 289-302, 2008.
[LM08b] P. Longa and A. Miri, "New Composite Operations and Precomputation Scheme for Elliptic
Curve Cryptosystems over Prime Fields," International Conference on Practice and Theory in Public Key
Cryptography (PKC 2008), LNCS Vol. 4939, pp. 229-247, Springer, 2008.
[LM08c] P. Longa and A. Miri, "New Multibase Non-Adjacent Form Scalar Multiplication and its
Application to Elliptic Curve Cryptosystems (extended version)," Cryptology ePrint Archive, Report
2008/052, 2008.
[Lon07] P. Longa, "Accelerating the Scalar Multiplication on Elliptic Curve Cryptosystems over Prime
Fields," Master's Thesis, University of Ottawa, 2007. Available online at:
http://patricklonga.bravehost.com.proxy.lib.uwaterloo.ca/publications.html
[Lon08] P. Longa, "ECC Point Arithmetic Formulae (EPAF) Database," 2008. Available online at:
http://patricklonga.bravehost.com/jacobian.html#jac
[Lon10] P. Longa, "Speed Benchmarks for Elliptic Curve Scalar Multiplication," 2010. Available online
at: http://www.patricklonga.bravehost.com/speed_ecc.html#speed
205
Bibliography
[Lon10b] P. Longa, "Speed Benchmarks for Pairings over Ordinary Curves," 2010. Available online at:
http://patricklonga.bravehost.com/speed_pairing.html#speed
[MD07] P.K. Mishra and V. Dimitrov, "Efficient Quintuple Formulas for Elliptic Curves and Efficient
Scalar Multiplication using Multibase Number Representation," Information Security Conference (ISC
2007), LNCS Vol. 4779, pp. 390-406, Springer, 2007.
[Mel07] N. Meloni, "New Point Addition Formulae for ECC Applications," International Workshop on
Arithmetic of Finite Fields (WAIFI 2007), LNCS Vol. 4547, pp. 189-201, Springer, 2007.
[Men09] A. Menezes, "An Introduction to Pairing-Based Cryptography," Recent Trends in
Cryptography, Vol. 477 of Contemporary Mathematics, pp. 47-65, AMS-RSME, 2009.
[MH09] N. Meloni and A. Hasan, "Elliptic Curve Scalar Multiplication Combining Yao's Algorithm and
Double Bases," International Workshop on Cryptographic Hardware and Embedded Systems (CHES
2009), Lecture Notes in Computer Science Vol. 5747, pp. 304-316, Springer, 2009.
[Mil86] V. Miller, "Use of Elliptic-Curves in Cryptography," Advances in Cryptology - Crypto '85,
LNCS Vol. 218, pp. 417-426, Springer, 1986.
[Mil86b] V. Miller, "Short Programs for Functions on Curves," 1986. Available online at:
http://crypto.stanford.edu/miller
[Mil04] V. Miller, "The Weil Pairing, and its Efficient Calculation," Journal of Cryptology, Vol. 17, pp.
235-261, 2004.
[MIR] M. Scott, "Multiprecision Integer and Rational Arithmetic C/C++ Library (MIRACL)." Available
online at: http://www.shamus.ie/
[Möl01] B. Möller, "Algorithms for Multi-Exponentiation," Selected Areas in Cryptography (SAC
2001), LNCS Vol. 2259, pp. 165-180, Springer, 2001.
[Möl03] B. Möller, "Improved Techniques for Fast Exponentiation," International Conference on
Information Security and Cryptology (ICISC 2002), LNCS Vol. 2587, pp. 298-312, Springer, 2003.
[Möl05] B. Möller, "Fractional Windows Revisited: Improved Signed-Digit Representations for
Efficient Exponentiation," International Conference on Information Security and Cryptology (ICISC
2004), LNCS Vol. 3506, pp. 137-153, Springer, 2005.
[Mon85] P.L. Montgomery, "Modular Multiplication without Trial Division," Mathematics of
Computation, Vol. 44, pp. 519-521, 1985.
[Mon87] P.L. Montgomery, "Speeding the Pollard and Elliptic Curve Methods of Factorization,"
Mathematics of Computation, Vol. 48, pp. 243-264, 1987.
[Mor90] F. Morain and J. Olivos, "Speeding up the Computations on an Elliptic Curve using Addition-
Subtraction Chains," Theoretical Informatics and Applications, Vol. 24(6), pp. 531-544, 1990.
[MOV93] A. Menezes, T. Okamoto and S. Vanstone, "Reducing Elliptic Curve Logarithms to
Logarithms in a Finite Field," IEEE Transactions on Information Theory, Vol. 39, pp. 1639-1646, 1993.
[mpFq] P. Gaudry and E. Thomé, "mpFq – A Finite Field Library." Available online at:
http://mpfq.gforge.inria.fr/mpfq-1.0-rc2.tar.gz
206
Bibliography
[NIST00] National Institute of Standards and Technology (NIST), "Digital Signature Standard (DSS),"
FIPS PUB 186-2, 2000. Available online at:
http://csrc.nist.gov.proxy.lib.uwaterloo.ca/publications/PubsFIPS.html
[NIST07] National Institute of Standards and Technology (NIST), "Recommendation for Key
Management - Part 1: General (Revised)," NIST Special Publication 800-57, 2007. Available online at:
http://csrc.nist.gov/publications/PubsSPs.html
[NIST09] National Institute of Standards and Technology (NIST), "Digital Signature Standard (DSS),"
FIPS PUB 186-3, 2009. Available online at:
http://csrc.nist.gov.proxy.lib.uwaterloo.ca/publications/PubsFIPS.html
[NNS10] M. Naehrig, R. Niederhagen and P. Schwabe, "New Software Speed Records for
Cryptographic Pairings," Progress in Cryptology - Latincrypt 2010, LNCS Vol. 6212, pp. 109-123,
Springer, 2010.
[NSA09] U.S. National Security Agency (NSA), "NSA Suite B Cryptography," Fact Sheet NSA Suite B
Cryptography, 2009. Available online at:
http://www.nsa.gov.proxy.lib.uwaterloo.ca/ia/programs/suiteb_cryptography/index.shtml
[OKN10] K. Okeya, H. Kato and Y. Nogami, "Width-3 Joint Sparse Form," International Conference on
Information Security, Practice and Experience (ISPEC 2010), LNCS Vol. 6047, pp. 67-84, Springer, 2010.
[OTV05] k. Okeya, T. Takagi and C. Vuillaume, "Efficient Representations on Koblitz Curves with
Resistance to Side Channel Attacks," Australasian Conference on Information Security and Privacy
(ACISP 2005), LNCS Vol. 3574, pp. 218-229, Springer, 2005.
[Pol78] J. Pollard, "Monte Carlo Methods for Index Computation mod p," Mathematics of Computation,
Vol. 32, pp. 918-924, 1978.
[Pro03] J. Proos, "Joint Sparse Forms and Generating Zero Columns when Combing," Technical Report
CORR 2003-23, University of Waterloo, 2003.
[PSN+10] G. Pereira, M. Simplicio Jr, M. Naehrig and P.S.L.M. Barreto, "A Family of Implementation-
Friendly BN Elliptic Curves," Cryptology ePrint Archive, Report 2010/429, 2010.
[Rei60] G.W. Reitwiesner, "Binary Arithmetic," Advances in Computers, Vol. 1, pp. 231-308, 1960.
[RSA78] R. Rivest, A. Shamir and L. Adleman, "A Method for Obtaining Digital Signatures and Public-
Key Cryptosystems," Communications of the ACM, Vol. 21(2), pp. 120-126, 1978.
[SB06] M. Scott and P.S.L.M. Barreto, "Generating more MNT Elliptic Curves," Designs, Codes and
Cryptography, Vol. 38(2), pp. 209-217, 2006.
[SBC+09] M. Scott, N. Benger, M. Charlemagne, L. Dominguez Perez and E. Kachisa, "On the Final
Exponentiation for Calculating Pairings on Ordinary Elliptic Curves," International Conference on
Pairing-Based Cryptography (Pairing 2009), LNCS Vol. 5671, pp. 78-88, Springer, 2009.
[Sch10] O. Schirokauer, "The Number Field Sieve for Integers of Low Weight," Mathematics of
Computation, Vol. 79(269), pp. 583-602, 2010.
[Sco07] M. Scott, "Implementing Cryptographic Pairings," International Conference on Pairing-Based
Cryptography (Pairing 2007), LNCS Vol. 4575, pp. 177-196, Springer, 2007.
[Sco08] M. Scott, "A Faster Way to Do ECC," Talk at the 12th Workshop on Elliptic Curve
Cryptography (ECC 2008), 2008. Available online at: http://www.hyperelliptic.org/tanja/conf/ECC08/
207
Bibliography
[SEI10] V. Suppakitpaisarn, M. Edahiro and H. Imai, "Optimal Average Joint Hamming Weight and
Minimal Weight Conversion of d Integers," Cryptology ePrint Archive, Report 2010/300, 2010.
[SEI11] V. Suppakitpaisarn, M. Edahiro and H. Imai, "Fast Elliptic Curve Cryptography using Optimal
Double-Base Chains," Cryptology ePrint Archive, Report 2011/030, 2011.
[SG08] R. Szerwinski and T. Güneysu, "Exploiting the Power of GPUs for Asymmetric Cryptography,"
International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2008), LNCS Vol.
5154, pp. 79-99, Springer, 2008.
[Sma99] N.P. Smart, "The Discrete Logarithm Problem on Elliptic Curves of Trace One," Journal of
Cryptology, Vol. 12, pp. 193-196, 1999.
[Sma01] N.P. Smart, "The Hessian Form of an Elliptic Curve," International Workshop on
Cryptographic Hardware and Embedded Systems (CHES 2001), LNCS Vol. 2162, pp. 118-125, Springer,
2001.
[SOK00] R. Sakai, K. Ohgishi and M. Kasahara, "Cryptosystems Based on Pairings," The 2000
Symposium on Cryptography and Information Security, 2000.
[Sol00] J. Solinas, "Efficient Arithmetic on Koblitz Curves," Designs, Codes and Cryptography, Vol.
19(2-3), pp. 195-249, 2000.
[Sol01] J. Solinas, "Low-Weight Binary Representations for Pairs of Integers," Technical Report CORR
2001-41, University of Waterloo, 2001.
[UWL+07] O. Ugus, D. Westhoff, R. Laue, A. Shoufan and S.A. Huss, "Optimized Implementation of
Elliptic Curve Based Additive Homomorphic Encryption for Wireless Sensor Networks," Workshop on
Embedded Systems Security (WESS 2007), 2007.
[Ver01] E. Verheul, "Self-Blindable Credential Certificates from the Weil Pairing," Advances in
Cryptology - Asiacrypt 2001, LNCS Vol. 2248, pp. 533-551, Springer, 2002.
[Ver10] F. Vercauteren, "Optimal Pairings," IEEE Transactions on Information Theory, Vol. 56(1), pp.
455-461, 2010.
[Wal98] C.D. Walter, "Exponentiation using Division Chains," IEEE Transactions on Computers, Vol.
47(7), pp. 757-765, 1998.
[Wal02] C.D. Walter, "MIST: an Efficient, Randomized Exponentiation Algorithm for Resisting Power
Analysis," Topics in Cryptology – CT-RSA 2002, Vol. 2271, pp. 142-174, 2002.
[Wal11] C.D. Walter, "Fast Scalar Multiplication for ECC over GF(p) Using Division Chains,"
International Workshop on Information Security Applications (WISA 2010), LNCS Vol. 6513, pp. 61-75,
Springer, 2011.
[WD98] D. Weber and T. Denny, "The Solution of McCurley's Discrete Log Challenge," Advances in
Cryptology - Crypto '98, LNCS Vol. 1462, pp. 458-471, Springer, 1998.
[XB01] S.-B. Xu and L. Batina, "Efficient Implementation of Elliptic Curve Cryptosystems on an
ARM7 with Hardware Accelerator," International Conference on Information and Communications
Security (ICICS '01), LNCS Vol. 2200, pp. 11-16, Springer, 2001.
[YSK02] T. Yanik, E. Savaş and C.K. Koç, "Incomplete Reduction in Modular Arithmetic," IEE
Proceedings of Computers and Digital Techniques, Vol. 149(2), pp. 46-52, 2002.
208
Bibliography
[ZZH08] C.-A. Zhao, F. Zhang and J. Huang, "A Note on the Ate Pairing," International Journal of
Information Security, Vol. 7(6), pp. 379-382, 2008.
209